A METHOD FOR BUILDING A FORECASTING MODEL WITH DYNAMIC WEIGHTS

V . S i n e g l a z o v Doctor of Technical Sciences, Professor Department of Aviation Computer-Integrated Systems Institute of aerospace control systems, National Aviation University Komarov Av., 1, Kyiv, Ukraine, 03680 E-mail: svm@nau.edu.ua O . C h u m a c h e n k o Candidate of Technical Sciences, Associate Professor* E-mail: lobach21@mail.ru V . G o r b a t i u k * E-mail: vladislav.horbatiuk@gmail.com *Department of Technical Cybernetics National Technical University of Ukraine «Kyiv Polytechnic Institute» Peremogy Av., 37, Kyiv, Ukraine, 03056 Представлено новий метод прогнозування часових рядів, який динамічно знаходить ваги для вхідних факторів в залежності від конкретних значень самих факторів. Запропонований метод був перевірений на наборі реальних часових рядів і показав кращі результати у порівнянні з методом, що використовувався як базовий Ключові слова: прогнозування часових рядів, лінійна регресія, Байєсівське усереднення моделей, нейронні мережі


Introduction
Forecasting has always been one of the most interesting and important problems of mankind.It is also one of the hardest problems, since to solve it we need to deal with the following issues: a) it is impossible to take into account all the factors that influence the process we are trying to forecast; moreover, their influence can change over time -the factor which was not important today can play a major role tomorrow; b) there are always a lot (sometimes infinite number) of plausible models that fit the training data well -we have to decide which model or set of models to use, and that's usually a very error-prone decision; c) it is often hard (if not impossible) to find the optimal complexity of the model.
In this paper we introduce the method that tries to deal with first two issues, i.e. it flexibly determines the set of models to use for the given inputs and takes into account the volatile significance/influence of the factors.

Problem statement
L e t u s h a v e a s e q u e n c e o f N d a t a p o i nts x x x N = { } 1 , ,  measured at successive time points t t t t T const i N . Then the problem of forecasting (Fig. 1) considered in this paper can be stated as follows: using the data we have (Fig. 1, а), build the model of the forecasted process that takes n successive data points x x i n i − +1 , ,  as input and outputs the forecast for the value x i k + at some future time point t i k + (Fig. 1, b).This model can be represented mathematically as y F x n = ( , ,x ) 1  , where F is some unknown function.One important note is that this model can be defined implicitly or even work as a "black box" -we give it an input, we receive the desired output, which serves as a forecast.

Review of existing forecasting methods
The most well-known forecasting method is probably a linear regression [1].It builds the following linear model: where w w n 1 , ,  -importance weights of the input variables x x n 1 , ,  respectively; w 0 -bias term, can be omitted.The weights are usually found by minimizing the mean squared error (MSE) of the model on the training data: where -j th training case; y j -corresponding known output value.
Even though a linear regression remains one of the most widely used methods due to its simplicity, it has one natural limitation following from its definition: it cannot model complex nonlinear dependencies.To overcome this limitation a lot of nonlinear forecasting methods were developed.Let us mention the most widely used.
1. Group method of data handling (GMDH) [2].The GMDH is a set of forecasting algorithms which are based on a recursive selection of the best models and the subsequent construction of more complex models using previously selected ones.The forecasting accuracy is improved by increasing a complexity of the models.The selection criterion is based on a model performance on the test set, while model's parameters are determined from the training set.The simplest models also called base functions usually have the following form: However, any kind of base functions can be used, including harmonic series, exponential series etc.
The GMDH-like algorithms have proven to be really effective on real-life problems mainly because of their use of an external criterion (i.e.models are selected using data that wasn't used for their training).
2. Artificial neural networks (ANN) [3].An ANN is a system of connected and interacting artificial neurons -mathematical models of biological neural cells.An ANN is not programmed in the usual sense of the word: they are trained.During training, the neural network is able to detect complex relationships between input and output data and perform synthesis.The ability of neural networks to forecast comes directly from their ability to generalize and find the hidden relationships between input and output data.After training, the network is able to predict the future value of a certain sequence on the basis of several previous values and/or any current factors.
Mainly two architectures are used for the forecasting task: feed-forward neural network [4] (Fig. 2) and recurrent neural network [5] (Fig. 3).While feed-forward ANN basically corresponds to very complex function the recurrent ANN adds some dynamics, i. e. it has a finite dynamic response to time series input data.
The main advantage of an ANN over other methods of forecasting is that the network can equally well model practically any functional relationship, whereas most other methods are best suited for modelling some concrete type of functions (obviously, the method of polynomial smoothing is best suited for processes with a polynomial regular component, the method of Fourier series smoothing is best suited for processes with a periodic regular component etc.).Another important advantage of neural networks is the ability to learn.The wavelet-based forecasting suggests the use of a discrete wavelet transform [7] to obtain the corresponding wavelet coefficients and the subsequent prediction of the future values using these coefficients as inputs.
One step of discrete wavelet transform produces so-called detail coefficients and approximation coefficients given by: is an impulse response of the low-pass filter and high-pass filter respectively.Usually, the approximation coefficients get decomposed further multiple times (Fig. 4).Fig. 4. Graphic representation of wavelet decomposition 4. Various combinations of multiple methods.For instance, a combination of GMDH and ANN was suggested in [8]: instead of using predefined base functions small feedforward neural networks can be used, thus eliminating the issue with selecting the most appropriate type of base functions.
Despite of the variety of existing forecasting methods, most of them can be generalized using the following equation: where E is some error function that is minimized; F w x ( , )  function that represents a forecasting model (linear or nonlinear); X -matrix of training cases;  y -vector of known output values for the training cases.It's clear that such an approach ignores the issues (a) and (b) given in the introduction -it uses a single model and assumes that input variables have constant influence.

Overview of the suggested forecasting method
The main idea of the method is to 'dynamically' find a set of weights for given inputs rather than use a single 'static' set of weights; in other words, the inputs are used for both finding the appropriate weights and predicting the output using these weights.
We suggest naming the method "linear regression with dynamic weights (LRDW)".The method's inputs are the matrix of training cases X m n ∈ ×  and the vector of known output values y m ∈ ×  1 where m is a number of training cases and n is a number of input variables.
The preprocessing stage needed to obtain these matrices from the raw time series x x i N is left outside the scope of this method for the sake of simplicity (we suggest normalizing the time series values to the range ; and then using an embedding technique [8] with an appropriate embedding dimension and the horizon of prediction to obtain these matrices).
The method's parameters are numbers K m ∈{ } 1 2 , , ,  and γ ∈ ∞ ( ) 0; which will be described later.Main steps of the method are: 1. Subtract row mean from each row of the matrix X (i.e.make the set of its rows zero-mean): where x x x n

. F i n d t h e i n i t i a l ' s t a t i c ' w e i g h t s v e c t o r
, ,w =     where α and β are some constants, α β , ; ∈ ∞ ( ) -new, 'dynamic' weights vectors, one for each training case.We need to make the squared prediction error for the i th training case small by choosing the appropriate weights  w i , and to keep these weights close to the static ones  w in ( ) in order to minimize the particular error function E i .The tradeoff between how much to reduce the error and how close should the weights  w i be to the initial ones is controlled by the parameters α and β ; if we set them to be α β > we want to improve the error more than to keep the weights and vice versa.To reduce the number of method's parameters we can divide all error functions by β and let γ α β = ; now we can see the meaning of the second parameter γ : choosing γ > 1 is equivalent to choosing α β > and γ α β < ⇔ <

. When the input values lie in the range
; the suitable choice of γ is somewhere between 0.1 and 0.3.

Find the optimal set of weights
 w i * for each error function E i by solving the following linear system, obtained as a result of finding the partial derivatives w.r.t.corresponding weights and equating them to 0: a. Find the vector of derivatives , , .b. Find K nearest neighbors of  v in the matrix V , and remember: Математика и кибернетика -прикладные аспекты -their indices in sorted order -from nearest to furthest: ) . (12)  d.Find the weights vector that will be actually used for the prediction.To do this, we should average the found weights vectors depending on the distance from  v to the vector of discrete derivatives for the corresponding training case.
The averaging weights are calculated as i. e. the weights vector for the nearest neighbor gets the biggest weight.
e. Finally, the forecast is calculated as: To sum up, the method finds a separate weights vector for each training case and then calculates the forecast for new inputs by finding the weights for K nearest training cases (nearest in the sense of Euclidean distance between the vectors of derivatives), weighting them based on the distance to produce a single set of weights and then applying these weights to the inputs.It is obvious that using this approach the weights ⇔ importance of the input variables will be different for different input vectors.Also, since each set of weights defines the corresponding forecasting model, we are not using a constant set of models -instead, we find the most appropriate set depending on the input.The parameter K plays a 'smoothing' role -the bigger the K the more weights vectors will be averaged the closer to the weights of a linear regression the average will be.
When searching for the nearest neighbors, vectors of derivatives are used instead of original vectors because for the time series forecasting problem the dynamics (i.e.how values change over time) is usually much more important and representative than the exact values of the forecasted process -for other problems, where inputs are not the successive points of some time series we should use original vectors.
The method is somewhat similar to Locally Linear Regression (LLR) [9] and Bayesian Model Averaging (BMA) [10]: it finds some kind of a local model for each training case similar to LLR and averages multiple models for the given inputs just like in BMA.However, LLR loses global information ('static' weights of a linear regression) while building local regressions -as a result, these local models can overfit badly.And opposite to BMA, where the set of averaged models is constant and we average the models' outputs, the proposed method selects models to average depending on the inputs and averages the models themselves, not their output (there is no difference in a linear case, but in general these two averaging methods are not equivalent).

Testing performance of the proposed method
To test the performance of the proposed method the set of 11 publicly available ( [11,12]) time series was used.Linear regression and GMDH were used for comparison.All methods shared the following parameters: -time series embedding dimension ⇔ number of input variables n = 5 -horizon of prediction h = 2 (predicting the value 2 time steps ahead) -training set size to full data set size ratio r = 0 5 .(half of the cases were used for model training).
A bias term was omitted for both the LRDW and the linear regression; default parameters values of the specific GMDH implementation [13] were used; the method's parameters were set to γ = 0 2 . and K = 1 .Normalized squared error (NSE) given by the formulae   Short analysis of the obtained results: -the LRDW has better NSE on most but not all time series -so we need to carefully choose its parameters, especially γ ; -the average improvement in error is about 12 % relative to the NSE of a linear regression (and the biggest improvement is ≈ 22 3 . % for the 'Sunspots per month' time series); -in general, GMDH performs better than the LRDW -however, the approach we used to obtain LRDW from a linear regression can be easily applied to other forecasting methods, including GMDH -and it can possibly boost their performance as well; -there are several time series for which LRDW performed even better than GMDH.
The graphical example of the LRDW model producing better forecasts than the linear regression is given on the Fig. 5 ('US aviation shipments' time series).
As you can see, the forecast of a LRDW method is very similar to the one, obtained by linear regression, but for some cases the proposed method gives much more accurate predictions (the training cases were selected randomly).

Conclusion
The proposed method was tested on real data and its performance (measured using NSE criterion) is usually better than the performance of the method it 'originated' fromlinear regression.Hence, we believe that applying the same approach to other methods, including nonlinear ones like GMDH or neural networks can improve their performance also.
There are also possible improvements to the approach itself: • instead of finding dynamic weights for each training case it is possible to find them for some clusters of training cases to improve the method's runtime efficiency; • the suitable choices for method's parameters can possibly be determined from the training data -for example a value of the γ parameter can somehow depend on the ratio between the total magnitude of static weights (i.e.sum of their values) and the magnitude of an error for this training case (when using these static weights); • instead of finding nearest neighbors and averaging the corresponding dynamic weights we can build a model to predict the weights values from the inputs values using any suitable forecasting method.

Fig. 1 .
Fig. 1.Graphic representation of the forecasting problem statement: a -known values; b -future values

.
It is important to omit a bias term -in practice, models without bias (given that training input vectors are zero-mean) usually have better prediction error on the whole data set.3. To perform the next step we should introduce new error functions, one for each training case:

6 .
The forecast for a new input vector 

1 
was used as a model performance indicator.It was calculated on the full data set.The obtained results are given in the

Fig. 5 .
Fig. 5. Forecasts, obtained by two different methods: solid line -original time series, dotted line -LRDW forecast, line with markers -linear regression forecast