THE EFFECT OF METHODS OF ELIMINATING SPIKES IN THE TIME SERIES OF FREIGHT FLOWS ON THEIR STATISTICAL CHARACTERISTICS

One of the requirements imposed to initial dynamic series in their analysis is their uniformity which means absence of strong breaks in trends and abnormal observations (spikes). Abnormal spike is understood as а separate value of the time series which does not meet potential capabilities of the studied process. Remaining as a value of the series level, it exerts a material effect on the values of its main characteristics including the corresponding trend model [1, 2]. Abnormal observations take the form of a strong change in the series level (jump or drop) with a subsequent approximate recovery of the previous level. According to the statistical properties of the studied set, several types of spikes are distinguished [3]. Statistical spikes are the spikes, which do not confine themselves within normal limits of the set. For freight flows, it can be: the quantity demanded for delivery of a material in one request differing markedly from the average supply volume, single deliveries of special loads, etc. Variative spikes are the spikes, sharply differing values in the studied characteristics. Presence of such spikes does not mean errors of observation or flaw in work, but it can indicate presence of elements from other set in the sample (for example, different cargo types in the given classification group) or qualitative neoformations in the process (for example, technology of cargo handling or use of other transport means). Remote spikes manifest themselves until the sample or the set has an essential variation. If the volume of the sample increases, it becomes more homogeneous due to emergence of units with intermediate values resulting in that the spikes can move to the normal set limits. The abovementioned types of abnormal observations are closely connected with the character of the set variation and the type of distribution of the data array. Besides, the same array can contain several types at once which must be considered in data processing. In dynamics, the term “spike” belongs not to the unit in the aggregate but to the level of the time series. The newly introduced dynamic spikes represent non-standard values emergence of which (an extraordinary shock) influences not only current but the subsequent levels of the series. For example, imposition of sanctions on delivery of certain product types will exert impact both on the current value of transportation volumes of the carrier companies and on the subsequent ones as well. Additive spikes are the levels of the time series which underwent an irregular intervention exerting impact only on the given levels. The difference between a theoretical level (by the results of smoothing or analytical leveling) THE EFFECT OF METHODS OF ELIMINATING SPIKES IN THE TIME SERIES OF FREIGHT FLOWS ON THEIR STATISTICAL CHARACTERISTICS


Introduction
One of the requirements imposed to initial dynamic series in their analysis is their uniformity which means absence of strong breaks in trends and abnormal observations (spikes).
Abnormal spike is understood as а separate value of the time series which does not meet potential capabilities of the studied process.Remaining as a value of the series level, it exerts a material effect on the values of its main characteristics including the corresponding trend model [1,2].Abnormal observations take the form of a strong change in the series level (jump or drop) with a subsequent approximate recovery of the previous level.
According to the statistical properties of the studied set, several types of spikes are distinguished [3].
Statistical spikes are the spikes, which do not confine themselves within normal limits of the set.For freight flows, it can be: the quantity demanded for delivery of a material in one request differing markedly from the average supply volume, single deliveries of special loads, etc.
Variative spikes are the spikes, sharply differing values in the studied characteristics.Presence of such spikes does not mean errors of observation or flaw in work, but it can indicate presence of elements from other set in the sample (for example, different cargo types in the given classifi-cation group) or qualitative neoformations in the process (for example, technology of cargo handling or use of other transport means).
Remote spikes manifest themselves until the sample or the set has an essential variation.If the volume of the sample increases, it becomes more homogeneous due to emergence of units with intermediate values resulting in that the spikes can move to the normal set limits.The abovementioned types of abnormal observations are closely connected with the character of the set variation and the type of distribution of the data array.Besides, the same array can contain several types at once which must be considered in data processing.In dynamics, the term "spike" belongs not to the unit in the aggregate but to the level of the time series.
The newly introduced dynamic spikes represent non-standard values emergence of which (an extraordinary shock) influences not only current but the subsequent levels of the series.For example, imposition of sanctions on delivery of certain product types will exert impact both on the current value of transportation volumes of the carrier companies and on the subsequent ones as well.
Additive spikes are the levels of the time series which underwent an irregular intervention exerting impact only on the given levels.The difference between a theoretical level (by the results of smoothing or analytical leveling)
Innovative spikes represent internal changes in the system (e.g. shift of levels upon transition to a new technology or an accidental single endogenous change in the system parameters).
From the point of view of causes of spike emergence, there is a difference between mistakenly abnormal values (e. g. registration errors made during inspection, biased sample, etc.) and dissimilar data.
According to the known methods stated in works [1][2][3], elimination of abnormal observations is obligatory at the preliminary stage of source data processing.At the same time, there are no clear recommendations concerning correction of the specific dynamic series.Therefore, it is necessary to make analysis of quantitative and qualitative changes in the main characteristics of the time series depending on the methods of processing of abnormal observations.

Literature review and problem statement
Statistical properties of aggregates and the nature of abnormal observations have a significant effect on preparation of the basis for decision making in a case of presence of the given spikes in the obtained data.In practice, identification of spikes is not limited to selecting only one standard procedure and obtaining results with its application.Samples testing for presence of spikes can be performed repeatedly depending on the problem complexity.Several procedures can be applied for one data array at different investigation phases: 1. Processing and bringing together primary data.
2. Studying the nature of distribution after removal or correction of observation errors.
From the positions of systems approach, the time series represents in essence a certain system including two main interconnected components: the trend and the seasonal fluctuations [4].Spike or error in source data of the time series can be considered as an external disturbance, and the points with abnormal values can be attributed, by definition, to rarely happening events.In the analysis and prediction of one-dimensional time series, spike in one point is used as a standard system disturbance.Hereinafter, the following will be attributed to abnormal points: -the points of the time series at which levels considerably exceed the mean value in the entire series; -the points at which the first differences considerably (multiply) exceed the series dispersion.
At the same time, it is necessary to proceed from the fact that if abnormal points appear continuously for a long period, then this is no more an abnormal value but a change in dynamics of the indicator levels.
Abnormally high or low values distort indicators of statistical properties of the aggregates (the average, dispersion, coefficient of variation, autocorrelations) and complicate analysis of the aggregate distribution nature.First of all, the size of spikes in the values of the time series levels can be commensurable with the change in the trend.Presence of such disturbances and uncertainty of laws of their distribution and connections between them complicate formalization of the trends of certain type, and the trend can be distorted when disturbances are smoothed with the use one or other method.Moreover, it is possible to achieve a converse effect, i.e. a regularly changing trend can be obtained from a purely random series by repeated smoothing (Slutsky effect).In its turn, presence of the trend in the time series is directly connected with the properties of stationarity of the time series.False spike in the source data distorts "periodic" seasonal fluctuation and causes emergence of false spikes in the trend at all points within a multiple number of periods from the point with a spike.For example, if a spike happened in February, then the indicator values will be distorted in all Februaries of the time series [5].
Sometimes, presence of multiple spikes in the distribution tail area can mean presence of partial aggregates having specific properties.Such distributions demand application of the mixed types of modeling including several analytical functions for different partial aggregates.Presence of spikes in construction of models substantially affects not the values of parameter estimates but their statistical properties (shift of dispersion estimates happens).
According to the estimates made by different authors, the statistical sequences used for analysis can contain up to 5…10 % of abnormal values [1,2].For the analyzed sequence, it can be considered normal if the number of spikes remained after corresponding data transformations does not exceed 20 % of the total number of revealed spikes [6,7].However even in a case of presence of a single abnormal observation it is possible to get estimates and conclusions which do not agree with the sampled data or lead to the use of false premises at the decision-making stage.
Therefore in each instance, the decision on the admissible number of spikes shall be made depending on the nature of distribution, intensity of variability of the sampled data and significance of spikes in the decision to be made [3].Dissimilar (abnormal) data at the known nature of distribution are the basic material for decision makers (DM).At the same time, DM shall cautiously approach the automatic exception of abnormal values.If observations seeming abnormal at first sight (but actually they are correct) are excluded, it can result in a loss of important information.By no means, spike has to be excluded from the analysis merely because it has an extreme value.However such exclusion is justified if there is confidence in the converse.

Research objective and tasks
The research objective was to establish the effect of the methods used in eliminating spikes in the time series of an industrial enterprise's freight flows on variations of their statistical characteristics.
For achievement of the assigned objective, the following problems were solved: -review of existing approaches to analysis of abnormal values in the time series and their exclusion; -study of effect of various methods of censoration of abnormal values on the main statistical sample characteristics and properties of the time series by a specific example of the volume of product delivery from an enterprise to consumers.

The researches materials and methods
The analysis of spikes in time series has its specifics and represents a separate problem because the levels of time series are arranged chronologically, i.e. in an order of their emergence.Besides, autocorrelation in the time series can propagate the spike effect to subsequent observations and the mere spike exclusion from the time series is not always applicable for the purpose of model construction.Often spikes appear at several levels at once leading to emergence of so-called masking effect which hides spikes.
Tо solve the problem of detection and correction of abnormal values, the theory of statistical decision is widely applied in practice with the use of parametric and nonparametric methods and algorithms.If the initial time series has a small number of observations, then the visual analysis of tables or graphs is an efficient method of anomaly detection.In one-dimensional series, anomalies manifest themselves as the value spikes to one or other side.Information obtained by means of graphs can be used in two ways: -from the point of view of the analysis: exclude abnormal values or make them to be in accord with the general model of data in order they had no effect on the analysis results; -from the point of view of the decision-maker: the grounds for checking.
Search for anomalies and their correction in the data set can be described as a process of value selection.At the same time, they strongly differ from the surrounding data, get out from the general series of values or are incompatible with other data.
The simplest approach to detection of abnormal values in one-dimensional data series is the statistical approach.If it is assumed that the series distribution is known, it is necessary to determine the key statistical parameters: the average value and dispersion.Based on these characteristics, it is possible to establish some threshold the size of which depends on the series dispersion.All series elements exceeding this threshold can be considered as potentially abnormal.
Two methods are used in practice for detection of spikes in non-stationary time series.
Method 1.The systematic component is previously excluded from the time series which results in a stationary series.Detection of spikes is performed by means of wellknown criteria.
Method 2. Establishment of the trend and detection of spikes as exclusively important "residuals" representing distinction between actual and expected values of the response function obtained by the method of regression analysis.This method enables not only establishment of the trend lines, but simultaneous spike detection as well.It must be understood here that spikes are deviations from the theoretical model so big that they cannot be explained by an accidental component of the time series.Such deviations can give evidence of some failure, mistake in obtaining results.It is exactly the deviation from theoretical model that expresses the "remainder".
If the found abnormal values really distort information on the studied process of dynamics and can affect the model quality and reliability of the analysis results, then it must be corrected.This correction is made so that the traced tendency is not distorted by accidental fluctuations.For this purpose, depending on the logic and features of the problem being solved, several methods can be used: -Removal of the record containing an abnormal value.If the number of records in the data sample significantly exceeds the minimum required for analysis, then the records containing abnormal values can just be removed.
-Manual substitution of abnormal values.It is applied if the amount of abnormal values is small and they can be processed manually.At the same time, the analyst has to replace abnormal values with other ones, more corresponding the model of data behavior.
-Smoothing and data filtering.Methods of frequency or spatial filtering are used.In doing this, it is necessary to consider that not only abnormal values but also all values of the series will be changed by processing.
-Data interpolation.Abnormal values are substituted with other values, based on some nearest neighboring levels.
-Substitution for the most probable value.Histogram of the series value distribution is constructed to find the value corresponding to the histogram mode which will be the most probable statistically.
The correction procedure resulting in availability of removal, substitution for calculated values or mathematical transformation (finding the logarithm, square-rooting or standardization) of the abnormal values is called censoration.
After corresponding transformations, the aggregate often takes the nature of some standard distribution with no spikes.However mathematical results shall not be in conflict with the research objective: excessive transformations will mask spikes and insufficient ones will assign units which are within normal boundaries to spikes.In practice, this issue is solved according to the principle of the parsimony [3]: simplify but not become simpler.In doing this, experience of the previous researches involving similar data arrays is of great importance.If abnormal values remain after transformations, then they exactly are spikes (dissimilar data) which are not characteristic for the set in general but are present in it.

Effect of different methods of abnormal value censoration on statistical characteristics and properties of the time series
Take volumes of raw materials supply to a metallurgical enterprise for the period N=100 days as the source data.Show the effect of different methods of abnormal value censoration on the main statistical sampling characteristics and properties of the time series FDR: 1. Arithmetic mean q̄=418.

Lag coefficients of autocorrelation ρ k .
By the absolute value of variation coefficient, it is possible to judge on the degree of constancy of position and variability characteristics of the time series.The characteristics are considered positive at |ν|=20 … 33 %.
The graph of the autocorrelation function (ACF) characterizing lag changes in autocorrelation coefficients makes it possible to reveal the trend and cyclic components in the time series.Thereby it is possible to identify the change in the property of stationarity of the time series depending on the methods of abnormal value correction.At the same time, proceed from the following a priori statements: -if the coefficient of autocorrelation of the first order was maximal in the ACF graph, then only trend is contained in the time series; -if the coefficient of autocorrelation of the order K was maximal, the time series only contains cyclic fluctuations with periodicity K instants of time; -if none of the coefficients is significant (close to zero or does not leave the confidential region of zero), it can be assumed that either the series does not contain the trend and cyclic component (being a "clean" white noise), or the series contains a nonlinear trend.
The quickly fading character of ACF in the initial and transformed time series confirms their stationarity (white noise).
Estimation of the above characteristics was made by robust methods [2,13,14].They enable obtaining rather reliable estimates of statistical aggregates.At the same time, reliability of the law of its distribution and presence of essential deviations in the data values is taken in consideration.These methods are aimed at ensuring stability of statistical decision making in the presence of spikes violating conditions of using classical statistical models.Robustness is understood as insensitivity to various deviations and nonuniformities in the sample connected with some causes.Robust estimates are constructed so that their properties remain satisfactory for practice, even when the true distribution of experimental data differs from the expected distribution.
In the primary analysis, the general character of change in the levels of time series is established and initial values of its characteristics are defined which are the basis for carrying out further comparative analysis.In Fig. 1, a with the initial time series, the points which can be attributed to potentially abnormal observations are given.
The visual analysis of the provided graphs gives the following a priori information: -the given time series contains both obvious and suspected abnormal values; -time inconstancy (heteroscedasticity) of the dispersion is observed in the series which is characteristic for the process of random walk (the effect of disturbance does not fade in it); -two significant coefficients of autocorrelation stand out in the correlogramm: the first of them (in the lag 1) demonstrates presence of the trend, and the second one (in the lag 13) shows presence of cyclic fluctuations; -the variation coefficient does not exceed 33 % which confirms constancy of the statistical characteristics.
Detection of abnormal values.The graph in Fig. 1 shows a close grouping of several abnormal values in the considered time series.The procedure for detection of such abnormal observations can be implemented in the Attestat software package by two methods: -based on L-and E-Tityen-Moore criteria; -based on Thompson's rule.Both methods are invariant.Difference consists in that when the first method is used, it is necessary to set a priori an expected number (1 to 10) of abnormal values.The second method ensures definition of not only presence of abnormal values in the time series, but their quantity as well.All abnormal values revealed by the two methods are highlighted by red.
Correction of abnormal values.Two organizational forms of the procedure for correction of abnormal observations were considered.
One-step correction.The originally revealed abnormal observations are removed from the time series or are replaced with calculated ones.Then transition is done by various methods of transformation (finding the logarithm, derivation of the first and the second order differences, etc.) for establishment of the time series stationarity.
Multistep (iterative) correction.In this case, the procedure of detection and elimination of abnormal values is repeated up to a complete or partial (at the discretion of the decision maker) elimination of abnormal values from the time series.At the same time, statistical characteristics are recalculated in all iterations.If presence of trend or cyclic components is found after iteration, then it is necessary to execute the corresponding transformations for making stationary time series.
Partial robust processing of abnormal values.The procedure consists in replacement of the abnormal values found in the FDR row of time series with their robust estimates (estimators).Following simplest estimators were used in the analysis of this time series: -common selective mean (VOS) which is adjusted at each iteration; -local selective mean (VLS) calculated in the three-or five-dot neighborhood (including abnormal one) of the time series level; -the median (ME) calculated for three or five levels (including abnormal one) of the time series; -the probable value determined in each iteration by the mode (MO) of the time series.

Discussion of the results obtained in the analysis of quantitative and qualitative changes in main time series characteristics depending on the methods of processing abnormal observations
Filtering out of abnormal values results in updated time series which can be further used in modeling and predicting studied indicators in various systems [15][16][17].
The results of analysis of quantitative and qualitative change in the main time series characteristics depending on the methods of processing abnormal observations are given in Tables 1, 2 and the change in characteristics of the trend component are given in Table 3.The method of correction of abnormal time series levels has an effect first and foremost on the quantitative characteristics of the time series and the trend component.
The mean value of the series level does not practically depend on the correction method.
For the considered series, deviations from the mean value of actual data made: 0.7...1.7 % for one-step correction and 0.7...4.3 % for iterative correction.
In comparisons with actual data, decrease in dispersion reaches 14...20 % for one-step correction, and 30...99 % for iterative correction.
The degree of correlation of levels of the time series weakly responds to the method of correction of abnormal values.
Iterative correction of abnormal observations by means of estimators VOS and MO can lead to elimination of the trend and cyclic components.
On average, 6 % of dispersion of the actual series is the share of the trend for all methods of correction (except VOS and MO).
The remaining parts of the deterministic actual series accumulate in themsevs about 75 % of dispersion of the actual series for one-step correction and 54 % for iterative correction.
In scientific and technical literature, attention is usually focused on the fact that elimination of abnormal observations is obligatory at the preliminary stage of the source data processing.At the same time, there are no clear recommendations on how to act rightly in each case.
The advantage of these studies is that the extent of their influence on behavior of the time series is shown on the basis of robust methods of abnormal observation elimination.It is necessary to point out that impossibility of displaying in an explicit analytical form the ideas set forth in the paper can be defined as its shortcoming.
The studies set forth in the paper are the continuation of earlier conducted studies in modeling freight flows.They are directed to elimination of uncertainties in planning transportation processes, therefore continuation of the studies would be reasonable.

Conclusions
1. Methods of correction of abnormal time series levels exert an effect first of all on the quantitative characteristics of the time series and the trend component.The mean value of the series level practically does not depend on the correction method.
2. As a result of partial robust processing of abnormal values, an updated time series was obtained.It can be used further in modeling and predicting the indicators studied in various systems used in transportation, logistics, warehousing, etc.

CREATING A QUALIMETRIC CRITERION FOR THE GENERALIZED LEVEL OF VEHICLE S . P a n c h e n k o
Doctor of Technical Sciences, Professor, Rector* E-mail: PSVUkrSURT@gmail.com

Fig. 1
shows the trajectory of the considered time series of the supply volume and the graph of the sample autocorrelation function of this series.а b Fig. 1.Characteristics of the FDR initial time series of supply volume: a -the graph of the initial time series; b -the graph of the sample autocorrelation function Doctor of Technical Sciences, Professor, Head of Department** E-mail: tilavalval@gmail.comO .S h a p a t i n a Assistant** E-mail: shapatina@ukr.net*Department of automation and computer remote control train traffic*** **Department of manage freight and commercial work*** ***Ukrainian State University of Railway Transport Feierbakha sq., 7, Kharkiv, Ukraine, 61050

Table 1
Results of one-step correction of the time series

Table 2
Results of iterative correction of the time series

Table 3
Change in the trend characteristics of the time series