Development of an Intelligent Subsystem for Operating System Incidents Forecasting

The object of research is a subsystem for prediction server platform’s incidents, which operates on the basis of the Windows OS family. One of the most problematic places when planning measures to prevent the harmful effects of network attacks such as dDOS, hardware failures etc for the server system is to obtain an effective model for predicting incidents of the operating system.<br><br>In the course of the research, methods of formation and research of the time series, exponential smoothing, elements of the theory of machine learning based on the method of group accounting (GMDH) are used. To obtain accurate and reliable forecasts of the operation of the intellectual subsystem for forecasting incidents, elements of the theory of heuristic self-organization and a specific implementation of this theory, the GMDH, are used. An algorithm is obtained and a software implementation of an intelligent system for predicting incidents of operating system operation and the main characteristics of its operation is developed. This became possible as a result of the analysis of the constructed model of the intruder, the system log of security incidents and the use of the GMDH. A mechanism is proposed for generating a sample of OS incident events based on the Windows system event log. The testing of the proposed forecasting system based on test samples allows to state that the forecasting results obtained with various settings of the machine learning system and parameters (degree of the reference polynomial, number of variables in the characteristic polynomial model, number of selection series) are satisfactory. As a result of applying the created algorithm for forecasting incidents of OS operation, it is shown that the use of a large number of polynomial models in GMDH allows one to obtain a forecasting system that is qualitatively superior to systems based on classical regression models and methods. Due to this, a much more accurate forecast can be obtained than the classical regression methods or the method of exponential smoothing, compared with similar methods. The percentage of false calculations using GMDH is less than 4 %.


Introduction
Most authors do not raise the issue of classifying methods and models for predicting the operation of operating systems (OS). It concerns the forecasting of security events and algorithms, or forecasting models that should be used for this purpose, it is not possible to name specific algorithms or methods. As a review of the literature shows, currently the most popular are classical forecasting models (trending, regression), forecasting using neural networks and Markov models [1][2][3]. Scientists make a special contribution to the theory and practice of creating algorithms, methods and forecasting systems in [4,5]. Therefore, it is relevant to analyze critical operating modes of operating systems using modern methods of forecasting time series, as well as developing new effective machine learning methods based on GMDH for use in incident forecasting subsystems. Thus, the object of research is the subsystem for forecasting incidents of the operating system of the server platform, which operates on the basis of the operating system of the Windows family. The aim of research is to create a software tool for the subsystem for predicting incidents of operating the server platform OS based on the Windows family of OS using time series forecasting using machine learning methods.

Methods of research
The subject of the model of the forecasting subsystem is the time series. As such a series, the system events log of the OS from the fixation system and the accounting of various security incidents in the Windows Server OS are used. In general, such systems react and register such violations of information properties as: violation of information confidentiality, violation of information integrity, violation of information availability, violation of system manageability and the like. The content of this journal, in general, correlates with the model of the intruder [6].
Among examples of the actions of the intruder and typical attacks on the operating system, information about ISSN 2664-9969 which is contained in the logs of system events of the Windows OS, for example, there are following [7]: -attempts to scan the file system and steal key information; -password selection; -collecting data from non-empty Windows recycle bin; -excess of access authority; -software bookmarks; -greedy programs. The time series for the subsystem for predicting incidents of OS operation is the sampling of critical events in the OS for 1, 2, 3, 4, 5, 6, 7, 8 days of the OS from the system log shown in Table 1. In MS Excel, there is quite a list of tools for statistical analysis. To test the forecast, select «Exponential Smoothing». In fact, the obtained time samples are elements of the time series (TS), which will be used later to obtain a forecast.
Let's obtain the following as a result of the analysis of the TS incidents of OS operation (Fig. 1).
For this TS, the average deviation is calculated, which is in the range from 18.67 to 119.85. The value of the smoothed levels for each of the 8 available values of the TS value indicator obtained in Fig. 1, allow to plan the expectations of such events for the next 8 days. The obtained polynomial forecasting model is not adequate for obtaining a forecast; therefore, the forecasting option using the method of group accounting of arguments (MGUA) is considered, and the resulting forecast is heuristic. The nature and absolute magnitude of the error do not allow to conclude about the reality of the trend.
To prove the correctness of this trend, it is possible to go in two ways: empirical; experimental. It is possible to use the predictive trend model using regression analysis [8]. But, to obtain accurate and reliable forecasts in the study of complex objects, for example, such as an incident registration system, the theory of heuristic self-organization and the concrete implementation of the theory -GMDH [9] are used. GMDH makes sense to use as a basic method for forecasting incidents, since the data sampling (Windows system event log) contains several elements [10][11][12]. Therefore, an inductive approach is used, according to which models of increasing complexity are successively generated until a minimum of some criterion of model quality is found. This quality criterion is called an external criterion, because when setting up models and evaluating the quality of models, various data are used. Achieving the global minimum of the external criterion when generating models means that a model that is able to find such a minimum is the desired one.
The algorithm for finding the optimal structure model for the incident forecasting subsystem can be represented in the form of the following steps [9]: 1. There is a sample in the form of the TS system log D x y n n n Due to the fact that for the GMDH operation it is necessary to conduct learning and testing, the sample is divided into educational and test ones. In the practical GMDH implementation, the percentage of these samples is manually selected.
Let l, C be sets from the range 1,.., }= These sets satisfy the conditions for partitioning sets l C l C The matrix X l consists of row vectors x n for which the index n l ∈ . The vector  l consists of those elements  n for which the index n l ∈ . The partition of the sample is written as follows: , .  4. Inductively generated candidate models. In this case, restrictions are introduced on the length of the polynomial of the base model. For example, the degree of polynomial of the base model should not exceed a specific natural value. Then the basic model is written as a linear combination of a given number  0 of products of free variables as follows: where f -the linear combination function. Arguments (2) are redefined as follows: x a x a x a x x a x a x a m q , .., , , ,.., , To configure these parameters, an internal criterion is used; it is calculated using the training sample. To each element of a vector x n -a selection element D, a vector is mapped a n . Next, a view matrix is constructed A  , which represents a set of column vectors a i . The matrix A  is divided into submatrices A l and A C . The smallest remainder of the form   −  , where    = Aω returns the value of the parameter vector  ω, which is calculated by the least squares method [10], respectively, of the expression: where G l C ∈{ } , , .  The internal criterion for the model applies the standard error of the form: In accordance with the criterion ε G 2 → min, parameters ω are selected and errors are calculated on the test sample G, where G l = . When the model is complicated, the internal criterion does not give the minimum models of optimal complexity; therefore, it is not suitable for choosing a model. 6. To select the best models, let's calculate their quality. For this, a control sample and an external criterion are used. The error in the sample H is indicated as follows: where H l C

∈{ }
, , H G ∩ =0. This means that the error is calculated on the sample H with the model parameters obtained on the sample G. 7. A model that provides a minimum of external criteria is considered optimal.

Research results and discussion
An application in the C# forecasting language based on GMDH has been implemented. It contains an interface that allows to change the degree of the reference polynomial from 1 to 7, and the number of variables in the model of the characteristic polynomial from 2 to 7. It is assumed that a priori the number of models is unknown, go to the top row, therefore, this field on screen forms can be filled in manually. The number of selection rows for the model can be set from 1 to 10.
After downloading a sample of data from the security logs of Windows 10 OS of the investigated personal computer, the parameters of the generated GMDH models are set and the opportunity to enter the parameters of the separation of the sample for training/verification of the forecast is opened (Fig. 2). When calculating the parameters of the GMDH model, let's obtain the corresponding intermediate data of the stages of the calculation of models based on GMDH (Fig. 3): -regularity criteria for S [1],

S[2], S[3], S[4] models;
-global criterion at the selection level 1; -module of the deviation of the estimation of the obtained forecasting models throughout the voters.
With an increase in the number of samples in the TS from the Windows security log, an improvement in the quality of the forecast (a decrease in the error criterion) may occur. But in this case, increasing the accuracy will require increasing the size of the data set for training and testing models from the supply TS for the model.
Depending on the goals and the expected duration of the forecast period for forecasting incidents, the parameters of the model can be changed on the dialogue form. But with an increase in the number of variables in the model and the degree of the reference polynomial, obtaining the best model for forecasting can increase significantly.
As a result of the program module, the mathematical model of training selects the best models, and as a result of the selection of the best models get the best model, which will be used to predict security incidents of the investigated OS (Fig. 4).
On the graph of the input and processed data of the forecasting model based on GMDH, it is possible to estimate the discrepancy between the data of the real sample and the data (shown in yellow) obtained on the basis of the best forecasting GMDH model (shown in red) (Fig. 5).
In order to predict the time series obtained from the incident log of the server OS based on GMDH, it is necessary to select the best models at each iteration of the method. This approach reduces the total number of calculations (processor time), and also reduces the amount of memory the work of the method itself. Comparing the forecasting results generated by the GMDH model and autoregression (Fig. 6), it is possible to conclude that it is possible to use the created software product in practice.
Prediction results are obtained with various system settings (Fig. 6) and various parameters (degree of the reference polynomial, number of variables of the characteristic polynomial model, number of selection series).

Conclusions
After testing the obtained forecasting subsystem and the generated test samples, it is found that the results of forecasting OS security incidents obtained with various system settings and parameters may differ slightly. The use of a large number of polynomial models, such as those used in GMDH, allows one to obtain a much more accurate forecast for the task of forecasting OS incidents than classical regression methods, as well as the method of exponential smoothing.