DEVELOPMENT OF THE METHOD OF FEATURES LEARNING AND TRAINING DECISION RULES FOR THE PREDICTION OF VIOLATION OF SERVICE LEVEL AGREEMENT IN A CLOUD-BASED ENVIRONMENT

The increase in popularity of cloud-based services stimulates spreading of distributed centers of data processing on the global scale, which leads to numerous problems in terms of resource planning for different administrative domains. Effective resource planning implies simultaneous provision of minimized violation of Service Level Agreement, SLA, 20. Wu, Х. A Comparison of Objects with Frames and OODBs [Text] / X. Wu // Object Currents. – 1996. – Vol. 1, Issue 1. 21. Minskiy, M. Freymy dlya predstavleniya znaniy [Text] / M. Minskiy. – Moscow: Energiya, 1979. – 152 p. 22. Iskusstvennyy intellekt. Kn. 2. Modeli i metody [Text]: spravochnik / D. A. Pospelov (Ed.). – Moscow: Radio i svyaz’, 1990. – 304 p. 23. Maciaszek, L. A. Requirements Analysis and System Design [Text] / L. A. Maciaszek. – 2nd ed. – Reading: Addison Wesley, Harlow England, 2005. – 504 p. 24. Gavrilov, A. V. Sistemy iskusstvennogo intellekta [Text] / A. V. Gavrilov. – Novosibirsk: NGTU, 2004. – 59 p. 25. Savitch, W. Java: An Introduction to Computer Science and Programming [Text] / W. Savitch. – 2nd ed. – Pearson: Prentice Hall, Inc, 2001. – 1039 p. 26. Deitel, H. M. C++ How to Program [Text] / H. M. Deitel, P. J. Deitel. – 5th ed. – Pearson: Prentice Hall, Inc, 2005 – 1536 p. 27. Levykin, V. M. Issledovanie i razrabotka freymovoy modeli struktury dokumenta [Text] / V. M. Levykin, M. A. Kernosov // Novi tekhnolohyi. – 2008. – Issue 1 (19). – P. 149–154. 28. Evlanov, M. V. Modeli patternov proektirovaniya trebovaniy k informacionnoy sisteme na urovne dannyh [Text] / M. V. Evlanov // Radioelektronni i kompiuterni systemy. – 2014. – Issue 1 (65). – P. 128–138. 29. Levykin, V. M. Parallel’noe proektirovanie informacionnogo i programmnogo kompleksov informacionnoy sistemy [Text] / V. M. Levykin, M. V. Evlanov, V. S. Sugrobov // Radiotekhnika. – 2006. – Issue 146. – P. 89–98. 30. Yevlanov, M. V. Paterny proektuvannia vymoh do informatsiynoi systemy [Text] / M. V. Yevlanov // Visnyk natsionalnoho universytetu «Lvivska politekhnika». – 2014. – Issue 783. – P. 429–434.


Introduction
The increase in popularity of cloud-based services stimulates spreading of distributed centers of data processing on the global scale, which leads to numerous problems in terms of resource planning for different administrative domains.Effective resource planning implies simultaneous provision of minimized violation of Service Level Agreement, SLA,

Разработан алгоритм обучения многослойного экстрактора признаков, использующий принципы нейронного газа и разреженного кодирования. Предложен информационно-экстремальный метод двоичного кодирования признакового представление для построения решающих правил. Это позволяет уменьшить требования к объемам обучающих данных и вычислительных ресурсов и обеспечить высокую достоверность прогнозирования нарушения условий договора об уровне обслуживания в облачной среде Ключевые слова: датацентр, разреженное кодирование, информационный критерий, машинное обучение, роевой алгоритм
decreasing the cost of using cloud-based services and increasing the level of energy saving, as well as the profit of a service provider.However, non-stationarity of demand for cloud resources generates variable load peaks, making ineffective the mechanisms of reactive scaling of services that initialize the process of allocation of additional resources only after exceeding a critical value by a particular metric.That is why an active area of research is proactive and predictive principles of resource management that allow us to initialize allocation of necessary resources in advance.In addition, the use of predictive mechanisms makes it possible to provide effective redistribution of resources by identifying unsuccessful candidates (data centers or individual servers) for hosting virtual machines.In this case, prediction of SLA violation allows removing uncertainty regarding the functional state of services at different levels of the cloud system and increase efficiency of multi-criteria optimizing algorithms when planning allocation of resources [1].
One of the approaches to prediction of SLA violation at different levels of a cloud-based system is to use the ideas and methods of machine learning, which form a predictive model by analyzing time series of changes in key indicators of performance, key indicators of quality and system messages [2].However, the use of traditional one-level methods of machine learning, which is characterized by exponential dependence of the number of model parameters on the number of recognition features, under conditions of multi-dimensionality of observation leads to an increase in requirements for computing resources and the volume of learning data.That is why the most promising direction of synthesis of analytic tools of the system of cloud environment management is usage of the methods of feature learning.These methods are designed to generate informative dictionary of independent features of the higher level of abstract character with relatively low dimensionality, which greatly simplifies the synthesis of decision rules.
Thus, development of the method of learning features and decision rules for prediction of SLA violation is a relevant direction of research as it is directed at increasing efficiency of the system of cloud environment management.

Literature review and problem statement
The main problem of service providers of cloud environment is to determine the best compromise between profit and users' satisfaction.However, solution of this problem is complicated by a priori uncertainty regarding the functional state of the service as a result of non-stationarity of demand and heterogeneity of physical and virtual components of IT-infrastructure.Papers [3,4] suggested application of the algorithms of a decision tree, random forests and Naïve Bayes to remove uncertainty concerning compliance with SLA conditions, associated with exceeding service response time, service availability or a decrease in information safety.However, a small number of features were controlled in proposed approaches, which prevented obtaining a highly reliable predictive model for advances period of time sufficient for the use of necessary measures.The authors of [4,5] proposed to examine the trends of using resources within a sliding window of the assigned size for the formation of feature description of predicted functional states.In study [5], the prediction model is based on recurrent neural network of Long Short-Term Memory, LSTM.The use of such a net-work made it possible to reduce response time of the services, but the experiment was carried out on virtual simulators and on data of limited volume tracing.In this case, LSTM network is quite deep while deployed in time and requires large amounts of training dataset in order to avoid the overfitting effect, which makes the model ineffective for a long time of service operation.Paper [6] considers prediction of SLA violations as a result of overloading of network channels in the IT-infrastructure of the data center, based on a deep model that just needs a large amount of training dataset to avoid convergence to the local extremum of function of losses.
Development of the ideology of autonomous computing in cloud-based systems causes research and implementation of technologies of predictive analysis at all levels of infocommunication system.Telemetry data that are accumulated in the data center management system are characterized by high dimensionality, lack in the balance of the number of samples, representing functional states of services and relatively small amount of labeled data on SLA violation, especially at the beginning of deployment of new services.The authors of [7,8], in order to analyze high-dimensional data with a small number of labeled samples, propose to use unsupervised feature learning for the full dataset and to carry out training of the classifier of functional states on labeled samples, encoded by learned features.Articles [9,10] show a high efficiency of neural network algorithms for feature learning based on stacking of autoencoders and restricted Boltzmann machines.However, this approach requires very large amount of data and computational resources, which increases costs on data analysis and delays in building up an effective model for prediction of the state of individual services.That is why the methods of matrix factorization for analyzing multi-dimensional samples and stacking into a multilayer structure, based on nonlinear transformation and pooling operator, are actively explored [8,11].These methods include Principal component analysis, PCA, and Independent component analysis, ICA, and Non-negative matrix factorization, NMF.Papers [11,12] show that the most effective factorization is the one, which provides sparse data representation.Sparse encoding allows getting noise immunity of compact representation of input data, where each observation can be represented as a linear combination of a small number of basic vectors, which makes its interpretation and subsequent analysis easier.
Papers [12,13] propose the algorithm of sparsely encoding neural gas that allows us to perform incremental unsupervised learning of feature basis according to principles of self-organization and Orthogonal Matching Pursuit, OMP.In this case, the algorithm of sparse encoding neural gas is suitable for samples of limited volume.The proposed algorithm showed high efficiency in analysis of images and noisy signals, however, its organization in a multilayer structure for simplifying of analysis of multi-dimensional observations of little formalized process has not been explored yet.
The most effective methods of machine learning for labeled samples of limited number are based on building an optimum separate hypersurface in the framework of a geometric approach.Articles [10,11] consider the use of the method of supporting vectors, which perform space transformation for construction of a separate hypersurface, however, its application requires computational time-consuming regularization of the model by selecting kernels and regularization coefficient.The authors of [14,15] propose the method of transformation of the space of original features using computationally efficient operations of comparison and "excluding OR" for building separate "hyperspheres" (hyper-parallelepiped) in the binary space of secondary features.In this case, binary feature encoding and population algorithm of optimization of parameters of decision rules by the information criterion makes it possible to create automatically an effective model of a classifier, which makes application of the approach for analysis of monitoring of cloud systems data promising.

The aim and objectives of the study
The aim of present research is to increase efficiency of formation of feature description and decision rules for the prediction of violation of SLA conditions in a cloud-based data center.
To accomplish the set goal, the following tasks had to be solved: -to develop a method of learning of hierarchical feature extractor based on ideas and methods of neural gas and sparse encoding of observations and to compare its effectiveness with the autoencoder; -to develop algorithms of machine learning for a system for prediction of SLA violation using binary feature encoding and populational optimization of parameters of decision rules by the information criterion; -to explore dependence of reliability of prediction decisions of a system that are made in operating mode for test data, from parameters of feature extractor and decision rules.

Algorithms of feature learning and decision rules
Collection of observations for feature learning goes on by scanning the archive history of changing the metrics of productivity of info-communicative service by the fixed-size window W, within which their values are read in time with an assigned step Δ.For traning decision rules, the sample of these windows with a classified service state at the moment of time is formed, which is ahead of the window by Δt steps.In this case, two functional states are considered -class An important step of data analysis is preliminary normalization with the view to removing linear correlation of components of observation and the unification of primary feature representation.Data whitening with the use of the method of ZCA (Zero-phase Component Analysis) is one of the most common methods of preliminary data normalization.ZCA method implies performance of the following steps: 1) calculation of mean selected value of features μ=mean(X); 2) calculation of co-variative matrix of selected observations Σ:=cov(X); 3) singular decomposition of co-variative matrix Σ≈VDT T ; 4) whitening of each observation by formula In general case, learning of feature representation implies the search for a set of parameters by unlabeled data, for example, in the form of a set of basis vectors C, which are subsequently used by the algorithm of encoding for reconstruction of input data distribution.For building a dictionary of basis vectors C, it is possible to use the algorithms of vector quantization, such as k-mean or neural gas.Neural gas is based on the principles of "mild" competition, which is why it is characterized by better convergence, independence on the initial search point and more optimum distribution of vectors of a code book.Formation of the feature representation can be performed by one of the method of sparse approximation, for example, the method of orthogonal matching pursuit (OMP).However, the method of optimized orthogonal matching pursuit (Optimized OMP, OOMP) [13] is more effective in terms of minimization of the norm of approximation of residual.Implementation of encoding in OOMP method is an iterative procedure and includes the following major steps: 1) search for the l-th column of the matrix of formed basis vectors C, which has not been selected (not added to set U) yet, with the aim of minimizing the norm of the obtained residue at the current step: : arg min min ; 2) updating of a set of selected basis vectors : ; 3) solution of optimization problem To decrease computational complexity of the first step, it is possible to use the population-based search algorithm or implementation, proposed in paper [12], where temporary matrix R is introduced, which in the beginning is equal to R=(r 1 ,…, r l ,…, r M )=C at : , and at each step is specified by formula : ( ) , where win l r is the column of matrix R that has maximum overlap with the current residue , : arg max( ) .
In the same way, value of remainder is updated during each iteration : ( ) .
win win Neural gas, which is used to search for C, is the algorithm of self-organization of unstructured grid for identifying topological data structure.In general case, the algorithm of neural gas includes the following basic steps: 1) initialization of dictionary C=(c 1 ,…, c M ) by random values from uniform distribution; 2) selection of the t-th input observation x from set X, which has volume t max ; 3) calculation of coefficients of dimensions of vicinity of neighborhood and learning speed from formulas: max / 0 final 0 : ( / ) , where λ 0 , λ final are the initiative and final values of coefficient λ t ; α 0 , α final are the initial and final values of coefficient α t ; 4) calculation of the distance of input vector x to the words of code book C and their arrangement in ascending order 0 1 ... ... ; For adaptation of the algorithm of vector quantization to a selected encoding scheme in order to reduce errors of sparse approximation, we propose to use the modified algorithm, studied in paper [13].A modified algorithm of neural gas for sparse encoding of observations consists of the following steps: 1) initialization of dictionary C=(c 1 ,…, c M ) by random values from uniform distribution; 2) selection of the t-th input observation x from set X, which has volume t max ; 3) normalization of basis vectors c 1 ,…, c M by reducing them to unit vector; 4) calculation of coefficients of dimensions of vicinity of neighborhood and of learning rate from formulas (4) and (5); 5) initialization of a set of indices of columns C, which have already been used during t-iterations U=ᴓ; 6) initialization of residue that is minimized, ε U =x; 7) initialization of temporary matrix R=(r 1 ,…, r l ,…, r M )= =C, orthonormalized according to C U ; 8) initialization of the counter of steps of residue refinement = : 1, h = − 1, 1; h K 9) calculation of distance (scalar product) of vector k l r to ε U and their arrangement in ascending order 10) initialization of the counter of steps of code book refinement 0, 0, 1; 11) updating of code book words at the k-the step with the use of principles of orthogonality to the sub-space, assigned in C U and by the Oja's rule [11] , : , , The first layer of feature extractor may carry out analysis of input signals from several time windows that overlap in time.After studying the first layer of feature extractor, the whole training sample can be recoded to sparse concatenated representation and use it for learning the next layer.Prior to this, it is advisable to introduce non-linearity to the obtained representation and reduce the number of basis vectors in a new layer [11].The simplest non-linearity is the limit in the form of a condition of non-negative features, in which the output of the S-th layer o s with a sparse code OOMP S a can be calculated from formula where 0 is the vector with zero components, of dimensionality M; max is the operator of element by element maximum between two vectors.Application of the proposed non-linearity (6) increases dimensionality of the resulting code twice 2* , M S o R Î but enhances informativeness due to the possibility of separate analysis of negative and positive responses of a signal and retains scarcity property.Thus, nonzero values of feature representation of the higher-level signal about the activation of a certain group of low-level features.In this case, it is possible to submit to the classifier both the output of the last layer of feature extractor and outputs of the lower layers, which will allow carrying out classification analysis taking into account the specificity of functional state at each abstraction level.
Algorithm of rough binary encoding of the feature vector for classification of analysis involves comparing value of the i-th feature with a corresponding lower A B,l,i and higher A T,l,i limits of the asymmetrical field of control tolerances, which are calculated from formulas , , , ,max max 1 , where , l i d is the parameter of the l-th field of control tolerances for the value of the i-th feature.
Formation of a binary learning matrix where N is the number of features of a classifier, n m is the number of vectors of class o m Х and K is the number of recognition classes, performed by the rule ( ) , , , * 1, ; 0, else.
Calculation of values of coordinates of binary averaged vector x k , relative to which construction in radial basis of class containers takes place, is performed by the rule where n is the total volume of labeled vectors of the initial sample.
As the criterion of efficiency of classifier's machine learning to recognize observations of class , Х modification of Kullbak's information measure is considered in [14,15]: where α k , β k are assessments of errors of first and second kind, which assign operation region of the criterion in the form of inequalities α k ≥0,5 and β k ≥0,5; ε is the small signpositive number for avoiding uncertainty while dividing by zero, equal, as a rule, to a number from range [ 4 2 10 ...10 Optimization of parameters of the field of control tolerances {δ l,i } lies in searching for extremum of function (7) in decision hyperspace.In this case, as the search algorithm, this work proposes to use Particle Swarm Optimization, PSO, which is characterized by simplicity of implementation and interpretability [16].Optimization of the radii of class containers can be implemented by the method of sequential direct lookup with the assigned step, because the number of steps of this search is relatively small.
To improve image compactness and inter-class gap in binary space of secondary features, the algorithm of machine learning takes into account fuzzy compactness of images that is calculated for class o k Х from formula ( ) , is the code distance between the centers of class containers and Effectiveness of each particle of the population-based algorithm, i. e. closeness to the global optimum, is measured with the help of pre-determined fitness function, the role of which in this case is performed by the function of criterion of machine learning efficiency (7).Each j-th particle, apart from its position P j retains the following information: V j is the current velocity of a particle, Pbest j is the best personal position of a particle.The best personal position of the j-th particle is the position of the j-th particle, in which the value of fitness functions for the particle was maximum at the current point of time.In addition, with the aim of searching for the global extremum of fitness function, the best particle is sought for throughout the whole swarm and the position is designated as Gbest.However, considered above swarm search algorithm is aimed at increasing the value of criterion of learning effectiveness, averaged by the class alphabet.For the purpose of additional increase in compactness of images, it is necessary to modify the procedure of updating the values of the best personal Pbest j position of search agents by rule (9), in which objective function E (...) is the averaged value of function of criterion (7).if ( ) ( ) .
Similarly, it is necessary to modify the procedure to updating the values of the best global Gbest j position of search agents Under the mode of examination, decision on belonging of vector-implementation In which μ k (x) is the membership function of vector x to the container of class , o k X which is calculated by the rule: For more precise consideration of distribution of binary vectors in the hyperspheric container of class , Х formula of membership function can be adjusted and will take the form where n k (d) is the number of vectors of class o k Х , which is at the distance d from the center x k ; n max is the maximum value in array n k (d), i. e., max { ( )}. max k d n n d = Thus, the proposed algorithms of feature learning and decision rules for prediction of conditions of SLA violations are not demanding to the amount of data and resources of the computer, which provides effectiveness of resource management at the early stages of service operation.

Results of physical simulation of the system of prediction of violation of SLA conditions
Testing effectiveness of the proposed algorithms is considered on the example of the problem of prediction of data center servers' overloading, which leads to SLA violation in accessibility metrics, resource capacity and response time.The simulation was carried out with the use of framework Clouds [16], where there were assigned 400 servers HP ProLiant ML110 G4 (Intel Xeon 3040, 2 cores×1860 MHz, 4 GB) and 400 servers HP ProLiant ML110 G5 (Intel Xeon 3075, 2 cores×2660 MHz, 4 GB).Workload data, collected on PlanetLab platform, were taken from project CoMon [16].The forecast horizon is 10 minutes, which is sufficient for implementation of migration of a virtual machine.The architecture of the system of prediction of SLA violations is shown in Fig. 1 and includes the two-level feature extractor.The extractor analyzes data of monitoring of loading of processing resource of the virtual machine in two sub-windows, shifted in time, with 50 % overlap and the reading step equal to 1 minute.The length of the sub-window exceeds by several times the prediction horizon and is 50 minutes, which was chosen at our discretion and may be not optimal.The selection of unlabeled samples for learning of the two-level feature extractor includes 10000 samples, and the volume of a priori classified learning sample of each of the two classes is 100 samples.The test sample of the classifier has the same volume as the learning sample.Table 1 shows results of machine learning at different capacity of dictionary of basis vectors of the first and second levels.An analysis of Table 1 shows that the best of the checked configurations of feature extractor is the sixth configuration that provides error-free decision rules for the test sample with minimal number of basis vectors.An analysis of Table 2 shows that the optimum number of control tolerances for values of features is L=3 and subsequent increase in the number of tolerances can lead to overfitting, which is evident from Table at L=7.In this case, Fig. 2 shows diagrams of change in accuracy of obtained decision rules by learning and test samples from the number of learning vectors of feature extractor.
An analysis of Fig. 2 shows that an increase in the number of the extractor's learning vectors leads to improving accuracy for learning and test samples for the classifier of functional states of the service.However, at the volume of learning sample of about 5,000 samples, the effect of overfitting of the system was observed after reaching 6,100 samples, it is possible to obtain the extractor that provides error-free decision rules for the test sample.Thus, the developed algorithm of learning features and decision rules allows us to obtain error-free decision rules for test sample with the extractor, containing 30 basis vectors in the first layer and 20 vectors in the second layer.In this case, 6,100 learning samples are sufficient for learning of the extractor.

Discussion of results of physical simulation of machine learning process
The use of the proposed extractor and modification by rules ( 9) and ( 10) swarm algorithm of decision rules optimization, as shown in Fig. 2, makes it possible to obtain highly reliable decision rules.In this case, the diagram shows the overfitting section with the width of 1,100 samples, at the end of which the accuracy for the test sample reaches the limit maximum value.Effect of overfitting has a component both of the extractor and the classifier.To assess the impact of the used rules ( 9) and ( 10) on the effect of overfitting, Fig. 3 shows diagrams of change in accuracy of obtained decision rules for learning and test samples depending on the number of learning vectors of the extractor without using these rules.
Fig. 3. Charts of dependence of effectiveness of decision rules on the number of learning vectors of feature extractor without using rules ( 9) and ( 10): 1 -the curve of accuracy change for learning sample; 2 -the curve of accuracy change for test sample As Fig. 3 shows, without considering compactness of images by rules ( 9) and (10), in order to obtain highly reliable decision rules, it is necessary to use learning sample of much larger volume, which in this case includes 8,500 samples.
To compare generalizing ability of the proposed extractor with the popular extractor based on the deep autoencoder [9], Fig. 4 shows diagrams of change in accuracy of obtained decision rules for learning and test samples depending on the number of learning vectors of the autoencoder.In this case, the autoencoder has the following configuration: input dimensionality is 75 features; the number of nodes of the first hidden layer is 30; the number of nodes in the hidden layer that corresponds to feature representation is 20.
An analysis of Fig. 4 shows that deep autoencoder similarly allows obtaining of error-free decision rules for the test sample, but to do this, we need more training samples, the number of which exceeds 10,000.
Thus, the developed informational and algorithmic software make it possible to obtain highly reliable decision rules for the prediction of violation of SLA conditions.In this case, the implemented algorithms, compared with the autoencoder, require a smaller volume of learning data, which allows previous introduction of predictive mechanisms of management of correspondent services.Results of physical simulation prove capability of both the proposed hierarchical feature extractor, based on ideas and methods of neural gas and sparse encoding, and of the autoencoder to obtain error-free decision rules for learning and test samples.However, the proposed extractor, unlike the autoencoder, requires 1.6 times smaller volume of learning samples for achievement of the same result, which makes it possible preliminarily to put in effect predictive mechanisms of management of appropriate cloud services.
2. It is shown that consideration of image compactness in binary space of secondary features during optimization of multi-level system of control tolerances for values of primary features allows us to significantly reduce the negative effect of overfitting of a classifier and requirements for the volume of learning samples.
3. It was shown that the proposed configuration of the extractor for the problem of prediction of violation of SLA condition is acceptable in terms of accuracy and complexity.In this case, at the input of the extractor, two time windows are used that intersect in time by 50 % and read through 50 features.The first layer of coding of the extractor contains 30 basis vectors, and the second layer -20.The intralayer pooling and non-linearity were formed by concatenation of sparse codes of each of the windows and by continuation of resulting code twice as much in order to separate positive and negative code components and by transforming the resulting code into the vector of sign-positive features.

Introduction
Fuzzy classification knowledge base design is carried out according to the criteria of accuracy, complexity, and interpretability.The design criteria are provided by gradual transformations of the initial model.In the theory of defect-free design of human-machine systems [1,2], formalization of such transformations is achieved by the use of improving transformations.
Then improving transformations correspond to the addition (removal) of output classes, input terms, and rules.Improving transformations allow formalization of the process to step 1 until k-iterations are completed.

ε
the index of which has not yet been added to U, but determined form formula 2 win ,

8 )
where d k , d c are the radii of class container o k Х and the closest neighboring class o с Х relatively; ( )

Fig. 1 .
Fig. 1.Block diagram of system of classification prediction of violation of SLA terms

Fig. 2 .
Fig. 2. Charts of dependence of effectiveness of decision rules on the number of learning vectors of feature extractor: 1 -the curve of accuracy change for learning sample; 2 -the curve of accuracy change for test sample

Fig. 4 .
Fig. 4. Charts of dependence of effectiveness of decision rules on the number of learning vectors of autoencoder: 1 -the curve of accuracy change for learning sample; 2 -the curve of accuracy change for test sample

Table 1
Results of machine learning of classifier at various configurations of feature extractor Table 2 shows results of machine learning of the classifier with the sixth configuration of feature extractor at varying number of control tolerances for recognition features.