DEVELOPMENT OF THE DESCRIPTIVE BINARY MODEL AND ITS APPLICATION FOR IDENTIFICATION OF CLUMPS OF TOXIC CYANOBACTERIA

K . N o s o v PhD, Research Fellow Scientific research unit* Е-mail: k.nosov@karazin.ua G . Z h o l t k e v y c h Doctor of Technical Sciences, PhD, Professor Department of Theoretical and Applied Computer Science* Е-mail: g.zholtkevych@karazin.ua M . G e o r g i y a n t s MD, Professor Department of Pediatrics Anesthesiology and Intensive Therapy Kharkiv Medical Academy of Postgraduate Education Amosova str., 58, Kharkіv, Ukraine, 61176 E-mail: eniram@bigmir.net О . V y s o t s k а Doctor of Technical Sciences, Professor** E-mail: olena.vysotska@nure.ua Y . B a l y m Doctor of Veterinary Science, Professor Department of Reproductology Kharkiv State Zooveterinary Academy Academichna str., 1, Malaya Danylivka, Dergachi district, Kharkiv region, Ukraine, 62341 Е-mail: yubalym8@gmail.com A . P o r v a n PhD** E-mail: andrii.porvan@nure.ua *V. N. Karazin Kharkiv National University Svobody sq., 4, Kharkiv, Ukraine, 61022 **Department of Biomedical Engineering Kharkiv National University of Radio Electronics Nauky ave., 14, Kharkiv, Ukraine, 61166 Представлена дескриптивна динамічна модель бінарних даних, що дозволяє по вихідним спостереженнями з порушеним часовим порядком відновити вихідний порядок на підставі принципу парсимонії. Модель застосована для знаходження системних колориметричних параметрів, використовуваних для обробки зображень скупчень токсичних ціанобактерій на основі аналізу компонентів RGB-моделі цифрової фотографії Ключові слова: дескриптивні моделі, динамічні системи, бінарні дані, парсимонія, інтелектуальний аналіз даних, скупчення токсичних ціанобактерій


Introduction
In analytics of raw data, in particular, in big data analytics, they often recognize three types of analysis -descriptive, predictive, and prescriptive.What are the differences between three main types of analytics?
Sometimes descriptive analysis is regarded as the simplest class of analytics that allows you to condense big data into smaller, more useful pieces of information.That step makes raw data more suitable for human consumption with the information derived from the data.
Predictive analytics is the next step up in data reduction.It utilizes a variety of statistical, modeling, data mining, and machine learning techniques to study data, thereby allowing analysts to make predictions about the future.Predictive analysis can forecast what might happen in the future because all predictive analytics are probabilistic in nature.
The emerging technology of prescriptive analytics goes beyond descriptive and predictive models by recommending one or more courses of action -and showing the likely outcome of -each decision.Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.Further, prescriptive analytics suggests decision options on how to take advantage of a future opportunity or mitigate a future risk and shows the implication of each decision option.
The authors have previously proposed a number of descriptive models of system dynamics to describe the behavior of complex natural and technical systems [1].The idea underlying these models is to present the system as a set of interacting components.In fact, each component corresponds to a certain part or property of the system.For example, in an animal community, a component is the number, or biomass, or density of a species from the community.For such a system, the model of between-component interactions can be built, and properties of interactions and dynamics can be identified by raw data obtained from the real (natural) system.
The development and expansion of the arsenal of mathematical tools for implementation of the systematic approach for tackling biosafety issues have always been the topical problem.Mathematical and information models describing the structure and stability of biological systems have the utmost importance.This problem is at the interface of cybernetics, mathematical modeling, computational and theoretical biology.
The clumps of biomass of toxic cyanobacteria, arising during their mass development in water reservoirs, create a threat to biosafety.These threats are related to the quality of potable water for animals through the release of rotting and dead toxins containing dead organics in coastal areas.
There is a wide range of veterinary medicine's solutions related to the toxicity of cyanobacteria, as well as the problems of environmental safety for humans [3].In this connection, the monitoring of rising and movement of these clumps is one of the urgent problems not only in ecology, but also in veterinary medicine [4].
The topicality of this problem has recently increased to a large extent as a result of increasing anthropogenic pressure on natural systems in the whole and, particularly, on water bodies, as well as due to the influence of factors of the global climate change.Thus, cyanobacteria's biomass outbreaks occur in significant water areas in not only terrestrial water reservoirs, but also in seas, e.g., the Baltic Sea.The monitoring of these environmental violations requires remote (aerospace) methods, which use complicated equipment -satellite color scanners.However, as noticed in [5], the full picture of "bloom" recorded on satellite images of the water surface can be reproduced by neither experimental no theoretical models due to the complexity of relations between relevant factors of this phenomenon.
At the same time, as it was shown in [6], certain practically significant system aspects of the performance of plant communities can be remotely registered using mathematical modeling.Initial factual data, obtained by the methods that directly record only parameters of the RGB model of digital photography with the help of a relatively simple, widely used and not-expensive equipment, can be used [7].
Such a possibility is provided by new classes of mathematical models, developed with the authors' participation, called the discrete models of dynamical systems (DMDS).In particular, it was shown that the dynamics of some colorimetric parameters of crops of cultivated plants can be described by the mathematical model that is very similar to the so-called marginal model of succession [7].The use of DMDS, in this case, provides the effect of increasing the information content while using relatively rough input colorimetric parameters due to the ability for modeling the relationships between these parameters.

Literature review and problem statement
Binary data arise when a particular response variable of interest can take only two values, say, {0, 1}.For example, to understand socioeconomic processes, economists often need to analyze individuals' binary decisions (whether to make a particular purchase, participate in the labor force, obtain a college degree, see a doctor, migrate to a different country, or vote in an election).
There are a few statistical methods for modeling binary data based on expression of relations between independent variables and a binary response.Apparently, logistic regression is the most popular model among the well-known ones [8].It enables to calculate the probability of a value for a binary dependent variable using a number of independent variables in the interval or ordinal scale [9].In addition to calculating the probability, logistic regression allows one to estimate the parameters of the model, calculate the quality of the statistical data, test hypotheses about the values of the parameters, etc. [10].
This fairly simple model has numerous extensions.They can be performed in various directions.
One of the directions is multilevel models for binary outcomes.Multilevel or clustered data consist of units of analysis at a lower level nested within units of analysis at a higher level [11,12].For this type of analysis, it is assumed that the hierarchy of units of analysis is natural.Thus, for individuals in a certain hierarchy, there is a tendency to be more similar in their characteristics than in a similar sample of individuals chosen at random from the population [13,14].Family relations are a good example of the natural hierarchy because we would expect that the children of the same parents to be similar in many important ways.Students nested in schools would be another typical hierarchical data structure.Often hierarchies reflect individual social differentiation, for example, when individuals with the similar ability are grouped in selected schools.In other cases, nesting may less reflect the individual characteristics and may arise at random.For example, in clinical trials, we often find experimental research conducted at several randomly chosen hospitals or among randomly selected groups of subjects.Multilevel models, for example, can address clustered data, repeated measures or longitudinal data [15,16].
When repeated measurements are conducted on the same individuals, a hierarchy is established with individuals at level two and measurement occasions at level one.These data are often referred to simply as longitudinal data.For data of such a type, a huge number of effective models have been developed.For example, in [17], for prospective studies with binary responses, they suggest using the Poisson's regression, which enables to build the models with correlated binary responses originating in longitudinal or cluster randomized trials.In [18], on the basis of the likelihood ratio, a new unified model was suggested.This model extends the traditional marginal regression models that describe consecutive and long-term follow-ups of binary variables.
An important method of revealing causal relations through statistical techniques is Structural Equation Modelling (SEM), which has been increasingly used in applied research.However, there is a great obstacle for its wider use in handling categorical, in particular, binary, variables.None the less, there are several more or less successful attempts to extend SEM to binary data.For example, in [19] they use Yule's transformation on the basis of odds ratios to approximate the matrix of Pearson's correlation coefficients.In [20], a SEM's extension for predicting a dependent binary variable (so called non-linear mixed model, NLMM) is suggested.
There are a few approaches to causal inference and structure learning of Bayesian networks studied in statistics and artificial intelligence [21,22].Most of them derive candidate causal structures from the observed data set by assuming acyclicity of causal dependencies.In these models, they use the information up to the second-order statistics of observed variables and narrow down the candidate directed acyclic graphs by using some constraints and/or scoring functions.
Meanwhile, the methods and models, both included in and beyond the review are hard-to-apply to some data structures.In this respect, the following general methodological remark can be made.Evidently, one should not expect that exhaustive set of models will ever be developed that adequately will describe all possible types of objects and systems -natural, technical, social, etc.The tasks a researcher is solving may require the development and study of models that cannot be reduced up to any of previously encountered.
In this study, we suggest a model of dichotomous data that solves the problem of processing a dataflow, which is formed as the series of observations of a certain dynamical system with discrete states.Additionally, the time order of observations can be disturbed, in contrast to the data similar to time series.Such a dataflow can arise, for example, when results of monitoring of the system's states are delivered via different channels, if measurements of these states are conducted asynchronously and in other similar cases.As far as the authors know, such models for dichotomous data have not been developed yet.
Consider the data from the dichotomous scale.In this case, the variables can take only 2 values from the set B= ={0, 1}.Besides 0 and 1, these values can be denoted by true/ false, male/female, etc.The order between the values is not established.
Let the system comprises N components, denoted by A 1 , A 2 ,..., A N ; each component takes values from the set B. The system is dynamical and has discrete time.Hence, its state at the moment t can be denoted by (A 1 (t), A 2 (t),..., A N (t)), where each A i (t)∈B.Also assume, that the system's state at the moment t+1 is determined in full by the state at the moment t.
Initial data are presented in the observation table composed of M cases (in rows) and N columns: , where any a ij ∈B.
Each column corresponds to a corresponding component.
We assume, that 1) in the table (1), each row is a state of the system at a certain moment of time; 2) the table (1) includes all or at least the major part of available system's states; 3) all rows in (1) are unique (that is, М≤2 N ).
The following assumption is that (1) presents the whole cycle of the dynamical system, but the sequence of moments may be out of order.It is assumed, that the dynamic system described above consisting of N components generates a number of observations in the form of the matrix (1).

The aim and objectives of the study
The aim of the study is to develop a model that allows one to construct a descriptive model of the dynamical system from data.This enables to describe relationships between components and restore the dynamics of the system, in particular, the correct time order of observations.
To achieve the aim, the following objectives are stated: -to recover the true order of the moments or the true order of rows using the observation matrix (1); -to recover the rules of dynamics of the given system from the observation matrix (1); -to develop the approach that enables more efficient identification of localization of clumps of toxic cyanobacteria with the use of digital images.This approach should be based on the analysis of weighted oriented graphs that reflect the dynamics of colorimetric parameters of snapshots of cyanobacterial clumps.

The descriptive models of binary data
Suppose, that the initial data in the form of the observation table (1) are given.
For addressing the task, we apply the parsimony principle: the simpler the model, the more correct it is.In contrast to the above-mentioned models developed by the authors, we don't minimize a discrepancy between the model's dynamics and observed data, but solve rather an unsupervised optimization problem.

Define a K-ary binary function as a map
In this definition, K can take the value 0. In other words, the function F can take a constant value from B, 0 or 1.As the K-ary binary function is a map into B, it defines the two disjoint sets of series from zeros and ones of a length of K.The first set is the series mapped by the function F into 0, the second one -into 1.If М=2 N , a unique truth table for the map F can be composed.But, if М<2 N , the truth table is not unique.
Let us start with the second task.Suppose, the table A is already a regular cycle.That is, its first row (a 1,1 , a 1,2 ,…, a 1,N ) is the state of the system at the moment t=1, the second -at the moment t=2, the M-th row -at t=M.After the last moment, the cycle is being repeated starting from the first row, as the system is finite-state.As mentioned above, the state at the moment t+1 is uniquely determined by the state t.Now try to establish the relationships between the components.For example, consider the component A 1 .It is needed to build such a binary function that performs the mapping (by rows): This means that the function F maps the system's state at t=1, written in the first row, in a 2,1 (the state of the component A 1 at t=2); the system's state at t=2 -into the component A 1 's state at t=3 and so on.
Because all the rows of the table A are different, we can obtain a trivial solution of the problem using the N-ary function as F 1 .But this mapping sometimes may be implemented by a m 1 -ary function (on some m 1 components), where m 1 <N.
The function F 1 constructed in such a manner is only defined on the entries from B available in (1).
For example, consider the table below (N=3, M=4) If it is needed to find the mapping F 1 , we easily can do this: For a given component A i , the minimal arity σ i of the map built according to the described approach is called the degree of dependency.One σ i can be provided by several components' subsets (of the same size).
Introduce the average degree of dependence for all components: where σ is the measure of the simplicity of the relations determined by the model.The smaller σ, the easier the relation, the more adequate the models are, according to the interpretation of the principle of parsimony used.If the rows in (1) are out of time order, we face the first task of revealing the true timeline sequence of observations.One needs to select a permutation of rows provided the minimal value of (2).
The upper boundary of the number of binary maps F 1 , F 2 ,…, F N to be used in solving the problem is as follows where M is the number of rows' permutation, N is the number of components, and 2 N is the maximum number of sets to be viewed for each map F k .Development of effective algorithms for integer optimization appropriate for large M and N is an important and challenging problem.

Application of the theory to image processing regarding biosafety issues
The obtained theoretical results could have practical applications related to the identification of disturbance of bioproduction processes that create certain biosafety threats.A significant example of such a disturbance is, as already mentioned, the effect of spots of bloom resulting from eutrophication on biosafety of water consumption for watering domestic animals.
To identify and determine a location of clumps of cyanobacteria on the surface of water, a set of strategies that determine transitions between the states of clumps will be used.The parameters of the sets of strategies can be expressed in the terms of directly measured "primary" colorimetric parameters (CPs), as well as "secondary", system colorimetric parameters (SCPs).The "primary" colorimetric parameters include the values of the RGB model and their derivatives expressed by elementary functions.The SCPs can be, e. g., the evenness of the "primary" values and/or the range of their variations.
Within the framework of the present work, the abilities of the obtained results are shown by the example of digital image processing of clumps of toxic cyanobacteria called spots of bloom.
Consider clumps of cyanobacteria as a dynamic system in the state of dynamic equilibrium characterized by changing the values of its parameters within a certain cycle.To be more precise, the image of a clump, divided into a number of parts (rectangles in this case), is considered as a system comprising four components described below.
Using the RGB color model, for each triangle, the average value of 4 parameters R/(R+G+B), G/(R+G+B), (R+G)/(R+G+B) and R/G were calculated.
Using a set of triangles as a sample, the 1st and 3d quartiles of this sample were calculated.Then, initial values of 4 parameters were encoded.If for a rectangle, the average of the corresponding parameters lies between the 1st and 3d quartiles of this parameter, the new value is assumed to be 1, otherwise -0.As a result of the encoding, we obtain the observation table with 4 columns and the numbers of rows equal to the number of rectangles in the division of the image.
The values of the components equal to 1 are hereinafter referred to as stable values (SVs); and, the values equal to 0 -unstable values (UVs).
The identification of the model by observable data is based on the principle of parsimony.The essence of this principle is the following.The simpler the description of the system's dynamics (in a certain sense determined by a numerical measure), the more the system is appropriate.Thus, the problem of integer optimization, minimizing the specified measure of simplicity for system identification from data should be solved.The upper bound of the number of binary mappings, which are candidates for solving the optimization problem, is obtained.This number grows rapidly with the increase of dimension of the problem, and, even for initial data of moderate size, the problem becomes computationally expensive.This feature of the model restricts to a certain extent its applicability and states the problem of development of effective computational algorithms for system identification from data.
The efficiency of the model was demonstrated by the example of revealing the clumps of toxic cyanobacteria on the surface of the Baltic Sea.The reference image was used as a source of initial data; on its basis, the binary model was built.Then, the properties of the dynamics of the binary model were used for calculation of the index of evenness of the systemic colorimetric parameters.The index was applied to the processing of the snapshot with added digital noise that simulated unfavorable observation conditions.The resulting noised image of the snapshot contained details that were virtually invisible in this image before processing.
This model can be used for descriptive analysis of an incoming binary dataflow, when the researcher needs information on relationships that generate the data.For example, if the data come from channels with time displacement error, asynchrony, then the time order of the initial data may be disturbed and its recovery is needed.
The model can be expanded in several directions.The most obvious extension is the results' application to nominal data.The principle of parsimony underlying the identification of the model from the data can be applied in a number of different ways.In the paper, we used the simplest measure of dependence based on the arity of mappings that generate the model's dynamics.For example, if the process generating a binary dataflow essentially depends on the amount of computer memory, it is reasonable explicitly to take into account the sizes of series that form transitional mappings.

Conclusions
1.For the measure introduced during solving the stated problem, the true time order is calculated as a time order that minimizes this measure among all available time orders.Thus, the problem of recovery of the true order was solved.
2. The numerical measure, which enables to calculate the complexity of system dynamics for an arbitrary time order of observations (true or not), is introduced.This measure is based on the arity of the mappings that determine the dependence of the current value of each component on the preceding values of all components.Thus, the problem of restoring the law of system dynamics is solved.
3. The abilities of the dynamic binary model for selection of colorimetric parameters for identification of clumps of toxic cyanobacteria have been demonstrated with the use of the digital satellite image from the NASA site.The demonstration consisted of several steps: -identification of the model from the reference image; -derivation of the index that reflects the systemic properties of the dynamics; -use of the index to process the image with the digital noise simulating unfavorable visibility conditions.