A DECISION TREE IN A CLASSIFICATION OF FIRE HAZARD FACTORS

At present, Ukraine experiences growing levels of natural and technogenic threats and fire hazards. In particular, every year there is a significant number of natural and technogenic disasters; among rather frequent risks are fires, including those of natural origin. In recent years, the problem of protecting forests from fire has become particularly acute because of higher air temperatures, rainfall shortage, and strong winds. Thus, in 2015 compared to 2014, the number of fires in natural ecosystems increased 2.2 times [1]. In general, Ukraine annually records on average about 3.5 thousand wildfires that destroy more than 5 thousand hectares of forest. In addition to forest fires, natural fires include those of peat, steppe, and farmland. They are of different nature, depending on the conditions of fire origin, vegetation, and soil. The main groups of natural factors include topography, vegetation, and climatic resources, which together determine the characteristics of a landscape structure and, to a certain extent and at a certain probability, contribute to fires. Therefore, the necessary steps in risk assessment are to determine the fixed and variable factors that affect the potential for fires and to classify them by the available features. Such research is increasingly important because it is essential to classify the factors of fire danger and to define them with appropriate weight factors to improve the monitoring and prediction of natural fires in Ukraine.


Introduction
At present, Ukraine experiences growing levels of natural and technogenic threats and fire hazards.In particular, every year there is a significant number of natural and technogenic disasters; among rather frequent risks are fires, including those of natural origin.In recent years, the problem of protecting forests from fire has become particularly acute because of higher air temperatures, rainfall shortage, and strong winds.Thus, in 2015 compared to 2014, the number of fires in natural ecosystems increased 2.2 times [1].
In general, Ukraine annually records on average about 3.5 thousand wildfires that destroy more than 5 thousand hectares of forest.In addition to forest fires, natural fires include those of peat, steppe, and farmland.They are of different nature, depending on the conditions of fire origin, vegetation, and soil.The main groups of natural factors include topography, vegetation, and climatic resources, which together determine the characteristics of a landscape structure and, to a certain extent and at a certain probability, contribute to fires.Therefore, the necessary steps in risk assessment are to determine the fixed and variable factors that affect the potential for fires and to classify them by the available features.Such research is increasingly important because it is essential to classify the factors of fire danger and to define them with appropriate weight factors to improve the monitoring and prediction of natural fires in Ukraine.

Literature review and problem statement
In recent years, there has been an increase in the use of predictive analysis for studying and predicting natural fires.Active development of remote sensing (RS) and accumulation of obtained geospatial data facilitate an increase in the number of studies on the problems of monitoring natural fires by means of remote control [2].However, the authors of this study focus their attention only on the technical features of the instrument MODIS.Common approaches to using GIS technology to assess fire threats and their mapping are considered in [3].On the other hand, this study does not sufficiently consider analytical capabilities of geographic information tools for monitoring and predicting fires.The authors of [4] suggest using neural network algorithms to predict forest fires.The predictive information of this study is mainly based on meteorological parameters, without regard to other factors.The analysis of natural fires' dynamics for several centuries through using six variables is described in [5], but the emphasis is placed on the characteristics of the terrain.The methods of spatial research analysis and mapping of fire threat areas are used by the authors of [6].The focus is zonal factors (vegetation, soil cover, and underlying surface) beyond climatic factors.Statistical evaluation of risks of natural disasters, including fires, is explored in [7].The study mainly examines the patterns of the distribution, occurrence and effects of natural fires.Spatiotemporal features of fires are investigated by the authors of [8], who have developed an algorithm of space-time prediction, using different methods of data mining with the main focus on the methodology of the study.Using the methods of multivariate regression data and artificial neural networks for predicting natural fires is analyzed in [9], which is based only on meteorological data.The problem of utilizing geospatial data to classify fire factors is considered in [10].However, the classification contains only constant factors, without including variable (dynamic) factors.Methods of artificial intelligence are involved in the analysis of fires in [11].The authors have developed a technique to study areas that endured natural fires, but they do not consider the factors of their occurrence.An evolutionary technology to determine the time and the spread of fire on the basis of production rules determined by experts is suggested by the authors of [12].The modeling of the time and the spread of fire does not include the parameters of fire sources and their consequences.

A . M u s i e n k o
The analysis of the above-mentioned approaches in the studies shows that they need considerable improvement.Thus, despite the large amount of studies on the use of predictive analysis for researching and predicting natural fires, it is necessary to specify the common risk assessment algorithms, depending on the environmental conditions of a territory.In particular, for estimating the models of the risk of forest fires, it is important to take into account classifications of fire factors.

The purpose and objectives of the study
The purpose of this study is to develop a classification of factors of fire danger with the use of a decision tree.According to the set goal, it is necessary to solve the following main tasks: -to select and prepare geospatial data for the study; -to choose an algorithm for the classification with the use of a decision tree; -to classify fixed and variable factors affecting fire safety of an area; -to assign weight coefficients to the main factors.

Data to study fire hazard factors
The study involved the use of geospatial data of the European Space Agency, which in recent years has expanded the group of space satellites and remote sensing and launched a program Copernicus.This program provides with free satellite imagery data and derived thematic products that can be used in the analysis of vegetation, climate change and safety of water resources, which are traditionally considered in determining the level of fire danger in an area.
Copernicus is a European system for monitoring the Earth, for which data are collected from various sources, including Earth observation satellites and on-site sensors.They are further processed to provide accurate and relevant information in six thematic areas: land cover, sea waters, the atmosphere, climate change, management of emergency, and security [13].
Global terrestrial service coverage (Global Land Service) is part of the service of the Copernicus land cover, which provides a range of biogeophysical data on the status and evolution of the Earth's surface on a global scale at middle and low spatial resolutions.These data are used to monitor vegetation, water cycle, and energy balance.The thematic products of Copernicus can be used to study the dynamics of regional climate changes and natural threats by researching the dynamic and structural component of assessing the vegetation cover of Ukraine.
One of the areas that is also supported by the system is the accumulation of data on territories that have endured fires.The Burnt Area product contains data on land areas that were burned during a season.Its use helps both obtain integrated data for a certain period of time on the scope of the fires and determine the outbreak of new fires during the time interval between recordings.
These samples can be used as training data to create models of fire risk using other Copernicus products, which are collected by the same methodology.
Among the Copernicus products that can characterize a fire hazard, it is necessary to single out an assessment index NDVI (the Normalized Difference Vegetation Index), which is normalized relative to the vegetation index and represented as a simple quantitative indicator of the amount of photosynthetically active biomass (usually called the vegetation index).On the one hand, the NDVI can serve as an indicator of vegetation while, on the other hand, it reveals the presence of dry organic matter that is suitable for burning.This is one of the most common and used codes to solve problems on the basis of a quantitative assessment of vegetation.It is calculated by the following formula: NDVI=NIR-RED/NIR+RED, where NIR is reflection in the near infrared spectrum, and RED is reflection in the red spectrum.
Accordingly, the density of vegetation (NDVI) at some point in the image equals the difference of the intensities of the reflected light in the red and infrared ranges, divided by the sum of their intensities.
Low NDVI values indicate a poor condition of vegetation, which can be caused by a drought and can consequently lead to an increased fire danger.In this case, it is important to maintain appropriate scale and resolution of data, which should be more generalized than for conventional agricultural monitoring.
In addition to the NDVI, an important index characterizing the fire condition of vegetation is the index of DMP, or performance of dry matter, showing an overall growth rate or dried increase in the vegetation biomass, which is expressed in kilograms of dry matter per hectare per day (kgDM/ha/day).The DMP is directly related to the NPP (net primary productivity), but its units are configured for agro-statistical purposes.In this case, high values of the DMP may contribute to fire risks within an area.
Another important measure is the groundwater index (the Standardized Water Level Index, SWI), which assesses the state of water at different depths in the soil.This index indicates the occurrence of a drought and an increase in the fire hazard.It is mainly caused by deposition through a process of infiltration.The moisture content of the soil is a very heterogeneous variable, subject to small scale changes in the properties of soil samples and drainage.Satellite measurements are integrated on relatively large scale areas with presence of vegetation, which complicates interpretation.High values of the index reduce fire risk, but low values indicate a risk of increased danger of natural fires.
All parameters of natural systems are recorded with the time distinction of once every 10 days, so a prediction of a fire danger should be rendered once every 10 days, too.
A combined analysis of these data is used for determining integrated indicators that reflect a potential fire hazard area.For this purpose, a number of procedures of GIS analysis are used within the module Spatial Analyst ArcGIS, including reclassification functions and raster algebra.

A rationale for the chosen method of the study
To solve the problem of classifying numerous factors of fire, we suggest using the method of building a decision tree -a method of representing the rules hierarchically, by a consistent structure, where each object corresponds to a single node that produces the decision.
At first, decision trees were suggested by Hoveland and Hunt in the late 1950s.
The earliest known work to describe the nature of decision trees -"Experiments in Induction" -was published in 1966 [14].The decision tree is a way of presenting rules in a hierarchical consistent structure.The basic component of this structure is the answer "Yes" or "No" if the tree is binary (the CART algorithm), or a series of questions identified dynamically when trees have an arbitrary number of branches (the C4.5 algorithm).
The model, which is presented in the form of a decision tree, is intuitive and conducive to understanding the problem being solved.This feature of decision trees is important not only in referring a new object to a class, but it is useful in interpreting the classification model as a whole.The decision tree makes it possible to understand and explain why a particular object belongs to a particular class.
Decision trees can create classification models in areas where it is difficult for analytics to formalize knowledge.The algorithm for constructing a decision tree does not require the user to select input attributes (independent variables).The input of the algorithm can admit all existing attributes; the algorithm itself will choose the most important among them, and only they will be used to build the tree.
The accuracy of the models created with the help of decision trees is comparable to other methods of building classification models (statistical methods and neural networks).By now, a number of scalable algorithms have been developed and used to build decision trees of very large databases.Most algorithms for constructing decision trees are capable of some special handling of missing values.Many classical statistical methods to solve classification problems can only work with numeric data, whereas decision trees can process both numerical and categorical types of data [14].
For automatic building of decision trees through training with examples, there have been developed a number of algorithms.They include, among others, CART, C4.5, Ne-wId, ITrule, CHAID, and CN2.

The algorithm CART (Classification And Regression
Trees) is designed to build a binary decision tree.Binary trees are also called binaries because each node in the tree partition has only two off-springs.This algorithm solves the problem of classification and regression.
The ID3 (Iterative Dichotomizer) algorithm is used to process natural language domains.ID3 is difficult to use for processing continuous data.If the value of any attribute is continuous, there are many opportunities of dividing information by this attribute.The ID3 algorithm works recursively, with the chosen feature being used in each node to break a set of data into subsets, based on the tree root that contains all the data.
The ID3 algorithm starts with the original data set as the root node.Each iteration of the algorithm determines an unused attribute set, and the entropy of this attribute is calculated, followed by the selection of an attribute that has the lowest entropy (or the greatest information gain) value.Next, the set is split by the selected attribute to obtain subsets of data.If all elements in the subset belong to the same class, this subset is not processed further.This node in the decision tree becomes terminal.The work of the ID3 algorithm ends if each subset has been classified.This algorithm generally produces small trees, but it does not always give the lowest possible tree [15].
A drawback of the ID3 algorithm is its incorrect processing of attributes with unique values for all objects of a training set.For such objects, information entropy is zero, and no new data can be obtained from the built tree by this dependent variable.The subsets that have been obtained after splitting will contain one object each.The C4.5 algorithm solves this problem by introducing normalization.
The algorithm C4.5, based on the algorithm ID3, adds a function of transferring the decision tree into equivalent rules and decisions for important research tasks.C4.5 adapts the information entropy method and selects an attribute for a maximum speed of information retrieval as well as a threshold segment as the best attribute of testing and a threshold segment.
The algorithm C4.5 simulates a decision tree with an unlimited number of branches in the node.This algorithm can only work with a discrete dependent attribute and, therefore, can only solve classification tasks.C4.5 is considered to be one of the best known and most used algorithms for building classification trees.
Working with the C4.5 algorithm, it is necessary to comply with the following requirements: -each entry of a data set must be associated with a defined class, which means that one of the attributes of the data set should be the label of a class; -the classes are to be discrete; -each sample must clearly relate to one of the classes; -the number of classes should be much less than the number of records in the test data set; -the algorithm C4.5 works slowly with extra large and noised data sets.
A pseudocode of the C4.

The method of classifying fire hazard factors
The production of a decision tree: 1.The calculation of information entropy in the classification.
Suppose that S is the number of examples in the training sets where there are m classifications of samples С і (i=1, 2, ..., m).S i is the number of samples in the classification C i .The formula is as follows: S , ,S p log p , where р i =S i /S is the probability of a random examples belonging to C i .

The calculation of information entropy of each attribute.
Let us assume that an attribute X has v values {x 1 , x 2 , ..., х v } and divides S into v subsets {s 1 , s 2 , ..., s v }; S j includes those examples with S that assume the value of x j for the attribute X (J=1, 2, ..., v).The expected entropy (entropy condition) of using the attribute X as a classification attribute is as follows: ,s ,  s where s ij is the number of examples that relate to the classification C i in the subset s j , and is the probability of each example in s j belonging to C i .

The calculation of the information gain, particularly regarding the attribute.
The information function of enhancing the X attributes: Gain X I S ,S , , S E X .
The function of the information gain tends to acquire great importance for testing the value that is likely to produce multiple branches.However, a test to produce multiple branches does not mean that the obtained predictive result will be the best for these unknown objects.The function of the information enhancement speed can compensate for the lack of information gain.The data gain transfer speed is an improvement in obtaining information, which can eliminate the influence of the attribute that produces multiple branches.The function of the information gain considers not only the number of units but also the size of each node (the number of examples) for each segment, and not only the amount of information included in the classification but also each segment.The information gain of the X attributes is as follows: Gain X A X , I S ,S , , S where v is the number of node branches, and S i is the number of entries for the i-th branch.
4. The creation of a decision tree.
The information gain -Gain (X) -is calculated, followed by estimating the information gain speed A (X) for each attribute; for the test, the selected attribute is the one that has the greatest extent of information gain and the factor value of information gain, which is not lower than the average value for all attributes.In the test, the attribute is assumed to be a node, and each attribute distribution is perceived as a segment branch of examples.If all examples of a node belong to the same class, the node is a leaf, which is distinguished by its classification.It would be useful to form the initial decision about the tree by the method of recursion where all examples of each subset would receive the same value of the main attribute or an attribute for use when it is not in the sample [10].The determining factors of natural fires are the vegetation cover and the soil, the nature of which affects the ranking of potential danger of fires.Depending on the ratio between the extent of the soil cover and the vegetation types, it is possible to distinguish between various spatial groupings in terms of the risk of fires.The main features of vegetation according to the types are the following:

The results of classifying
-forest cover, shrubs, or grass cover; -seasonal or evergreen cover of forests, coniferous or deciduous forests; -the ratio of vegetation types and the extent of the project soil cover.
For the soil cover, the important factors are open areas, soils' moisture and forming conditions -wetlands and peat deposits.There are important forms of land use, including agricultural territories of various types and urban areas.
The relief factor contains several important components, including the physical surface altitude, which determines insolation and dissected topography, surface slope, and slope exposure.The altitude affects the level of surface insolation and the degree of rainfall, humidity, and temperature.
Climatic factors also play a key role in assessing danger of natural fires.Traditionally, fire risk modeling takes into account a set of factors such as temperature, precipitation, moisture balance, solar radiation, as well as wind speed and direction.However, the sample was selected by only the basic factors: the average annual temperature in Celsius and the average annual rainfall in millimeters.
Thus, using a decision tree, we have classified the territory in terms of fire danger and obtained quantitative measurements that can act as weights for further modeling of fire hazards.The selected variable factors can have a nonlinear correlation with the fire hazard.For example, high and low values of NDI correspond to a low probability of fire, due to the low volume of biomass, and territories with index values that are average and below the average reflect a high probability of fire.
At the same time, low and high index values of DPM are more dangerous, whereas average values are of less impact.A high rate of groundwater index SWI corresponds to a low likelihood of fires.

Conclusions
We have determined a system of indicators of the ongoing fire risk factors (land cover, topography, and climatic resources) as well as variable indicators, which are taken into account while determining the level of a fire hazard in an area and are used to describe the state of vegetation, climate change, and supply of water resources.The study involves the use of geospatial data of Copernicus as a European system for monitoring the Earth, including the NDVI (normalized difference vegetation index), the DMP (dry matter productivity), and the SWI (standardized water level index).
1. To solve the problem of classifying fire factors, we suggest using the method of building decision trees, which is a method of arranging rules in a hierarchical consistent structure where each object corresponds to a single node through which the decision is made.The C4.5 algorithm is used to simulate a decision tree with an unlimited number of branches in the node and to represent causal relationships in the process of referring an object of classification to a particular class.
2. The use of the C4.5 algorithm makes it possible to build a decision tree and classify fixed and variable factors of fire danger.We distinguish between 3 main classes of permanent environmental factors, which include land cover, topography, and climatic resources.They, in turn, are divided into subclasses.In particular, in the forest cover subclass, the greatest fire danger can arise in pine forests and thick bushes; in the relief subclass, it is important to take into account a surface slope exposure and the slope angle; in the subclass of climatic factors, the important indicators are high temperature and low rainfall.The construction of the decision tree for variable factors (the indices NDVI, DMP, and SWI) allow ranking these indicators according to their impact on the level of fire danger in an area.
3. The use of the C4.5 algorithm has helped obtain quantitative measurements that can act as weights for further modeling of fire hazards.The resulting values range from 0 to 1, where a value of 0 prevents natural fires (e. g., water surfaces), but values close to 1 indicate a high hazard potential of natural fires.
4. The decision trees, obtained in the process of classifying, are important for planning measures to prevent natural fires.They can also be used for zoning in terms of fire hazards in spatial modeling of fires, mathematical modeling of their effects, as well as in further monitoring and prediction of natural fires.