DEVELOPMENT OF AN IMAGE SEGMENTATION MODEL BASED ON A CONVOLUTIONAL NEURAL NETWORK

This paper has considered a model of image segmentation using convolutional neural networks and studied the process efficiency based on models involving training the deep layers of convolutional neural networks. There are objective difficulties associated with determining the optimal characteristics of neural networks


Introduction
Image processing is extremely important in modern science and practice, so it is constantly evolving and improving. Image processing could be used in many industries, namely precision farming (agromonitoring), safety systems, quality control systems, etc. These areas involve vision systems, robotic complexes, unmanned aerial vehicles (UAVs), video surveillance systems, web services and mobile applications for identification and search.
One type of image processing is segmentation, which is widely used in industry, art, medicine, space, process management, automation, and many other industries [1]. Segmenting images involves splitting the input image into areas that are not similar in terms of a certain criterion. The result is a set of regions that all together cover the entire input image.
There are a large number of image segmentation methods, among which the most common are methods based on image graph analysis, clustering, contour and threshold methods, as well as neural network methods. At the same time, these methods [2][3][4] work much faster with small image sizes and limited coloration.
Since segmentation precedes a higher image processing level, certain requirements apply to segmentation methods. In general, these requirements could be formulated as follows: -maximum correspondence of the segmented area to the real object; -high performance; -resistance to errors; -high accuracy. Therefore, it is necessary to analyze the methods of image segmentation and choose the optimal one according to the above requirements, in particular, high accuracy. It is also worth considering the parameters that characterize these methods, a change in which has a direct impact on the accuracy, performance speed, and overall efficiency of the segmentation process.
A modern relevant direction of production is the development of precision farming, which is based on the results from agromonitoring, namely, on the images from UAV video cameras for the analysis of vegetation, assessment of the areas of damage to crops, forecasting yields, etc. An important criterion, in this case, is the UAV's ability to avoid collisions with close objects, determine the position in space, the direc-tion and trajectory of a flight by acquiring input data in the form of segmented images.
The effectiveness of these systems is determined by the accuracy of segmentation, for the evaluation of which experimental research is required.

Literature review and problem statement
Paper [5] reports the results of image segmentation using clustering. The search for clusters in images based on similar characteristics is shown. This process is characterized by high speed, accuracy, and resistance to errors at the initial stage. However, unresolved issues remain related to a significant decrease in the effectiveness of the method when the image size increases. The reason for this may be objective difficulties associated with the large number of small elements in the image and the poor operation of the method in the noisy image, which makes the relevant studies impractical. Study [6] considers threshold methods of image segmentation, which are characterized by high performance and ease of implementation. However, unresolved issues related to rather low segmentation accuracy remain. This causes difficulties when limiting the color palette at the boundaries of image elements, which prevents the use of the method for full-color images. Paper [7] reports the results of image segmentation using contour methods. The resistance to changing the parameters of input images is shown. However, the problem of tearing up areas of the image remains unresolved. In addition, the methods are characterized by low performance. This could be caused by objective difficulties associated with using a wide palette of colors in real images. This limits the use of this method for photorealistic images. Study [8] reports the results of image segmentation using methods based on image graph analysis. It is shown that the obtained data do not depend on changes in the uniformity of colors and the size of the input images. However, these methods demonstrate poor performance and require a lot of memory. This could be caused by the difficulties of choosing a metric separately for each type of image and using a large number of graph elements to segment the image. This makes the cited studies costly in terms of the use of computing resources.
An option to overcome the above difficulties associated with insufficient accuracy, efficiency, and performance may be to use neural network-based image segmentation methods [9], specifically convolutional ones. This is the approach used in work [10], where neural networks for color images are used but training parameters are not analyzed. In addition, a similar principle is implemented in paper [11], where learning parameters are given without explaining the possibilities of use, in particular for segmentation tasks.
All this suggests that it is advisable to conduct a study to improve the effectiveness of neural network learning, which would significantly improve the accuracy of image segmentation.

The aim and objectives of the study
The aim of this study is to improve the architecture of a convolutional neural network to segment images and select the learning parameters of this network. That would make it possible to build a new neural network with improved accuracy to segment images, which could be used as a pre-trained neural network for other tasks.
To accomplish the aim, the following tasks have been set: -to investigate neural network models based on the PASCAL VOC set; -to evaluate the Voc-3 model based on the NVIDIA-Aerial Drone set.

Materials and methods to study image segmentation using convolutional neural networks
We studied image segmentation by applying appropriate methods based on convolutional neural networks, taking into consideration neural network learning parameters. To test the effectiveness of these methods, PASCAL VOC and NVIDIA-Aerial Drone sets were used, containing a large number of images with labeled groups of pixels and defined object classes. PASCAL VOC contains images where the desired class is clearly distinguished from among other pixels by color [12]. The NVIDIA-Aerial Drone contains images acquired from video cameras attached to UAV while being shot from a height of several hundred meters [13]. Our research was carried out using the DIGITS programming environment involving the Caffe environment, which is designed to deeply train the neural network, taking into consideration the speed and modularity in the development of the model. The combination of these environments makes it possible to quickly train neural networks with deep layers and is used for the tasks of classification, segmentation of images, and detection of objects on them. DIGITS contains a pre-trained AlexNet model, which is characterized by parameters adapted for segmentation (Table 1) and has a flexible architecture ( Fig. 1).
The AlexNet architecture consists of five convolutional layers, among which Pooling and normalization layers are located, and three fully interconnected layers. Moreover, their parameters may change in the learning process. An image with a 227×227×3 RGB palette is fed to the input. The filter size of the first layer is 11×11. The 96 kernel is , AlexNet has significantly lower memory size requirements (by 10 times) with an increase in accuracy by more than 70 %. Thus, the AlexNet architecture was used as the base one to build a specialized FCN-AlexNet model by adding a fully-connected convolution layer by making the following changes to DIGITS: -the training and verification networks should be combined into a single neural network; -specialized layers to retrieve image data for SBDDSeg-DataLayer and VOCSegDataLayer segmentation must be replaced by simple layers to retrieve data from LMDB; -the power layer must be added to the shear layer of the input data; -one needs to add a bilinear filtration of the weight of neurons to the upscore layer; -one needs to add a layer of accuracy to evaluate the effectiveness of the model on the test set of images; -training and verification processes should be normalized by entering the batch size=1 parameter [15].
The choice of the FCN-AlexNet model optimization algorithm is determined by the features of image segmentation, for which it is necessary to have a good convergence of the algorithm, and for practical use -high performance. The comparison of algorithms [16] shows that for the task of image segmentation, the best performance results are demonstrated by Adam (an increase of 10-50 %) and a stochastic gradient descent (increase by 5-20 %) depending on the metric. In addition, these algorithms show good convergence, especially the stochastic gradient descent.
The main chosen indicators of the effectiveness of neural network training, which were determined during our study, were accuracy and errors. Accuracy is calculated as a percentage of correctly defined classes (or pixels belonging to a particular class) in the image relative to all classes (or all pixels). The error is calculated as a percentage of incorrectly recognized classes (or pixels belonging to a particular class) in the image relative to all classes (or all pixels). To assess the effectiveness of neural network training, optimal neural network parameters are determined. These parameters are the duration of training (the number of epochs), optimization algorithm (an adaptive instant estimation (Adam), stochastic gradient descent (SGD)), the type of change in learning speed, a gamma coefficient, learning speed (an increment value). Combinations of parameters in the process of training 4 models are summarized in Table 2.
The error was investigated on the training and verification sample, which are the above sets of marked segmented images (PASCAL VOC and NVIDIA-Aerial Drone). The PASCAL VOC set contains 20 categories of objects, which include 1,464 images for training, and 1,449 for verification. For the NVIDIA-Aerial Drone set, the images were distributed at a ratio of 80 % per training sample and 20 % for the test using a cross-check to assess the accuracy of the model.
The error was investigated on the training and verification sample, which are the above sets of marked segmented images (PASCAL VOC and NVIDIA-Aerial Drone). The PASCAL VOC set contains 20 categories of objects, which include 1,464 images for training, and 1,449 for verifica-tion. For the NVIDIA-Aerial Drone set, the images were distributed at a ratio of 80 % per training sample and 20 % for the test using a cross-check to assess the accuracy of the model. The error value on the training and, in particular, on the verification sample should gradually decrease. That would indicate the correctness of neural network training and the lack of retraining, that is, the adequacy of the model. Validation of training results could be defined as a gradual decrease in the error in the test sample. The number of epochs of training is selected from the condition of obtaining the highest accuracy on the test sample in the absence of significant fluctuations in numerical values. The criterion for increasing the epochs of learning is a gradual increase in accuracy on the test sample. The beginning of the drop in accuracy on the test sample is a criterion for retraining, the absence of which is a condition for validation of the model.

1. Investigating neural network models for image segmentation from the PASCAL VOC set
Our study was carried out on pixelwise marked images from the PASCAL VOC set. Because the sizes of the images in the set are different, if one needs to check the model on a new image, one must perform pixel-by-pixel markup of segmented areas. In addition, accuracy calculation occurs in soft real time, which requires a high-performance neural network with limited memory to ensure high accuracy. The neural network input image comes with an RGB (256 color palette) of up to 10 MB in size, and the output is obtained in PNG (lossless) format. The batch size is 32 with the number of threads equal to 4. Pixel markup (if any) must match the Lightning Memory-Mapped Database annotation format. Models are imported in prototxt format. Segmentation time must not exceed 50 ms for a Full HD image size.
The model performance check is illustrated by the charts that were constructed automatically in the DIGITS programming environment based on the specified parameters given in Table 2. The Caffe environment was used for hardware acceleration of training. The number of accuracy values is equal to the number of epochs of learning. The final accuracy of the model is the accuracy of learning at the last epoch. Fig. 2 shows the Voc-1 model performance check chart. Fig. 2 shows that on the training sample an increase in the number of stages of learning (epochs) leads to a decrease in the error in absolute value from 3 to 1.5-2.5, and, after epoch 10, it stabilizes in the range of 1.3-2.7. Accuracy on the test sample reaches 72 % and almost does not increase after the first epoch. Fig. 3 shows a chart of checking the effectiveness of the Voc-2 model. Fig. 3 shows that on the training sample an increase in the number of stages of learning (epochs) leads to a decrease in the error in absolute value from 3 to 0.7-0.2, and, after epoch 1 and thereafter, it practically does not change. The accuracy on the test sample reaches 80 % and smoothly increases after the first epoch to a value of 82 %. Fig. 4 shows the Voc-3 performance check chart. Fig. 4 shows that on the training sample an increase in the number of stages of learning (epochs) leads to a decrease in the error in absolute value from 3 to 0.6-0.2, and, after epoch 10, it stabilizes in the range of 0.4-0.2. Accuracy on the test sample reaches 78 %, and smoothly increases after the first epoch to a value of 83 %. Fig. 5 shows the Voc-4 model performance test chart. Fig. 5 shows that on the training sample an increase in the number of stages of learning (epochs) leads to a decrease in the error in absolute value from 3 to 1.7-0.3, and, after epoch 10, it stabilizes in the range of 0.7-0.3. Accuracy on the test sample reaches 75 % and smoothly increases after the first epoch to a value of 81 %.
Our study results are the findings from the efficiency check of the 4 models given in Table 3.   Table 3 shows that the Voc-3 model demonstrates the greatest accuracy of 83 % at a learning speed of 0.0001 based on SGD with a step-by-step technique for changing the speed of learning. The smallest accuracy value is 72 % for the Voc-1 model, which uses an adaptive instant estimation algorithm. Fig. 6 shows the image segmentation of animal photography in the DIGITS programming environment using the Voc-3 model for training at different epochs. At the same time, the Caffe environment was used for hardware acceleration of training.

2. Evaluation of the trained Voc-3 model for segmenting images from the NVIDIA-Aerial Drone set
In practice, segmentation is part of environmental monitoring involving UAV that requires high segmentation accuracy for control and orientation in space. UAV camcorders acquire a contrasting image, unlike images in the PASCAL VOC set. Therefore, the model with the highest accuracy Voc-3 must be additionally trained on another set of images acquired from a UAV camcorder. This task uses the NVIDIA-Aerial Drone set. At the same time, the model must demonstrate high performance at limited memory to ensure high accuracy in real time.
A set of NVIDIA-Aerial Drone images was used to test the effectiveness of the Voc-3 model trained with DIGITS. The Caffe environment was used for hardware acceleration of training. During our research, we used the values of parameters defined as optimal in the study (Table 2): -training speed, 0.0001; -training duration (the number of epochs), 50; -optimization algorithm, SGD; -the type of change in the speed of learning, stepped; -a gamma coefficient, 0.1; -a pre-trained model, fully-connected convolutional neural network FCN-AlexNet.
Thus, the Drone-1 model was built, the results of testing which are shown in Fig. 7.   7 shows that the error in absolute value for the training sample of images becomes close to 0 after the first epoch of training, except for episodic cases in certain training epochs. On the test sample of images, the error after the first epoch also becomes close to 0, and the accuracy of the model becomes close to 100 %, and it practically does not change. After the end of epoch 30, the accuracy value is 99 %. Fig. 8 shows the segmentation of an image acquired from a UAV camcorder for the Drone-1 model in the DIGITS programming environment.
The available performance values make it possible to compare the Drone-1 model with others [2][3][4][5][6][7], however, it makes sense to compare with models close in architecture that are built during training on a similar basis. Therefore, the comparison was carried out according to the accuracy criterion with some well-known AlexNet-based models developed according to similar parameters, trained on the basis of images acquired from UAV cameras. FireCAMP2 SLIC, VEDAI, NZAM/ONERA Christchurch, ISPRS Potsdam were chosen as such models [17,18]. The results of the assessment of the accuracy of models are given in Table 4.  Table 4 shows that the greatest accuracy, namely 98 %, is demonstrated by the Drone-1 model, as others were trained based on images acquired from UAV cameras that were not part of the NVIDIA-Aerial Drone set.
The adequacy, reliability, and convergence of the Drone-1 model were compared to others [17,18]. To this end, we segmented 100 pixel-hand-marked images from the Aeroscapes set [19]. Calculations were performed in the Jupiter Notebook environment in the Python language. Averaged results are summarized in Table 5. Table 5 Test results to assess the reliability, adequacy, and convergence of the model  Table 5 shows that the greatest accuracy, namely 87 %, is demonstrated by the developed Drone-1 model. This indicates the high authenticity of the Drone-1 model.
Objects in the images for validation are marked with specific classes, according to which the image is segmented by a model that performs class designation for selected areas. In the experimental verification of models, our results were incorrectly positive (the presence of a certain class in the image in its absence) and falsely negative (the absence of a certain class in the image in its presence) when designating classes. Table 5 shows that the share of images with incorrectly marked classes is 5 %. Thus, the share of images with properly marked classes is 95 %, which indicates the high adequacy of the model. Segmentation involves taking into consideration the proportion of incorrectly marked pixels. In experimental model checking, the results of pixel marking were incorrectly positive (marking a pixel by a certain class in the absence of belonging in it) and incorrectly negative (non-marking of the pixel by a certain class when belonging in it). Table 5 shows that the percentage of incorrectly marked pixels is 13 %. Thus, the proportion of correctly marked pixels is 87 %, which indicates high convergence of the model.

Discussion of results of studying image segmentation using convolutional neural networks
The results of our study show that the trained Drone-1 model demonstrates high accuracy of image segmentation (Fig. 7). This is due to the choice of optimal parameters of the neural network, as well as the introduction of a convolutional layer into the standard neural network architecture. Drone-1 is based on Voc-3. This model demonstrated the best accuracy values (Table 3) in the learning process based on the PASCAL VOC set (Fig. 4). That is due to the choice of a stepped function to change the speed of learning (Table 1). Fig. 6 visualizes the result obtained, which confirms the data acquired. Therefore, Voc-3 was chosen for additional training based on the NVIDIA-Aerial Drone set with the same parameters. The Drone-1 model created in this way could be used to segment real images (Fig. 8), the high accuracy of which determines the efficiency of UAV control system. The results of comparing the accuracy of image segmentation (Fig. 8) for Drone-1 and other similar models are given in Table 4. Drone-1 demonstrates, when compared to others, the high values of the accuracy of the segmentation of images that were not in the training and test samples, indicating the adequacy of this model and the lack of retraining. The accuracy and performance of the developed Drone-1 neural network model accept higher values than similar values derived in [17], by 1-2 % and 20-50 %, respectively. At the same time, the segmentation process does not require significant computing resources at the stage of using the model. Compared to [18], a given model has a 3 % higher image segmentation accuracy. This is achieved by combining training and test networks, replacing specialized layers with simple ones, adding the power and accuracy layers, the bilinear filtering of the weight of neurons, and the batch size parameter. The reliability, authenticity, and convergence of the developed Drone-1 neural network model are comparable with other models [17,18] while is not inferior to them. Since neural network model training was conducted for images from the NVIDIA-Aerial Drone set, high segmentation accuracy values are typical of images acquired from the drone's camera, usually due to the high contrast of pixel groups. For other types of images, accuracy probably will not be as high. That requires additional research.
The shortcomings include the long process of training the neural network and the cost of resources at the initial stage of training. Overcoming this disadvantage is possible by the parallelization of computing on GPU, the creation of new compact architectures, and the emergence of more pretrained neural models.
This model could be advanced by further increasing the accuracy, performance speed, as well as decreasing computing resources. That would require complex mathematical modeling, taking into consideration the subject area of application and the development of software modules for a particular system.

Conclusions
1. The Voc-1, Voc-2, Voc-3, Voc-4 models of neural networks based on the PASCAL VOC set were investigated. It is established that the highest accuracy is demonstrated by the Voc-3 model, 83 %, at a learning speed of 0.0001, based on SGD with a stepwise method of changing the learning speed. The smallest accuracy value is 72 % for the Voc-1 model, which uses an adaptive instant estimation algorithm. This means that SGD copes better than Adam since Voc-2 and Voc-4 have significantly higher model accuracy indicators, 82 % and 81 %, respectively.
2. The accuracy values of the Drone-1 model, created on the basis of the previously trained Voc-3 model for the segmentation of images from the NVIDIA-Aerial Drone set with the parameters determined on the basis of the study, were determined. These values are quite high because the error in absolute value for the training sample of images becomes close to 0 after the first epoch of training. On the test sample of images, the error after the first epoch also becomes close to 0, and the accuracy of the model becomes close to 100 %; it practically does not change. After the end of epoch 30, the accuracy value is 99 %. The derived accuracy values make it possible to assert the correctness of the choice of network architecture and the selection of parameters. This allows this model to be used for practical tasks of image segmentation, for example, when estimating a fire size, analyzing field vegetation, categorizing crops, etc.