inter-IMPROVING A NEURAL NETWORK MODEL FOR SEMANTIC SEGMENTATION OF IMAGES OF MONITORED OBJECTS IN AERIAL PHOTOGRAPHS

This paper considers a model of the neural network for semantically segmenting the images of monitored objects on aerial photographs. Unmanned


Introduction
The use of unmanned aerial vehicles (UAVs) makes it possible to accelerate the process of monitoring critical infrastructure objects [1]. Such facilities include industrial enterprises [2], energy plants [3], chemically hazardous industries [4], and other strategic objects [5]. Disruption of the functioning of these facilities can threaten the national inter-cross-class variability on a global scale, resulting in recovery with more distinguishable pixels. Experimental results on three test datasets demonstrate a significant superiority of the proposed method over modern ones. The disadvantage of that model is its lack of adaptability for the segmentation of MO images in aerial photographs.
Paper [14] considers obtaining accurate multiscale semantic information from images for high-quality semantic segmentation. A model called cross fusion net (CF-Net) is proposed for fast and efficient extraction of multiscale semantic information. The model is able to encode more accurate semantic information from small-scale objects, and, accordingly, improve the accuracy of segmentation of smallscale objects. The disadvantage of the model is its computational complexity.
Our review of the literature [8][9][10][11][12][13][14] revealed the following shortcomings in the above models (methods): -the computational complexity of the segmentation of MO images on aerial photographs obtained from UAVs; -the lack of neural network models that solve the problem of segmenting MO images in aerial photographs.
All this suggests that it is advisable to conduct a study to improve a neural network model for the semantic segmentation of images of monitored objects in aerial photographs, which would significantly improve the accuracy and efficiency of the segmentation of MO images in aerial photographs.

The aim and objectives of the study
The aim of this study is to improve the neural network model for the segmentation of MO images in aerial photographs involving the choice of its learning parameters. This will make it possible to automate the process of analysis (processing) of aerial photographs.
To accomplish the aim, the following tasks have been set: -to investigate the effectiveness of MO image segmentation using CNN; -to evaluate the effectiveness of segmenting MO images on aerial photographs by the proposed U-NetWavelet model.

The study materials and methods
Suppose that a digital camera is installed on board a UAV. In this case, aerial photographs are transmitted through the communication channel to the computer of the ground control point. There, they are stored digitally as a file. Segmentation is important for the tasks of analyzing images of monitored objects in aerial photographs. Semantic segmentation describes the process of connecting each pixel in an image to a class label (color).
The mathematical statement of the problem of semantic segmentation of images is to assign each pixel of the MO image in the aerial photograph S(x,y,z) to the label (color) of each pixel of class (object) B i : where P is the operator that characterizes the work of CNN. In the proposed model, an RGB aerial photograph is fed to the CNN input; dimension, 6,000×4,000×3; JPEG format; the output is the label (color) of each pixel of the class (object): Table 1. ests of people's lives [6,7]. With the help of UAVs, objects are monitored by processing (analyzing) aerial photographs and a video stream. One of the types of image processing is its segmentation. Segmentation of aerial photographs involves dividing it into areas according to certain criteria. The result of segmentation is a set of areas that cover the entire aerial image. Therefore, the development of new and improvement of existing neural network models for segmenting images of monitored objects (MOs) in aerial photography is of particular relevance.

Literature review and problem statement
Paper [8] shows that traffic surveillance using UAVs has gained great popularity in civilian applications and remote sensing tasks. Due to its high mobility and large field of vision, as well as the ability to cover large areas at different altitudes, UAVs have become a sought-after surveillance tool in recent years. The option of counting vehicles with the elimination of the problem of excessive counting of information in sequential frames of video from UAVs is proposed in the cited paper. However, it did not address issues related to the segmentation of MO images.
Study [9] proposed various models based on convolutional neural networks (CNNs) to collect information obtained using a segmentation network; a generative adversarial network based on Pixel2Pixel was suggested. The discriminator employed CNN to distinguish between the results of segmentation of the generated model and the Expert Advisor. The results showed that the network model could provide effective automatic segmentation of the hippocampus and is of practical importance for the correct diagnosis of diseases such as Alzheimer's disease. The disadvantage of that method is its high computational complexity, its lack of adaptability for the segmentation of MO images on aerial photographs.
In work [10], a fast clustering algorithm based on super pixels for segmentation of radar images with a synthesized aperture is proposed. Experimental results of two real images of the synthetic aperture radar show that the proposed method is superior to other modern methods both in terms of segmentation accuracy and in terms of computational efficiency. The disadvantage of that model is its lack of adaptability for the segmentation of MO images in aerial photographs.
Paper [11] shows that malware detection methods based on deep learning are generally highly accurate. However, when malware families with a high degree of similarity are detected, the detection accuracy is seriously compromised due to the lack of obvious training functions. To resolve that issue, the cited paper proposes a method for detecting malicious code that is based on image segmentation and deep CNN. A disadvantage of the model is its high computational complexity and maladjustment for the segmentation of MO images in aerial photographs.
Paper [12] proposes a multiscale model of semantic segmentation in real time. Experimentally, it has been shown that the proposed model could be used to solve many recognition problems, has a good decoding ability. Despite this, the issues of automating the process of segmenting MO images in aerial photographs were not considered.
Study [13] proposes a new classification scheme for hyperspectral images of remote sensing of the Earth. The proposed model is able to increase intraclass similarity by locally suppressing spectral variations, while promoting the extended path consists of an upscaling discretization of the feature map, followed by a convolution 2×2 ("convolution up"), which halves the number of feature channels. Each step of the tapering path consists of a downscaling of the feature map followed by a convolution of 3×3, each followed by a ReLU.
Cropping is necessary because of the loss of edge pixels with each convolution. On the last layer, a 1×1 convolution is used to map each 64-component feature vector to the desired number of classes. In total, there are 23 convolutional layers in the network.
Features of the ReLU activation function, its mathematical notation is described in detail in [17,18]; the implementation of the operation of maximum unification (pooling) -in [17].
Training U-Net. U-Net is trained in stochastic gradient descent based on input images and their corresponding segmentation maps. Because of convolutions, the output image is smaller than the input signal by a constant border width. Applied pixel-by-pixel, the Softmax function, which calculates energy from the final feature map along with the cross-entropy function. The Softmax function is defined in [15] as: where p k (x) is the value of the function approaching 1, when k has the maximum activation a k (x), which represents the activation channel of the function k pixel position (x∈Ω) и (Ω⸦ℤ 2 ); k denotes the number of classes. The cross-entropy at each point shows the deviation and is defined in [15] as: where ℓ: Ω→{1, …, K} is the true label of each pixel; w: Ω→ℝ is the weight map, which is introduced to give some pixels a greater value during training.
The separation boundary is calculated using morphological operations. The calculation of the map of weighting coefficients is carried out according to the formula given in [15]: where w c : Ω→ℝ is the weight map for balancing class frequencies; Our study of object recognition in aerial photographs was conducted using CNN methods in combination with the selection of optimal training parameters.
To automate the process of semantic segmentation of MO images in aerial photographs, it is proposed to use the U-Net model as a basic one, which demonstrated high efficiency in solving biomedical problems.
The architecture of CNN U-Net is discussed in [15,16] and is shown in Fig. 1. CNN uses a weight matrix in convolution operations. The convolution layer sums up the results of the element-by-element product of each fragment of the image onto a matrix -the core of the convolution.
U-Net consists of a tapering path (left side) and an expanding path (right side). It consists of the use of two convolutions 3×3 (incomplete convolutions), each of which is followed by a positively linear ReLU function and a maximum unification (pooling) operation of 2×2 in steps 2 to lower the sampling. At each stage of down-sampling, the number of functional channels is doubled. Each step of w 0 =10 and σ=5 pixels have been experimentally established [11].
Justification of the architecture and the mathematical apparatus used for the implementation proposed by CNN.
Our review of the literature [15,16] showed that the U-Net model demonstrates high efficiency for the semantic segmentation of images of objects of different shapes and positions.
The advantages of U-Net and neural networks based on it are: -high efficiency for solving the problems of segmentation of medical images [14,15]; -information from large scales (upper layers) allows the model to be better at classification; -information from smaller scales (deep layers) helps the model segment better; -increasing dimensionality by increasing the number of feature channels allows the CNN to distribute contextual information to layers of greater resolution; -symmetrical network strategy makes it possible to process large images (snapshots) such as aerial photographs, hyperspectral images, images for orthophoto plans; -using a small number of images [15] for training and obtaining good accuracy.
To solve the problem of semantic segmentation of images of monitored objects on aerial photographs for 7 classes and improve the efficiency of segmentation, it is proposed to use a modified wavelet layer as an input layer, and CNNU-Net as a basic model. The training of the model was carried out by a set of images prepared from aerial photographs.
The architecture of CNN (Basic U-Net) is shown in Fig. 2. The task solved by CNN is the semantic segmentation of images of monitored objects into 7 classes.
The convolutional layer is the main layer of a convolutional neural network. The convolution layer includes for each channel its own filter, the core of the convolution that processes the previous layer into fragments (summing up the results of the element-by-element product for each fragment).
Normalization layer is necessary so that different elements in different places of the same feature map (the image of the convolution operation) are normalized in the same way.
MaxPool layer is necessary to speed up the learning process and reduce the computing resources used.
Unifying layer combines the outputs of the neural network layers.
The output layer is the last layer of the neural network, which gives the output data (result) of the neural network.
The architecture of the proposed CNN is similar to U-Net, the difference is the dimensionality of the input and output layer of the network.
As performance indicators that characterize the process of training and evaluating the effectiveness of CNN, the following ones are chosen in [18]: -accuracy is the ratio of correctly segmented objects to the total number of expected and true objects [ (5) where N TP is the number of correctly segmented objects in the aerial photograph; N FP is the number of erroneously segmented objects in the aerial photograph; N val is the number of aerial photographs in the test sample; t is the current aerial image; sensitivity is the ratio of correctly segmented objects to the total number of objects in an aerial photograph [ (6) where N FN is the number of erroneously unsegmented objects in the aerial image.
We tested the CNN models using a computer ACPI X64 (China), equipped with a Tesla GPU card of 12 GB and a RAM capacity of 8 GB.
To prepare aerial photographs for the training sample, the Image Labeler software from the mathematical programming environment MATLAB R2020b (USA) was used. The preparation (marking) of aerial photographs of objects "Truck", "Car" is shown in Fig. 3. Their studies were carried out under the following assumptions and limitations: -a digital camera is installed on board the UAV and shoots in the view range in the daytime; -an aerial photograph in digital form is transmitted through a communication channel to a ground control point; -the process of the semantic segmentation of images of objects in an aerial photograph is carried out using a computer of the ground control point of the unmanned aerial system.

Results of studying the effectiveness of segmenting
images of monitored objects in aerial photographs using CNN
The neural network's layers (Fig. 5 [3,3]; strides= [1,1], padding='same', activation='sigmoid'). Accuracy and sensitivity were chosen as indicators of the effectiveness of semantic segmentation of images of objects by CNN. The parameters of CNN training are the duration of training (number of epochs), the optimization algorithm, the speed of learning (learning step). The physical meaning of the learning speed (learning step) of CNN is set out in [18].
The parameters for training and testing CNN models for the semantic segmentation of object images are shown in Table 2. In this case, the comparison was carried out for three types of neural networks: PSPsmall, U-Netaverage, and U-Net. For modeling, the Terra AI framework, and the MATLAB R2020b mathematical modeling environment were used. Fig. 6 shows the accuracy plots on a test sample of PSPsmall, U-Netaverage, U-Net models. Fig. 6 shows that the U-Net model demonstrates the best accuracy (91 %) on the test sample. After epoch 20, the accuracy of the model varies in the range from 90 % to 91 %. Fig. 7 shows sensitivity check plots on a test sample of PSPsmall, U-Netaverage, U-Net models. Fig. 7 shows that in the test sample, the U-Net model demonstrates the best sensitivity (87 %), which, after epoch 10, stabilizes and changes in the range from 84 % to 87 %. Fig. 8 shows the result of the semantic segmentation of "airplane" images by the U-Net model in the Terra AI framework. In Fig. 8, during segmentation, 2 areas are highlighted: -"airplane", -"sky". Our analysis of the results reveals that the best performance indicators are shown by the U-Net model: accuracy (91 %), sensitivity (87 %), maximum error value (0.232), minimum error value (0.0132).

2. Evaluation of the effectiveness of segmentation of MO images on aerial photographs proposed by the U-NetWavelet model
To investigate the effectiveness of the segmentation of MO images in aerial photographs, aerial photographs of training and verification samples were prepared. 100 aerial photographs were used as a training sample. The total number of classes for semantic segmentation was 7 (helicopter, airplane, tank, tractor, truck, car, bus). The type of training and test samples (the same) is an aerial photograph of 6,000×4,000 pixels; JPEG format. 80 aerial photographs were used as a test sample.
The segmentation of images of monitored objects in aerial photographs using CNN was carried out at a ground control point. For the shooting, a UAV was used, which is equipped with a Sony ILCE-7M2 camera. This camera took aerial photographs under the following mode: shutter speed, 1/1,600 s; -focal length, 55 mm; -aerial image size (pixels): 6,000×4,000 (24M). The aerial photograph was taken by a Sony ILCE-7M2K digital camera aboard a UAV at an altitude of 1,100 meters; it is shown in Fig. 9. The study procedure (modeling) using an example of the proposed U-NetWavelet model: Step 1. Download aerial photographs: 6,000×4,000×3 pixels.
Step 3. Apply a wavelet layer to a snapshot of 1,000×1,000×3 (implemented on a modified Haar transformthe value of the adjacent two pixels is summed up and divided by two) and adapted to the dimension of 512×512×3.
Step 4. Split data into training and validation datasets.
Step 5. Training and validation of the network.
Step 6. Segmentation of verification sample snapshots.
Step 7. Evaluate the accuracy of the segmentation of the test sample.
Step 8. Evaluate the sensitivity of the model on a test sample.
We trained the proposed U-NetWavelet model by using the optimal values of the parameters, which were obtained experimentally: -learning speed -0.001; -learning duration (the number of epochs) -60; -packet (batch) size -20; -optimization algorithm -Adam. As a result, a new model with the proposed name U-Net-Wavelet was built. The results of checking the accuracy and sensitivity of this neural network are shown in Fig. 10.   Fig. 11 shows a fragment of the segmented aerial photograph using the U-NetWavelet model. Fig. 11. Fragment of the segmented aerial photograph using the U-NetWavelet model In Fig. 11, two types of objects are distinguished during segmentation: -"passenger car", -"truck". A comparison of the new U-NetWavelet model was made with the FCN, SegNet models. 80 aerial photographs were used as a test sample to evaluate the U-NetWavelet model for convergence, adequacy, and validity.
Convergence. A CNN shows convergence provided that the error decreases with each epoch. The convergence of the CNN model is influenced by three components: the completeness of the database (aerial photographs); the correct choice of architecture; selection of CNN training parameters. Fig. 12 shows the U-NetWavelet convergence score on a test sample. Our analysis of Fig. 12 reveals that the proposed U-Net-Wavelet model does have convergence.
Adequacy. A neural network is adequate if the learning outcomes converge to close values, a necessary condition that there is a dependence between the output and input data, which is implemented by CNN.
The most recommended way to test a CNN model for adequacy is to compare the results with known models.
The results of checking on the test sample (80 aerial photographs) are shown in Table 3.  Table 3 shows that in comparison with the FCN, SegNet models, the proposed U-NetWavelet model demonstrates the best efficiency indicators: accuracy (89 %), sensitivity (83 %), maximum error value (0.451), minimum error value (0.102).

Discussion of results of studying the semantic segmentation of images of objects in aerial photographs using CNN
It is proposed to use the U-Net CNN [15,16] to segment images of objects in aerial photographs. To improve the efficiency of the neural network, this model was trained by a set of aerial photographs (Fig. 9) with the selection of optimal parameters (speed (step) of training, the number of epochs, packet size (batch), optimization algorithm). As a result, a new model with the proposed name U-NetWavelet was constructed (Fig. 2).
Due to the use of a modified wavelet layer, the size of the aerial photograph adapts to the parameters of the input layer of the neural network; the efficiency of segmentation of images in aerial photographs increases. The use of the U-NetWavelet CNN makes it possible to increase the performance and automate the process of semantic segmentation of MO images.
Using the proposed model allows us to solve the following issues [8][9][10][11][12][13][14]: -the computational complexity of segmentation of MO images on aerial photographs obtained from UAVs; -the lack of neural network models that solve the task of the segmentation of MO images in aerial photographs.
Note the following limitations in the proposed model: -the segmentation of MO images on aerial photographs is carried out within 7 classes (Table 1); -the orientation of MOs in the images is not taken into consideration; -the resolution of aerial photographs for the classification of MOs is 6,000×4,000 pixels; the CNN's transmission invariance is not taken into consideration; aerial photography is carried out in the visible range in the daytime.
The proposed model is constrained by that it is adapted to segment objects in an aerial photograph into seven classes. The CNN training was conducted on aerial photographs of high contrast, clarity (Fig. 6). The shooting was carried out in the daytime, the time of year was summer. Therefore, high values of accuracy and sensitivity of the segmentation of object images were obtained (Table 3). For other types of images of objects (shooting conditions), the accuracy, sensitivity of the segmentation of MO images by class may vary, which requires additional research.
It is planned to advance the proposed model by: -increasing the base of marked (segmented) aerial photographs for a training sample; -exploring the proposed and other models [19][20][21] (PSPNet, DenseNet, DeepLab, DilatedNet, etc.) for different conditions of aerial photography; -optimizing the proposed model in terms of computational complexity to increase performance; -building method for counting the number of objects in aerial photographs by class; -devising a method for detecting and identifying objects in the video stream received by the UAV video camera.
Our model is proposed to be used at a ground control point of UAV when processing aerial photographs, orthophoto plans; in systems with artificial intelligence; in MO control systems; when designing robots; in unmanned vehicle systems.

Conclusions
1. The indicators of efficiency of PSPsmall, U-Netaverage, U-Net models have been studied. Verification of the effectiveness of these models was carried out on the basis of images of aircraft (800 images in a training sample, 140 in a test sample). It has been established that the best indicators are shown by the U-Net model: accuracy (91 %), sensitivity (87 %), maximum error value (0.232), minimum error value (0.0132). The lowest accuracy (84 %) and sensitivity (81 %) are shown by the U-Netaverage model.
2. The effectiveness of the proposed U-NetWavelet model (based on images prepared from aerial photographs) was evaluated. The model has the best efficiency indicators in comparison with the FCN, SegNet models: accuracy (89 %), sensitivity (83 %), maximum error value (0.451), minimum error value (0.102). The obtained values of the performance indicators of the U-NetWavelet model allow us to assert the correctness of the choice of the CNN architecture and the selection of its training parameters: the learning rate is 0.001; the duration of training (number of epochs) is 60; the optimization algorithm is Adam.