IMPROVING A MODEL OF OBJECT RECOGNITION IN IMAGES BASED ON A CONVOLUTIONAL NEURAL NETWORK

Image processing is extremely important in modern science and practice, so it is constantly evolving and improving. Image processing can be used in many industries, namely precision farming (agricultural monitoring), safety systems, quality control, etc. The given areas employ vision systems, robotic complexes, unmanned aerial vehicles (UAVs), video surveillance systems, web services, and mobile applications for identification and search. One type of image processing is the recognition of objects in images, which is widely used in the industry, art, medicine, space technology, process management, automation, and many other fields [1]. Recognition of objects in images involves class attrition of the source data to a certain class by highlighting significant features. These attributes characterize the initial data from the general array of non-essential information. There are many methods for recognizing objects in images, among which Random Forests techniques, boosting methods, as well as neural network procedures, specifically convolutional [2–6], are the most common. Certain requirements are put forward for object recognition methods, namely: – correspondence of the recognized object to the real object; – high performance; – resistance to errors; – high accuracy. Therefore, it becomes necessary to analyze the methods of object recognition in images and to choose the optimal according to the above requirements, specifically high accuracy. It is also worth considering the parameters that characterize these methods, changing which directly affects the precision, performance, and overall efficiency of the process of object recognition. A modern relevant industrial area is the development of precision agriculture, which is based on the results from agricultural monitoring. These data, acquired from UAV video cameras, make it possible to assess the harvested crop, control the routes of movement of agricultural machinery, predict yields, etc. In this case, an important criterion is the UAV’s ability to avoid collisions with close objects, determine the position in space, direction, and trajectory of the flight by receiving input data on the recognized objects. The effectiveness of these systems is determined by the precision of object recognition whose evaluation requires experimental research.


Introduction
Image processing is extremely important in modern science and practice, so it is constantly evolving and improving. Image processing can be used in many industries, namely precision farming (agricultural monitoring), safety systems, quality control, etc. The given areas employ vision systems, robotic complexes, unmanned aerial vehicles (UAVs), video surveillance systems, web services, and mobile applications for identification and search.
One type of image processing is the recognition of objects in images, which is widely used in the industry, art, medicine, space technology, process management, automation, and many other fields [1]. Recognition of objects in images involves class attrition of the source data to a certain class by highlighting significant features. These attributes characterize the initial data from the general array of non-essential information.
There are many methods for recognizing objects in images, among which Random Forests techniques, boosting methods, as well as neural network procedures, specifically convolutional [2][3][4][5][6], are the most common.
Certain requirements are put forward for object recognition methods, namely: -correspondence of the recognized object to the real object; -high performance; -resistance to errors; -high accuracy.
Therefore, it becomes necessary to analyze the methods of object recognition in images and to choose the optimal according to the above requirements, specifically high accuracy. It is also worth considering the parameters that characterize these methods, changing which directly affects the precision, performance, and overall efficiency of the process of object recognition.
A modern relevant industrial area is the development of precision agriculture, which is based on the results from agricultural monitoring. These data, acquired from UAV video cameras, make it possible to assess the harvested crop, control the routes of movement of agricultural machinery, predict yields, etc. In this case, an important criterion is the UAV's ability to avoid collisions with close objects, determine the position in space, direction, and trajectory of the flight by receiving input data on the recognized objects.
The effectiveness of these systems is determined by the precision of object recognition whose evaluation requires experimental research.

Literature review and problem statement
A Random Forests method for recognizing many classes of objects is considered in paper [2]; it is characterized by high accuracy, resistance to retraining, and is easily accelerated when using parallel computations. However, unresolved issues

The study materials and methods
We studied the recognition of objects in images by using appropriate methods based on convolutional neural networks, taking into consideration the parameters of neural network learning. To test the effectiveness of these methods, the INRIA set was employed, which contains a large number of images with marked groups of pixels and defined classes of objects. INRIA contains images acquired from video cameras attached to UAV while being shot from a height of several hundred meters [11]. The study was carried out using the DIGITS programming environment involving the Caffe environment designed to deeply train a neural network taking into consideration the speed and modularity in the development of the model. The combination of these environments makes it possible to quickly train neural networks with deep layers and is used for the tasks of classification, segmentation of images [12], and recognition of objects on them. DIGITS contains a pre-trained GoogLeNet model, which is characterized by adapted parameters for recognizing objects in images (Tables 1, 2) and has a flexible architecture (Fig. 1).
The GoogLeNet architecture consists of 22 layers (27 layers when taking into consideration the merge layers) and part of these layers consists of 9 initial modules. Moreover, their parameters may change in the learning process. An image with an RGB palette of 224×224 is sent to the input. The filter size of the first layer is 7×7. The kernel size of 1×1×256 is used. The output activation function is Softmax, and in layers -ReLU, which makes it possible to increase performance by 6 times. Compared to similar models [13,14], GoogLeNet contains 12 times less parameters, the network depth is increased to 22 layers without additional involvement of computing resources [15].
Thus, the GoogLeNet architecture was used as the base one to build a specialized FCN-GoogLeNet model by adding a fully linked convolution layer by making the following changes to DIGITS: -we added a layer of data that receives training images and labels, and a conversion layer that applies real-time data magnification; -we added a layer of normalization of data; -we added a fully connected convolutional network (FCN), which removes the characteristics and forecasts object classes and field boundaries to a grid square; -we added a layer of error, which simultaneously measures two values of forecasting; -after determining the size of the input image, a random number is set, which determines how much the input image should be reduced; -we added parameters to complement the data, which determine to what extent random conversions (pixel shifts, image flipping, etc.) should be applied to input images; -we added a layer that uses a linear combination of two separate loss functions to calculate the total loss function for optimization; -we deleted the layers of input and output data and a pooling layer [16].
The choice of the FCN-GoogLeNet model optimization algorithm is determined by the features of object recognition in images, for which it is necessary to have a good convergence of the algorithm, and for practical use -high performance. remain related to the lack of visual interpretation of the process and the complexity of explanations for their decisions, as well as high sensitivity to noise in images. That causes difficulties associated with high requirements for the absence of noise in the images and the inability to get an explanation of the result. Works [3,4] show the results of object recognition in images using boosting methods, specifically Adaboost. High speed and efficiency of work, as well as adaptability to a specific application, are shown. However, there are difficulties associated with retraining in the presence of noise in the input data, a large number of image features, as well as the need for a significant amount of data for the training sample. This makes research costly and limits the use of these methods when working with low-quality images. Work [5] reports the results of object recognition in images using neural network methods. The ability to train the system to highlight key characteristics of objects from training sampling is shown. However, these methods require the use of an ensemble of neural networks, auxiliary methods for selecting the plot part of the image, as well as their architectures are extremely sensitive to external influences. The reason for this may be difficulties associated with the computational complexity and quality of preprocessing the initial and working data. That makes the use of these methods for certain tasks not effective. Work [6] provides the results of real problems of object recognition in images using neural network methods. It is shown that input data can be presented in any order, which does not affect the purpose of learning. However, these methods require taking into consideration a large number of parameters since images in real recognition tasks have large dimensionality. The reason for this may be difficulties caused by the requirements of a larger training sample. That increases the time and computational complexity of the learning process, which limits the application of this method.
An option to overcome the above difficulties associated with insufficient accuracy, efficiency, and performance may be the use of methods for recognizing objects in images based on convolutional neural networks [7,8]. This is the approach used in work [9], which employs multispectral data acquired from a satellite while UAV video cameras provide multispectral data. In addition, a similar principle is implemented in work [10], where training parameters are analyzed and recommendations for changing the neural network architecture are provided; however, these recommendations are general in nature without analyzing specific applications, specifically for recognition tasks.
All this gives reason to assert that it is advisable to conduct a study into improving the effectiveness of training a neural network, which could significantly improve the precision of object recognition in images.

The aim and objectives of the study
The purpose of this work is to improve the model of a convolutional neural network in order to recognize objects in images and to select learning parameters for this network. That would make it possible to obtain a new neural network with increased precision for recognizing objects in images that could be used as a pre-trained neural network for other tasks.
To accomplish the aim, the following tasks have been set: -to investigate neural network models based on the INRIA image set; -to evaluate the Inria-9 model.
The comparison of algorithms [17] reveals that for the task of recognizing objects in images, Adam shows the best performance results (an increase of 10-50 %). That algorithm also demonstrates good convergence.   Table 2 GoogLeNet model layer parameters The main indicators of neural network training effectiveness, which were determined during our study, were chosen the following characteristics [18]: -precision -the ratio of correctly recognized objects to the total number of predictable or true objects: where N TP is defined as the number of correctly recognized objects in the image; N FP is defined as the number of errone- -recallthe ratio of correctly recognized objects to the total number of objects in the images: where N FN is defined as the number of erroneously unrecognized objects; -mean average precisionthe simplified assessment of mathematical expectation based on the product of precision and recall, which shows how sensitive the network is to the right objects and resistant to errors: To assess the effectiveness of neural network training, optimal neural network parameters are determined. These parameters are the duration of training (the number of epochs), the optimization algorithm (adaptive instant evaluation (Adam)), the type of change in the speed of learning, the coefficient gamma or power, the speed of learning (learning step), the pre-trained model. The combinations of parameters in the process of training six models are summarized in Table 3.
The Adam algorithm shows good optimization results, particularly in the duration of training, but does not always demonstrate satisfactory convergence [19]. Therefore, different values of the learning step were used to train the model with a balance between good convergence and duration of training (Table 3). With a good convergence of the model, the values of characteristics (1) to (3) are stable. Otherwise, the risk of retraining increases, and the values of characteristics (1) to (3) change dramatically in the learning process, which complicates the practical use of the model. Therefore, models with frequent and sharp drops in characteristic values (1) to (3) are not to be used as pre-trained models. For additional verification of the selected learning step values to ensure satisfactory convergence and lack of retraining, the model is to be tested on another set of images that were not used for the test sampling. The values of characteristics (1) to (3) on the new set should not differ significantly from the values obtained for the verification set of images, which also indicates the adequacy of the resulting model.
Our study was conducted on a test sample, which is a set of marked INRIA images. Features: 2 classes of objects; images in the form of color images with a resolution of 0.3 m with a total coverage of 810 km 2 , of which 405 km 2 for training and 405 km 2 for verification.
The values of precision, recall, and mean accuracy estimates on the test sample should gradually increase. These parameters, and especially the assessment of mean accuracy, which includes precision and recall, characterize the adequacy of the model, that is, the correctness of neural network training and the lack of retraining. Validation of learning outcomes could be defined as a gradual increase in precision, recall, and assessment of mean accuracy on the test sample. The number of epochs of training is selected from the condition of obtaining the highest precision, recall, and assessment of mean accuracy on the test sample in the absence of significant fluctuations in numerical values. The expediency criterion for increasing the epochs of learning is a gradual increase in precision, recall, and assessment of mean accuracy on the test sample. The beginning of the drop in precision, recall, and assessment of mean accuracy on the test sample is a criterion for retraining, the absence of which is a condition for validating the model.

Results of studying the recognition of objects in images
using convolutional neural networks

1. Investigating neural network models for the recognition of objects in images from the INRIA set
Our study was conducted on pixelated images from the INRIA set. Since the dimensions of the images in the set are different, if one needs to test the model in a new image, one must mark up the existing objects. In addition, precision calculation is carried out in soft real time, which requires a high performance from a neural network with limited memory to ensure high accuracy. The neural network input image comes with an RGB (256 color palette) no larger than 5,000×5,000 pixels at a resolution of 30 cm. That corresponds to the surface with an area of up to 1,500×1,500 m. The output image is formed in TIFF or GeoTIFF format. The batch size is 32 with the number of threads equal to 4. Models are imported in prototxt or protobuf format. The recognition time should not exceed 50 ms for a Full HD image.
The model performance check is illustrated by charts that were constructed automatically in the DIGITS programming environment based on the specified parameters given in Table 3. Caffe environment was used for hardware acceleration of training. The number of values of precision, recall, and assessment of mean accuracy is equal to the number of epochs of learning. Fig. 2 shows the Inria-1 performance test chart. Fig. 2 shows that the values of precision, recall, and mean accuracy estimate gradually increase and acquire maximum value on learning epoch 16 learning. The values of precision, recall, and mean accuracy estimate are 69.91 %, 51.01 %, and 37.79 %, respectively. Fig. 3 shows that the precision, recall, and mean accuracy estimate values increase and acquire their maximum values during learning epoch 23 for precision and learning epoch 22 for recall and mean accuracy estimate. The precision, recall, and mean accuracy estimate values for epoch 22 are 79.65 %, 70.90 %, and 57.80 %, respectively. Fig. 3 shows the Inria-2 performance test chart. Fig. 4 shows the Inria-3 performance test chart. Table 3 Combinations of parameters for the training process   Fig. 7 shows the Inria-6 performance test chart. Fig. 7 shows that the precision, recall, and mean accuracy estimate values increase and acquire their maximum  Table 4. Table 4 shows that the highest mean accuracy estimate in the absence of sharp jumps in the indicators is demonstrated by the Inria-3 model, 55.41 %, over 30 learning epochs at a learning speed of 0.00005.
Thus, Inria-3 was used as the basis for training the new Inria-7 model over 30 epochs with an exponential change in the learning speed, which is 0.000025, the gamma coefficient of 0.99, and the Adam optimization type. Fig. 8 shows the Inria-7 performance test chart. Fig. 8  This model demonstrates good growth rates of the mean accuracy estimate and the stability of results, so it was used to train Inria-8 and Inria-9 training (a polynomial change in training speed) while learning duration increased to 100 (Table 5). Fig. 9 shows the Inria-8 performance test chart. Fig. 9 shows that the values of precision, recall, and mean accuracy estimate gradually increase and acquire their maximum values during learning epoch 97. The precision, recall, and mean accuracy estimate values are 84.15 %, 74.00 %, and 63.22 %, respectively. Fig. 10 shows the Inria-9 performance test chart. Fig. 10  The research findings showing the results of verifying the effectiveness of the three models are given in Table 6. Table 6 shows that the Inria-9 model demonstrates the highest mean accuracy estimate in the  Table 4 Results of exploring the effectiveness of models with different parameters  Table 5 Parameters that changed during the learning process absence of sharp jumps in the indicators. This is observed over 100 epochs with a polynomial change in the speed of learning, which is 0.00005, the power factor of 3, and the Adam type of optimization. Table 6 Results of exploring the effectiveness of models with different parameters For this model, we managed to increase the mean accuracy estimate from 60.77 % to 65.70 %.

2. Assessing the Inria-9 trained model for object recognition in images
In practice, object recognition in images is part of environmental monitoring with UAV that requires high accuracy in terms of control and orientation in space. Therefore, the model with the highest mean accuracy estimate, Inria-9, should be additionally trained using the values of parameters defined as optimal based on our study (Table 3): -learning speed, 0.000025; -the duration of learning (the number of epochs), 100; -optimization algorithm, Adam; -the type of change in the speed of learning, polynomial; -power factor, 0.25; -pre-trained model, Inria-7. Thus, the Inria-10 model was built, the results of testing of which are shown in Fig. 11. Fig. 11 shows that the values of precision, recall, and mean accuracy estimate gradually increase and acquire their maximum values during learning epoch 97. The precision, recall, and mean accuracy estimate values are 85.95 %, 79.26 %, and 68.78 %, respectively.
Our findings showing the results of the effectiveness test of all ten models are given in Table 7. Table 7 shows that among the ten trained models, Inria-10 demonstrated the highest mean accuracy estimate. For this model, it was possible to increase the mean accuracy estimate from 55.41 % (Inria-3) to 60.77 % (Inria-7), then to 65.70 % (Inria-9) and, finally, to 68.78 %. A further increase could be achieved through experiments to change the neural network architecture, more diligent selection of images from the set, and a combination of training cycles on different data sets; that, however, requires significant computing resources. Fig. 12 shows the recognition of buildings in images from the UAV camcorder for the Inria-10 model in the DIGITS programming environment.
The example (Fig. 12) allows us to conclude that the network recognizes almost all buildings. Structures such as sheds, greenhouses, unfinished buildings, as well as buildings that were partially present in the photo, were partially covered with trees or, due to their close location, were recognized as one building, remained unrecognized. The number of unrecognized buildings confirms the experimental accuracy of about 70 %. At the same time, there was no mistaken attrition of objects that are not buildings to the "building" class.
Thus, the operational quality of the Inria-10 model depends significantly on the visual dimensions of the desired object, lighting, shooting angle, the presence of objects that interfere with the inspection. However, with close-ups at good lighting, the buildings are almost guaranteed to be recognized. Therefore, a given model could be used to control farms, build orthophotoplans, draw up field maps, monitor territories, solve tasks related to cadaster and land management, etc. The achieved performance values make it possible to compare the Inria-10 model with others [2][3][4][5][6][7][8][9][10], but it makes sense to compare with models close in architecture that are obtained during training on a similar base. Therefore, the comparison was carried out according to the criterion for assessing mean accuracy estimate with some well-known GoogLeNet-based models, developed according to similar parameters, trained on the basis of images acquired from UAV cameras. GoogLeNet-like (Switzerland), In-ceptionResNetV2 (Turkey), U-Net Inception-ResNetV2 (Turkey) were chosen as such models [20,21]. The results of assessing the mean accuracy of the models are given in Table 8. Table 8 shows that the highest mean accuracy estimate, namely 75 %, is demonstrated by the Inria-10 model, as others were trained based on images acquired from UAV cameras that were not part of the INRIA set. We estimated the adequacy, reliability, and convergence between the Inria-10 model and others [20,21]. To this end, the recognition of 100 manually marked images from the set NVidia Aerial Drone Dataset (USA) [22] was performed. This set is selected because the images in this set are acquired under different shooting conditions than in the INRIA set. The calculations were carried out in the Jupiter Notebook environment in the Python language. Averaged results are summarized in Table 9. Table 9 shows that the highest mean accuracy estimate, namely 67 %, is demonstrated by the developed Inria-10 model. This indicates the high reliability of the Inria-10 model. Table 9 Results of testing the reliability, adequacy, and convergence of models Objects in the images for verification were marked in certain classes, according to which the model recognizes the specified objects (buildings) in the image. In the experimental verification of models, the results were incorrectly positive (the presence of a certain class in the image in its absence) and falsely negative (the absence of a certain class in the image in its presence) when  Table 7 Results of exploring the effectiveness of models with different parameters  The share of images with incorrectly marked classes is 15 %, and the share of images with correctly marked classes is 85 %. These results make it possible, by using formulas (1) to (3), to calculate precision, recall, and mean accuracy estimate. The precision value is 87 %, which indicates the convergence of models. The resulting recall value, which is 81 %, indicates the reliability of the model. The mean accuracy estimate value is 67 %, which indicates the adequacy of the model.

Discussion of results of studying the recognition of objects in images using convolutional neural networks
The results of our study show that the Inria-10 trained model demonstrates the high accuracy of object recognition in images (Fig. 11). This is due to the choice of optimal parameters for the neural network, as well as the introduction of a convolutional layer into the standard neural network architecture. Inria-10 is based on Inria-9. This model has demonstrated the best mean accuracy estimate values (Table 6) in the learning process based on the INRIA set ( Fig. 10). That is explained by the choice of a polynomial change in the speed of learning (Table 5). Therefore, it was Inria-9 that was chosen for additional training with optimal neural network parameters. The Inria-10 model built in this way could be used to recognize objects in real images (Fig. 12), the high accuracy of which determines the effectiveness of UAV control system. The results of the comparison of the mean accuracy estimate of object recognition in images for Inria-10 and other similar models are given in Table 8. Inria-10, compared to others, demonstrates high values of the mean accuracy estimate of object recognition in images, which indicates the adequacy of this model and no need for retraining it.
The accuracy and performance of the developed Inria-10 neural network model have higher values than similar ones reported in [20], by 2-4 % and 20-50 %, respectively. In this case, the recognition process does not require significant computing resources at the stage of using the model. Compared to [21], this model has a 3 % higher precision of object recognition in images. That was achieved by adding data layers, converting, normalizing data, error, calculating the mean error and parameters to complement the data, as well as FCN, and deleting layers of input/output data, and layer pooling. The reliability, adequacy, and convergence of the developed Inria-10 neural network model is comparable (Table 9) to other models [20,21], and is not inferior to them. The value of precision is greater by 3-6 %, which indicates the convergence of the model. The recall value is greater by 2-5 %, which indicates the reliability of the model. The mean accuracy estimate value is higher by 2-4 %, which indicates the adequacy of the model.
Since the neural network model was trained for images from the INRIA set, high recognition precision values are typical of the images obtained from a drone's camera, usually due to the high contrast of pixel groups. For other types of images, precision probably won't be as high. That requires additional research.
The disadvantages include the cost of time and computing resources at the stage of training the neural network. This disadvantage could be overcome by using parallel graphic computing using CUDA technology and employing a more compact neural network as a pre-trained neural network, for example, MobileNet.
The advancement of a given model may be to further increase the precision, performance, as well as a decrease in computing resources. That would require sophisticated mathematical modeling, taking into consideration the subject area of application, and the development of software modules for a particular system.

Conclusions
1. We have investigated the models of Inria-1, Inria-2, Inria-3, Inria-4, Inria-5, Inria-6, Inria-7, Inria-8, Inria-9 neural networks based on the INRIA set. It was found that the largest mean accuracy estimate is demonstrated by the Inria-9 model, 68.78 %, at a training speed of 0.000025 based on Adam at a polynomial change in learning speed with a power coefficient of 0.25. The lowest mean accuracy estimate is 37.79 % for the Inria-1 model, which uses an exponential change in learning speed. That means that the greater learning accuracy is provided by a polynomial change in the speed of learning.
2. A mean accuracy estimate value has been obtained for the Inria-10 model, built on the basis of the pre-trained Inria-9 model for the recognition of objects in images from the INRIA set with the parameters defined during our study. This value is quite high as it gradually increases and acquires its maximum value during learning epoch 97. The precision, recall, and mean accuracy estimate values are 85.95 %, 79.26 %, and 68.78 %, respectively. The resulting values make it possible to assert the correctness of the choice of network architecture and the selection of parameters. That allows this model to be used for practical tasks of recognizing objects in images, for example, in autopilots, in collision avoidance systems with other UAVs, for machine vision, analysis of agricultural infrastructure, etc.