IMPROVING THE MODEL OF OBJECT DETECTION ON AERIAL PHOTOGRAPHS AND VIDEO IN UNMANNED AERIAL SYSTEMS

on aerial photographs and video using a neural network in unmanned aerial systems. The development of artificial intelligence and computer vision systems for unmanned systems (drones, robots) requires the improvement of models for detecting and recognizing objects in images and video streams. The results of video and aerial photography in unmanned aircraft systems are processed by the operator ma nually but there are objective difficul-ties associated with the operator’s processing of a large number of videos and aerial photographs, so it is advisable to automate this process. Analysis of neural network models has revealed that the YOLOv5x model (USA) is most suitable, as a basic model, for performing the task of object detection on aerial photographs and video. The Microsoft COCO suite (USA) is used to train this model. This set contains more than 200,000 images across 80 categories. To improve the YOLOv5x model, the neural network was trained with a set of VisDrone 2021 images (China) with the choice of such optimal training parameters as the optimization algorithm SGD; the initial learning rate (step) of 0.0005; the number of epochs of 25. As a result, a new model of object detection on aerial photographs and videos with the proposed name VisDroneYOLOv5x was obtained. The effectiveness of the improved model was studied using aerial photographs and videos from the VisDrone 2021 set. To assess the effectiveness of the model, the following indicators were chosen as the main indicators: accuracy, sensi-tivity, the estimation of average accuracy. Using a convolutional neural network has made it possible to automate the process of object detection on aerial photographs and video in unmanned aerial systems


Introduction
Safety is a paramount human need. The issue to provide security is associated with the active use of unmanned aerial vehicles (UAVs) to monitor military sites, as well as critical infrastructure facilities. The latter include energy facilities, chemically hazardous industries, and other strategic objects, the disruption of the functioning of which may threaten vital state interests. A concept of devising an integrated information and analytical system of decision support under the conditions of anthropogenic emergencies has been proposed in [1]. The main factors threatening the safety of a monitored object (MO) include fires (explosions), emissions of hazardous substances, radiation, as well as unauthorized entry of persons into a MO territory. With the help of computer vision, MO reconnaissance is carried out by analyzing aerial photographs and a video stream. Based on artificial intelligence methods, real-time monitoring of traffic flows is carried out [2], real-time detection of vehicles [3]. They create unmanned aircraft systems (UAS), unmanned vehicle systems, and robot control. The development of artificial intelligence systems should be accompanied by the improvement of models for their implementation. One such option may be to use convolutional neural networks (CNNs) to detect MO in aerial photographs and videos.
Therefore, it is a relevant task to improve existing models of object detection in aerial photographs and videos using CNNs.

Literature review and problem statement
Work [4] describes a video system for detecting violations of traffic rules. The proposed model implements the detection of three classes of objects on a video sequence: a pedestrian crossing, a car, and a person at a pedestrian crossing. The model also makes it possible to track the trajectory of the vehicle and person at the pedestrian crossing; determine the violation of traffic rules over a certain period. To detect objects in real time, the YOLOv3 neural network was used. The disadvantage of that model is its high computational complexity, lack of adaptation to the detection of objects in the video stream acquired from an unmanned aerial vehicle (UAV).
A traffic video surveillance system is proposed in [5]. The project addresses the concept of vehicle detection with the support of a computer vision algorithm in real time. The proposed system uses the YOLOv4 architecture for faster detection of objects in real time. That model has been tested in a variety of conditions such as rain, low visibility, daylight, snow, and night. The system can automate the process of detecting accidents in real time. However, the task of object detection in the video stream acquired from UAV remains unresolved.
Paper [6] shows that deep learning technique has led to a significant increase in the accuracy of object detection. In many applications, object detection is performed on video data consisting of a sequence of two-dimensional image frames. It is shown that the accuracy of object detection can be significantly improved by using a temporal structure in the sequence of images at the stage of object detection. A new model for object detection is proposed, which takes into consideration the trajectory of movement from neighboring frames, as well as spatial-temporal characteristics. The disadvantage of the model is the inability to use it to detect objects when processing a video stream in UAS.
A prototype of the implementation of a threat detector based on artificial intelligence for video surveillance cameras is considered in [7]. The proposed CNN model processes the stream of images directly from the webcam on the site, classifies objects, and displays the results to the user through a convenient graphical interface. The motion detection module is designed to automatically capture images from video when new motion is detected. Experimental results showed that the average overall accuracy of forecasts for the test set date was 94 %. The disadvantage of the approach used is the lack of practical application for object detection on aerial photographs and video acquired from the optical system of UAV.
Work [8] shows that object detection is closely related to the analysis of streaming video. Owing to the rapid development of deep learning neural networks, new CNN models are emerging that are able to resolve this task. The disadvantage of the approach used is the lack of practical application for object detection when processing a video stream in UAS.
Paper [9] proposes a new approach to the detection of YOLO objects (You Only Look Once). Object detection is considered as a regression problem for spatially separated bounding frames and related probabilities of classes. The neural network predicts bounding boxes and probabilities of classes directly from complete images in a single estimate. The model is superior to other detection models such as DPM and R-CNN. Despite this, the issues of automation of the process of object detection in aerial photographs and streaming video in UAS were not considered.
In work [10], a method for recognizing images of objects monitored by a convolutional neural network is proposed. The effectiveness of image recognition of monitored objects by an improved method was tested on a convolutional neural network, which was trained with images of 300 monitored objects. In that case, the decision-making time for the proposed method decreased on average from 0.7 to 0.84 s compared with the artificial neural networks ResNet and ConvNets. The disadvantage of the proposed model is the lack of practical application for object detection when processing a video stream in UAS.
Paper [11] proposes a model of YOLO9000 object detection in real time, which can detect more than 9,000 categories of objects. The advanced YOLOv2 model corresponds to the latest technology in standard detection tasks such as PASCAL VOC and COCO. A method of joint training in the detection and classification of objects is proposed. The YOLO9000 is trained on both the COCO discovery dataset and the ImageNet classification dataset. YOLO9000 predicts detections for 200 classes and 9,000 different categories of objects and works in real time.
Our review of the scientific literature [4][5][6][7][8][9][10][11] has revealed the following shortcomings of known models: -the lack of CNN models that solve the task of object detection on aerial photographs and streaming video in UAS; -the process of object detection on aerial photographs and videos in UAS is not automated.
All this suggests that it is expedient to conduct a study on improving the model of object detection on aerial photographs and video in unmanned aircraft systems.

The aim and objectives of the study
The purpose of this study is to improve the model of object detection on aerial photographs and video in unmanned aircraft systems using CNN and choosing the parameters of its training. This would make it possible to automate the process of object recognition in aerial photographs and videos.
To accomplish the aim, the following tasks have been set: -to investigate the effectiveness of object detection on aerial photographs using CNN; -to evaluate the effectiveness of object detection on aerial photographs and video streams with an improved model VisDroneYOLOv5x.

The study materials and methods
A video camera is installed on board the UAV. The video stream is transmitted through the communication channel to the ground control point. To simplify understanding and processing, each frame of the video stream is treated as one digital image (Fig. 1). The task of object detection is to assign each image P to one object (set of objects) B in a certain class: , , , , , , s k ÎR is the reliability, c k is the class of objects (person, plane, car, truck, bus, etc.), k is the number of classes. In fact, the task of finding the location of an object in a frame is a detection task.
The proposed model can work with two types of data: aerial photography and streaming video. In the first case, an RGB JPEG aerial photograph is submitted to the CNN input, and an aerial photograph with marked classes (objects) in the form of rectangular frames is received at the output (Fig. 2).
In the second case, a video in MPEG-4 format is submitted to the CNN input, and a video with marked classes (objects) in the form of rectangular frames (similar to an aerial photograph) is received at the output. Table 1 gives the color of the rectangular frame of the class (object), using an example of 6 classes (total -80) of the model YOLOv4, YOLOv5x.
Knowing the coordinates of objects in the image (video) makes it possible to solve various more complex problems: -tracking (tracking of movement); -prediction of actions; -simultaneous localization and construction of the SLAM map (simultaneous localization and mapping); -estimation of distances to objects. Boat boat As the main performance indicators that characterize the detection process, the following were chosen at GitHub: precision, recall, mean average precision.
Precision P is the ratio of correctly detected objects to the total number of correctly and erroneously detected objects [12]: where TP (true positive) is the number of correctly detected objects in the image; FP (false positive) is the number of erroneously detected objects in the image.
Recall r is the ratio of correctly detected objects to the total number of detected objects in the images [12]: where FN (false negative) is the number of false positives in the background. The mean Average Precision (mAP) score is the mean of Average Precision (AP) for each class in the training sample: The estimate of the average precision is determined from the formula: where k is the number of classes. Features of the architecture and principles of implementation of the YOLOv4 CNN are considered in [13,14]. The model YOLOv5 [15] shows high efficiency for object detection of various shapes and positions both in digital images and videos. YOLOv5 includes models of different sizes: YOLOv5n (small), YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x (large).
The advantages of YOLOv5. YOLOv5 is the first model from the YOLO family to be implemented in the PyTorch framework. Previous models were written on the Darknet, the framework of the creator of the architecture. The Darknet loses to PyTorch in the context of performance, configuration capabilities, and model deployment. In Colab Notebooks with the Tesla P100, the YOLOv5 produces predictions at a rate of 0.007 seconds per image. That's equivalent to 140 frames per second. For comparison, YOLOv4 operates at 50 frames per second.
YOLO is a modern real-time object detector, and YOLOv5 (Fig. 3) is based on YOLOv1-YOLOv4 [15]. Continuous improvements have made it possible to achieve the highest performance on two official sets of object detection datasets: Pascal VOC (Visual Object Classes) [16] and Microsoft COCO (Shared Objects in Context) [17]. The architecture of the YOLOv5 CNN (Fig. 3) is discussed in [15], the task that CNN solves is to detect objects from 80 classes (Microsoft COCO).
The architecture of the YOLOv5 network consists of three parts. The data are first entered into CSPDarknet to extract the features, and then passed to PANet for the merge function. At the end, Yolo Layer displays the detection results (class, score, location, size).
To solve the problem of automation and increase the efficiency of object detection on aerial photographs and videos in UAS, it is proposed at GitHub to use the YOLOv5x CNN as a basic model. This model is the most accurate of the YOLOv5 line. Improvement of the model of object detection in aerial photographs and in the video stream was carried out by training the YOLOv5x neural network with a set of images VisDrone 2021 (China) [18]. This set includes 10 classes: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, motor. YOLOv5x training was conducted with the choice of optimal training parameters: optimization algorithm -SGD; the initial learning rate (step) is 0.0005, the number of epochs is 25. The VisDrone 2021 set consists of 400 video files consisting of 265,228 frames and 10,209 images that were acquired from the camera installed on UAV. A feature of this set is that it is assembled under different lighting conditions, different densities of objects, and weather conditions. Details of work (source code) of YOLOv5, principles, full documentation (books) on training, testing and deployment of the model are given in the official repository [19]. In that repository, there are links to tutorials, the main ones are: training user data; tips for achieving the best training results; registration of scales and displacements; Multi-GPU training assembly of the model; reduction and sparseness of the model; evolution of hyperparameters; transfer learning with frozen layers.
Their study of object detection on aerial photographs and videos, plotting graphs, was conducted in the python 3.7 programming language in the cloud service (machine learning modeling environment) Colab Notebooks PRO version (paid), Tesla T4 GPU runtime 15,110 MB. They used computer ACPI X64 (China) with the operating system Windows 10 Pro, which is equipped with a processor AMD Ryzen 3 1200 Quad-Core Processor 3.10 GHz, graphics card GPU GTX 1050 2 GB, and RAM capacity of 16 GB. When studying the model, aerial photographs and videos obtained from the UAV were used. According to the NATO classification (STANAG 4670 (ATP 3.3.7)), the UAV used belongs to class I (≤150 kg), category -small (>15 kg). Tac To obtain streaming video, a multi-sensor gyrostabilized suspension USG-212 EO/IR is used, which was designed for use on UAVs and small manned aircraft. The gimbal is equipped with a Sony Full-HD block camera with 30x optical zoom and a high-quality infrared camera. It is hermetically sealed and can be operated in all weather conditions. Optionally, a built-in image processing unit is available, adding functions such as target tracking, coordinate acquisition, and video stabilization. The anti-vibration damping system eliminates the vibrations of the UAV hull. Even when using a 30x zoom, the image remains clear and stable.
Their studies were carried out under the following assumptions and limitations: -a digital photo and video camera (a digital camera with the ability to shoot photos and videos can be used) is installed on board the UAV; it shoots in the view range during daytime; -an aerial photograph (video) in digital form is transmitted through the communication channel to the ground control point; -the process of object detection on aerial photographs (video) is carried out on the computer of the ground control point of UAS.

1. Studying the effectiveness of object detection on aerial photographs using CNN
The efficiency of object detection on aerial photographs using CNN of the following models YOLOv4, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, trained on the Microsoft COCO dataset, VisDrone 2021 was investigated. The comparison was carried out in the cloud service Colab Notebooks runtime Tesla T4 15110 MB. Fig. 4 shows the original aerial photo from the VisDrone 2021 set.
At the first stage, a study was conducted on the detection of objects on an aerial image (from the VisDrone set (VisDrone2019-DET-test-challenge)) in the Colab Notebooks cloud environment using models YOLOv4, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, trained on the Microsoft COCO dataset. The result of object detection for the model YOLOv4, YOLOv5x is shown in Fig. 5, 6, respectively.
The validation parameters of the YOLOv4, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x models trained on the Microsoft COCO dataset are given in Table 2.  Analysis of their results given in Table 2 shows that the best indicators for the time of detection of aerial photographs with a size of 1,360 × 765 are demonstrated by the YOLOv5n model -0.0096 s, the largest detection time was demonstrated by the YOLOv4 model -0.166 s. Table 3 gives the result of checking the YOLOv4, YOLOv5 models on the Microsoft COCO 2017 validation set (the number of images for validation is 5,000, the image size is 640 × 640, the number of tags is 36,335).
Analysis of their results given in Table 3 reveals that the best average precision is demonstrated by the model YOLOv5x -mAP 0.5 = 0.683, mAP 0,5…95 = 0.496. An example of checking the YOLOv5x model on the Microsoft COCO 2017 validation set using Colab Notebooks in the Tesla T4 15110 MB GPU runtime is shown in Fig. 7.

2. Evaluation of the effectiveness of object detection on aerial photographs and video stream with the improved model VisDroneYOLOv5x
For training, validation, and testing the models YOLOv4, YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, aerial photographs from the VisDrone 2021 set were used. Training sample -6,471, validation -548, test sample -1,610 aerial photographs of RGB type, JPEG format, dimension 640 × 640. The total number of classes for object detection on aerial photographs and video stream was 10 (pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, motor).
Step 1. Loading (cloning) the YOLOv5 model, checking the PyTorch framework and the GPU.
To train the new dataset, the architecture (code) of the YOLOv5 model is cloned into the Colab Notebooks machine learning simulation environment, PRO version Fig. 8. In addition, their code can be cloned into other environments (frameworks) of machine learning modeling: Kaggle; DockerHub; Deep Learning Amazon Web Services (AWS); Google Cloud Platform (GCP), and others.
Training of the proposed model YOLOv5x was carried out using the optimal values of the parameters, which were obtained experimentally: -image size -640 (640 × 640); -batch size -8; -duration of training (number of epochs) -25; -data set -VisDrone; -model -YOLOv5x. Enter training parameters (image size, batch size number of epochs, data set, model) Fig. 9.
Step 3. Loading the data set VisDrone Fig. 10. This step also unzips and converts the data.
Step 4. Loading the YOLOv5x model Fig. 11. The total number of model parameters was 86 million (86,278,375) for the VisDrone dataset.
Step 5. Caching the training and validation data set (Fig. 12).
Step 6. Training the model by the training sample and checking for precision (P), recall (R), mean averaged precision mAP 0.5 , mAP 0.5…0.95 using the validation sample of 548 images Fig. 13.
Step 7. Evaluation of precision, recall, average precision of the model on the validation sample.
As a result, a new model with the proposed name VisDroneYOLOv5x was obtained. The results of checking the precision and recall of this neural network are shown in Fig. 14. Fig. 14 shows that for the model VisDroneYOLOv5x, the greatest precision for the class of car was mAP 0.5 = 0.787.
The weight of each category is related to the number of tags shown in Fig. 15 for the VisDrone 2021 set when examining the Vis-DroneYOLOv5x model. Fig. 15 shows that for the VisDrone 2021 dataset, the largest number of tags is demonstrated by the class of car, followed by the class of pedestrian. Fig. 16 shows the deployment (example of object detection) of a test sample by the proposed VisDroneYOLOv5x model. Fig. 16 demonstrates that even with high saturation of objects, the proposed model VisDroneYOLOv5x copes with the task of detecting 10 classes. Fig. 17 shows an example of object detection by the Vis-DroneYOLOv5x model on the video acquired from UAV (the distance to the objects is 2.38 km, the height is 333.4 m, the zoom multiplicity is 30).
The proposed model VisDroneYOLOv5x was compared with the models YOLOv4, YOLOv3. To evaluate the Vis-DroneYOLOv5x model for convergence, adequacy, and validity, 548 aerial photographs from the VisDrone 2021 set were used as a validation sample.
Convergence. CNN shows convergence provided that with each epoch the error decreases. The convergence of the CNN model is influenced by three components: the completeness of the database (aerial photographs); the correct choice of architecture; selection of CNN training parameters. Adequacy. A neural network is adequate if the learning outcomes converge to close values, a necessary condition that there is a dependence between the output and input data that is implemented by the neural network.
The most recommended way to test a neural network model for adequacy is to compare the results with known models. The results of the test on a validation sample (548 aerial photographs) are given in Table 4. Table 4 demonstrates that in comparison with the models YOLOv3, YOLOv4, the proposed model VisDroneYOLOv5x shows the best performance: precision P = 0.510; recall r = 0.403; average precision mAP 0.5 = 0.403, mAP 0.5…0.95 = 0.235.

Discussion of results of studying object detection in aerial photographs and videos using CNN
A study of the efficiency of object detection on aerial photographs ( Table 2) using CNN of the following models YOLOv4 (Fig. 5), YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x (Fig. 6) showed that the best indicators for the time of detection of aerial photographs with a size of 1,360 × 765 are demonstrated by the model YOLOv5n -0.0096 s. The longest detection time (Table 2) was shown by the YOLOv4 model -0.166 s. The best indicators of average precision are demonstrated by the YOLOv5x model (Fig. 7) -mAP 0.5 = 0.683, mAP 0.5…0.95 = 0.496 (Table 3).
It is proposed to use the YOLOv5x CNN (Fig. 11) to detect objects in aerial photographs (Fig. 6) and video (Fig. 17).
To increase the efficiency of the neural network, this model was trained (Fig. 13) by the VisDrone set with the selection of optimal parameters (Fig. 9): the duration of training (number of epochs) -25; batch size -8; initial learning rate -0.0005; optimization algorithm -SGD. As a result, a new model with the proposed name VisDroneYOLOv5x was obtained.
The use of the VisDroneYOLOv5x CNN makes it possible to automate the process of object detection on aerial photographs and videos.
Using the proposed model makes it possible to solve the following problems [4][5][6][7][8][9][10][11]: -computational complexity of object detection on aerial photographs and videos acquired from UAVs; -the lack of models of neural networks that solve the problem of object detection on aerial photographs and videos.
Limitations of the proposed model include: -the detection of objects on aerial photographs and videos is carried out within 10 classes; -the orientation of objects on aerial photographs is not taken into consideration; -the CNN's broadcast invariance is not taken into consideration.
The limitations of the proposed model are that it is adapted to detect objects in aerial photography and video for ten classes. CNN training was conducted on aerial photographs of high contrast, clarity. For other aerial photographs, the precision and recall of object detection by class may vary, which requires additional copy paste research.
To advance the proposed model, it is planned: -to increase the base of marked aerial photographs for a training sample [20,21]; -to explore the proposed and other models (YOLOX, YOLOP, etc.) for different conditions of aerial photography; -to optimize the proposed model for computational complexity, to increase the speed of performance.
The VisDroneYOLOv5x model is proposed to be used at the ground control point of UAS when processing aerial photographs, orthophoto maps, and videos. In addition, a given model should be used in systems with artificial intelligence; in facility control systems; when designing artificial intelligence in robots; in unmanned vehicle systems.

Conclusions
1. The indicators of efficiency of models YOLOv4, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x have been studied. The verification of the effectiveness of these models involved the Microsoft COCO 2017 validation set (the number of images for validation is 5,000). It has been established that the best performance is shown by the YOLOv5x model: