A Vehicle Detection Method for Aerial Image Based on YOLO

With the application of UAVs in intelligent transportation systems, vehicle detection for aerial images has become a key engineering technology and has academic research significance. In this paper, a vehicle detection method for aerial image based on YOLO deep learning algorithm is presented. The method integrates an aerial image dataset suitable for YOLO training by processing three public aerial image datasets. Experiments show that the training model has a good performance on unknown aerial images, especially for small objects, rotating objects, as well as compact and dense objects, while meeting the real-time requirements.


Introduction
In recent years, with the rapid development of information technology, intelligent transportation systems have become an important way of modern traffic management and an inevitable trend.As the key technology of intelligent transportation system, vehicle detection is the basis for realizing many important functions [1], such as measurement and statistics of traffic parameters such as traffic flow and density, vehicle location and tracking, and traffic data mining, etc.
At the same time, with the technology maturity and market popularization of UAV (Unmanned Aerial Vehicle), which has characteristics of being lightweight, flexible, and cheap, the aerial photography of UAVs in the application of scenes such as traffic information collection and traffic emergency response reflects a huge advantage.
In summary, vehicle detection for aerial image plays an important role in engineering applications.In addition, the technology relies on machine vision, artificial intelligence, image processing and other disciplines, and is a typical application of interdisciplinary research.Therefore, it also has important research significance in academics.
Based on YOLO deep learning algorithm and three public aerial image datasets, this paper presents a vehicle detection method for aerial image.

Related Work
The commonly used vehicle detection methods proposed by domestic and foreign scholars are mainly divided into three categories: based on motion information, based on features, and based on template matching.Cheng and others use background subtraction and registration methods to detect dynamic vehicles [2], Azevedo and others based on median background difference method to detect vehicles in aerial images [3].The above two methods achieve the detection of moving objects, however, because the aerial video has the characteristics of complex scenes and diverse objects, the two methods cannot achieve the desired effect for accurate vehicle detection, and false and missed detection are also serious.Sivaraman and others combined Haar features and Adaboost to detect vehicles and implement vehicle detection on highways [4], Tehrani and others proposed a vehicle detection method based on HOG features and SVM to achieve vehicle detection in urban roads [5].The above two methods improve the accuracy of detection, but since the traditional machine learning method only supports training for a small amount of data, there is still a shortage of detection of vehicle diversity.
In recent years, with the updating of computer hardware, especially GPU technology, the deep learning algorithms have been rapidly developed when solving problems in the fields of pattern recognition and image processing, and are more efficient and precise than traditional algorithms.Therefore, this paper uses a deep learning algorithm, YOLO, to achieve vehicle detection.

YOLO Deep Learning Object Detection Algorithm
YOLO, which has been proposed by Joseph Redmon and others in 2015 [6], is a real-time object detection system based on CNN (Convolutional Neural Network).On the CVPR (Conference on Computer Vision and Pattern Recognition) in 2017, Joseph Redmon and Ali Farhadi released YOLO v2 which has improved the algorithm's accuracy and speed [7].In April this year, Joseph Redmon and Ali Farhadi proposed the latest YOLO v3, which has further improved the performance on object detection [8].This chapter introduces the basic principles of the YOLO algorithm according to its update process.2) Network structure YOLO network borrows Google Net while the difference is that YOLO uses the 1 1 × convolutional layer (for cross-channel information integration) +

YOLO v2
Compared with the region proposal based method such as Fast R-CNN, YOLO v1 has a larger positioning error and a lower recall rate.Therefore, the main improvements of YOLO v2 are to enhance the recall rate and positioning ability, and include: BN is a popular training technique since 2015.By adding BN layer after each layer, the entire batch data can be normalized to a space with a mean of 0 and variance of 1, which can prevent the gradient from disappearing as well as gradient explosion, and make network convergence faster.

2) Anchor boxes
In YOLO v1, the full connection layer is used to predict the coordinates of bbox directly after the convolutional layer.YOLO v2 removes the full connection layer by using the idea of Faster R-CNN, and adds Anchor Boxes, which effectively improves the recall rate.

YOLO v3
YOLO v3 model is much more complex than YOLO v2, and its detection on small objects, as well as compact dense or highly overlapping objects is very excellent.The main improvements include: 1) Loss YOLO v3 replaces the Softmax Loss of YOLO v2 with Logistic Loss.When the predicted objects classes are complex, especially when there are many overlapping labels in the dataset, it is more efficient to use Logistic Regression.
2) Anchor YOLO V3 uses nine anchors instead of the five anchors of YOLO v2, which improves the IoU.

3) Detection
YOLO v2 only uses one detection while YOLO v3 uses three, which greatly improves the detection effect on small objects.

Public Datasets for YOLO Training
The performance of the classifier trained based on conventional dataset is poor on aerial images, because that aerial images have the following special features:

1) Scale diversity
The shooting height of UAVs ranges from tens of meters to kilometers, resulting in a wide range of size of similar object on the ground.

2) Perspective specificity
The perspectives of aerial images are basically high-altitude overlooking, while most of the conventional datasets are ground-level perspectives.

3) Small object
The objects of aerial images are generally only a few dozen or even a few pixels, so their amount of information is less also.

4) Multidirectional
Aerial images are taken from a bird's view, and the direction of objects are uncertain (while the object direction on the conventional dataset tends to have certainty, such as pedestrians are generally upright).

5) High background complexity
Aerial images have a large field of view (usually with a few square kilometers For the above reasons, it is often difficult to train an ideal classifier on conventional datasets for the object detection tasks on aerial images.Therefore, a specialized aerial image dataset is needed.In this paper, three public aerial image datasets are used and processed to make a new aerial image dataset suitable for YOLO training.This chapter introduces the specific information of the three datasets.

VEDAI Dataset
The VEDAI (Vehicle Detection in Aerial Imagery) dataset is made by Sebastien Razakarivony and Frederic Jurie of University of Caen [9], whose original material is from the public Utah AGRC database.The raw images have 4 uncompressed color channels (three visible color channels and one near infrared channel).The authors firstly split the original large-field satellite image into 1024 × 1024 pixels JPEG format images, and then create the visible color channels dataset and the near infrared channel dataset, and finally down sample the above two datasets into 512 × 512 pixels, so VEDAI contains 4 subsets.In this paper, only the first subset of VEDAI (1024 × 1024, RGB 3 channels) is used.The shooting heights of all images in VEDAI are the same, and the GSD (Ground Sampling Distance) of 1024 × 1024 image is 12.5 cm pp (cm per pixel).VEDAI contains a total of 1250 images, and is manually annotated nine classes of vehicle ("plane", "boat", "camping car", "car", "pick-up", "tractor", "truck", "van", and "other"), a total of 2950 samples.The annotation of each sample includes: sample class, GT's center point coordinates, direction, and the coordinates of GT's 4 corners.

COWC Dataset
COWC (Cars Overhead with Context) dataset is made by T. Nathan Mundhenk and others of Lawrence Livermore National Laboratory [10], whose original materials are from six public websites.The COWC contains a total of 53 pictures in TIFF format, and the image size is between 2000 × 2000 to 19,000 × 19,000 pixels.COWC images have covered six geographic locations, namely Toronto (Canada), Selwyn (New Zealand), Potsdam and Vaihingen (Germany), Columbus and Utah (United States), in which the images of Vaihingen and Columbus are grayscale, while the others are in RGB color.The GSD of the image is 15 cmpp, so the size of vehicle is basically between 24 to 48 pixels.CWOC is manually annotated one class of positive samples ("car") with a number of 32,716, as well as four classes of negative samples ("boats", "trailers", "bushes" and "A/C units") that are easily confused with the vehicle with a number of 58,247.The annotation of each sample includes: sample class, and GT's center point coordinates.[11].In order to eliminate the deviation caused by different sensors, the original material comes from multiple platforms (such as Google Earth).DOTA is characterized by multi-sensor and multi-resolution, namely that the GSDs of the images are diversified.DOTA contains a total of 2806 images about 4000 × 4000 pixels, and is manually annotated 15 classes of sample ("plane", "ship", "storage tank", "baseball diamond", "tennis court", "swimming pool", "ground track field", "harbor", "bridge", "large vehicle", "small vehicle", "helicopter", "roundabout", "soccer ball field" and "basketball court") with a number of 188,282.The annotation of each sample includes: sample class, and the coordinates of GT's 4 corners (where the top left corner is the starting point, arranged in a clockwise order).

A Vehicle Detection Method for Aerial Image Based on YOLO
In this paper, we process and integrate the above three public aerial image datasets first and then modify the network parameters of YOLO algorithm map propriately to train a model.Thus, we propose a vehicle detection method for aerial image.The specific steps are as follows.

Make Standard Datasets for YOLO Training
The standard dataset for YOLO training mainly consists of two parts: images and labels, where images are JPEG format and labels are txt format documents.
Labels and images are in one-to-one correspondence.Each label records annotations of the samples in the corresponding image.The annotation format is: class GT's center point coordinates ( , ) x y GT's width and height ( , ) w h where ( , , , ) x y w h are normalized values, wrap the line to distinguish when there are multiple samples in one image.Since the input dimension of YOLO v3 training network is 416 × 416 × 3, the size of image used for training should not be too large, otherwise the characteristics of the sample after resize may be lost seriously.The basic information of the three public aerial image datasets described in Chapter 4 is shown in Table 1.
Table 1.The basic information of the three public aerial image datasets.
After processing, the information of the new datasets are shown in Table 2.

Conclusion
In this paper, a vehicle detection method based on YOLO deep learning algorithm for aerial image is presented.This method integrates an aerial image dataset suitable for YOLO training by processing three public datasets.The training model has good test results especially for small objects, rotating objects, as well as compact and dense objects, and meets the real-time requirements.Next, we will integrate more public aerial image datasets to increase the number and diversity of training samples, at the same time, optimize the YOLO algorithm to further improve the detection accuracy.

3 3 ×
convolutional layer instead of the Inception module simply.YOLO v1 network structure consists of 24 convolution layers and 2 full connection layers, as shown in Figure1.
darknet-19 network of YOLO v2 with darknet-53 network, which improves the accuracy of object detection by deepening the network.This paper uses the latest YOLO v3 model to achieve the vehicle detection for aerial image.

We process the above three datasets separately. 1 )
VEDAI a) Image size is suitable and do not need to be processed; b) Delete the annotation of "plane", "boat", and "other" three classes in labels; c) Delete the "direction" in annotations; d) According to the coordinates of GT's 4 corners, calculate width and height: the grayscale images; b) Delete the annotation of negative samples, leaving only the positive sample "car"; c) Split the images of COWC into 416 × 416 size and convert to JPEG format.When splitting, the coordinate of the sample center point is converted accordingly to ensure its position in the new image is correct.The remaining images less than 416 × 416 are padded with black.d) According to the GSD of COWC, it is assumed that the size of vehicle in the image is unified to 48 * 48 pixels, therefore, for "large vehicle" and "small vehicle", delete all the annotations of other 13 classes in labels, "large vehicle" and "small vehicle" are unified to "car"; b) Split the images of DOTA into 1024 × 1024 size.When splitting, the coordinates of GT's 4 corners are converted accordingly to ensure their positions in the new images are correct.Abandon the remaining imagesless than 1024

Figure 2 (
Figure 2 (the original images are from Internet, please inform if there is any infringement).

Figure 2 (
Figure 2 (left) shows that the training model has a good effect on detection of small objects.The vehicles in Figure 2 (middle) are mostly not horizontal or vertical with rotation, test result shows that the model has a good performance on the detection of rotating objects, especially the leftmost vehicle in the image is very close to the background, while the manual detection may miss the object, and the model correctly detects it.Figure 2 (right) indicates it is outstanding that the model on detection of compact and dense objects, more than 95% of the vehicles are correctly detected except for those in the far left shadow.

Figure 2 .
Figure 2. Training model test on unknown images.
DOTA (Dataset for Object detection in Aerial images) is an aerial image dataset J. Y. Lu et al.
DOI: 10.4236/jcc.2018.611009103 Journal of Computer and Communications made by Xia Guisong of Wuhan University, Bai Xiang of Huazhong University of Science and Technology, and others

Table 2 .
The processed datasets information.

Table 3 .
Test results of the training model.