Research on the Application of Helmet Detection Based on YOLOv4

Helmets are one of the important measures to ensure the safety of construction workers. Because the harm caused by not wearing safety helmets as required is great, the wearing of safety helmets has also attracted more and more people’s attention. At present, the main method of helmet detection is the YOLO series of algorithms. They often only focus on detection accuracy, ig-noring the actual situation during deployment, that is, a balance between accuracy and speed is required. Therefore, this paper proposes a helmet detection application based on YOLOv4 algorithm, and combined with the MobileNet network, it has achieved good results in terms of detection accuracy and speed. Through transfer learning and tuning parameters, the mAP and FPS values detected in this paper on the public safety helmet datasets are 94.47% and 27.36%, which exceed the research work of some similar papers. This paper also combines YOLOv4 and MobileNetv3 networks to propose a mobile-Net-based YOLOv4 helmet detection application. Its mAP and FPS values are 91.47% and 42.58%, respectively, which meet the accuracy and real-time requirements of current hardware deployment.


Introduction
With the continuous development of computer vision, the application of computer vision is getting closer and closer to us, and gradually integrated into all aspects of our life. Safety helmets are one of the important safeguards to protect the lives of construction workers in the construction industry, production plants and other areas where risks exist, so this paper uses the target detection method to detect the wearing of safety helmets of workers in construction sites and remind them to wear the helmets in real time, which is important to protect their lives.
At present, target detection algorithms can be divided into traditional target detection algorithms and deep learning target detection algorithms. In this paper, deep learning target detection algorithm is the base line to explore its development and application in target detection. Deep learning target detection algorithm is further divided into two-stage target detection algorithm and one-stage target detection algorithm, due to meet the real-time requirements of helmet detection in construction site, this paper mainly studies the one-stage target detection algorithm. The most typical algorithms for one-stage target detection are SSD, YOLO, RetinaNet, CenterNet, EfficientDet, etc. Among them, the YOLO algorithm has received the attention of many researchers due to its superior performance in meeting the conditions of accuracy and real-time. This paper is no exception and explores the performance of the algorithm of the YOLO series as the main route for its performance in helmet detection applications.
Safety helmet detection is one of the research directions of target detection and is of great interest to research enthusiasts. Weihong Wu [1] proposed an application scenario of helmet detection in security surveillance by changing the backbone network of YOLOv3 to ResNet and added attention mechanisms, thus proposed an improved YOLOv3-B detection algorithm that improved the accuracy of helmet detection. Yuxin Huang [2] et al. proposed a portable and reliable helmet detection system by combining YOLOv2 with an embedded device. Shuai Li [3] et al. improved YOLOv4 by image enhancement techniques, redesigned the anchor size by K-means, and added dilated convolution and label smoothing techniques to improve its effectiveness in detecting helmets both in terms of small targets and speed. Zhao Rui [4] et al. proposed an improved YOLOv5s algorithm by replacing the slicing (Focus) structure of the YOLOv5 backbone network with DenseBlock and added the SE attention mechanism to the neck network, thus greatly improved the accuracy of target detection. Yufang Jin [5] et al.
increased the detection effect of small targets by adding the output of feature layer of 128 × 128 to the output of feature layer of YOLOv4, additionally, enhanced the feature reuse and improved the detection effect of helmet by addition to the idea of dense connection. Chenglong Wang [6] et al. proposed a helmet detection method which is different from all of the above, and their algorithm used a combination of facial features and neural networks, combined with VGG networks for helmet detection to provide safety to construction workers. Sun [7] et al. enhanced the efficiency of small target detection by adding a self-attentive mechanism to the framework of Faster R-CNN, and the framework was further focused on small targets by anchors complementary enhancement, which achieved good results in helmet detection. In addition to using the above target detection framework, helmet detection can also be performed using OpenCV, and Zhao Zhen [8] implemented safety helmet detection based on OpenCV, which also achieved the effect of reminding construction workers to wear safety helmets in practice and reduced unsafe factors.
This paper proposes to combine YOLOv4 target detection algorithm to detect helmet wearing in construction sites in real time, so as to remind construction workers to wear helmets in time. In order to improve the accuracy and real-time

YOLOv3
YOLOv3 [9] is the latest version proposed by Joseph Redmon, following YO-LOv1 and YOLOv2. YOLOv3 uses DarkNet53 as the backbone network and Leaky ReLU activation function to construct a network structure, which is similar to feature pyramid networks, to enhance feature extraction, and uses YOLO Head to obtain the final prediction results. YOLOv3 divides the detection targets in the images into three scales: large, medium, and small, the images are divided into 13 × 13, 26 × 26, and 52 × 52, and each feature point corresponds to three prior anchors. The prior anchors of YOLOv3 need to be found before the training datasets using K-means, because each feature point uses three prior anchors, so we need nine prior anchors before train the datasets by K-means. YOLOv3 is the most important version in the development of the YOLO series, and many subsequent versions of YOLO are based on the improved version of YOLOv3 ( Figure 1).

YOLOv4
YOLOv4 [10] is not a completely new version, but more precisely YOLOv4 is a series of tricks added to YOLOv3, which is a collection of tricks to increase the accuracy of target detection. YOLOv4 changes the backbone network of YOLOv3  in Table 1 below. In terms of training, YOLOv4 uses Mosaic data enhancement, Label Smoothing, CIOU and cosine annealing decay learning rate which are small tricks to make YOLOv4 better performance in target detection. The Mish activation function is calculated as follows (Figure 2)

YOLOv5
YOLOv5 is another improved version of the YOLO series, the official website does not give a definitive paper, but the code is open source, YOLOv5 has a total of 5 versions, there are Yolov5n, Yolov5s, Yolov5m, Yolov5l, Yolov5x. Similar to YOLOv4, it also adopts the CSP structure, and the neck part adopts the FPN + PAN structure and uses the newer focus technology. YOLOv5 introduces the adaptive anchors calculation, so that the object detection algorithm will automatically calculate the size of the anchors without using k-means to generate the anchors before training, which reduces the complexity of the object detection algorithm to some extent. YOLOv5 uses GIOU Loss as the loss function of bounding box, which makes the target box regression more stable.

YOLOX
YOLOX [11] is an improved version of the latest YOLO series proposed by

MobileNet Series
While pursuing the accuracy of the model, some researchers put the target on the balance of detection accuracy and speed, they start to focus their research on lightweight networks, among which MobileNet is one of the many lightweight networks that perform quite well. So there are three versions of MobileNet. Mo-bileNetv1 [12] proposes separable convolutional neural network, which divides a standard convolution into a depthwise convolution and a pointwise convolution, thus greatly reducing the number of parameters and the computation of the network, making the number of parameters and the accuracy of the network reach a good balance. The core of MobileNetv2 [13] is the inverted residual block, which reduces the number of parameters and increases the accuracy of the designed network. MobileNetv3 [14] adds an attention mechanism (SE module) and is designed to utilize the network architecture search [15] (NAS) algorithm to further improve the performance of the network.

Metrics for Model Evaluation
The main metrics for the evaluation of target detection are accuracy (mean average precision, mAP) and speed (frames per second, FPS). Two important metrics in the calculation of accuracy are precision and recall. The precision is the proportion of the right part of the prediction to the prediction result, which is calculated as The recall is the proportion of the correct prediction to the true sample and is calculated as Another metric we commonly use to calculate the accuracy of target detection is IOU (Intersection over Union), which is a measure of the accuracy of detecting the corresponding object in given data sets. The IOU represents the intersection rate or overlap between the candidate bound and the ground bound, that is, the ratio of their intersection to their concurrence. The higher the correlation, the larger the value. In the most desired case they are the complete overlap, i.e. the ratio is 1, which is calculated as where A is the candidate box and B is the original marked box. yolov4 used is CIOU and the calculation formula is as follows FPS is another important performance metric for target detection algorithms, it means the number of images that can be processed within per second. Only with high speed can real-time detection be achieved, which is extremely important for some application scenarios.

Description of Data
To facilitate comparison with other papers, the datasets used in this paper is Safety Helmet Wearing Dataset (SHWD), which provides datasets for safety helmet wearing and human head detection with a total of 7851 images, including

Experimental Comparison
The effect of target detection is influenced by various factors, among which different experimental parameters, different data sets, different detection targets and different detection algorithm can lead to different detection results, among which the setting of experimental parameters and laboratory hardware conditions are extremely important. The experiment was conducted on a desktop with 32 g RAM, i7-10700K processor, and NVIDIA GeForce RTX 3070 graphics card. Constrained by the hardware conditions of the experiment, the input images are all 640 × 640 in size, and the batch size is set to 4 before all network freeze and 2 after the freeze. In total, the target detection algorithms used in this paper are Centernet, Retinanet, Efficientdet, SSD, YOLOv3, YOLOv4, YOLOv5, and YOLOX, etc. Their specific detection results are shown in Table 2 below, and the score threshold values of the performance metrics in the table are all 0.5.

Results Presentation and Analysis (Figure 3)
As shown in Table 2 Table 2 shows that there are some similarities in the reasons for their poor detection effects, their detection AP values for Hat are 83.01% and 89.4%, However, the AP values of Person are 45.84% and 34.83%, which are not satisfactory. The reason for this may be because of the effect of Efficientdet and Retinanet on feature extraction of human face is not very effective, due to this paper mainly studies the target detection algorithm of YOLO series, so it will not be further explored.
With the growing development of target detection algorithm, its detection effect is getting better and better, so some researchers want to deploy the target detection program on the hardware, which can brighten our life, but limited by the performance bottleneck of hardware resources, it is impossible to deploy larger networks on the hardware with limited capabilities, so the lightweight neural networks have attracted attention from more and more researchers, and among the better-performing lightweight networks are ShuffleNet [16], Con-denseNet [17], MobileNet, Xception [18], and SqueezeNet [19], which have achieved good detection results in terms of detection accuracy and speed. One of the better-performing and most popular target detection algorithms is Mo-bileNet. be applied to construction sites in order to remind construction workers to wear helmets in a timely manner in real time, which protects their lives to a certain