Research on Pedestrian Detection Technology Based on MSR and Faster R-CNN ()
1. Introduction
Pedestrian detection refers to a research problem in which a pedestrian is judged in a specific scene and a specific position of the pedestrian is given. In recent years, it has been widely used in scenes such as video surveillance, vehicle assisted driving, and intelligent robots. Because pedestrians are easily affected by occlusion, background, lighting and other factors, pedestrian detection becomes a challenging and hot issue. Therefore, optimizing pedestrian detection has important implications.
Pedestrian detection can be roughly divided into three parts: feature extraction, classification and non-maximum suppression [1] . Currently, there are many technologies that are used in pedestrian detection. Dala et al. [2] proposed the use of the Histogram of Oriented Gradient (HOG) feature, combined with a linear SVM classifier to achieve pedestrian detection. One of the more important features in recent years is the Deformable Part Mode (DPM) proposed by Felenszwalb et al. [3] . Because DPM takes into account the internal structure of the target, it can well detect pedestrians with different postures and can distinguish between targets and backgrounds. Although the above detection method improves the object detection method to different extents, the hand-designed features are not very robust to target diversity changes in complex scenes. The biggest feature of the Convolutional Neural Network (CNN) is that it can automatically learn object features through a large amount of data, and send this feature into the classifier to obtain excellent classification performance. Sermanet et al. [4] proposed the application of convolutional neural networks to pedestrian detection. The image features extracted by deep learning are far superior to those extracted by traditional methods. The Faster R-CNN model proposed by Ren et al. [5] in 2015 achieved the highest accuracy of current target detection. Relative to R-CNN and Fast R-CNN, Faster R-CNN truly implements an end-to-end target detection framework, which further reduces the generation of bounding box time. The depth convolution feature proposed by the Cascaded Boosted Forest direct training area suggestion network is used in [6] , which has achieved good results in pedestrian detection. In the paper [7] , the regional proposal network is used to generate proposals, and the Faster R-CNN network framework is used to implement pedestrian detection in nighttime infrared images. On the basis of the Faster R-CNN model, the paper [8] incorporates the dark channel dehazing algorithm to effectively improve the pedestrian detection effect in harsh environments.
The above method shows that deep learning has been greatly improved in the field of pedestrian detection compared with the traditional method, but for the target shot in low light and small in the image, the Region-of-Interest (RoI) pooling layer has no distinguishing ability for features extracted at low resolution. In addition, the Faster R-CNN model uses the traditional NMS algorithm (Greedy-NMS) to eliminate redundant detection frames. This algorithm is based on a greedy strategy. If an object is within the preset overlap threshold, it may not be detected.
For the problems of pedestrian detection under low light, the main work of this paper is as follows: 1) For the problem of difficult feature extraction in low-light environment, sample pre-processing is performed before the Faster R-CNN model training by multi-scale Retinex image enhancement method; 2) For the problem of inaccurate positioning of the bounding box, the Soft-NMS algorithm is more effective in improving the detection accuracy; 3) In order to verify the performance of the proposed algorithm, the low-light pedestrian image dataset with annotations is trained under the algorithm to evaluate the performance of the proposed algorithm.
2. Algorithm Principle
2.1. Multi-Scale Retinex Image Enhancement Algorithm (MSR)
Since the illumination condition is an important factor affecting the performance of pedestrian detection, it is the key to the feature extraction of pedestrian detection. Under different illumination conditions, the results of pedestrian detection will be different. Especially for dealing with relatively small pedestrians in low light, the resolution of the feature map extracted by the ROI pooling layer in the Faster R-CNN algorithm will be relatively low, so this paper takes a multi-scale Retinex image enhancement method on the data set to preprocess the sample to improve the resolution, thereby improving the accuracy of pedestrian detection.
The Retinex theory proposed is mainly to reduce the image or remove the influence of incident light, and then obtain the reflection characteristics of the object in the scene, and only retain the information in the original image that can reflect the basic features of the object, so as to achieve the purpose of image enhancement. The mathematical expression of Retinex theory can be expressed as:
(1)
where
represents the information of the initial image captured by the camera,
represents the illumination component of the incident light in the scene, and
reflects the reflected component of the essential information of the image.
According to the human eye, the perception of the brightness of the acquired image and the change of the brightness exhibit a logarithmic nonlinear relationship. For the formula (1), the illumination component in the scene is separated from the acquired image by taking the logarithm, that is, the relationship is [9] :
(2)
(3)
where * represents a convolution operation.
(4)
In the formula (4), c represents a Gaussian scale, which represents a scale for satisfying
.
In order to ensure rich image feature information and low color distortion, this paper uses the multi-scale Retinex algorithm (MSR) [10] , which can be expressed as:
(5)
where v represents the color channel, and this article v = 3, which represents the color image of the R, G, and B channels.
represents the processing results of the v channels, j represents the number of scales, that is, the number of Gaussian surround functions,
represents the weight corresponding to each scale. In general, the value of j is 3, because the execution time of the oversized algorithm will increase, the effect will not be significantly improved. Finally, the pixel value of the reflection information image of each channel obtained by the MSR algorithm is normalized to between 0 and 255 by using the formula (6).
(6)
and
represent the maximum values and minimum values of the reflection information for each channel.
The processing flow chart of the MSR algorithm can be used as shown in Figure 1.
2.2. Faster R-CNN
The main process of the Faster R-CNN algorithm for detecting pedestrians is: first, the network inputs a picture, generates a series of proposals through the Region Proposal Network (RPN), and then sends the picture and the proposals together to the FastR-CNN. The network outputs the final pedestrian test results. The detection process of Faster-RCNN is shown in Figure 2.
The algorithm consists of two major modules:
1) Region Proposal Network
RPN is a full convolutional network. The network consists of a convolutional layer, an intermediate layer, a classification layer, and a regression layer. The convolutional layer is consistent with Fast R-CNN, and the intermediate layer input is fully connected to the
region on the last layer of the convolutional layer feature map. The network traverses the convolution extraction feature using a
-sized sliding window, encoding each convolution map location as a low-dimensional feature vector. This paper uses the 512-dimensional VGG-16 network structure, as shown in Table 1. The position in each window corresponds to k anchors of different scales and aspect ratios simultaneously sampled. In this paper, the value of k is 9. The output of the network is a classification layer and a regression layer, indicating the category score of the image area and the position correction of the bounding box.
Figure 2. Faster R-CNN detection flow chart.
Table 1. VGG-16 network structure table.
2) Fast R-CNN
After the proposals are obtained by the RPN output, it is regarded as the input of another Fast R-CNN. The RoI pooling layer uses the proposal window to extract the proposal feature from the feature map and send it to the subsequent full connection and softmax network for classification. Through the accurate image classification and positioning correction again, the final target detection result is obtained.
2.3. Non-Maximum Suppression (NMS)
A to-be-detected image has initially detected multiple quasi-targets in the image, but due to the influence of scale and traversal, there will be multiple Bounding Boxes (BB) in the same target, so it is necessary to suppress the extra bounding boxes and find the best detection location. Therefore, non-maximum suppression is a post-processing process for pedestrian detection. The purpose is to remove redundant bounding boxes and retain the best one. However, the biggest problem with the traditional NMS algorithm is that it forces the scores of adjacent bounding boxes to zero. In this case, if a pedestrian appears in the overlapping area, it will cause the detection of the pedestrian to fail and reduce the average precision (AP).
For this problem with NMS, we introduce the Soft-NMS [11] algorithm to suppress redundant information in the bounding box.
The formula for the NMS algorithm and Soft-NMS is as follows:
(7)
(8)
when IoU is less than the threshold
, the detection score is
; when IoU is greater than the threshold
, the score is 0. This process is applied recursively to the remaining bounding boxes. The Soft-NMS attenuates the detection score of the non-maximum bounding box instead of completely removing it. After IoU is greater than the threshold
, the score value
is
. Simple changes are made in the traditional NMS algorithm, and without additional parameters, the detection accuracy can be improved by about 1.2% and the detection speed.
3. Experimental
3.1. Data Set
In the experiment, this paper selects the pictures collected under low light as the data set. This data set has various scenes, including low-light pictures in various pedestrian situations such as subway stations, roads and shopping malls. The data set uses 8000 images as the training model, in which 6400 pictures are randomly selected as the training set, and the remaining 1600 pictures are used as the test set. Name the image according to the format, write the program in Python, mark the real bounding box in each image, and save the bounding coordinate information of the label to the .xml file. An example of an experimental data set is shown in Figure 3.
3.2. Experimental Setup and Evaluation Criteria
This paper uses the most advanced pedestrian detection method Faster R-CNN model. This pedestrian detection model is implemented on the popular deep learning framework Tensor Flow. For the VGG-16 network pre-trained by Image Net, initialize both RPN and Fast R-CNN, and then use the data set to train the system. In the first training stage, since the previous layers usually extract very similar pixel-level features, no adjustments are needed. We only need to train conv3_1 and higher of the VGG-16 network. In the second stage, only the conv5_3 layer and higher layers in the RPN and the fully-connected layers in the Fast R-CNN are tuned. The system is trained using Stochastic Gradient Descent
(SGD) with a momentum of 0.9 and a weight decay of 0.0005. The layer parameters are updated at an initial learning rate of 0.001. After 50,000 iterations, the learning rate lowered to 1/10 of the current rate, and the total number of iterations is 100,000.
We refer to the model obtained by training as model 1, and then continue to train the sample set processed by the MSR image enhancement algorithm to obtain model 2, and compare the models 1 and 2 through the pedestrian test set. In addition, we introduce Soft-NMS into the model and compare it with the traditional NMS algorithm. The experimental flow chart is shown in Figure 4.
3.3. Experimental Results and Analysis
The comparison of pre-processed and unprocessed images using the multi-scale Retinex image enhancement algorithm is shown in Figure 5.
It can be observed from the figure that the pedestrian features in the image processed by the MSR algorithm are more prominent, and the picture quality is significantly improved in both brightness and contrast.
The performance improvement of soft-NMS relative to NMS under different overlapping thresholds was found through experiments. With the increase of the overlap threshold and the retrieval rate, the soft-NMS has a greater improvement in accuracy. As shown in Figure 6.
Finally, our proposed improved model achieved an average accuracy of 89.74% in pedestrian detection under low light as shown in Table 2. Observations show that the proposed method achieves the desired test results on the test set. Compared with the original Faster R-CNN, our improved algorithm improves performance by 1.5%, which is better than the original algorithm. This is because the pedestrian feature is easier to extract after using the MSR image enhancement algorithm in front of the Faster R-CNN network. In addition, the introduction of the Soft-NMS algorithm enables the detection rate of overlapping pedestrians to be effectively improved and the pedestrian position to be more accurate. Some examples of pedestrian detection using the proposed method are shown in Figure 7.
4. Conclusion
The improved model presented in this paper has a higher accuracy than the
Figure 6. Precisionvs Recall at multiple overlap thresholds (Ot).
Table 2. Accuracy of different methods.
Figure 7. Examples of pedestrian detection.
original Faster R-CNN model. This is because after the image is enhanced by the multi-scale Retinex algorithm, the contrast of the image is improved, the difference between the target and the background area is more obvious, and the target outline is also clearer. Therefore, the network can better extract the characteristics of pedestrians. In addition, based on the Faster RCNN, the Soft-NMS algorithm makes it possible to obtain higher accuracy for pedestrian detection with higher overlap and small scale. The results show that the detection effect of the model is more significant, but the detection speed needs to be improved. How to improve the detection speed is the main direction of our next research.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.