Application of Dual-Energy X-Ray Image Detection of Dangerous Goods Based on YOLOv7

X-ray security equipment is currently a more commonly used dangerous goods detection tool, due to the increasing security work tasks, the use of target detection technology to assist security personnel to carry out work has become an inevitable trend. With the development of deep learning, object detection technology is becoming more and more mature, and object detection framework based on convolutional neural networks has been widely used in industrial, medical and military fields. In order to improve the efficiency of security staff, reduce the risk of dangerous goods missed detection. Based on the data collected in X-ray security equipment, this paper uses a method of inserting dangerous goods into an empty package to balance all kinds of dangerous goods data and expand the data set. The high-low energy images are combined using the high-low energy feature fusion method. Finally, the dangerous goods target detection technology based on the YOLOv7 model is used for model training. After the introduction of the above method, the detection accuracy is improved by 6% compared with the direct use of the original data set for detection, and the speed is 93FPS, which can meet the requirements of the online security system, greatly improve the work efficiency of security personnel, and eliminate the security risks caused by missed detection.


Introduction
In order to protect people's personal and property safety when taking public transport, safety inspection has become a necessary means to ensure the safety of ment angle and the uncertainty of the types of dangerous goods, dangerous goods may be ignored due to shielding, which is not conducive to the judgment of security personnel; coupled with the high-intensity work of security inspectors and other reasons, there will often be missed detection and false detection, causing security risks or reducing work efficiency, in addition, due to the high mobility of China's population and large passenger flow, the demand for security inspection technology and requirements are also increasing.
With the development of deep learning, object detection technology based on convolutional neural networks has been widely used in the industrial field. Target detection technology is the further development of classification technology, it can not only predict the target category, but also give the location information and confidence of the target. At present, the popular object detection framework is based on convolutional neural networks. Convolutional neural networks are mainly composed of convolutional layer, pooling layer and fully connected layer. The convolutional neural network reduces the complexity and training difficulty of the network model by using three strategies: local sensitivity field, weight sharing and downsampling. It is not affected by affine transformations, such as translation, scaling and rotation of images, and has a strong feature extraction ability. Current target detection technologies are mainly divided into two categories: two-stage and one-stage. Two-stage mainly includes R-CNN [1], Fast R-CNN [2] and Faster R-CNN [3]. Such technologies need to use heuristic methods or convolutional neural networks to generate pre-selection frames. Then do classification and regression on the pre-selection box. One-stage mainly includes SSD [4] and YOLO [5] series, such technologies extract features and predict the location and category information of the target directly over the network. Two-stage target detection technology needs to carry out multiple operation detection, which has a large amount of calculation, high precision but slow speed, and can not meet the real-time requirements of the security inspection system. One-stage object detection technology speeds up the detection speed and reduces the detection accuracy.
In order to meet the speed requirements of the security inspection system, this paper uses the one-stage target detection framework based on YOLOv7 [6]. At the same time, to meet the accuracy requirements, this paper uses a dangerous goods data set expansion method using Threat Image Projection (TIP) technol-  [7]. In addition, a high-and low-energy image fusion method is developed to enhance the image and make full use of the two energy X-ray images of the dual-energy X-ray security detector. The experimental results show that satisfactory results are obtained on the data set collected by Shanghai Wuying Technology Co., Ltd. and meet the real-time requirements of the security check system, which can be integrated into the security check system.

Dangerous Goods Detection Algorithm in Traditional X-Ray Images
Before the target detection network model based on convolutional neural networks is widely used, the research on X-ray dangerous goods target detection is Wang et al. [9] proposed a detection method based on Scale Invariant Feature Transform (SIFT) [10] and Implicit Shape Model (ISM); the SIFT algorithm is used to extract the key points of the target, and the ISM model of the target is constructed. In the detection process, the extracted SIFT descriptor of the target is matched with the visual descriptor in the ISM model, and the voting mechanism is used to determine whether the target is a hazardous material.

X-Ray Image Dangerous Goods Detection Method Based on Deep Learning
After the Convolutional Neural Network (CNN) was proposed, it has been widely used in various fields and has become an indispensable part of various object detection models in computer vision. Krizhevsky et al. [11] proposed an image classification method based on convolutional neural networks, which achieved record-breaking results in image classification at that time and was far ahead of the second place in the ILSVRC-2012 competition. The proposed neural network has made a breakthrough in the method based on convolutional neural networks in the field of computer vision.
Akcay [12] first applied convolutional neural networks to the field of X-ray images, and discussed the applicability and effectiveness of the traditional sliding window-based convolutional neural network detection pipeline and area-based object detection technology in the problem of object detection in X-ray security images. Based on this, Akcay et al. [13] proposed the use of deep convolutional neural networks and transfer learning [14] to solve the problem of image classi- Lu et al. [15] proposed a detection algorithm of dangerous goods in security check packages based on improved YOLOv3 [16], which reduced the original prediction of three bounding boxes per grid in Y0L0v3 to two bounding boxes.
K-means clustering was used to calculate the prior box according to the data set, and data enhancement method was adopted, multi-scale input training strategy was adopted. The detection speed and accuracy are improved to some extent.
Wu et al. [17] proposed to improve the detection method of X-ray security dangerous goods by combining atrous convolution and transfer learning to improve YOLOv4 [18]. By increasing the receptive field, multi-scale context information is aggregated, the initial candidate box is obtained by K-means clustering algorithm, and the learning rate is optimized by cosine annealing to accelerate model training, which can effectively reduce the false detection rate of dangerous goods. And it improves the detection ability of small targets.
Liu et al. [19] proposed an object detection method for X-ray images. Firstly, a color-based foreground-background segmentation method was proposed to contour the detected object, and then Faster-RCNN, an object detection framework based on deep convolutional neural networks, was used to achieve a mAP of 77%.
Zhang et al. [20] proposed an improved SSD [4] algorithm and its application in subway security inspection. The convolution operation of each scale feature in the SSD algorithm is unchanged in size, and the corresponding features before and after convolution are fused in lightweight network to generate a new pyramid feature layer, and the detection unit based on the residual module is added to avoid increasing the capacity and computational complexity of the network model. This paper aims to solve the problem of easy missed detection and low detection accuracy when detecting small targets.
Han [21] proposed a detection and tracking algorithm for dangerous goods in X-ray images based on deep learning. On the one hand, a deep learning detection network based on improved single-shot multi-box detection method is designed to improve the detection accuracy. On the other hand, a tracker based on the detection results is implemented, and real-time detection and tracking is achieved through the cooperation between the tracker and the detector.
In various studies, the object detection algorithm based on YOLO [5] can reduce the amount of calculation and speed up the training of the model under the condition of ensuring accuracy. Because the X-ray security inspection system

Method
The experiment is mainly divided into the following steps: 1) performing Threat Image Projection on the empty package, inserting the separately collected dangerous goods into the empty package to create fake dangerous goods data; 2) performing high-and low-energy image fusion operation on the data, combining high-and low-energy images into one image. 3) YOLOv7 object detection model is used for training and prediction.

Threat Image Projection
Collecting the data of parcels containing dangerous goods is a time-consuming and labor-intensive work. The dangerous goods in each collected parcel are artificially placed, and will not contain all the positions and angles of dangerous goods in real life. In order to reduce the labor cost and time cost of collecting hazardous materials parcel data, and improve the complexity and variability of hazardous materials placement. In this paper, a Threat Image Projection (TIP) [7] method is used, which can insert different dangerous goods into different luggage packages at various angles and positions. Therefore, while expanding the training data set, TIP times can be increased for individual dangerous goods with fewer samples to solve the problem of sample imbalance. After TIP insertion, the number of dangerous goods in each parcel is balanced.
TIP is a baggage screening technique used to train security agents and automatic threat recognition algorithms. Dangerous goods are placed in the tray separately for collection, and the acquired image is processed by simple threshold to obtain clean images of dangerous goods from the background. Then the affine transformation is applied to the dangerous goods image, and random rotation or scaling is carried out, so that dangerous goods of different angles and different sizes can be fully considered. Carry out threshold processing and morphological operations on the collected baggage package A that does not contain dangerous goods, obtain the image of the baggage package area B, bring the minimum external rectangular coordinates of the dangerous goods data into the baggage package area, limit the insertion range, and ensure that the inserted dangerous goods are located in the entire baggage package area. At the selected effective position ( ) , M i j , the dangerous goods image D is superimposed on the baggage package image B to generate a composite composite image C.
In order to ensure the reliability of the synthesized TIP image C, two parameters are introduced in image fusion. The parameter α controls the transparency of the original image A ( 0.9

α =
). Another parameter is the dangerous goods pixel threshold T, which ensures the consistency of the original image and the target image B in image contrast. The purpose of using the dangerous goods pixel threshold T is to remove high-value pixels inserted into the dangerous goods where x is the normalized average intensity of the inserted area in B, and the calculation formula is as follows: Image composition can be expressed as follows: In the formula, ( ) , A i j represents the pixel value of row i and column j of image A, and the other values are the same; Since the T value calculated by Equation (1) is in the range of 0.5 -0.95, any pixel in image A higher than 255 T * will be ignored during image synthesis.
As shown in Figure 1, 1) the truly collected dangerous goods package data, and 2) the TIP method is used to insert the separately collected dangerous goods into the package that does not contain dangerous goods to generate the true composite image containing dangerous goods. In this method, dangerous goods are randomly inserted into the image at several angles to make up for the problem of expensive collection of dangerous goods, and avoid the problem of expensive collection of certain dangerous goods data (such as drugs, explosives).

High-and Low-Energy Image Fusion
In this paper, a method of high-and low-energy image fusion is proposed to enhance the fine granularity of X-ray images, make the outline of X-ray images clearer, and help the network model to learn the characteristics of dangerous goods in X-ray images. The dual energy X-ray image is composed of a high-energy image and a low-energy image. The ray is first illuminated on the low-energy detector, and the detector obtains the low-energy signal value. A copper sheet is used between the low-energy detector and the high-energy detector to filter out The value of each pixel after the fusion of high-and low-energy images is calculated as follows: where Low is the low-energy image as shown in Figure 2(a), Max (Low) is the maximum value in the low-energy image, and factor is the ratio of the current pixel value and the maximum value in the low-energy image. High denotes the high-energy image as shown in Figure 2(b), i denotes the ith row of the image, and j denotes the jth column of the image. The gray value of the low-energy image is small, and the penetration effect is not good, so the gray value is assigned a large weight coefficient. The high-energy image has a large gray value and good penetration effect, which assigns a small weight coefficient to the gray value.
Using the above formula, the high-and low-energy images were fused to obtain an image with richer feature information and clearer contour.
Since the flat panel detector acquired X-ray images with high dynamic range, in order to display the high dynamic image to a common display device, a hue mapping algorithm based on multi-scale local edge preserving filter was used to convert the low-low-energy fusion image into a low-dynamic range image [22], and the low-low-energy fusion image was passed through the LEP filter to obtain the base layer image representing the approximate information. Then the gray value of the corresponding position of the base layer image is different, and the

Object Detection Network Model Structure
The dangerous goods package data set is obtained through the above two methods, and the YOLOv7 [6] model is used for training. YOLOv7 is a one stage target detection algorithm. The research shows that it is better than the previous version in accuracy and speed. Since X-ray dangerous goods detection needs to consider both accuracy and speed, YOLOv7 is selected as the model for the detection of this dangerous goods dataset in this paper. YOLOv7 is mainly composed of four parts: Input, backbone, neck, and head. The following are introduced separately.
After TIP dangerous goods insertion and high-and low-energy image fusion, the data set is input into the network model. There are several characteristics of the data collected by X-ray security equipment. 1) The size of the image is fixed, and the size of each item will not change after several times of collection.
2) The whole X-ray image is grayscale, without color and texture information, only the internal structure information of the material is retained. 3) X-ray images are different from optical images. X-ray images are not affected by illumination, so there is no significant color change in the image. Therefore, the input side is preprocessed by Mosaic data enhancement, adaptive image scaling and other preprocessing, randomly using four X-ray images for random scaling, and then random splicing, which greatly enriches the detection data set, especially the random scaling increases many small targets. It can make the model learn some potentially valuable information, so as to improve the generalization ability of the model, deal with more complex application scenarios, and make the network more robust.
The Backbone part is shown in Figure 3. After preprocessing, the data is sent

Experimental Data
The data set was collected using ICT6040, a security inspection equipment in-

Experimental Parameter Setting
Depending on hardware environment, the input image size was uniformly scaled to 640 × 640, the training batch size was set to 8, the training period was set to 300, the initial learning rate was set to 0.001, and the optimizer used Adam.

Experimental Evaluation Metrics
• Evaluation Metrics In the experiment, accuracy rate, recall rate, F1-score, mAP0.5, mAP0.95 and FPS were used as evaluation indexes to evaluate the performance of the model.
Where TP indicates that it is actually a positive sample and is predicted to be a positive sample; FP indicates that it is actually a positive sample and predicted to be a negative sample; FN indicates that the actual sample is actually negative and is predicted to be positive; FS indicates the inference speed of the model.
The F1-score, also known as the balanced F Score, takes into account both accuracy and recall, and is the harmonic average of accuracy and recall. The calculation formula is as follows: 2 TP F1 2 TP FP FN • Evaluation Metrics Results Analysis Figure 6 shows the accuracy curve, Figure 7 shows the recall rate curve, and Figure 8 shows the F1-score curve. It can be seen from the figure that when the   confidence is greater than 0.4, the accuracy is close to 1, that is, the greater the confidence is, the greater the probability of predicting the real positive sample to the positive sample in the test set. When the confidence level is less than 0.6, the recall rate is close to 1, that is, the greater the probability of predicting all the real positive samples in the test set. In order to comprehensively measure the accuracy rate and recall rate, the F1-score evaluation index is introduced to reconcile the accuracy rate and recall rate. As shown in Figure 3 above, when the confidence level is 0.4 -0.6, the recall rate can be adjusted. When F1-score is close to 1, the model performs better.

Comparison Experiment
Under the condition that the operating environment of the system and the initialization parameters of the model are the same, four groups of comparison tests are carried out. As shown in Table 1, the first group directly used YOLOv7 to train the data set, and both mAP0.5 and mAP0.95 reached more than 90%.
The second group of experiments was trained after the TIP method was introduced. It can be seen from the results that when the number of parameters of the model increased, its mAP also increased by 6.6%. In the third group of experiments, after the introduction of high-low energy image fusion method for training, the number of model parameters decreased compared with the TIP method only. When the threshold value was 0.5, the Map value was slightly lower than that of the TIP method. The threshold value was 0.5 to 0.95, and the step size was 0.05 incrementally higher than that of the TIP method only. When the threshold is set higher, the model can perform better. The fourth group of experiments combined the above two methods for training, and the number of parameters of its model decreased significantly, and mAP0.5 and mAP0.95 increased by 0.06 and 0.057 respectively compared with the two methods. Although the number of images per second can be processed has decreased, it has met the requirements of real-time detection system.

Test Section
Based on the above results, the test set was used to conduct tests based on YO-LOv7 + TIP + Map to evaluate the quality of the model. The experiment was carried out using the package data containing bullets, grenades, kitchen knives, vernier calipers, carving knives, screwdrivers, wrenches, and razors, the results  Table 2.
In all dangerous goods identification, the average accuracy of all categories with a threshold of 0.5 is above 0.99, because its texture is clear and the feature information is easy to learn, which should be caused by the simple data collected and the background is not particularly complex. Among them, the recognition rate of hand grenade, kitchen knife and vernier caliper is the best, because the hand grenade has rich structural information, and its shape has high consistency in X-ray images at any Angle and position, so it is easy to identify the target.
Compared with other dangerous goods, the structure of kitchen knife and vernier caliper is relatively simple, and the projected texture information is easier to learn. The reason for the analysis is that under different angles, the imaging morphology changes greatly. For example, the imaging of the knife face and the knife back are completely inconsistent, and the shape of the knife in the X-ray image is very different. For the model, it is difficult to identify the target, so the recognition accuracy is not high, but other categories are above 99%. Its detection accuracy has met the demand.

Prediction Section
Inference is performed using the best parameter profile trained by the model. As shown in Figure 9, even if the grenade overlaps with other items, it can still be accurately identified due to its own unique structural information. For inorganic objects, such as knives, kitchen knives, and axes, the pixel value after imaging in the X-ray image is much smaller than that of other substances, and after the fusion of high-low energy images, the outline of such objects is more obvious. This helps the network model to learn the characteristic information of such objects and to recognize them.

Conclusion
In this paper, TIP [7] and high-and low-energy image fusion methods are introduced to the collected X-ray dangerous goods data set to expand the data set,  and feature enhancement is carried out. The YOLOv7 model, which is superior to the previous target detector in both accuracy and speed, is used for training, and its training accuracy rate is more than 99%. Compared with direct training without TIP and high-and low-energy fusion, the accuracy is improved by 6%, and the FPS is 93. The experimental results show that the method of using TIP and high-low energy fusion and then training based on YOLOv7 can meet the real-time detection requirements of X-ray security equipment, and the detection accuracy is much higher than the industry detection accuracy standards. It can be integrated into X-ray security equipment to assist security staff to carry out work, improve work efficiency and reduce security risks. Since the categories of dangerous goods in real life are far more than those collected in this experiment, the model may ignore some unknown dangerous goods in practical application.
In the next step, more dangerous goods categories will be collected for training, and the model will be trained with open data sets to improve the robustness of the model and enable the model to identify unlabeled dangerous goods. In addition, although the detection speed of the model has met the requirements of the real-time detection system, there is still a lot of room for improvement. The next step will be to comprehensively improve the performance of the model by optimizing the structure of the network model, and reducing the number of parameters and calculation amount of the model under the premise of ensuring accuracy.

Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.