An Improved YOLOv3 Model for Asian Food Image Recognition and Detection

The detection and recognition of food pictures has become an emerging application field of computer vision. However, due to the small differences between the categories of food pictures and the large differences within the categories, there are problems such as missed inspections and false inspections in the detection and recognition process. Aiming at the existing problems, an improved YOLOv3 model of Asian food detection method is proposed. Firstly, increase the top-down fusion path to form a circular fusion, making full use of shallow and deep features. Secondly, introduce the convolution residual module to replace the ordinary convolution layer to increase the gradient correlation and non-linearity of the network. Thirdly, introduce the CBAM (Convolutional Block Attention Module) attention mechanism to improve the network’s ability to extract effective features. Finally, CIOU (Complete-IoU) loss is used to improve the convergence efficiency of the model. Experimental results show that the proposed improved model achieves better detection results on the Asian food UECFOOD100 data set.


Introduction
With the development of science and technology and the improvement of human living standards, food object detection plays an important role in the fields of digital retail services, smart homes, healthy eating, self-eating detection, etc.
More and more research based on food images analysis is emerging [1] [2].
In recent years, the classification and detection of food images has attracted widespread attention from scholars. Most researchers start from the structural characteristics of food and base their research on the characteristics of the food itself. In 2009, Taichi Joutou [3] proposed an automatic food image recognition system, which uses Multiple Kernel Learning (MKL) method to integrate multiple image features to classify food images. In 2015, Bettadapura [4] used the food pictures taken by the camera and other information such as the location of the restaurant involved in the pictures to classify the food through support vector machines (SVM). However, traditional machine learning methods are less robust in real scenarios. With the development and application of deep learning models, the introduction of deep learning models into food image target detection has become the mainstream method. In 2016, Jingjing Chen [5] used the mutual and fuzzy relationship between foods and proposed a deep network structure that simultaneously learns food component recognition and food classification. In 2018, Eduardo Aguilar [6] focused on the cafeteria environment and carried out research on automatic food analysis, integrating multiple functions such as food positioning, recognition, and segmentation. In 2019, Zhang Gang and others [7] proposed a food image recognition method based on diffusion graph convolutional network and transfer learning, Weiqing Min [8] and others used abundant raw material composition information to locate multiple food images of different scales, and realized the recognition from the category level to the composition level. In 2020, Ya Lu [9] propose a novel system based on artificial intelligence (AI) to accurately estimate nutrient intake, by simply processing RGB Depth (RGB-D) image pairs captured before and after meal consumption.
Since the texture and color information contained in the food itself is too rich, it is confusing to the model. Therefore, food object detection is one of the most challenging tasks in the field of machine learning. In addition, the main source of food pictures is the dining table scene. In the pictures, there are inevitably background information such as tableware, other dishes, and sundries, which can easily affect the detection effect. As far as Asian food is concerned, its shape and structure are diverse, and the appearance of food under different cooking methods is also very different. These factors significantly increase the difficulty of detection. Moreover, using too much additional information such as raw materials and geographic location will lead to slow detection speed and poor real-time performance. When additional information is unavailable or unavailable, the accuracy of food detection will be greatly affected.
In order to solve the above-mentioned problems in the detection of Asian food targets, further explore the correlation between the inherent characteristics of Asian food and the detection results, and improve the accuracy of Asian food detectors, this paper improves the YOLOv3 [10] algorithm and uses the improved algorithm in Asia Food target detection. It aims to improve the accuracy of Asian food target detection under complex background conditions. The improved network structure is shown in Figure 1  fusion path to form a ring fusion, and make full use of shallow and deep features. Secondly, introduce the convolution residual module [11] to reshape the feature output layer, effectively increasing the gradient correlation and nonlinearity of the network. Thirdly, introduce the CBAM (Convolutional Block Attention Module) attention mechanism [12] to improve the network's ability to

Annulus-FPN
The In response to the above problems, this paper redesigned the feature fusion method of FPN, which is called "Annulus-FPN", as shown in Figure 2.
The red frame in Figure 2 is the FPN feature fusion network structure, and the blue frame is the improved Annulus-FPN feature fusion network structure.
The food pictures are extracted through the DarkNet-53 network, and the ex- Annulus-FPN feature fusion is the add operation, which can increase the amount of information in each dimension of the fusion image, while the dimensionality of the image remains unchanged, so that the feature fusion network can use fewer parameters and achieve a higher computational efficiency. By enhancing the fusion path, the network can deliver more reliable semantic information and location information, and further utilize feature information to achieve feature enhancement.

The Convolutional Block
In the YOLOv3 network structure, in order to further extract features and obtain the prediction results of the network, the detection layer obtained after FPN feature fusion will perform 3 × 3 convolution and 1 × 1 convolution operations, of which the 3 × 3 convolution operation is shown in Figure 3.
Generally, the greater the degree of non-linearity of the convolutional neural network, the better the performance of the network. The degree of non-linearity of the network is closely related to the use of the activation function. Ordinary 3 × 3 convolution uses only one Rectified Linear Unit (Relu) activation function, and its degree of non-linearity is far from enough. When information is transferred, there will be more or less problems such as information loss. In addition, although the use of ordinary 3 × 3 convolution can deepen the network and help extract feature information, it will also cause the correlation between the gradients of the backpropagation to become worse.
In response to the above problems, the convolutional block is introduced to replace the ordinary 3 × 3 convolution output feature, and the convolutional block structure used is shown in Figure 4.
The convolution residual structure includes 1 × 1 convolution [14] [15], 3 × 3   passed to the output to protect the integrity of the feature information. In addition, the introduction of the convolution block can ensure that the back propagation of a gradient is the same as the forward propagation, which not only deepens the network but also maintains the gradient correla.

Regression Loss Function-CIoU
In addition, on the basis of the above improvements to the YOLOv3 model, in view of the large deviation between the predicted box and the real box, the CIoU regression loss function [16] [17] is introduced to replace the L2 norm loss func- is the distance between b p and b g , and d e is the diagonal length of the smallest closed box covering the two boxes. av is a penalty factor, which can control the width and height of the prediction box and quickly fit the width and height of the real box. Among them, a represents the parameter of trade-off, and the definition is shown in Equation v represents a parameter for the consistency of the aspect ratio, and is defined Among them, w g and h g are the width and height of the real box, w p and h p are the width and height of the prediction box, and a schematic diagram of the CIoU loss function is shown in Figure 6.
CIoU loss directly minimizes the normalized distance between two boxes, and takes into account of the three geometric properties: overlap area, center distance and aspect ratio, which can improve the stability of box regression and the accuracy of model convergence.

Introduction to the Data Set
The data set uses the UEC-FOOD100 public Asian food data set. The data set has a total of 12,741 pictures, including 100 Asian foods such as rice, grilled chicken, and sweet and sour pork. The characteristics of the UEC-FOOD100 data set are small differences between food images, large differences within categories, many cooking methods, and rich information such as colors and textures. These can accurately reflect the complex and diverse characteristics of During the experiment, the UECFOOD100 data set was made into VOC2007 format. The marked center point, width and height, target object category and other information were saved as xml files required by VOC format. The data format in the UECFOOD100 data set is that each picture has one or more foods, and if there are multiple foods, it corresponds to multiple xml files. In the experiment, multiple xml files of each picture are merged, so that each picture corresponds to only one xml file, making the format is more concise, and the training speed faster. The data set is randomly divided into training set, validation set and test set, the proportions are 80%, 10% and 10%.

Data Enhancement
Through the analysis of the UECFOOD100 data set, the resolution of a large number of pictures is about 800 × 600 pixels, and the aspect ratio is about 4:3.
When the data being input into the 416 × 416 YOLOv3 network, because the network adopts the resize processing method of maintaining the aspect ratio of the input picture, so the vacant part is filled with the pure gray pixels, as shown in Figure 7. The image whose original image size is 800 × 600 actually occupies only 416 × 312 in the network input, and the rest are pure gray pixels.
After the resize operation, the target zoom is too small, and the detection is difficult. About one-third of the network calculation is used to calculate the supplementary gray pixels, which causes a lot of waste of computing resources. In addition, the pictures in the data set are distributed in a long-tailed manner.
There are as many as 723 pictures in each category, and as few as 100 pictures, as shown in Figure 8. For the target detection task, only more than 100 pictures in the category are used to train the deep learning model, and the amount of data is not sufficient.
In order to increase the amount of data and improve the effect of model training, Mosaic data enhancement [18] is used to simulate more data samples during the experiment. The food image after Mosaic data enhancement is shown in Figure 9. It can be seen from Figure 9 that the picture after data enhancement does not have a solid-color border. During the cropping process, the target   will not increase the difficulty of detection because it is too small, and it can simulate more sample data.

Experimental Environment Configuration and Model Training
Use Intel (R) Core (TM) i5-8500 CPU processor, NVIDIA GeForce GTX 1660 graphics card, 16 GB CPU memory, 6 G graphics card memory. Using the deep learning framework based on pytorch, the original YOLOv3 network and the improved YOLOv3 network were trained and analyzed separately under the windows operating system.

X. P. He et al. Open Journal of Applied Sciences
First, the UECFOOD100 Asian food data set was used to train the YOLOv3 pre-training model. In the training process, the learning rate dynamic adjustment strategy is adopted, and the step mode is selected to update the learning rate. At the beginning of training, due to the limited Graphics Processing Unit (GPU) memory, the initial learning rate is set to 1e−3, the gamma coefficient is set to 0.92, and the batch size is set to 8. In the 50 th iteration, the model no longer converges. At this time, the learning rate is set to 1e−4, and the batch size is set to 4 to continue training. As the number of training increases, the loss value continues to decrease, and the loss value decline curve is shown in Figure 10. In

Evaluation Index
In the experiment, Average Precision (AP) and Mean Average Precision (mAP) Assuming there are K categories, and K > 1, the mAP calculation method is shown in Equation (5).

Initial Assessment
After training on a 416 × 416 image, the performance of the original YOLOv3 was evaluated on the test set. Combining the recall rate and accuracy rate to evaluate the training network, consider the evaluation indicators under the two conditions of IoU = 0.5 (mAP@.5) and IoU = 0.75 (mAP@.75). The threshold of the confidence score (confidence score × category probability) of a specific category is set to 0.3 to generate the predicted bounding box. Table 1 summarizes the mAP@.5 and mAP@.75 indicators of the original YOLOv3.
In Table 1, Experiment 1 uses the original data set for training, and Experiment 2 adds Mosaic data enhancement for training. Under the indicators of IoU = 0.5 and IoU = 0.75, the test accuracy has been significantly improved.

Experimental Results and Analysis
In order to evaluate the proposed improvement method, the following four ex- It can be seen from Table 2 that under the index of IoU = 0.5, compared with the original YOLOv3 algorithm (Experiment 2 in Table 1) using the Mosaic data enhancement method, experiment 1 is to improve the feature fusion network to annulus feature fusion network and mAP is increased by 1.43%, indicating that the introduction of annulus feature fusion can make better use of deep and shallow features, and effectively improve the detection accuracy. Experiment 2 uses the convolutional block to replace the ordinary 3 × 3 convolution output features, mAP is improved by 0.94%, and the correlation of the gradient is effectively  Under the index of IoU = 0.5, the average accuracy of the original YOLOv3 algorithm is 70.54%, and the average accuracy of the improved YOLOV3 in this article has reached 77.60%. The change curve of the average accuracy of the model before and after the improvement in the training process is shown in Figure 11. The figure on the left is the average precision curve when IoU = 0.5, and the figure on the right is the average precision curve when IoU = 0.75. The above-mentioned average precision only reflects the overall level of the model, and cannot specifically reflect the performance of each category in the model. Therefore, this article draws a scatter plot of AP changes for each category before and after the improvement of YOLOv3, as shown in Figure 12, it can be seen that with the exception of individual categories, most categories have significantly improved AP after the improvement. The comparison of evaluation indicators between the improved YOLOv3 and the original YOLOv3 proves the effectiveness of methods such as Annulus-FPN, the convolutional block, CBAM attention mechanism, CIOU loss function, and data enhancement.
As shown in Figure 13, by outputting the heat maps (a) and (b) of the network before and after the improvement, it can be seen that the improved network is more targeted in terms of feature extraction and expands the target area.
From Figure 13(c) and Figure 13  also inherits the network can obtain a higher degree of confidence in a variety of food targets, with fewer false detections and missed detections. When compared with the original YOLOv3, it always shows better performance.
In order to further verify the effectiveness of the algorithm on the Asian food data set in this paper, the algorithm in this article is combined with 4 representative or advanced target detection algorithms SSD [20], Faster R-CNN [21], original YOLOv3, and EfficientDet-d2 [22] for comparison test, the performance Open Journal of Applied Sciences   Table 3. It can be seen from Table 3 that under the index of IoU = 0.5, compared with other mainstream target detection algorithms, the algorithm in this paper has higher detection accuracy in Asian food target detection, and the mAP value reaches 77.60%, compared with the original YO-LOv3 algorithm, the mAP value is increased by 7.06%.
The comparison of the detection results of the improved YOLOV3 algorithm in this paper with the original YOLOv3, good effects of the original YOLOv3.   From Figure 14(b) and Figure 14(c), it can be found that the original YO-LOv3 has low confidence in the detection of some targets, but the improved YO-LOv3 SSD, Faster R-CNN, and EfficientDet algorithms is shown in Figure 15.
From the different performance of each model on the same image in Figure   15, it can be seen that SSD, Faster R-CNN, EfficientDet-d2, and original YO-  significant improvement in the confidence of detecting the target and the positioning of the target is more accurate. In summary, compared with other algorithms, this algorithm is more suitable for Asian food target detection, and the detection effect has been significantly improved. Asian food target detection tasks, the performance of the improved YOLOv3 model has been significantly improved, and it also has greater advantages compared with other conventional target detection algorithms.