Asian Food Image Classification Based on Deep Learning

To improve Asian food image classification accuracy, a method that combined Convolutional Block Attention Module (CBAM) with the Mobile NetV2, VGG16, and ResNet50 was proposed for Asian food image classification. Additionally, we proposed to use a mixed data enhancement algorithm (Mixup) to have a smoother discrimination ability. The effects of introducing the attention mechanism (CBAM) and using the mixed data enhancement algorithm (Mixup) were shown respectively through experimental comparison. The combination of these two and the final test set Top-1 accuracy rate reached 87.33%. Moreover, the information emphasized by CBAM was re-flected through the visualization of the heat map. The results confirmed the classification method’s effectiveness and provided new ideas that improved Asian food image classification accuracy.


Introduction
A good diet provides humans with nutrients needed by the body hence making it a basis for human survival. With the continuous improvement of living standards, people have gradually begun to pay attention to the nutritional balance in their daily diet. People have started using computer vision technology to classify and recognize food images that provide a newly fast and low-cost method for analyzing food composition and nutritional components. Therefore, food image classification technology has gradually become a research hotspot in the field of computer vision. chine learning methods solve food image classification problem by extracting food image features manually and designing classifiers. For instance, Yuji Matsuda et al. proposed a technique that involved dividing the entire food picture into different candidate regions, analyzing each candidate region, and identifying multiple foods simultaneously. It was proved that the proposed method was effective for the recognition of multiple food images [1]. Lukas Bossard et al. proposed a method that includes using the random forest for food image recognition. Their method outperformed alternative classification methods, including SVM classification on Improved Fisher Vectors and existing discriminative part-mining algorithms. On the challenging mit-Indoor dataset, their method compared nicely to other s-o-a component-based classification methods [2]. On the other hand, Marc Bolaños and others proposed a food image recognition algorithm. The core of this algorithm is first to analyze the food feature map of the input food picture, and then use it to predict the kind of the input food by looking for the most similar food features. They proved that, compared to the most similar problem nowadays-object localization, was able to obtain high precision and reasonable recall levels with only a few bounding boxes. Furthermore, they also showed that it was applicable to both conventional and egocentric images [3]. Further, Shulin Yang et al. proposed a method of performing pairwise statistics on the feature relationship between different components of food and analyzing the statistical information to come up with a classifier for food identification. Their experiments showed that the proposed representation was significantly more accurate at identifying food than existing methods [4].
But the traditional machine learning classification technology relies on manual feature extraction and classifiers selection. A variety of factors restrict the methods of manually extracting features. It is usually difficult to accurately express the real meaning of the picture, which results in low classification accuracy. This method is based on deep learning. We automatically learn food features through deep neural networks. We can closely associate the learned features with the classifier, which solves many shortcomings caused by manual feature extraction and design classifiers. Because of the aforementioned factors, food image classification based on deep learning methods has been receiving increasing attention.
Some of the most common neural networks in food image classification tasks include MobileNet, VGG, ResNet, etc. Fu Z. et al. introduced a 1000-category food data set ChinFood 1000 and proposed a simple and effective baseline method that involved using the ResNet model to conduct research on the ChinFood 1000 data set. The base-line approach was evaluated on three most widely used food data sets and achieved the best performance on all of them. And this approach was also applied to the ChinFood 1000 dataset with a promising accuracy [5]. On the other hand, M. Taskiran and others used the Food 101 data set to train models such as MobileNet, VGG, and ResNet and proposed a method of comparing correlation coefficients within categories to solve the problem of in-proposed a food image classification model that transplanted the MobileNet network to the Raspberry Pi, to calculate the nutritional content of food. Their network in Raspberry Pi 3 produced good prediction accuracy but slow speed. And they introduced PeachPy to speed up the network and it could run at 3.3 seconds per food image [7].
There is a variety of Asian foods, and these foods have different characteristics. Therefore, this makes it challenging to classify Asian food images manually. It then becomes essential to come up with a method for Asian food image classification. In this article, we used an algorithm that was based on deep learning with three convolutional neural networks that included MobileNetV2 [8], VGG16 [9], and ResNet50 [10] as the baseline network. Further, we used CBAM [11] (Convolutional Block Attention Module) to improve the baseline network and Mixup [12] data enhancement algorithm to expand the training set. Our method verified the performance improvement effect of the CBAM attention mechanism and the Mixup data enhancement algorithm on Asian food image classification.

Neural Networks
To realize the classification method of deep learning for Asian food pictures, we selected MobileNetV2, VGG16, and ResNet50 as the experimental baseline network.

MobileNetV2 Network
MobileNetV2 is a lightweight deep neural network. Its core is adding 1 × 1 convolution before deep separation and convolution, enabling deep separation convolution to perform feature extraction on higher-dimensional channels. Depth separable convolution integrates ordinary convolution into layer-wise convolution and pointwise convolution. Moreover, it uses Batch Normalization (BN) to prevent overfitting and ReLu function as the activation function. Additionally, to curb the problem of feature loss during training, the ReLu function of the third layer of point-to-point convolution is replaced with a linear activation function.
To efficiently play the role of depth separable convolution, MobileNetV2 builds an inverted residual block structure. This structure draws on the residual structure of ResNet, and it is used for training only when the step size is 1, and the network is deep. The block structure of MobileNetV2 is shown in Figure 1.

VGG16 Network
The Visual Geometry Group proposed VGG model in 2014. The most outstanding feature of the VGG models is its simplicity. Its hierarchical structure includes convolutional layers, pooling layers, and fully connected layers. The convolutional layers are for extracting features at different locations in an image. On the other hand, the pooling layers are used to reduce the dimensionality of the features extracted by the convolution kernels. The fully connected layers are equivalent to the classifier in machine learning since they classify the extracted features. The VGG16 model consists of 13 convolutional layers and 3 fully connected layers, and its input image size is 224 × 224. The initial convolution kernel size is 3 × 3, and the pooling layers are represented by 2 × 2 max-pooling. The structure diagram of VGG16 is shown as in Figure 2.

ResNet50 Network
The ResNet model was proposed by Kaiming and colleagues in 2015. The Res-Net model solves the problems of gradient explosion and network convergence, which are slowly caused by the network's excessive depth through the identity mapping of the residual block. It is more practical in deep neural networks than the VGG models. The ResNet residual block is shown in Figure 3.

CBAM Attention Mechanism
The attention mechanism in human vision is a signal processing mechanism. People scan the global image with the naked eye to get the target of attention, and then put more attention on the target to obtain more detailed information to reduce global information's attention. Through the attention mechanism, the speed and accuracy of human visual information processing are improved. The attention mechanism in deep learning draws on the attention mechanism in human vision. It adjusts and adapts to the learned features by changing the weight, which improves the accuracy of image classification. We introduced the convolution block attention mechanism (CBAM) to improve the convolution models. CBAM includes channel attention and spatial attention. The channel attention learns the content of the picture, the structure is shown in Figure 4.
In the channel attention mechanism, C feature vectors represents the dimension of the feature vector, H represents the height of the feature vector space direction, and W represents the width of the feature vector space direction) are compressed into C real numbers of receptive fields by average-pooling and max-pooling, and then these real numbers generate the final weight parameter through a multilayer perceptron. To obtain the feature vector F ′ , the next is multiplied by the weight parameter c M to the feature vector The weight parameter of the channel attention module can be expressed by Equations (1).
Among them, σ represents the Sigmoid activation function, 0 W represents the weight of the fully connected layers, and 1 W represents the weight of the output layers. Spatial attention learns the location of the input image, the structure is shown in Figure 5.
The spatial attention mechanism compresses the input feature map through global max-pooling and global average-pooling, and obtains a feature map with 2 channels by concat stitching, then it is compressed by a 7 × 7 convolution kernel, and the dimension is reduced to a feature map is generated through the Sigmoid activation function. Finally, the final feature is obtained by multiplying the weight parameter with the input feature diagram F of the module. The weight parameter of the spatial attention module can be expressed by Equations (2).
;  The combination of spatial attention and channel attention enables the neural network model to locate quickly and focuses on the image's local key information for better adaptive effects. Moreover, the channel attention mechanism is implemented by MLP, and the pooling layers don't introduce more parameters, which significantly reduce the amount of CBAM parameters, and thus CBAM is a lightweight module.

CBAM Combination Method
Because the structure of CBAM includes channel attention and spatial attention, to maximize the effect of CBAM, we analyzed three different combinations of channel attention and spatial attention. First, the channel attention and spatial attention were connected in series. The input feature first paid attention to the feature content through the channel attention mechanism and then paid attention to the feature location through the spatial attention mechanism. We called this structure "channel before space". The structure is shown in Figure 6.
Secondly, the channel attention and spatial attention were connected in series. The input feature located the feature position through the spatial attention mechanism and then focused on the channel attention mechanism's feature content. We called this structure "space before channel". The structure is shown in Figure 7.
Finally, the channel attention and spatial attention were connected in parallel. The input features went through the channel attention mechanism and the spatial attention mechanism, respectively. It could pay attention to the feature content, and feature location, respectively, and then the structure merged the newly output features generated by these two mechanisms. We called this structure "parallel structure". The structure is shown in Figure 8.

CBAM-Based Network Model
Firstly, we introduced CBAM to MobileNetV2. Since MobileNetV2 is a lightweight network, and the inverse residual structure can simplify the model's learning goal and reduce the difficulty of training, and we added the CBAM attention mechanism after the last convolutional layer in the block with a step size of 1.  The CBAM attention mechanism adjusted the convolutional layer's weight to ensure that the weighted features were transmitted farther back. The specific approach is shown in Figure 9. Secondly, we introduced CBAM to VGG16. Since the features obtained after all convolution operations of VGG16 retain important local feature information, the convolutional layer of VGG16 was used as the backbone. We added the CBAM attention mechanism in between to enhance the original feature map's expression ability and improve the classification accuracy. The specific approach is shown in Figure 10.
Finally, we introduced CBAM to ResNet50. The features obtained by the first layer of convolution contain more local key information. Therefore, the CBAM structure was added after the first layer of ResNet50 to capture the first convolutional layer's detailed features. When the model performed identity mapping, the important features learned through the CBAM structure could be transmitted farther back to improve the classification effect. The specific approach is shown in Figure 11.

Mixup Data Enhancement
To improve the classification accuracy, we introduced the Mixup data enhancement

Picture Segmentation
We used UECFOOD100 [1] created by Yoshiyuki Kawano from the University Figure 11. CBAM structure based on ResNet50.   Figure 13.
To avoid the interference of irrelevant features such as the background of the food image, we combined the label information documents of the food subject in the data set and wrote a python script to preprocess the image segmentation. Taking rice as an example, the food image detection frame was restored by labelling the information file, and the food subject in the detection frame was segmented, as shown in Figure 14.  Among them, column (a) is the original picture, column (b) shows the information of the food detection frame, and column (c) shows the rendering of the content of the detection frame. It can be observed that the image segmentation method that we proposed reduces the influence of the image background and avoids the interference of irrelevant features such as the background.

Experimental Evaluation Index
In image classification experiments, the accuracy of Top-1 and Top-5 is often used to measure the effectiveness of the model. The accuracy of Top-1 and Top-5 can be expressed by Equations (4). Among them, n 1 is the number of graphs with the correct label corresponding to the first classification probability label in the test picture, n 5 is the number of graphs with the correct label in the test picture among the first five classification probability labels, and n is the number of tests pictures.

Experimental Process and Result Analysis
Our experiment used the Pytorch framework, and it was carried out on the NVIDIA GeForce RTX GPU. To further improve the classification accuracy, we used common data enhancement techniques, which included random cropping and random flipping to expand the training set. In the training process, the picture was scaled to 224 × 224 pixels, and the test set picture was scaled to 256 × 256 during testing, and then the centre was cropped to 224 × 224 as input.

The Effect of the Combination Mode of CBAM on Classification
We set the initial learning rate at 0.01, momentum at 0.9, batch size at 32, and training for 90 epochs. When training to 30 Epoch, the learning rate decayed to 0.001, and when training to 60 Epoch, the learning rate decayed to 0.0001 on the basis of 30 Epoch. In order to calculate the gradient quickly, and the models could converge at a faster speed, we selected sgd as the optimizer. The experimental results are shown in Table 1. Table 1 shows the classification accuracy of the CBAM series structures. The classification accuracy of "Channel before Space" is higher than the classification accuracy of "Space before Channel" and the classification accuracy of "parallel structure". On VGG16, the classification accuracy of "Channel before Space" is improved by 0.92% and 1.83% compared to the classification accuracy of "Space before Channel" and "parallel structure". On MobileNetV2, the classification accuracy of "Channel before Space" is improved by 0.28% and 1.69% compared to the classification accuracy of "Space before Channel" and "parallel structure". On ResNet50, the classification accuracy of "Channel before Space" is improved by 0.21% and 0.21% compared to the classification accuracy of "Space before Channel" and "parallel structure".
The line chart of Top-1 training accuracy is shown in Figure 15, and the line chart of Top-1 testing accuracy is shown in Figure 16. The x-axis in the figure represents epoch, and the y-axis represents the accuracy of Top-1 training or testing.
Because the learning rate is reduced by 0.1 when the training reaches 30 epoch and 60 epoch, it contributes to the large fluctuation of the figure's broken line. Figure 15 and Figure 16 further confirm that the classification effect of the "channel before space" is better than the other two CBAM structures. Therefore, they further confirm the "channel before space" is more suitable for Asian food.

The Effect of CBAM on Classification
To further verify the proposed CBAM mechanism's effectiveness, we combined CBAM with MoblieNetV2, VGG16, and ResNet50, and selected "channel before space" as the CBAM structure. Our models used the ImageNet pre-training   weights. We set initial learning rate to 0.01, the momentum to 0.9, and the batch size to 32. What's more, we set the training epochs to 160. When training to 90 Epoch, the learning rate decayed to 0.001, and when training to 120 Epoch, the learning rate decayed to 0.0001 on the basis of 90 Epoch. In order to calculate the gradient quickly, and the models could converge at a faster speed, we selected sgd as the optimizer. The results are shown in Table 2. It can be seen from Table 2 that CBAM models have better classification performance on the Asian food data set than benchmark models. The x-axis in the figure represents epoch, and the y-axis represents Top-1 training or testing accuracy. Because the learning rate is reduced by 0.1 when the training reaches 90 epoch and 120 epoch, it contributes to the large fluctuation of the figure's broken line. Figure 17 and Figure 18 show that the classification accuracy of models after introducing the CBAM structure is improved compared with the original models. This shows that the CBAM attention mechanism is effective in solving the

The Effect of Mixup on Classification
In addition, we performed Mixup data enhancement preprocessing on the Asian food data set, we set the hyperparameter to 0.4, and the experimental results are shown in Table 3. It can be seen from Table 3 that the classification performance of our Mixup preprocessing models are better than benchmark models on the Asian food data set. The line graph of Top-1 training accuracy is shown in Figure 19. The Top-1  test accuracy is The line chart is shown in Figure 20. The x-axis in the figure represents epoch, and the y-axis represents Top-1 training or testing accuracy. Although the virtual samples reduce training accuracy, the classification accuracy of the models trained on the virtual samples has improved on the test samples. It shows that the Mixup data enhancement pre-processing method can show good classification results on multiple models, and it effectively solves the problem of Asian food image classification. Table 4 shows the comparison results between our models and the previous research on UECFOOD100. It can be seen from Table 4 that our method are superior to other methods in classification effect. Among them, the VGG16 +  CBAM + Mixup model is 7.80% higher than the Top-1 of the FV + DeepFood-Cam [13] method, Top-5 is increased 2.26%, which is 8.85% higher than the Top-1 of the DeepFood [14] method, and Top-5 is increased by 2.51%, which is 6.38% higher than the Top-1 of the DCNN-FOOD [15] method, Top-5 is increased by 1.96%. This shows that our method is effective in dealing with the problem of Asian food image classification and improves the classification accuracy.

Visual Comparison
To visually display the key information that our CBAM models focus on, we introduced Grad-CAM [16] for visualization experiments. The heat map was used to display the Asian food picture information that the convolutional neural network and CBAM focus on. We took ResNet50 as an example and randomly selected 3 pictures, and the generated heat map comparison chart is shown in Figure 21. Among them, the column of (a) is the original picture, the column of (b) is the effect picture of the benchmark network, the column of (c) is the effect picture after adding the CBAM attention mechanism. It can be observed that after the introduction of the CBAM module, the models pay more attention to the features in the Asian food images area, thereby greatly improve the classification effect.

Conclusion
To improve Asian food image classification accuracy, we designed a method to use three convolutional neural networks MobileNetV2, VGG16, ResNet50 as the reference network, and used the CBAM attention mechanism to improve the reference network. In addition we used the Mixup data enhancement algorithm to expand. The training set and comparative experiments on multiple models show that the method can effectively improve the accuracy of Asian food image classification, and the accuracy rate is higher than the accuracy of Asian food image classification in recent years, and we used the heat map to further verify the effectiveness of CBAM attention mechanism. Our method provides a new idea for solving the problem of Asian food image classification, which is in line with the purpose of introducing the CBAM attention mechanism and Mixup data enhancement.