^{1}

^{*}

^{2}

^{2}

^{2}

^{2}

Face recognition is a kind of biometric technology that recognizes identities through human faces. At first, the speed of machine recognition of human faces was slow and the accuracy was lower than manual recognition. With the rapid development of deep learning and the application of Convolutional Neural Network (CNN) in the field of face recognition, the accuracy of face recognition has greatly improved. FaceNet is a deep learning framework commo nly used in face recognition in recent years. FaceNet uses the deep learning model GoogLeNet, which has a high accuracy in face recognition. However, its network structure is too large, which causes the FaceNet to run at a low speed. Therefore, to improve the running speed without affecting the recognition accuracy of FaceNet, this paper proposes a lightweight FaceNet model based on MobileNet. This article mainly does the following works: Based on the analysis of the low running speed of FaceNet and the principle of MobileNet, a lightweight FaceNet model based on MobileNet is proposed. The model would reduce the overall calculation of the network by using deep separable convolutio ns. In this paper, the model is trained on the CASIA-WebFace and VGGFace2 datasets, and tested on the LFW dataset. Experimental results show that the model reduces the network parameters to a large extent while ensuring the accuracy and hence an increase in system computing speed. The model can also perform face recognition on a specific person in the video.

With the advent of the era of big data, artificial intelligence has been developing more rapidly. Artificial intelligence involves many fields, including deep learning [

Computer vision is an application of machine learning in the field of vision and an important part of the field of artificial intelligence. The purpose of computer vision is to collect pictures or videos, analyze the pictures or videos, and accordingly obtain the required information. Computer vision is widely used nowadays: video surveillance, automatic drive, medical treatment, face punching, and consumption are all supported by computer vision. Studying computer vision can start from the perspective of object vision and space vision. The purpose of object vision is to determine the type of object, while space vision is to determine the position and shape of the object. At present, there are some tasks such as image classification [

With the development of the times, face recognition has gradually evolved from artificial recognition to machine recognition, and the accuracy of machine recognition has long surpassed that of human beings. Face recognition is a kind of biometrics technology that recognizes identities through facial features. Compared with other biometrics, face recognition has the advantages of naturalness, uniqueness, and inconsistency. Other biometric methods such as fingerprint recognition and iris recognition are not natural, and require pressure sensors and other equipments. Face recognition is not only used in the field of video surveillance and finance, but also shows its broad application space in many scenarios, such as transportation, education, medical care, and e-commerce. The rapid development of deep learning has made deep learning models widely used in face recognition. Since its introduction, the deep neural network model has been widely used in computer vision tasks such as image classification and face detection [

In face recognition, posture and lighting have always been a long-standing problem. The traditional face recognition method based on convolutional neural network is to use CNN’s twin network [

FaceNet has two different deep network structures, both of which are deep convolutional networks. The first structure is based on the Zeiler & Fergus model [

Szegedy et al., which uses mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. These models can reduce the number of FLOPS to achieve better performance. Zhenyao et al. used a deep network to “distort” human faces into a canonical frontal view, and then learned to classify each human face as a known CNN. For facial verification, the principal component analysis on the network output is used in combination with a set of SVMs. Taigman et al. proposed a multi-stage method to align the face with a general 3D model. And they trained a multi-category network that can perform facial recognition on more than 4,000 identities. The authors also conducted experiments on the proposed twin network, in which they optimized the L1 distance between two facial features. Their best performance on LFW comes from the collection of three networks using different arrangements and color channels, using nonlinear SVM to combine the prediction distances of these networks (nonlinear SVM prediction based on χ2 kernel), through semantic and visual similarity Ranking images. The new generation of FaceNet uses the Inception-ResNet-v2 network, which combines Microsoft’s ResNet idea of residual network on the basis of the original Google’s Inception series network [

The main contributions of this article are as follows:

1) A Fast-FaceNet model based on MobileNet is proposed to reduce the overall calculation of the network.

2) Fast-FaceNet was applied to video face recognition to improve the recognition rate while ensuring a certain recognition accuracy rate.

This paper is divided into five parts: Section 1 introduces the relevant background, the related work in recent years, and summarizes the main work and organizational structure of this paper. Section 2 introduces the relevant basic theory. Sections 3 and 4 are the core of this paper, model architecture and the analysis of experimental results. The final part summarizes the entire article.

The FaceNet system can directly map face images to a compact Euclidean space,

where the length of the spatial distance directly corresponds to the measure of face similarity. Once this space is generated, you can use standard techniques with FaceNet embedding as feature vectors to easily perform tasks such as face recognition, verification, and clustering. The advantage of this model is that only a small amount of processing on the image can be used as input. At the same time, the accuracy of the model is very high in the data set. Facenet can be widely used in face recognition in mobile termial. Its network structure is shown in

The FaceNet network consists of a batch input layer and a deep convolutional network, and then L2 normalization, which leads to face embedding, and finally calculates the triplet loss to make the distance between the same objects. As small as possible, the distance between different objects is as large as possible. It uses a deep convolutional neural network to learn the Euclidean embedding method of each image, and trains the network so that the squared L2 distance in the embedding space directly corresponds to the face similarity. FaceNet directly uses the Loss function of Triplets-based LMNN (Maximum Boundary Nearest Neighbor Classification) to train the neural network, and the network output is a 128-dimensional vector space. The selected Triplets contain two matching face thumbnails and a non-matching face thumbnail. The Loss function target distinguishes positive and negative classes by distance boundaries.

The deep neural network in the classic FaceNet system is GoogLeNet which uses the Inception module, so it is also called the Inception network.

The original Inception module contains several convolutions of different sizes, namely 1 × 1 convolution, 3 × 3 convolution and 5 × 5 convolution, and also includes a 3 × 3 maximum pooling layer. The features obtained by these convolutional layers and pooling layers are aggregated together as the final output, which is also the input of the next module. The original Inception module is shown in

However, a larger convolution kernel is used in the original Inception module, and the calculation complexity is larger, which can only limit the number of feature channels. So GoogLeNet uses 1 × 1 convolution to optimize, that is, firstly use 1 × 1 convolution to perform up-down dimension, and secondly perform convolution and aggregation on multiple sizes at the same time. The size reduction Inception module is shown in

The entire GoogLeNet network is formed by stacking Inception modules. The entire network has a total of 22 layers. The specific network and parameter configuration are shown in

The modular structure (Inception structure) adopted by GoogLeNet is easy to add and modify. At the end of the network, the average pooling is used to replace the fully connected layer, which can improve the accuracy. However, GoogLeNet’s network model is relatively large, and the calculation speed is also slow.

Type | Size-in | Size-out | Kernel (stride) |
---|---|---|---|

Conv | 224 × 224 × 3 | 112 × 112 × 64 | 7 × 7 × 3 (2) |

Max pool | 112 × 112 × 64 | 56 × 56 × 64 | 3 × 3 × 64 (2) |

Inception (2) | 56 × 56 × 64 | 56 × 56 × 192 | 3 × 3 × 192 (1) |

Max pool | 56 × 56 × 192 | 28 × 28 × 192 | 3 × 3 × 192 (2) |

Inception (3a) | 28 × 28 × 192 | 28 × 28 × 256 | 1 × 1 × 64 (1), 3 × 3 × 128 (1), 5 × 5 × 32 (1), 1 × 1 × 32(1) |

Inception (3b) | 28 × 28 × 256 | 28 × 28 × 320 | 1 × 1 × 64 (1), 3 × 3 × 128 (1), 5 × 5 × 64 (1), 1 × 1 × 64(1) |

Inception (3c) | 28 × 28 × 320 | 14 × 14 × 640 | 3 × 3 × 256 (2), 5 × 5 × 64 (2) |

Inception (4a) | 14 × 14 × 640 | 14 × 14 × 640 | 1 × 1 × 256 (1), 3 × 3 × 192 (1), 5 × 5 × 64 (1), 1 × 1 × 128(1) |

Inception (4b) | 14 × 14 × 640 | 14 × 14 × 640 | 1 × 1 × 224 (1), 3 × 3 × 224 (1), 5 × 5 × 64 (1), 1 × 1 × 128(1) |

Inception (4c) | 14 × 14 × 640 | 14 × 14 × 640 | 1 × 1 × 192 (1), 3 × 3 × 256 (1), 5 × 5 × 64 (1), 1 × 1 × 128(1) |

Inception (4d) | 14 × 14 × 640 | 14 × 14 × 640 | 1 × 1 × 160 (1), 3 × 3 × 288 (1), 5 × 5 × 64 (1), 1 × 1 × 128(1) |

Inception (4e) | 14 × 14 × 640 | 7 × 7 × 1024 | 3 × 3 × 256 (2), 5 × 5 × 128 (2) |

Inception (5a) | 7 × 7 × 1024 | 7 × 7 × 1024 | 1 × 1 × 384 (1), 3 × 3 × 384 (1), 5 × 5 × 128 (1), 1 × 1 × 128(1) |

Inception (5b) | 7 × 7 × 1024 | 7 × 7 × 1024 | 1 × 1 × 384 (1), 3 × 3 × 384 (1), 5 × 5 × 128 (1), 1 × 1 × 128(1) |

Avg Pool | 7 × 7 × 1024 | 1 × 1 × 1024 | 7 × 7 × 3 (1) |

FC | 1 × 1 × 1024 | 1 × 1 × 128 | 1024 × 128 (1) |

MobileNet [

The core layer built by MobileNet is a deep separable filter. Deep separable convolution is a form of deconvolution. The standard convolution operation directly extracts the features from the input and combines them into a series of outputs. The depth separable convolution divides this process into two layers: one layer is the depth convolution, which is used to extract each channel of the input separately Features; One layer is a point-by-point convolution, which uses a 1 × 1 convolution to combine the output of the previous step. This decomposition has the effect of significantly reducing the calculation and model size.

Suppose the size of the input feature map is D_{F} × D_{F} × M, M is the number of input channels, N is the number of output channels, the parameters of a standard convolutional layer are D_{K} × D_{K} × M × N, and D_{K} is the size of the convolution kernel. If the space size of the output feature map remains unchanged, the calculation cost of standard convolution is shown in Equation (1):

D K × D K × M × N × D F × D F (1)

The MobileNet model uses deep separable convolutions to break the interaction between the number of output channels and the size of the kernel to greatly reduce the computational cost. The calculation cost of deep convolution is shown in Equation (2):

D K × D K × M × D F × D F (2)

Although deep convolution is much more efficient than standard convolution, it only filters the input channels and does not combine them to generate new features. An additional 1 × 1 convolution is required to combine the features obtained by these filters to form a New multi-channel features. The calculation cost of the final depth separable convolution is the sum of depth convolution and point-by-point convolution, as shown in Equation (3):

D K × D K × M × D F × D F + M × N × D F × D F (3)

By decomposing the standard convolution integral into deep convolution and point-by-point convolution, the calculation amount is reduced as shown in Equation (4):

D K × D K × M × D F × D F + M × N × D F × D F D K × D K × M × N × D F × D F = 1 N + 1 D K 2 (4)

MobileNet which uses deep separable convolution and 8 - 9 times less computation than standard convolution can greatly improve the operation rate. Therefore, this article uses MobileNet to replace the deep learning model in FaceNet.

The original FaceNet network is relatively complex. However, these complex networks will affect the size and speed of the model. In order to be better deployed on the mobile terminal without affecting the accuracy of face recognition. This paper uses MobileNet to replace GoogLeNet, and proposes a Fast-FaceNet model based on MobileNet in order to improve the practicality of FaceNet. Its network structure is shown in

In

The percentage of the total parameters and the total calculation amount of each operation of MobileNet in Fast-FaceNet is shown in

It can be seen from

The results of comparing the parameters of MobileNet and GoogLeNet with the amount of calculation are shown in

Type | Calculation | Parameter |
---|---|---|

Conv 1 × 1 | 94.86% | 74.59% |

Conv DW 3 × 3 | 3.06% | 1.06% |

Conv 3 × 3 | 1.19% | 0.02% |

Fully Connected | 0.18% | 24.33% |

After comparison, it can be found that MobileNet is smaller than GoogleNet in size, less in parameters, and the amount of calculation is reduced by more than 2.5 times. So it is effective that this article uses MobileNet to improve FaceNet.

The parameter configuration of each network layer of Fast-FaceNet is shown in

Model | Calculation | Parameter |
---|---|---|

MobileNet | 569 | 4.2 |

GoogLeNet | 1600 | 7.5 |

Type | Size-in | Size-out | Kernel (stride) |
---|---|---|---|

Conv | 224 × 224 × 3 | 112 × 112 × 32 | 3 × 3 × 3 × 32 (2) |

Conv dw | 112 × 112 × 32 | 112 × 112 × 32 | 3 × 3 × 32 dw (1) |

Conv | 112 × 112 × 32 | 112 × 112 × 64 | 1 × 1 × 32 × 64 (1) |

Conv dw | 112 × 112 × 64 | 56 × 56 × 64 | 3 × 3 × 64 dw (2) |

Conv | 56 × 56 × 64 | 56 × 56 × 128 | 1 × 1 × 64 × 128 (1) |

Conv dw | 56 × 56 × 128 | 56 × 56 × 128 | 3 × 3 × 128 dw (1) |

Conv | 56 × 56 × 128 | 56 × 56 × 128 | 1 × 1 × 128 × 128 (1) |

Conv dw | 56 × 56 × 128 | 28 × 28 × 128 | 3 × 3 × 128 dw (2) |

Conv | 28 × 28 × 128 | 28 × 28 × 256 | 1 × 1 × 128 × 256 (1) |

Conv dw | 28 × 28 × 256 | 28 × 28 × 256 | 3 × 3 × 256 dw (1) |

Conv | 28 × 28 × 256 | 28 × 28 × 256 | 1 × 1 × 256 × 256 (1) |

Conv dw | 28 × 28 × 256 | 14 × 14 × 256 | 3 × 3 × 256 dw (2) |

Conv | 14 × 14 × 256 | 14 × 14 × 512 14 × 14 × 512 | 1 × 1 × 256 × 512 (1) |

5× Conv dw Conv | 14 × 14 × 512 14 × 14 × 512 | 14 × 14 × 512 | 3 × 3 × 512 dw (1) 1 × 1 × 512 × 512 (1) |

Conv dw | 14 × 14 × 512 | 7 × 7 × 512 | 3 × 3 × 512 dw (2) |

Conv | 7 × 7 × 512 | 7 × 7 × 1024 | 1 × 1 × 512 × 1024 (1) |

Conv dw | 7 × 7 × 1024 | 7 × 7 × 1024 | 3 × 3 × 1024 dw (1) |

Conv | 7 × 7 × 1024 | 7 × 7 × 1024 | 1 × 1 × 1024 × 1024 (1) |

Avg Pool | 7 × 7 × 1024 | 1 × 1 × 1024 | 7 × 7 × 3 (1) |

FC | 1 × 1 × 1024 | 1 × 1 × 1024 | 1024 × 1000 (1) |

This paper uses the loss function based on Triplets’ maximum boundary nearest neighbor classification algorithm to train the neural network. The network directly outputs a 128-dimensional vector space.

Triplets means triples, that is, the loss function is calculated by three parameters: Anchor, Negative, and Positive. Anchor refers to the benchmark image, Positive refers to the image under the same category as Anchor, and Negative refers to the category different from Anchor picture.

The loss function makes the feature distance between the same identities as small as possible, while the feature distance between different identities is as large as possible. Therefore, the distance of the points in the Euclidean space of the features corresponding to the two images directly corresponds to the two Whether the images are similar. The process of Triplet Loss is shown in

As shown in

L = ∑ i N [ ‖ f ( x i a ) − f ( x i p ) ‖ 2 2 − ‖ f ( x i a ) − f ( x i n ) ‖ 2 2 + ∝ ] + (5)

where the L2 on the left is the intra-class distances, and the L2 on the right is the inter-class distances. α is a constant. The meaning of formula (5) is to optimize the triplets that do not meet the conditions; for the triplets that meet the conditions, set aside and ignore. In the optimization process, the gradient descent method is used to make the loss function decrease continuously, that is, the intra-class distances decreases and inter-class distances increases continuously.

The choice of Triples is crucial to the convergence of the model. In actual training, it is unrealistic to calculate the maximum and minimum distances between images across all training samples, and it is also difficult to converge due to incorrectly labeled images. Therefore, this article sets every 64 samples as a Mini-Batch, and uses online generation to select Triplets in each Mini-Batch. In each Mini-Batch, two face pictures are selected as positive samples for a single individual, and other face pictures are randomly selected as negative samples. In order to avoid premature training convergence caused by improper selection of negative samples, this paper uses Equation (6) to filter negative samples:

‖ f ( x i a ) − f ( x i p ) ‖ 2 2 < ‖ f ( x i a ) − f ( x i n ) ‖ 2 2 (6)

In order to verify the model proposed in this paper, the CASIA WebFace dataset and the VGGFace2 dataset are used to train the proposed Fast-FaceNet model, and the trained model is tested with the LFW dataset. All the experimentally verified platforms in this article use Google open source deep learning platform Tensorflow, which is an artificial intelligence-oriented learning system that and uses tensorflow to calculate logarithmic graphs. The platform mainly analyzes and processes neural network models in artificial intelligence, which is easy to use.

In this paper, the AdaGrad optimizer is used to train the MobileNet model by a stochastic gradient descent method. The learning rate is 0.02. After 300 hours of training on the CPU cluster, the loss function drops significantly, and the boundary value α is set to 0.2. Since FaceNet only needs a small amount of processing on the image (only needs to crop the face area without additional preprocessing, such as 3D alignment, etc.), and then it can be used as the input of the model, in this article we first runs A face detector (implemented through MTCNN) on each image, and generate a tight bounding box around each face, and then adjust the size of these face thumbnails to 224 × 224 to input.

Although the basic MobileNet is already very small and the delay is very short, in order to test whether MobileNet can be further reduced and Fast-FaceNet’s operation rate can be faster when using MobileNet to replace the original GoogLeNet in FaceNet, This article introduces a parameter called width multiplication Number θ, whose function is to make the network of each layer thinner evenly. For a given layer and width multiplier θ, the number of input channels becomes θM, and the number of output channels becomes θN, where θ ∈ ( 0 , 1 ] . Take θ = 0.25, 0.5, 0.75, 1 to train Fast-FaceNet with different network widths and experiment on the LFW dataset. The results are shown in

As shown in

As can be seen in

In order to test the effect of Fast-FaceNet on face recognition of video, a piece of film and television video was intercepted on the network. For two objects respectively input as shown in

As shown in

As can be seen in

Model | Parameter | Calculation | Accuracy | CPU processing time for a picture |
---|---|---|---|---|

0.25Fast-FaceNet | 0.9 | 62 | 79.53% | 26 |

0.50Fast-FaceNet | 1.6 | 181 | 88.02% | 45 |

0.75Fast-FaceNet | 3.1 | 392 | 97.74% | 88 |

1.00Fast-FaceNet | 4.8 | 684 | 98.63% | 137 |

Model | Dataset | Accuracy | CPU processing time for a picture |
---|---|---|---|

FaceNet | CASIA-WebFace | 99.05% | 245 |

VGGFace2 | 99.65% | ||

0.75Fast-FaceNet | CASIA-WebFace | 97.25% | 88 |

VGGFace2 | 97.74% | ||

1.00Fast-FaceNet | CASIA-WebFace | 98.15% | 137 |

VGGFace2 | 98.63% |

FaceNet | 0.75Fast-FaceNet | 1.0Fast-FaceNet | |
---|---|---|---|

Precision | 0.8784 | 0.8314 | 0.8549 |

recall | 0.7563 | 0.7095 | 0.7388 |

F1-score | 0.8128 | 0.7656 | 0.7926 |

Based on the classic FaceNet, this paper introduced the lightweight model MobileNet and proposed a lightweight FaceNet based on MobileNet. Firstly the paper introduced the classic model FaceNet, then introduced MobileNet and proposed Fast-FaceNet. Fast-FaceNet was trained on the CASIA-WebFace and VGGFace2 datasets and tested on the LFW dataset. Finally, Fast-FaceNet was applied to video face recognition. It is proved by experiments that Fast-FaceNet greatly improves the recognition rate while ensuring a certain recognition accuracy rate.

This work was supported by the National Natural Science Foundation of China (61976217), the Fundamental Research Funds for the Central Universities (No. 2019XKQYMS87), and the Opening Foundation of Key Laboratory of Opto-technology and Intelligent Control, Ministry of Education (KFKT2020-3).

The authors declare no conflicts of interest regarding the publication of this paper.

Xu, X.Z., Du, M., Guo, H.X., Chang, J.Y. and Zhao, X.Y. (2021) Lightweight FaceNet Based on MobileNet. International Journal of Intelligence Science, 11, 1-16. https://doi.org/10.4236/ijis.2021.111001