Lightweight FaceNet Based on MobileNet

Face recognition is a kind of biometric technology that recognizes identities through human faces. At first, the speed of machine recognition of human faces was slow and the accuracy was lower than manual recognition. With the rapid development of deep learning and the application of Convolutional Neural Network (CNN) in the field of face recognition, the accuracy of face recognition has greatly improved. FaceNet is a deep learning framework commonly used in face recognition in recent years. FaceNet uses the deep learning model GoogLeNet, which has a high accuracy in face recognition. However, its network structure is too large, which causes the FaceNet to run at a low speed. Therefore, to improve the running speed without affecting the recognition accuracy of FaceNet, this paper proposes a lightweight FaceNet model based on MobileNet. This article mainly does the following works: Based on the analysis of the low running speed of FaceNet and the principle of MobileNet, a lightweight FaceNet model based on MobileNet is proposed. The model would reduce the overall calculation of the network by using deep separable convolutions. In this paper, the model is trained on the CASIA-WebFace and VGGFace2 datasets, and tested on the LFW dataset. Experimental results show that the model reduces the network parameters to a large extent while ensuring the accuracy and hence an increase in system computing speed. The model can also perform face recognition on a specific person in the video.

ing [1], reinforcement learning [2], cluster analysis [3] and support vector machine [4] and other branches. As an emerging field of artificial intelligence in recent years, deep learning aims to use computers to simulate the human brain to think and learn. The rapid development of deep learning in the fields of computer vision [5], natural language processing [6], data mining and robotics in recent years has opened a new chapter in the history of human science of artificial intelligence.
Computer vision is an application of machine learning in the field of vision and an important part of the field of artificial intelligence. The purpose of computer vision is to collect pictures or videos, analyze the pictures or videos, and accordingly obtain the required information. Computer vision is widely used nowadays: video surveillance, automatic drive, medical treatment, face punching, and consumption are all supported by computer vision. Studying computer vision can start from the perspective of object vision and space vision. The purpose of object vision is to determine the type of object, while space vision is to determine the position and shape of the object. At present, there are some tasks such as image classification [7], face recognition [8] [9] [10] and object detection [11] in the field of computer vision. Among them, face recognition has been always a work of great significance.
With the development of the times, face recognition has gradually evolved from artificial recognition to machine recognition, and the accuracy of machine recognition has long surpassed that of human beings. Face recognition is a kind of biometrics technology that recognizes identities through facial features.
Compared with other biometrics, face recognition has the advantages of naturalness, uniqueness, and inconsistency. Other biometric methods such as fingerprint recognition and iris recognition are not natural, and require pressure sensors and other equipments. Face recognition is not only used in the field of video surveillance and finance, but also shows its broad application space in many scenarios, such as transportation, education, medical care, and e-commerce. The rapid development of deep learning has made deep learning models widely used in face recognition. Since its introduction, the deep neural network model has been widely used in computer vision tasks such as image classification and face detection [12], and has achieved very good results. Convolutional Neural Network (CNN) [13], the appearance of deep learning models such as neural networks based on probabilistic decision-making, greatly improved the accuracy of face recognition. With the fierce development of deep learning, face recognition technology continues to reach new heights, and the proposal of FaceNet [14] [15] has increased the face recognition rate in the LFW dataset to more than 99%. Face recognition problems can generally be divided into face detection and face recognition. The so-called face detection is not only to detect whether there is a face in the photo, but also to remove the unrelated parts of the picture. In the early days, face detection and face recognition could only be achieved separately through different algorithm frameworks. To realize face detection and face rec-ceNet can be used for face detection, recognition and clustering [16].
In face recognition, posture and lighting have always been a long-standing problem. The traditional face recognition method based on convolutional neural network is to use CNN's twin network [17] to extract face features, and then use Support Vector Machine (Support Vector Machine, SVM) and other methods for classification. However, FaceNet directly learns the mapping of images to points on the Euclidean space and judge whether the two images which the distance between the features of the two images in the Euclidean space directly corresponds to are similar. The Euclidean distance between image features is shown in Figure 1. The numbers correspond to the Euclidean distance between this set of image features. The Euclidean distance of 1.1 is used as the threshold. When the Euclidean distance is greater than 1.1, the two The faces in the images are determined to be from different people, and when the Euclidean distance is less than 1.1, the faces in the two images are determined to be from the same person.
FaceNet has two different deep network structures, both of which are deep convolutional networks. The first structure is based on the Zeiler & Fergus model [18], which consists of multiple intersecting layers such as convolutional layers, nonlinear activation layers, local response normalization layers, and maximum pooling layers. The second structure is based on the Inception model of  The main contributions of this article are as follows: 1) A Fast-FaceNet model based on MobileNet is proposed to reduce the overall calculation of the network.
2) Fast-FaceNet was applied to video face recognition to improve the recognition rate while ensuring a certain recognition accuracy rate.
This paper is divided into five parts: Section 1 introduces the relevant background, the related work in recent years, and summarizes the main work and organizational structure of this paper. Section 2 introduces the relevant basic theory. Sections 3 and 4 are the core of this paper, model architecture and the analysis of experimental results. The final part summarizes the entire article.

FaceNet Basic Structure
The FaceNet system can directly map face images to a compact Euclidean space, where the length of the spatial distance directly corresponds to the measure of face similarity. Once this space is generated, you can use standard techniques with Fa-ceNet embedding as feature vectors to easily perform tasks such as face recognition, verification, and clustering. The advantage of this model is that only a small amount of processing on the image can be used as input. At the same time, the accuracy of the model is very high in the data set. Facenet can be widely used in face recognition in mobile termial. Its network structure is shown in Figure 2.
The FaceNet network consists of a batch input layer and a deep convolutional network, and then L2 normalization, which leads to face embedding, and finally calculates the triplet loss to make the distance between the same objects. As small as possible, the distance between different objects is as large as possible. It uses a deep convolutional neural network to learn the Euclidean embedding method of each image, and trains the network so that the squared L2 distance in the embedding space directly corresponds to the face similarity. FaceNet directly uses the Loss function of Triplets-based LMNN (Maximum Boundary Nearest Neighbor Classification) to train the neural network, and the network output is a 128-dimensional vector space. The selected Triplets contain two matching face thumbnails and a non-matching face thumbnail. The Loss function target distinguishes positive and negative classes by distance boundaries.

GoogLeNet
The deep neural network in the classic FaceNet system is GoogLeNet which uses the Inception module, so it is also called the Inception network.
The original Inception module contains several convolutions of different sizes, namely 1 × 1 convolution, 3 × 3 convolution and 5 × 5 convolution, and also includes a 3 × 3 maximum pooling layer. The features obtained by these convolutional layers and pooling layers are aggregated together as the final output, which is also the input of the next module. The original Inception module is shown in Figure 3. However, a larger convolution kernel is used in the original Inception module, and the calculation complexity is larger, which can only limit the number of feature channels. So GoogLeNet uses 1 × 1 convolution to optimize, that is, firstly use 1 × 1 convolution to perform up-down dimension, and secondly perform convolution and aggregation on multiple sizes at the same time. The size reduction Inception module is shown in Figure 4.
The entire GoogLeNet network is formed by stacking Inception modules. The entire network has a total of 22 layers. The specific network and parameter configuration are shown in Table 1.
The modular structure (Inception structure) adopted by GoogLeNet is easy to add and modify. At the end of the network, the average pooling is used to replace the fully connected layer, which can improve the accuracy. However, GoogLeNet's network model is relatively large, and the calculation speed is also slow.

MobileNet
MobileNet [23] [24] [25] is a lightweight deep neural network which is based on streamline architecture and built by using deep separable convolution. When FaceNet performs face recognition, in order to achieve a certain degree of accuracy, the network is relatively complex. Therefore, these complex networks will affect the size and speed of the model. For example, when the model is used in automatic driving and criminal detection, the real-time nature of visual tasks and other factors need to be considered by reason of the limitations of the platform's calculation. MobileNet proposes a high-performance architecture with hyperparameters, which can make the model smaller and the calculation speed faster. And it is very practical for face recognition systems. The core layer built by MobileNet is a deep separable filter. Deep separable convolution is a form of deconvolution. The standard convolution operation directly extracts the features from the input and combines them into a series of outputs. The depth separable convolution divides this process into two layers: one layer is the depth convolution, which is used to extract each channel of the input separately Features; One layer is a point-by-point convolution, which uses a 1 × 1 convolution to combine the output of the previous step. This decomposition has the effect of significantly reducing the calculation and model size.    The MobileNet model uses deep separable convolutions to break the interaction between the number of output channels and the size of the kernel to greatly reduce the computational cost. The calculation cost of deep convolution is shown in Equation (2): Although deep convolution is much more efficient than standard convolution, it only filters the input channels and does not combine them to generate new features. An additional 1 × 1 convolution is required to combine the features obtained by these filters to form a New multi-channel features. The calculation cost of the final depth separable convolution is the sum of depth convolution and point-by-point convolution, as shown in Equation (3): By decomposing the standard convolution integral into deep convolution and point-by-point convolution, the calculation amount is reduced as shown in Equation (4): MobileNet which uses deep separable convolution and 8 -9 times less computation than standard convolution can greatly improve the operation rate. Therefore, this article uses MobileNet to replace the deep learning model in Fa-ceNet.

Network model design
The original FaceNet network is relatively complex. However, these complex networks will affect the size and speed of the model. In order to be better deployed on the mobile terminal without affecting the accuracy of face recognition. This paper uses MobileNet to replace GoogLeNet, and proposes a Fast-FaceNet model based on MobileNet in order to improve the practicality of FaceNet. Its network structure is shown in Figure 8.
In Figure 8, Batch refers to the input face image samples that have been detected by face detection and cropped to a fixed size, and then feature extraction through the lightweight model MobileNet, then L2 feature normalization. Finally, classify through the Triplet loss function so that the feature distance between the same identities should be as small as possible and the feature distance between different identities should be as large as possible.
The percentage of the total parameters and the total calculation amount of each operation of MobileNet in Fast-FaceNet is shown in Table 2.
It can be seen from Table 2 that MobileNet spends 95% of its computing time in the 1 × 1 convolution. The 1 × 1 convolution also contains 75% of the parameters, and almost all other parameters are located in the fully connected layer. The 1 × 1 convolution does not need to be reordered in memory, and can be implemented directly using general matrix multiplication, therefore it improves the operation rate.
The results of comparing the parameters of MobileNet and GoogLeNet with the amount of calculation are shown in Table 3.  After comparison, it can be found that MobileNet is smaller than GoogleNet in size, less in parameters, and the amount of calculation is reduced by more than 2.5 times. So it is effective that this article uses MobileNet to improve FaceNet.
The parameter configuration of each network layer of Fast-FaceNet is shown in Table 4.

Selection of Loss Function
This paper uses the loss function based on Triplets' maximum boundary nearest neighbor classification algorithm to train the neural network. The network directly outputs a 128-dimensional vector space. Triplets means triples, that is, the loss function is calculated by three parameters: Anchor, Negative, and Positive. Anchor refers to the benchmark image, Positive refers to the image under the same category as Anchor, and Negative refers to the category different from Anchor picture.
The loss function makes the feature distance between the same identities as small as possible, while the feature distance between different identities is as large as possible. Therefore, the distance of the points in the Euclidean space of the features corresponding to the two images directly corresponds to the two Whether the images are similar. The process of Triplet Loss is shown in Figure 9.
As shown in Figure 9, the purpose of Triplet Loss is to embed the face image where the L2 on the left is the intra-class distances, and the L2 on the right is the inter-class distances. α is a constant. The meaning of formula (5) is to optimize the triplets that do not meet the conditions; for the triplets that meet the conditions, set aside and ignore. In the optimization process, the gradient descent method is used to make the loss function decrease continuously, that is, the intra-class distances decreases and inter-class distances increases continuously. The choice of Triples is crucial to the convergence of the model. In actual training, it is unrealistic to calculate the maximum and minimum distances between images across all training samples, and it is also difficult to converge due to incorrectly labeled images. Therefore, this article sets every 64 samples as a Mini-Batch, and uses online generation to select Triplets in each Mini-Batch. In each Mini-Batch, two face pictures are selected as positive samples for a single individual, and other face pictures are randomly selected as negative samples. In order to avoid premature training convergence caused by improper selection of negative samples, this paper uses Equation (6) to filter negative samples: Figure 9. Triplet loss process. International Journal of Intelligence Science

Experimental Results and Analysis
In order to verify the model proposed in this paper, the CASIA WebFace dataset and the VGGFace2 dataset are used to train the proposed Fast-FaceNet model, and the trained model is tested with the LFW dataset. All the experimentally verified platforms in this article use Google open source deep learning platform Tensorflow, which is an artificial intelligence-oriented learning system that and uses tensorflow to calculate logarithmic graphs. The platform mainly analyzes and processes neural network models in artificial intelligence, which is easy to use. In this paper, the AdaGrad optimizer is used to train the MobileNet model by a stochastic gradient descent method. The learning rate is 0.02. After 300 hours of training on the CPU cluster, the loss function drops significantly, and the boundary value α is set to 0.2. Since FaceNet only needs a small amount of processing on the image (only needs to crop the face area without additional preprocessing, such as 3D alignment, etc.), and then it can be used as the input of the model, in this article we first runs A face detector (implemented through MTCNN) on each image, and generate a tight bounding box around each face, and then adjust the size of these face thumbnails to 224 × 224 to input.
Although the basic MobileNet is already very small and the delay is very short, in order to test whether MobileNet can be further reduced and Fast-FaceNet's operation rate can be faster when using MobileNet to replace the original Goog-LeNet in FaceNet, This article introduces a parameter called width multiplication Number θ, whose function is to make the network of each layer thinner evenly. For a given layer and width multiplier θ, the number of input channels becomes θM, and the number of output channels becomes θN, where ( ] 0,1 θ ∈ . Take θ = 0.25, 0.5, 0.75, 1 to train Fast-FaceNet with different network widths and experiment on the LFW dataset. The results are shown in Table 5.
As shown in Table 5, with different width multipliers, the accuracy and rate of recognition of the entire FaceNet on the LFW data set have changed. The two factors of operation rate and recognition accuracy can be considered comprehensively. When the width multiplier is 0.75 and 1, the system performance is optimal. Therefore, comparing the Fast-FaceNet with the width multiplier of 0.75 and 1 to the original FaceNet system, the experimental results obtained on the LFW data set are shown in Table 6.
As can be seen in Table 6, Fast-FaceNet compared to the original FaceNet when the width multiplier is 1, although the accuracy of face recognition is slightly reduced, the calculation time is greatly reduced; when the width multiplier is 0.75, The recognition accuracy rate of Fast-FaceNet is reduced by 0.9% compared to the time when the width multiplier is 1, but the calculation rate has been greatly improved.
In order to test the effect of Fast-FaceNet on face recognition of video, a piece of film and television video was intercepted on the network. For two objects respectively input as shown in Figure 10 the results after Fast-FaceNet recognition are shown in Figure 11. As shown in Figure 11, the object on the left in Figure 10 has been successfully identified in the video by Fast-FaceNet and is marked by a red frame, and the object on the right has been marked by a yellow frame. Compare the results of FaceNet and Fast-FaceNet for video face detection, and use F1-score to evaluate the experimental results. The results are shown in Table 7.
As can be seen in Table 7, Fast-FaceNet compared to the original FaceNet, although the F1-score of face recognition is slightly reduced, there is a certain recognition accuracy rate.

Conclusion
Based on the classic FaceNet, this paper introduced the lightweight model Mo-bileNet and proposed a lightweight FaceNet based on MobileNet. Firstly the paper introduced the classic model FaceNet, then introduced MobileNet and proposed Fast-FaceNet. Fast-FaceNet was trained on the CASIA-WebFace and VGGFace2 datasets and tested on the LFW dataset. Finally, Fast-FaceNet was applied to video face recognition. It is proved by experiments that Fast-FaceNet greatly improves the recognition rate while ensuring a certain recognition accuracy rate.