Hierarchical Representations Feature Deep Learning for Face Recognition

Most modern face recognition and classiﬁcation systems mainly rely on hand-crafted image feature descriptors. In this paper, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) as a natural source for obtaining additional, complementary hierarchical representations, which helps to relieve us from the complicated hand-crafted fea-ture-design step. DBNESR first learns hierarchical representations of feature by greedy layer-wise unsupervised learning in a feed-forward (bottom-up) and back-forward (top-down) manner and then makes more efficient recognition with Softmax regress by supervised learning. As a comparison with the algorithms only based on supervised learning, we again propose and design many kinds of classifiers: BP, HBPNNs, RBF, HRBFNNs, SVM and multiple classification decision fusion classifier (MCDFC)—hybrid HBPNNs-HRBFNNs-SVM classifier. The conducted experiments validate: Firstly, the proposed DBNESR is optimal for face recognition with the highest and most stable recognition rates; second, the algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; third, hybrid neural networks have better effect than single model neural network; fourth, the average recognition rate and variance of these algorithms in order of the largest to the smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; at last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artiﬁcial intelligent tasks.


Introduction
Face recognition (FR) is one of the main areas of investigation in biometrics and computer vision. It has a wide range of applications, including access control, information security, law enforcement and surveillance systems. FR has caught the great attention from large numbers of research groups and has also achieved a great development in the past few decades [1] [2] [3]. However, FR suffers from some difficulties because of varying illumination conditions, different poses, disguise and facial expressions and so on [4] [5] [6]. A plenty of FR algorithms have been designed to alleviate these difficulties [7] [8] [9]. FR includes three key steps: image preprocessing, feature extraction and classification. Image preprocessing is essential process before feature extraction and also is the important step in the process of FR. Feature extraction is mainly to give an effective representation of each image, which can reduce the computational complexity of the classification algorithm and enhance the separability of the images to get a higher recognition rate. While classification is to distinguish those extracted features with a good classifier. Therefore, an effective face recognition system greatly depends on the appropriate representation of human face features and the good design of classifier [10].
After extracting the features, the following work is to design an effective classifier. Classification aims to obtain the face type for the input signal. Typically used classification approaches include polynomial function, HMM [21] [22], GMM [23], K-NN [23], SVM [24], and Bayesian classifier [25]. In addition, random weight network (RWN) is proposed in some articles [26] [27] and there are also other kinds of neural networks used as the classifier for FR [28] [29].
In this paper, we first make image preprocessing to eliminate the interference of noise and redundant information, reduce the effects of environmental factors on images and highlight the important information of images. At the same time, in order to compensate the deficiency of geometric features, it is well known that the original face images often need to be well represented instead of being input into the classifier directly because of the huge computational cost. So PCA and 2D-PCA are used to extract geometric features from preprocessed images, reduce their dimensionality for computation and attain a higher level of separability. At last, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) to learn hierarchical representations for FR; as a comparison with the algorithms only based on supervised learning, again design many kinds of other classifiers and make experiments to validate the effectiveness of the algorithm.
The proposed DBNESR has several important properties, which are summarized as follows: 1) Through special learning, DBNESR can provide effective hierarchical representations [30]. For example, it can capture the intuition that if a certain image feature (or pattern) is useful in some locations of the image, then the same image feature can also be useful in other locations or it can capture higher-order statistics such as corners and contours, and can be tuned to the statistics of the specific object classes being considered (e.g., faces). 2) DBNESR is similar to the multiple nonlinear functions mapping, which can extract complex statistical dependencies from high-dimensional sensory inputs (e.g., faces) and Second, the deep learning algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; Third, hybrid neural networks has better effect than single model neural network; Fourth, the average recognition rate and variance of these algorithms in order of largest to smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; At last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artificial intelligent tasks.
The remainder of this paper is organized as follows. Section 2 reviews the images preprocessing. Section 3 introduces the feature extraction methods. Section 4 designs the classifiers of supervised learning. Section 5 gives and designs the classifier combining unsupervised and supervised learning proposed by us. Experimental results are presented and discussed in Section 6. Section 7 gives the concluding remarks.

Images Preprocessing
Images often appear the phenomenon such as low contrast, being not clear and so on in the process of generation, acquisition, input, etc. of images due to the influence of environmental factors such as the imaging system, noise and light conditions so on. Therefore it needs to make images preprocessing. The purpose of the preprocessing is to eliminate the interference of noise and redundant information, reduce the effects of environmental factors on images and highlight the important information of images [31]. Images preprocessing usually includes gray of images, images filtering, gray equalization of images, standardization of images, compression of images (or dimensionality-reduced) and so on [32]. The process of images preprocessing is as following.

1) Face images filtering
We use median filtering to make smoothing denoising for images. This method not only can effectively restrain the noise but also can very well protect the boundary. Median filter is a kind of nonlinear operation, it sorts a pixel point and all others pixel points within its neighborhood as the size of grey value, sets the median of the sequence as the gray value of the pixel point, as shown in Equ- where, s is the filter window. Using the template of 3 × 3 makes median filtering for the experiment in the back.

2) Histogram equalization
The purpose of histogram equalization is to make images enhancement, improve the visual effect of images, make redundant information of images after preprocessing less and highlight some important information of images.
Set the gray range of image ( ) Making normalization processing for the histogram, the probability density function of each grey value can be obtained: The probability distribution function is: In order to make the scope of s for [ ] 0, L , can get C L = . For discrete case the gray transformation function is as following: where, k r is the kth grayscale, k n is the pixel number of k r , n is the total pixels number of images, the scope of k for [ ] 0, 1 L − . We make the histogram equalization experiment for the images in the back.

3) Compression of images (or dimensionality-reduced)
It is well known that the original face images often need to be well represented instead of being input into the classifier directly because of the huge computational cost. As one of the popular representations, geometric features are often extracted to attain a higher level of separability. Here we employ multi-scale two-dimensional wavelet transform to generate the initial geometric features for representing face images.
We make the multi-scale two-dimensional wavelet transform experiment for the images in the back.

Feature Extraction
There are two main purposes for feature extraction: One is to extract characteristic information from the face images, the feature information can classify all the samples; The second is to reduce the redundant information of the images, make the data dimensionality being on behalf of human faces as far as possibly reduce, so as to improve the speed of subsequent operation process. It is well known that image features are usually classified into four classes: Statistical-pixel features, visual features, algebraic features, and geometric features (e.g. transform-coefficient features).

1) Extract features with PCA
All samples can be expressed as following: Calculate the average face of all sample images as following: Calculate the difference of faces, namely the difference of each face with the average face as following: Journal of Data Analysis and Information Processing , 1, 2, , Therefore, the images covariance matrix C can be represented as following: ( ) where, usually set 90% a = , can get the eigenvalues face subspace . All the samples project to subspace U, as following: Therefore, using front t principal component instead of the original vector X, not only make the facial features parameter dimension is reduced, but also won't loss too much feature information of the original images.   (17) Therefore, the images covariance matrix G can be represented as follows: (18) and the generalized total scattered criterion ( ) J X can be expressed by: Let opt X be the unitary vector such that it maximizes the generalized total scatter criterion ( ) J X , that is: In general, there is more than one optimal solution. We usually select a set of Journal of Data Analysis and Information Processing subjected to the orthonormal constraints and the maximizing criterion ( ) J X , where, t is smaller than the dimension of the coefficients matrix. In fact, they are those orthonormal eigenvectors of the matrix G corresponding to t largest eigenvalues. Now for each sub-band coefficient matrix i S , compute the principal component of the matrix i S as follows: Then we can get its reduced features matrix  . We extract features respectively with PCA and 2D-PCA and compare their effects for the images in the back experiment.

Designing the Classifiers of Supervised Learning
Usually the classifiers based on supervised learning are often used for FR, in the paper we design two types of classifiers. One is the type of supervised learning classifiers and the other is the classifiers combining unsupervised and supervised learning [33].

1) BP neural network
BP neural network is a kind of multilayer feed-forward network according to the back-propagation algorithm for errors, is currently one of the most widely used neural network models [34]. Recognition and classification of face images is an important application for BP neural network in the field of pattern recognition and classification.
The network consists of L layers as shown in Figure 1. Its training algorithm consists of three steps, illustrated as follows [35].

2) Hybrid BP neural networks (HBPNNs)
When the number scale of human face images isn't big, generalization ability and operation time of single model BP neural network are ideal, and with the increase of numbers of identification species, the structure of BP network will become more complicated, which causes the time of network training to become longer, slower convergence rate, easy to fall into local minimum and poorer generalization ability and so on.
In order to eliminate these problems we design the hybrid BP neural networks (HBPNNs) composed of multiple single model BP networks to replace the complex BP network for FR. Hybrid networks have better fault tolerant and generalization than single model network, and can implement distributed computing to greatly shorten the training time of network [36].
The core idea of designing hybrid networks classifier is to divide a K-class pattern classification into K independent 2-class pattern classification. That is to make a complex classification problem decomposed into some simple classification problems. In the paper multiple single model BP networks are combined into a hybrid network classifier, namely make K BP networks of multiple inputs single output integrated, a BP network is a child network only being responsible for identifying one of K-class model category and parallel to each other between  Figure 2.
BP neural network only having a hidden layer and with sufficient hidden neurons is sufficient for approximating the input-output relationship [37]. Therefore, it selects standard three-layer BP neural network as the subnets for hybrid networks. For each subnets of hybrid networks, the number of neurons of input layer corresponds to the dimensions of face feature extraction, the number of neurons of output layer is 1. The number of neurons of hidden layer is calculated by the following empirical formula: where, m are the number of neurons of output layer, n are the number of neurons of input layer, a is constant between 1 -10 [38]. If the dimensions of face feature extraction are X, the structure of each subnets of the hybrid networks is as following: ( ) The structure of BP neural network is as following: The structure of subnets is simpler than the structure of single model BP neural network. When the structure of networks is complex, every increasing a neural Journal of Data Analysis and Information Processing the training time will greatly increase. In addition, with the size of networks gradually becoming larger, more and more complex network structure is easy to have slow convergence, prone to fall into local minimum, to have poor generalization ability and so on. By contrast, the hybrid networks based on some subnets can obtain more stable and efficient classifiers in the shorter period of time of training.

3) RBF neural network
Radial Basis Function (RBF) simulates the structure of neural network of the adjustment and covering each other of receiving domain of human brain, can approximate any continuous function with arbitrary precision. With the characteristics of fast learning, won't get into local minimum.
The expression of RBF is as following [39]: where, , n x c R ∈ , Euclidean distance of x to c is x c − . The radial basis function most commonly used is the Gaussian function for RBF neural network as following: where, σ is the width of the function. Radial basis function is often used to construct the function as following: There are some different for i c of each radial basis function and the weight i w .
The concrete process of training RBF is as follows.
For the set of sample data ( ) we use Equation (27) with M hidden nodes to classify those sample data. The number of hidden nodes is chosen to be a small integer initially in appli-cations. If the training error is not good, we can increase hidden nodes to reduce it. Considering the testing error simultaneously, there is a proper number of hidden nodes in applications. The model figure of RBF is shown in Figure 3.

4) Hybrid RBF neural networks (HRBFNNs)
The hybrid RBF neural networks (HRBFNNs) are composed of multiple RBF networks to replace RBF network for FR. Hybrid networks have better fault tolerant, higher convergence rate and stronger generalization than a single model network, and can implement distributed computing to greatly shorten the training time of network [40].
If the dimensions of face feature extraction are n, the structure of each subnets of the hybrid networks is as following: The structure of RBF neural network is as following: The structure of subnets is simpler than the structure of RBF neural network.
In addition, when the structure of networks is complex, every increasing a neural where, i ξ is the slack variables and G controls the fraction on misclassified training examples. This is a quadratic programming problem, use Lagrange multiplier method and meet the KKT conditions, can get the optimal classification function for the above problems: where, i a * and b * are to the parameters to determine the optimal classifica- is the dot product of two vectors.
For the nonlinear problem SVM can turn it into a high dimensional space by the nonlinear function mapping to solve the optimal classification surface. Therefore, the original problem becomes linearly separable. As can be seen from Equation (32) if we know dot product operation of the characteristics space the optimal classification surface can be obtained by simple calculation. According to the theory of Mercer, for any ( ) 0 The arbitrary symmetric function ( ) , i j K x x will be the dot product of a certain transformation space. Equation (32) will be corresponding to: This is SVM. There are a number of categories of the kernel function ( )  Figure 5. SVM is essentially the classifier for two types. Solving multiple classification problems needs to make more appropriate classifier. There are two main methods for SVM to structure the classifier for multiple classifications. One is the direct method, namely modify the objective function to use an optimization problem to solve the multiple classification parameters. This method is of high computational complexity. Another method is the indirect method. Combining multiple two-classifier constructs multiple classification classifiers. The method has two ways:  One-Against-One: Build a hyper-plane between any two classes, to the problem of k classes needing to build ( ) 1 2 k k × − classification planes.  One-Against-the-Rest: The classification plane is built between one category and other multiple categories, to the problem of k classes only needing to build k classification planes.
We will use two methods of "One-Against-One" and "One-Against-the-Rest" for the experiment and choose the method with better effect to construct the multiple classification classifiers of SVM.

5) Multiple classification decision fusion classifier (MCDFC)-hybrid HBPNNs HRBFNNs-SVM classifier
The different classifiers have different performance. Fusion of multiple classifiers integrating their respective characteristics can make classification effect and robustness further improvement.
Feature fusion and decision-making fusion are of two main methods of classifier fusion. Feature fusion has large computation to be not easy to achieve, therefore, we adopt the decision-making fusion. The model figure of MCDFC is shown in Figure 6.
We use the weighted voting for decision fusion of each classifier:

Designing the Classifier Combining Unsupervised and Supervised Learning
Supervised learning systems are domain-specific and annotating a large-scale corpus for each domain is very expensive [46]. Recently, semi-supervised learning, which uses a large amount of unlabeled data together with labeled data to build better learners, has attracted more and more attention in pattern recognition and classification [47]. In the paper we design a novel classifier of semi-supervised learning, namely combining unsupervised and supervised learning-deep belief network embedded with Softmax regress (DBNESR) for FR. DBNESR first learns hierarchical representations of feature by greedy layer-wise unsupervised learning in a feed-forward (bottom-up) and back-forward (top-down) manner [48] and then makes more efficient classification with Softmax regress by supervised learning. Deep belief network (DBN) is a representative deep learning algorithm, has deep architecture that is composed of multiple levels of non-linear operations [49], which is expected to perform well in semi-supervised learning, because of its capability of modeling hard artificial intelligent tasks [50]. Softmax regression is a generalization of the logistic regression in many classification problems.

1) Problem formulation
The dataset is represented as a matrix: where, N is the number of training samples, M is the number of test samples, D is the number of feature values in the dataset. Each column of X corresponds to a sample X. A sample which has all features is viewed as a vector in D  , where the jth coordinate corresponds to the jth feature. Let Y be a set of labels correspond to L labeled training samples and is denoted as: where, C is the number of classes. Each column of Y is a vector in C  , where, the jth coordinate corresponds to the jth class: We intend to seek the mapping function L X Y → using all the samples in order to determine Y when a new X comes.

2) Softmax regression
Softmax regression is a generalization of the logistic regression in many classification problems [51]. Logistic regression is for binary classification problems, . The hypothesis function is as following: Training model parameters vector Softmax regression is for many classification problems, class tag ( ) { } It is used for each given sample X, using hypothesis function to estimate the probability value ( ) for each category j. The hypothesis function is as following: where, where, {} 1 ⋅ denotes: The value of expression is true 1 or 1 The value of expression is false 0 = = There are no closed form solutions to minimize the cost function Equation (43) at present. Therefore, we use the iterative optimization algorithm (for example, gradient descent method or L-BFGS). After derivation we get gradient formula is as following: Then make the following update operation: Journal of Data Analysis and Information Processing ( ) where, α denotes learning rate.

3) Deep belief network embedded with Softmax regress (DBNESR)
DBN uses a Markov random field Restricted Boltzmann Machine (RBM) [52] [53] of unsupervised learning networks as building blocks for the multi-layer learning systems and uses a supervised learning algorithm named BP (back propagation) for fine-tuning after pre-training. Its architecture is shown in Figure 7. The deep architecture is a fully interconnected directed belief nets with one input  [54].
The semi-supervised learning method based on DBN architecture can be divided into two stages: First, DBN architecture is constructed by greedy layer-wise unsupervised learning using RBM as building blocks. All samples are utilized to find the parameter space W with N layers. Second, DBN architecture is trained according to the log-likelihood using gradient descent method. As it is difficult to optimize a deep architecture using supervised learning directly, the unsupervised learning stage can abstract the hierarchical representations feature effectively, and prevent over-fitting of the supervised training. The algorithm BP is used pass the error top-down for fine-tuning after pre-training.
For unsupervised learning, we define the energy of the joint configuration ( ) where,   The probability that the model assigns to a 1 k h − is: where, ( ) Z θ denotes the normalizing constant. The conditional distributions over k h and 1 k h − are given as: The probability of turning unit j is a logistic function of the states The probability of turning unit i is a logistic function of the states of k h and k ij w : where, the logistic function been chosen is the sigmoid function: The derivative of the log-likelihood with respect to the model parameter k w can be obtained from Equation (48): where, zero and the accuracy is approximate of MCMC after making slope for r times for correction parameter θ . The training process of RBM is shown in Figure 8. We can get Equation (57) by training process of RBM using contrastive divergence: where, η is the learning rate. Then the parameter can be adjusted through: where, µ is the momentum. The above discussion is based on the training of the parameters between hidden layers with one sample x. For unsupervised learning, we construct the deep architecture using all samples by inputting them one by one from layer 0 h , train the parameters between 0 h and 1 h . Then 1 h is constructed, the value of 1 h is calculated by 0 h and the trained parameters between 0 h and 1 h . We also can use it to construct the next layer 2 h and so on. The deep architecture is constructed layer by layer from bottom to top. In each time, the parameter space K W is trained by the calculated data in the ( ) For supervised learning, the DBM architecture is trained by C labeled data.
The optimization problem is formulized as: Journal of Data Analysis and Information Processing namely, to minimize cross-entropy. Where, k p denotes the real label probability and ˆk p denotes the model label probability.
The greedy layer-wise unsupervised learning is just used to initialize the parameter of deep architecture, the parameters of the deep architecture are updated based on Equation (58). After initialization, real values are used in all the nodes of the deep architecture. We use gradient-descent through the whole deep architecture to retrain the weights for optimal classification.

1) Face Recognition Databases
We selected some typical databases of images, for example ORL Face Database, which consists of 10 different images for each of the 40 distinct individuals. Each people is imaged in different facial expressions and facial details under va-rying lighting conditions at different times. All the pictures are captured with a dark background and the individuals are in an upright and frontal position; the facial gestures are not identical, expressions, position, angle and scale are some different; The depth rotation and plane rotary can be up to 20˚, the scale of faces also has as much as 10% change. For each face database as above, we randomly choose a part of images as training data and the remaining as testing data. In this paper, in order to reflect the universality and high efficiency of all classification algorithms we randomly choose about 50% of each individual image as training data and the rest as testing data. At first all images will be made preprocessing and feature extraction.
All the experiments are carried out in MATLAB R2010b environment running on a desktop with Intel  Core TM 2 Duo CPU T6670 @2.20GHz and 4.00 GB RAM.
2) Relevant experiments Experiment 1. In this experiment, we use median filtering to make smoothing denoising for images preprocessing and get the sample Figure 9 as following: Seeing from the comparison of face images, the face images after filtering eliminate most of noise interference. Experiment 2. In this experiment, we make histogram equalization for the images preprocessing and get the sample figures as following: From Figure 10 and Figure 11 we can see: After histogram equalization, the distribution of image histogram is more uniform, the range of gray increases some and the contrast has also been stronger. In addition, the image after histogram equalization basically eliminated the influence of illumination, expanded the representation range of pixel gray, improved the contrast of image,   made the facial features more evident and is conducive to follow-up feature extraction and FR. Experiment 3. In this experiment, we employ multi-scale two-dimensional wavelet transform to generate the initial geometric features for representing face images. By the experiment we get the sample figures as following: From Figure 12 we can see: Although for compression of images (or dimensionality-reduced), LL sub-graph information capacity has decreased some, but still has very high resolution and the energy of wavelet domain did not decrease a lot. LL sub-graph can be well made for the follow-up feature extraction. Experiment 4. In this experiment, we extract features respectively with PCA and 2D-PCA and compare their effects as following: From Figure 13 we can see that the first several principal components contribution rates extracted with 2D-PCA are higher than the first several principal components contribution rates extracted with PCA. From Figure 14 we can see when the principal components are extracted for 20, the principal component Journal of Data Analysis and Information Processing     Table 1.
As shown in Table 1, Recognition rates of HBPNNs are improved very greatly being compared to BP, in the same classifier (BP or HBPNNs) recognition rates of the methods based on WT + PCA are higher than them based on PCA. Experiment 6. This experiment compares the recognition rate of the methods respectively based on WT + 2D-PCA + RBF and WT + 2D-PCA + HRBFNNs.
The experiment is repeated for many times and takes the average recognition rate. The experimental results are shown in Table 2.
As shown in Table 2, Recognition rates of HRBFNNs are improved very greatly being compared to RBF. Therefore, HRBFNNs being used for FR is more feasible. Experiment 7. Because SVM is essentially the classifier for two types, solving  the multiple classification problems needs to reconstruct more appropriate classifier. We will use two methods of "One-Against-One" and "One-Against-the Rest" for the experiment and choose the method with better effect to construct the multiple classification classifiers of SVM. The experiment is repeated for 20 times and takes the average recognition rate. The experimental results are shown in Table 3. As shown in Table 3, "One-Against-One" SVM has higher recognition rate than "One-Against-the-Rest" SVM and at the same time has lower wrong number. Therefore, we use the way of "One-Against-One" to reconstruct the SVM classifier to realize FR. Experiment 8. In the paper we construct the multiple classification decision fusion classifier (MCDFC)-hybrid HBPNNs-HRBFNNs-SVM classifier. In this experiment, in order to show the efficiency of MCDFC, we first make recognition experiment respectively based on HBPNNs, HRBFNNs and SVM, then use the decision function to make fusions for classification results of three classifiers and get classification results of MCDFC. The experiment is repeated for 20 times and the experimental results are shown in Table 4 and in Figure 16.
As shown in Figure 16, the recognition effect of MCDFC is always not lower than the average level of other three kinds of classifiers and in almost all cases the effect of MCDFC is optimal.
To eliminate the error of single experiment and greatly reduce the random uncertainty, Table 5 lists the average recognition rates of each classifier for 20 times and the variance of each classifier. It can be seen from the experimental results that the multiple classification decision fusion classifier (MCDFC)-hybrid HBPNNs-HRBFNNs-SVM classifier has the best effect for FR, has the minimum variance, can effectively improve the generalization ability and has high stability. Experiment 9. In this experiment, in order to validate the performance of our proposed algorithm-DBNESR is optimal for FR, we compare our proposed algorithm with some other methods such as BP, HBPNNs, RBF, HRBFNNs, SVM and MCDFC. Journal of Data Analysis and Information Processing  In the experiment we set up different hidden layers and each hidden layer with different neurons. The architecture of DBNESR is similar with DBN, but with a different loss function introduced for supervised learning stage. For greedy layer-wise unsupervised learning we train the weights of each layer independently with the different epochs, we also make fine-tuning supervised learning for the different epochs. All DBNESR structures and learning epochs used in this experiment are separately shown in Table 6. The number of units in input layer is the same as the feature dimensions of the dataset.
Almost all the recognition rates of these DBNESR structures are more than 90%, in particular the effects of the models of 500-1000-40 and 1000-500-40 are Journal of Data Analysis and Information Processing   Table 7. Recognition rates of different recognition methods for 20 times.     As shown in Table 7, Table 8 and in Figures 17-19, our proposed algorithm-DBNESR is optimal for FR, in almost all cases the recognition rates of DBNESR is highest and most stable, namely there is the largest average recognition rate and the smallest variance.

Conclusion
The conducted experiments validate that the proposed algorithm DBNESR is optimal for face recognition with the highest and most stable recognition rates, that is, it successfully implements hierarchical representations' feature deep learning for face recognition. You can also be sure that it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling other artificial intelligent tasks, which is also what we're going to do in the future.