^{1}

^{2}

^{*}

Most modern face recognition and classification systems mainly rely on hand-crafted image feature descriptors. In this paper, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) as a natural source for obtaining additional, complementary hierarchical representations, which helps to relieve us from the complicated hand-crafted feature-design step. DBNESR first learns hierarchical representations of feature by greedy layer-wise unsupervised learning in a feed-forward (bottom-up) and back-forward (top-down) manner and then makes more efficient recognition with Softmax regress by supervised learning. As a comparison with the algorithms only based on supervised learning, we again propose and design many kinds of classifiers: BP, HBPNNs, RBF, HRBFNNs, SVM and multiple classification decision fusion classifier (MCDFC)—hybrid HBPNNs-HRBFNNs-SVM classifier. The conducted experiments validate: Firstly, the proposed DBNESR is optimal for face recognition with the highest and most stable recognition rates; second, the algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; third, hybrid neural networks have better effect than single model neural network; fourth, the average recognition rate and variance of these algorithms in order of the largest to the smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; at last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artificial intelligent tasks.

Face recognition (FR) is one of the main areas of investigation in biometrics and computer vision. It has a wide range of applications, including access control, information security, law enforcement and surveillance systems. FR has caught the great attention from large numbers of research groups and has also achieved a great development in the past few decades [

To select the features that can highlight classification, many kinds of feature selection methods have been presented, such as: spectral feature selection (SPEC) [

After extracting the features, the following work is to design an effective classiﬁer. Classification aims to obtain the face type for the input signal. Typically used classification approaches include polynomial function, HMM [

In this paper, we first make image preprocessing to eliminate the interference of noise and redundant information, reduce the effects of environmental factors on images and highlight the important information of images. At the same time, in order to compensate the deﬁciency of geometric features, it is well known that the original face images often need to be well represented instead of being input into the classiﬁer directly because of the huge computational cost. So PCA and 2D-PCA are used to extract geometric features from preprocessed images, reduce their dimensionality for computation and attain a higher level of separability. At last, we propose a novel deep learning algorithm combining unsupervised and supervised learning named deep belief network embedded with Softmax regress (DBNESR) to learn hierarchical representations for FR; as a comparison with the algorithms only based on supervised learning, again design many kinds of other classifiers and make experiments to validate the effectiveness of the algorithm.

The proposed DBNESR has several important properties, which are summarized as follows: 1) Through special learning, DBNESR can provide effective hierarchical representations [

The analysis and experiments are performed on the precise rate of face recognition. The conducted experiments validate: Firstly, the proposed DBNESR is optimal for face recognition with the highest and most stable recognition rates; Second, the deep learning algorithm combining unsupervised and supervised learning has better effect than all supervised learning algorithms; Third, hybrid neural networks has better effect than single model neural network; Fourth, the average recognition rate and variance of these algorithms in order of largest to smallest are respectively shown as DBNESR, MCDFC, SVM, HRBFNNs, RBF, HBPNNs, BP and BP, RBF, HBPNNs, HRBFNNs, SVM, MCDFC, DBNESR; At last, it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling hard artiﬁcial intelligent tasks.

The remainder of this paper is organized as follows. Section 2 reviews the images preprocessing. Section 3 introduces the feature extraction methods. Section 4 designs the classifiers of supervised learning. Section 5 gives and designs the classifier combining unsupervised and supervised learning proposed by us. Experimental results are presented and discussed in Section 6. Section 7 gives the concluding remarks.

Images often appear the phenomenon such as low contrast, being not clear and so on in the process of generation, acquisition, input, etc. of images due to the influence of environmental factors such as the imaging system, noise and light conditions so on. Therefore it needs to make images preprocessing. The purpose of the preprocessing is to eliminate the interference of noise and redundant information, reduce the effects of environmental factors on images and highlight the important information of images [

1) Face images filtering

We use median filtering to make smoothing denoising for images. This method not only can effectively restrain the noise but also can very well protect the boundary. Median filter is a kind of nonlinear operation, it sorts a pixel point and all others pixel points within its neighborhood as the size of grey value, sets the median of the sequence as the gray value of the pixel point, as shown in Equation (1).

f ′ ( i , j ) = M e d s { f ( i , j ) } (1)

where, s is the filter window. Using the template of 3 × 3 makes median filtering for the experiment in the back.

The purpose of histogram equalization is to make images enhancement, improve the visual effect of images, make redundant information of images after preprocessing less and highlight some important information of images.

Set the gray range of image A ( x , y ) as [ 0 , L ] , image histogram for H A ( r ) , Therefore, the total pixel points are:

A 0 = ∫ 0 L H A ( r ) d r (2)

Making normalization processing for the histogram, the probability density function of each grey value can be obtained:

p ( r ) = H A ( r ) A 0 (3)

The probability distribution function is:

P ( r ) = ∫ 0 L p ( r ) d r = 1 A 0 ∫ 0 L H A ( r ) d r (4)

Set the gray transformation function of histogram equalization as the limited slope not reduce continuously differentiable function s = T ( r ) , input it into A ( x , y ) to get the output B ( x , y ) . H B ( r ) is the histogram of output image, it can get

H B ( s ) d s = H A ( r ) d r (5)

H B ( s ) = H A ( r ) d s / d r = H A ( r ) T ′ ( r ) (6)

where, T ′ ( r ) = d s / d r . Therefore, when the difference between the molecular and denominator of H B ( r ) is only a proportionality constant, H B ( r ) is constant. Namely

T ′ ( r ) = C A 0 H A ( r ) (7)

s = T ( r ) = C A 0 ∫ 0 r H A ( r ) d r = C P ( r ) (8)

In order to make the scope of s for [ 0 , L ] , can get C = L . For discrete case the gray transformation function is as following:

s = T ( r ) = C P ( r k ) = C ∑ i = 0 k p ( r i ) = C ∑ i = o k n i n (9)

where, r k is the kth grayscale, n k is the pixel number of r k , n is the total pixels number of images, the scope of k for [ 0 , L − 1 ] .

We make the histogram equalization experiment for the images in the back.

It is well known that the original face images often need to be well represented instead of being input into the classiﬁer directly because of the huge computational cost. As one of the popular representations, geometric features are often extracted to attain a higher level of separability. Here we employ multi-scale two-dimensional wavelet transform to generate the initial geometric features for representing face images.

We make the multi-scale two-dimensional wavelet transform experiment for the images in the back.

There are two main purposes for feature extraction: One is to extract characteristic information from the face images, the feature information can classify all the samples; The second is to reduce the redundant information of the images, make the data dimensionality being on behalf of human faces as far as possibly reduce, so as to improve the speed of subsequent operation process. It is well known that image features are usually classiﬁed into four classes: Statistical-pixel features, visual features, algebraic features, and geometric features (e.g. transform-coefﬁcient features).

Suppose that there are N facial images { X i } i = 1 N , X i is column vector of M dimension. All samples can be expressed as following:

X = ( X 1 , X 2 , ⋯ , X N ) T (10)

Calculate the average face of all sample images as following:

X ¯ = 1 N ∑ i = 1 N X i (11)

Calculate the difference of faces, namely the difference of each face with the average face as following:

d i = X i − X ¯ , i = 1 , 2 , ⋯ , N (12)

Therefore, the images covariance matrix C can be represented as following:

C = 1 N ∑ i = 1 N d i d i T = 1 N A A T A = ( d 1 , d 2 , ⋯ , d N ) (13)

Using the theorem of singular value decomposition (SVD) to calculate the eigenvalue λ i and orthogonal normalization eigenvector ν i of A T A , through Equation (14) the eigenvalues of covariance matrix C can be calculated.

u i = 1 λ i A v i , ( i = 1 , 2 , ⋯ , N ) (14)

Making all the eigenvalues [ λ 1 , λ 2 , ⋯ , λ N ] order in descend according to the size, through the formula as following:

t = min k { ∑ j = 1 k u j ∑ j = 1 N u j > α , k ≤ t } (15)

where, usually set a = 90 % , can get the eigenvalues face subspace U = ( u 1 , u 2 , ⋯ , u t ) . All the samples project to subspace U, as following:

Z = U T X (16)

Therefore, using front t principal component instead of the original vector X, not only make the facial features parameter dimension is reduced, but also won’t loss too much feature information of the original images.

Suppose sample set is { S j i ∈ R m ⋅ n , i = 1 , 2 , ⋯ , N ; j = 1 , 2 , ⋯ , M } , i is the category, j is the sample of the ith category, N is the total number of category, M is the total number of samples of each category, K = N ⋅ M is the number of all samples.

Let S ¯ be average of all samples as follows:

S ¯ = 1 K ∑ i = 1 N ∑ j = 1 M S j i (17)

Therefore, the images covariance matrix G can be represented as follows:

G = 1 K ∑ i = 1 N ∑ j = 1 M ( S j i − S ¯ ) T ( S j i − S ¯ ) (18)

and the generalized total scattered criterion J ( X ) can be expressed by:

J ( X ) = X T G X (19)

Let X o p t be the unitary vector such that it maximizes the generalized total scatter criterion J ( X ) , that is:

X o p t = arg max X J ( X ) (20)

In general, there is more than one optimal solution. We usually select a set of optimal solutions { X 1 , ⋯ , X t } subjected to the orthonormal constraints and the maximizing criterion J ( X ) , where, t is smaller than the dimension of the coefﬁcients matrix. In fact, they are those orthonormal eigenvectors of the matrix G corresponding to t largest eigenvalues.

Now for each sub-band coefﬁcient matrix S i , compute the principal component of the matrix S i as follows:

y i j = A i x j , j = 1 , 2 , ⋯ , t (21)

Then we can get its reduced features matrix Y i = [ y i 1 , ⋯ , y i t ] , i = 1 , 2 , ⋯ , m .

We extract features respectively with PCA and 2D-PCA and compare their effects for the images in the back experiment.

Usually the classifiers based on supervised learning are often used for FR, in the paper we design two types of classifiers. One is the type of supervised learning classifiers and the other is the classifiers combining unsupervised and supervised learning [

1) BP neural network

BP neural network is a kind of multilayer feed-forward network according to the back-propagation algorithm for errors, is currently one of the most widely used neural network models [

The network consists of L layers as shown in

2) Hybrid BP neural networks (HBPNNs)

When the number scale of human face images isn’t big, generalization ability and operation time of single model BP neural network are ideal, and with the increase of numbers of identification species, the structure of BP network will become more complicated, which causes the time of network training to become longer, slower convergence rate, easy to fall into local minimum and poorer generalization ability and so on.

In order to eliminate these problems we design the hybrid BP neural networks (HBPNNs) composed of multiple single model BP networks to replace the complex BP network for FR. Hybrid networks have better fault tolerant and generalization than single model network, and can implement distributed computing to greatly shorten the training time of network [

The core idea of designing hybrid networks classifier is to divide a K-class pattern classification into K independent 2-class pattern classification. That is to make a complex classification problem decomposed into some simple classification problems. In the paper multiple single model BP networks are combined into a hybrid network classifier, namely make K BP networks of multiple inputs single output integrated, a BP network is a child network only being responsible for identifying one of K-class model category and parallel to each other between

different subnets. In reference of

BP neural network only having a hidden layer and with sufficient hidden neurons is sufficient for approximating the input-output relationship [

h = n + m + a (22)

where, m are the number of neurons of output layer, n are the number of neurons of input layer, a is constant between 1 - 10 [

X → ( X + 1 + a ) → 1 (23)

The structure of BP neural network is as following:

X → ( X + K + a ) → K (24)

The structure of subnets is simpler than the structure of single model BP neural network. When the structure of networks is complex, every increasing a neural

the training time will greatly increase. In addition, with the size of networks gradually becoming larger, more and more complex network structure is easy to have slow convergence, prone to fall into local minimum, to have poor generalization ability and so on. By contrast, the hybrid networks based on some subnets can obtain more stable and efficient classifiers in the shorter period of time of training.

Radial Basis Function (RBF) simulates the structure of neural network of the adjustment and covering each other of receiving domain of human brain, can approximate any continuous function with arbitrary precision. With the characteristics of fast learning, won’t get into local minimum.

The expression of RBF is as following [

ϕ ( x ) = ϕ ( ‖ x − c ‖ ) (25)

where, x , c ∈ R n , Euclidean distance of x to c is ‖ x − c ‖ . The radial basis function most commonly used is the Gaussian function for RBF neural network as following:

ϕ ( x ) = exp ( − ‖ x − c ‖ 2 σ 2 ) (26)

where, σ is the width of the function. Radial basis function is often used to construct the function as following:

y ( x ) = ∑ i = 1 M w i ϕ ( ‖ x − c i ‖ ) (27)

There are some different for c i of each radial basis function and the weight w i . The concrete process of training RBF is as follows.

For the set of sample data { ( x i , d i ) } i = 1 N , we use Equation (27) with M hidden nodes to classify those sample data.

The number of hidden nodes is chosen to be a small integer initially in applications. If the training error is not good, we can increase hidden nodes to reduce it. Considering the testing error simultaneously, there is a proper number of hidden nodes in applications. The model figure of RBF is shown in

The hybrid RBF neural networks (HRBFNNs) are composed of multiple RBF networks to replace RBF network for FR. Hybrid networks have better fault tolerant, higher convergence rate and stronger generalization than a single model network, and can implement distributed computing to greatly shorten the training time of network [

If the dimensions of face feature extraction are n, the structure of each subnets of the hybrid networks is as following:

n → m → 1 (29)

The structure of RBF neural network is as following:

n → m → k (30)

The structure of subnets is simpler than the structure of RBF neural network. In addition, when the structure of networks is complex, every increasing a neural the training time and amount of calculation will greatly increase. The model figure of the HRBFNNs is shown in

SVM is a novel machine learning technique based on the statistical learning theory that aims at ﬁnding the optimal hyper-plane among different classes (usually to solve binary classiﬁcation problem) of input data or training data in high dimensional feature space, and new test data can be classiﬁed by the separating hyper-plane [

Supposing there are two classes of examples (positive and negative), the label

of positive example is +1 and negative example is −1. The number of positive and negative examples respectively is n and m. The set { x i } i = 1 n + m are given positive and negative examples for training. The set { y i } i = 1 n + m are the labels of x i , in which { y i = + 1 } i = 1 n and { y i = − 1 } i = n + 1 n + m . SVM is to learn a decision function to predict the label of an example. The optimization formulation of SVM is:

min ‖ w ‖ 2 2 + G ∑ i = 1 n + m ξ i , s .t . w x i + b ≥ 1 − ξ i , i = 1 , ⋯ , n , w x i + b ≤ − 1 + ξ i , i = n + 1 , ⋯ , n + m (31)

where, ξ i is the slack variables and G controls the fraction on misclassiﬁed training examples. This is a quadratic programming problem, use Lagrange multiplier method and meet the KKT conditions, can get the optimal classification function for the above problems:

f ( x ) = sgn { w ⋅ x + b ∗ } = sgn { ∑ i = 1 n a i ∗ y i ( x i • x ) + b ∗ } (32)

where, a i ∗ and b ∗ are to the parameters to determine the optimal classification surface. ( x i • x ) is the dot product of two vectors.

For the nonlinear problem SVM can turn it into a high dimensional space by the nonlinear function mapping to solve the optimal classification surface. Therefore, the original problem becomes linearly separable. As can be seen from Equation (32) if we know dot product operation of the characteristics space the optimal classification surface can be obtained by simple calculation. According to the theory of Mercer, for any φ ( x ) ≠ 0 if:

{ ∫ φ 2 ( x ) d x < ∞ and ∬ K ( x i , x j ) φ ( x i ) φ ( x j ) d x i d x j > 0 (33)

The arbitrary symmetric function K ( x i , x j ) will be the dot product of a certain transformation space. Equation (32) will be corresponding to:

f ( x ) = sgn { ∑ i = 1 n a i ∗ y i K ( x i • x ) + b ∗ } (34)

This is SVM. There are a number of categories of the kernel function K ( x , x i ) :

l The linear kernel function K ( x , x i ) = ( x • x i ) ;

l The polynomial kernel function K ( x , x i ) = ( s ( x • x i ) + c ) d ,where s, c and d are parameters;

l The radial basis kernel function K ( x , x i ) = exp ( − γ | x − x i | 2 ) ,where, γ is the parameter;

l The Sigmoid kernel function K ( x , x i ) = tanh ( s ( x • x i ) + c ) , where, s and c are parameters.

The model figure of SVM [

SVM is essentially the classifier for two types. Solving multiple classification problems needs to make more appropriate classifier. There are two main methods

for SVM to structure the classifier for multiple classifications. One is the direct method, namely modify the objective function to use an optimization problem to solve the multiple classification parameters. This method is of high computational complexity. Another method is the indirect method. Combining multiple two-classifier constructs multiple classification classifiers. The method has two ways:

l One-Against-One: Build a hyper-plane between any two classes, to the problem of k classes needing to build k × ( k − 1 ) / 2 classification planes.

l One-Against-the-Rest: The classification plane is built between one category and other multiple categories, to the problem of k classes only needing to build k classification planes.

We will use two methods of “One-Against-One” and “One-Against-the-Rest” for the experiment and choose the method with better effect to construct the multiple classification classifiers of SVM.

The different classifiers have different performance. Fusion of multiple classifiers integrating their respective characteristics can make classification effect and robustness further improvement.

Feature fusion and decision-making fusion are of two main methods of classifier fusion. Feature fusion has large computation to be not easy to achieve, therefore, we adopt the decision-making fusion. The model figure of MCDFC is shown in

We use the weighted voting for decision fusion of each classifier:

w i = { log ( 1 − ε i ε i ) , ε i ≤ 0.5 0 , ε i > 0.5 (35)

where, w i is the weight of each classifier for the vote of classification result, ε i is variable. The final classification result is concluded by each classifier according to the following weighted voting formula:

f t ( x ) = arg max y ∈ Y ∑ i = 1 n w i [ f i ( x ) = y ] (36)

where, f t ( x ) is the final classification result and corresponding to the category y with the maximum, f i ( x ) is the classification result of the ith classifier, x is the input, y ∈ Y and Y is the category set. [ f i ( x ) = y ] indicates that the classification result of the ith classifier meeting the conditions is the category y and combines with the voting weight w i of the classifier.

Supervised learning systems are domain-speciﬁc and annotating a large-scale corpus for each domain is very expensive [

1) Problem formulation

The dataset is represented as a matrix:

X = [ X 1 , X 2 , ⋯ , X N + M ] = [ x 1 1 x 1 2 ⋯ x 1 N + M x 2 1 x 2 2 ⋯ x 2 N + M ⋮ ⋮ ⋱ ⋮ x D 1 x D 2 ⋯ x 1 N + M ] (37)

where, N is the number of training samples, M is the number of test samples, D is the number of feature values in the dataset. Each column of X corresponds to a sample X. A sample which has all features is viewed as a vector in ℝ D , where the jth coordinate corresponds to the jth feature.

Let Y be a set of labels correspond to L labeled training samples and is denoted as:

Y L = [ Y 1 , Y 2 , ⋯ , Y L ] = [ y 1 1 y 1 2 ⋯ y 1 L y 2 1 y 2 2 ⋯ y 2 L ⋮ ⋮ ⋱ ⋮ y C 1 y C 2 ⋯ y C L ] (38)

where, C is the number of classes. Each column of Y is a vector in ℝ C , where, the jth coordinate corresponds to the jth class:

y j = { 1 if X ∈ j th class 0 if X ∉ j th class (39)

We intend to seek the mapping function X → Y L using all the samples in order to determine Y when a new X comes.

2) Softmax regression

Softmax regression is a generalization of the logistic regression in many classification problems [

h ϕ ( X ) = 1 1 + exp ( − ϕ T X ) (40)

Training model parameters vector ϕ ∈ ℝ D + 1 , which can minimize the cost function:

J ( ϕ ) = − 1 L [ ∑ i = 1 L Y ( i ) log h ϕ ( X ( i ) ) + ( 1 − Y ( i ) ) log ( 1 − h ϕ ( X ( i ) ) ) ] (41)

Softmax regression is for many classification problems, class tag Y ( i ) ∈ { 1 , 2 , ⋯ , k } . It is used for each given sample X, using hypothesis function to estimate the probability value

where,

where,

There are no closed form solutions to minimize the cost function Equation (43) at present. Therefore, we use the iterative optimization algorithm (for example, gradient descent method or L-BFGS). After derivation we get gradient formula is as following:

Then make the following update operation:

where,

3) Deep belief network embedded with Softmax regress (DBNESR)

DBN uses a Markov random ﬁeld Restricted Boltzmann Machine (RBM) [

The semi-supervised learning method based on DBN architecture can be divided into two stages: First, DBN architecture is constructed by greedy layer-wise unsupervised learning using RBM as building blocks. All samples are utilized to find the parameter space W with N layers. Second, DBN architecture is trained

according to the log-likelihood using gradient descent method. As it is difﬁcult to optimize a deep architecture using supervised learning directly, the unsupervised learning stage can abstract the hierarchical representations feature effectively, and prevent over-ﬁtting of the supervised training. The algorithm BP is used pass the error top-down for fine-tuning after pre-training.

For unsupervised learning, we define the energy of the joint configuration

where,

The probability that the model assigns to a

where,

The probability of turning unit j is a logistic function of the states

The probability of turning unit i is a logistic function of the states of

where, the logistic function been chosen is the sigmoid function:

The derivative of the log-likelihood with respect to the model parameter

where,

where,

We can get Equation (57) by training process of RBM using contrastive divergence:

where,

where,

The above discussion is based on the training of the parameters between hidden layers with one sample x. For unsupervised learning, we construct the deep architecture using all samples by inputting them one by one from layer

For supervised learning, the DBM architecture is trained by C labeled data. The optimization problem is formulized as:

namely, to minimize cross-entropy. Where,

The greedy layer-wise unsupervised learning is just used to initialize the parameter of deep architecture, the parameters of the deep architecture are updated based on Equation (58). After initialization, real values are used in all the nodes of the deep architecture. We use gradient-descent through the whole deep architecture to retrain the weights for optimal classification.

1) Face Recognition Databases

We selected some typical databases of images, for example ORL Face Database, which consists of 10 different images for each of the 40 distinct individuals. Each people is imaged in different facial expressions and facial details under varying lighting conditions at different times. All the pictures are captured with a dark background and the individuals are in an upright and frontal position; the facial gestures are not identical, expressions, position, angle and scale are some different; The depth rotation and plane rotary can be up to 20˚, the scale of faces also has as much as 10% change. For each face database as above, we randomly choose a part of images as training data and the remaining as testing data. In this paper, in order to reflect the universality and high efficiency of all classification algorithms we randomly choose about 50% of each individual image as training data and the rest as testing data. At first all images will be made preprocessing and feature extraction.

All the experiments are carried out in MATLAB R2010b environment running on a desktop with Intel^{Ò} Core^{TM}2 Duo CPU T6670 @2.20GHz and 4.00 GB RAM.

2) Relevant experiments

Experiment 1. In this experiment, we use median filtering to make smoothing denoising for images preprocessing and get the sample

Seeing from the comparison of face images, the face images after filtering eliminate most of noise interference.

Experiment 2. In this experiment, we make histogram equalization for the images preprocessing and get the sample figures as following:

From

made the facial features more evident and is conducive to follow-up feature extraction and FR.

Experiment 3. In this experiment, we employ multi-scale two-dimensional wavelet transform to generate the initial geometric features for representing face images. By the experiment we get the sample figures as following:

From

Experiment 4. In this experiment, we extract features respectively with PCA and 2D-PCA and compare their effects as following:

From

contribution rate of 2D-PCA is greater than 90%, while the principal component contribution rate of PCA is less than 80%. Accordingly, 2D-PCA can use less principal component to better describe the image than PCA.

Experiment 5. In this experiment, we compare the recognition rate of the methods respectively based on PCA + BP, WT + PCA + BP, PCA + HBPNNs and WT + PCA + HBPNNs. The experiment is repeated many times and takes the average recognition rate. The experimental results are shown in

As shown in

Experiment 6. This experiment compares the recognition rate of the methods respectively based on WT + 2D-PCA + RBF and WT + 2D-PCA + HRBFNNs. The experiment is repeated for many times and takes the average recognition rate. The experimental results are shown in

As shown in

Experiment 7. Because SVM is essentially the classifier for two types, solving

Serial number | Recognition method | Recognition rate/% |
---|---|---|

1 | PCA + BP | 66.2 |

2 | WT + PCA + BP | 67.29 |

3 | PCA + HBPNNs | 91.7 |

4 | WT + PCA + HBPNNs | 93.3 |

Serial number | Recognition method | Recognition rate/% |
---|---|---|

1 | WT + 2D-PCA + RBF | 90.5 |

2 | WT + 2D-PCA + HRBFNNs | 95.5 |

the multiple classification problems needs to reconstruct more appropriate classifier. We will use two methods of “One-Against-One” and “One-Against-the Rest” for the experiment and choose the method with better effect to construct the multiple classification classifiers of SVM. The experiment is repeated for 20 times and takes the average recognition rate. The experimental results are shown in

As shown in

Experiment 8. In the paper we construct the multiple classification decision fusion classifier (MCDFC)—hybrid HBPNNs-HRBFNNs-SVM classifier. In this experiment, in order to show the efficiency of MCDFC, we first make recognition experiment respectively based on HBPNNs, HRBFNNs and SVM, then use the decision function to make fusions for classification results of three classifiers and get classification results of MCDFC. The experiment is repeated for 20 times and the experimental results are shown in

As shown in

To eliminate the error of single experiment and greatly reduce the random uncertainty,

Experiment 9. In this experiment, in order to validate the performance of our proposed algorithm—DBNESR is optimal for FR, we compare our proposed algorithm with some other methods such as BP, HBPNNs, RBF, HRBFNNs, SVM and MCDFC.

Serial number | Recognition method | Recognition rate/% | Wrong number |
---|---|---|---|

1 | One-Against-One SVM | 95.05 | 9.9 |

2 | One-Against-the-Rest SVM | 90.45 | 19.1 |

Algorithm | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

HBPNNs | 0.92 | 0.915 | 0.905 | 0.875 | 0.905 | 0.905 | 0.93 | 0.95 | 0.91 | 0.9 | 0.935 | 0.9 | 0.9 | 0.91 | 0.91 | 0.91 | 0.9 | 0.925 | 0.925 | 0.91 |

HRBFNNs | 0.935 | 0.905 | 0.945 | 0.9 | 0.95 | 0.94 | 0.97 | 0.935 | 0.945 | 0.935 | 0.95 | 0.945 | 0.94 | 0.95 | 0.92 | 0.94 | 0.935 | 0.95 | 0.935 | 0.945 |

SVM | 0.93 | 0.92 | 0.945 | 0.91 | 0.945 | 0.915 | 0.955 | 0.945 | 0.94 | 0.915 | 0.95 | 0.955 | 0.92 | 0.955 | 0.935 | 0.935 | 0.945 | 0.945 | 0.94 | 0.92 |

MCDFC | 0.93 | 0.93 | 0.945 | 0.915 | 0.95 | 0.94 | 0.97 | 0.94 | 0.95 | 0.935 | 0.955 | 0.94 | 0.94 | 0.955 | 0.925 | 0.94 | 0.94 | 0.95 | 0.94 | 0.94 |

Serial number | Recognition method | Average recognition rate/% | Variance |
---|---|---|---|

1 | HBPNNs | 91.2 | 0.0002537 |

2 | HRBFNNs | 93.85 | 0.0002476 |

3 | SVM | 93.6 | 0.0002147 |

4 | MCDFC | 94.15 | 0.0001424 |

In the experiment we set up different hidden layers and each hidden layer with different neurons. The architecture of DBNESR is similar with DBN, but with a different loss function introduced for supervised learning stage. For greedy layer-wise unsupervised learning we train the weights of each layer independently with the different epochs, we also make fine-tuning supervised learning for the different epochs. All DBNESR structures and learning epochs used in this experiment are separately shown in

Almost all the recognition rates of these DBNESR structures are more than 90%, in particular the effects of the models of 500-1000-40 and 1000-500-40 are

Serial number | DBNESR structures | Unsupervised learning epochs | Supervised learning epochs |
---|---|---|---|

1 | 400-200-100-50-20-40 | 10 | 1000 |

2 | 400-200-100-100-50-40 | 50 | 100 |

3 | 400-200-300-100-50-40 | 100 | 20 |

4 | 400-200-300-100-40 | 50 | 50 |

5 | 400-200-300-200-40 | 100 | 20 |

6 | 200-200-300-400-40 | 100 | 100 |

7 | 200-300-400-40 | 100 | 100 |

8 | 400-300-200-40 | 100 | 200 |

9 | 400-200-300-40 | 200 | 100 |

10 | 500-400-40 | 200 | 200 |

11 | 500-1000-40 | 200 | 200 |

12 | 1000-500-40 | 200 | 200 |

Algorithm | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

BP | 0.65 | 0.655 | 0.68 | 0.675 | 0.645 | 0.645 | 0.805 | 0.64 | 0.665 | 0.635 | 0.635 | 0.68 | 0.625 | 0.625 | 0.7 | 0.8 | 0.635 | 0.65 | 0.628 | 0.74 |

HBPNNs | 0.92 | 0.915 | 0.905 | 0.875 | 0.905 | 0.905 | 0.93 | 0.95 | 0.91 | 0.9 | 0.935 | 0.9 | 0.9 | 0.91 | 0.91 | 0.91 | 0.9 | 0.925 | 0.925 | 0.91 |

RBF | 0.905 | 0.9 | 0.9 | 0.875 | 0.88 | 0.88 | 0.915 | 0.92 | 0.92 | 0.915 | 0.93 | 0.935 | 0.9 | 0.905 | 0.895 | 0.895 | 0.93 | 0.85 | 0.91 | 0.94 |

HRBFNNs | 0.935 | 0.905 | 0.945 | 0.9 | 0.95 | 0.94 | 0.97 | 0.935 | 0.945 | 0.935 | 0.95 | 0.945 | 0.94 | 0.95 | 0.92 | 0.94 | 0.935 | 0.95 | 0.935 | 0.945 |

SVM | 0.93 | 0.92 | 0.945 | 0.91 | 0.945 | 0.915 | 0.955 | 0.945 | 0.94 | 0.915 | 0.95 | 0.955 | 0.92 | 0.955 | 0.935 | 0.935 | 0.945 | 0.945 | 0.94 | 0.92 |

MCDFC | 0.93 | 0.93 | 0.945 | 0.915 | 0.95 | 0.94 | 0.97 | 0.94 | 0.95 | 0.935 | 0.955 | 0.94 | 0.94 | 0.955 | 0.925 | 0.94 | 0.94 | 0.95 | 0.94 | 0.94 |

DBNESR | 0.95 | 0.95 | 0.96 | 0.965 | 0.945 | 0.95 | 0.95 | 0.96 | 0.965 | 0.96 | 0.95 | 0.965 | 0.945 | 0.95 | 0.96 | 0.965 | 0.95 | 0.96 | 0.96 | 0.965 |

Serial number | Recognition method | Average recognition rate/% | Variance |
---|---|---|---|

1 | BP | 67.06 | 0.0028 |

2 | HBPNNs | 91.2 | 0.0002537 |

3 | RBF | 90.5 | 0.0005 |

4 | HRBFNNs | 93.85 | 0.0002476 |

5 | SVM | 93.6 | 0.0002147 |

6 | MCDFC | 94.15 | 0.0001424 |

7 | DBNESR | 95.63 | 0.0000523 |

best and most stable. Therefore, the DBNESR structures used in this experiment are 1000-500-40, which represents the number of units in output layer is 40, and in 2 hidden layers are 1000 and 500 respectively. The learning rate is set to dynamic value, which the initial learning rate is set to 0.1 and becomes smaller as the training error becoming smaller. The experimental results are shown in

As shown in

The conducted experiments validate that the proposed algorithm DBNESR is optimal for face recognition with the highest and most stable recognition rates, that is, it successfully implements hierarchical representations’ feature deep learning for face recognition. You can also be sure that it reflects hierarchical representations of feature by DBNESR in terms of its capability of modeling other artiﬁcial intelligent tasks, which is also what we’re going to do in the future.

This research was funded by the National Natural Science Foundation (Grand 61171141, 61573145), the Public Research and Capacity Building of Guangdong Province (Grand 2014B010104001), the Basic and Applied Basic Research of Guangdong Province (Grand 2015A030308018), the Main Project of the Natural Science Fund of Jiaying University (grant number 2017KJZ02) and the key research bases being jointly built by provinces and cities for humanities and social science of regular institutions of higher learning of Guangdong province (Grant number 18KYKT11), the cooperative education program of Ministry of Education (Grant number 201802153047), the college characteristic innovation project of Education Department of Guangdong province in 2019 (Grant number 2019KTSCX169), the authors are greatly thanks to these grants.

1) (In Case of Funding) Funding

This study was funded by the National Natural Science Foundation (grant number 61171141, 61573145), the Public Research and Capacity Building of Guangdong Province (grant number 2014B010104001), the Basic and Applied Basic Research of Guangdong Province (grant number 2015A030308018), the Main Project of the Natural Science Fund of Jiaying University (grant number 2017KJZ02) and the key research bases being jointly built by provinces and cities for humanities and social science of regular institutions of higher learning of Guangdong province (grant number 18KYKT11), the cooperative education program of ministry of education (grant number 201802153047), the college characteristic innovation project of education department of guangdong province in 2019 (grant number 2019KTSCX169).

2) (If Articles Do Not Contain Studies with Human Participants or Animals by Any of The Authors, Please Select One of The Following Statements) Ethical Approval:

This article does not contain any studies with human participants or animals performed by any of the authors.

Hai-Jun Zhang declares that he has no conflict of interest. Ying-hui Chen declares that she has no conflict of interest.

Zhang, H.J. and Chen, Y.H. (2020) Hierarchical Representations Feature Deep Learning for Face Recognition. Journal of Data Analysis and Information Processing, 8, 195-227. https://doi.org/10.4236/jdaip.2020.83012