Study of Similarity Measures with Linear Discriminant Analysis for Face Recognition

Face recognition systems have been in the active research in the area of image processing for quite a long time. Evaluating the face recognition system was carried out with various types of algorithms used for extracting the features, their classification and matching. Similarity measure or distance measure is also an important factor in assessing the quality of a face recognition system. There are various distance measures in literature which are widely used in this area. In this work, a new class of similarity measure based on the Lp metric between fuzzy sets is proposed which gives better results when compared to the existing distance measures in the area with Linear Discriminant Analysis (LDA). The result points to a positive direction that with the existing feature extraction methods itself the results can be improved if the similarity measure in the matching part is efficient.


Introduction
The concept of similarity is fundamentally important in almost every scientific field.For example, in mathematics, geometric methods for assessing similarity are used in studies of congruence and homothetic as well as in allied fields such as trigonometry.Face recognition has been studied extensively for more than 40 years.Now it is one of the most imperative subtopics in the domain of face research [1]- [4].Face recognition is a technology which recognizes the human by his/her face image.Face recognition has gained much attention in recent years and has become one of the most successful applications of image analysis and understanding.
A fundamental challenge in face recognition lies in determining which steps are important in the recognition of faces.Several studies [5]- [7] have indicated the significance of certain steps in this regard, particularly preprocessing and feature extraction.Measuring similarity or distance between two data points is a core requirement for several data mining and knowledge discovery tasks that involve distance computation.For continuous data sets, the Minkowski Distance [8] is a general method used to compute distance between two multivariate points.In particular, the Minkowski Distance of order 1 (Manhattan) and order 2 (Euclidean) are the two most widely used distance measures for continuous data.The simplest way to address similarity between two categorical attributes is to assign a similarity of 1 if the values are identical and a similarity of 0 if the values are not identical.For two multivariate categorical data points, the similarity between them will be directly proportional to the number of attributes in which they match.Various similarity measure functions are enumerated in the literature [9] [10].
In this paper, we propose a novel approach to derive the similarity between two images using new similarity measure, by representing each numerical value of their feature vectors as a fuzzy set, instead of a single value.This representation takes into account the uncertainty presents in the extraction process of features and consequently, increases the precision rate in the image retrieval process.The results obtained by the proposed approach present higher performance than the traditional ones.
The rest of this paper is organized as follows: Section 2 presents the mathematical foundations similarity measures on digital images.Section 3 describes the various similarity measures and Section 4 describes about the proposed novel similarity measure with its properties which can be used for effective face recognition.In Section 5, Linear Discriminant Analysis is explained briefly.Experimental details and results on real data are outlined in Section 6, and finally, the conclusions are given in Section 7.

Similarity Measure
The similarity of two images is obtained by computing the similarity (or dissimilarity) between their feature vectors [11].Several measures have been proposed to measure the similarity between fuzzy sets or images [12]- [16].There is no generic method for selecting a suitable similarity measure or a distance measure.However, a prior information and statistics of features can be used in selection or to establish a new measure.Van der Weken and et al. [17] gave an overview of similarity measures, originally introduced to express the degree of comparison between fuzzy sets, which can be applied to images.These similarity measures are all pixel-based, and have therefore not always satisfactory results.To cope with this drawback, in [18] they proposed similarity measures based on neighborhoods, so that the relevant structures of the images are observed better.Some of the distance or similarity functions frequently used for information retrieval from the databases are listed below [19].

Euclidean Distance
The Euclidean distance between two points p and q is the length of the line segment connecting them ( ) , , , n p p p p =  and ( ) , , , n q q q q =  are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by:

Squared Euclidean Distance
The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are farther apart.In this case, the equation becomes:

City Block Distance
The City block distance between two points a, b with k dimensions is calculated as follows: ( ) The City block distance is always greater than or equal to zero.The measurement would be zero for identical points and high for points that show little similarity.

Minkowski Distance
The Minkowski distance of order p between two points ( ) , , , n P x x x =  and ( ) , , ,

Mahalanobis Distance
Formally, the Mahalanobis distance of a multivariate vector ( )  from a group of values with mean ( ) , , , N µ µ µ µ =  and covariance matrix Sis defined as:

Chebyshev Distance
The Chebyshev distance between two vectors or points p and q, with standard coordinates i p and i q , respec- tively, is

Cosine Correlation
The Cosine correlation between two points a and b, with k dimensions is calculated as where ( ) The cosine correlation ranges from +1 to −1 where +1 is the highest correlation.Complete opposite points have correlation −1.

Canberra Distance
The Canberra distance, CAD d , between two vectors , p q in an n-dimensional real vector space is given as follows: where , , , n p p p p =  and ( ) , , , n q q q q =  are vectors.

Proposed Similarity Measure
Similarity measure is defined as the distance between various data points.The performance of many algorithms depends upon selecting a good distance function over input data set.While, similarity is a amount that reflects the strength of relationship between two data items, dissimilarity deals with the measurement of divergence between two data items [20] [21].

Properties of Similarity Measure
The theory of fuzzy sets F(X) was proposed by Zadeh [21].A fuzzy set A in a universe , which associates with every element x in X by a degree of membership ( ) x ( ) F X be the class of all fuzzy sets of X , ( ) There is no unique definition for the similarity measure, but the most common used definition is the following [22] [23].
, is said to be a measure between Fuzzy Sets ( ) ( ) , S A B is a similarity measure which defined as above, then is a distance measure between A and B .Definition 3.3.(Minkowski distance): The distance between two sets A and B ( p L metric), is given by: is a distance measure between A and B .Family of distance based similarity measures presented by Sanitini in the following definition [23]: Definition 3.4.Let r d is the distance between two fuzzy sets A and B , then are similarity measures.
Based on this definition a class of similarity measures can be given in the following.Definition 3.5.Let

Formulation of the Proposed Similarity
In this section, we introduce the new classes of similarity measure ( ) S A B based on p L metrics.To prove the proposed new classes of similarity measure, the following lemma will be needed: is a similarity measure.Generally, for Proof: For the first partition of this theorem, it is quite easy to verify that the conditions ( ) ( ) -SP SP are sa- tisfied and for ( ) SP , the above lemma 3.1 can be used to prove.For the second partition of the theorem, the induction is used as the following: where the left hand side is similarity measure as in the first partition.So the theory was done.

Theorem 4.2 Let ( )
, S A B be a similarity measure between ( ) is dissimilarity measure.Generally, the th n -formula for ( ) ( )

S A B S A B S A B S A B
, , , S S S and 4 S according to the above relation in theo- rem 4.1.
( ) From the above similarity measures, 1 2 3 , , S S S and 4 S , the classes using the relation in Equation ( 14) can be driven: It can extend to the following classes (practically these classes are satisfies all previous conditions of similarity measures):

S A B S A B
S A B = + ……etc.

Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA) [24] finds the vectors in the underlying space that best discriminate among classes.For all samples of all classes the between-class scatter matrix S B and the within-class scatter matrix S W are defined by: ( )( ) where M i is the number of training samples in class i, c is the number of distinct classes, i µ is the mean vector of samples belonging to class i and X i represents the set of samples belonging to class i with x k being the k-th image of that class.S W represents the scatter of features around the mean of each face class and S B represents the scatter of features around the overall mean for all face classes.For more details about LDA, see [25].LDA uses the class information and finds a set of vectors that maximize the between-class scatter while minimizing the within-class scatter.Beveridge et al. [26] claim in their tests that LDA performed uniformly worse than PCA.Martinez [27] states that LDA is better for some tasks, and Navarrete et al. [28] claim that LDA outperforms PCA on all tasks in their tests (for more than two samples per class in training phase).The Figure 1 shows the flow of a general subspace face recognition system.

Initial Steps
The data of the image folder are unpacked as follows: for each image, the intensity values of all pixels are read continuously starting from pixel position (1,1) and going column width until a new row, then the pixels values of this row are read and finally all the values are made into a row vector.All images are unpacked in this fashion gives a data matrix X of 400 × 10,304 dimensions.(i.e.400 images and each image having 112 × 92 = 10,304 pixels).Hence, each image is considered a vector of 10,304 dimensional space.In the case of Georgia Tech database, the data matrix will be 400 × 40,000 and each image can be considered a vector in 40,000 dimensional space.For statistical analysis, the pixels (the data matrix columns) are considered as random variables and the different images (the data matrix rows) are considered as samples of these variables.It is expected that the variables will be highly correlated.This is because most of pixels which form the background will have the same intensity in the image.And thus, they would be highly correlated.Similarly, a lot of other pixels would be correlated.So, there is a lot of redundancy in the data.

Algorithm in Details
For a given face image, correctly identify the individual is the main objective.To this end, we use the Linear discriminant analysis (LDA) assuming multivariate normality of features given groups and common covariance.
The individuals based on their face images are classified by using Linear Discriminant Analysis (LDA).Each 10,304 (Consider the dataset used is ORL to show the calculations) pixel intensity vector considers a vector of features and each individual considers a class that the features are classified into.Even with the assumptions already made, LDA cannot be performed on the raw pixel intensity values (features) as the covariance matrix of large dimensions (10,304 × 10,304).Computing the covariance matrix and its inverse are both computationally not feasible.
To overcome this problem, Principal Component Analysis (PCA) is performed on the pixel intensity data.But, to get the principal components of X which is 400 × 10,304, the covariance matrix which is 10,304 × 10,304 dimensions should be obtained and then its eigenvectors are found.So, the problem still remains.Nevertheless, we have to note that the rank of X is only 400, which means there will only be 400 eigenvectors that have non-zero eigen values.
A simple way of finding these eigenvectors is proposed by (Turk and Pentland [31]).If v is an eigenvector of Multiplying both sides by T X we get Hence T X ν is an eigenvector of T X X .Note that X, as used here, is a mean centered.The mean of X (beforemean centering) i.e. the mean of all faces is called the average face.The average face is shown in Figure 2.
Also, note that the eigen values not change i.e. the eigenvalues of T XX are the same as those of T XX .The calculated eigenvectors by this way are normalized.The order of the covariance matrix from which eigenvectors have to be estimated is reduced from 10,304 × 10,304 to 400 × 400 (i.e. from square of pixels to square of number of images).The eigenvectors and eigen values are estimated using MATLAB.The larger an eigen value the more important the eigenvector, in the sense that the variance of the data set in that direction is more compared to the variance in the direction of eigenvectors with lesser eigen values.
These eigenvectors are called eigen faces [31].The components of each eigenvector can be regarded as a weight of the corresponding pixel in forming the total image.The components of the eigenvectors calculated, generally do not lie between 0 -255, which is required to visualize an image.Hence, the eigenvectors are transformed so that the components lie between the 0 -255 range, for visualization.These eigen faces can be regarded as the faces that make up all the faces in the database, i.e. a linear combination of these faces can be used to represent any face in the data set.Also, at least one face can be represented by a linear combination of all the eigenfaces.The first few eigenfaces (after transformation) are shown in Figure 3.The eigen faces represent vectors in the 10304 dimensional space spanned by the variables (pixels).If the new vectors are considered to be new axes that represent the data, then these new axes will be the principal components we are seeking for.They are just a rotation of the original axes to more meaningful directions.To illustrate how principal components are closer to normality, the normal probability plots some of the first few principal components are given.To check the multivariate normality of the principal components, the uni_variate normality of individual Principal Components (PC's) is checked.

Projections (Reconstruction of an Image)
The projection of a vector v onto a vector u is given by: ( ) If v is any image (after subtracting the average face) and u is any eigen face, then the above formula gives the projection of the face von the eigen face u.Any image can be projected onto each of the first few eigen faces.All the projected vectors are added to get a final image (reconstructed).After this, the average face has to be added again to get the actual projected face onto the lower dimensional space spanned by the eigen faces.For example, a face and its projection of 5, 20, 50, 150, 400 dimensions (8 faces) are shown respectively as Figure 4.
Any image in the database can be completely reconstructed by using all the eigen faces.The last image above is a complete reconstruction of the original image.

Performing LDA and Experimental Results
With the above basic frame work and with the additional assumption of equivalence of the covariance matrices for each group, LDA for classification can be used.The assumption of equivalence of covariance matrices is quite strong.No statistical test has been performed to check this assumption, as most of these tests are not robust enough (e.g.Box's test).Instead, two-fold cross validation has been performed to check the error rate on the final classification and the assumptions are validated based on these results.The covariance matrix has been estimated using a spooled estimator.The prior probabilities have been considered to be equal.With this spooled estimator and the means of the individual groups, the y σ 's and y β 's for each group have been calculated by The classification rule (Bayes's minimum cost rule) using equal losses then becomes, ( x I j y P y x P y x where x is a feature vector and J is the total number of classes (equal to 40).The obtained classification depends on the number of principal components.
We can evaluate error rates by means of a training sample (to construct the discrimination surf surface) and a test sample.An optimistic error rate is obtained by reclassifying the design set: this is known as the apparent error rate APER.So that APER is equal to proportion of items in the training set that are misclassified.If an independent test sample is used for classifying, we arrive at the true error rate AER.The number of principal components to be used for classification is chosen so that the APER (Apparent Error Rate) is almost 0%.Using 23 principal components, it was found that the APER is 0.02%.The APER is 0% when 60 or more principal components are considered.A closer estimate of the AER (actual error rate) can be calculated using two-fold cross validation.These were calculated based on the confusion matrices.The confusion matrices are not shown here as they are of 40 × 40 dimensions.Instead the final cross validated error rates are shown.
Table 1 shows the number of principal components used and the AER.
It can be seen that the error rate does not decrease much with the increase in the number of principal components.In fact, when 100 PC's are used, the error rate is more than that when only 23 PC's are used.This can be attributed to the larger rounding errors when dealing with inverses of large dimension covariance matrices.When using 100 PC's the spooled covariance matrix becomes 100 × 100.Because of the error rate does not decrease much with increasing the number of PC's, so 23 PC's can be considered to be optimal.The state of 23 PC's explaining the variance of the data set up to 75%.The APER as the AER is quite small, so the assumptions of multivariate normality of features that given a group and the equivalence of covariance matrices are not unreasonable.
In the stage of classification, various similarity measures for the recognition purpose have been analyzed.Euclidean, City block, Mahanalobis, and Cosine similarity measure are used together with the proposed distance measure.In the proposed similarity measures, Equation ( 9) is used for the experimental purpose.The results are divided into two sets.Various performance metrics for the two data sets and the graphical representation or various plots that show the performance and error details are explained as the following: In the performance metrics, six measures are used.They are Rank one recognition rate, equal error rate, minimal half total error rate, verification rate at 1%, 0.1% and 0.01% of False Acceptance Rate (FAR).For all the five distance measures, these metrics are calculated to both datasets.It is mentioned in Table 2.The performance metrics show that the proposed distance outperforms all the other four measures.An average of 90.6% of recognition rate is achieved which far better than the other measures.

Conclusion
Analysis of various similarity measures was carried out in the context of face recognition.Feature extraction was done with the renowned Linear Discriminant Analysis method.A new similarity measure based on fuzzy Minkowski's theorem is introduced in this work.Various similarity measures such as cosine similarity, Euclidean distance, City Block distance and Mahanalobis distance measures were compared with the proposed similarity measure.This novel similarity measure outperformed all the other distance measures.These results provide us the strong guideline in developing future face recognition using an efficient similarity measure.Also, we can use this measure as an important factor for evaluating a face recognition system on tested datasets.
x χ in the fuzzy set A .In the following, let the fuzzy sets A and B respectively, where i a and i b are membership values any finite integer n, is defined a class of dissimilarity measures.Proof:It is noted that: a several similarity measures which are a combination of the similarity measures, 1 2 3

Figure 1 .
Figure 1.A general subspace face recognition system.

Table 1 .
Cross validated error rates.

Table 2 .
Performance metrics for the ORL and Georgia-tech face databases using different distance measures.