An Integrated Face Tracking and Facial Expression Recognition System

This article proposes a feature extraction method for an integrated face tracking and facial expression recognition in real time video. The method proposed by Viola and Jones [1] is used to detect the face region in the first frame of the video. A rectangular bounding box is fitted over for the face region and the detected face is tracked in the successive frames using the cascaded Support vector machine (SVM) and cascaded Radial basis function neural network (RBFNN). The haar-like features are extracted from the detected face region and they are used to create a cascaded SVM and RBFNN classifiers. Each stage of the SVM classifier and RBFNN classifier rejects the non-face regions and pass the face regions to the next stage in the cascade thereby efficiently tracking the face. The performance of tracking is evaluated using one hour video data. The performance of the cascaded SVM is compared with the cascaded RBFNN. The experiment results show that the proposed cascaded SVM classifier method gives better performance over the RBFNN and also the methods described in the literature using single SVM classifier [2]. While the face is being tracked, features are extracted from the mouth region for expression recognition. The features are modelled using a multi-class SVM. The SVM finds an optimal hyperplane to distinguish different facial expressions with an accuracy of 96.0%.


Introduction
The proposed work focuses mainly on recognizing the facial expressions of a user interacting with the computer.Many applications such as video conferencing, intelligent tutoring systems, measuring advertisement effecttiveness, behavioural science, e-commerce, video games, home/service robotics etc. require efficient facial expression recognition in order to achieve the desired results.Tracking a face in the video is the first step in automated facial expression recognition system.This paper deals with the problem of tracking a face in a video using cascaded SVM and cascaded RBFNN.Tracking requires a method for face detection which determines the image location of a face in the first frame of a video sequence.A number of methods have been proposed for face detection which include neural networks [3,4], wavelet basis functions [5] and Bayesian discriminating features [6].Also many methods have been proposed for face tracking which include active contours [7], robust appearance filter [8], probabilistic tracking [9], adaptive active appearance model [10] and active appearance mo-del [11].In a face tracking application, Boosting and cascading detectors have gained great popularity due to the their efficiency in selecting features for face detection.In this work, Adaboost algorithm [1] is used for face detection.
In Adaboost algorithm [1] faces are detected in three steps.First, the image is represented by a new representation called "Integral image" which computes the features quickly.In the second step, a learning algorithm is used which selects critical visual features from a large number of features.This yields a large number of efficient classifiers.In the third step, the classifiers are combined in a cascade that discards background regions from the object-regions of interest.A rectangular bounding box is fitted using these features for the face region.Once the face is detected, the detected face is tracked in consecutive frames in the video sequence.
Face tracking is to follow the face region through the video sequence.Face tracking is done in two methods.The first method tracks the face by detecting the face region in each and every frame.The second method detects the face region in the first frame and the detected face region is tracked in the consecutive frames by classifying the face region from non face regions.Many models have been proposed for classification.Support vector machines (SVM) and Radial basis function neural network (RBFNN) models are popular due to their good capacity of generalization and possibility to build non-linear classifiers.
In this work, a method is proposed using haar-like features for feature selection and a cascade of SVMs and a cascade of RBFNNs for classification.The detected face and non face regions are modelled using cascaded SVMs and cascaded RBFNNs.The trained models are used to track the face region in the video sequence.
While the face is being tracked the mouth region is also tracked.Since the mouth plays an eminent role in expressing emotions, the mouth features are used for classifying expressions.Many methods have been proposed in the literature which include Active appearance model [12], Adoboost classifier [1,13], hidden markov models [14], neural networks [15], Bayesian networks [16,17], fuzzy systems [18], Support vector machines [19][20][21].In this work a multi-class SVM is used for modelling mouth features and to classify the facial expressions.
Section 2 explains the face detection process.Facial feature extraction and face and non face modelling using cascaded SVM and cascaded RBFNN are presented in Section 3. Section 4 describes the modelling of mouth region for expression classification.Experimental results are given in Section 5. Section 6 concludes the paper.

Face Detection
Detecting faces automatically from the intensity or colour image is an essential task for many applications like person authentication and video indexing.Face detection is done in three steps.In step 1 simple rectangle features as shown in Figure 1 are used.
These features are reminiscent of Haar basis functions [19].The rectangle features are computed quickly using an intermediate representation by the sum of the pixels above and to the left of x, y inclusive: where ii(x, y) is the integral image and i(x', y') is the original image.The following pair of recurrences can be used to compute the integral image in one pass over the original image.     where s(x, y) is the cumulative row sum, s(x, -1) = 0, and ii(-1, y) = 0.The set of rectangle features used provide rich image representation which supports effective learning.
In the second step, Adaboost learning algorithm is used to select a small set of features to train the classifier.About 180,000 rectangle features are associated with each image sub-window.Out of these large numbers of features, a very small number of features are combined to form an effective classifier.
In the third step, a cascade of classifiers as shown in Figure 2 is constructed which increases the detection performance thereby reducing the computation time.
A positive result from the first classifier triggers the evaluation of the second classifier and a positive result from the second classifier triggers the third classifier and so on.A negative result from any stage is rejected.The stages in the cascade are constructed using Adaboost algorithm and the threshold is adjusted to minimize the false negatives.Stages are added until the target for false positive and detection rate is achieved.A rectangle bounding box is placed over the face region.The detected face region is shown in Figure 3.

Modelling of Face and Nonface Regions
for Face Tracking

Facial Feature Extraction
One of the main issues in constructing a face tracker is to extract the facial features that are invariant to the size of the face.The three types of haar-like features shown in Figure 4 are used to extract the facial features.The haar-like features are calculated as the sum of gray values in the black area subtracted from the sum of To extract the facial features, each haar-like feature is moved over the face region which is marked as a rectangular window as shown in Figure 5. Non facial features from the non face regions are also extracted in the same way by moving each haar-like feature over the non face regions.The features extracted as shown in Figure 5 also shows the scaling of features.Scaling is done by increasing the size of the feature window along the vertical and horizontal directions while moving them over face and non face regions.
For training the cascaded SVM, a set of 1700 (120 from face and 1580 from non face) such feature vectors each of dimension 60 are extracted from the face and non face regions using the first type of haar-like features, 1700 (120 from face and 1580 from non face) feature vectors using the second type and 700 (120 from face and 580 from non face) feature vectors using the third type.
For training the cascaded RBFNN, a set of 700 (120 from face and 580 from non face) such feature vectors each of dimension 60 are extracted from the face and non face regions using the first type of haar-like features, 700 (120 from face and 580 from nonface) feature vectors using the second type and 700 (120 from face and 580 from non face) feature vectors using the third type.

Support Vector Machines Machine
Support vector machine (SVM) is a learning machine [22,2,19].It is based on structural risk minimization (SRM).For a two-class linearly separable data, SVM finds a decision boundary which is a hyperplane.This hyperplane defined by the support vectors separates the data by maximizing the margin of separation.For linearly inseparable data, it maps the input pattern space X into a highdimensional feature space Z using a nonlinear function (x).Then the SVM finds an optimal hyperplane as the decision surface to separate the examples of two classes in the feature space.Cover's theorem states that, "A complex pattern classification problem cast in a high dimensional space nonlinearly is more likely to be linearly separable than in a low dimensional space."An example of mapping two-dimensional data into three-dimensional space using the function (x) = {x 1 2 , x 2 2 , 2 x 1 x 2 } is shown in Figure 6.This shows that the data is linearly inseparable in two-dimensional space and it is linearly separable in three-dimensional space.
The support vector machine can be used for modeling the face and non face regions.The feature vectors are extracted from face and non face regions and given as input to the SVM to find out the optimal hyperplane for the face and non face regions.The proposed method uses three SVM classifiers for constructing the cascade as shown in and passes them to the second classifier.The second classifier is tested with the second type which in turn discriminates most possible regions and passes them to the third one.Finally, the third classifier tested with the third type, detects the exact face region in the image.Since the cascade of classifiers rejects the non face regions in each stage so that the face region is efficiently detected at the final stage, the proposed method performs better than the single SVM classifier.

Radial Basis Function Neural Networks
Radial basis function neural network is a variant of Artificial Neural Network (ANN).It has a feedforward archicture [23][24][25] with an input layer, a hidden layer, and an output layer.It is applied to the problems of supervised learning and associated with radial basis functions.RBF-NN trains faster than multilayer perceptron.It can be applied to the fields such as control engineering, time-series prediction, electronic device parameter modelling, speech recognition, image restoration, motion estimation, data fusion etc.The architecture of RBFNN is shown in Figure 8. Radial basis functions are embedded into a two-layer feed forward neural network.Such a network is characterized by a set of inputs and a set of outputs.In between the inputs and outputs there is a layer of processing units called hidden units.Each of them implements a radial basis function.The input layer of this network has n i units for a n i dimensional input vector.The input units are fully connected to the n h hidden layer units, which are in turn fully connected to the n c output layer

Detected Face Window
The activation functions of the hidden layer w n to be Gaussians, and are characterized by their mean vectors (centers) μ i , and covariance matrices C i , i = 1, 2, •••, n h .For simplicity, it is assumed that the covariance matrices are of the form Then the activation function of the i th hidden unit for an input vector x j is given by   Reject sub-windows (Nonface) The μ and σ 2 are calculated by using suitable clusterin les to k means (cluste samples according to nearest μ k .
2 and 3 until no change in μ k .and th i i g algorithm.Here the k-means clustering algorithm is employed to determine the centers.The algorithm is composed of the following steps: 1) Randomly initialize the samp rs), 2) Classify n 3) Recompute μ k .4) Repeat the steps A number of activation functions in the network eir spread influence the smoothness of the mapping.The assumption σ i 2 = σ 2 is made and σ 2 is given in ( 6) to ensure that the activation functions are not too peaked or too flat.
In the above equation d is the m tw aximum distance beeen the chosen centers, and η is an empirical scale factor which serves to control the smoothness of the mapping function.Therefore, The Equation ( 5) is written as The hidden layer units are fully connected to the n c ou (8) where g 0 (x ) = 1.Given n feature vectors from n c classes, ally, the unsu tput layer units through weights w ik .The output units are linear, and the response of the k th output unit for an input x j is given by pervised k-means clustering algorithm, can be applied to find n h clusters from n t training vectors.However, the training vectors of a class may not fall into a single cluster.In order to obtain clusters only according to class, the k-means clustering may be used in a supervised manner.Training feature vectors belonging to the same class are clustered to n h /n c clusters using the k-means clustering hidden an (9) where Y is a n × n c matrix w 10) To solve W from ( 8), G is completely specified by cl algorithm.This is repeated for each class yielding nh cluster for nc classes.These cluster means are used as the centers μ i of the Gaussian activation functions in the RBFNN.The parameter d was then computed by finding the maximum distance between n h cluster means.
Determining the weights w ik between the d output layer: Given that the Gaussian function centers and widths are computed from n t training vectors, ( 7) may be written in matrix form as a n t × (n h + 1) matrix with elements G ij = g j (x i ), and W is a (n h +1) × n c matrix of unknown weights.W is obtained from the standard least squares solution as given by , G is the ustering results, and the elements of Y are specified as The Radial basis function neural networks can be used fo

gion for Expression
Th a face plays an efficient role in claf 100 × 16 feature vectors from 100 frames in th discriminates possible face the background regions and passes them to the second classifier.The second classifier (RBFNN2) is tested with the second type which in turn discriminates most possible regions and passes them to the third one.Finally, the third classifier (RBFNN3) which is tested with the third type, detects the exact face region in the image.Since the cascade of classifiers rejects the non face regions in each stage so that the face region is efficiently detected at the final stage, the proposed method with cascaded RBFNN performs better than the single RBFNN classifier.
The hidden layer nodes are assigned weights of s 2 where 10 clusters (2 clusters for face and 8 clusters for non face) are formed.In this work, the output layer consists of 2 nodes where the patterns are classified as belonging to face or non face.from the f racked.A to be tracked are cap-acial region while the face is being t multiclass SVM is used to model these features.It is trained initially with 400 × 16 feature vectors.While testing, the mouth region is tracked in a video and the features are extracted and given as input to the multiclass SVM models for classification.

Experimental Results
The images from which faces are tured at the rate of 10 frames/sec using Logitech Quickcam Pro5000 web camera in a PC with 2.00 GHz Intel Core 2 Duo processor.The performance is evaluated by tracking the face in a video of one hour duration using SVM and RBFNN.Once the face is detected in the first frame, it is tracked in consecutive frames.The tracked face region in consecutive frames is shown by means of drawing a rectangle over the face region as shown in Figures 12 and 13.While tracking the face region, the mouth region is also tracked to classify the facial expressions.ith different SVM kernel functions and RBFNN.It is understood that the SVM with polynomial kernel performs better than the Linear kernel, Radial basis function kernel and RBFNN.While the face is being tracked, mouth features are extracted from about 100 frames for each of the four expressions separately.Hence a total of 400 × 16 feature vectors are trained using a multiclass SVM.While testing, the SVM is tested for 400 frames (100 consecutive frames for each of the four expressions separately).
Table 2 s od in recognizing facial expressions.

Figure 14. Accuracy rates for facial expression recognition
The confusion matrix given in Table 3 shows that out of acking and faci also compared with a m st xpressions, the proposed syste number of expressions.100 frames with fear expressions, 96 frames have been correctly classified and 4 frames have been misclassified.

Conclusions and Future Work
This paper proposes a method for face tr al expression recognition using cascaded SVM and cascaded RBFNN classifiers.A face is detected in the first frame of the video using face detection process proposed in [12].The face is then tracked in consecutive frames by classifying the face and non face regions using the cascaded SVM and cascaded RBFNN.Since the non face regions are rejected in each stage, the face region is effectively tracked in the final stage even under varying illumination, scale and pose.The RBFNN classifier is trained with the maximum of 700 feature vectors whereas SVM is trained with the maximum of 1700 feature vectors.It is the drawback of RBFNN that its performance gets reduced with large number of data.Hence it is found that the system tracks the face with greater performance using SVM polynomial kernel than the RBFNN.The system using a cascaded SVM is found to be better with the precision performance of 91.4% than the system using a single SVM classifier [2] in the literature with the maximum Precision performance of 72.8% using the same polynomial kernel.
The proposed method is ultiage face tracker [26] where ratio template algorithm had been used.The ratio template algorithm was modified with the inclusion of better spatial template for facial features.This hybrid face tracker was able to locate only 89% of images in the sequence which performs lower than the proposed method.
Among the four facial e m is able to recognize the fear expression with the highest accuracy of 96.0%.The system can be extended to track multiple faces in the video sequence with more

Figure 4 .
Figure 4. Types of haar-like features for feature selection.gray values in the white area.The gray values are obtained by converting the RGB image into gray level image I, given by

Figure 5 .
Figure 5. Extraction of scaled type of haar-like features.

Figure 8 .
Figure 8. Structure of Radial basis function neural network.
n c is the number of output classes.

e 11 .
mouth region in ssifying the basic facial expressions.The basic facial expressions are shown in Figure 10.To extract the features from the mouth region, it is first located in the face region.Mouth location is determined relative to the coordinates of the face detection window.This is shown in Figure Once the mouth is located, it is divided into sixteen regions.The average of the gray values in each of the sixteen regions is extracted to model the facial expressions.A set o e video are extracted for each of the expressions separately.In this proposed work, four facial expressions such as normal, smile, anger and fear are considered and hence a total of 400 × 16 feature vectors are extracted

Figure 11 .
Figure 11.Location of mouth region.

Table 3
shows the confusion matrix for the test examples.The accuracy rates of facial expression recognition measured with different number of consecutive frames is shown in Figure 14.From the

Table 2 ,
it is observed that the Fear expression is recognized well than the other expressions.