Locality Preserving Discriminant Projection for Speaker Verification

In this paper, a manifold subspace learning algorithm based on locality preserving discriminant projection (LPDP) is used for speaker verification. LPDP can overcome the deficiency of the total variability factor analysis and locality preserving projection (LPP). LPDP can effectively use the speaker label information of speech data. Through optimization, LPDP can maintain the inherent manifold local structure of the speech data samples of the same speaker by reducing the distance between them. At the same time, LPDP can enhance the discriminability of the embedding space by expanding the distance between the speech data samples of different speakers. The proposed method is compared with LPP and total variability factor analysis on the NIST SRE 2010 telephone-telephone core condition. The experimental results indicate that the proposed LPDP can overcome the deficiency of LPP and total variability factor analysis and can further improve the system performance.


Introduction
Speaker verification is a subtask of speaker recognition, whose purpose is to verify whether a segment of speech is spoken by a designated speaker [1] [2]. Total variability factor analysis has been widely used in speaker verification [3] [4] [5] [6]. In total variability factor analysis, the speaker and the channel variabilities are contained simultaneously in a low-dimensional space which is referred to as the total variability space. By the space mapping, the useful information can be obtained by reducing the dimensionality of the mean supervector of the Gaussian mixture model (GMM) and the latent variables can be estimated using li-mited data. The low-dimensional variable characteristic of the speaker's identity is called the total variability factor vector, or i-vector. Support vector machine (SVM) can be used as a classifier for i-vector [7] [8].
As an application of probabilistic principal component analysis (PPCA), total variability factor analysis only analyzes the speech data from a global perspective [9] [10]. To compensate for the deficiency, we introduced locality preserving projection (LPP) [11], neighborhood preserving embedding (NPE) [12], and discriminant neighborhood embedding (DNE) [13] to speaker verification. By constructing a graph containing the neighborhood information of the speech data, the inherent local neighborhood relationship of the speech data is optimally preserved. Combined with total variability factor analysis, the performance of speaker verification is improved [14] [15]. Here, LPP is an unsupervised learning algorithm [11] [16] that is not concerned with the speaker label information in the dimensionality-reduction process and does not make use of the discriminative information between the speech data of different speakers. However, the speaker label information of the training data and the discriminative information of the speech data are of great importance in speaker verification.
In view of the above shortcomings of LPP, we apply the locality preserving discriminant projection (LPDP) algorithm in speaker verification. LPDP can bring in the speaker label information from the speech data and, through optimization, preserve the inherent local manifold structure of the speech data samples from the same speaker to reduce the distance between them. At the same time, the distance between the speech data samples from different speakers is enlarged to enhance the discriminative ability of the embedding space.
The remainder of this paper is organized as follows. The LPP algorithm based on i-vector is introduced in Section 2. The LPDP algorithm is proposed in Section 3. The experiment and results are presented in Section 4. The conclusion is given in Section 5.

Total Variability Factor Analysis
Based on the total variability space, the GMM mean supervector containing speaker and channel information in the speech data can be expressed as = + M m Tw (1) where m is the mean supervector of the universal background model (UBM) independent of the speaker and channel; T is the total variability space which is defined by the total variability matrix; and w is a low-dimensional latent variable that obeys the normal distribution, known as the total variability factor vector, or identity vector (i-vector). Total variability factor analysis can be regarded as a feature-extraction module. It projects the speech data into the low-rank total variability space T to obtain the i-vector w. The training method of T and the extraction process of the i-vector have been described previously [4] [8].
The intersession compensation can be carried out in a low-dimensional space where the i-vector lies. The linear discriminant analysis (LDA) approach [17] and within class covariance normalization (WCCN) approach [18] are often used for intersession compensation. After the intersession compensation, modeling and scoring are made using SVM.

LPP Algorithm
The speaker verification system framework, in which the LPP algorithm based on i-vector is used, is presented in Figure 1. The dashed boxes from left to right refer to Enrollment, Training and Testing, respectively.
On the basis of i-vector, the LPP algorithm is used to achieve an effective combination of the total variability factor analysis technique and the LPP algorithm that retains both the global and local neighborhood structures of the speech data, thereby significantly improving system performance [11]. However, the known speaker label information of the speech data is not used in the dimensionality-reduction process of the LPP algorithm. As a result, although the locality-preserving projection space matrix P has a strong descriptive ability, its discriminative ability is not strong, which to a certain degree affects the recognition

LPDP Algorithm
LPDP is an effective manifold learning method that has been successfully applied in face recognition [19]. The basic idea of LPDP is to divide the nearest neighbor graph in the LPP algorithm into intra-class and out-of-class graphs. LPDP can maintain the local neighborhood relationship of the same speaker's speech data samples and reduce the distance between them. At the same time, LPDP emphasizes the discrimination information between speakers and expands the distance between their speech data. Combined with total variability factor analysis, the algorithm can globally and locally analyze the feature structure of speech data more comprehensively, and at the same time reflects the between-speaker difference and enhances the discriminatory ability of the embedding space.
The idea of applying LPDP to speaker verification is similar to that of LPP as shown in Figure 1. The corresponding i-vectors of given N items of training speech data with speaker labels constitute a vector set The purpose of LPDP is to find an optimal locality preserving discriminant projection space matrix embed the i-vector of the speech in space R D in the feature-space R K (K < D). In the R K space, the speech data point x i is transformed to The steps to train the locality preserving discriminant projection space matrix A are as follows.
Step 1: Determine the neighborhood of the i-vector w i , which consists of all the i-vectors whose similarity with w i is less than its average similarity, i.e., ( ) where MS (w i ) is the average similarity of all the N i-vectors for the training speech data with i-vector w i , and NB (w i ) represents the neighborhood i-vectors of w i .
Step 2: Construct two subgraphs of the neighborhood graph: the in-class graph G in and out-of-class graph G out . In both the in-class graph G in and the out-of-class graph G out , the i-th node corresponds to the i-vector w i. For the in-class graph G in , we put a directed edge from node i to j if i-vector w j is in the neighborhood of i-vector w i and is from the same class as i-vector w i . For the out-class graph G out , we put a directed edge from node i to j if i-vector w j is in the neighborhood of i-vector w i but is from the different class of w i .
Step 3: Calculate the weights of the edges in G in and G out , and obtain their respective weight matrices, W in and W out .
1) Denote the weight of the edge between i-vector w i and i-vector w j in G in as 2) Denote the weight of the edge between i-vector w i and i-vector w j in G out as out ij W and choose its value as Here, spk (w i ) represents the speaker label information of i-vector w i , and t is the mean distance of all the i-vectors for the training speech data.
Step 4: Calculate the locality preserving discriminant projection matrix A. The idea of LPDP is that, in the embedding space, the i-vectors from the same speaker have the smallest in-class divergence after projection, i.e., the distance between the same speaker's i-vectors is as small as possible. Conversely, the i-vectors from different speakers have the largest between-class divergence after projection, i.e., they are as far from each other as possible. To achieve these goals, they are integrated into the following two optimization problems [20]: (7) where in which can be further transformed to a generalized eigenvalue problem, By solving Equation (9), the locality-preserving discriminant projection space can be obtained, where 1 2 , , , K a a a  are the eigenvectors corresponding to the largest K eigenvalues of the above problem.

Experimental Setup
Experiments were carried out on the core test set of the NIST SRE 2010 tele-

Experimental Results
To verify the performance of the proposed LPDP algorithm, we experimentally compared it with the traditional total variability factor analysis and LPP algorithms. Table 1 shows the performance comparison of the three algorithms without channel compensation. It is observed that applying the LPDP algorithm to i-vector is equivalent to effectively combining total variability factor analysis technology with the LPP algorithm. This combination can maintain the global and local neighborhood structures of the speech data. Compared to total variability factor analysis, which can only preserve the global structure of speech data, LPP and LPDP can significantly improve system performance. LPDP can also make effective use of the speaker label information of the speech data and, through optimization, maintain the intrinsic local manifold structure of the same speaker's speech data. As well, the distance between the speech data of different speakers is expanded in LPDP and the discrimination performance of the embedding space is enhanced to further improve system performance. Compared with LPP, LPDP leads to a relative improvement of 16.36% in EER and 13.04% in minDCF for male testing dataset, and 29.33% in EER and 8.67% in minDCF for female testing dataset.

Conclusion
On the basis of LPP, this paper introduced LPDP to speaker verification. LPDP makes full use of the speaker label information of the speech data to categorize and differentiate the neighborhood. It can overcome the shortcomings of the total variability factor analysis method and maintain the intrinsic local neighborhood relationship of in-class (same speaker) speech data and more comprehensively reflect the global and local structure of the speech data. It can also address the inadequacy of LPP and maximize the distance between out-of-class (different speakers) speech data to obtain the most discriminative feature vector and enhance the discriminative ability of the projection space, thereby improving the recognition performance of the system. Our future work will be devoted to enhance the discrimination of the embedding space and further improve the recognition performance of the system.