Linear Dimension Reduction for Multiple Heteroscedastic Multivariate Normal Populations

For the case where all multivariate normal parameters are known, we derive a new linear dimension reduction (LDR) method to determine a low-dimensional subspace that preserves or nearly preserves the original feature-space separation of the individual populations and the Bayes probability of misclassification. We also give necessary and sufficient conditions which provide the smallest reduced dimension that essentially retains the Bayes probability of misclassification from the original full-dimensional space in the reduced space. Moreover, our new LDR procedure requires no computationally expensive optimization procedure. Finally, for the case where parameters are unknown, we devise a LDR method based on our new theorem and compare our LDR method with three competing LDR methods using Monte Carlo simulations and a parametric bootstrap based on real data.


Introduction
The fact that the Bayes probability of misclassification (BPMC) of a statistical classification rule does not increase as the dimension or feature space increases, provided the class-conditional probability densities are known, is well-known.However, in practice when parameters are estimated and the feature-space dimension is large relative to the training-sample sizes, the performance or efficacy of a sample discriminant rule may be considerably degraded.This phenomenon gives rise to a paradoxical behavior that [1] has called the curse of dimensionality.
An exact relationship between the expected probability of misclassification (EPMC), training-sample sizes, feature-space dimension, and actual parameters of the class-conditional densities is challenging to obtain.In general, as the classifier becomes more complex, the ratio of sample size to dimensionality must increase at an exponential rate to avoid the curse of dimensionality.The authors [2] have suggested a ratio of at least ten times as many training samples per class as the feature dimension increases.Hence, as the number of feature variables p becomes large relative to the training-sample sizes i n , 1, , i m =  , where m is the number of classes, one might wish to use a smaller number of the feature variables to improve the classifier performance or computational efficiency.This approach is called feature subset selection.
Another effective approach to obtain a reduced dimension to avoid the curse of dimensionality is linear dimension reduction (LDR).Perhaps the most well-known LDR procedure for the m-class problem is linear discriminant analysis (LDA) from [3], which is a generalization of the linear discriminant function (LDF) derived in [4] for the case 2 m = .The LDA LDR method determines a vector a that maximizes the ratio of between- class scatter to average within-class scatter in the lower-dimensional space.The sample within-class scatter matrix is where i S is the estimated sample covariance matrix for the ith class, and the sample between-class scatter ma- trix is where i x is the sample mean vector for class i Π , i α is its a priori probability of class membership, x x is the estimated overall mean, and 1, , i m =  .For 2 m = with 1 2 α α = , [4] determined the vector a that maximizes the criterion function ( ) ( )

a S a J a a S a
which is achieved by an eigenvalue decomposition of 1 W B − S S .An attractive feature of LDA as a LDR method is that it is computationally appealing; however, it attempts to maximally separate class means and does not incorporate the discriminatory information contained in the differences of the class covariance matrices.Many alternative approaches to the LDA LDR method have been proposed.For example, canonical variables can be viewed as an extension to the LDF when 2 m > and 1 q > (see [5]).Other straightforward extensions of the LDF include those in [6] [7].
Extensions of LDA that incorporate information on the differences in covariance matrices are known as heteroscedastic linear dimension reduction (HLDR) methods.The authors [8] have proposed an eigenvalue-based HLDR approach for 2 m > , utilizing the so-called Chernoff criterion, and have extended the well-known LDA method using directed distance matrices that can be considered a generalization of (2).Additional HLDR methods have been proposed by authors such as [9]- [10].
Using results by [13] that characterize linear sufficient statistics for multivariate normal distributions, we develop an explicit LDR matrix × real matrices.Using the Bayes classifi- cation procedure in which we assume equal costs of misclassification and that all class parameters are known, we determine the reduced dimension q p < that is the smallest reduced dimension for which there exists a LDR matrix q p × ∈ B  that preserves all of the classification information originally contained in the p-dimensional feature space.We then derive a linear transformation that assigns x to k Π if and only if the corres- ponding Bayes classification rule assigns Bx to k Π , where . We refer to this method as the SY LDR procedure.
Moreover, we use Monte Carlo simulations to compare the classification efficacy of the BE method, sliced inverse regression (SIR), and sliced average variance estimation (SAVE) found in [14]- [16], respectively, with the SY method.
The remainder of this paper is organized as follows.We begin with a brief introduction to the Bayes quadratic classifier in Section 2 and introduce some preliminary results that we use to prove our new LDR method in Section 3. In Section 4, we provide conditions under which the Bayes quadratic classification rule is preserved in the low-dimensional space and derive a new LDR matrix.We establish a SVD-based approximation to our LDR procedure along with an example of low-dimensional graphical representations in Section 5. We describe the four LDR methods that we compare using Monte Carlo simulations in Section 6.We present five Monte Carlo simulations in which we compare the competing LDR procedures for various population parameter configurations in Section 7. In Section 8, we compare the four methods using bootstrap simulations for a real data example, and, finally, we offer a few concluding remarks in Section 9.

The Bayes Quadratic Classifier
The Bayesian statistical classifier discriminates based on the probability density functions ( ) 1, , i m =  , of each class.The Bayes classifier is optimal in the sense that it maximizes the class a posteriori probability provided all class distributions and corresponding parameters are known.That is, suppose we have m classes, This decision rule partitions the measurement or feature space into m disjoint regions . Using Bayes' rule, the a posteriori probabilities of class membership ( ) One can re-express the Bayes classification as the following: .
This decision rule is known as the Bayes' classification rule.Let i Π be modeled as a p-dimensional multi- variate normal distribution, and let The Bayes decision rule (4) is to classify the unlabeled observation x into the class k . The classification rule defined by ( 4) is known as the quadratic discriminant function (QDF), or the quadratic classifier.

Preliminary Results
The following notation will be used throughout the remainder of the paper.We let The proof of the main result for the derivation of our new LDR method requires the following notation and lemmas.Let ( )( ) where (ii) ( )( ) . We now state and prove three lemmas that we use in the proof of our main result.Lemma 1 For = T FG in (5), where = FF E E FF , and (c) ( ) ( ) Lemma 2 For = T FG in (5), where , i m =  , we have that , the result follows because . (5), where ( ) Proof.The proof of part (a) of Lemma 3 follows from (i), and we have that ( ) . Hence, (b) and (c) follow from (6).

Linear Dimension Reduction for Multiple Heteroscedastic Nonsingular Normal Populations
We now derive a new LDR method that is motivated by results on linear sufficient statistics derived by [13] and by a linear feature selection theorem given in [17].The theorem provides necessary and sufficient conditions for which a low-dimensional linear transformation of the original data will preserve the BPMC in the original feature space.Also, the theorem provides a representation of the LDR matrix.Theorem 1 Let i Π be a p-dimensional multivariate normal population with a priori probability 0 i α > , mean i µ , and covariance matrix i Finally, let = M FG be a full-rank decomposition of M , where ( ) . Then, the pdimensional Bayes procedure assigns x to k Π if and only if the q-dimensional Bayes procedure assigns ) , where and , By Lemma 3, we conclude that ( ) ( ) , , . That is, ( ) Recall that for , 1, , i j m =  , i j ≠ , the p-variate Bayes procedure defined by (3) assigns x to j Π if and only if .

w u y y u y y y y
Therefore, the original p-variate Bayes classification assignment is preserved by the linear transformation Theorem 1 is important in that if its conditions hold, we obtain a LDR matrix for the reduced q-dimensional subspace such that the BPMC in the q-dimensional space is equal to the BPMC for the original p-dimensional feature space.In other words, provided the conditions in Theorem 1 hold, we have that the LDR matrix , where ( ) . With the following corollary, we demonstrate that for two multivariate normal populations such that 1 2 = = Σ Σ Σ and 2 1 ≠ µ µ , our LDR matrix derived in Theorem 1 reduces to the LDF of [4].

Corollary 1 Assuming we have two multivariate normal populations
( ) , the proposed LDR matrix in Theorem 1 reduces to ( ) Proof.The proof is immediate from (7).

Low-Dimensional Graphical Representations for Heteroscedastic Multivariate
Normal Populations , one cannot use Theorem 1 to directly obtain a q p × LDR matrix that preserves the full- feature BPMC.Also, in some situations when Theorem 1 holds, we may desire to determine a low-dimensional representation with dimension less than q, say r, where 1 r q p ≤ < < .Even if the required conditions of Theorem 1 hold, we may desire to determine a low-dimensional representation with dimension less than q, say r, where 1 r q p ≤ < < .Thus, we seek to construct an r-dimensional representation which preserves as much of the original p-dimensional BPMC as possible.
One method of obtaining an r-dimensional LDR matrix, 1 r q p ≤ ≤ < , is the SVD approximation to M in (7).We use the following theorem from [17] to determine such an r-dimensional LDR matrix.

C denote the class of all s t
× real matrices of rank r, where 1 r q p ≤ < < .If where , and p A is the usual Eucli- dean or Frobenius norm of a matrix p A , given by ( ) Using Theorems 1 and 2, we now construct a linear transformation for projecting high-dimensional data onto a low-dimensional subspace when all class distribution parameters are known.Let p = M UD V be the SVD of the matrix M , where ( ) F UD and define ( ) is a rank-r approximation of M , and, therefore, a rank-r approximation to F is r r = F UD .Thus, r ′ F is an r p × LDR matrix that yields an r-dimensional representation of the original p-dimensional class models.One can also use r + F to construct low-dimensional representations of high-dimensional class densities.We next provide an example to demonstrate the efficacy of Theorems 1 and 2 to determine low-dimensional representations for multiple multivariate normal populations with known mean vectors and covariance matrices.
In the example, we display the simplicity of Theorem 1 to formulate a low-dimensional representation for three populations ( ) with unequal covariance matrices and original dimension 6 p = .Note that unlike the low-dimensional representation of [18], our Theorem 1 does not restrict the reduced dimension to be 1 r = .

Example
Consider the configuration ( ) , N Σ µ , and ( ) We have , and, thus, by Theorem 1, the six-dimensional multivariate normal densities can be compressed to the dimension 2 q = without increasing the BPMC.Using Theorem 1, we have that an optimal two-dimensional representation space is
We can also determine a one-dimensional representation of the three multivariate normal populations through application of the SVD described in Theorem 2 applied to the matrix M given in (7).A one-dimensional representation space is column one of the matrix F in (8), and the graphical representation of this configuration of univariate normal densities is depicted in Figure 2.

Four LDR Methods for Statistical Discrimination
In this section, we present and describe the four LDR methods that we wish to compare and contrast in Sections 7 and 8.

The SY Method
In Theorem 1, we assume the parameters i µ and i Σ , 1, , i m =  , are known, but in reality this assumption is rarely the case.In a sampling situation, we can replace the columns of the M matrix in (7) by their sample estimators yielding Our estimator M , along with Theorems 1 and 2, yields a LDR technique based on the selection of an r-dimensional hyperplane determined from a rank-r approximation to the full-rank matrix M .Thus, using the SVD, we let ˆp = M UD V , where ( ) 1, , j p =  , and let ˆp = F UD .Also, let ( ) with j l λ λ ≥ for 1 j l p ≤ < ≤ and 1 r p ≤ < .From Theorem 2, we have that r r ′ =  M UD V is a rank-r ap- proximation of M , and a rank-r approximation to Provided ( ) ( )

The BE Method
A second LDR method presented by [14] ) ( ) for 1 i j m ≤ < ≤ , provided all multivariate normal population parameters are known.For the unknown parameter case, an estimator of U is then ) ( ) The LDR matrix derived in [14], which we refer to as the BE matrix, was also obtained when we determined low-rank approximation to Û by using the SVD.That is, let ˆp ′ = U RD S be the SVD of Û , where ( ) with j l λ λ ≥ for 1 j l p ≤ < ≤ , and let ˆp = H RD .Define r D as in (9) with j l λ λ ≥ for 1 j r ≤ ≤ .Then, r ′ =  U RD S is a rank-r approximation of Û , and is a rank-r approximation of Ĥ .Thus, the BE LDR matrix is r ′  H , or, equivalently, ' r R , where the th j eigenvector corresponds to the th j largest singular value and 1, , j r =  .The BE LDR approach is based on the rotated differences in the means.The LDR matrix (10) uses a type of pooled covariance matrix estimator for the precision matrices.However, the BE method does not incorporate all of the information contained in the different individual covariance matrices.Another disadvantage of the BE LDR approach is that it is limited to a reduced dimension that depends on the number of classes, m.For 2 m = , BE allows one to reduce the data to only one dimension, regardless of the full-feature vector dimension.Therefore, one may lose some discriminatory information with the application of the BE LDR method when the covariance matrices are considerably different.

Sliced Inverse Regression (SIR)
The next LDR method we consider is sliced inverse regression (SIR), which was proposed in [16].Assuming all population parameters are known, we first define as the within-group covariance matrix and ( )( ) as the between-group covariance matrix, where is the overall population mean.As its criterion matrix, the SIR LDR method uses where Γ Σ Σ is the marginal covariance matrix of x .When population parameters must be estimated, an estimator of ( 12) is where ˆB with W S and B S given in ( 1) and ( 2), respectively.Let ˆSIR p ′ = M GD P be the SVD of (13), and let , where p G is composed of eigenvectors of ( 13) such that the th j eigenvector of (13) corresponds to the th j largest singular value of ( 13), 1, , is a rank-r approximation of K , and the r p × SIR LDR matrix is ˆr ′ K , which is composed of the eigenvectors correspond- ing to the r largest singular values of K .

Sliced Average Variance Estimation (SAVE)
The last LDR method we consider is sliced average variance estimation (SAVE), which has been proposed in [19]- [20].The SAVE method uses the p p × criterion matrix ( ) , where We use a form of SAVE given in [21], which is ( ) where ( ) ( ) with W S and B S given in ( 1) and ( 2), respectively.Next, let ˆSAVE p ′ = M BD Q be the SVD of (15), and let , where p B is composed of the eigenvectors of (15) such that the th j column of p B corresponds to the th j of (15) with the th j largest singular value, 1, , j p =  .Then, is a rankr approximation of L , and, thus, the r p × SAVE LDR matrix is ˆr ′ L .An alternative representation of the r-dimensional SAVE LDR matrix is r B , which is composed of the ei- genvectors corresponding to the r largest singular values of B .

A Monte Carlo Comparison of Four LDR Methods for Statistical Classification
Here, we compare our new SY LDR method derived above to the BE, SIR, and SAVE LDR methods.Specifically, we evaluate the classification efficacy in terms of the EPMC for the SY, BE, SIR, and SAVE LDR methods using Monte Carlo simulations for five different configurations of multivariate normal populations with 10 p = .We have generated 10,000 training-sample and test datasets from the appropriate multivariate normal distributions for each parameter configuration.The test data were assigned to either population class 1 Π , or 3 Π when 3 m = using the sample QDF corresponding to (4).We have applied the four competing  K , and r ′  L and calculated the EPMCs for the full-dimensional QDF and for the four reduced-dimensional QDFs by averaging the estimated conditional error rate over all training samples.We examined the effect of the training-sample sizes on the four LDR methods using sample sizes ( ) For the SY and SAVE LDR approaches, we reduce the dimension to 1, 2, 3 r = . In general, for m populations, we remark that the BE LDR method can reduce the feature vector to at most the dimension ( ) is the column dimension of the matrix ˆp r × ∈ U  .This limitation is potentially a major drawback when the ratio m p is small and especially when 2 m = .In particular, for the BE and SIR LDR approaches, we can reduce only to the dimension 1 r = when 2 m = .When 3 m = , the BE LDR method can be applied to reduce the dimensions to at most 2,3 r = , and for the SIR LDR method, we can reduce the number of original features to at most 1 r m = − .However, the SY LDR method avoids this shortcoming and allows one to reduce the original feature vector to our choice of reduced dimension r, where 1 r p ≤ < , for any finite number of populations m.The five Monte Carlo simulations were generated using the programming language R. Table 1 gives a description of the number of populations and the theoretically optimal rank of the four LDR indices for each configuration.In Figures 3-7, we display the corresponding EPMCs of the four competing LDR methods for the various population configurations and values of i n and r, where 1, , i m =  .The estimated standard error of all EPMCs in the following tables was less than 0.001.For each configuration, we also calculated M , Û , K , and L along with their respective singular values using the SVD.These singular values contain information concerning the amount of discriminatory information available in each reduced dimension.In the subsequent subsections, we use the following notation:
When 25 i n = , the EPMC was reduced by the SY, BE, and SIR LDR methods but not for the SAVE LDR method.This effect occurred because the training-sample size 25 , are small relative to the fullfeature dimensionality 10 p = , and, therefore, insufficient data were available to accurately estimate the ( ) EPMC ⋅ for all four LDR methods as r increased.In addition, the SIR LDR method did not utilize discriminatory information contained in the differences of the covariance matrices when 25 , all four LDR methods yielded essentially the same EPMC, except for SAVE when 1 r = .

2.00
In this configuration, the population means are unequal but relatively close.Moreover, 2 Σ and 3 Σ are un- equal but notably more similar to one another than to 1 Σ .However, the variance of the third feature in 3 Π is significantly greater than the population variances of the third feature for either 1 Π or 2 Π .As a result of markedly different covariance matrices, the BE and SIR LDR methods are not ideal because both methods aggregated the sample covariance matrices.The SY LDR method, however, attempts to estimate each individual covariance matrix for all three populations, and uses this information that yielded SY as the superior LDR procedure.Table 3 gives the singular values of each of the LDR methords considered here.
For the larger sample-size scenario, 50 , the SAVE LDR was more competitive with SY than either BE or SIR when 1 r > .However, the SY LDR remained the preferred LDR method.In Figure 4, we added noise to the SY method when we used 2 r > , and, we eliminated essential discriminatory information for the SY procedure when we chose 1 r = .Thus, the optimal choice for the reduced dimension for the SY LDR method was 2 r = , regardless of the training-sample size of i n , 1, 2, 3 i = .Furthermore, in our simulation we determined ( ) ( ) . This significant reduction from 10 EPMC demonstrated a significant benefit of dimension reduction in general and the SY LDR in particular.
For the fourth configuration, the covariance matrices are considerably different from one another, which benefits both the SY and SAVE LDR methods.Specifically, the SY LDR procedure uses information contained in the unequal covariance matrices to determine classificatory information contained in  5, the first three singular values of M were relatively large.However, the pooling of the sample covariance matrices used in BE and SIR tended to obscure classificatory information in the sample covariance matrices.The only improvement from the full dimension we found for the values of r and i n considered here was the SY method when 3 r = for 25 i n = , 1, 2, 3 i = . Specifically, we have that ( )

Configuration 5: m = 3 with Diverse Population Covariance Matrices
In Configuration 5, we have three multivariate normal populations: , , Here, we have three considerably different covariance matrices.For this population configuration, the SY method outperformed the three other LDR methods for the reduced dimensions 1, 2, 3 r = . Examining the singular values of M , we see that most of the discriminatory information was contained in the first transformed dimension.This fact was illustrated with how ( ) r EPMC SY and r were directly related, regardless of the training-sample size i n , 1, 2, 3 i = .The BE and SIR LDR methods did not perform as well here because of the pooling of highly diverse estimated covariance matrices.While the SAVE LDR procedure was the least effective LDR method for 25 i n = , ( ) EPMC SAVE and i n were inversely related.While each of the four LDR methods decreased from 10 EPMC for some combination of i n and r, the largest reduction in EMPC was for the SY LDR method where ( ) .This result implied that the SY LDR method was especially useful when i n p was relatively small and the covariance matrices were considerably different.

A Parametric Bootstrap Simulation
In the following parametric bootstrap simulation, we use a real dataset to obtain the population means and covariance matrices for three multivariate normal populations.The chosen dataset comes from the University of Califorina at Irvine Machine Learning Repository, which describes the diagnoses of cardiac Single Proton Emission Computed Tomography (SPECT) images.Each patient in the study is classified into two categories: normal or abnormal.The dataset contains 267 SPECT image sets of patients.Each observation consists of 44 continuous features for each patient.However, for our simulation, we chose only ten of the 44 features.The ten selected features were F2R, F6R, F7S, F9S, F11R, F11S, F14S, F16R, F17R, and F19S.Hence, we performed the parametric bootstrap Monte Carlo simulation with two populations: ( ) For this dataset,  6 gives the singular values for each of the LDR methods considered here.Figure 8 gives the EPMCs for the reduced dimensions 1, 2, and 3 for each of the LDR methods.In this example, a considerable amount of discriminatory information is contained in the covariance matrices, which are ex-tremely different.Hence, not surprisingly, neither the BE nor the SIR LDR methods performed well for

EPMC EPMC SY ≈
. This particular example verified that one can implement the SY method and obtain excellent results even though the conditions of Theorem 1 do not essentially hold.This example also illustrated the prospect that, with a judicious choice of r combined with the appropriate LDR method, we can significantly reduce the feature dimension while still preserving the EMPC.

Discussion
In this paper, while all population parameters are known, we have presented a simple and flexible algorithm for a low-dimensional representation of data from multiple multivariate normal populations with different parametric configurations.Also, we have given necessary and sufficient conditions for attaining the subspace of smallest dimension q p < , which preserves the original Bayes classification assignments.We have provided a con- structive proof for obtaining a low-dimensional representation space when certain population parameter conditions are satisfied.Under a special case for the two-class multivariate normal problem with equal nonsingular covariance structures, our proposed LDR transformation is the LDF in [4].Moreover, we have also extended our concept proposed in Theorem 1 to cases where the conditions in Theorem 1 are not satisfied through the application of the SVD given in Theorem 2.
We have presented several advantages of our proposed low-dimensional representation method.First, our method is not restricted to a one-dimensional representation regardless of the number of populations, unlike the transformation introduced by [14].Second, our method allows for equal and unequal covariance structures.Third, the original feature dimension p does not significantly impact the computational complexity.Also, under certain conditions, one-dimensional representations of populations with unequal covariance structures can be accomplished without an appreciable increase in the EPMC.
Furthermore, we have derived a LDR method for realistic cases where the class population parameters are unknown and the linear sufficient matrix M in (7) must be estimated using training data.Using Monte Carlo simulation studies, we have compared the performance of the SY LDR method with three other LDR procedures derived by [14] [16] [19].In the Monte Carlo simulation studies, we have demonstrated that the full-dimension EMPC can sometimes be decreased by implementing the SY LDR method when the training-sample size is small relative to the total number of estimated parameters.
Finally, we have extended our concept proposed in Theorem 1 through the application of the SVD given in Theorem 2 to cases where one might wish to considerably reduce the original feature dimension.Our new LDR approach can yield excellent results provided the population covariance matrices are sufficiently different.
of all m n a full-rank decomposition of T so that = and from (ii) above.Parts (b) and (c) follow directly from (a).

Figure 1 .
Figure 1.The optimal two-dimensional representation for normal densities from the example in Section 5.2.

Figure 2 .
Figure 2. The optimal one-dimensional representation for normal densities from the example in Section 5.2.
denote the estimated EPMCs for the SY, BE, SIR, and SAVE LDR methods, respectively, for each of the appropriate reduced dimensions.

2 ,Table 1 .Figure 3 .
Figure 3. Simulation results from Section 7.1.The horizontal bar in each graph represents the

Figure 4 .Figure 5 . 10 EPMC
Figure 4. Simulation results from Configuration 2. The horizontal bar in each graph represents the and

Figure 6 .
Figure 6.Simulation results from 4. The horizontal bar in each graph represents the

Figure 7 . 10 EPMC
Figure 7. Simulation results from Section 7.5.The horizontal bar in each graph represents the

Figure 8 .
Figure 8. Simulation results from Section 8.The horizontal bar in each graph represents the The goal of statistical decision theory is to obtain a decision rule that assigns an unlabeled observation x to k 1 , , m Π Π  , with assumed known a priori probabilities 1 , , m α α  , respectively.Also, let ( ) i p ⋅ Π denote the p-dimensional multivariate normal density corresponding to population

Table 2 .
Summary of singular values for the four competing LDR methods for Configuration 1.

Table 3 .
Summary of singular values for the four competing LDR methods applied to Configuration 2.

Table 4 .
Summary of singular values for the four competing LDR methods for Configuration 3. ,

Table 5 .
Summary of singular values for the four competing LDR methods for Configuration 4.

Table 6 .
Summary of singular values for the four competing LDR methods for Configuration 6.

Table 7 ,
we see that one reason that as r decreased is that the singular values are relatively large up to 7 r = .Thus, when we chose 1, 2,3 r =, we discarded some necessary discriminatory information.SAVE slightly outperformed the other three LDR methods, though the SY LDR approach was very competitive.For 25 ( ) r EPMC SY increased

Table 7 .
Summary of singular values for the four competing LDR methods for the parametric bootstrap example.