A Comparison of Two Linear Discriminant Analysis Methods That Use Block Monotone Missing Training Data

We revisit a comparison of two discriminant analysis procedures, namely the linear combination classifier of Chung and Han (2000) and the maximum likelihood estimation substitution classifier for the problem of classifying unlabeled multivariate normal observations with equal covariance matrices into one of two classes. Both classes have matching block monotone missing training data. Here, we demonstrate that for intra-class covariance structures with at least small correlation among the variables with missing data and the variables without block missing data, the maximum likelihood estimation substitution classifier outperforms the Chung and Han (2000) classifier regardless of the percent of missing observations. Specifically, we examine the differences in the estimated expected error rates for these classifiers using a Monte Carlo simulation, and we compare the two classifiers using two real data sets with monotone missing data via parametric bootstrap simulations. Our results contradict the conclusions of Chung and Han (2000) that their linear combination classifier is superior to the MLE classifier for block monotone missing multivariate normal data.


Introduction
We consider the problem of classifying an unlabeled observation vector ( ) Σ µ into one of two distinct multivariate normally distributed populations ( ) , when monotone missing training data are present, where i µ and Σ are the th i population mean vector and common covariance matrix, respectively.Here, we re-compare two linear classification procedures for block monotone missing (BMM) training data: one classifier is from [1], and the other classifier employs the maximum likelihood estimator (MLE).
Monotone missing data occur for an observation vector j x when, if ji x is missing, then jk x is missing for all k i > .The authors [1] claim that their "linear combination classification procedure is better than the substitution methods (MLE) as the proportion of missing observations gets larger" when block monotone missing data are present in the training data.Specifically, [1] has performed a Monte Carlo simulation and has concluded that their classifier performs better in terms of the expected error rate (EER) than the MLE substitution (MLES) classifier formulated by [2] as the proportion of missing observations increases.However, we demonstrate that for intra-class covariance training data with at least small correlations among the variables, the MLES classifier can significantly outperform the classifier from [1], which we refer to as the C-H classifier, in terms of their respective EERs.This phenomenon occurs regardless of the proportion of the variables missing in each observation with missing data (POMD) in the training data set.
Throughout the remainder of the paper, we use the notation m n ×  to represent the matrix space of all m n × matrices over the real field  .Also, we let the symbol > n  represent the cone of all n n × positive definite matrices in n n ×  .Moreover, represents the transpose of The author [3] has considered the problem of missing values in discriminant analysis where the dimension and the training-sample sizes are very large.Additionally, [4] has examined the probability of correct classification for several methods of handling data values that are missing at random and use the EER as the criterion to weigh the relative quality of supervised classification methods.Moreover, [5] has examined missing observations in statistical discrimination for a variety of population covariance matrices.Also, [6] has applied recursive methods for handling incomplete data and has verified asymptotic properties for the recursive methods.
We have organized the remainder of the paper as follows.In Section 2, we describe the C-H classifier, and we describe the MLES linear discriminant procedure when the training data from both classes contain identical BMM data patterns.In Section 3, we describe and report the result of Monte Carlo simulations that examine the differences in the estimated EERs of the C-H and MLES classifiers for various parameter configurations, training-sample sizes, and missing data sizes and summarize our simulation results graphically.In Section 4, we compare the C-H and MLES linear classifiers using a parametric bootstrap estimator of the EER difference (EERD) on two actual data sets.We summarize our results and conclude with some brief comments in Section 5. ,

Suppose we have two
where [ ] denotes the i n complete-observation submatrix, and ( ) is the partial observation submatrix whose first k measurements are non-missing, where . We denote a complete data observation vector by 1 : where  .Also, random samples of sizes y y S y y y with ( ) where denotes the sample mean for the first i n observations and the first k features from 1 i Y in (1), ( )( )

S y y y y
is the pooled sample covariance matrix for the incomplete training data (4), where 1, 2 t = , represent the subsets of (1) with non-missing data and BMM data, respectively, for 1, 2 i = .The authors [1] have proposed the linear combination statistic ( ) u y (7) where . One classifies an unlabeled observation vector and into 2 Π , otherwise.The conditional error rate (CER) for classifying an unlabeled vector , 1, 2 i j = , i j ≠ , where ( ) where , and i y defined in (5).Thus, using (9) and assuming equal a priori probabilities, the CER for ( 8) is : : : : : (8), the EER of misclassifying an unlabeled observation vector , i j ≠ .Thus, once again assuming equal a priori probabilities, the EER for ( 8) is In choosing c in ( 7), [1] have utilized the fact that the CER and EER will depend on the Mahalanobis distance for the complete and partial training observations and the corresponding training-sample sizes, i N and i n , 1, 2 i = . Usually, when one has small CERs, at least one of the sample Mahalanobis distances ( ) ( ) w w S w w w u y will be large.While i n and 2 D u determine the performance of W u , the quantities i N and 2 D y dictate the performance of W y .Hence, [1] have chosen c in relation to the training-sample sizes and the Mahalanobis dis- tances for the complete and incomplete training-data sets.Note that the implication for circumstances where , in (1) contributes largely to the discriminatory information.Hence, [1] to determine the linear combination classification statistic (7).

A Maximum Likelihood Substitution Classifier for Monotone Missing Training Data
The authors [7] have derived an MLE method for estimating parameters in a multivariate normal distribution with BMM data.The estimator of Σ in the [7] MLES classifier is a pooled estimator of the two individual MLEs of Σ .Below, we state the MLEs for two multivariate normal distributions having unequal means and a common covariance matrix with identical BMM-data patterns in both training samples.

Monte Carlo Simulations
The authors [1] claim that "it can be shown that the linear combination classification statistic is invariant under nonsingular linear transformations when the data contain missing observations" and assume this invariance is also true for the MLES classifier.While their assertion might be true for the C-H classifier, it is not necessarily true for the MLE classifier.Because [1] do not consider covariance structures with moderate to high correlation, their results are biased toward the C-H classifier.Here, we show that the MLES classifier can considerably outperform the C-H classifier, depending on the degree of correlation among the variables with missing data and the variables without missing data.
Next, we present a description and results of a Monte Carlo simulation we have performed to evaluate the EERD between the MLE and C-H classifiers for two multivariate normal configurations, ( ) , using various training-sample sizes, dimensions, features with block missing data, differences in means, values of correlation among variables, and missing-data proportions.For the simulations, we define p to be the total number of feature dimensions and r to be the number of missing features so that r p < .Also, i N denotes the total training-sample size from population i Π , 1, 2 i = , and ( ) is the intraclass covariance matrix where ρ denotes the common population correlation among the features in the intraclass covariance matrix and denotes a matrix of ones.The simulation was performed in SAS 9.2 (SAS Institute In., Cary, NC, USA) using the RANDNORMAL command in PROC IML to generate 10,000 training-sample sets of size i N , 1, 2 i = , for each parameter configuration.Next, the MLE and C-H classifiers were computed, and their CERs were calculated for each training-sample set.Then, the differences between the CERs for the classifiers were averaged over the 10,000 CER differences for the two classifiers for each parameter configuration involving i N , p, r, Σ , i µ , and POMD for the r features with monotone missing data, where 1, 2 i = .Thus, the  EERD for the C-H and MLES classifiers is .We chose these specific values of p and r to evaluate  EERD when the proportion of variables with missing data were both small and large relative to p.The choice of r and i N depended on the value of p, and we provide the values of p, r, and i N used in the Monte Carlo simulation in Table 1.
Lastly, we chose 1 and 2 1 p× ∈  µ such that 2 , 0, 0, , 0, , 0, , 0 , with 1 0.5 d = and 2 3 d = to assess  EERD for both small and large between-class separation.These values for i µ , 1, 2 i = , given in ( 22) and ( 23), were chosen because they are similar to the population means used in the simulation used in [1].Furthermore, we contrasted ( 8) and (19) using POMD = 0.5, 0.8 for the r covariates with BMM data, and as in [1], we chose i N p > to avoid singularity of the estimated covariance matrices.The comparison criterion  EERD is plotted against ρ for various combinations of p, r, j d , i N , and POMD in Figure 1 and Figure 2 because the graphs are similar to the plots for 10 p = and 40 p = .The graphs for 20 p = can be obtained from the authors.Figure 1, Figure 2 illustrate that the  EERD is consistently positive for the values of p, r, i N , ρ , j d , and POMD examined here.Moreover, the figures indicate that the primary parameters that influence the dominance of the MLES classifier are ρ and j d , . For all feature dimensions considered here, the C-H and MLES classifiers were competitive for 0.1 ρ = .More importantly, for 0.1 ρ > ,  EERD increased as ρ increased for all p, r, i N , 1 µ , 2 µ , and POMD considered here.The most noteworthy increase in the  EERD was for 0.7 0.9 ρ ≤ ≤ when 1 0.5 d = , where  EERD increased by approximately 0.10.This increase occurred for all specified values of p, r, i N , and POMD, and, thus, supported the superiority of the MLES classifier in terms of EERD for these configurations.Additionally, we noted that  0.20 EERD ≈ when 0.9 ρ = ,   We remark that the standard errors for the  EERD in the [1] simulations are not sufficiently small enough to conclude a difference in the ERRs of the two competing classifiers.Hence, their claim that the C-H classifier outperforms the MLES classifier as the percent of missing observations increases is questionable.
We also performed a second Monte Carlo simulation whose results are not presented here.In this simulation, all fixed parameter values were equivalent to those of the first simulation except for 2 µ in (23), where we chose 0.80 of the elements of 2 µ to be non-zero.Consequently, we obtained slightly different results from those of our first simulation.However, the MLES classifier still outperformed the C-H classifier for all parameter configurations when 0.1 ρ ≥ .These results suggest that for classification problems with equal intra-class covariance matrices the MLES classifier is superior to the C-H classifier when at least small correlation exists among the features with missing data and the features without missing data.

Bootstrap Expected Error Rate Estimators for the C-H and MLE Classifiers
In this section, we compare the parametric bootstrap estimated ERRs of the C-H and MLES classifiers for two real-data sets each having two approximate multivariate normal populations with different population means and equal covariance matrices.First, we define the bootstrap ERR estimator for the C-H classifier,  ( ) μ , and Σ be the MLEs of 1 µ , 2 µ , and Σ , respectively, defined in Theorem 1. Also, let * Σ be the bootstrap estimates of 1 μ , 2 μ , and Σ , respectively, calculated using the parametric bootstrap training-sample data that is generated from ( ) , and * Σ , the bootstrap CERs for the C-H classifier are ( ) Thus, assuming equal a priori probabilities of belonging to i Π , 1, 2 i = , for an unlabeled observation, we have ( ) Hence, the estimated parametric bootstrap EERD for the C-H and MLES classifiers is where j denotes the th j simulated training-data set for We use (27) to compare the C-H and MLES classifiers for two real-data sets given in the following subsections.

A Comparison of the C-H and MLE Classifiers for UTA Admissions Data
The first data set was supplied by the Admissions Office at the University of Texas at Arlington and implemented as an example in [1].The two populations for the UTA data are the Success Group for the students who receive their master's degrees ( 1 Π ) and the Failure Group for students who do not complete their master's degrees ( 2Π ).Each training sample is composed of ten foreign students and ten United States students.Each foreign student had 5 variables associated with him or her.The variables are X 1 = undergraduate GPA, X 2 = GRE verbal, X 3 = GRE quantitative, X 4 = GRE analytic, and X 5 = TOEFL score.For each observation in both data sets, variables 1 X , 2 X , 3 X , and 4 X are complete; however, 5 X contains monotone missing data.The UTA data set as seen in [1] can be seen in Table 2.
Also, the common estimated correlation matrix for the UTA data is 1.000 0.145 0.066 0.199 0.373 0.145 1.000 0.404 0.494 0.767 ˆ. 0.066 0.404 1.000 0.129 0.493 0.199 0.494 0.129 1.000 0.392 0.373 0.767 0.493 0.392 1.000 We remark that only one sample correlation coefficient in the last column of (28) has a magnitude exceeding 0.50, which reflects relatively low correlation among the four features without BMM data with the one feature having BMM data.
To estimate EERD for the C-H classifier (8) and the MLES classifier (19) for the UTA Admissions data, we determine  Boot EERD , given in (27), using 10,000 bootstrap simulation iterations with . 0.001 s e EERD = , which indicated that the C-H classifier yielded slightly better discriminatory performance compared to the MLES classifier for the UTA data.The fact that the C-H procedure slightly outperformed the MLES classifier for the UTA data set in terms of EERD is not surprising.In the UTA data set, relatively little correlation exists among many of the features, and the C-H classifier does not require or use information in the correlation between the features with no missing data and the features with missing data.However, the MLES classifier does require at least a moderate degree of correlation between some features with no missing data and the feature with missing data to yield a more effective supervised classifier than the C-H classifier.

A Comparison of the C-H and MLE Classifiers on the Partial Iris Data
The second real-data set on which we compare the C-H and MLES classifiers is a subset of the well-known Iris data, which is one of the most popular data sets applied in pattern recognition literature and was first analyzed by R. A. Fisher (1936).The data used here is given in Table 3.
The University of Irvine Machine Learning Repository website provides the original data set, which contains 150 observations (50 in each class) with four variables: X 1 = sepal length (cm), X 2 = sepal width (cm), X 3 = petal length (cm), and X 4 = petal width (cm).This data set has three classes: Iris-setosa ( mean for the first k features of the latter i parameters are allowed to vary.The MLES classifier especially outperformed the C-H classifier when 1 0.5 d = for 0 ρ > , as compared to when 2 3 d = .The smaller values of  EERD for 2 3 d = can be attributed to the fact that for a relatively large

Figure 1 .
Figure 1.Graphs of the  EERD versus ρ for fixed values of i N , r, j d , POMD, and p = 10.

Figure 2 .
Figure 2. Graphs of the  EERD versus ρ for fixed values of i N , r, j d , POMD, and p = 40.
] have derived a linear combination of a discriminant function composed from complete data and a second discriminant function determined from BMM data.The C-H classifier uses Anderson's linear dis-criminant function (LDF) for the subset of complete data i )

Table 1 .
Dimensions and sample sizes for the Monte Carlo simulation.
respectively, except that we use the bootstrap multivariate normal data in (24).Thus, assuming equal prior probabilities, the bootstrap CER for the C-H classifier is where * c W , * h , and * f are similar in definition to c W , h , and f in (7), (11), and (10),
1 Π ), Iris-versicolor ( 2 Π ), and Iris-virginica ( 3 Π ).We have used a subset of the original Iris data set by taking only the first 20 observations from 1 Π and 2Π and omitting the Iris-virginica group ( 3 Π ).We emphasize that the variables in the partial Iris data are much more highly correlated than the variables in the UTA data.