Model-Free Feature Screening Based on Gini Impurity for Ultrahigh-Dimensional Multiclass Classification ()
1. Introduction
Ultrahigh-dimensional data are commonly available in a wide range of scientific research and applications. Feature screening plays an essential role in the ultrahigh-dimensional data, where Fan and Lv [1] first proposed the sure independence screening (SIS) in the seminal paper. For linear regressions, they showed that the approach based on Pearson correlation learning possesses a sure screening property. That is, even if the number of predictors p can grow much faster than the number of observations n with
for some
, all relevant predictors can be selected with probability tending to one [2].
Lots of feature screening is the Model-based and Model-free approaches have been developed in recent years, see, for example, Wang [3] proposed forward regression for ultrahigh-dimensional data. Fan and Song [4] applied the maximum marginal likelihood estimates or the maximum marginal likelihood to ultrahigh-dimensional screening in generalized linear model. Fan et al. [5] further extend the correlation learning to marginal nonparametric learning. Zhu et al. [6] proposed a model-free feature screening approach for ultrahigh-dimensional data. Li et al. [7] proposed a robust rank correlation screening method to deal with ultrahigh-dimensional data based on the Kendall
correlation coefficient. Li et al. [8] applied the distance correlation to sure independence screening procedure. He et al. [9] proposed a quantile-adaptive framework for nonlinear variable screening with high-dimensional heterogeneous data. Fan et al. [10] proposed nonparametric independence screening selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. Liu et al. [11] proposed a feature screening procedure for varying coefficient model based on conditional correlation coefficient. Nandy et al. [12] proposed a covariate information number sure independence screening, which used a marginal utility connected to the notion of the traditional Fisher information. Pouyap et al. [13] proposed a merge of the features selection methods in order to define the most relevant features in the texture of the vibration signal images.
To address the ultrahigh-dimensional feature screening in classification problem, Fan and Fan [14] proposed the t-test statistic for two-sample mean problem as a marginal utility for feature screening and establish its theoretical properties. Mai and Zou [15] applied the Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. [16] proposed a screening procedure via used empirical conditional distribution functions. Lai et al. [17] proposed a feature screening procedure based on the expected conditional Kolmogorov filter for binary classification problem. However, the above-proposed screening methods assume that the types of data are continuous. For categorical covariates, Huang et al. [18] constructed a model-free discrete feature screening method based on the Pearson Chi-square statistics and showed its sure screening property fulfilling (Fan et al. [2]). When all the covariates are binary, Ni and Fang [19] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. [20] further proposed a feature screening procedure based on weighting Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang [21] proposed a new model-free feature screening method based on classification accuracy of marginal classifiers for ultrahigh-dimensional classification. Anzarmou et al. [22] proposed a new model-free interaction screening method, termed Kendall Interaction Filter (KIF), for the classification in high-dimensional settings.
Based on the above study of classification models, in this paper, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Gini impurity is a non-purity attribute splitting index, which was proposed by Breiman et al. [23] and has been widely used in decision tree algorithms such as CART and SPRINT. With regard to categorical covariate screening, we can apply the index of purity gain, which is the same as information gain [19]. Similar to Ni and Fang [19], continuous covariates can be sliced via standard normal quantile. The proposed feature screening procedure is based on purity gain, which is referred to Purity Gain sure independence screening (PG-SIS). Theoretically, the PG-SIS is rigorously proven to enjoy. Fan and Lv [1] proposed sure screening property that ensures all important features can be obtained. Practically, as shown by the simulation results, compared with the existing feature screening method, PG-SIS satisfies the sure screening property.
This paper is organized as follows. Section 2 describes the proposed PG-SIS method in detail. Section 3 establishes its sure screening property. In Section 4, numerical simulations and an example for real data analysis are given to assess sure screening property of our method. Some concluding remarks are given in Section 5 and all the proofs are given in the Appendix.
2. Feature Screening Procedure
We first introduce Gini impurity and purity gain, and then propose the screening procedure based on purity gain.
2.1. Gini Index and Purity Gain
Suppose that Y is a categorical response with R classes
, and covariate
is a vector of p dimension, where each of these components
with
categories. Where
. To introduce the Gini impurity and purity gain, assuming that
and
. Define
represents the probability function of a response variable,
represents the probability function of covariates,
represents the probability function of response variables under the condition of covariates, where
and
. Let
. Marginal Gini impurity of Y and X respectively is defined as
(1)
(2)
Conditional Gini impurity is defined as
(3)
Similar to the information gain, the purity gain is defined as
(4)
In the Equation (1),
is non-negative and acquires its maximum
if and only if
by Jensen’s inequality [24]. And the
in Equation (2) is the conditional Gini impurity of Y given
. Further support can be given by the following proposition.
Proposition 2.1. When
is a categorical covariable, we can get
, and
and Y are independent if and only if
.
For continuous
, the conditional Gini impurity can’t directly calculate, and purity gain by slicing X into several categories. For a fixed integer
, let
be the j/Jth percentile of X,
,
and
. Replacing
and
in Equation (3) respectively by
and
, we define conditional Gini impurity based on continuous covariates:
(5)
(6)
Proposition 2.2. When
is a continuous covariable, we can get
, and
and Y are independent if and only if
.
2.2. Feature Screening Procedure Based on Purity Gain
First, we select a medium scale of simplified model which can almost fully contain D, where
functionally depends on
for some
, we use an adjusted purity gain index for each pair
is as follows:
(7)
where
,
and
when
is categorical,
represents the number of categories of
. When
is defined as a continuous covariates,
represents the number of slices applied to
,
,
and
represents j/Jth percentile of
.
There may be more categories of covariates associated with larger purity gain in the original definition of Equation (4), regardless of whether the covariates are important, especially when the number of categories involved in each covariate is different. So Ni and Fang [19] used
to construct the information gain ratio to solve this problem, where each category of
is the same. Similarly, when each category of
is the same, for Equation (7), we apply the
to build an adjusted purity gain index to address the problem, which is also applied to continuous
. However, each category of
is different,
is defined as an adjustment factor, which is motivated by the split
into several categories via the Decision Tree algorithm.
For sample data
,
,
can be easily estimated by
(8)
When
is categorical,
and
.
When
is continuous,
and
where
is the j/Jth sample normal percentile of
. In either case,
.
We suggest selecting a sub-model:
. Where both c and
are predetermined thresholds established via Condition (C2) in Section 3. In practice, we can choose a model:
where
.
3. Feature Screening Property
In this section, we establish the sure screening property of PG-SIS. Based on Ni and Fang [19] proposed sure independence screening theories, the following conditions are assumed.
Condition 1 (C1). There exist two positive constants
and
such that,
,
,
and
for every
,
and
.
Condition 2 (C2). There exist a positive constant
and
such that
.
Condition 3 (C3).
,
, where
,
and
.
Condition 4 (C4). There exist a positive constant
, such that
for any
, and x in the domain of
, where
is the Lebesgue density function of
conditional on
.
Condition 5 (C5). There exist a positive constant
and
such that
for any
and x in the domain of
, where
is the Lebesgue density function of
. Furthermore,
is continuous in the domain of
.
Condition 6 (C6).
,
, where
and
.
Condition 7 (C7).
, where
is a constant.
Condition (C1) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. Similar assumption is also made in condition (C1) in Huang et al. [18] and Cui et al. [16]. According to Fan and Lv [1] and Cui et al. [16], Condition (C2) allows the minimum true signal to disappear to zero in the order of
as the sample size goes to infinity. According to [19] Condition (C3) provides the covariates to diverge with a certain order and the number of classes for the response, and Condition (C6) modifies Condition (C3) a little bit. To ensures the sample percentiles are close to the true percentiles, Condition (C4) rules out the extreme case that some
put heavy mass in a small range. Condition (C5) asks for the
as lower bound to the density. According to [16] and Zhu et al. [6] proposed ranking consistency property, we need to assume the inactive covariate subset
, then Condition (C7) is established.
Theorem 3.1. (Sure screening property) Under conditions (C1) to (C3), if all the covariates are categorical, we get:
Theorem 3.2. (Sure screening property) Under conditions (C4) to (C6), when the covariates are composed of continuous and categorical variables, we get:
where b is a positive constant. If
and
, PG-SIS has a sure screening property.
Theorem 3.3. (Ranking consistency property) Under conditions (C1), (C4), (C5) and (C7), if
and
, then
, a.s.
Theorem 3.3 testifies that the proposed screening index can separate active and inactive covariates well in the sample level.
4. Numerical Studies
4.1. Simulation Results
In this subsection, we carry out three simulation studies to demonstrate the finite sample performance of our group screen methods described in Section 2. We compare PG-SIS with IG-SIS [19] and APC-SIS in performance via the below evaluation criteria: MMS, minimal model size, consists of all active covariates, the results generally existing 5%, 25%, 50%, 75%, 95% of MMS; CP1, CP2 and CP3 respectively represent the probability that the given model size
,
and
cover all active covariates, while CPa indicates whether the indicators of the selected model cover all active covariates.
Model 1: categorical covariates and binary response
We first consider the response variables of different categories. According to [19], we assume a model which response
is binary in which
, and all the covariates are categorical. We think about two distributions for
:
1) Balanced,
;
2) Unbalanced,
with
.
The true model is defined at
with
. Condition on
, latent variable is generated as
, where
,
. Then, we construct active covariates:
1) If
, then
;
2) If
and
, then
;
3) If
and
, then
.
Next, we apply the quantile of the standard normal distribution to generate covariates. The specific approach is as follows:
1) When k as odd number, that is
;
2) When k as even number, that is
;
Where αth percentile of the standard normal distribution is
.
Thus, amongst all p covariates, the covariates of two categories and five categories account for half, respectively. Similar to [20], we consider
and
in this model.
Table 1 reports the evaluation criteria over 100 simulations for Model 1. We
Table 1. Simulation results for example 1.
can see the following: The results argue that the proposed PG-SIS works quite well. When the sample size n increases, PG-SIS is close to the true model size
in MMS, and both increase to 1 in coverage probability. MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. The performances of PG-SIS and IG-SIS are quite close, and PG-SIS is slightly better than APC-SIS in higher coverage probabilities.
Model 2: categorical covariates and multi-class response
We consider more covariate classification, and response
is multi-class which
. We think about
of two distributions:
1) Balanced,
;
2) Unbalanced,
with
.
Among the
covariates, the minimum set of active covariate set is
with the number of active covariates
. Condition on
, latent variable is generated as
, for covariates
,
, where
and
represents a quantile function of standard normal distribution. Then, we construct active covariates via defining
:
1) If
and
, then
;
2) If
, then
;
Next, we apply the
to generate covariates, and take
,
in this model. The specific approach is as follows:
1) For
, then
;
2) For
, then
;
3) For
, then
;
4) For
, then
;
5) For
, then
;
Thus, amongst all the p covariates, the covariates of two categories, four categories, six categories, eight categories and ten categories account for one-fifth each.
Table 2 reports the evaluation criteria over 100 simulations for Model 2. We can see the following: Two methods in performance under Model 1 is worse than Model 2. When the model is more intricate, PG-SIS in performance is close to IG-SIS. Particularly, PG-SIS and IG-SIS have a slightly small MMS under a small sample size n. When the sample size n increases, PG-SIS is close to
Table 2. Simulation results for example 2.
in MMS, and both increase to 1 in coverage probability. Four indexes of coverage probability of APC-SIS are worse than that of PG-SIS when the sample
. MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. Furthermore, PG-SIS and IG-SIS are more robust in performance because the fluctuation range in MMS is small.
Model 3: continuous and categorical covariates
At last, among the covariates that are both continuous and categorical, we assume a more complex example, where response
is multi-class which
. We think about
of two distributions:
1) Balanced,
;
2) Unbalanced,
with
.
In this model, we take
,
. The true model is defined at
with
. Condition on
,
latent variable is generated as
. For covariates
,
, where
when
and
. According to Ni and Fang [19],
is given in Table 3.
when
. To generate
:
For
, then
, if
;
For
, then
, if
;
For
, then
.
Thus, amongst all the p covariates, the covariates of four categories and ten categories account for one-fifth, respectively, the other covariates are continuous. Similarly, there respectively are 5 in four categories and ten categories in the active covariates, and the rest of active covariates are continuous accounting for half. For continuous covariates, we applied different slices
. The corresponding approaches are defined as PG-SIS-4, IG-SIS-4, APC-SIS = 4, PG-SIS-8, IG-SIS-8, APC-SIS-8, PG-SIS-10, IG-SIS-10 and APC-SIS-10. Table 4 and Table 5 show the simulation results with over 100 simulations for balanced and unbalanced case, respectively. We can see the following: When the sample size n increases, PG-SIS is close to
in MMS, and both increase to 1 in coverage probability. And coverage probability of PG-SIS is close to IG-SIS in five indexes. Therefore, it is proved that the PG-SIS has the characteristics of feature screening. MMS in unbalanced response is better than in balanced response in performance via comparing the response of different structures. Furthermore, PG-SIS and IG-SIS are robust in performance because the fluctuation range in MMS is small for two types of responses. When different slices are
Table 3. Parameter specification of Model 3.
Table 4. Simulation results for example 3: balanced Y.
Table 5. Simulation results for example 3: unbalanced Y.
Table 6. Analysis results for real data example.
applied in continuous covariates, PG-SIS and IG-SIS are better in five indexes of coverage probability and MMS in performance via comparing the response of different structures. Therefore, three methods are independent of the number of slices in performance.
4.2. Real Data
In this subsection, we analyse a real data set from the feature selection database of Arizona State University (http://featureselection.asu.edu/). The GLIOMA biological data includes 50 samples and 4434 features, which is unbalanced due to the response variable. Every class is 14, 7, 14, 15, and the covariates are not only continuous, but also multiclass. We randomly divided the data into two parts where 90 percent of the data represent training data and 10 percent of the data represents test data. The sample size of training data and test data respectively are
and
. The dimension of both training data and test data are
.
We apply a ten-fold cross-validation to eliminate different training data that cause the model accuracy problems. To PG-SIS, IG-SIS and APC-SIS, we use three classification approaches, which are Support Vector Machine [25], Random Forest (RF) and Decision Tree (DT) [26] to them via the chose active covariates.
In training data, we use the G-mean and F-measure [27] evaluation, the same is true for test data. PG-SIS in performance for unbalanced data is reported in Table 6. In all classification methods, PG-SIS is the best in performance, where G-mean of PG-SIS is more closed to 1 than the other two methods. In a word, the proposed PG-SIS performs better.
5. Conclusions
In the data, there are continuous and categorical covariates, and the response is categorical, which is very common in practice, but the applicable screening methods are very limited. We propose a PG-SIS procedure based on Gini impurity to effectively screen covariates. PG-SIS has a sure screening property and ranking consistency property theoretically and is model free. When the numbers of categories of all the covariates are the same and different, PG-SIS is quite similar to IG-SIS in performance, which can be shown in simulation.
The features screening reports some difficulties based on missing data. In the future, based on the classification model, we intend to propose a new feature screening to either the missing variable or the response variable can be selected.
Acknowledgements
The work was supported by National Natural Science Foundation of China [grant number 71963008].
Appendix
Proof of Proposition 2.1. To prove Proposition 2.1, we need to define
, proved to be close Ni and Fang [19]. By Jensen’s inequality [24],
then
The above equation holds if and only if
, for any
and
. That is,
and Y are independent.
Proof of Proposition 2.2. From the same proof as Proposition 2.1, we can get
holds if and only if
for any
and
, that is
. So when
and Y are independent,
. On the other hand, when
for any J, we need to show that
does not depend on r for any x in the domain of X. Proposition 2.2 has been proven by [19], so the proof is omitted here.
Lemma 1 (Bernstein inequality). If
is an independent random variable with a mean value of 0 and bounded supporter is
, then the inequality:
where
Lemma 2. For discrete covariates
and discrete response Y, we have the following three inequalities:
1)
2)
3)
Proof of Lemma 2. Three inequalities are similar in the proofs, where inequality (1) and inequality (2) have been given at [27]. The following is the proof of inequality (3).
The expectation of
is
Let
,
is known, then
According to Bernstein inequality, the formula is held.
Lemma 3. With regard to discrete covariates
and discrete response Y, for any
, under condition (C1), we have
,
where
represents a positive constant.
Proof of Lemma 3. By
and
in Section 2.2, we have
Since
, we have
For
, we have
For
, we have
For
, we have
For
and
, we have
In a word, we have the inequality,
where
represents a positive constant.
Proof of Theorem 3.1. By Conditions (C1) to (C3) and Lemma 3, we can get
Where b is a positive constant.
Lemma 4 (Lemma A.2 [19]). For any continuous covariate
satisfying conditions (C4) and (C5), let
be the cumulative distribution function of
and
be the empirical cumulative distribution function, we have
for any
,
and
, where
and
are two positive constants.
Lemma 5 (Lemma A.3 [19]). Under (C1), (C4) and (C5), for any
, so for continuous
, we have
, there exists a positive constant
.
Proof of Theorem 3.2. According to Lemma 4 and Lemma 5, the proof of Theorem 3.2 is the same as Theorem 3.1 and hence is omitted.
Proof of Theorem 3.3. According to Lemma 3 and 5 and under Conditions (C1), (C4), (C5) and (C7). The proof of Theorem 3.2 is proved by [19] and hence is omitted.