Model-Free Feature Screening Based on Gini Impurity for Ultrahigh-Dimensional Multiclass Classification

It is quite common that both categorical and continuous covariates appear in the data. But, most feature screening methods for ultrahigh-dimensional classification assume the covariates are continuous. And applicable feature screening method is very limited; to handle this non-trivial situation, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Under certain regularity conditions, it is proved that the proposed screening procedure possesses the sure screening property and ranking consistency properties. We demonstrate the finite sample performance of the proposed procedure by simulation studies and illustrate using real data analysis.


Introduction
Ultrahigh-dimensional data are commonly available in a wide range of scientific research and applications. Feature screening plays an essential role in the ultrahigh-dimensional data, where Fan and Lv [1] first proposed the sure independence screening (SIS) in the seminal paper. For linear regressions, they showed that the approach based on Pearson correlation learning possesses a sure screening property. That is, even if the number of predictors p can grow much ( ) log p O n α = for some 1 0, 2 α   ∈     , all relevant predictors can be selected with probability tending to one [2].
Lots of feature screening is the Model-based and Model-free approaches have been developed in recent years, see, for example, Wang [3] proposed forward regression for ultrahigh-dimensional data. Fan and Song [4] applied the maximum marginal likelihood estimates or the maximum marginal likelihood to ultrahigh-dimensional screening in generalized linear model. Fan et al. [5] further extend the correlation learning to marginal nonparametric learning. Zhu et al. [6] proposed a model-free feature screening approach for ultrahigh-dimensional data. Li et al. [7] proposed a robust rank correlation screening method to deal with ultrahigh-dimensional data based on the Kendall τ correlation coefficient.
Li et al. [8] applied the distance correlation to sure independence screening procedure. He et al. [9] proposed a quantile-adaptive framework for nonlinear variable screening with high-dimensional heterogeneous data. Fan et al. [10] proposed nonparametric independence screening selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. Liu et al. [11] proposed a feature screening procedure for varying coefficient model based on conditional correlation coefficient. Nandy et al. [12] proposed a covariate information number sure independence screening, which used a marginal utility connected to the notion of the traditional Fisher information. Pouyap et al. [13] proposed a merge of the features selection methods in order to define the most relevant features in the texture of the vibration signal images.
To address the ultrahigh-dimensional feature screening in classification problem, Fan and Fan [14] proposed the t-test statistic for two-sample mean problem as a marginal utility for feature screening and establish its theoretical properties. Mai and Zou [15] applied the Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. [16] proposed a screening procedure via used empirical conditional distribution functions. Lai et al. [17] proposed a feature screening procedure based on the expected conditional Kolmogorov filter for binary classification problem. However, the above-proposed screening methods assume that the types of data are continuous. For categorical covariates, Huang et al. [18] constructed a model-free discrete feature screening method based on the Pearson Chi-square statistics and showed its sure screening property fulfilling (Fan et al. [2]). When all the covariates are binary, Ni and Fang [19] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. [20] further proposed a feature screening procedure based on weighting Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang [21] proposed a new model-free feature screening method based on classification accuracy of marginal classifiers for ultrahigh-dimensional classification. Anzarmou et al. [22]  Based on the above study of classification models, in this paper, we propose a model-free feature screening for ultrahigh-dimensional multi-classification with both categorical and continuous covariates. The proposed feature screening method will be based on Gini impurity to evaluate the prediction power of covariates. Gini impurity is a non-purity attribute splitting index, which was proposed by Breiman et al. [23] and has been widely used in decision tree algorithms such as CART and SPRINT. With regard to categorical covariate screening, we can apply the index of purity gain, which is the same as information gain [19]. Similar to Ni and Fang [19], continuous covariates can be sliced via standard normal quantile. The proposed feature screening procedure is based on purity gain, which is referred to Purity Gain sure independence screening (PG-SIS).
Theoretically, the PG-SIS is rigorously proven to enjoy. Fan and Lv [1] proposed sure screening property that ensures all important features can be obtained.
Practically, as shown by the simulation results, compared with the existing feature screening method, PG-SIS satisfies the sure screening property.
This paper is organized as follows. Section 2 describes the proposed PG-SIS method in detail. Section 3 establishes its sure screening property. In Section 4, numerical simulations and an example for real data analysis are given to assess sure screening property of our method. Some concluding remarks are given in Section 5 and all the proofs are given in the Appendix.

Feature Screening Procedure
We first introduce Gini impurity and purity gain, and then propose the screening procedure based on purity gain.

Gini Index and Purity Gain
Suppose that Y is a categorical response with R classes { } Conditional Gini impurity is defined as Similar to the information gain, the purity gain is defined as In the Equation (1),

( )
Gini Y is non-negative and acquires its maximum by Jensen's inequality [24]. And the Further support can be given by the following proposition.
For continuous k X , the conditional Gini impurity can't directly calculate, and purity gain by slicing X into several categories. For a fixed integer 2 J ≥ , let ( ) j q be the j/J th percentile of X, 1, , conditional Gini impurity based on continuous covariates:

Feature Screening Procedure Based on Purity Gain
First, we select a medium scale of simplified model which can almost fully con- we use an adjusted purity gain index for each pair ( ) , k Y X is as follows: ( ) There may be more categories of covariates associated with larger purity gain in the original definition of Equation (4), regardless of whether the covariates are important, especially when the number of categories involved in each covariate is different. So Ni and Fang [19] used log k J to construct the information gain ratio to solve this problem, where each category of k X is the same. Similarly, when each category of k X is the same, for Equation (7), we apply the log k J to build an adjusted purity gain index to address the problem, which is also applied to continuous k X . However, each category of k X is different, is defined as an adjustment factor, which is motivated by the split k X into several categories via the Decision Tree algorithm.
When k X is continuous, We suggest selecting a sub-model:

Feature Screening Property
In this section, we establish the sure screening property of PG-SIS. Based on Ni  [19] proposed sure independence screening theories, the following conditions are assumed.
Condition 1 (C1). There exist two positive constants 1 c and 2 c such that, Condition 2 (C2). There exist a positive constant 0 c > and 0 Condition 4 (C4). There exist a positive constant 3 c , such that There exist a positive constant 4 c and 0 1 2 Condition (C1) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. Similar assumption is also made in condition (C1) in Huang et al. [18] and Cui et al. [16]. According to Fan and Lv [1] and Cui et al. [16], Condition (C2) allows the minimum true signal to disappear to zero in the order of n τ − as the sample size goes to infinity. According to [19] Condition (C3) provides the covariates to diverge with a certain order and the number of classes for the response, and Condition (C6) modifies Condition (C3) a little bit. To ensures the sample percentiles are close to the true percentiles, Condition (C4) rules out the extreme case that some k X put heavy mass in a small range. Condition (C5) asks for the n ρ − as lower bound to the density. According to [16] and Zhu et al. [6] proposed ranking consistency property, we need to assume the inactive covariate subset

Simulation Results
In this subsection, we carry out three simulation studies to demonstrate the finite sample performance of our group screen methods described in Section 2.
We compare PG-SIS with IG-SIS [19] and APC-SIS in performance via the be- We first consider the response variables of different categories. According to [19], we assume a model which response i y is binary in which 2 R = , and all the covariates are categorical. We think about two distributions for i y : The true model is defined at Next, we apply the quantile of the standard normal distribution to generate covariates. The specific approach is as follows: 1) When k as odd number, that is , 2) When k as even number, that is , Where αth percentile of the standard normal distribution is ( ) z α .
Thus, amongst all p covariates, the covariates of two categories and five categories account for half, respectively. Similar to [20], we consider 1000,5000 p = and 200, 400 n = in this model.  Model 2: categorical covariates and multi-class response We consider more covariate classification, and response i y is multi-class which 10 R = . We think about i y of two distributions: Among the Thus, amongst all the p covariates, the covariates of two categories, four categories, six categories, eight categories and ten categories account for one-fifth each.   2) Unbalanced, 2 1 3 1 In this model, we take 5000 p = , 400, 600,800 n = . The true model is de- Condition on i y , latent variable is generated as and k D ∈ . According to Ni and Fang [19], rk θ is given in Table 3. , 0 To generate k X : Thus, amongst all the p covariates, the covariates of four categories and ten categories account for one-fifth, respectively, the other covariates are continuous.
Similarly, there respectively are 5 in four categories and ten categories in the ac-  Table 4 and Table 5 show the simulation results with over 100 simulations for balanced and unbalanced case, respectively. We can see  Table 3. Parameter specification of Model 3.

Real Data
In this subsection, we analyse a real data set from the feature selection database We apply a ten-fold cross-validation to eliminate different training data that cause the model accuracy problems. To PG-SIS, IG-SIS and APC-SIS, we use three classification approaches, which are Support Vector Machine [25], Random Forest (RF) and Decision Tree (DT) [26] to them via the chose active covariates.
In training data, we use the G-mean and F-measure [27] evaluation, the same is true for test data. PG-SIS in performance for unbalanced data is reported in Table 6. In all classification methods, PG-SIS is the best in performance, where G-mean of PG-SIS is more closed to 1 than the other two methods. In a word, the proposed PG-SIS performs better.

Conclusions
In the data, there are continuous and categorical covariates, and the response is categorical, which is very common in practice, but the applicable screening methods are very limited. We propose a PG-SIS procedure based on Gini impurity