Model-Free Feature Screening via Maximal Information Coefficient (MIC) for Ultrahigh-Dimensional Multiclass Classification ()
1. Introduction
With the advancement of data acquisition tools and the improvement of computer storage capacity, ultra-high dimensional data has been widely applied in various scientific research fields, especially in genomics, tumor classification, machine learning and other fields. The traditional data screening methods can no longer be applied, and the existing feature screening methods for ultra-high dimensional data have their own limitations. Among them, when the covariates are both categorical and continuous variables, there are fewer research methods and the screening effect needs to be improved, so there is an urgent need to develop new theoretical and statistical methods to deal with ultra-high dimensional data. The pioneering work by Fan and Lv (2008) [1] introduced the concept of sure independence screening (SIS) in their seminal paper. Specifically, for linear regressions, they demonstrated that the approach based on Pearson correlation learning exhibits a sure screening property. This means that even when the
number of predictors (p) grows at a much faster rate than the number of observations (n) with logarithm of p equal to
for some
, all relevant predictors can be selected with a probability approaching one (2009) [2] .
Numerous approaches have been developed in recent years for feature screening in ultrahigh-dimensional data. Wang (2009) [3] introduced forward regression as a method for handling such data. Fan and Song (2010) [4] applied maximum marginal likelihood estimates or maximum marginal likelihood to ultrahigh-dimensional screening in generalized linear models. Fan et al. (2011) [5] extended correlation learning to marginal nonparametric learning. Li et al. (2012) [6] presented a robust rank correlation screening method based on the Kendall correlation coefficient. He et al. (2013) [7] developed a quantile-adaptive framework for nonlinear variable screening in high-dimensional heterogeneous data. Fan et al. (2014) [8] introduced nonparametric independence screening, which selects variables based on the nonparametric marginal contributions of each covariate given the exposure variable. Nandy et al. (2022) [9] introduced covariate information number sure independence screening, which incorporates a marginal utility connected to the traditional Fisher information. Tong et al. (2022) [10] propose a model-free conditional feature screening method for ultra-high-dimensional data based on false discovery rate (FDR) control, which does not require a specific functional form of the regression function and is robust to heavy-tail responses and predictors.
To tackle the challenge of ultrahigh-dimensional feature screening in classification problems, Fan and Fan (2008) [11] introduced the t-test statistic for the two-sample mean problem as a marginal utility for feature screening and established its theoretical properties. Mai and Zou (2013) [12] applied the Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. (2015) [13] proposed a screening procedure that utilizes empirical conditional distribution functions. Lai et al. (2017) [14] developed a feature screening procedure based on the expected conditional Kolmogorov filter for binary classification problems.
However, the aforementioned screening methods assume that the data types are continuous. For categorical covariates, Huang et al. (2014) [15] devised a model-free discrete feature screening method based on Pearson Chi-square statistics and demonstrated its sure screening property, as mentioned in Fan et al. (2009) [2] . When all the covariates are binary, Ni and Fang (2016) [16] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. (2017) [17] further extended this by introducing a feature screening procedure based on weighted Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang (2020) [18] introduced a novel model-free feature screening method based on the classification accuracy of marginal classifiers for ultrahigh-dimensional classification. Anzarmou et al. (2022) [19] presented a new model-free interaction screening method called Kendall Interaction Filter (KIF) for classification in high-dimensional settings.
Based on the aforementioned research on classification models, this paper introduces a model-free feature screening approach for ultrahigh-dimensional multi-classification that accommodates both categorical and continuous covariates. The proposed method utilizes the maximal information coefficient (MIC) to evaluate the predictive power of the covariates. For screening categorical covariates, we employ the maximal information coefficient (MIC) index, which is equivalent to information gain [16] . The feature screening procedure proposed in this paper is based on maximal information coefficient, specifically referred to as Maximal Information Coefficient Sure Independence Screening (MIC-SIS). The maximum mutual information coefficient can be directly used to categorize the data without any cut-off processing, and it overcomes the disadvantages of information gain, such as the difficulty of calculating the joint probability, not belonging to the measurement method, no way to normalize it, and not being able to compare the results of different data and it, and the screening results are more robust and have lower computational complexity. It first finds an optimal discretization method, and then converts the mutual information value into a measurement method, and the value range is between [0, 1]. MIC has the advantages of wide application range, low computational complexity and strong robustness. The MIC-SIS method is rigorously proven to possess the sure screening property, as originally proposed by Fan and Lv [1] , ensuring that all significant features can be identified. Through simulation results, the MIC-SIS approach demonstrates the satisfaction of the sure screening property when compared to existing feature screening methods.
The paper is organized as follows: Section 2 provides a detailed description of the proposed MIC-SIS method. Section 3 establishes the sure screening property of the method. In Section 4, numerical simulations and a real data analysis example are presented to assess the sure screening property of our approach. Concluding remarks are provided in Section 5, and all proofs are included in the Appendix.
2. Feature Screening Procedure
Firstly, we introduce the concept of the maximal information coefficient (MIC), and subsequently, we propose a screening procedure that is based on the maximal information coefficient.
2.1. Maximal Information Coefficient (MIC)
The fundamental principle underlying the maximal information coefficient (MIC) is based on the concept of mutual information. Mutual Information (MI) [20] is a valuable measure in information theory, quantifying the amount of information contained in one random variable regarding another random variable. It represents the reduction in uncertainty of a random variable due to the knowledge of another random variable. In decision trees, mutual information and information gain (IG) are essentially equivalent.
The mutual information between two random variables X and Y is defined based on their joint probability distribution
as follows
is always nonnegative and
if and only if X and Y are independent.
When covariate X is continuous and Y is a categorical response with R classes
,
where
and
. In practice probability functions are usually calculated using Gaussian kernel functions.
When covariate
is a vector of p dimension with J categories, where
, and Y is also categorical with R classes
,
where
.
The concept behind MIC is to discretize the relationship between two variables and represent it in two-dimensional space using a scatterplot. A data set consisting of data points with two attributes is distributed in a two-dimensional space. A grid of a multiplied by b is used to divide the data space, and the frequency of data points falling in each
cell is estimated as
, which greatly reduces the computational complexity of the joint probability and successfully solves the difficult problem of estimating the joint probability in mutual information.
Since there are several ways to partition the data points using an
grid, our goal is to find the partitioning method that maximizes the mutual information. The mutual information values are normalized using a normalization factor that maps them to the
interval. Finally, the mesh resolution that maximises the normalised mutual information is determined as the MIC measure. The formula for MIC is expressed in the following equation:
In the above equation, a, b is the number of grids divided in the x, y direction, which is essentially the grid distribution, and B is the variable. According to Reshef D N et al [21] , the grid resolution is typically limited to
, where the size setting of B is often chosen to be approximately 0.6 times the power of the data volume.
is always nonnegative and
if and only if X and Y are independent.
2.2. An Independence Ranking and Screening Procedure
We propose a novel model-free sure independence screening method utilizing the maximal information coefficient
for analyzing ultrahigh-dimensional data. In this context, Y represents the response variable with support
, and
denotes the predictor vector, where p is significantly larger than the sample size n. Without specifying a particular regression model, we define the subset of active predictor indices as follows:
and define the subset of inactive predictor indices by
.
Using the notation mentioned above, we can define the active predictors as
and the inactive predictors as
. Our primary objective is to accurately identify the subset of active predictor indices, denoted as D.
The MI marginal measure can be estimated by letting
. When covariate X is continuous and Y is a categorical response with R classes
,
where
and
. Consider a covariate vector
of dimension p, where each component
takes on
categories, represented by
. Furthermore, the response variable Y is also categorical, with R classes denoted by
,
where
;
;
;
;
;
.
The MIC marginal measure can be estimated by letting
. Then
Our goal is to calculate the maximal information coefficient MIC between each predictor and the response variable, denoted as
for
. Note that
if and only if
, this also indicates that predictor
is statistically independent of Y. Therefore, the MIC index can be utilized as a measure of dependence to screen the predictors. The MIC-based approach is considered model-free because it solely relies on the marginal and joint densities of the random variables. This index can effectively capture both linear and nonlinear relationships between the response and predictors.
In ultra-high-dimensional data analysis, the primary objective of feature screening is to identify a reduced model with a small number of predictors that can still encompass the true model D with a high probability. According to the deterministic screening property proposed by Fan and Lv [2] , as the amount of data n tends to infinity, the probability that the model converges to the true model must converge to 1, so that the screened covariates are guaranteed to be valid. In this paper, we propose utilizing the
index to select a moderate-sized model
where c and τ are predetermined positive values. In practice, we often select the reduced model using another formula:
It is evident that the set of predictors
represents the most likely relevant predictors associated with the response variable. Consequently, we can employ the predictors in
to estimate the true model. To simplify the description, we refer to the aforementioned procedure as the MIC-SIS (Maximal Information Coefficient Sure Independence Screening) procedure.
3. Feature Screening Property
In the subsequent sections, we will establish the theoretical properties of the proposed independence screening procedure. Previous studies by Fan and Lv [1] , Ji and Jin [22] , Zhou and Wang [20] and Ni and Fang [16] have demonstrated that the sure screening property ensures the effectiveness of the independence screening procedure. Hence, it is crucial to establish the sure screening property for MIC-SIS. The following conditions are assumed to guarantee the sure screening property of the MIC-SIS procedure. Although they may not be the weakest conditions, they are primarily imposed to facilitate the technical proofs.
(C1) Let
, where
is drawn from an unknown distribution
. Each distribution
has an unknown Lebesgue probability density function pdf
, with
. The conditions specified in Lemma 3 in the Appendix apply to these distributions.
(C2) There exists a positive constant
, such that
(C3) There exists a positive constant
and
; the minimum MIC of the active predictors satisfies
;
(C4) Both X and Y exhibit sub exponential tail probabilities that hold uniformly in p. Specifically, there exists a positive constant
such that, for all
, the following condition holds:
(C5) There exist two positive constants
and
such that,
,
,
and
for every
,
, and
.
(C6) There exist a positive constant
, such that
for any
, and x in the domain of
, where
is the Lebesgue density function of
conditional on
.
(C7) There exist a positive constant
and
such that
for any
and x in the domain of
, where
is the Lebesgue density function of
. Furthermore,
is continuous in the domain of
.
(C8)
, where
is a constant.
According to Ji and Jin [22] and conditions in the Appendix, conditions (C1) and (C2) ensure that the estimated probabilities converge strongly and uniformly to the true probabilities. According to Fan and Lv [1] and Cui [13] , condition (C3) allows the minimum true signal to disappear to zero in the order of
as the sample size goes to infinity. According to the sure screening property proposed by Zhou and Wang [20] , then condition (C4) is established. Condition (C5) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. A similar assumption is also made in condition (C5) in Huang [15] and Cui [13] . To ensure that the sample percentiles are close to the true percentiles, condition (C6) excludes the extreme case that some
put heavy mass in a small range. Condition (C7) requires the
as a lower bound on the density. According to Cui [13] , it is easy to show that
for
and
for
naturally holds. Thus, condition (C8) is established, and MIC index is able to separate active and inactive predictors well at the population level.
Theorem 1 (Sure Screening Property).
Under conditions(C1) - (C4), there exists the positive constant
such that
Further, we have that
In the equation above,
represents the cardinality of D. According to Theorem 1, it implies that we can handle the ultra-high-dimensional scenario where the logarithm of p is on the order of
, with
.
Theorem 2 (Ranking consistency property).
Under conditions(C5) - (C8), if
and
, then
Theorem 2 demonstrates that the proposed screening index effectively distinguishes between active and inactive covariates at the sample level.
4. Numerical Studies
4.1. Simulation Results
In this subsection, we conduct three simulation studies to demonstrate the finite sample performance of our group screening methods as described in Section 2. We compare the performance of MIC-SIS with that of IG-SIS [16] and APC-SIS [17] using the following evaluation criteria:
1) MMS (Minimal Model Size): This criterion represents the smallest model size that includes all active covariates. The results are presented for various proportions of MMS, such as 5%, 25%, 50%, 75%, and 95%.
2) CP1, CP2, and CP3: These criteria indicate the probabilities that a given model size, specifically
,
, and
, respectively, cover all active covariates.
3) CPa: This criterion evaluates whether the indicators of the selected model cover all active covariates.
By comparing these evaluation criteria, we can assess and compare the performance of MIC-SIS, IG-SIS, and APC-SIS in terms of their ability to identify and include active covariates in the model.
Model 1: categorical covariates and binary response
We begin by examining the response variables with different categories. Following Ni and Fang [16] , we consider a binary response model where
, and all covariates are categorical. We consider two distributions for the response variable
:
1) Balanced,
;
2) Unbalanced,
with
.
The true model is defined at
with
. Condition on
, a latent variable
is generated as
, where
for
. We then construct the active covariates as follows:
1) If
, then
;
2) If
and
, then
;
3) If
and
, then
.
Next, we generate the covariates by applying the quantile of the standard normal distribution. The specific approach is as follows:
1) When k as odd number, that is
;
2) When k as even number, that is
.
Where αth percentile of the standard normal distribution is
.
Therefore, out of all p covariates, half of them belong to two categories, while the other half belong to five categories. Following the approach in Ni and Fang [17] , we consider
and
, with sample sizes
and
in this model.
Table 1 shows the evaluation criteria for Model 1 based on 100 simulations. The results show the effectiveness of the proposed MIC-SIS method. As the sample size n increases, MIC-SIS approaches the true model size
in terms of MMS, and the coverage probability increases toward 1. MMS performs better in the unbalanced response compared to the balanced response when considering different response structures. MIC-SIS and IG-SIS show similar performance, with MIC-SIS slightly outperforming APC-SIS at higher coverage probabilities.
Model 2: categorical covariates and multi-class response
We further investigate the classification of more covariates, where the response variable
has multiple classes with
. We consider two distributions for
:
1) Balanced,
;
2) Unbalanced,
with
.
Out of the
covariates, the minimum set of active covariates is represented by
, with a total of
active covariates. Conditional on
, the latent variable
is generated, where
follows a standard normal distribution
for covariate
. Each covariate
is defined as
, where
follows a standard normal distribution
, and
represents the quantile function of the standard normal
Table 1. Simulation results for model 1.
distribution. Based on this, we construct the active covariates by defining
:
1) If
and
, then
;
2) If
and
, then
.
Next, we generate the covariates by applying the quantile function
to the defined parameters. We consider
and sample sizes
and 500 for this model. The specific approach is as follows:
1) For
, then
;
2) For
, then (
;
3) For
, then
;
4) For
, then
;
5) For
, then
.
Among the
covariates, each category 2, 4, 6, 8, and 10 constitutes one-fifth of the total.
Table 2 shows the results of the evaluation criteria for 100 simulations of Model 2. The following conclusions can be drawn:
Both methods perform poorly in the more complex Model 2 compared to Model 1. MIC-SIS and IG-SIS perform similarly. As the sample size n increases, MIC-SIS approaches the true model size
in MMS and the coverage probability reaches 1. When the sample size is 300, the coverage probability of APC-SIS is lower compared to MIC-SIS. By comparing the responses of different structures, the unbalanced response has a better MMS performance than the balanced response, the performance of MIC-SIS and IG-SIS is more stable with less fluctuation in MMS. In conclusion, these results highlight the effectiveness and robustness of MIC-SIS and IG-SIS in dealing with Model 2.
Model 3: continuous and categorical covariates
Finally, we consider a more complex example where the response variable
is multi-class with
. We examine two distributions for
:
1) Balanced,
;
2) Unbalanced,
with
.
In this model, we consider
and sample sizes
. The true model is defined as
with
. Conditioned on
, the latent variable is generated as
, where
for
. For the covariates
, we have
for
, where
Table 2. Simulation results for model 2.
if
and
. The values of
, as given in Table 3 by Ni and Fang [16] , determine the active covariates. Specifically,
when
.
An active covariate is established by defining
:
1) For
, then
, if
;
2) For
, then
, if
;
3) For
, then
.
Table 3. Parameter specification of Model 3.
Among all the p covariates, four categories and ten categories account for one-fifth each, respectively, while the remaining covariates are continuous. Similarly, there are 5 active covariates in each of the four categories and ten categories, with the remaining active covariates being continuous, accounting for half of the total. For the continuous covariates, we applied different subdivisions with
and 10. Accordingly, we define the corresponding approaches as MIC-SIS-4, IG-SIS-4, APC-SIS-4, MIC-SIS-8, IG-SIS-8, APC-SIS-8, MIC-SIS-10, IG-SIS-10, and APC-SIS-10.
Table 4 and Table 5 present the simulation results based on over 100 simulations for the balanced and unbalanced cases, respectively. The following observations can be made: As the sample size n increases, MIC-SIS approaches the true model size
in terms of MMS, and both approach 1 in terms of coverage probability. The coverage probability of MIC-SIS is similar to that of IG-SIS for all five indices, demonstrating the characteristic screening properties of MIC-SIS. For MMS, the unbalanced response outperforms the balanced response when comparing response structures. In addition, both MIC-SIS and IG-SIS exhibit robust performance, as evidenced by the small range of variation in MMS for both response types. When applying different breakdowns to the continuous covariates, MIC-SIS and IG-SIS outperform the other methods in terms of coverage probability and MMS when comparing response structures.
4.2. Real Data
In this subsection, we analyze a real dataset obtained from the feature selection database of Arizona State University (http://featureselection.asu.edu/). The dataset, called GLIOMA biological data, consists of 50 samples and 4434 features. The data is unbalanced due to the response variable, with class sizes of 14, 7, 14, and 15. The covariates in this dataset are both continuous and multiclass. We randomly divided the data into two parts, with 90% used as training data and 10% used as test data. The training data consists of 45 samples, while the test data consists of 5 samples. The dimensionality of both the training and test data is
.
To assess the performance of MIC-SIS, PG-SIS, IG-SIS, and APC-SIS, we employ three classification approaches: Support Vector Machine (SVM) [23] , Random Forest (RF), and Decision Tree (DT). We utilize a ten-fold cross-validation to address potential issues related to varying training data that could affect the accuracy of the models. These classification approaches are applied to the selected active covariates obtained from the aforementioned screening methods. The evaluation metrics commonly used in such analyses include accuracy, recall,
Table 4. Simulation results for model 3 (Balanced Y).
Table 5. Simulation results for model 3 (Unbalanced Y).
F-measure, and G-mean. In this paper, we specifically utilize G-mean and F-measure [24] to assess the performance of the models on both the training and test data. The performance of MIC-SIS for unbalanced data is presented in Table 6.
Table 6. Analysis results for real data example.
According to the results of Table 6, we can get that among all the classification methods, MIC-SIS consistently outperforms the others, exhibiting higher G-mean and F-measure values that are closer to 1. In summary, the proposed MIC-SIS method demonstrates superior performance.
5. Conclusions
In practical scenarios, it is common to encounter datasets with a combination of continuous and categorical covariates, along with categorical responses. However, the available screening methods for such cases are limited. To address this problem, we introduce a new method called MIC-SIS (Maximum Information Coefficient-based Screening), which does not require continuous variables to be sliced and can be applied directly to a wide range of variables, overcoming the shortcomings of existing methods. In this paper, we demonstrate that MIC-SIS has theoretical properties such as deterministic screening and ranking consistency, and that no modeling is required. Through numerical simulations, we find that MIC-SIS can effectively screen covariates with better screening and lower computational complexity than existing methods. It also performs well in the empirical analysis of GLIOMA data.
One of the current challenges in covariate screening is missing data. It is common to have missing or incorrect data during the data collection process, which can affect the variable screening results. In future work, we aim to develop a new approach to feature screening that can either deal with missing variables prior to screening, or use classification models to screen features based on response variables.
Acknowledgements
The work was supported by National Natural Science Foundation of China [grant number 71963008].
Availability of Data
The GLIOMA biological data that support the findings of this study are available from the feature selection database of Arizona State University (http://featureselection.asu.edu/).
Appendix
Proof of Theoretical Result.
To establish the validity of Theorem 1, we rely on three accompanying lemmas. The first two lemmas furnish us with exponential inequalities, and their detailed proofs can be found in [25] .
Lemma 1. Let
. If
, then
Lemma 2. Let
be a kernel function for the U statistics
, and
. If
holds, then for any
and
, we have the following inequality
where
denotes the integer part of
.
Lemma 2 represents the one-sided tail inequality of
. As a result of its symmetry, we can readily derive the two-sided tail inequality of
Lemma 3. (The asymptotic property of nonparametric density estimators). Suppose that
exists and
, then
From the above equation,
and
.
Lemma 3 directly implies that
. Under certain additional stringent conditions, we can achieve strong uniform convergence of
.
For more detailed information regarding the strong uniform convergence, one can refer to references such as [26] or [27] . These sources provide further insights and explanations on the topic.
Proof of Theorem 1. We begin by demonstrating that the following inequality holds for each k:
Since
is obtained by normalizing
without altering its properties, we only need to establish that the aforementioned inequality holds for each k:
Because
Using Lemma 3 and the strong law of large numbers, we observe the convergence of
Next, our goal is to establish an upper bound for the second term.
Define
as the kernel of the U statistics of
, where we define
, and
By applying Markov's inequality, we can ensure that:
where
.
Following the approach utilized by Li et al. (2012) [28] to handle the U statistics, and considering condition (C2), we can immediately deduce that:
By selecting
, we obtain
. Consequently, taking into account the symmetry of U statistics, we can derive the bilateral tail inequality:
By utilizing the relationship between
and
, we can demonstrate that
Under condition (C2), for
, we can choose a sufficiently large
such that when
,
. Furthermore, we can easily establish that
Note that
Similarly, employing the same technique and selecting a larger
, we can ensure that when
,
. This directly implies that
Let
,
. By employing
, together with Lemma 1 and Bonferroni’s inequality, we can deduce that
We thus have
Next, we prove the second part of Theorem 1.
If
, it implies that there must exist some
such that
. By utilizing condition (C3), we can deduce that if
holds for some
, then
also holds for some
. Thus, the event
is a subset of the event
. Taking the complement on both sides, we obtain
is a subset of
. Therefore, we have:
In the above equation,
is the cardinality of D.
Now we prove Theorem 2. The proofs for Lemma 4 and Lemma 5 can be found in Ni and Fang (2016) [16] .
Lemma 4. For categorical covariates
and response Y,under condition (C5), for any
, we have
, where
represents a positive constant.
Lemma 5. For continuous covariates
and response Y, under condition (C5), (C6) and (C7), for any
, we have
, where
represents a positive constant.
Under Conditions (C5)-(C8) and by Lemmas 4 and 5, if
and
, we get
where
. Since
, there exists a positive constant
such that
. Also,
implies that
and
for large n. Then there exists a
such that
. According to Ni and Fang (2016) [16] and by the Borel Cantelli Lemma, we can get
, a.s.
This is the end of the proof.