Inconsistency of Classical Penalized Likelihood Approaches under Endogeneity

With the high speed development of information technology, contemporary data from a variety of fields becomes extremely large. The number of features in many datasets is well above the sample size and is called high dimensional data. In statistics, variable selection approaches are required to extract the ef-ficacious information from high dimensional data. The most popular approach is to add a penalty function coupled with a tuning parameter to the log likelihood function, which is called penalized likelihood method. Howev-er, almost all of penalized likelihood approaches only consider noise accumulation and supurious correlation whereas ignoring the endogeneity which also appeared frequently in high dimensional space. In this paper, we explore the cause of endogeneity and its influence on penalized likelihood approaches. Simulations based on five classical penalized approaches are provided to vindicate their inconsistency under endogeneity. The results show that the positive selection rate of all five approaches increased gradually but the false selection rate does not consistently decrease when endogenous variables exist, that is, they do not satisfy the selection consistency.


Introduction
Along with the rapid progress of information technology and electronics industry, more and more data have been obtained from biomedical, econometrics and other fields. Therefore, in order to extract valid information from mass data, high-dimensional variable selection has been set off in statistics. Variables selection refers to the selection of important variables from the suspicious feature space and the elimination of redundant variables. High dimension indexes the

The Origin and Cause of Endogeneity
The concept of endogeneity originated from economics. Under the linear regression model 0 1 1 2 2 ... p p Y X X X β β β β ε = + + + + + , it means that some explanatory variables correlates with the residual, namely cov( , ) 0 j X ε ≠ . The causes of endogeneity in variable selection can be roughly divided into three categories: omitted variables, measurement errors and simultaneous bias. These will be elaborate in detail under the most commonly used linear regression model. Omitted variables mean that some important variables that can affect the response variable Y are omitted in the explanatory variable. If these omitted variables were related to the pre-existing explanatory variables, endogeneity would occur. To be more specific, assuming that the true regression model is but variable X * is omitted, and the regression model is mistakenly set as Therefore the omitted variable actually goes into the error term u, that is, u = Xβ * +ε. if X * is related to X j , then u is related to X j , and it would lead to endogeneity. When the measurement of a variable is incomplete, the measurement bias will be included in the error term of the regression equation as a part of the regression bias. The measurement bias comes not only from the error records of variables, but also from the inevitable conceptual differences between the commonly used proxy variable and the real variable, which can be obtained from the explanatory variables and the response variable. For example, suppose the real regression mod-Journal of Applied Mathematics and Physics If the measurement bias t is related to the explanatory variable, endogeneity will occur.
In addition to the omitted variables and measurement biases leading to endogeneity, explanatory variables and response variables may also affect each other.
That is not a one-way casuality, leading to causal correlation bias but also endogeneity. Take resident income X and resident consumption Y as an example. In general, the interaction between income and consumption, and the process of mutual influence cannot be observed. At this time, the information about X and Y is essentially mixed up. More precisely, X ε ≠ and endogeneity occurs.
In the analysis of high-dimensional data, endogeneity is almost inevitable. That is mainly because researchers tend to collect as many potential relevant explanatory variables as possible to avoid omission of important variables when we do not know the real model while these high-dimensional variables are usually aggregated from multiple data sources. Unintentionally, some explanatory variables may be associated with residuals, leading to endogeneity. It can also be said that the more variables, the higher the data dimension, the greater the probability of endogeneity.

Penalized Likelihood Method and Its Development
One of the most popular techniques in statistics for extracting information from large volumes of complex data is the high dimensional variable selection. There are two main goals in variable selection: selection consistency, that is, selecting of important variables accurately with a probability close to 1; prediction accuracy, that is, estimating coefficients as accurately as knowing in advance. An Oracle property is defined if these two goals can be satisfied simultaneously. However, due to the occurrence of over-fitting in high-dimensional space, it is difficult combine the two goals, and the selection consistency is usually considered to be more important. For example, in disease gene mapping, the main concern is which genes are the pathogenic genes and not others.
In the high dimension linear model, the penalized likelihood method, which adds a penalty function to the log-likelihood function to shrink estimates to trade between variance and bias, is the most common method of variable selection. More specifically, we consider a linear regression model with main effects only, by minimizing the penalized likelihood function , and it's going to produce a certain amount of non-zero coefficients. And their corresponding variables will be the candidate variables. In the penalized likelihood approaches, a variety of penalized functions were selected, including Lasso [4], SCAD [5], Adaptive Lasso (ALasso) [6], MCP [7], Sequential Lasso (SLasso) [8], etc.

Lasso and Improvements
Lasso was the first to choose the most basic penalized function ( ) p λ β λ β = Journal of Applied Mathematics and Physics and has been widely cited. It is convenient and easy to compute since its entire regularization path is computed under the complexity of a single linear regression. In a high-dimensional space, the estimation of Lasso is biased, but it satisfies model's selection consistency under conditions like neighborhood stability condition [9], non-representable condition [10], and Mutual Incohorence Condition [11]. However, all of these conditions require weak correlations between non-significant variables and significant variables, which is difficult to achieve in practice. That is, Lasso performs poorly when there is a high correlation between variables. In fact, for a set of variables with a high two-way correlation, Lasso is more likely to select a variable from this set regardless of which one is selected.
Many classical feature selection methods have been proposed by on the basis of Lasso. Elastic net [12] integrated Lasso with ridge regression by defining ( ) and it outperforms Lasso in high correlation and prediction accuracy. However, it is easy to cause grouping effect, that is, highly correlated variables are often selected into the model or excluded at the same time.
ALasso [6] considers the weighted penalized function and is proved to satisfy both the selection consistency and the prediction accuracy under a reasonable initial estimator. Another significant improvement of Lasso, SLasso [8], takes a stepwise approach to variables selection, but only adds a L1 penalized function to variables which are not selected in previous stage. This can ensure that variables selected in the early stage are not omitted in the subsequent selection process. SLasso also owns the oracle property and is more computationally attractive than approaches like elastic net.

Tuning Parameter
In addition to the chosen of penalized function, the determination of tuning parameter λ is also one of the key points of penalized likelihood approaches. If set λ to a set of values, a serious of candidate models are generated. Therefore, the penalized likelihood method should be used in conjunction with the model selection criteria. The former generates candidate models; the latter decides the optimal model. Classical model selection criteria include AIC [16], BIC [17].
However, these traditional criteria are no longer suitable for high-dimensional space due to the selection of too many useless variables. In order to adapt to the  [18] or replace factor 2 with a constant term C [19]. For BIC, more efforts have been devoted to the prior probability modifications, such as modified BIC (mBIC) [20] and extended BIC (EBIC) [21]. By assigning different values to parameters γ, EBIC is essentially a set of criteria. The BIC and mBIC can be regarded as special cases of EBIC by letting γ = 0 and γ = 1. The properties of EBIC under different high-dimensional models have been extensively studied. It is consistent for the linear model [21], the generalized linear model [22], the cox model [23], etc.

Inconsistency under Endogeneity
When using the penalized likelihood method for variables selection, some basic conditions must be met to achieve the desired properties. This includes restrictions on explanatory variables [8] or focus on the explanatory variables and regression coefficients [24] or the restrictions on likelihood function [25]. However, when endogeneity exists, even if there's only one endogenous variable left, the above necessary conditions are hard to meet. In this case, there will be an insurmountable difference between the estimated value of regression coefficient and the true value, which will affect the selection consistency of these features.
Next, we will use a simulation to show the effect of endogeneity. The difference between these two settings is that the former only has insignificant variables that are endogenous while the latter are all important variables that are endogenous. Both of them will be compared respectively with the exogenous case that X j = Z j for all j to reflect the impact of endogeneity. The Z~N (0, ∑) and is independent of ε. The setting of the covariance matrix ∑ considers only two common structures: ∑ ij = 0.5, i ≠ j, ∑ ij = 1, i = j and ∑ ij = 0.5 |i−j| , which can be called S1 and S2 respectively. The extended Bayesian model selection criterion EBIC is used to select the tunning parameter and determine the optimal model by letting γ = 1 − logn/4logp.

Results and Interpretation
In  Tables 1-4.
It can be seen from Tables 1-4 that when there is no endogeneity, PDR tended to 1 with an upward trend, while FDR tended to 0 with a downward trend, and the number of selected variables tended to the number of real variables, although the initial performance of various feature selection methods is different. In other words, the asymptotic consistency of these classical penalized likelihood approaches satisfied. However, when endogeneity exists, either the unimportant   endogenous variables or important endogenous variables, as the sample size increases, all approaches are selected to rate though there is a rising trend for PDR but not necessarily obvious. The performance of FDR and number of selected variables is not as expected by their asymptotic nature; it's still picking the wrong variables, which means it is no longer valid in the presence of endogeneity. In addition, these tables showed the difference in the robustness between the above penalized likelihood methods. When switching from exgenous to exogenous, SCAD is the most robust and SLasso is the lowest robust, which suggests some implications for subsequent endogenous feature selection studies.