An Alternative Approach to AIC and Mallow ’ s Cp Statistic-Based Relative Influence Measures ( RIMS ) in Regression Variable Selection

Outlier detection is an important data screening type. RIM is a mechanism of outlier detection that identifies the contribution of data points in a regression model. A BIC-based RIM is essentially a technique developed in this work to simultaneously detect influential data points and select optimal predictor variables. It is an addition to the body of existing literature in this area of study to both having an alternative to the AIC and Mallow’s Cp Statistic-based RIM as well as conditions of no influence, some sort of influence and perfectly single outlier data point in an entire data set which are proposed in this work. The method is implemented in R by an algorithm that iterates over all data points; deleting data points one at a time while computing BICs and selecting optimal predictors alongside RIMs. From the analyses done using evaporation data to compare the proposed method and the existing methods, the results show that the same data cases selected as having high influences by the two existing methods are also selected by the proposed method. The three methods show same performance; hence the relevance of the BIC-based RIM cannot be undermined.


Introduction
Model selection (variable selection) in regression has received great attention in literature in the recent times.A large number of predictors usually are introduced at the initial stage of modeling to attenuate possible modeling biases [1].As noted by [2], inference under models with too few parameters (variables) can be biased while with models having too many parameters (variables), there may be poor precision or identification of effects.Hence, the need for a balance between under-and over-fitted models is known as variable selection.
Influential observation is a special case of outliers.In the simplest sense, outlying or extreme values are observations which are well separated from the remainder of the data.Outliers result from either (1) the errors of measurement or (2) intrinsic variability (mean shift-inflation of variances or others) and appear either in the form of (i) change in the direction of response (Y) variable, (ii) deviation in the space of explanatory variables, deviated points in X-direction called leverage points or (iii) change in both the directions (direction of the explanatory variable(s) and the response variable).These outlying observations may involve large residuals and often have dramatic effects on the fitted least squares regression function.The influence of an individual case (data point) in a regression model can be adverse causing a significant shift (upward or downward) in the value of the parameters of a model in turn reducing the predictive power of the model.Only few papers dealing with the influence of individual data cases in regression explicitly take an initial variable selection step into account.This problem is handled by [3]- [6].
One objective of regression variable selection is to reduce the predictors to some optimal subset of the available regressors [3].In literature, several approaches of variable selection exist, which include the stepwise deletion and subset selection.Stepwise deletion includes regression models in which the choice of predictive variables is carried out by an automatic procedure.Usually, this takes the form of a sequence of F-tests or t-tests, but other techniques are possible, such as adjusted R-square, AIC, BIC, Mallow's statistic, PRESS or false discovery rate [7].[8] proposed the coefficient of determination ratio (CDR) which was based on the value coefficient of determination (R 2 ) of the linear regression model.[9] developed an outlier detection and robust selection method that combined robust least angle regression with least trimmed squares regression on jack-knife subset.When the detected outliers are removed, the standard least angle regression is applied on the cleaned data to robustly sequence the predictor variables in order of importance.[3] proposed a method called the Relative Influence Measure using the Mallow's p C and AIC Statistics.These methods are dimensionally consistent, computa- tionally efficient and able to identify influential case, though, failed in asymptotic consistency.[10] in comparing the BIC and AIC, stated that the AIC was not consistent.That is, as the number of observations n grows very large, the probability that AIC recovers a true low-dimensional model does not approach unity [11].[12] supported same argument that the BIC has the advantage of being asymptotically consistent: as n → ∞ , BIC will select the correct model.
Hence, the specific objectives of this paper are to propose a relative influence measurewith an indication of whether the fit of the selected model improves or deteriorates owing to the presence of an observation (case) and that retains asymptotic consistency and hence not violating the sampling properties of the model parameters.

Cook's Distance and the Influence Measure
Let V be the set of indices corresponding to the predictor variables selected from the full data set and let ( ) ŷ V be the prediction vector based on the selected variables and calculated from the full data set.Also let ( ) ( ) ˆi y V − be the prediction vector based on the variables corresponding to V, but calculated from the full data set without case i. [3] noted that ( ) ( ) approximately scaled.Here, • denotes the Euclidean norm.Repeating the variable selection using the data without case i as pointed out by [12], this selection yields a subset ( ) approximately scaled.Since the unconditional version explicitly takes the selection effect into account [13] argued that it is preferable.As explained in the literature, a measure say M calculated from the complete data set can as well be calculated from the reduced data set as M − and then quantify the influence of case i in terms of a function ( ) has to be based on the difference in the value of the selection criterion before and after omitting case i.This difference may then be divided by M in order to calculate the relative change in the selection criterion.As proposed by [3], the influence measure for the th i case when the Cook's distance is used becomes

Mallow's C p Estimate and the Influence Measure
Let Y be an 1 n × vector of response in a linear regression with corresponding n p × design matrix X of explanatory variables.A traditional model is where

( )
RSS V be the residual sum of squares from the least squares fit using only the regressors corresponding to the indices in V together with an intercept.The p C statistic corresponding model is where v is the number of indices in V. Variable selection based on (5) entails calculating and selecting the variables corresponding to V ; the subset minimizing (5).This approach is based on the fact that for a given V, ( ) is an estimate of the expected squared error if a (multiple) linear regression function based on the variables corresponding to V is used to predict * Y , a new (future) observation of the response random vector Y.Therefore, choosing V to minimize (5) is equivalent to selecting the variables which minimize the estimated expected prediction error.As proposed by [3], the influence measure for the th i case when the p C criterion is used becomes ( , C V − is calculated as in (5) but with the th i case omitted.In calculating C V − , the estimator for the error variance 2  σ is obtained from the full data set.

The AIC Estimate and the Influence Measure
The AIC is based on the maximized log-likelihood function of the model under consideration.Suppose ( ) 2 ~0, N ε ε σ , and ignoring constant terms, the maximized log-likelihood for the model corresponding to a sub- set V is given by . This is a non-decreasing function of the number of selected regressors.
[13] therefore included a penalty termviz; 2 v + , which equals the number of parameters which have to be estimated.Multiplying the resulting expression by −2 yields See [14] for details.It is known that ( ) AIC V does not perform when the number of parameters to be estimated is large compared to the sample size (typically cases where ( ) In such a case, a modified over the parameters θ of the model M for fixed observed data y is approximated as Under the assumption that the model errors or disturbances are independent and identically distributed according to a normal distribution and that the boundary condition that the derivative of the log likelihood with respect to the true variance is zero, this becomes (up to an additive constant, which depends only on n and not on the model).where 2 ˆe σ is the error variance.The error variance in this case is defined as ( ) 1 ˆn which is a biased estimator for the true variance.In terms of residual sum of squares, the BIC is defined thus The BIC is an increasing function of the error variance 2 e σ and an increasing function of v.That is, unex- plained variations in the dependent variable and the number of explanatory variables increase the value of BIC.Hence, lower BIC implies either fewer explanatory variables, better fit or both.
Based on ( 14), the proposed influence measure for the th i case when the BIC criterion is used becomes (2.15) can take the form of ( , , by invoking trichotomy law of real numbers, ( 16) can be rewritten as if 1, then observation has no influence 0, observation is the only outlier in the data 0 1, then observation is influential The values of (15, 16 and 17) are obtained by using (14) but with the th i case omitted.Steel and Uys (2007) claimed that influence measure can be calculated for all selection criteria where particular criterion is a combination of some sort of goodness-of-fit measure and a penalty function (such a penalty function usually include the number of predictors of the particular selected model as one of its components [19].Closely evaluating (14), it is clear that log v n is a huge penalty term compared to the penalty term in ( 5) and ( 8) and hence it gives a good model fit of the data set.

Results
The results above Table 1 show that the method was able to detect cases 33 and 41 as having high influence on the model given that their respective RIMs are relatively larger than others just as the AIC and Mallow's C p Statistic-based RIM detected.The method proposed here for simultaneously detecting influential data points and variable selection, detects outliers one at a time.However, further study can be embarked upon to detect multiple influential data points all at a time while selecting optimal predictor variables.
The problems of masking and swamping were not covered in this study.Masking occurs when one outlier is not detected because of the presence of others; swamping occurs when a non-outlier is wrongly identified owing to the effect of some hidden outliers.Therefore, further studies can be carried out to detect influential outliers and simultaneously select optimal predictor variables while incorporating the solutions to problems of masking and swamping.Again, because it was not intended initially to carry out a test of convergence through which we can compare the computational cost of the three methods, this work avoided the task of re-sampling which was done by Steel and Uys (2007).Meanwhile Steel and Uys (2007) did not run any test of convergence after bootstrapping rather they calculated the estimated average prediction error to substantiate their results.Hence, their additional task of re-sampling is a repetition of results they achieved with their methods and as a result it is not necessary in this study.One can further these existing methods by adding a test of convergence after re-sampling.

Conclusion
Two things are unique about this paper namely a new approach to detecting influential outlier and then the conditions for the interpretation of the result.The later is achieved by invoking the trichotomy law of real numbers.The proposed method penalizes models hugely as the sample size becomes very large and hence has greater likelihood of choosing a better model while detecting influential data cases one at a time.
and 2 ε σ are unknown parameters.Usually, β is a