^{1}

^{*}

^{2}

Outlier detection is an important data screening type. RIM is a mechanism of outlier detection that identifies the contribution of data points in a regression model. A BIC-based RIM is essentially a technique developed in this work to simultaneously detect influential data points and select optimal predictor variables. It is an addition to the body of existing literature in this area of study to both having an alternative to the AIC and Mallow’s Cp Statistic-based RIM as well as conditions of no influence, some sort of influence and perfectly single outlier data point in an entire data set which are proposed in this work. The method is implemented in R by an algorithm that iterates over all data points; deleting data points one at a time while computing BICs and selecting optimal predictors alongside RIMs. From the analyses done using evaporation data to compare the proposed method and the existing methods, the results show that the same data cases selected as having high influences by the two existing methods are also selected by the proposed method. The three methods show same performance; hence the relevance of the BIC-based RIM cannot be undermined.

Model selection (variable selection) in regression has received great attention in literature in the recent times. A large number of predictors usually are introduced at the initial stage of modeling to attenuate possible modeling biases [

Influential observation is a special case of outliers. In the simplest sense, outlying or extreme values are observations which are well separated from the remainder of the data. Outliers result from either (1) the errors of measurement or (2) intrinsic variability (mean shift-inflation of variances or others) and appear either in the form of (i) change in the direction of response (Y) variable, (ii) deviation in the space of explanatory variables, deviated points in X-direction called leverage points or (iii) change in both the directions (direction of the explanatory variable(s) and the response variable). These outlying observations may involve large residuals and often have dramatic effects on the fitted least squares regression function. The influence of an individual case (data point) in a regression model can be adverse causing a significant shift (upward or downward) in the value of the parameters of a model in turn reducing the predictive power of the model. Only few papers dealing with the influence of individual data cases in regression explicitly take an initial variable selection step into account. This problem is handled by [

One objective of regression variable selection is to reduce the predictors to some optimal subset of the available regressors [

[^{2}) of the linear regression model. [

Hence, the specific objectives of this paper are to propose a relative influence measurewith an indication of whether the fit of the selected model improves or deteriorates owing to the presence of an observation (case) and that retains asymptotic consistency and hence not violating the sampling properties of the model parameters.

Let V be the set of indices corresponding to the predictor variables selected from the full data set and let

approximately scaled. Here,

approximately scaled. Since the unconditional version explicitly takes the selection effect into account [

Let Y be an

where

the OLS fit of (4), then

V of

where v is the number of indices in V. Variable selection based on (5) entails calculating

where

The AIC is based on the maximized log-likelihood function of the model under consideration. Suppose

set V is given by

[

See [

Variable selection based on (5) and (8) calculating the criterion for each subset V of

The value of

A popular alternative to AIC as proposed by [

The BIC is formally defined as

the integral of the likelihood function

parameters

Under the assumption that the model errors or disturbances are independent and identically distributed according to a normal distribution and that the boundary condition that the derivative of the log likelihood with respect to the true variance is zero, this becomes (up to an additive constant, which depends only on n and not on the model).

where

which is a biased estimator for the true variance. In terms of residual sum of squares, the BIC is defined thus

The BIC is an increasing function of the error variance

Based on (14), the proposed influence measure for the

(2.15) can take the form of

Suppose

The values of

The results above _{p} Statistic-based RIM detected. The method proposed here for simultaneously detecting influential data points and variable selection, detects outliers one at a time. However, further study can be embarked upon to detect multiple influential data points all at a time while selecting optimal predictor variables.

The problems of masking and swamping were not covered in this study. Masking occurs when one outlier is not detected because of the presence of others; swamping occurs when a non-outlier is wrongly identified owing to the effect of some hidden outliers. Therefore, further studies can be carried out to detect influential outliers and simultaneously select optimal predictor variables while incorporating the solutions to problems of masking and swamping.

Case Omitted | Variables Selected | Influence Measure (BIC) | |
---|---|---|---|

1 | 1, 3, 6, 9 | 0.01797914 | |

2 | 1, 3, 6, 9 | 0.03826104 | |

14 | 1, 3, 6, 9 | 0.0179898 | |

15 | 1, 3, 6, 9 | 0.01751532 | |

31 | 1, 3, 4, 8, 9 | 0.03268965 | |

32 | 1, 3, 6, 9 | 0.02016968 | |

33 | 6, 9, 10 | 0.05645203 | ^{***}high influence measure comparable to the Steel & Uys (2007) paper results that used the C_{p} and AIC |

34 | 1, 3, 6, 9 | 0.01791702 | |

40 | 6, 9, 10 | 0.03512789 | |

41 | 1, 3, 6, 9 | 0.06042516 | ^{***}high influence measure comparable to the Steel & Uys (2007) paper results that used the C_{p} and AIC |

42 | 1, 3, 6, 9 | 0.02053766 | |

45 | 1, 3, 6, 9 | 0.01754306 | |

46 | 1, 3, 6, 9 | 0.01905877 |

Again, because it was not intended initially to carry out a test of convergence through which we can compare the computational cost of the three methods, this work avoided the task of re-sampling which was done by Steel and Uys (2007). Meanwhile Steel and Uys (2007) did not run any test of convergence after bootstrapping rather they calculated the estimated average prediction error to substantiate their results. Hence, their additional task of re-sampling is a repetition of the results they achieved with their methods and as a result it is not necessary in this study. One can further implement these existing methods by adding a test of convergence after re-sampling.

Two things are unique about this paper namely a new approach to detecting influential outlier and then the conditions for the interpretation of the result. The later is achieved by invoking the trichotomy law of real numbers. The proposed method penalizes models hugely as the sample size becomes very large and hence has greater likelihood of choosing a better model while detecting influential data cases one at a time.

Umeh EdithUzoma,Obulezi OkechukwuJeremiah, (2016) An Alternative Approach to AIC and Mallow’s Cp Statistic-Based Relative Influence Measures (RIMS) in Regression Variable Selection. Open Journal of Statistics,06,70-75. doi: 10.4236/ojs.2016.61009