On Identifying Influential Observations in the Presence of Multicollinearity ()
1. Introduction
It is well understood that not all observations in the data set play equal role when fitting a regression model. We occasionally find that a single or small subset of the data exerts a disproportionate influence on the fitted regression model. That is, parameter estimates or prediction may depend more on the influential subset than the majority of the data. Belsley et al. [1] defined an influential observation as one which either individually or together with several other observations has demonstrably large impact on the calculated values of various estimates, than is the case of most of the other observations. Influential observation in either dependent or independent variable can be as a result of data error or other problem, for example, the influential data points in dependent variable can arise from skewness in the independent variable or from differences in the data generation process for small subset of sample. Obviously, outliers which are observations in a data set which appears to be inconsistent with the remainder of other set of data [2] need not be influential observation in affecting the regression Equation [3]. Andrew and Pregibon [4] highlighted the need to find outliers that matter. They stated that it is not all outliers that need to be harmful in the way that they have undue influence on for instance, the estimation of the parameters in the regression model. If not all outliers matter, examining residual alone might not lead to the detection of influential observation. Thus, other ways of detecting influential observations are needed.
Regression diagnostic comprises of a collection of method used in the identification of influential points and multicollinearity [1]. This includes methods of exploratory data analysis for influential points and identification of violation of assumption of least squares. When the assumption of Ordinary Least Squares (OLS) method that the explanatory variables are not linearly correlated is violated, this results to multicollinearity problem and should be controlled before attempting to measure influence [1]. One of the most popular methods of controlling multicollinearity is the use of Ridge Regression (RR) suggested by Hoerl and Kennard [5]. The idea in RR method is to add small positive number (k > 0) to diagonal elements of the matrix
in order to obtain a ridge regression estimator
(1)
Though the estimator obtained is bias but it yields minimum Mean Squares Error (MSE) when compared to OLS estimator. If k = 0,
becomes the unbiased OLS estimator (
).The choice of ridge parameter k has always been a problem in using RR to solve for multicollinearity, hence methods of estimating the value of k had been suggested by several authors. Below are some suggested methods of estimating k: Hoerl and Kennard [5], Hoerl et al. [6], Lawless and Wang [7], Nomura [8], Khalaf and Shukur [9], Dorugade [10], Al-Hassan [11], Dorugade and Kashid [12], Saleh and Kibria [13], Kibria [14], Zang and Ibrahim [15], Alkhamisi et al. [16], Al-Hassan [17], Muniz and Kibria [18], Khalaf and Shukur [9], Khalaf and Mohamed [19], Uzuke et al. [20] etc.
Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s distance (
) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s distance includes; modified cook’s distance (
), DFFITS, Hadi’s measure, Pena statistic, DFBETAS, COVRATIO, etc.
Therefore, problem of multicollinearity and influential observation affect the regression analysis or estimates remarkably. And in using Ridge Regression to mitigate multicollinearity problem, there is always a problem of the method to use to estimate the ridge parameter (k) to achieve reduction in variance larger than increase in bias furthermore, one may want to know whether multiticollinearity affects identification of influential observations.
2. Methodology
The influence of an observation is measured by the effect it produces on the fit when it is deleted in the fitting process. This deletion is always done one point at a time. Let
denote the regression coefficients obtained when the ith observation is deleted
. Similarly, let
and
be the predicted values and residual mean square respectively when the ith observation is dropped. Note that
(2)
is the fitted value for the observations m when the fitted equation is obtained with the ith observation deleted. Influential measures look at differences produced in quantities such as
or
. Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s Distance (
) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s Distance includes; modified Cook’s Distance (
), DFFITs, Hadi’s influence measure, Pena statistic, DFBETAS, COVRATIO, etc. This work, adopted the following influential measures;
1) Cook’s Distance
Cook [21] proposed this measure and it is widely used. Cook’s distance measures the difference between the fitted values obtained from the full data and the fitted values obtained by deleting the ith observation. Cook’s distance measure is defined as,
(3)
which can also be expressed as
(4)
Thus, Cook’s distance is a multiplication function of two quantities. The first term in Equation (4) is the square of the standardized residual
, which is given
as
and the second term is called potential function
where
is the leverage of the ith observation given as
.
If a point is influential, its deletion causes large changes and the value of
will be large. Therefore, large value of
indicates that the point is influential. It has also be suggested that points with
value greater than the 50% point of the F distribution with p + 1 and (n – p – 1) degrees of freedom be classified as influential points.
2) Welsch and Kuh Measure
Welsch and Kuh [22] developed a similar measure to Cook’s Distance named DFFITs, defined as
(5)
is the scaled difference between the ith fitted value obtained from the full data and the ith fitted value obtained by deleting the ith observation.
can as well be written as
(6)
where
is the standardized residual defined as
.
Points with
are usually classified as influential points.
3) Hadi’s Influence Measure
Hadi [23] proposed a measure of the influence of ith observation based on the fact that influential observations are outliers in the response variable or in the predictors or both. Accordingly, the influence of the ith observation can be measured by
(7)
where
(normalized residual).
is an additive function. The first term of the equation is the potential function which measures outlyingness in the X-space and the second term is a function of the residual, which measures outlyingness in the response variable. Observations with large
are influential observations in the response and/or the predictor variables. Although the measure
does not focus on a specific regression result, but it can be thought of as an overall general measure of influence which depicts observations that are influential on at least one regression result.
4) DFBETAS [1]
DFBETAS measures the difference in each parameter estimate with and without the influential data point. It is an influential measure used to ascertain which observation influence specific regression coefficient
(8)
where
denote the regression coefficients obtained when the ith observation is deleted in fitting process
and
the predicted values from the full data, when ith observation is used in the fitting process.
5) Kuh and Welsch Ratio (COVRATIO)
The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the ith observation. This influential measure is given as
(9)
which can also be expressed as below
(10)
where n is the sample size, p' is the number of independent variable and hii is the hat matrix.
The ridge parameter estimators which were selected to control multicollinearity are
a)
Hoerl and Kennard [5]
b)
Kibria [14]
c)
Alkhamisi et al. [16]
d)
Muniz and Kibria [18]
e)
Muniz and Kibria [18]
f)
Muniz and Kibria [18]
g)
Muniz and Kibria [18]
where
h)
Dorugade [10]
i)
Uzuke et al., [20]
where
j)
3. Illustration
Using the Nigeria Economic indicator (1980-2010) data from the Central Bank of Nigeria (CBN) Statistical Bulletin 2010. The data consist of Gross Domestic Product as the dependent variable (y) and ten [10] independent variables namely Money Supply (x1), Credit to Private Sector (x2), Exchange Rate (x3), External Reserve (x4), Agricultural Loan (x5), Foreign Reserve (x6), Oil Import (x7), Non-oil Export (x8), Oil Export (x9), and Non-oil Export (x10) shown in Appendix III.
Table 1 showed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero(0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.
Identification of Influential Observations
Using five different influential measures; Cook’s distance, DFFITs, Hadi influence measure, DFBETAs and COVRATIO, influential observations in the real data are identified using the criteria of Table 2 when multicolinearity is not controlled (OLS: k = 0) and when controlled using the selected ridge parameter estimators. The values for the measure criteria are presented in Table 2.
The influential observations identified by the five influential measures in the presence of multicollinearity and when controlled using some selected ridge parameters (k) were presented in Table 3. When compared with values of Table 2,
![]()
Table 1. Result of test for multicollinearity.
Tableshowed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero (0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.
![]()
Table 2. Influential measures, calculated measure criteria and values obtained.
![]()
Table 3. Influential observations identified.
any observation whose calculated influence measure is greater than the criteria value obtained is identified as an influential observation or data point. Cook’s Distance and Hadi influence measure performed alike. They fail to identify influential data points when ridge estimators were used to control multicollinearity. DFFITs and COVRATIO measure identified single observation 25 in both OLS and when multicollinearity was controlled while DFBETAS identified data point 29 as well.
4. Summary and Conclusion
Ridge estimator affects influential observation identified. Cook’s distance and Hadi influence measure were able to identify several influential data points on the data in the presence of multicollinearity but failed to identify any data points when the multicollinear effect has been controlled. DFFITs, DFBETAs and COVRATIO identified the same single data point in the presence of multicollinearity and when it has been controlled. Cook’s distance and Hadi influence measure are very sensitive in the presence of multicollinearity, this made them to identify several influential data points but they are less sensitive when multicollinearity is controlled where they fail to identify and data point. DFFITs, DFBETAs and COVRATIO perform better than them and should be used when multicollinearity is controlled.
Appendix I
Algorithm for the R Programme
The model
Using the unit length scaling shown below:
,
where
is the mean of Y,
is the mean of
, and
, and
,
such that
,
We obtain the following model
Obtain
Eigenvalues of A = tj
Eigenvectors of A = D
Confirm that
Confirm that
Obtain
Obtain
Methods of estimating ridge parameter k
1)
Hoerl and Kennard (1970)
where,
is the residual mean square estimate of
and
is the ith element of
which is an unbiased estimator of
where D is the eigenvectors of the matrix
2)
,
Kibria (2003)
3)
Alkhamisi et al. (2006)
where
is the ith eigenvalues of the matrix
and
4)
Muniz and Kibira [18]
5)
6)
7)
where
8)
,
Dorugade [10]
9)
Uzuke et al. [20]
where the weight
10) OLS =
Methods of detecting influential observation
Method 1 (cook’s distance)
,
The criteria is given as
where
, and
Method 2 (DFFITs)
The criteria is given as
where
is the R-residual defined as
and
Method 3 (Hadi measure)
where
called normalized residual.
Method 4 (DFBETAS)
The criteria is given as
Method 5 (COVRATIO)
The criteria is given as
where
, and
Appendix II
R Codes for Detecting Influential Observation for Different k Values
for(i in 1:9){
h=matrix(hatr(lmridge(V1~.,rr, k[i]],30,30)
ss=(sqrt(h[i,i]/(1-h[i,i])))
C=NULL
DF9=NULL
H=NULL
DFB=NULL
COV=NULL
for(i in 1:30){
b1=coefficients(lm(V1~.,rr[-i,]))
r1=c(residuals(lm(V1~.,rr[-i,])))
sig1=(sum(r1^2))/(n-p)
num=c[3]-b1[3]
hh=solve(t(xx[-i,])%*%(xx[-i,]))
denom=sqrt(sig1*hh[3,3])
C=rbind(C,(((r[i]^2/((sig)*(1-h[i,i]))))/(11))*(h[i,i]/(1-h[i,i])))
DF9=rbind(DF9,r[i]/(sqrt(sig1*(1-h[i,i])))*sqrt(h[i,i]/(1-h[i,i])))
H=rbind(H,(h[i,i]/(1-h[i,i]))+(11/(1-h[i,i]))*(r1[i]/sqrt(ssr)))
DFB=rbind(DFB,num/denom)
COV=rbind(COV,(sig1/sig)*(h[i,i]/(1-h[i,i])))
}
Appendix III
![]()
Table A1. Nigerian economic indicator (1980-2010) data.