Choosing Appropriate Regression Model in the Presence of Multicolinearity

This work is geared towards detecting and solving the problem of multicolinearity in regression analysis. As such, Variance Inflation Factor (VIF) and the Condition Index (CI) were used as measures of such detection. Ridge Regression (RR) and the Principal Component Regression (PCR) were the two other approaches used in modeling apart from the conventional simple linear regression. For the purpose of comparing the two methods, simulated data were used. Our task is to ascertain the effectiveness of each of the methods based on their respective mean square errors. From the result, we found that Ridge Regression (RR) method is better than principal component regression when multicollinearity exists among the predictors.


Introduction
Regression Analysis is a statistical tool used in studying if there is existence of relationship, of any forms, either linear or nonlinear between the two variables, subject to certain constraints, such that one of the two variables can serve well to predict for the other.Meanwhile, it is important to note that our focus in this study is on the linear form of such relationship.Thus, when we talk of regression, we only consider the linear regression, which may either be simple, multiple and or, multivariate in nature depending on the levels (or number) of variables on either side of the equation.When we compare a single dependent with a single independent variable, the regression is said to be simple, so we have simple (linear) regression.But, if only one dependent variable is being compared with more than one independent variables, the regression is said to be multiple in form; and thus we have multiple (linear) regression.The multivariate regression (which is outside the scope of this research), only comes to play if we are comparing more than one level of the dependent variable with two or more levels of the independent variables.

Fundamental Principles
Consider the multiple regressions model: In matrix form, (1) becomes: where in (1); ( ) ( ) , which implies randomness, independence and homoscedasticity of error terms respectively.Indicating , , , 0 no specification bias; the model should correctly be specified and no exact linear relationship between any two predictor variables.However, violating any of these assumptions brings about serious problem in regression analysis.Hence, causes of multicollinearity which constitutes a major problem sets in as a result of violation of the said assumptions [1], which constitute the pivot of our discussion in this section.
Multicollinearity is an important concept in regression analysis, given the serious threat it poses on the validity or the predicting strength of the regression model.It is usually regarded as a problem arising out of violation of the assumption that the explanatory variables are linearly independent.It is a phenomenon that plays its way in regression, especially multiple regressions when there is a high level of inter-correlation or inter-associations among the independent variables.
It is therefore a type of disturbance in the regression model which if allowed, the statistical inferences made about the model become misleading simply because the estimates of the regression coefficients are faulty or unreliable.Multicollinearity is a condition in multiple regression models whereby two or more covariates become redundant.The redundancy implies that what one independent variable (X) explains about the dependent variable (Y) is exactly what the other independent variable explains.In this case, the estimates of the regression coefficients for such redundant predictor variables would be completely erroneous.

Possible Causes of Multicollinearity
1) Multicollinearity generally occurs when two or more explanatory variables are directly and highly correlated to each other.
2) It may also set in when one or more of the predictors represent the multiples of or computed from some other predictor variables in the same equation.
3) It may also be experienced when repeating or including almost the same predictor variable in the same model.
4) It may as well occur when in situations of nominal variables; the dummy variables are not properly use.
However, the following have been identified as the primary sources of multicollinearity; 1) When a regression model is over defined; that is, including more than necessary predictor variables in the model; 2) The data collection method is faulty; or better still choosing in appropriate sampling scheme used for data collection or generation; 3) Placing a spurious or unnecessary constraint on the model or in the population; 4) When the regression model is wrongly specified.

How Is Multicollinearity Detected?
There are a number of ways by which multicollinearity may be detected in a multiple regression model, which include: 1) when the correlation coefficients in the correlation matrix of predictor variables become so high that is close to one, or the value of correlation coefficient between two highly correlated predictor variables is close to one.
2) when the coefficient of determination (R 2 ) value is so close to unity for a particular predictor variable that is regressed on other independent variables to such an extent that the variance inflation factor (VIF) becomes so large [2].
3) When one or more eigenvalues of the correlation matrix becomes so small that is close to zero then multicollinearity is at work.4) Another rule of thumb to detecting the presence of multicollinearity is that while one or more eigenvalues of the predictor variables become so small, to the extent of getting so close to zero but the corresponding condition number (ϕ) becomes very large [3] [4] [5].
5) Comparing the decisions made using overall F-test and t-test might provide some indication of the presence of multicollinearity.For instance, when the overall significance of the model is good using F-test, but individually, the coefficients are not significant using t-test, then the model might suffer from multicollinearity

Possible Effects of Multicollinearity
The effects of existence of multicollinearity in regression are of concerns that it gives rise to circumstance whereby: 1) The partial contribution of each of the explanatory variable remains confounded leading to difficulties in interpreting the model; 2) The variances as well as the coefficients of the predictor variables becomes un duly bogus (or inflated), thereby making precise estimation of the parameters becomes impossible; 3) The presence of multicollinearity gives rise to considerably high mean square error, paving ways for committing type-1-error; 4) The ordinary least square (OLS) estimators as well as their standard errors may be sensitive to small changes in the data, in order words, the results will not be robust; and 5) Finally, [6] observes that as multicollinearity increases, it complicates the interpretation of the variable because it becomes more difficult to ascertain the effect of any single variable, due to the variable interrelationships.
Since obtaining robust estimates of regression coefficients possess significant level of difficulties with ordinary least square (OLS) method, we in this research, we hope to explore other possible regression models; principal component regression (PCR) and ridge regression (RR) methods as alternative techniques to estimating the model parameters, with a view to enhancing precision in estimating the parameters of the regression parameters when multicollinearity is suspected among the predictors without need for dropping any of the variables and that to determine which of the two methods perform better based on the mean square errors of the two models.Meanwhile, the use of condition number as well as variance inflation factor is endearing to us to check if multicollinearity exists among the covariates after estimating via OLS approach.
The remaining part of this paper is organized as follows; Section 2 has the brief discussion on the methodologies adopted while Section 3 presents the results and general discussion on the findings of the research.

Methodology
This section discusses statistical techniques which are applied and compared with Ordinary Least Square (OLS) method in multiple linear regressions.These methods are Principal component regression (PCR) and Ridge regression (RR), their formulations as well as the underlining assumptions governing each of them are discussed.

Principal Component Regression (PCR)
This is one of the methods for solving problem of multicollinearity such that better estimates of the model parameters and consequently, better and a more robust prediction could be made as compared to ordinary least squares.With this method, the original variables are transformed into a new set of orthogonal or uncorrelated variables called principal components of the correlated matrix.
This transformation ranks the new orthogonal variables in order of their importance and the procedure then involves eliminating some of the principal com-ponents to effect a reduction in variance.The major goal of PCR include; variable reduction, selection, classification as well as prediction.It would be recalled that this method is two procedural in application; first principal component analysis is applied, then the set of "k" uncorrelated or orthogonal component factors are used to replace the original p set of predictor variables.According to [7], PCR is a two-step procedure, in the first step, one computes principal components which are linear combinations of the explanatory variables while in the second step, the response variable is regressed on to the selected principal components.Combining both steps in a single method will maximize the relation to the response variables.
Let the random vector of the predictor variable be: , with the covariance matrix Σ and eigenvalues . The linear combination: Subject to the constraints: The principal components are those uncorrelated linear combinations whose variances are as maximal as possible.Looking closely at the model ( 2), suppose X X ′ = Σ is rewritten as P P′ Λ , with Λ being the ( ) p p * diagonal matrix of the eigenvalues of the design matrix or variance covariance matrix, Σ and where Y X P ′ = and

P β ′ Φ =
; thus, Y Y P X XP P P P P P P The columns of Y, defined as the linearly uncorrelated components of the original random vector X are now the new set of orthogonal predictor variables; also known as the principal components.These now serve as the new covariates in the regression model, called principal component regression.Obtaining the estimates of the model using OLS, we have: Such that the covariance of π is ( ) ( ) ( ) This new variance estimate is now expected to produce a minimum variance, which eventually leads to having an improved and reliable estimate of the parameters, thus making a robust decision.According to [8], one of the simplest ways collinearity problem could be rectified in practice, is by the use of Principal Component Regression (PCR); claiming that from the experience, PCR usually DOI: 10.4236/ojs.2019.92012164 Open Journal of Statistics gives much better result than the least square for prediction purpose.

Ridge Regression Method
This method was originally suggested by [9] as a procedure for investigating the sensitivity of least-squares estimates based on data exhibiting near-extreme multicollinearity, where small perturbations in the data may produce large changes in the magnitude of the estimated coefficients.
The ridge regression estimate of the coefficients, j β ; 1, 2, , where 0 r ≥ is a constant called biasing factor, which needs to be set by the re- searcher; such that when r = 0, the ridge regression automatically reduces to ordinary least square.This is implies that ridge regression is an improved form of OLS, with minor transformation.Thus: where ( ) ′ .This results point to the fact that ridge regression is a biased estimator of β , which is the necessary condition for getting away with the problem of estimating the model parameter.
Meanwhile the variance-covariance matrix of R β is obtained as: Giving rise to the mean square error: ( ) According to [10], ridge regressions are known to have favourable properties as shown by [9] R β has smaller mean square error than the ordinary least square estimators β , provided 2 σ is small enough so that the validity of the regression model holds.[11] [12] also pointed out that the ridge regressions are known as shrinkage estimator.

Results and Discussion
In this work, we illustrate with an example to predicting gas productivity (Y) using density (X 1 ), volumetric temperature (X 2 ), sulphur content (X 3 ), feedback flow (X 4 ), output feedback temperature (X 5 ), catalyst temperature in regenerator system (X 6 ) and catalyst/feedback ratio (X 7 ) as the independent variables.Now since some of the variables are significantly related as shown in Table 1, it then becomes impossible to determine which of the variables accounts for the variation in the dependent variable.This is because of high correlation among the predicting variables, resulting in less stability in the estimates of the regression parameters [13].The results of the correlation matrix above showed a highly significant possible relationships between X 1 and X 3 (r-0.96,P-value = 0.004), X 3 and X 7 (r = 0.66, P-value = 0.004).These results showed that there is presence of multicollinearity among these independent variables.

Multicollinearity Diagnostic
The existence of multicollinearity was investigated using Variance Inflation Factor (VIF), variables proportion and condition index.The result obtained is shown in Table 2.It could be confirmed that X 1 and X 3 , have VIF greater than 10 which shows that there is collinearity problem.
Variance Inflation Factor (VIF) The following VIF values were obtained from each of the Independent Variables: ( ) ( ) ( ) ( ) ( ) ( ) ( ) σ .Other interesting assumptions are Zero Covariance between   and each of the i s X ′ variable; i.e.( for the eigenvalues for Σ , such that:

=
The result of VIF revealed presence of multicollinearity at VIF (1) and VIF(3) are greater than 10.This result confirmed a high level of multicollinearity among the independent variables.: Sum of square error of the linear regression model; n p − : Degree of freedom; n: is the number of data point; p: Number of parameters in the model.

Table 1 .
Correlation matrix between independent variables.