^{1}

^{*}

^{2}

^{3}

This work is geared towards detecting and solving the problem of multicolinearity in regression analysis. As such, Variance Inflation Factor (VIF) and the Condition Index (CI) were used as measures of such detection. Ridge Regression (RR) and the Principal Component Regression (PCR) were the two other approaches used in modeling apart from the conventional simple linear regression. For the purpose of comparing the two methods, simulated data were used. Our task is to ascertain the effectiveness of each of the methods based on their respective mean square errors. From the result, we found that Ridge Regression (RR) method is better than principal component regression when multicollinearity exists among the predictors.

Regression Analysis is a statistical tool used in studying if there is existence of relationship, of any forms, either linear or nonlinear between the two variables, subject to certain constraints, such that one of the two variables can serve well to predict for the other. Meanwhile, it is important to note that our focus in this study is on the linear form of such relationship. Thus, when we talk of regression, we only consider the linear regression, which may either be simple, multiple and or, multivariate in nature depending on the levels (or number) of variables on either side of the equation. When we compare a single dependent with a single independent variable, the regression is said to be simple, so we have simple (linear) regression. But, if only one dependent variable is being compared with more than one independent variables, the regression is said to be multiple in form; and thus we have multiple (linear) regression. The multivariate regression (which is outside the scope of this research), only comes to play if we are comparing more than one level of the dependent variable with two or more levels of the independent variables.

Consider the multiple regressions model:

Y i = β 0 + β 1 X 1 i + β 2 X 2 i + ⋯ + β p X p i + e i (1)

In matrix form, (1) becomes:

Y = X ′ β + ∈ (2)

where in (1); β j ′ s ∀ j = 1 , 2 , ⋯ , p are the regression coefficients; Y ( n × 1 ) matrix represents the outcome (response or dependent) variable and X i ′ s , ∀ i = 1 , 2 , ⋯ , n are the explanatory (predictor or independent) variables, which are fixed and e i is the error term, which with Y i ′ s are assumed to be random. With assumptions that: E ( e i ) = 0 ; E ( e i e j ) = 0 ∀ i ≠ j ; E ( e i e i ) = σ 2 , which implies randomness, independence and homoscedasticity of error terms respectively. Indicating e i ~ I I D N ( 0 , σ 2 ) . Other interesting assumptions are Zero Covariance between and each of the X i ′ s variable; i.e.

C o v ( e i , X 1 ) = C o v ( e i , X 2 ) = ⋯ = C o v ( e i , X p ) = 0 ;

no specification bias; the model should correctly be specified and no exact linear relationship between any two predictor variables. However, violating any of these assumptions brings about serious problem in regression analysis. Hence, causes of multicollinearity which constitutes a major problem sets in as a result of violation of the said assumptions [

Multicollinearity is an important concept in regression analysis, given the serious threat it poses on the validity or the predicting strength of the regression model. It is usually regarded as a problem arising out of violation of the assumption that the explanatory variables are linearly independent. It is a phenomenon that plays its way in regression, especially multiple regressions when there is a high level of inter-correlation or inter-associations among the independent variables.

It is therefore a type of disturbance in the regression model which if allowed, the statistical inferences made about the model become misleading simply because the estimates of the regression coefficients are faulty or unreliable. Multicollinearity is a condition in multiple regression models whereby two or more covariates become redundant. The redundancy implies that what one independent variable (X) explains about the dependent variable (Y) is exactly what the other independent variable explains. In this case, the estimates of the regression coefficients for such redundant predictor variables would be completely erroneous.

1) Multicollinearity generally occurs when two or more explanatory variables are directly and highly correlated to each other.

2) It may also set in when one or more of the predictors represent the multiples of or computed from some other predictor variables in the same equation.

3) It may also be experienced when repeating or including almost the same predictor variable in the same model.

4) It may as well occur when in situations of nominal variables; the dummy variables are not properly use.

However, the following have been identified as the primary sources of multicollinearity;

1) When a regression model is over defined; that is, including more than necessary predictor variables in the model;

2) The data collection method is faulty; or better still choosing in appropriate sampling scheme used for data collection or generation;

3) Placing a spurious or unnecessary constraint on the model or in the population;

4) When the regression model is wrongly specified.

There are a number of ways by which multicollinearity may be detected in a multiple regression model, which include:

1) when the correlation coefficients in the correlation matrix of predictor variables become so high that is close to one, or the value of correlation coefficient between two highly correlated predictor variables is close to one.

2) when the coefficient of determination (R^{2}) value is so close to unity for a particular predictor variable that is regressed on other independent variables to such an extent that the variance inflation factor (VIF) becomes so large [

3) When one or more eigenvalues of the correlation matrix becomes so small that is close to zero then multicollinearity is at work.

4) Another rule of thumb to detecting the presence of multicollinearity is that while one or more eigenvalues of the predictor variables become so small, to the extent of getting so close to zero but the corresponding condition number (ϕ) becomes very large [

5) Comparing the decisions made using overall F-test and t-test might provide some indication of the presence of multicollinearity. For instance, when the overall significance of the model is good using F-test, but individually, the coefficients are not significant using t-test, then the model might suffer from multicollinearity

The effects of existence of multicollinearity in regression are of concerns that it gives rise to circumstance whereby:

1) The partial contribution of each of the explanatory variable remains confounded leading to difficulties in interpreting the model;

2) The variances as well as the coefficients of the predictor variables becomes un duly bogus (or inflated), thereby making precise estimation of the parameters becomes impossible;

3) The presence of multicollinearity gives rise to considerably high mean square error, paving ways for committing type-1-error;

4) The ordinary least square (OLS) estimators as well as their standard errors may be sensitive to small changes in the data, in order words, the results will not be robust; and

5) Finally, [

Since obtaining robust estimates of regression coefficients possess significant level of difficulties with ordinary least square (OLS) method, we in this research, we hope to explore other possible regression models; principal component regression (PCR) and ridge regression (RR) methods as alternative techniques to estimating the model parameters, with a view to enhancing precision in estimating the parameters of the regression parameters when multicollinearity is suspected among the predictors without need for dropping any of the variables and that to determine which of the two methods perform better based on the mean square errors of the two models. Meanwhile, the use of condition number as well as variance inflation factor is endearing to us to check if multicollinearity exists among the covariates after estimating via OLS approach.

The remaining part of this paper is organized as follows; Section 2 has the brief discussion on the methodologies adopted while Section 3 presents the results and general discussion on the findings of the research.

This section discusses statistical techniques which are applied and compared with Ordinary Least Square (OLS) method in multiple linear regressions. These methods are Principal component regression (PCR) and Ridge regression (RR), their formulations as well as the underlining assumptions governing each of them are discussed.

This is one of the methods for solving problem of multicollinearity such that better estimates of the model parameters and consequently, better and a more robust prediction could be made as compared to ordinary least squares. With this method, the original variables are transformed into a new set of orthogonal or uncorrelated variables called principal components of the correlated matrix. This transformation ranks the new orthogonal variables in order of their importance and the procedure then involves eliminating some of the principal components to effect a reduction in variance. The major goal of PCR include; variable reduction, selection, classification as well as prediction. It would be recalled that this method is two procedural in application; first principal component analysis is applied, then the set of “k” uncorrelated or orthogonal component factors are used to replace the original p set of predictor variables. According to [

Let the random vector of the predictor variable be:

X ′ = [ X 1 , X 2 , X 3 , ⋯ , X p ] , with the covariance matrix Σ and eigenvalues λ 1 ≥ λ 2 ≥ λ 3 ≥ ⋯ ≥ λ p ≥ 0 . The linear combination:

Y i = e ′ i X ; (3)

Subject to the constraints:

V a r ( Y i ) = e ′ i Σ e i ; ∀ i = 1 , 2 , ⋯ , p (4)

C o v ( Y i , Y j ) = e ′ i Σ e j ; ∀ i , j = 1 , 2 , ⋯ , p ; i ≠ j (5)

The principal components are those uncorrelated linear combinations Y 1 , Y 2 , Y 3 , ⋯ , Y P whose variances are as maximal as possible.

Looking closely at the model (2), suppose X ′ X = Σ is rewritten as P Λ P ′ , with Λ being the ( p ∗ p ) diagonal matrix of the eigenvalues of the design matrix or variance covariance matrix, Σ and P = [ e 1 , e 2 , e 3 , ⋯ , e p ] representing associated normalized eigenvectors for the eigenvalues for Σ , such that:

P P ′ = P ′ P = I (Identity matrix) (6)

Rewriting the (2) by inserting (6) Z = X P P ′ β + ε ; which then becomes:

Z = Y Φ + ε (7)

where Y = X ′ P and Φ = P ′ β ; thus, Y ′ Y = P ′ X ′ X P = P ′ Σ P = P ′ P Λ P ′ P = Λ . The columns of Y, defined as the linearly uncorrelated components of the original random vector X are now the new set of orthogonal predictor variables; also known as the principal components. These now serve as the new covariates in the regression model, called principal component regression. Obtaining the estimates of the model using OLS, we have:

Φ ^ = Y ′ Y − 1 Y ′ Z = Λ − 1 Y ′ Z (8)

Such that the covariance of π ^ is

V ( Φ ^ ) = σ 2 ( Y ′ Y ) − 1 = Λ − 1 σ 2 = d i a g ( λ 1 − 1 + λ 2 − 1 + ⋯ + λ k − 1 ) (9)

This new variance estimate is now expected to produce a minimum variance, which eventually leads to having an improved and reliable estimate of the parameters, thus making a robust decision. According to [

This method was originally suggested by [

The ridge regression estimate of the coefficients, β j ; j = 1 , 2 , ⋯ , k are given by

β ^ R = ( X ′ X + r I ) − 1 X ′ Y (10)

where r ≥ 0 is a constant called biasing factor, which needs to be set by the researcher; such that when r = 0, the ridge regression automatically reduces to ordinary least square. This is implies that ridge regression is an improved form of OLS, with minor transformation.

Thus:

E ( β ^ R ) = ( X ′ X + r I ) − 1 E ( X ′ Y ) = ( X ′ X + r I ) − 1 ( X ′ X ) E ( β ) = P R , β ^ (11)

where P R = ( X ′ X + r I ) − 1 X ′ X . This results point to the fact that ridge regression is a biased estimator of β ^ , which is the necessary condition for getting away with the problem of estimating the model parameter._{ }

Meanwhile the variance-covariance matrix of β R is obtained as:

V a r ( β ^ R ) = ( X ′ X + r I ) − 1 ( X ′ X ) ( X ′ X + r I ) − 1 σ 2 (12)

Giving rise to the mean square error:

M S E ( β ^ R ) = B i a s + v a r i a n c e ; i.e. ( B i a s β ^ R ) 2 + V a r ( β ^ R ) (13)

According to [

In this work, we illustrate with an example to predicting gas productivity (Y) using density (X_{1}), volumetric temperature (X_{2}), sulphur content (X_{3}), feedback flow (X_{4}), output feedback temperature (X_{5}), catalyst temperature in regenerator system (X_{6}) and catalyst/feedback ratio (X_{7}) as the independent variables.

Now since some of the variables are significantly related as shown in

X_{1} | X_{2} | X_{3} | X_{4} | X_{5} | X_{6} | X_{7} | |
---|---|---|---|---|---|---|---|

X_{1} | 1 | ||||||

X_{2} | −0.14 | 1 | |||||

X_{3} | 0.96** | −0.24 | 1 | ||||

X_{4} | −0.25 | 0.05 | −0.28 | 1 | |||

X_{5} | 0.10 | −0.20 | 0.19 | 0.33 | 1 | ||

X_{6} | −0.44 | −0.49* | −0.46 | −0.67 | 0.16 | 1 | |

X_{7} | 0.67** | −0.17 | 0.66** | −0.13 | −0.09 | −0.30 | 1 |

*P-value < 0.05, significantly correlated at 5%, **P-value < 0.01, significantly correlated at 1%.

significant possible relationships between X_{1} and X_{3} (r-0.96, P-value = 0.004), X_{3} and X_{7} (r = 0.66, P-value = 0.004). These results showed that there is presence of multicollinearity among these independent variables.

The existence of multicollinearity was investigated using Variance Inflation Factor (VIF), variables proportion and condition index. The result obtained is shown in _{1} and X_{3}, have VIF greater than 10 which shows that there is collinearity problem.

Variance Inflation Factor (VIF)

The following VIF values were obtained from each of the Independent Variables:

VIF ( X 1 ) = 20.74 , VIF ( X 2 ) = 3.639 , VIF ( X 3 ) = 38.95 , VIF ( X 4 ) = 2.58 , VIF ( X 5 ) = 2.82 , VIF ( X 6 ) = 5.58 , VIF ( X 7 ) = 2.24

The result of VIF revealed presence of multicollinearity at VIF (1) and VIF (3) are greater than 10. This result confirmed a high level of multicollinearity among the independent variables.

The Condition Index (ϕ) = λ max λ min ; the ratio of maximum eigenvalue to minimum eigenvalue.

7.492 0.00001 = 749200 . Since ϕ > 1000 (749,200 > 1000), the results also supported that obtained from VIF.

Mean Squared Error (MSE)

For any given regression model:

MSE = SSE n − p (14)

where: SSE: Sum of square error of the linear regression model; n − p : Degree of freedom; n: is the number of data point; p: Number of parameters in the model.

OLS | PCR | RR |
---|---|---|

0.70554 | 0.70553 | 0.68624 |

Ordinary Least Square (OLS)

G P = 126.121 − 175.414 D T Y + 0.037 V T − 2.121 S C + 0.064 F F − 0.042 F T + 0.085 C T + 1.718 F R

Principal Component Regression (PCR)

G P = 51.400 − 2.336 D T Y + 0.336 V T − 0.385 S C + 0.552 F F − 0.480 F T + 0.674 C t − 0.049 F R

Ridge Regression (RR)

G P = 54.1387 − 28.0014 D T Y + 0.0029 V T + 0.5137 S C + 0.0045 F F + 0.0087 F T + 0.1236 C T − 0.62374 F R

Computing the mean Square Error for each of the model we obtain the following result. The result can be summarized in

Having fitted the respective regression models to the available data, we investigated the adequacies of the three models using the MSE square errors (see

Meanwhile, given the results obtained from the analyses, one may conclude there is no much difference in the error values, especially with the PCR and OLS; this may be due to the nature of the data used for the analysis. However, based on the results presented in

VIF | Eigenvalue | Contn. Index | Intercept | X_{1} | X_{2} | X_{3} | X_{4} | X_{5} | X_{6} | X_{7} |
---|---|---|---|---|---|---|---|---|---|---|

- | 7.492 | 1.0 | 0.00 | 0.000 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.0000 |

20.736 | 0.507 | 3.846 | 0.00 | 0.000 | 0.000 | 0.026 | 0.000 | 0.000 | 0.000 | 0.002 |

3.639 | 0.001 | 112.352 | 0.00 | 0.000 | 0.0003 | 0.0177 | 0.0137 | 0.0110 | 0.00 | 0.680 |

38.953 | 0.003 | 153.825 | 0.00 | 0.0004 | 0.0067 | 0.0005 | 0.3735 | 0.0001 | 0.000 | 0.030 |

Further, we observe that despite the little quantitative difference, there lie some advantages as well as disadvantages between the two methods. The advantages of RR method over PCR are that it is easier to compute and also provides a more stable way of moderating the model’s degrees of freedom than dropping variables [

The authors declare no conflicts of interest regarding the publication of this paper.

Raheem, M.A., Udoh, N.S. and Gbolahan, A.T. (2019) Choosing Appropriate Regression Model in the Presence of Multicolinearity. Open Journal of Statistics, 9, 159-168. https://doi.org/10.4236/ojs.2019.92012