^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Suppression effect in multiple regression analysis may be more common in research than what is currently recognized. We have reviewed several literatures of interest which treats the concept and types of suppressor variables. Also, we have highlighted systematic ways to identify suppression effect in multiple regressions using statistics such as: R<sup>2</sup>, sum of squares, regression weight and comparing zero-order correlations with Variance Inflation Factor (VIF) respectively. We also establish that suppression effect is a function of multicollinearity; however, a suppressor variable should only be allowed in a regression analysis if its VIF is less than five (5).

When selecting a set of study variables, researchers frequently test correlations between the outcome variables (i.e. dependent variables) and theoretically relevant predictor variables (i.e. independent variables) [

Questions such as this are most times not given the supposed credit. In multiple regression equations, suppressor variables increase the magnitude of regression coefficients associated with other independent variables or set of variables [

Stepwise regression is a common technique used to eliminate variables when the relationship of each predictor variable with an outcome variable is tested separately for statistical significance. Predictor variables that are not significantly related to outcome variables are often eliminated at the bi-variate level. Bi-variate results obtained from stepwise selection, provide only partial information about the relationship between a predictor and an outcome variable, and are an improper method for selecting variables for a multiple regression model. Some researchers have reported that when a multiple regression model is incorporated with a predictor variable that is uncorrelated with the outcome variable in a bi-variate model, the uncorrelated predictor variable sometimes significantly improved the explained variance [

Collinearity is a linear association between two explanatory (predictor) variables. Two regressor variables are perfectly collinear if there is an exact linear relationship between the two.

Multicollinearity: Multicollinearity refers to a situation in which two or more explanatory (predictor) variables in a multiple regression model are related with each other and likewise related with the response variable. We have perfect multicollinearity if, for example as in the equation above, the correlation between two independent variables is equal to 1 or −1. In practice, we rarely face perfect multicollinearity in a data set. More commonly, the issue of multicollinearity arises when there is an approximate linear relationship among two or more independent variables.

In regression analysis, we look at the correlations between one or more input variables, or factors, and a response to visualize the strength and direction of association between them. But in practice, the number of potential factors you may include in a regression model is limited only by your imagination and your capacity to actually gather the desired data of interest.

Multicollinearity practically inflates unnecessarily the standard errors of the coefficients. Whereas, increased standard errors in turn means that coefficients for some independent variables may be found not to be significantly far from 0. In other words, by overinflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (that is, with lower standard errors), those coefficients might be significant.

A little bit of multicollinearity isn’t necessarily a huge problem. But severe multicollinearity is a major problem, because it theoretically shoots up the variance of the regression coefficients, making them unstable. The more variance they have, the more difficult it is to interpret the coefficients. Some things to be concerned about when multicollinearity is a factor in multiple regression analysis are outlined as:

・ A regression coefficient is not significant even though, in the real sense, that variable is highly correlated with Y.

・ When you add or delete a predictor variable, the regression coefficients changes dramatically.

・ Having a negative regression coefficient when the response should increase along with X.

・ Having a positive regression coefficient when the response should decrease as X increases. One way to estimate multicollinearity is the variance inflation factor (VIF), which assesses how much the variance of an estimated regression coefficient increases when predictors are correlated. If no factors are correlated, the VIFs will all be 1. If the variance inflation factor (VIF) is equal to 1 there is no multicollinearity among regressors, but if the VIF is greater than 1, the regressors may be moderately correlated. A VIF between 5 and 10 indicates high correlation that may be problematic. And if the VIF goes above 10, it can be assumed that the regression coefficients are poorly estimated due to multicollinearity which should be handled accordingly. If multicollinearity is a problem in a multiple model, that is, the variance inflation factor (VIF) for a predictor is near or above 5. The solution may be simply to:

・ Remove highly correlated predictors from the model: If there are one or more factors with a high VIF, one of the factors should be removed from the model. Because they supply redundant information, removing one of the correlated factors usually doesn’t drastically reduce the R-squared. However, instead of tagging multicollinearity as a disadvantage in multiple linear regressions, we are viewing it as an advantage in the sense that predictors act as suppressor variables in regression analysis leveraging on presence of multicolinearity among independent variables because a predictor which shares zero order correlation with the response variable can only be retained in the model if and only if it is significantly correlated with one or more predictor variables under study. Having studied the concept and effect of multicollinearity we can theoretically say that a suppressor variable should be allowed in a regression model if and only if the variance inflation factor (VIF) is below 5, that is, if the strength of multicolinearity in the model does not account for rendering other predictors redundant (less significant when they should be practically significant) [

Since the introduction of the concept of suppression, many authors have expanded the definition of these variables (see for example, [

There are four types of suppressor variables: the classic suppressor, the negative suppressor, the reciprocal suppressor, and the absolute and relative suppressor. We briefly introduce each type below.

Classic suppression in multiple regression analysis was originally introduced and was later demonstrated mathematically. Although, there exist a zero-order correlation between a suppressor and an outcome variable (zero correlation), the prediction in the outcome variable increases when a suppressor variable is added to the equation simply because the suppressor variable is correlated with another predictor (or set of predictors) that are correlated with the outcome (dependent) variable. In this case, the suppressor variable removes irrelevant predictive variance from the other predictor (or set of predictors) and increases the predictor’s regression weight, thus increasing overall model predictability. Sometimes the suppressor variable may also receive nonzero regression weight with a negative sign. However, a variable is a suppressor only for those variables whose regression weights are increased. Thus, a suppressor is not defined by its own regression weight but rather by its effects on other variables in a regression system [

Consider an example involving two predictor variables,

Here

The beta weight

The coefficient of determination

Thus, in this example, even though

Negative suppression was introduced and later explained mathematically by [

Reciprocal suppression was introduced by [

Absolute and relative suppression was originally introduced by [

The use of suppressor variables in multiple regressions is more common than currently recognized [

Similarly [

Having reviewed relevant literatures as to the nature, implication, behavior and identification of suppressor variable(s) and its effect in multiple regression analysis validation and reporting of results, we can say to a reasonable extent that the concept of suppression effect in multiple regression has for long been in existence but has not been in lime light due to the fact that suppressor variables are not necessarily a special category of predictor (independent) variable in regression analysis. However they can simply be referred to as any predictor or independent variable that are not necessarily correlated with the outcome or response variable but linearly correlated with some or all the other predictors so to say.

As a result of the fact that researchers are in a perpetual search for substantive relationships between variables, they usually try to use predictors that they believe will be highly correlated with the response variable. For this reason, suppressor variables are usually not consciously sought out to be included in a regression equation. Fortunately, suppressor variables can be incorporated into a study unknown to the researcher. In these situations, even variables that would not be considered theoretically reasonable as direct predictors have possibilities for suppressor effect [

Another complication in detecting suppressor variables is that they may simply be overlooked because of their low zero-order correlations or non-correlation with the response variable [

One final problem in detecting suppressor variables is the type of statistical analysis employed. The only analysis that has been discussed to this point is that of linear regression where the predictors are inter-correlated [

We undertook a review of science literatures and various databases to understand the concept of multicollinearity and suppressor variables in regression analysis, again we went ahead to further examine the linkage between multicollinearity and suppression effect keeping in mind the supposed implication of multicollinearity in over or underestimating regression inferences . Next, we designed a sample study for the purpose of illustrating the setbacks of refusing to allow a suppressor variable in a regression analysis without obtaining it variance inflation factor (VIF).

Data Source and TypeSolely for the purpose of illustration, in our investigation we employed the use of a simulated data from MINITAB (14) and Microsoft Excel (2007). These data is a 5 variable data, we have also assigned arbitrary names to the variables which include: Grain Yield, Plant Heading, Plant Height, Tiller Count and Panicle Length. A limitation of this analysis is that we have as a result of the fact that it is sometime nearly hard to have a set of data which has a zero order correlation between them, but having our objective in mind, that is, to show the limitation of stepwise selection in been able to select a variable with zero or nearly zero order correlation with the response variable and to show that we cannot talk about suppression effect in analysis without talking about multicollinearity. Therefore we require a set of predictor variables that exhibit the basic nature of effect we intend to show that is, independent variables that have near zero or very weak correlation with the outcome (dependent) variable and other predictor variables that has a non-zero correlation with the response variable. We have ignored limitations that are inherent in the use of such data. Readers should ignore all implications of our findings, taking away from this exercise only the discussion that pertains to the limitation of stepwise selection and the advantage of multicollinearity as regards suppression effect.

The statistical packages used for this study are R-Package 3.2.2, Stat-Graphics (version 17), Minitab (version 14) and Microsoft Excel 2010. The choice of these packages is due to preference.

Quite a number of authors have proposed the understanding suppressor variables by evaluating regression weights [

From the simulated data, we hypothesized that the Grain Yield of wheat if solely dependent on Plant Heading, Plant Height, Tiller Count and Panicle Length. Specifically, we examined the following hypothesis:

・ The grain yield of wheat depends on the plant heading;

・ The grain yield of wheat depends on plant height;

・ The grain yield of wheat depends on tiller count;

・ The grain yield of wheat depends on panicle length,

We picked five (5) variables from the simulated wheat grain yield data: 1) Grain yield; 2) Plant Heading; 3) Plant Height; 4) Tiller Count; 5) panicle Length. We treated plant heading, plant height, tiller count and panicle length as predictor (independent) variables while grain yield as response (dependent) variable.

The first step of analysis involves a Pearson zero order correlation of the five variables that is, Grain Yield, Plant Heading, Plant Height, Tiller Count and Panicle Length. From

The second analytic step is to clearly outline the correlated predictors. To this end, we check for multicollinearity among these four independent (predictor) variables. Therefore, from

・ Plant Heading and Plant Height, Tiller Count, Panicle Length (

・ Plant Height and Tiller Count and Panicle Length (

・ Tiller Count and Panicle Length (

The third step involves assessment of Tiller Count as possible suppressor variable. Since Tiller Count is not significantly related with the outcome variable Grain Yield but the Tiller Count variable is significantly associated with the other predictor variables (that is, Plant Heading, Plant Height and Panicle Length) and therefore this suggests Tiller Count as a potential suppressor variable.

But before we go ahead to investigate the presence of suppression among the predictor variables, it is expedient to employ the already existing methods of variable selection in regression analysis to get a clear picture of the potentially relevant variable(s) that will be suggested by the various methods of variable selection as it are so as to further buttress our point.

Stepwise Regression: Grain Yield versus Plant Heading, Plant Height, Tiller Count and Panicle Length. Response is Grain Yield on 4 predictors, with N = 50. From

Stepwise Regression: Grain Yield versus Plant Heading, Plant Height, Tiller Count and Panicle Length.

Response is Grain Yield on 4 predictors, with N = 50.

From

Grain Yield | Plant Heading | Plant Height | Tiller Count | Panicle Length | |
---|---|---|---|---|---|

Grain Yield | 1 | ||||

Plant Heading | 0.5917 | 1 | |||

P-Value | 0.000 | ||||

Plant Height | 0.0393 | 0.0936 | 1 | ||

P-Value | 0.786 | 0.518 | |||

Tiller Count | 0.0016 | −0.3264 | 0.0070 | 1 | |

P-Value | 0.991 | 0.021 | 0.961 | ||

Panicle Length | 0.7675 | 0.2618 | 0.1745 | 0.1792 | 1 |

P-Value | 0.000 | 0.066 | 0.225 | 0.213 |

0.05 (α) as the significant variables to be included in the model as suggested by the correlation result in

Stepwise Regression: Grain Yield versus Plant Heading, Plant Height, Tiller Count and Panicle Length.

Response is Grain Yield on 4 predictors, with N = 50.

From the three methods of variable selection (Tables 2-4) (that is, forward selection, backward elimination

Step | 1 | 2 |
---|---|---|

Constant | 253.38 | −95.54 |

Panicle Length | 0.807 | 0.691 |

T-Value | 8.30 | 8.76 |

P-Value | 0.000 | 0.000 |

Plant Heading | 0.406 | |

T-Value | 5.59 | |

P-Value | 0.000 | |

R-Sq | 58.91 | 75.30 |

R-Sq (Adj) | 58.06 | 74.25 |

Step | 1 | 2 | 3 |
---|---|---|---|

Constant | 144.30 | 155.43 | −95.54 |

Plant Heading | 0.421 | 0.412 | 0.406 |

T-Value | 5.37 | 5.76 | 5.59 |

P-Value | 0.000 | 0.000 | 0.000 |

Plant Height | −1.11 | −1.11 | |

T-Value | −1.62 | −1.64 | |

P-Value | 0.112 | 0.108 | |

Tiller Count | 0.4 | ||

T-Value | 0.31 | ||

P-Value | 0.759 | ||

Panicle Length | 0.704 | 0.711 | 0.691 |

T-Value | 8.50 | 9.06 | 8.76 |

P-Value | 0.000 | 0.000 | 0.000 |

R-Sq | 76.71 | 76.67 | 75.30 |

R-Sq (Adj) | 74.64 | 75.14 | 74.25 |

Step | 1 | 2 |
---|---|---|

Constant | 253.38 | −95.54 |

Panicle Length | 0.807 | 0.691 |

T-Value | 8.30 | 8.76 |

P-Value | 0.000 | 0.000 |

Plant Heading | 0.406 | |

T-Value | 5.59 | |

P-Value | 0.000 | |

R-Sq | 58.91 | 75.30 |

R-Sq (Adj) | 58.06 | 74.25 |

and stepwise selection) above, we are able to deduce that Plant Heading and Panicle Length are the potentially relevant variables to be included in the model as suggested by the three variable selection methods. But it is against this backdrop that we suggest the presence of a suppressor variable from the zero order correlation of the four predictor (independent) variables. We analyze having identified the existence of multicollinearity among the said predictor (independent) variables. To this end, we, therefore, identify Tiller Count as a potential suppressor variable because of its significant correlation with other predictor (Plant Heading) which is said to be the presence of multicollinearity within the said variables.

The fourth analytic step is to run a regression of the variables both in the bi-variate and multiple variable cases to explicitly reveal the suppression effect of the Tiller Count variable as the potential classic suppressor in the regression model.

The Bi-variate Case

Regression Analysis: Grain Yield versus Plant Heading

The regression equation is

Regression Analysis: Grain Yield versus Plant Height

The regression equation is

Regression Analysis: Grain Yield versus Tiller Count

The regression equation is

Regression Analysis: Grain Yield versus Panicle Length

The regression equation is

Results obtained from Tables 5-12 that is; the regression analysis in the bi-variate cases shows that the significant predictors among the four predictor variables are plant heading and panicle length. This is in agreement with the correlation result obtained in

Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|

Constant | 553.8 | 136.2 | 4.06 | 0.000 |

Plant Heading | 0.5727 | 0.1126 | 5.09 | 0.000 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 1 | 13,980 | 13,980 | 25.86 | 0.000 |

Residual Error | 48 | 25,948 | 541 | ||

Total | 49 | 4,939,928 |

Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|

Constant | 1152.5 | 343.9 | 3.35 | 0.002 |

Plant Height | 0.368 | 1.346 | 0.27 | 0.786 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 1 | 61.9 | 61.9 | 0.07 | 0.786 |

Residual Error | 48 | 39,866.3 | 830.5 | ||

Total | 49 | 39,928.2 |

Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|

Constant | 1245.85 | 49.30 | 25.27 | 0.000 |

Tiller Count | 0.025 | 2.174 | 0.01 | 0.991 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 1 | 0.01 | 0.01 | 0.00 | 0.991 |

Residual Error | 48 | 39,928.1 | 831.8 | ||

Total | 49 | 39,928.2 |

Predictor | Coef | SE Coef | T-Value | P-Value |
---|---|---|---|---|

Constant | 253.4 | 119.7 | 2.12 | 0.040 |

Panicle Length | 0.80689 | 0.09728 | 8.30 | 0.000 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 1 | 23,523 | 23,523 | 68.83 | 0.000 |

Residual Error | 48 | 16,405 | 342 | ||

Total | 49 | 39,928 |

Regression Analysis: Grain Yield versus Plant Heading, Panicle Length

The regression equation is

Regression Analysis: Grain Yield versus Plant Heading, Tiller Count and Panicle Length

The regression equation is

From the four regression analyses in the bi-variate case, in model 1, we regressed our outcome variable Grain Yield on the predictor variable Plant Heading was significant and accounted 33.7% of the variance in the outcome variable. Plant Heading was positively associated with grain yield in the bi-variate correlation (

In model 2, Grain yield versus Plant Height which was insignificant as suggested by the correlation result in

In model 3, Grain Yield versus Tiller Count was insignificant as suggested by the correlation result in

In model 4, Grain Yield versus Panicle Length was significant as expected and accounted for about 58.1% of the variance in the outcome variable. Panicle Length which was positively associated with Grain yield has (

Multicollinearity in regression is viewed as more of disadvantage, as it practically inflates unnecessarily the standard errors of coefficients in regression. Having studied Variance Inflation Factor (VIF) we know that a VIF of 5 and above is not good for regression model because it might render other significant variables redundant. Therefore, from our equation in

Predictor | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|

Constant | −95.5 | 112.7 | −0.85 | 0.401 | |

Plant Heading | 0.40602 | 0.07270 | 5.59 | 0.000 | 1.1 |

Panicle Length | 0.69141 | 0.07896 | 8.76 | 0.000 | 1.1 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 2 | 30,068 | 15,034 | 71.66 | 0.000 |

Residual Error | 47 | 9860 | 210 | ||

Total | 49 | 39,928 |

Source | Df | Sum of Squares | Mean Square | F-Ratio | P-Value |
---|---|---|---|---|---|

Regression | 3 | 30,089 | 10,030 | 46.89 | 0.000 |

Residual Error | 46 | 9840 | 214 | ||

Total | 49 | 39,928 |

Predictor | Coef | SE Coef | T-Value | P-Value | VIF |
---|---|---|---|---|---|

Constant | 94.6 | 119.3 | 0.89 | 0.376 | |

Plant Heading | 0.41584 | 0.07984 | 5.21 | 0.000 | 1.3 |

Tiller Count | 0.381 | 1.219 | 0.31 | 0.756 | 1.2 |

Panicle Length | 0.69910 | 0.08331 | 8.21 | 0.000 | 1.2 |

Length variable have the same VIF that is to say Tiller Count in the model serves as a classic suppressor Panicle Length. However, having studied this effect we say that instead of viewing multicollinearity as a disadvantage we are viewing it as an advantage since suppressor variables leverage on multicollinaerity among variables to act. That is to say, suppression effect is a function of multicollinearity. Therefore to this end we say that a suppressor variable should be allowed a place in a multiple regression model if its VIF is less than five (5).

Having identified Tiller Count as a suppressor variable, that is, Classic suppressor, from the correlation result in

Therefore from the above illustration we have been able to show a clear case of classic suppression in the regression model for Grain Yield versus Plant Heading, Tiller Count and Panicle Length.

The final analytic step is to check for reciprocal suppression effect in the overall model side by side classic suppression. From the definition of reciprocal suppression; here, both the predictor variables (Tiller Count and Plant Heading) have a positive correlation with the outcome (response) variable but have a negative zero-order correlation with each other. When the response variable is regressed on these two variables, they will suppress some of their irrelevant information, increase the regression weight of each other, and thus improve model

In this section, we discuss some of the advantages of accurately identifying suppression effects and the benefits of using suppressor variables in multiple regression analysis. Using suppressor variables in multiple regressions will yield three positive outcomes: determining more accurate regression coefficients associated with independent variables; improving overall predictive power of the model; and enhancing accuracy of theory building.

First, the risks associated with excluding a relevant variable are much greater than the risks associated with including an irrelevant variable. The regression weight of an independent variable may change depending upon its correlation with other independent variables in the model. If a suppressor variable that should have been in the model is missing, that omission may substantially alter the results, including an underestimated regression coefficient of the suppressed variable, higher model error sum of squares, and lower predictive power of the model as it has been shown in the analysis above . An incomplete set of independent variables may not only underestimate regression coefficients, but in some instances, will increase the probability of making a Type II error by failing to reject the null hypothesis when it is false. In contrast, although including irrelevant variables in a model can contribute to multi-collinearity and loss of degrees of freedom, those variables will not affect the predictive power of the model. Hence, the risk of excluding a relevant variable outweighs the risk of including an irrelevant variable. To avoid underestimating the regression coefficient of a particular independent variable, it is important to understand the nature of its relationship with other independent variables. The concept of suppression provokes researchers to think about the presence of outcome-irrelevant variation in an independent variable that may mask that variable’s genuine relationship with the outcome variable.

Only when a predictor variable that is uncorrelated with other predictors is included in a multiple regression, will the regression weight of other predictor variables remain stable and not change. However, in most research, explanatory variables are inter-correlated, and regression coefficients are calculated after adjusting for all the bi-variate correlations between independent variables. When a multiple regression model is altered by adding a variable that is uncorrelated with other predictor variables, the usual outcome is that the uncorrelated variable reduces the regression weight of the other predictor variable(s). The impact will be different if the added variable (or set of variables) is a suppressor variable. The suppressor variable will account for irrelevant predictive variance in some predictors and, therefore, will yield an increase in the regression weight of those predictors. Moreover, the regressor weight of the suppressor may improve, thus improving the overall predictive power of the model [

Our example using the simulated Wheat Grain Yield data illustrates that the regression weight may change substantially when potential suppressor variables are included in models. If the regression weights of included variables improve dramatically due to the presence of a variable that was insignificant at the bi-variate level, then one or more of the independent variables may be acting as a suppressor. In our example, the presence of Tiller Count improved the regressor weights of Plant Heading and Panicle Length. Also, Plant Heading and Tiller Count served as Reciprocal suppressors in the overall model thereby clearing out the outcome irrelevant variance in each other thus improving the weights of each other.

Michael OlusegunAkinwande,Hussaini GarbaDikko,AgboolaSamson, (2015) Variance Inflation Factor: As a Condition for the Inclusion of Suppressor Variable(s) in Regression Analysis. Open Journal of Statistics,05,754-767. doi: 10.4236/ojs.2015.57075