^{1}

^{2}

^{*}

In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R^{2} of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.

In deriving a suitable regression model analysts are often faced with many predictors which may have an influence on the outcome. We will consider the low-dimensional situation with about 10 to 30 variables, the much more difficult task of analyzing ‘omics’ data with thousands of measured variables will be ignored. Even for 10+ variables selection of a more relevant subset of these variables may have advantages as it results in simpler models which are easier to interpret and which are often more useful in practice. However, variable selection can introduce severe problems such as biases in estimates of regression parameters and corresponding standard errors, instability of selected variables or an overoptimistic estimate of the predictive value [1-4].

To overcome some of theses difficulties several proposals were made during the last decades. To assess the predictive value of regression model cross-validation is often recommended [

When building regression models it has to be distinguished whether the only interest is a model for prediction or whether an explanatory model, in which it is also important to assess the effect of each individual covariate on the outcome, is required. Whereas the mean square error of prediction (MSE) is the main criterion for the earlier situation, it is important to consider further quality criteria for a selected model in the latter case. At least interpretability, model complexity and practical usefulness are relevant [

We will focus on a simple regression model

with X a p-dimensional covariate. Let there be n observations used to obtain estimates

and of the regression parameters.

The standard approach without variable selection is classic ordinary least squares (OLS). In a simulation study we will investigate how much model building can be improved by variable selection and cross-validated based shrinkage. The paper reviews and extends early work by the authors [2,4,9]. Elements added are a thorough reflection on the value of cross-validation and a comparison with Tibshirani’s LASSO [

The paper is structured in the following way. Section 2 describes the design of the simulation study. Section 3 reviews the role of cross-validation in assessing the prediction error of a regression model and studies its behavior in the simulation study. Section 4 reviews global and parameterwise shrinkage and assesses the performance of cross-validation based shrinkage in the simulation data. The next Sections 5 and 6 discuss the effect of model selection by BE and the usefulness of crossvalidation and shrinkage after selection. Section 7 compares the performance of post-selection shrinkage with the LASSO. Two real-life examples are given in Section 8. Finally, the findings of the paper are summarized and discussed in Section 9.

The properties of the different procedures are investigated by simulation using the same design as in [_{1,5} = 0.7, R_{1,10} = 0.5, R_{2,6} = 0.5, R_{4,8} = −0.7, and. The covariates and are uncorrelated with all other ones. The regression coefficients are taken to be (intercept), β_{1} = β_{2} = β_{3} = 0, β_{4} = −0.5, β_{5} = β_{6} = β_{7} = 0.5, β_{8} = β_{9} = 1 and,.

The variance of the linear predictor

in the model equals

, where C_{X} is the covariance matrix of the X’s. The residual variances are taken to be or. The corresponding values of the multiple correlation coefficient

are and

, respectively. Sample sizes are n = 100 or. For each of the four combinations, called scenarios, samples are generated and analyzed. The scenarios are ordered on the amount of information they carry on the regression coeffients. Scenario 1 is the combination, scenario 2 is, scenario 3 is

and scenario 4 is

.

Since the covariates are not independent, the contribution of X_{j} to the variance of the linear predictor

is not simply equal to. Moreover, the regression coefficients have no absolute meaning, but depend on which other covariates are in the model. To demonstrate this, it is studied how dropping one of the covariates influences the optimal regression coefficients of the other covariates, the variance of the linear predictor and the increase of the residual variance, which is equal to the decrease of. This is only done for which have non-zero coefficients in the full model. The results are shown in

^{2}. The latter is computed for the case of. Covariates X_{1}, X_{2}, X_{3}, X_{11}, ..., X_{15} have β = 0. Dropping them will not affect the β’s of the model under “none”.

The table also shows the resulting for the case that. Apparently, the effect of each covariate is partly “inherited” by some of the other covariates. A simple pattern of inheritance is seen for X_{6}. It only correlates with X_{2} and the pair is independent of the rest. If X_{6} is dropped, gets the regression coefficient. This saves a little bit of the variance of the linear predictor. It drops from 6.250 to 6.063, while it would have dropped to 6.000 if X_{6} were independent of the other predictors. A more complicated pattern is seen for X_{7}. If that one is dropped, and inherite the effects. The covariates X_{14} and X_{8} show up because they are directly correlated with X_{7}. Covariate X_{4} shows up because it is correlated with X_{8}. The variance of the linear predictor drops from 6.250 to 6.107.

Since are independent of the other covariates, they cannot inherit effects. However, can partly substitute, although they have coefficients in the full model.

Cross-validation is often recommended as a robust way of assessing the predictive value of a statistical model. The simplest approach is leave-one-out cross-validation in which each observation is predicted from a model using all the other observations. The generalization is -fold cross-validation in which the observations are randomly divided into “folds” of approximately equal size and observation in one fold are predicted using the observations in the other folds. In the paper leave-one-out crossvalidation will be used, but the formulas presented apply more generally. Let be obtained in the cross-validation subset, in which observation is not included. The cross-validation based estimate of the prediction error is defined as

The true prediction error of the model with estimates b_{0} and b_{1} from the “original” model using all n observations is defined as

In the simulation study it is given by

The results in the simulation study using all covariates without any selection are given in

The results show that does a good job in estimating the mean value of over all simulations. However, since the correlation between and over all simulation runs is virtually equal to zero, it must be concluded that it does a very poor job in estimating the prediction error of the individual models.

Notice that the standard deviation of is much larger than that of. The explanation is that a lot of the variation in is due to the estimation of the unknown. Cross-validation might be do a better job in picking up the systematic prediction part of the prediction error caused by the error in the estimated’s. That can be checked by studying the behavior of

which is an estimate of the systematic part

. Here is the usual unbiased estimator of. The results are shown in

Means are very similar but standard deviations from the CV estimates are much smaller. CV somehow shrinks the estimate of the systematic error towards the mean. The table shows that the correlations between the estimate and the true value are still very low. The warning issued in Section 4 of [

“training sets”. However, it might be helpful in selecting procedures that reduce the prediction error.

Finally, it should be pointed out that the cross-validation results are in close agreement with the model based estimates of the prediction error as discussed in the same section of [

As argued by [2,11], the predictive performance of the resulting model can be improved by shrinkage of the model towards the mean. This gives the predictor

with shrinkage factor c,. In the following c will be called global shrinkage factor. Under the assumption of homo-skedasticity, the optimal value for c can be estimated as

with the explained sum of squares, the estimate of the residual variance and p the number of predictors.

A model free estimator can be obtained by means of cross-validation. Let be obtained in the cross-validation subset, in which observation is not included, then can be estimated by minimizing

resulting in

This estimate can be obtained by regressing

on in a model without an intercept. It differs slightly from the one obtained by regressing on as proposed in [

The table shows that global shrinkage can help to reduce the prediction error if the amount of information in the data is low. For scenario 1 the mean of the shrinkage factor is 0.84 and the mean reduction of prediction error is 0.14. Corresponding values for scenario 4 are 0.98 and 0.001. For the latter all shrinkage factors are close to one and predictors with and without shrinkage are nearly identical. However, the positive correlation between the shrinkage factor c and the reduction in prediction error is counter-intuitive. To get more insight the data for scenario 1 with a small amount of information is shown in

The relation between reduction in prediction error due to shrinkage and the prediction error of the OLS models are shown for three categories of the shrinkage factor c, namely and. The frequencies of these categories among the 10,000 simulations are 1754, 7740 and 506, respectively. The upper panel shows the apparent (estimated) prediction errors based on cross-validation and the apparent reduction achieved by global shrinkage. The differences between the three categories are small, but they are in line with the intuition that the largest reduction is achieved when the shrinkage factor is small. The quartiles (25%, 50%, 75%) of the apparent reduction are 0.09, 0.15, 0.27 for for and −0.07−0.02, 010 for. The middle panel shows the actual (true) prediction error based on our knowledge of the true model. Here, the picture is completely different. Reduction of the prediction error only occurs when the shrinkage factor is close to one and the OLS prediction error is large. Substantial shrinkage with tends to increase the prediction error. The quartiles of the true reduction are −0.29, −0.13, 0.04 for c < 0.8, 0.05, 0.18, 0.31 for and 0.19, 0.28, 0.38 for. The lower panel shows the relation between the apparent and the actual reduction. At first sight the results our counter-intuitive. This phenomenon is extensively discussed in [

[

Here, is a vector of shrinkage factors with for and “” stands for coordinate-wise multiplication:. This way of regulation is in the spirit of Breiman’s Garrote [

this could be obtained by regression without intercept of on.

Although this is against the advice of [

Using PWSF the average prediction error increases when compared with the OLS predictor. The increase is large (about 10%) in scenario 1. In scenario 4 the increase is moderate, but still present. Moreover, the estimated prediction error obtained from the cross-validation fit is far too optimistic. (Data not shown). The explanation is that parameterwise shrinkage is not able to handle the redundant covariates with no effect at all. This can be seen from the box plots in

For the redundant covariates the shrinkage factors are all over the place. Even variables with a weak effect have sometimes negative PWSF values. For the strong covariates they are quiet well-behaved despite the erratic behavior for the other ones. The conclusion must be that

in models with many predictors without selection on stronger predictors parameterwise shrinkage cannot be recommended. The behavior is a bit better if negative shrinkage values will be set to zero, but altogether such a constraint is not sufficient. This is completely in line with Sauerbrei’s original suggestion.

Following [

The “softest” definition of model selection is the selection of covariates to be used in further research. The “optimal” model contains only the important covariates, the ones that have an influence on the outcome in the full model. In the simulation those are the covariates. If these are selected the other ones are redundant. However if one of the important covariates is not selected, other non-important covariates can come to the rescue if they are correlated with the non-selected important covariate(s). In the simulation data non-important covariates that can play such a role are covariates as can be seen from

similar way, even when their effects were very small. Such a model is impractical, therefore “not clinically useful” and likely to be “quickly forgotten” [

From _{5}. In 19.4% of the simulations X_{1} is selected while the important variable X_{5} is not selected.

The effect of selection at the three levels is shown in Figures 4-6.

The comparison of Figures 5 and 6 nicely shows the balance between allowing redundant covariates (covariates without effect in the selected model) and allowing loss of explained variation. The number of redundant covariates shown in

, which is equal to 12.5 in scenarios 1 and 3 (σ^{2} = 6.25) and 8.75 in scenarios 2 and 4 (σ^{2} = 2.5). This is shown in

Generally speaking, loss of depends on both the significance level used in the selection process and the

amount of information in the data reflected by residual variance and the sample size. In scenario 4 with a high amount of information, the loss of is negligible and the number of redundant covariates can be controlled by taking. In scenario 1 there is a substantial loss of if the selection is too strict and might be more appropriate.

As mentioned above and seen in _{5} and excludes X_{1}, which is the case in 69% of the replications. Here the mean loss is smallest. X_{5} is erroneously excluded in about 28% of the models. In about half of them the correlated variable X_{1} is included, reducing the R^{2} loss

caused by exclusion of X_{5} substantially.

The prediction error of a particular model depends on the number of redundant covariates and the loss of explained variation. The average reduction of prediction error when compared with no selection (significance level) is shown in

It nicely shows that the optimal level depends on the amount of information in the data. It also shows that moderate selection at, in a univariate situation equivalent with AIC or Mallows’ CP, can do very little harm. Even gives always better results than no selection. For a small sample size the relative reduction in prediction error is small. For the large sample size elimination of several variables reduces the relative prediction error substantially for. The average number of selected variables is 7.02 for scenario 3 and 7.40 for scenario 4.

The prediction errors of the selected models can be estimated by cross-validation in such a way that in each cross-validation data set the whole selection procedure is carried out. As in Section 3 this will yield a correct estimate of the average prediction. Thus, it could be used to select the “optimal” significance level in general, but it will not necessarily yield the best procedure for the actual data set at hand.

A common error in model selection is to use crossvalidation after selection to estimate the prediction error and to select the best procedure. As pointed out by [

Comparing Figures 7 and 9 shows that cross-validation after selection is far too optimistic about the reduction of the prediction error and is not able to notice the poor performance of selection at for the scenarios 1 and 2. Moreover, as can be seen from

While cross-validation after selection is not able to select

the best model, it might be of interest to see whether cross-validation based shrinkage after selection can help to improve the model. The results are shown in

On the average, parameterwise shrinkage gives better predictions than global shrinkage, when applied after selection. An intuitive explanation is that small effects that just survive the selection, are more prone to selection bias and, therefore, can profit from shrinkage. In contrast, selection bias plays no role for large effects and shrinkage is not needed, see [

This can also be investigated by looking into the mean squared estimation errors of the regression coefficients conditional on the selection of covariates in the model. The coefficients are the optimal regression coefficients in the selected model. As discussed in Section 2 the coefficients can differ from the coefficients if there is correlation between the covariates.

It is clear that the parameterwise shrinkage helps to reduce the effect of redundant covariates that only enter by chance, while the global shrinkage is not able to make that distinction. The precise mechanism is not quite clear yet. To get some better feeling what is going on, the observed scatterplot of parameterwise shrinkage factors versus the OLS estimator are shown in _{3} (no effect), X_{6} (weak effect) and X_{9} (strong effect). These covariates are selected because the optimal parameter value when selected does not depend on which other covariates are selected as well. X_{3} is independent of all other variables and the other two variables are only correlated with one variable without influence. Therefore parameter estimates are theoretically equal to the true value in the full model. If the optimal value varies with the selection the graphs are a bit harder to interpret.

For X_{3}, the covariate without effect and no correlation, the inclusion frequency is close to the type I error and in about half of these cases parameter estimates are positive and negative. The variable is selected in replications in which the estimated regression coefficient is by chance most heavily overestimated (in absolute terms) compared to the true (null) effect. One would hope that PWFS would correct these chance inclusions by a rule like “the larger the absolute effect, the smaller the shrinkage factor”. Although most shrinkage factors are much lower than one,

A similar observation transfers to the plot for X_{9}, which is selected in all replications. Therefore, selection bias is no issue for this covariate. The hope is that PWSF would move the estimate close to the true value β = 1.0.

Generally speaking that does not happen: c increases slowly with b. Most values are close to 1, indicating that shrinkage is not required. The only “hoped for” observation can be made for the cases where the correlated

variable X_{13} is included (plot not shown). X_{13} has no effect and the selection frequency (5.9%) agrees well to the type I error. If X_{13} is included, the shrinkage factor for X_{9} show a decreasing trend with -values clearly below 1 if the estimate overestimates the true value, and values around if. One might say that parameter shrinkages helps to correct for chance inclusions of X_{13} but not for estimation errors.

X_{6} is a covariate with a weak effect. It is not included in 19% of the replications, certainly cases in which the true effect was underestimated by chance. The overall picture for the case where it is included shows a stronger increasing trend (compared with X_{9}) tending to a value of about, if b is large. Here, X_{2} plays the role of a confounding redundant covariate. In cases where it is included (plot not shown), the shrinkage factor for X_{6} is rather stable with median value about.

Some understanding can be obtained by the observation in [

The conclusion so far is that coordinate-wise shrinkage is helpful after selection. However it is not clear how to select the significance level. In a real analysis, the level should be determined subjectively by the aim of the study [

The simulations discussed above were compared with the results of the LASSO with cross-validation based selection of the penalty parameter. Because LASSO is quite time-consuming it was only applied on the first 2000 data sets for each combination of and.

The next

for the different covariates. Relevant variables are nearly always included, but LASSO is not able to exclude redundant covariates if there is much signal in the other ones. For example, in scenario 4 inclusion frequencies are 52% for and 54% for, the two uncorrelated variables without influence. The probable reason is that selection and shrinkage are controlled by the same penalty term. The phenomenon is also nicely illustrated by

Finally, the question how the prediction error of LASSO compares with the models based on selection and shrinkage is answered by

The conclusion must be that LASSO is no panacea. Concerning prediction error, it seems to be OK for noisy data (scenarios 1 and 2), but it is beaten by variable selection followed by some form of shrinkage if the data are less noisy (scenario 4). Most likely, that is caused by the inclusion of two many variables without effect. Variable selection combined with parameterwise shrinkage performs quit well. The choice of a suitable significance level seems to depend on the amount of information in the data. Whereas has the best performance in scenario 4, this level seems to be too low in the other scenarios. In these cases selections with or have better prediction performance. Using post selection shrinkage slightly reduces the prediction errors with an advantage for parameterwise shrinkage.

For illustration, we consider one specific aspect of a study on ozone effects on school children’s lung growth. The study was carried out from February 1996 to October 1999 in Germany on 1101 school children in first and second primary school classes (6 - 8 years). For more details see [

First, the whole data set is analyzed using backward elimination in combination with global and parameterwise shrinkage and LASSO. Selected variables with corresponding parameter estimates are given in

Mean squared prediction errors for the full model and the BE models were obtained through double cross-validation in the sense that for each cross-validated prediction, the shrinkage factors were determined by crossvalidation within the cross-validation training set. Prediction error for the LASSO is based on single crossvalidation because double cross-validation turned out to be too time-consuming. Therefore, the LASSO prediction error might be too optimistic.

MSE is very similar for all models, irrespective of applying shrinkage or not (range 0.449 - 0.475; the full model with PWSF is the only exception), but the number

^{2} equals 0.67 in the full model and drops only slightly to 0.64 for the BE(0.01) model. For the LASSO model it is 0.66.

of variables in the model is very different. BE(0.01) selects a model with 4 variables, corresponding PWSF are all close to 1. Three variables are added if 0.05 is used as significance level. Using 0.157 selects a model with 12 variables. Two of them have a very low (below 0.3) PWSF, indicating that these variables may better be excluded. LASSO selects a complex model with 17 variables.

Although they carry relevant information the double cross-validation results for the full data set lack the intuitive appeal of the split-sample approach. To get closer to that intuition the following “dynamic” analysis scheme is applied. First the data are sorted randomly, next the first observations are used to derive a prediction model which is used to predict the remaining observations. This is done for n_{train} = 100, 150, 200, 250, 300, 350 and repeated 100 times. In that way an impression is obtained how the different approaches behave with growing information. The results are shown in graphs. _{train} = 350 LASSO selects on average 14.4 variables, whereas BE(0.01) selects only 4.0 variables.

In a second example we will illustrate some issues in a study with one dominating variable. The data were first analysed in [

Excluding results in the inclusion of other variables for all approaches. As in the ozone data LASSO hardly eliminates any variable, but the MSE is not better than from BE(0.01) followed by PWSF. PWSF of all variables selected by BE(0.01) are close to 1, whereas variables selected additionally by BE(0.157) all have PWSF values below 0.9 and sometimes around 0.6. This example confirms that BE(0.01) followed by PWSF gives similar prediction MSE’s, but includes a much smaller number of variables.

Building a suitable regression model is a challenging task if a larger number of candidate predictors is available. Having a situation with about 10 to 30 variables in mind the full model is often unsuitable and some type of variable selection is required. Obviously subject matter knowledge has to play a key role in model building, but often it is limited [

21] we consider backward elimination as a suitable approach, provided the sample size is not too small and the significance level is sensibly chosen according to the aim of a study. For a more detailed discussion see chapter 2 of [

In a simulation study and two examples we discuss the value of cross-validation, assess a global and a parameterwise cross-validation shrinkage approach, both without and with variable selection, and compare results with the LASSO procedure which combines variable

selection with shrinkage [2,4,5]. As discussed in the introduction it is often necessary to derive a suitable explanatory model which means that the effects of individual variables are important. In this respect a sparse model has advantages, both from a statistical point of view and from a clinical point of view. In a related context [

Our simulation design was used before for investigations of different issues [

and therefore on the comparison between different procedures.

The findings of Section 3 confirm that cross-validation does not estimate the performance of the model at hand but the average performance over all possible “training sets”. The results of Section 4 confirm that global shrinkage can help to improve prediction performance in data with little information [2,11] like in the first scenario with and. However, the results show that the actual value of the global shrinkage factor is hard to interpret [

Our results confirm that it does not make any sense to use parameterwise shrinkage in the full model [

Most often prediction error is the main criteria to compare results from variable selection procedures. This implies that a suitable predictor with favorable statistical criteria are the main (or only) interest of an analysis. In contrast to such a statistically guided analysis researchers have often the aim to derive a suitable explanatory model and are willing to accept a model with slightly inferior prediction performance [

With the hype for high-dimensional data the LASSO approach [

From the large number of newer proposals combining variable selection and shrinkage simultaneously, we considered only the LASSO in this work. In a comparison of approaches in low-dimensional survival settings [

Like the choice of in LASSO, the choice of the significance level in the variable selection is crucial. Double cross-validation might be helpful in selecting, but reflection is needed about the criterion to be used. Prediction error is the obvious choice but does not reflect the need for a sparse model [

We thank Gabi Ihorst for providing the ozone study data.