Cross-Validation , Shrinkage and Variable Selection in Linear Regression Revisited

In deriving a regression model analysts often have to use variable selection, despite of problems introduced by datadependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.


Introduction
In deriving a suitable regression model analysts are often faced with many predictors which may have an influence on the outcome.We will consider the low-dimensional situation with about 10 to 30 variables, the much more difficult task of analyzing 'omics' data with thousands of measured variables will be ignored.Even for 10+ variables selection of a more relevant subset of these variables may have advantages as it results in simpler models which are easier to interpret and which are often more useful in practice.However, variable selection can introduce severe problems such as biases in estimates of regression parameters and corresponding standard errors, instability of selected variables or an overoptimistic estimate of the predictive value [1][2][3][4].
To overcome some of theses difficulties several proposals were made during the last decades.To assess the predictive value of regression model cross-validation is often recommended [2].For models with a main interest in a good predictor the LASSO by [5] has gained some popularity.By minimizing residuals under a constraint it combines variable selection with shrinkage.It can be regarded, in a wider sense, as a generalization of an approach by [2], who propose to improve predictors with respect to the average prediction error by multiplying the estimated effect of each covariate with a constant, an estimated shrinkage factor.As the bias caused by variable selection is usually different for individual covariates, [4] extends their idea by proposing parameterwise shrinkage factors.The latter approach is intended as a post-estimation shrinkage procedure after selection of variables.To estimate shrinkage factors the latter two approaches use cross-validation calibration and can also be used for GLMs and regression models for survival data.
When building regression models it has to be distinguished whether the only interest is a model for prediction or whether an explanatory model, in which it is also important to assess the effect of each individual covariate on the outcome, is required.Whereas the mean square error of prediction (MSE) is the main criterion for the earlier situation, it is important to consider further quality criteria for a selected model in the latter case.At least interpretability, model complexity and practical usefulness are relevant [6].For the low-dimensional situation we consider backward elimination (BE) as the most suitable variable selection procedure.Advantages compared to other stepwise procedure were given by [7].For a more general discussion of issue in variable selection and arguments to favor BE to other stepwise procedures and to subset selection procedures using various penalties (e.g.AIC and BIC) see [4] and [8].To handle the important issue of model complexity we will use different nominal significance levels of BE.The two post-estimation shrinkage approaches mentioned above will be used to correct parameter estimates of models selected by BE.There are many other approaches for model building.Despite of its enormous practical importance hardly any properties are known and the number of informative simulation studies is limited.As a result many issues are hardly understood, guidance to built multivariable regression models is limited and a large variety of approaches is used in practice.
We will focus on a simple regression model with X a p-dimensional covariate.Let there be n observations of the regression parameters.The standard approach without variable selection is classic ordinary least squares (OLS).In a simulation study we will investigate how much model building can be improved by variable selection and cross-validated based shrinkage.The paper reviews and extends early work by the authors [2,4,9].Elements added are a thorough reflection on the value of cross-validation and a comparison with Tibshirani's LASSO [5].With an interest in deriving explanatory models we will not only use the MSE as criteria, but will also consider model complexity and the effects of individual variables.Two larger studies analyzed several times in the literature will also be used to illustrate some issues and to compare results of the procedures considered.
The paper is structured in the following way.Section 2 describes the design of the simulation study.Section 3 reviews the role of cross-validation in assessing the prediction error of a regression model and studies its behavior in the simulation study.Section 4 reviews global and parameterwise shrinkage and assesses the perform-ance of cross-validation based shrinkage in the simulation data.The next Sections 5 and 6 discuss the effect of model selection by BE and the usefulness of crossvalidation and shrinkage after selection.Section 7 compares the performance of post-selection shrinkage with the LASSO.Two real-life examples are given in Section 8. Finally, the findings of the paper are summarized and discussed in Section 9.

Simulation Design
The properties of the different procedures are investigated by simulation using the same design as in [10].In that design the number of covariates , the covariates have a multivariate normal distribution with mean j  , standard deviation j 1  .The covariates 3 8 X X and 15 , X are uncorrelated with all other ones.The regression coefficients are taken to be 0 0   (intercept), , where C X is the covariance matrix of the X's.The residual variances are taken to be or .The corresponding values of the multiple correlation coefficient      The table also shows the resulting for the case that .Apparently, the effect of each covariate is partly "inherited" by some of the other covariates.A simple pattern of inheritance is seen for X 6 .It only correlates with X 2 and the pair of assessing the predictive value of a statistical model.
. This saves a little bit of the variance of the linear predictor.It drops from 6.250 to 6.063, while it would have dropped to 6.000 if X 6 were independent of the other predictors.A more complicated pattern is seen for X 7 .If that one is dropped, 14 8 , X X and 4 X inherite the effects.The covariates X 14 and X 8 show up because they are directly correlated with X 7 .Covariate X 4 shows up because it is correlated with X 8 .The variance of the linear predictor drops from 6.250 to 6.107.Since  are independent of the other covariates, they cannot inherit effects.However, 14 in the full model.

The Value of Cross-Validation
Cross-validation is often recommended as a robust way d The simplest approach is leave-one-out cross-validation in which each observation is predicted from a model using all the other observations.The generalization is k -fold cross-validation in which the observations are ran omly divided into k "folds" of approximately equal size and observation in one fold are predicted using the observations in the other folds.In the paper leave-one-out crossvalidation will be used   k n  , but the formulas presented apply more gene Let rally.
, , be obtained in the cross-validation sub observation i is not included.The cross-validation based estimate of the prediction error is defined as set, in which The true prediction error of the model with estimates b 0    and b 1 from the "original" model using all n observations is defined as In the simulation study it is given by The results in the simulation study using all variates w d job in estim of Err l to ze mus la CV r co ithout any selection are given in  "training sets".However, it might be helpful in selecting procedures that reduce the prediction error.
Finally, it should be pointed out that the cross-validation results are in close agreement with the model based estimates of the prediction error as discussed in the same section of [2].

Global Shrinkage
As argued by [2,11], the predictive performance of the resulting model can be improved by shrinkage of the model towards the mean.This gives the predictor with shrinkage factor c,   .In the following c will be called global shrinkage factor.Under the assumption of homo-skedasticity, the optimal value for c can be estimated as exp the explained sum of squares, s the estimate of the residual variance and p the number of predictors.
A model free estimator can be obtained by means of cross-validation.Let       be obtained in the cross-validation subset, in which observation i is not included, then can be estimated by minimizing c .
This estimate can be obtained by regressi the data is low.For scenario 1 the mean of the shrinkage factor is 0.84 and the mean reduction of prediction error is 0.14.Corresponding values for scenario 4 are 0.98 and 0.001.For the latter all shrinkage factors are close to one and predictors with and without shrinkage are nearly identical.However, the positive correlation between the shrinkage factor c and the reduction in prediction error is counter-intuitive.To get more insight the data for scenario 1 with a small amount of information   .The frequencies of these categories among the 10,000 simulations ar espectiv upper panel shows the apparent (estimated) prediction errors based on cross-validation and the apparent reduction achi-eved by global shrinkage.The differences between the three categories are small, but they are in line with the intuition that the largest reduction is achieved when the shrinkage factor is small.The quartiles (25%, 50%, 75%) of the apparent reduction are 0.09, 0. tends to increase the prediction error.The quartiles of the true reduction are −0.29,−0.13, 0.04 for c < 0. , 0.18, 0.31 for 0.8 0.9 c   and 0.19, 0.28, 0.38 for 0.9 c  .The lower panel shows the relation between the apparent and the a ion.At first sight the res r counter-intuitive.This phenomenon is extensively discussed in [9].What happens could be understood from the heuristic shrinkage factor ctual reduct ults ou . If b is "large" by random fluctuation, th plained sum of squares exp SS heur ays ose to 1 and does not "push" b in the direction of the true e observed exis large and ĉ st cl  .If b is "small" andom fluctuati exp SS is small and heur ĉ will be close to 0 and might "push" in the wrong direction.This explains the overall neg correlation 0.253 r by r on, r ative   ible to pred se Shrinkage rinkage factor, coin-PWSF), to be defined between apparent and actual reduction of the prediction error.It must be concluded that it is imposs ict from the data whether shrinkage will be helpful for a particular data set or not.The chances are given under "frac.pos." in Table 4.They are quite high in noisy data, but that gives no guarantee for a particular data set.

Parameterwi
[4] suggested a covariate specific sh ed parameterwise shrinkage factor ( as . This way of of Breiman's Garrote [12].See also [9,13].Sauerbrei sug arameterwise shrinkage after model selection and to estimate the vector c by cross-validation.As for the global shrinkage   this could be obtained by regression without intercept e shrink as applied in the simulation data in m ith the in scenario 1.In scenario 4 the in-mated prediction error obtained from the cross-validation fit is far t crease is moderate, but still present.Moreover, the esti-Although this is against the advice of [4] parameterwis age w odels without selection, ignoring the restrictions 0 1 j c   for 1, , j p  Λ .A summary of the results is given in Table 5. Using PWSF the average prediction error increases when compared w OLS predictor.The increase is large (about 10%) oo optimistic.(Data not shown).The explanation is that parameterwise shrinkage is not able to handle the redundant covariates with no effect at all.This can be seen from the box plots in Figure 2.
For the redundant covariates the shrinkage factors are all over the place.Even variables with a weak effect have sometimes negative PWSF values.For the strong covariates they are quiet well-behaved despite the erratic behavior for the other ones.The conclusion must be that  in models with many predictors without selection on stronger predictors parameterwise shrinkage cannot be recomm rinkage values will be set to zero, but altogether such a 0.1573, 0.05 ended.The behavior is a bit better if negative sh constraint is not sufficient.This is completely in line with Sauerbrei's original suggestion.

Number of Variables Select Error
The "softest" definition of model selection is the tion of covariates to be used in further research.T "optimal" model contains only the important covariate . If these are selected the other ones are redundant.However if one of the important covariates is not s portant covariates can come to elected, other non-im the rescue if they are correlated with the non-selected important covariate(s).In the simulation data non-important covariates that can play such a role are covariates , , , X X X X as can be seen from Table 1.The effect of omitting important covariates is the loss of explained variatio in the optimal model after selection or, equivalently, the introduction of additional random error.If no selection takes place there is no loss of explained variation, but there is a large number of non-important covariates, leading to larger estimation errors.Even more important is a severe loss in general usability of such pred nsider, for example, a prognostic model comprising many variables.All constituent variables would have to be measured in an identical or at least in a n ictors [4,6].Co similar way, even when their effects were very small.Such a model is impractical, therefore "not clinically useful" and likely to be "quickly forgotten" [14].
From Figure 3 we learn that in the lot of information (scenario 4) all important variables % easy situation with a are selected in nearly all replications.For the 8 variables without an influence selection frequencies agree closely to the nominal significance level.For X has the largest frequency of selection non-important covariates.That is its strong correlation with X 5 .In 19.4% of the simulations X 1 is selected while the important variable X 5 is not selected.he effect of sel ion at the three levels is shown in Figures 4-6.
Figure 4 summarizes the number of included variables for the different scenarios.In the simple scenario 4 all seven relevant variables  amount of information in the data reflected by residual variance 2 coefficient is 0.7) if the latter is not selected.That is shown in  As mentioned above and seen in Table 1 the variable 1 X can pa ly take over the role of 5 rt X (correlation lt X 1 , which is the cas mean loss is smallest.X 5 is erroneously excluded in about 28% of the models.In about half of them the correlated variable X 1 is included, reducing the R 2 loss  caused by ex f X 5 substantially.

Assessment of Prediction Error
The predictio ror of a particular model d nds on the number of re ant covariates and the los explained variation.The average reduction of prediction error when compared with no selection ( gives always bette ults than no selection.For a small sample size the relative reduction in prediction error is small.For the large sample size elimination of several variab r res les reduces the relative prediction error substantially for 0.05 . The average number of selected variables is 7.02 for scenario 3 and 7.40 for scenario 4 Figure 8 shows the prediction error ranking of the different levels in the ividual simulation data sets.Mean ranks were used in replications where the same model was selected.As observed before, the variation is very big and there is no outspoken winner, but there are some outspoken losers.In all scenarios it is bad not to select at all.However, for a small sample size it is even worse to use a very small s .ind election level.
The p errors of the se els can be estimated by cross-validation in such a way that in each cross-validation data set the whole sele rocedure is carried ou in Section 3 this will yield a correct estimate of the average prediction.Thus, it could be used to select the al" significance level eral, but it will not necessarily yield the best proc or the actual data set at hand.s 7 and 9 shows that cross-validabout the rele to notice rediction lected mod ction p t.As "optim in gen edure f

Post-Selection Cross-Validation
A common error in model selection is to use crossvalidation after selection to estimate the prediction error and to select the best procedure.As pointed out by [15], this is a bad thing to do.That is exemplified by is equiva lent with using AIC, which is very close to using cross-validated prediction error if the normal model holds true.

Post-Selection Shrinkage
While cross-validation after selection is not able to select   On the average, parameterwise shrinkage gives better predictions than global shrinkage, when applied after selection.An intuitive explanation is that small effects that just survive the selection, are more prone to selection bias and, therefore, can profit from shrinkage.In contrast, selection bias plays no role for large effects and shrinkage is not needed, see [16] and chapter 2 of [8].Whereas global shrinkage "averages" over all effects, parameterwise shrinkage aims to shrink according to these individual needs.
This can also be investigated by looking into the mean squared   the best model, it might be of interest to see whether estimation errors coefficients can differ from the j  coefficients if there is correlation between ce the effect of redundant covariates that only enter that distinction.The precise mechanism is not quite clear yet.To get some better feeling what is going on, the observed scatterplot of parameterwise shrinkage factors versus the OLS estimator are shown in Figure 13 for covariate X 3 (no effect), X 6 (weak effect) and X 9 (strong effect).These covariates are selected because the optimal parameter value when selected does not depend on which other covariates are selected as well.X 3 is independent of all other variables and the other two variables are only correlated with one variable without influence.Therefore parameter estimates are theoretically equal to the true value in the full model.If the optimal value varies with the selection the graphs are a bit harder to interpret.For X 3 , the covariate without effect and no correlation, the inclusion frequency is close to the type I error and in about half of these cases parameter estimates are positive and negative.The variable is selected in replications in which the estimated regression coefficient is by chance most heavily overestimated (in absolute terms) compared to the true (null) effect.One would hope that PWFS would correct these chance inclusions by a rule like "the by chance, while the global shrinkage is not able to make the covariates.Figure 12 shows the mean squared errors of the shrinkage based estimators relative to the mean quared errors of the OLS estimators for sample size larger the absolute effect b , the smaller the shrinkage factor c ".Although most shrinkage factors are muc lower than one, Fig A similar observation transfers to the plot for X 9 , which is selected in all replications.Therefore, selection bias is no issue for this covariate.The hope is that PWSF would move the estimate close to the true value β = 1.0.
, scenarios 1 and 2 .Sample size 400 n  is not shown, because post-selection shrinkage has hardly any effect.
It is clear that the parameterwise shrinkage helps to redu  variable X 13 is included (plot not shown).X 13 has no effect and the selection frequency (5.9%) agrees well to the type I error.If X 13 is included, the shrinkage factor for X 9 show a decreasing trend with c -values clearly below 1 if the estimate overestimates the true value 1 b   , and values around 1  if 1 b  .One might say that param rin helps to correct for chance inclusions of ut r estimation errors.X 6 is a covariate with a weak effect.It is not included in 19% of the replications, certainly cases in which the ue effect was underestimated by chance.The overall for the case where it is included shows a stronger increasing trend (compared with X 9 ) tending to a value of about 0.96 c  , if b is large.Here, X 2 plays the role of a nfounding redundant covariate.In cases where it is included (plot not shown), the shrinkage factor for X 6 is rather stable with median value about 0.88 c  .Some understanding obtained by the observation in [2] that in univariate he optimal shrinkage factor is given by ) is the cut-point for an inclusion and that PWSF tends to increase with t for included variables.For large absolute t-values, PWSF's are close to one.whereas they drop to about 0.8 for t close to 2. The relation between PWSF and t is similar for all three covariates 9 X .The difference between the covariates is the size of the effect and correspondingly the range of t after selection.
The conclusion so far is that coordinate-wise shrinkage is helpful after selection.However it is not clear how to select the significance level.In a real analysis, the level should be determined subjectively by the aim of the study [4].In the following we will compare backwards elimination with a procedure like the LASSO, that combines selection, shrinkage and fine-tuning.

Comparison with LASSO
The simulations discussed above were compared with the results of the LASSO with cross-validation based selection of the penalty parameter  .Because LASSO is quite time-consuming it was only applied on the first 2000 data sets for eac com ination n and 2 h b of  .Figure 14 shows the d i n of e cross-va n based istr butio th lidatio  for each co The variation in the penalty para-meter mbination.
 , even in the simple situation of scenario 4 is surprisingly large.There is some correlation with the estimated variance in the full model, but that does not explain the huge variation.The next Figure 15 shows the inclusion frequencies ation based λ's for the different scenarios.for the different covariates.Relevant variables are nearly always included, but LASSO is not able to exclude redundant covariates if there is much signal in the other ones.For example, in scenario 4 inclusion frequencies are 52% for 3 X and 54% for 15 X , the two uncorrelated bles with uence.The probable reason is that selection and shri ge are controlled by the same penalm varia ty ter out infl nka  .The menon is also nicely illustrated by Figure 16.
Finally, the question how the prediction error of LASSO compares with the models based on selection and shrinkage is answered by Figure 17.
The conclusion must be that LASSO is no panacea.Concerning prediction error, it seems to be OK for noisy data (scenarios 1 and 2), but it is beaten by variable selection followed by some form of shrinkage if the data are less noisy (scenario 4).Most likely, that is caused by the inclusion of two many variables without effect.Variable selection combined with parameterwise shrinkage performs quit well.The choice of a suitable significance level seems to depend on the amount of information in eas 0.01

Ozone Data
For illustration, we consider one specific aspect of a study on ozone effects on school children's lung growth.The study was carried out from February 1996 to October 1999 in Germany on 1101 school children in first and second primary school classes (6 -8 years).For more details see [17].As in [18] we use a subpopulation of 496 children and 24 variables.None of the continuous variables exhibited a strong non-linear effect, allowing to assume a linear effect for continuous variables in our analyses.
First, the whole data set is analyzed using backward elimination in combination with global and parameterwise shrinkage and LASSO.Selected variables with corresponding parameter estimates are given in Table 7 and mean squared prediction errors are shown in Table 8.The t-values are only given for the full model to illustrate the relation between the t-value and the parameterwise shrinkage factor.For variables with very large ha selection diction perf ghtly redu t -values, PWSF are close to 1.In contrast, PWSFs are all over the place if t is small, a good indication that variables should be eliminated.
Mean squared prediction errors for the full model and the BE models were obtained through double cross-vas-validated preined by crossvalidation within the cross-validation training set.Prediction error for the LASSO is based on single crossvalidation because double cross-validation turned out to be too time-consuming.Therefore, the LASSO prediction error might be too optimistic.
MSE is very similar for all models, irrespective of applying shrinkage or not (range 0.449 -0.475; the full model with PWSF is the only exception), but the number lidation in the sense that for each cros diction, the shrinkage factors were determ   Although they carry relevant information the double cross-validation results for the full data set lack the intuitive appeal of the split-sample approach.To get closer to that intuition the following "dynamic" analysis scheme is applied.First the data are sorted randomly, next the first train n observations are used to derive a prediction model which is used to predict the remaining train n n  observations.This is done for n train 150, 200, 250, 300, 350 and repeated 100 time way an impression is obtained how the different approaches behave with growing information.T raphs.Figure 18 shows the mean number of covariates proc are substanti train = 350 LA selects on average 14.4 vari whereas BE(0.01) lect .0varia s. analysed in [19] and later used in many papers.13 continuous covariates (age, weight, height and 10 body circumference measurements) are available as predictors for percentage of body fat.As in the book of [8] we excluded case 39 from the analysis.The data are available on the website of the book.In Table 9 we give mean squared prediction errors for several models and shrinkage approaches.Furthermore we give these estimates for variables excluding 6 X , the dominating predictor.This analysis assumes that 6 X would not have been measured or that it would not have been publicly available.A related analysis is presented in chapter 2.7 of [8] with the aim to illustrate the influence of a dominating predictor and to raise the issue about the term "full model" and whether a full model approach has advantageous properties compared with variable selection procedures.Using all variable MSEs of the models are very similar with a range from 18.76 -20.80.Excluding 6 = 100, s.In that X leads to erences between models are still negligible (25.87 -27.47); with the full model of the ozone data.For BE(0.157), E(0.01) a SO we giv ter est Table 10.Exclud 6 a severe increase, but diff he results are shown in g included.More variables are included with increasing sample size (larger power) and differences between the followed by PWSF as an exception.view and from a clinical point of view.In a related context [22] refer to parameter sparsity and practical sparsty.
as a lower boundary to derive an explanatory model [23].
Based on this knowledge it is obvious that a sample size i

Design of Simulation Study
Our simulation design was used before for investigations of different issues [10].We consider 15 covariates with seven of them having an effect on the outcome.In addition, multicollinearity between variables introduces in-is  (6.7 per variable) is low.As a more realistic 400  . Concerning the residu scenario we also consider n al variance we have chosen two scenarios resulting in 2 0.5 R  and 2 0.71 R  .The 4 scenarios reflect the situation with a low to a large(r) amount of information in the data.As expected from related studies the amount of information in the data has an influence on the results Our results confirm that it does not make any sense to use parameterwise shrinkage in the full model [4].Estimated shrinkage factors are not able to handle redundant covariates with no effect at all.Therefore they can have

Variable Selection and Post Selection Shrinkage
Most often pr the main criteria t s f dur im a suit tor wi criteria are the ma ) inte analys cont guided esear have and are aim suitable ex ry m odel with slightly inferi xclud form results in a loss of R and the inclusion of variables without effect complicates the model unnecessarily and usually increases the variance of a predictor.We used several criteria to compare the full model and models derived by using backward elimination with several values of the nominal significance level, the key criterion to determine complexity of a selected model, and the LASSO procedure.The number of variables is a key crit r the interpretability and the practical usefulness of a predictor.BE(0.01) always selects the sparsest model, but such a low significance level may be dangerudies w w amount of information.Our results confirm that BE(0.01) is very well suited if a lot of information is available.All stronger predictors are always included and only a small number of irrelevant variables is selected.Altoge levels of 0.05 or 0.157 select nable models.For studies with a sample siz loss in 2 R is acceptable and more than comp using the pre n error criterion.bviou sidering several statistical criteria the odel does not have advantages and the simu study illustrates that some selection is always sensib The ts of on 5 show that parameterwise shrinkage a ter selection can help to pro dictiv rform on rrecting the regre oeffi s tha orderline significant.

Comparison with LASSO and Similar Procedures
With the hype for high-d sion ata th SSO approach [5]  In cont variables, but the MSE was similar.The simultane combination of variable selection with shrinkage is ofte idered as an im some of its followers, such as SCAD [24] or the elastic net [25].We compare the LASSO results with post selection shrinkage procedures, in principle two-step procedures combining variable selection and shrinkage.In contrast to LASSO and related procedures post selection shrinkage is not based on optimizing any criteria under a given nt and aches are somehow adhoc.Using cross-validation a global shrinkage approach was proposed [2] and later extended to parameterwise shrinkage [4].These post selection shrinkage approaches can be easily used in all types of GLMs or regression models for survival data after the selection of a model.Whereas global shrin "averages" over all effects, parameterwise shrinkage aims to shrink according to individual needs caused by the selection bias.Our results confirm that parameterwise shrinkage helps to reduce the effect of redundant covariates that only enter by chance, while the global shrinkage is not able to make that distinction.The better performance concerning the individual f eterwise shrinkage compared to global shrinkage.The PWSF results confirm observations from another type of study on the use of cross-validation to reduce bias caused by model building [16].Small effects that just survived eeded about the criterion to be used.reflection is n Prediction error is the obvious choice but does not reflect the need for a sparse model [27].In order to improve research on selection procedures for high-dimensional data, several approaches to determine a more suitable  or use two penalty parameters were proposed during the last years.It would be important to investigate whether they can improve model building in the easier low-dimensional situations As mentioned abov the approach of this paper can be easily implied for generalized linear models like model logistic regression and survival analysis.It would be interesting to see such applications.

Acknowledgements
We thank Gabi Ihorst for providing the ozone study data.
increase of the residual variance  , which is equal to the decrease of non-zero coefficients in the full model.The results are shown in

Figure 1 ..
Figure 1.Reduction of prediction error achieved by global shrinkage for different categories of the shrinkage factor c; data from scenario 1   , .  2 100 6 25  .The upper panel shows the apparent prediction errors obtain d through crossn e validation, the middle panel shows the actual (true) prediction errors and the lower panel shows the relation between the apparent and the actual reduction.

Figure 2 .
Figure 2. Box plot of parameterwise shrinkage factors in models without selection.Results from scenario 4.

Following [ 10 ]
models are selected by backward elimination at significance levels of  ones that have an influence on the outcome in the full model.In the simulation those ar the covariates


and the sample size n .In scenario 4 with a high amount of information, the loss of 2 R is negligible and the nu er of redu variates can be controlled by taking 0.01   .It has to be kept in mind that other variables are also deleted making a direct comparison difficu .Concerning these two variables the correct model includes X 5 and excludes e in 69% of the replications.Here the mb ndant co   .In sce ario 1 there is a if the selec n tion substantial loss of 2 R is too strict and 0.1573   might be more appropriate.

Figure 6 .
Figure 6.Loss of R in comparison to the full model.The bars show mean ± 1 st.dev. in the population of replications.

Figures 9 and 10 .
Comparing Figure tion after selection is far too optimistic a duction of the prediction error and is not ab the poor performance of selection at 0.01  for the scenarios 1 and 2.Moreover, as can be seen from Figure10, post-selection cross-validation tends to favor selec-

Figure 9 .
Figure 9. Reduction of estimated prediction error, obtained through post-selection cross-validation, after backward elimination with three selection levels.

Figure 10 .
Figure 10.Ranks of estimated prediction error obtained from post-selection cross-validation for different levels; rank = 1 is best, rank = 4 is worst.The ranks of the true prediction error are shown in Figure 8.

h ure 13
shows a different cloud: "the la s rger b , the larger c ".

Figure 11 .Figure 12 .
Figure 11.Reduction in prediction error obtained through s deviation.hr kage after selection.Error bars show mean ± 1 standard in

Figure 13 .
Figure 13.Parameterwise shrinkage factors versus OLS estimates from selected models for covariates 3, 6 and 9; α = 0:05 and scenario 2; reference lines refer to the true value of the parameter.

. 1
If  is small, hard to esti ate.If this quantity is very m  is large it could be estimated by 2 1 uni c t .The parameter-wise shrinkage factor behaves very similarly.This could be seen from plotting PWSF against t (Graphs not shown).Such a plot clearly shows that ˆ 
errors with an advantage for parameterwise shrinkage.

covar Figure 16 .
LASSO: distribution of the number of redundant iates.There are eight redundant covariates in the design.

Figure 17 .
Figure 17.Average prediction errors for different strategies.

Table 2 .
The results show that ˆCV

Table 6
for scenario 2 and 0.05

Table 3 .
It nicely shows that the optimal level depends on the amount of information in the data.It also shows that  , as reported in

. Cross-Validation nd Shrinkage without
became popular.Howe results in our simulation study and in the two examples are disexamples.From 24 candidate variables 17 were included in the model derived for the ozone data.