Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited


In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.

Share and Cite:

H. Houwelingen and W. Sauerbrei, "Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited," Open Journal of Statistics, Vol. 3 No. 2, 2013, pp. 79-102. doi: 10.4236/ojs.2013.32011.

Conflicts of Interest

The authors declare no conflicts of interest.


[1] C. Chen and S. L. George, “The Bootstrap and Identification of Prognostic Factors via Cox’s Proportional Hazards Regression Model,” Statistics in Medicine, Vol. 4, No. 1, 1985, pp. 39-46. doi:10.1002/sim.4780040107
[2] J. C. van Houwelingen and S. le Cessie, “Predictive Value of Statistical Models,” Statistics in Medicine, Vol. 9, No. 11, 1990, pp. 1303-1325. doi:10.1002/sim.4780091109
[3] F. E. Harrell, K. L. Lee and D. B. Mark, “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors,” Statistics in Medicine, Vol. 15, No. 4, 1996, pp. 361-387. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4
[4] W. Sauerbrei, “The Use of Resampling Methods to Simplify Regression Models in Medical Statistics,” Journal of the Royal Statistical Society Series C—Applied Statis tics, Vol. 48, No. 3, 1999, pp. 313-329. doi:10.1111/1467-9876.00155
[5] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, Vol. 58, No. 1, 1996, pp. 267-288.
[6] W. Sauerbrei, P. Royston and H. Binder, “Selection of Important Variables and Determination of Functional Form for Continuous Predictors in Multivariable Model Building,” Statistics in Medicine, Vol. 26, No. 30, 2007, pp. 5512-5528. doi:10.1002/sim.3148
[7] N. Mantel, “Why Stepdown Procedures in Variable Se lection?” Technometrics, Vol. 12, No. 3, 1970, pp. 621-625. doi:10.1080/00401706.1970.10488701
[8] P. Royston and W. Sauerbrei, “Multivariable Model-Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables,” Wiley, Chichester, 2008. doi:10.1002/9780470770771
[9] J. C. van Houwelingen, “Shrinkage and Penalized Likelihood as Methods to Improve Predictive Accuracy,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 17-34. doi:10.1111/1467-9574.00154
[10] W. Sauerbrei, N. Holl?nder and A. Buchholz, “Investigation about a Screening Step in Model Selection,” Statistics and Computing, Vol. 18, No. 2, 2008, pp. 195-208. doi:10.1007/s11222-007-9048-5
[11] J. B. Copas, “Regression, Prediction and Shrinkage (with Discussion),” Journal of the Royal Statistical Society Series B-Methodological, Vol. 45, No. 3, 1983, pp. 311-354.
[12] L. Breiman, “Better Subset Regression Using the Non negative Garrote,” Technometrics, Vol. 37, No. 4, 1995, pp. 373-384. doi:10.1080/00401706.1995.10484371
[13] K. Vach, W. Sauerbrei and M. Schumacher, “Variable Selection and Shrinkage: Comparison of Some Approaches,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 53-75. doi:10.1111/1467-9574.00156
[14] J. C. Wyatt and D. G. Altman, “Prognostic Models: Clinically Useful or Quickly Forgotten?” British Medical Journal, Vol. 311, No. 7019, 1995, pp. 1539-1541. doi:10.1136/bmj.311.7019.1539
[15] S. Varma and R. Simon, “Bias in Error Estimation When Using Cross-Validation for Model Selection,” BMC Bio informatics, Vol. 7, No. 91, 2006. doi:10.1186/1471-2105-7-91
[16] M. Schumacher, N. Holl?nder and W. Sauerbrei, “Re sampling and Cross-Validation Techniques: A Tool to Reduce Bias Caused by Model Building?” Statistics in Medicine, Vol. 16, No. 24, 1997, pp. 2813-2827. doi:10.1002/(SICI)1097-0258(19971230)16:24<2813::AID-SIM701>3.0.CO;2-Z
[17] G. Ihorst, T. Frischer, F. Horak, M. Schumacher, M. Kopp, J. Forster, J. Mattes and J. Kuehr, “Long and Medium-Term Ozone Effects on Lung Growth Including a Broad Spectrum of Exposure,” European Respiratory Journal, Vol. 23, No. 2, 2004, pp. 292-299. doi:10.1183/09031936.04.00021704
[18] A. Buchholz, N. Holl?nder and W. Sauerbrei, “On Properties of Predictors Derived with a Two-Step Bootstrap Model Averaging Approach—A Simulation Study in the Linear Regression Model,” Computational Statistics and Data Analysis, Vol. 52, No. 5, 2008, pp. 2778-2793. doi:10.1016/j.csda.2007.10.007
[19] R. W. Johnson, “Fitting Percentage of Body Fat to Simple Body Measurements,” Journal of Statistics Education, Vol. 4, No. 1, 1996.
[20] F. E. Harrell, “Regression Modeling Strategies, with Applications to Linear Models, Logistic Regression and Survival Analysis,” Springer, New York, 2001.
[21] E. Steyerberg, R. Eijkemans, F. Harrell and J. Habbema, “Prognostic Modelling with Logistic Regression Analysis: A Comparison of Selection and Estimation Methods in Small Data Sets,” Statistics in Medicine, Vol. 19, No. 8, 2000, pp. 1059-1079. doi:10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0
[22] J. Bien, J. Taylor and R. Tibshirani, “A Lasso for Hierarchical Interactions,” Submitted 2012.
[23] F. E. Harrell, K. L. Lee, R. M. Califf, D. B. Pryor and R. A. Rosati, “Regression Modeling Strategies for Improved Prognostic Prediction,” Statistics in Medicine, Vol. 3, No. 2, 1984, pp. 143-152. doi:10.1002/sim.4780030207
[24] J. Q. Fan and R. Z. Li, “Variable Selection via Noncon cave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, Vol. 96, No. 456, 2001, pp. 1348-1360. doi:10.1198/016214501753382273
[25] H. Zou and T. Hastie, “Regularization and Variable Se lection via the Elastic Net,” Journal of the Royal Statistical Society Series B, Vol. 67, No. 2, 2005, pp. 301-320. doi:10.1111/j.1467-9868.2005.00503.x
[26] C. Porzelius, M. Schumacher and H. Binder, “Sparse Regression Techniques in Low-Dimensional Survival Data Settings,” Statistics and Computing, Vol. 20, No. 2, 2010, pp. 151-163. doi:10.1007/s11222-009-9155-6
[27] C. L. Leng, Y. Lin and G. Wahba, “A Note on the Lasso and Related Procedures in Model Selection,” Statistica Sinica, Vol. 16, 2006, pp. 1273-1284.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.