A Review on High-Dimensional Frequentist Model Averaging

Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. To obtain a better understanding of the available model averaging methods, their properties and the relationships between them, this paper is devoted to make a review on some recent progresses in high-dimensional model averaging from the frequentist perspective. Some future research topics are also discussed.


Introduction
With the advent of high-throughput technologies, high-dimensional data have been frequently generated for the understanding of biological processes such as disease occurrence and cancer study.Motivated by these important applications, there has been a dramatic development in the statistical analysis of high-dimensional data; see [1] and [2], and examples therein.
Model selection and model averaging are two approaches used to improve estimation and prediction in the regression problems.Model selection assigns the weight of a single optimal model to 1 and weights for other candidate models to 0, thus the parsimonious and compact representations of the data can be obtained.In recent years, the shrinkage methods have become popular as they can achieve simultaneous model selection and parameter estimation.Such methods include, but are not limited to, the least absolute shrinkage and selection opera-tor (LASSO, Tibshirani [3]), the smoothly clipped absolute deviation (SCAD, Fan and Li [4]), the elastic net (Zou and Hastie [5]), and the minimax concave penalty (MCP, Zhang [6]).
However, the process of model selection ignores the additional uncertainty or even introduces bias, and therefore often underestimates variance [7].In addition, different selection methods or criteria may yield different best models.
Hence inference based on the final model can be seriously misleading.
Instead of relying on only one model, model averaging compromises across a set of competing models by assigning different weights.In doing so, model uncertainty is incorporated into the conclusions about the unknown parameters.
Besides, if the weights can be properly determined, then prediction performance could be enhanced [8].
Regarding model averaging techniques, Frequentist Model Averaging (FMA) and Bayesian Model Averaging (BMA) are two different methods in the literature.Compared with FMA, there are extensive references on BMA where a prior probability to each candidate model is set for the model uncertainty; for an overview of BMA, see [9].On the other hand, the FMA approach, whose estimators are totally determined by data, is starting to receive more attention over the last decade, as the procedure avoids problems such as how to set priors and how to deal with the priors when they are in conflict.
The aim of this paper is to make a review on the current methods of the FMA in the high-dimensional linear models.The methods on FMA estimation are surveyed in Section 2. Some future research topics are discussed in Section 3.

High-Dimensional FMA
So far, most current model averaging approaches are developed for the classic setting in which the number of observations is greater than the number of predictors, with the main focus of determination of the weights for individual models.These approaches include Akaike information criterion model averaging (AIC, Akaike [10]), Bayesian information criterion model averaging (BIC, Hoeting et al. [11]), Mallows model averaging (Hansen [12]; Wan et al. [13]), and Jackknife model averaging (Hansen and Racine [14]; Zhang et al. [15]), to name but a few.
However, for the high-dimensional setting, model averaging has only recently been studied.This is very different from the finite dimensional case because many of the fixed dimensional model averaging procedures either do not work at all or, for their implementation, require some theoretical or computational adjustment.
Given the dataset of n observations, a linear regression model takes the form of where i y is the response in the ith trial, 1 , , Developing for the data in which the number of predictors p is much greater than the number of observations n, Ando and Li [16] proposed a two-stage model averaging procedure.The procedure first divides p predictors into 1 M + groups by the absolute marginal correlations between all predictors and the response.Let model k M consist of the regressors with marginal correlations falling into the kth group.The first group has the highest values, and the 1 M + group has values closest to 0 and is then discarded.Thus the number of candidate models is M.Each model can also be written in matrix form . Given candidate models whose number of predictors is smaller than the sample size, the regression coefficients are estimated by the usual least-squares method as ( ) After the candidate models and their corresponding least-squares predicted are obtained, the second stage of procedure of [16] is to determine the model weights.Let be an n-dimensional vector, where is the predicted value of the ith observation from k M using the data without the ith observation, then the optimal weight vector w is optimized by minimizing the delete-one cross-validation criterion where Finally, the model averaging predicted value û is ex- pressed as There are several contributions of Ando and Li [16].One notable feature of this method is the relaxation on the total model weights.The standard constraint of the model weights summing up to 1 is relaxed to the model weights can be vary freely between 0 and 1, and it is shown that this relaxation is helpful to lower the prediction error.Furthermore, the algorithm is computationally feasible for high-dimensional data, since each candidate model and its corresponding weight are first determined in the low-dimensional setting and then organically combined.Theoretically, it is proved that the proposed method could asymptot-ically achieve the lowest possible prediction loss, which is an important property in prediction performance.
Following [16], Ando and Li [17] further extended model averaging to high-dimensional generalized linear models.Still allowing the weights to alter between 0 and 1, the Kullback-Leibler distance is used in [17] as a replacement of the squared error for risk measure, to overcome several technical and theoretical challenges.
Nevertheless, Lin et al. [18] showed through a simulated example that the two-stage model averaging procedure in [16] tends to have high variance and may lead the final estimator to be overfitting.They argued that the increase in variance is due to the reuse of the same data for generating candidate models and estimating model weights in the two steps.
To reduce the variance of estimators, Lin et al. [18]  .In the next step, the second level data ( ) where i I is the set of indexes of test dataset b test D that contain observation i.
After ik z is determined, the optimal weight vector w is estimated by mini- mizing .
Finally, the model averaging predicted value takes the form of ( ) The procedure of [18] selects candidate models and obtains estimators using training sets, while finds optimal weights using only test sets, which could successfully avoid model overfitting and could improve prediction accuracy by combining models from multiple random splits.The main price one pays for using the random splitting, however, is in significantly increased computational complexity.

Conclusion and Discussion
In this paper, we have made a review on the development of the FMA approach for high-dimensional linear regression models.The performance of the FMA procedures highly depends on how to choose weights in estimation, since different weights will result in different risks and asymptotic properties.Conse- proposed a random splitting approach by first dividing the original data set into training set b .For each b train D , the variable selection method LASSO is applied to determine candidate model ˆk M λ for each candi-