A Universal Selection Method in Linear Regression Models

In this paper we consider a linear regression model with fixed design. A new rule for the selection of a relevant sub-model is introduced on the basis of parameter tests. One particular feature of the rule is that subjective grading of the model complexity can be incorporated. We provide bounds for the mis-selection error. Simulations show that by using the proposed selection rule, the mis-selection error can be controlled uniformly


Introduction
In this paper we consider a linear regression model with fixed design and deal with the problem of how to select a model from a family of models which fits the data well.The restriction to linear models is done for the sake of transparency.In applications the analyst is very often interested in simple models because these can be interpreted more easily.Thus a more precise formulation of our goal is to find the simplest model which fits the data reasonably well.We establish a principle for selecting this "best" model.
Over time the problem of model selection has been studied by a large number of authors.The papers [1,2] by Akaike and Mallows inspired statisticians to think about the comparisons of fitted models to a given dataset.Akaike, Mallows and later Schwarz (in [3]) developed information criteria which may be used for comparisons and in particular, may be applied to non-nested sets of models.The basic idea is the assessment of the trade-off between the improved fit of a larger model and the increased number of parameters.Akaike's approach is to penalise the maximised log-likelihood by twice the number of parameters in the model.The resulted quantity, the so called AIC, is maximised with respect to the parameters and the models.The disadvantage of this procedure is that it is not consistent; more precisely, the probability of overfitting the model tends to a positive value.Subsequently, a lot of other criteria have been developed.In a series of papers the consistency of procedures based on several information criteria (BIC, GIC, MDL, for example) are shown.The MDL-method was introduced by Rissanen in [4].In the nineties of the last century a new class of model selection methods came into focus.The FDR procedure of Benjamini and Hochberg (see [5]) uses ideas from multiple testing and attempts to control the false discovery rate, which we will call the mis-selection rate in this paper.More recent papers of this direction are published by Bunea et al. [6], and by Benjamini and Gavrilov [7].Surveys of the theory and existing results may be found in [8][9][10][11].In a large number of papers the consistency and loss efficiency of the selection procedure is shown and the signal to noise ratio is calculated for the criterion under consideration.Among these papers we refer to [12][13][14][15][16], where consistency is proved in a rather general framework.A method for the submodel selection using graphs is studied in [17].Leeb and Pötscher examine several aspects of the post-model-selection inference in [9,18,19].The authors point out and illustrate the important distinction between asymptotic results and the small sample performance.Shao introduced in [20] a generalised information criterion, which includes many popular criteria or which is asymptotically equivalent to them.In this paper Shao proved convergence rates for the probability of mis-selection.In [21] a rather general approach using a penalised maximum likelihood criterion was considered for nested models.
Edwards and Havránek proposed in [22] a selection procedure aimed at finding sets of simplest models that are accepted by a test like a goodness-of-fit test.Unfortunately, it is not possible to use the typical statistical tests of linear models in Edwards and Havránek's procedure since the assumption (b) in the Section 2 of their paper is not fulfilled (cf.Section 4 of their paper).
In this paper we develop a new universal method for selecting a significant submodel from a linear regression model with fixed design, where the selection is done from the whole set of all submodels.We point out the several new features of our approach: 1) A new selection procedure based on parameter tests is introduced.The procedure is not comparable with methods based on information criteria and it is different from Efroymson's algorithm of stepwise variable selection in [23].
2) We derive convergence rates for the probability of mis-selection which are better than those proved in papers about information criteria e.g. in [20].
3) Subjective grading of the model complexity can be incorporated.
Concerning 1) we consider tests on a set of parameters in contrast to FDR-methods, where several tests on only one parameter are applied.Moreover w.r.t. 2), many authors do not analyse the behaviour of mis-selection probabilities.The results on bounds or convergence rates of these probabilities are more informative than the consistency.The aspect 3) is of special interest from the point of view of model building.Typically model builder have some preference rules in mind when selecting the model.They prefer simple models with linear functions to models with more complex functions (exponential or logarithmic, for example).The crucial idea is to assign to each submodel a specific complexity number.
We do not assume that the errors are normally distributed.This ensures a wide-ranging applicability of the approach, but only asymptotic distributions of test statistics are available.From examples in Section 2, it can be seen that applications are possible in several directions, for instance to the one-factor-ANOVA model.The simulations show an advantage of the proposed method in that it controls the frequency of mis-selection uniformly.For models with a large number of regressors, the problem of establishing an effective selection algorithm is not discussed in this paper; we refer to the paper [24].
The paper is organised as follows: In Section 2 we introduce the regression model and several versions of submodels.The asymptotic behaviour of the basic statistic is also studied there.Section 3 is devoted to the model selection method.We provide convergence rates for the probability that the procedure selects the wrong model (mis-selection).We see that the behaviour is similar to that in the case of hypothesis testing.The results of simulations are discussed in Section 4. The reader finds the proofs in Section 5.

Models
Let us introduce the master model where is the design matrix, k is the parameter vector, and where This leads to the residual sum of squares where is the Euclidean vector norm. .The aim is to select model (1) or an appropriate submodel which fits the data well.Moreover, we search for a reasonably simple model.In the following we define the submodels of (1).The submodel with index has the parameter vector .Next we give several versions for the definition of submodels in different situations.
Example 1.We consider all submodels, where components of  are zero.More precisely, index  is : 0 holds.Moreover, we have in this case.Here the digits "1" in the binary representation of 1   give the indices of the parameters j  available in the submodel  .X  in (2) consists of the columns 1 l of the design matrix , i ,i X corresponding to the present parameters in submodel  .□ Example 2. Let .submodel 1: . Submodel 2: We consider the one-factor ANOVA model , where , is the parameter vector.11 g gn    are independent random variables.The submodels are characterised by the fact that several in the following way: for , therwise. , Example 3 shows that the model selection problem occurs also in the context of ANOVA.In submodel (2) with index  , the least square estimator ˆ  and the residual sum of squares S  are given by What is an appropriate statistic for model selection?
Here we consider a quantity   n M  , which is similar to F-statistics known from hypothesis testing in linear regression models with normal errors: The main difference to classical F-statistics is that the of the model variance in submodel  appears in the denominator.The quantity is the proper estimator under the hypothesis of submodel  .Classical F-statistics are used in Efroymson's algorithm of stepwise variable selection (see [23]).
In the remainder of this section we study the asymptotic behaviour of the statistic   M  when 0 n  is the true parameter of the model (1).For this reason, we first introduce some assumptions.
In a wide range of applications, the entries x of the design matrix are uniformly bounded such that

since
. The Assumption may be weakened in some ways, but we use this assumption to reduce the technical effort.We introduce and .Then we have Depending on whether the true parameter 0  belongs to submodel  or not, the statistic

 
M n  has a different asymptotic behaviour.In the first case, it has an asymptotic χ 2 -distribution.In the second case it tends to infinity in probability with rate n .Therefore, the statistic

 
M n  is suitable for model selection.In the next section a selection procedure is introduced based on serving as fundamental statistic.

The New Selection Rule
In this section we propose a selection rule which is based are: on the statistic (4).We introduce a measure is the degree of the polynomial plus 1, of the complexity for submodel  with max .With this quantity is the number of parameters j  avail- able in the submodel, the other parameters  it is possible to incorporate a subjective grading of the model complexity.The restriction to integers is made for simpler handling in the selection algorithm.The following examples should illustrate the applicability of the complexity measure.Example 4. We consider the polynomial k .The regressor is observed at the measurement points 1 , where  is the number of parameters j  available in the submodel.This choice has the advantage that a polynom of higher degree always gets a higher complexity number.□ Example 5.For a quasilinear model with regression function    1 for submodel functions , , 2 for submodel function ln odel function bmodel functions ln , ln 5 for the full model This choice takes into account that the logarithm is a more complex function in comparison to constants or linear functions.□ Next we need restricted parameter sets defined by Here n is just the quantile of order   will play the role of an asymptotic type-1 error probability later.A submodel is referred to as admissible if  is satisfied, which in turn corresponds to the nonrejection of the hypothesis that the parameter belongs to the space   of the submodel.The generalised information criterion introduced by Shao (see [20]) is given by We next show that there is a relationship between the both approaches.A submodel  is admissible if where . Moreover, note that our selection procedure is completely different from Shao's one.Whereas  in information criteria is typically free of choice, the quantity n  is well-defined and motivated.Let l F be the distribution function of the -distribution.We introduce the fol-lowing rule for the selection: Select a model ν * such that The central idea is to prefer any admissible model with lower complexity.If there is more than one admissible model with the same minimum complexity, then we take the model with maximum p-value of The next step is to analyse the asymptotic behaviour of the probability that the wrong model is selected; i.e. the probability of mis-selection (PMS).Let 0 The following cases of mis-selection can occur: for some with j d j d  0 .The probability of mis-selection case (m2) may be decreased by reducing the number of submodels having the same complexity.The Theorem 3.1 below provides bounds for the selection error.
Assume that Assumption is fulfilled, and The PMS of case (m1) behaves like a type-1-error in a statistical test.It approaches asymptotically under the assumptions of part 1).The additional term with rate   for all and some .These rates of PMS are rather fast.They are better than in comparable cases in [20] ( n 0 a   and n  can be considered to have the same rate).One reason is that in this paper alternative techniques such as Fuk-Nagaev inequality are employed to obtain the convergence rates.The results of Theorem 3.1 recommend the selection rule above from the theoretical point of view.The behaviour in practice is discussed in the next section.

Simulations
Here we consider the polynomial model: are the observations of the regressor variable, and the ε i 's are i.i.d.random variables.
, , 0,1 x x   For simplicity, we consider the case i i x n  d .The complexity is measured as given in Example 4(b).We compare the selection method of the previous section with procedures based on Schwarz's Bayesian information criterion (BIC, see [3]) and the Hannan-Quinn criterion (HQIC, see [25]).The Tables 1-3 show the frequencies of mis-selection.The results are based on 10 6 replications of the model.We choose the following error

Pro s
By C , we d vary from pl vari s.Throughout this section, we assume that Assumption  is fulfilled.In the following we prove auxiliary statements which are used later in the proofs of the theorems.

   
Trace

Applying aev's inequality (see [2 obtain the ass
of the lemma: Further an application of Fuk-Nagaev's inequality fro [26] leads to by Lemma 5.1 for n large enough.A combination of Inequalities ( 5)-(7) yields the lemma.
large enou plies assertion 2) of the lemma.  An application of th mit theorem and the Cramér-Wold device leads to the following lemma: In the second part of this section we provide the proofs of Proposition 2.
ver, the id ds in view of Assumption  .An application of Lemma 5.4 and the Cochran theorem leads to hol , and therefore assertion 1) of Proposition 2 is proved.

  
We derive Moreover, we deduce Hence by (8) and Lemma 5. , 0 , , 0 and , , , and hence D a  in contradiction to the assumption.This pr he lemma.  Proof o eorem 3.1: One shows easily that and analogously, Note that is a convex set for all 0 a  , we can apply Bhattacharya's theorem on a multivariate Berry-Esseen inequality (see [27]) Combining these identities and (11), ( 12) we obtain assertion 1).
2) One can show that for some : , , for n large enough.On the other hand, we have number of applications, the γ i 's coincide with different components of  .The submodel indices 1 and  correspond to the model function equal to zero (no parameters) and to the full model, respectively.Thus we can write the model equation for the submodel  as Y X    , x ij 's are uniformly bounded.This theorem shows that the PMS of cases (m2) and (m3) tends to zero at rate and n large enough, where are constants not e-d pending on  , n .The same upper bound holds for

2 )
Using Lemmas 5.1 and 5.2, we deduce of real numbers with n   , , we have K ur-0 i  by Lemma 5.5.F thermore, by Lemma 5.1 we obtain A New Look at the Statistical M del Identification," IEEE Transactions on Automatic Control, Vol. 19, 1974, pp.716-723.doi:10.1109/TAC.1974.1100705d i d 