Revisiting Akaike’s Final Prediction Error and the Generalized Cross Validation Criteria in Regression from the Same Perspective: From Least Squares to Ridge Regression and Smoothing Splines ()
1. Introduction
In many branches of activity, the data analyst is confronted with the need to model a continuous numeric variable Y (the response) in terms of one or more explanatory other variables (called covariates or predictors or regressors)
in a population Ω through a model
(1)
where1:
;
is a function from
, generally unknown, called the regression function;
is an unobserved error term, also called residual error in the model (1).
In our developments in this paper, and contrary to popular tradition, we will not need that the variables X and
necessarily be stochastically independent. However, the usual minimal assumption for (1) is that the variables X and
satisfy:
(
).
.
Though sometimes far more debatable, we will also admit that the postulated regression model (1) satisfies the homoscedasticity assumption for the residual error variance:
(
).
.
Now, as is well known, Assumption (
) implies that
(2)
However, that conditional expectation function can almost never be analytically computed in practical situations. The aim of the regression analysis of
(i.e. “Y given X”) is rather to estimate the unknown regression function
in (1) based on some observed data
(3)
collected on a sample of size n drawn from Ω. If achieved, this will result in a computable function
so that the final practical model used to express the response variable Y in terms of the vector of predictors X will be:
(4)
where
is the residual error in the modeling. But once we get such a fit for the regression model (1), an obvious question then arises: how can one measure the accuracy of that computed fit (its so called goodness-of-fit)?
For some specific regression models (generally parametric ones and, most notably, the LM fitted by OLS with a full column rank design matrix), various measures of accuracy of their fit to given data, with closed form formulas, have been developed. But such specific and easily computable measures of accuracy are not universally applicable to all regression models. So they cannot be used to compare two arbitrarily fitted such models. Opposite to that, in the late 1960s, Akaike decisively introduced an approach (but in a much broader context including time series modeling) which has that desirable universal feature [1] , and is thus now generally recommended to compare different regression fits for the same data, albeit, sometimes, at some computational cost. It is based on estimating the prediction error of the fit. But various definitions and measures of that error have been used. By far, the most popular is the Mean Squared Prediction Error (MSPE), but there appears to be a myriad ways of defining it and/or theoretical estimates of it. A detailed lexicon on the matter is given in [2] and the references therein.
These various definitions and theoretical estimates of the MSPE are, undoubtedly, insightful in their respective motivations and aims, but they ultimately make the subject of prediction error assessment utterly complicated and confusing for most non expert users. This is compounded by the fact that many of these definitions and discussions of prediction error liberally mix theoretical estimates (i.e. finite averages over sample items) with intrinsic numeric characteristics of the whole population (expressed through expectations of some random variable). In that respect, while being both aimed at estimating the MSPE, the respective classical derivations of Akaike’s Final Prediction Error (FPE) and Craven and Wahba’s Generalized Cross Validation (GCV) stem from two quite different perspectives [1] [3] . This makes it not easy, for the non expert user, to grasp that these two selection criteria might even be related in the first place.
The first purpose of this paper is to settle on the definition of the MSPE most commonly known by users to assess the prediction power of any fitted model, be it for regression or else. Then, in that framework, we shall provide a conceptually simpler derivation of the FPE as an estimate of that MSPE in a LM fitted by OLS when the design matrix is full column rank. Secondly, we build on that to derive generalizations of the FPE for the LM fitted by other well known methods under various scenarios, generalizations seldom accessible from the traditional derivation of the FPE. Finally, we show that, in that same unified framework, a minor variation in the derivation of the MSPE estimates yield the well known formula of the GCV for all these various LM fitters. For the latter selection criterion, previous attempts [4] [5] have been made to provide a derivation of it different from the classical one as an approximation of the leave-one-out cross validation (LOO-CV) estimate of the MSPE. We view our approach as a more straightforward and inclusive derivation of the GCV score.
To achieve that, we start, in Section 2, by reviewing the prediction error viewpoint in assessing how a fitted regression model performs and to settle on the definition of the MSPE most commonly known by users to measure that performance (while briefly recalling the alternative one most given in regression textbooks). Then, in Section 3, focusing specifically on the LM, we provide theoretical expressions of that MSPE measure valid for any arbitrary LM fitting method, be it under random or non random designs. In the next sections, these expressions are successively specialized to some of the best known LM fitters to deduce, for each, computable estimates of the MSPE, a class of which yielding the FPE and a slight variation the GCV. Those LM fitters include: OLS, with or without a full column rank design matrix (Section 4), Ridge Regression, both Ordinary and Generalized (Section 5), the latter embedding smoothing splines fitting. Finally, Section 6 draws a conclusion and suggests some needed future work.
As customary, we summarize the data (3) through the design matrix and the response vector:
(5)
It is important to emphasize that technically, each observed response
is a realization of a random variable. Consequently, the same will be true of the vector
. On the other hand, the matrix
is considered as fixed or random, depending on whether the covariates
are viewed as fixed or random variables. According to which one holds, one talks of fixed design or random design. Our presentation will encompass both scenarios. For convenience, we will use the same symbol to denote a random variable and its realization. Nonetheless, the distinction between the latter two will be apparent from context.
2. Prediction Error of a Fitted Regression Model
2.1. The Prediction Error Viewpoint for Assessing a Fitted Regression Model
Using the data (3), assume we got an estimate
of the unknown regression function
in the regression model (1). Thus, we fitted the model (1) to express the response Y in terms of the vector of covariates
in the population under study Ω. From a predictive perspective, assessing the goodness-of-fit of that fitted model amounts to answering the question: How well the fitted model is likely to predict the response variable Y on a future individual based on its covariates values
?
To try to formalize an answer, consider a new member of Ω, drawn independently from the sample of Ω which produced the data (3), and assume we have observed its covariates vector
, but not its response value
. Nonetheless,served value of the latter. This is the prediction problem of
(“Y given X”) on that individual. With model (4) fitted to the response variable Y, it seems natural to predict the unknown value of
on that individual, given from the former, we want to get an idea of the unob
, by:
(6)
But there is, obviously, an error attached to such a prediction of
value by
, called the prediction error or predictive risk of
w.r.t.
.
2.2. The Mean Squared Prediction Error
The prediction error of
w.r.t.
needs to be assessed beforehand to get an idea of the quality of the fit provided by the estimated regression model (4) for
in the population under study. To that aim, the global measure mostly used is the Mean Squared Prediction Error (MSPE), but one meets a bewildering number of ways of defining it in the literature [2] , creating a bit of confusion in the mindset of the daily user of Statistics. Yet, most users would accept as definition of the MSPE:
(7)
an intrinsic characteristic of the population which we need to estimate from the available data (3).
In this paper, an unconditional expectation like the r.h.s. of (7) is meant to integrate over all the random variables involved in the integrand. A conditional expectation, on the other hand, acts the same, except for the conditioning random variable(s). As such, en route to getting
, we will pass successively through:
(8a)
(8b)
(8c)
representing, each, the MSPE conditioned on some information, which might be relevant in its own right. In particular,
is the relevant definition of the MSPE in the case of a fixed design. But, as a consequence of moving successively through (8a)-(8c) to get (7), handling the fixed design case will not need a special treatment because computing (8c) will be a necessary intermediate step.
Obviously, for most fitted regression models, trying to compute
or
is a hopeless task. However, it is known that K-fold cross validation (CV) can be used to estimate these quantities in a nearly universal manner, imposing no distributional assumption and using a quasi automatic algorithm [6] . Nonetheless, cross validation has its own defects such as a high extra computational cost2, the impact of the choice of K and, generally, upwards bias. As for the latter defect, Borra and Di Ciaccio [2] showed in an extensive simulation study that Repeated Corrected CV, a little popularized correction to K-fold cross validation developed by Burman [7] , performed pretty well and outperformed, on some well known nonlinear regression fitters and under a variety of scenarios, all the other theoretically more technically involved suggested measures of the MSPE alluded to above.
Fortunately, after fitting a Linear Model to given data by a chosen method, there is, at least for the most common methods, no need to shoulder the computational cost attached to CV to estimate the MSPE of the fitted model. It is the purpose of this article to show that for the most well known LM fitters, one can transform (7)-(8c) in a form allowing to deduce, in a straightforward manner, two estimates of the MSPE computable from the data, which turn out to coincide, respectively, with the FPE and the GCV. This, therefore, will yield a new way to get these two selection criteria in linear regression completely different from how they are usually respectively motivated and derived.
2.3. Prediction Error and Sources of Randomness in the Regression Process
The rationale behind settling on (7) or (8c) as definition of the MSPE lies in the fact that, from our standpoint, any global measure of the prediction error in the population computed from a sample collected in it must account both for the randomness in drawing that sample and in that of drawing a future individual. Consequently, each of the expectations in (7)-(8c) should be understood as integrating over all the possible sources of randomness entering in the process of computing the prediction
of
, safe the conditioning variables, if any:
1) the realized, but unobserved random n-vector
of sample residual errors in the model (1) for the data (3). This vector results from the fact that for the observed sample of n individuals, the model (1) implies that
(9)
2)
, the vector of covariates for the potential newly independently drawn individual for which the unknown value
of the response Y is to be predicted;
3)
, the error of the model (1) for that new individual,
,
.
4) and, in case of a random design, the entries in the design matrix
.
The key assumption to assess the prediction error of a regression model is then:
(
). The random couple
and is independent from
.
2.4. Measuring the Prediction Error in Regression: Sample Based Definition
While the GCV score is classically derived, indeed, as an estimate of the MSPE as given by (7), through a two-stage process where the intermediate step is the well known LOO-CV estimate of that MSPE, the traditional derivation of the FPE criterion stems from a completely different angle. Actually, the latter angle is the one presented in most textbooks on regression [8] [9] . In it, with the regression function
in (1) estimated through
, a function computed from the data (3), the prediction error of that fit is rather measured by how well the vector of predicted responses on sample items,
, with
,
, estimates the vector of exact responses on those items,
, with
,
. In that respect, one considers the mean average squared error, also called risk,
(10)
with bias-variance decomposition
(11)
where
is the conditional bias, given
, in estimating
by
.
But the relation to the more obvious definition (8c) of the conditional MSPE is better seen when one considers, instead, the average predicted squared error ( [8] Chapter 3, page 42] or prediction risk ( [9] Chapter 2, page 29):
(12)
where
are putative responses assumed generated at the respective predictor values
through model (1), but with respective errors
independent from the initial ones
. Nonetheless, there is a simple relation between (10) and (12):
(13)
hence minimizing
w.r.t.
is the same as doing so for
.
In its classical derivation for linear regression (see, e.g., [10] , pages 19-20), the FPE selection criterion is an estimate of
. Now, with the terminology elaborated in [5] , measuring the prediction error by the latter amounts to using a Fixed-X viewpoint as opposed to the Random-X one when measuring it instead through
. But an even more common terminology to distinguish these two approaches to estimating the predictive performance of a regression method qualifies the first as in-sample prediction and the second one as out-of-sample prediction. It should be said that while in the past, the prediction error was mostly evaluated using the in-sample paradigm, facing the complexity of data met in modern statistics, noticeably high dimensional data, many researchers in regression have advocated or used the out-of-sample viewpoint, though this might be through either (7), (8b), or (8c), depending on the author(s). In that respect, in addition to the aforementioned paper of Trosset and Tibshirani, we may cite, e.g., Breiman and Spector [11] , Leeb [12] , Dicker [13] , Dobriban and Wager [14] .
Note, however, that the prediction error viewpoint in assessing the quality of a regression fit is not without its own demerits. Indeed, in [15] and [16] , it is highlighted that in the specific case of smoothing splines regression, one can find a fit which is optimal from the prediction error viewpoint, but which clearly undersmoothes the data, resulting in a wiggly curve. But the argument appears to be more a matter of visual esthetics because the analysis in those papers targets the regression function Φ, which is, indeed, probably the main objective of many users of univariate nonparametric regression. Nonetheless, when measuring the prediction error through
, the target is rather the response Y, i.e. Φ + error, in which the wiggliness is inherently embedded. Do not forget that when formulating his Final Prediction Error, Akaike was working on real world engineering problems [17] . Hence his interest in targeting the actual output in his predictions.
3. MSPE of a Linear Model Fitter
From now on, we focus attention on the generic LM, with
:
(14)
to be fitted to the data (3). Because of its ease of tractability and manipulation, the LM is the most popular approach to estimating the regression function
in (1). It is mostly implemented by estimating
through the Ordinary Least Squares criterion. However, several other approaches have been designed to estimate
based on various grounds, such as Weighted Least Squares, Total Least Squares, Least Absolute Deviations, LASSO [18] and Ridge Regression [19] . Furthermore, some generally more adaptive nonparametric regression methods proceed by first nonlinearly transforming the data to a scale where they can be fitted by a LM. Due to its more than ever quite central role in statistical modeling and practice, several books have been and continue to be fully devoted to the presentation of the LM and its many facets such as: [20] [21] [22] and [23] .
3.1. Some Preliminaries
For the sample of n individuals with recorded data (3), the general regression Equation (9) becomes, in the case of the LM (14):
(15)
or, better globally summarized in matrix form,
(16)
Then Assumptions (
) and (
) respectively imply here:
(17)
with
the n-by-n identity matrix. Consequently,
(18)
Fitting the LM (14) boils down to estimating the p-vector of parameters
. We call Linear Model fitter, or LM fitter, any method allowing to achieve that. First, we consider an arbitrarily chosen such method. It uses the data (3) to compute
, an estimate of
. The precision of that estimate3 can be assessed through its Mean Squared Error matrix:
(19a)
But to reach
generally requires passing through the conditional Mean Squared Error matrix of
given the design matrix
:
(19b)
The relationship between the two is:
(19c)
Those two matrices will play a key role in our MSPE derivations to come.
The precision of an estimate of a vector parameter like
is easier to assess when its Mean Squared Error matrix coincides with its covariance matrix. Hence, our interest in:
Definition 1 In the LM (14),
is an unbiased estimate of
, conditionally on
, if
.
Then one has:
, the conditional covariance matrix of
given
.
Since
, it is immediate that if
is an unbiased estimate of
, conditionally on
, then
, i.e.
is an unbiased estimate of
(unconditionally). So the former property is stronger than the latter, but is more useful in this context. Note also that it implies:
, the covariance matrix of
. On the other hand, when
is a biased estimate of
, the bias-variance decomposition of
might be of interest:
(20)
where
.
3.2. MSPE of a Linear Model Fitter: Base Results
In the prediction setting of Section 2.1 applied to the LM (14), one has, for the new individual:
(21)
with
, but unknown. It would then be natural to predict the response value
by
(22)
were the exact value of the parameter vector
available. Since that is not typically the case, one rather predicts
by the computable quantity
(23)
The goal here is to find expressions, in this context, for the MSPE of
w.r.t.
as given by (7)-(8c), manageable enough to allow the derivation of computable estimates of that MSPE for the most common LM fitters. The starting point to get such MSPE expressions is the base result:
Theorem 1 For any
estimating
in the LM (14), one has, under (
), (
), (
):
(24a)
(24b)
(24c)
(24d)
Proof. From
, we first get:
(25)
Now, from (8a) and using Assumption (
),
(26)
the latter because
, and so
. On the other hand,
because
is a scalar. Thus
which, inserted in (26), yields (24a).
Thanks to (24a) and identity (8b), we have
Likewise, (24b) and (8c) give:
From relations (24c) and (7), one gets:
The above result is interesting in that it imposes no assumption on
, hence it is valid for any LM fitter. But an immediate important subcase is provided in:
Corollary 2 If, conditional on
,
estimates
unbiasedly in the LM (14), then, under Assumptions (
), (
) and (
):
(27a)
(27b)
4. MSPE When Fitting the LM by Ordinary Least Squares
By far, the most popular approach to estimating the parameter vector
in the LM (14) is through minimizing the Ordinary Least Squares (OLS) criterion using the observed data (3):
(28)
The properties of
as an estimate of
depend on whether the design matrix
has full column rank or not. This remains true when studying the corresponding MSPE as well.
4.1. MSPE in the LM Fitted by OLS with X of Full Column Rank
4.1.1. LM Fitted by OLS with X Full Column Rank
Here, we consider the LM (14) fitted under:
(
).
, i.e.
is a full column rank matrix.
That assumption is known to be equivalent to saying that the square matrix
is nonsingular, thus implying that
(29)
Furthermore, given Assumptions (
)-(
), (18) holds, so
(30)
We also recall that under these assumptions, with
the residual response vector and
the Euclidean norm, a computable unbiased estimate of the residual variance
is:
(31)
the latter being the sum of squared residuals in the OLS fit to the data.
Now, from the first identity in (30), we deduce that when
is full column rank,
is an unbiased estimate of
, conditionally on
. Hence, combining the second identity in (30) with Corollary 2 yields:
Theorem 3 In the LM (14) fitted by OLS with Assumptions (
), (
), (
) and (
),
(32a)
(32b)
Proof. From (24c),
Now, using (32a), one gets:
from which the result is got.
4.1.2. The FPE and the GCV in the LM Fitted by OLS with X of Full Column Rank
From (32b) in Theorem 3, we deduce a closed form computable estimate of
, using data (3), by estimating, respectively:
• the residual variance
by
given by (31);
• the
expectation matrix
by the observed
;
• the
expectation matrix
by (given that
are i.i.d.
):
(33)
Therefore, one estimates
by
(34)
with
(35)
the usual Maximum Likelihood Estimator of the residual variance
in the LM (14) when the residual error
is assumed to follow a
Gaussian distribution.
We see that the final estimate
obtained for
coincides with Akaike’s Final Prediction Error (FPE) goodness-of-fit criterion for the LM (14) fitted by OLS [1] . The main difference between the derivation above and the traditonal one is that the latter uses the sample viewpoint of the prediction error reviewed in Section 2.4. The latter excludes the possibility that covariates values on a future individual might be completely unrelated to the observed
’s in the sample (3). In particular, it does not account for any potential random origin for the design matrix
, a situation often encountered in certain areas of application of the LM such as in econometrics.
On the other hand, estimating, instead,
by
yields as estimate of
:
(36)
the traditional GCV estimate of
in the LM fitted by OLS with
full column rank.
Remark 1 Note that the very way the two estimates (34) and (36) of
were derived above implies that they can also validly serve, each, as an estimate of the conditional
given by (32a). This will remain true for all the estimates derived for the MSPE under the other scenarios examined in this paper.
4.2. MSPE in the LM Fitted by OLS with X Not of Full Column Rank
4.2.1. LM Fitted by OLS with X Column Rank Deficient
Although Assumption (
) is routinely admitted by most people when handling the LM, one actually meets many concrete instances of data sets where it does not hold. Fortunately, with the formalization by Moore [24] , Penrose [25] and, especially, Rao [26] of the notion of generalized inverse (short: g-inverse) of an arbitrary matrix, it became possible to handle least squares estimation in the LM without having to assume the design matrix
necessarily of full column rank.
To begin with, it is shown in most textbooks on the LM that whatever the rank of the design matrix
, a vector
is a solution to the OLS minimization problem (28) if, and only if,
is a solution to the so called normal equations:
(37)
When Assumption (
) holds, the unique solution to the normal equations is clearly
given by (29). When that’s not the case, the square matrix
is singular, hence does not have a regular inverse
. Nevertheless, even then it can be shown that the normal Equation (37) are always consistent. But the apparent negative thing is that they then have infinitely many solution vectors
, actually all vectors
of the form:
(38)
where
is any g-inverse of
in the sense of Rao. Given that multitude of possible OLS estimates of
in this case, one may worry that this may hinder any attempt to get a meaningful estimate of the MSPE in the fitted LM. But we are going to show that such a worry is not warranted.
When Assumption (
) does not hold, in spite of there being as many solutions
to the normal Equation (37) as there are g-inverse s
of
, i.e. infinitely many, it is a remarkable well known fact that the fitted response vector
(39)
is the same, whichever g-inverse
of
is used to compute
through (38). This stems from the hat matrix
being equal to the matrix of the orthogonal projection of
into the range space of
( [22] Appendix A). Therefore, the residual response vector
(40)
does not vary with the g-inverse
either. Then we will need the result:
Lemma 4 With
, one has:
and
.
Proof. On the one hand, one has:
(41)
Now,
being a g-inverse of
, it comes that
is an idempotent matrix ( [22] Appendix A, page 509]. Hence
(42)
Relations (41) and (42) give
. On the other hand,
(43)
Now,
being the matrix of the orthogonal projection of
into the range space of
, then
. Hence,
(44)
From (43) and (44), we get
. Therefore:
Under Assumptions (
),(
), and thanks to the known identity which gives the expectation of a quadratic form ( [27] Appendix B, page 170], one has:
An unbiased estimate of the LM residual variance
in this case is thus known to be:
(45)
We will also need the mean vector and covariance matrix of
. First,
(46a)
which shows that
is a biased estimator of
when Assumption (
) does not hold. But in spite of that, note that from (39),
(46b)
On the other hand,
(47a)
the symmetric and positive semi-definite matrix
also being a g-inverse of
. Then
(47b)
again independent of the g-inverse
of
used to compute
.
4.2.2. Preliminary for the MSPE in the LM Fitted by OLS without Assumption ( A 3 )
Our first aim here is to examine the MSPE in the LM when fitted by OLS under the assumption that the design matrix
might not have full column rank. So
has been estimated through
given by (38) and we are interested in
, where
is taken as prediction of Y on an independently sampled new individual for whom
would have been observed, but not
. We are going to use the results of Section 3.2. First note that since, from the above,
is a biased estimator of
, Corollary 2 does not apply here. Nonetheless, from Theorem 1, we get:
(48a)
(48b)
Our estimation of
in this case will be based on those two identities and:
Lemma 5 In the LM (14) with Assumptions (
)-(
),
(49)
Proof. For
,
, thanks to (46b)
, by (47b).
Hence (49), thanks to Lemma 4.
4.2.3. The FPE and the GCV in the LM Fitted by OLS with X Column Rank Deficient
Given (48a), we first estimate
by
as in (33), which entails the following preliminary estimate of
in the present case, using the last lemma:
(50)
But since
, then (51) also gives a preliminary estimate of
. Then, estimating
by
given by (45), our final estimate of
in this case, computable from data, is:
(51)
which is also an estimate of
. It is denoted FPE because it generalizes (34) in assessing the goodness-of-fit of OLS in the LM when the design matrix
is column rank deficient. The remarkable feature is that this estimate is the same, whichever g-inverse
of
was used to get the estimate
of
in (38).
Estimating, instead,
by
yields as estimate of
:
(52)
the GCV estimate of
in the LM fitted by OLS when
is not full column rank.
5. MSPE when Fitting the LM by Ridge Regression
The design matrix
being column rank deficient means that its p columns are linearly dependent, or almost so. This happens when there is multicollinearity among the p regressors
, and thus some of them are redundant with some others. However, when this occurs, computing an OLS estimate
of
, given by (38), in a numerically stable manner is not easy and requires using carefully designed Numerical Linear Algebra programming routines. The difficulty stems from the fact that this requires, at least implicitly, to find, along the way, the exact rank of
, which is difficult to achieve, precisely because of the multicollinearity among its columns. It can then be of much interest to have a method which can fit the LM without having to bother about the exact rank of
. This is precisely what Ridge regression (RR) tries to achieve.
Hoerl and Kennard presented two variants of Ridge Regression [19] . In the initial one (the default), which some have termed Ordinary Ridge Regression (ORR), to fit the LM (14), one estimates
through regularizing the OLS criterion (28) by a ridge constraint, yielding:
(53)
for some
, a penalty parameter to choose appropriately. The unique solution to (54) is known to be (whatever the rank of
):
(54)
Hoerl and Kennard presented ORR in [19] assuming that the design matrix
was full column rank (i.e. our Assumption (
)), which also requires that
. But because the symmetric matrix
is always at least semi-positive definite, imposing
entails that the
matrix
is symmetric and positive definite (SPD), whatever the rank of
and the ranking between the integers n and p. That is why, in what follows, we do not impose any rank constraint on
(apart being trivially ≥1 because
is nonzero). No specific ranking either is assumed between n and p.
However, hereafter, since it does not require extra work, we directly consider the extended setting of Generalized Ridge Regression (GRR) which, to fit the LM (14), estimates
through solving the minimization problem:
(55)
where
is as in ORR and
is a
symmetric and semi-positive definite (SSPD) matrix, both given, while
. The solution of (56) is still (55), but now with
(56)
under the assumption that the latter is SPD. Since smoothing splines fitting can be cast in the GRR form (56), what follows applies to that hugely popular nonparametric regression method as well.
5.1. The MSPE Issue for Ridge Regression
More than for Least Squares, depending on the unspecified parameter
, it is critical to assess how Ridge Regression fits the LM for given data in order to be able to select the best
value, i.e. the one ensuring the best fit. From the prediction error point of view stated in this article, this amounts to choosing the
for which the RR fit has the smallest MSPE. It, thus, requires to estimate
for any given
value, where
and
are as before, while
. Traditionally, estimating
in this context is mostly done using the Generalized Cross Validation (GCV) criterion initially developed by Craven and Wahba for selecting the best value of the smoothing parameter in a smoothing spline [3] . That GCV is obtained as a variation of the LOO-CV. Here, to estimate
, we take a different route. We first note that the fitted sample response vector is
(57)
On the other hand, from (55),
(58)
So, again, Corollary 2 does not apply. But, using Theorem 1,
(59a)
(59b)
For estimating those quantities, we will need some preliminary results.
5.2. Preliminary Results for Estimating the MSPE in Ridge Regression
First, two simple, but remarkable identities about the matrices
and
given in (58) and (59).
Lemma 6
and
.
Proof. The first identity is easily got from (59) and (57). Indeed, one has:
As for the second one,
Next, a key preliminary about the Mean Squared Error matrix of
as an estimate of
:
Lemma 7 Under Assumptions (
)-(
), one has:
(60)
Proof. Inserting (20) in
gives:
with
and
. Now,
the last three identities using (59) and Lemma 6.
With the above lemma, we are now in a position to be able to estimate
in Ridge Regression. We examine, hereafter, two paths for achieving that: one leads to the FPE, the other one to the GCV.
5.3. Estimating the MSPE in Ridge Regression by the FPE
Here, we first estimate
by
in (60a). Then, given (61), this suggests the preliminary estimate of
in RR:
(61)
which is also, therefore, a preliminary estimate of
. It is only a preliminary estimate because even given
, it still depends on the two unknowns
and
. Interestingly, it can be shown ( [8] Chapter 3, page 46) that (62) coincides with
given by (62) in the present setting.
It is useful to note that (62) depends on
only through the squared bias term
. To estimate the latter, let
, the vector of sample residuals, and
, the sum of squared residuals in the Ridge Regression fit. Then, thanks to a well known identity ( [9] Chapter 2, page 38),
implying that
is an unbiased estimate of
. Hence a general formula for computing an estimate of the MSPE in Ridge Regression:
(62)
where
is a chosen estimate of
, possibly computed from the RR fit, thus dependent on
.
We denoted
the estimate of the MSPE given by (63) for the reason to follow. Indeed, probably the most popular estimate of the residual variance
from an RR fit is the one proposed by Wahba in the context of smoothing splines [28] :
(63)
Now, if one uses
in (63), an algebraic manipulation easily leads to:
(64)
recovering the classical formula of the FPE for this setting, but this time as an estimate of the MSPE rather than the PSE.
5.4. Estimating the MSPE in Ridge Regression by the GCV
Here, we estimate
in (60a) rather by
. With (61), this suggests a preliminary estimate
of
in RR, correspondig to
where the denominators n have been replaced by
. Using the same unbiased estimate of
as in the previous section and an estimate
of
, we get another general formula for computing an estimate of the MSPE in RR:
(65)
We denoted it
because again taking
, one gets:
(66)
the well known formula of Craven and Wahba’s GCV [3] for this setting, but this time derived without any reference to Cross Validation.
6. Conclusion and Perspectives
In this work, the goal was not to derive new and better selection criteria to assess the goodness-of-fit in regression, but rather to show how one can derive the well known Akaike’s FPE and, Craven and Wahba’s GCV [3] as direct estimates of the measure of prediction error most commonly known to users, which is not how they are traditionally derived. We achieved this for some of the best known linear model fitters, with the two derivations differing only slightly for each of them. But, nowadays, in regression, use of the FPE criterion is generally not recommended because much better performing criteria are known [29] , while GCV has its own shortcomings in certain settings (e.g. small sample size), though hugely popular and almost the best in some difficult high dimensional situations [12] . It is then our hope that, in the future, one can, through the same unified framework used in this paper, derive new and better selection criteria, different from other already available such proposals for the same setting, among which we can cite the AICc [30] , the modified GCV [31] [32] , the modified RGCV and R1GCV [33] [34] ,
,
and
[5] .
NOTES
1By default, in this paper, vectors and random vectors are column vectors, unless specified otherwise; and AT denotes the transpose of the matrix (or vector) A.
2However, that defect is more and more mitigated these days, thanks to the availability of increasingly user-friendly parallel programming environments in popular statistical software systems, provided one has a computer allowing to exploit such possibilities, e.g. a laptop with several core processors.
3Keeping in line with our stated convention of denoting a random variable and its realization by the same symbol, whereas, technically, an estimate of a parameter is a realization of a random variable called estimator of that parameter, we use the term estimate here for both. So when an estimate appears inside an expectation or a covariance notation, it is definitely the estimator which is meant.