_{1}

^{*}

Built upon an iterative process of resampling without replacement and out-of-sample prediction, the delete-d cross validation statistic CV(
*d*) provides a robust estimate of forecast error variance. To compute CV(
*d*), a dataset consisting of n observations of predictor and response values is systematically and repeatedly partitioned (split) into subsets of size
*n* –
*d* (used for model training) and
*d* (used for model testing). Two aspects of CV(
*d*) are explored in this paper. First, estimates for the unknown expected value E[CV(
*d*)] are simulated in an OLS linear regression setting. Results suggest general formulas for E[CV(
*d*)] dependent on σ
^{2} (“true” model error variance),
*n* –
*d* (training set size), and
*p* (number of predictors in the model). The conjectured E[CV(
*d*)] formulas are connected back to theory and generalized. The formulas break down at the two largest allowable
*d* values (
*d* =
*n* –
*p* – 1 and
*d* =
*n* –
*p*, the 1 and 0 degrees of freedom cases), and numerical instabilities are observed at these points. An explanation for this distinct behavior remains an open question. For the second analysis, simulation is used to demonstrate how the previously established asymptotic conditions {
*d*/
*n* → 1 and
*n* –
*d* → ∞ as
*n* → ∞} required for optimal linear model selection using CV(
*d*) for model ranking are manifested in the smallest sample setting, using either independent or correlated candidate predictors.

Cross validation (CV) is a model evaluation technique that utilizes data splitting. To describe CV, suppose that each data observation consists of a response value (the dependent variable) and corresponding predictor values (the independent variables) that will be used in some specified model form for the response. The data observations are split (partitioned) into two subsets. One subset (the training set) is used for model parameter estimation. Using these parameter values, the model is then applied to the other subset (the testing set). The model predictions determined for the testing set observations are compared to their corresponding actual response values, and an “out-of-sample” mean squared error is computed. For delete-d cross validation, all possible data splits with testing sets that contain d observations are evaluated. The aggregated mean squared error statistic that results from this computational effort is denoted CV(d).

To define the CV(d) statistic used in ordinary least squares (OLS) linear regression, let p < n be positive integers, and let I_{k} denote the k-by-k identity matrix. Let^{2}) linear statistical model for predicting Y using X. Each row of the matrix [X Y] corresponds to a data observation (p predictors and one response), and each column of X corresponds to a particular predictor. Assume that each p-row submatrix of X has full rank, a necessary condition for CV(d) to be computable for all ^{C} = N\S. Let X_{S} denote the row subset of X indexed by S, and define

_{S}. Define

Let ^{2} norm and let

This equation can be found in [

Three popular papers provided some of the early groundwork for cross validation. Allen [

Numerous authors have discussed and examined the properties of CV(1) specifically in the context of model selection (e.g., [

Asymptotic equivalence of CV(1) to the delete-1 jackknife, the standard bootstrap, and other model selection statistics such as Mallows’s C_{p} [

Unfortunately, none of these asymptotic theoretical developments provide practitioners with specific guidance or information helpful for making a judicious choice for d in an arbitrary small sample setting, for either forecast error variance estimation or model ranking for model selection. For this study, two questions are addressed regarding CV(d) that are relevant to the small sample setting. First, expressions are developed for E[CV(d)] for the X-random case using simulation, which are linked back to theory and generalized. Second, a model selection simulation is used to illustrate how Shao’s conditions {d/n → 1 and n ? d → ∞ as n → ∞} are manifested in the smallest sample setting.

For the case with X-fixed, Shao & Tu [

This expression implies that CV(d) provides an estimate for

A theorem introduced in this section establishes that (2) applies to the case of the mean (intercept) model. As previously noted, we are interested in the X-random case. Based on an extremely tight correspondence with simulation results, expressions are conjectured for E[CV(d)] when the linear model contains at least one random valued predictor, for cases with and without an intercept. Building from work described in Miller [

Begin with the simplest case, where the only predictor is the intercept. Suppose X = 1^{n}^{×}^{1} (an n-vector of ones), so that the linear regression model under investigation is the mean model (so called because

THEOREM 1: Suppose X = 1^{n}^{×1}, and let

Proof: Define

The d-by-1 vector _{j} is a “deleted” observation (entry in Y_{S}) and _{j} and

Since y_{j} is an arbitrary element of holdout set Y_{S}, we have

Because of the linearity of

Different results appear when simulating models that include at least one random valued predictor. Using the random number generator in MATLAB® to simulate data sets {X,Y}, values for CV(d) were simulated for numerous cases with _{j} = X_{j}β + ε_{j}, with error^{α} and _{k} values outside the interval [?3,3] were snapped to the appropriate interval endpoint). Simulations using an intercept were also examined, in which case the first column of X was populated with ones rather than random values._{ }

For a particular (n,p), after simulating at least 20,000 CV(d) values for each possible^{2} were found to follow rational number sequences clear enough to conjecture general formulas for E[CV(d)] dependent on n ? d, p, and σ^{2}.

An apparently related outcome is the identification of a two-point region of numerical instability of the E[CV(d)] error curve for any tested model that includes a random valued predictor. Specifically, simulation results reveal two points of increasing instability at _{max})], where d_{max} = n ? p. The term “increasing instability” is apt because the coefficient of variation (=standard deviation/mean) calculated for the simulated CV(d) values is stable for d < d_{max} ? 1, but increasingly blows up (along with E[CV(d)]) at d = d_{max} ? 1 and d = d_{max} (results not shown). The reason for this phenomenon is, at present, an interesting open question. This exceptional behavior is not incompatible with the conjectured formulas for E[CV(d)] because the formulas break down at the two largest allowable d values.

To gauge the accuracy of the conjectured formulas for E[CV(d)], the author used an absolute percent error statistic defined by

To provide a simple gauge for rounding error magnitude, values were simulated for the expected error of regression, given by

REG has the property that E[REG] = σ^{2}.

Two findings are notable: (a) distinct but related patterns for E[CV(d)] emerge when considering linear models consisting entirely of random valued predictors and those that use an intercept; and (b) two points of increasing instability appear at _{max})]. The existence of the two-point instability appears to be robust to increasing dimensionality. Result (a) is expressed in Conjectures 1 and 2, which do not conflict with the exceptional behavior noted in (b). For Conjectures 1 and 2, suppose that

CONJECTURE 1: Let X be the n-by-p design matrix where p < n ? 2 and the predictors in X are multivariate normal. Then, for

In the search for this equation, the author scrutinized simulated values for E[CV(d)], examining a variety of cases. Using the approximation in (2) as a starting point for exploring possible forms for the RHS of (6), the author eventually arrived at (6) through trial and error.

At _{max}, (6) takes the nonsensical value of^{2}, realized at d = d_{max}. Compare this to the maximum E[CV(d)] value

disparity between simulated E[CV(d)] and predicted E[CV(d)] from the approximation provided in (2). Also note the blowup in simulated E[CV(d)] at the two largest d values, reflecting the previously described two-point instability of the E[CV(d)] error curve when at least one random valued predictor is used in the model.

CONJECTURE 2: Let X be the n-by-p design matrix where p < n ? 2, the first column of X is an intercept, and the other predictors in X are multivariate normal. Then, for

In the search for this equation, the author once again scrutinized simulated values for E[CV(d)], examining a variety of cases. This time, (6) was used as a starting point for exploring possible forms for the RHS of (7). Specifically, the author reasoned that substitution of an intercept for a random valued predictor reduces model complexity, suggesting that the E[CV(d)] expression for models that include an intercept might take the form of a dampened version of (6). Indeed, after much trial and error, this was found to be the case once the RHS of (7) was “discovered”.

Like Equation (6), at

These simulation results using independent normal predictors and errors provide strong evidence for the validity of Conjectures 1 and 2. Graphical evidence for this assertion can be seen in Figures 1-3. APE values (4) computed comparing (6) and (7) to corresponding simulated E[CV(d)] were generally O(10^{−2}) to O(10^{−1}). To provide a gauge for these error magnitudes, E[REG] values (5) were simulated and compared to the known value of σ^{2}. APE values from this comparison also were generally O(10^{−2}) to O(10^{−1}), indicating that rounding error was solely responsible for the slight differences observed between simulated E[CV(d)] and predicted E[CV(d)] from (6) and (7). To justify the more general assumptions in the conjectures using multivariate normal predictors and second-order errors, we use the following theoretical connection.

Equation (2) was examined because it was the only explicitly stated estimate for E[CV(d)] found in the literature. This expression gives the expected mean squared error of prediction (MSEP) for using a linear regression model to make a prediction for some future observation at a design point. However, (2) provides an inaccurate characterization for CV(d) in any arbitrary small sample setting where there are substantially more possible design point values than observations. In this situation, the random subset design used for making “out-of-sample” predictions when computing the CV(d) statistic more logically is associated with the expected MSEP for using a linear regression model to make a prediction for some future observation at a random X value.

In Miller [^{2}. Miller credits this result to [

The “1/n” term in the dilation factor accounts for the variance of the intercept parameter estimated in the model. The other term in the dilation factor is developed using a Hotelling T^{2}-statistic, which is a generalization of Student’s t-statistic that is used in multivariate hypothesis testing. If we replace n ? d with n in (7), then it is easy to show that (8) and (7) are equivalent.

Following Miller’s derivation for the case using a model with an intercept, we can also derive an expression equivalent to (6) for the “no intercept” case. The “1/n” term in (8) is not needed because all predictor and response variables are distributed with 0 mean, and no intercept is used in the model. It is a straightforward exercise to show that the other term in the dilation factor becomes

(9) is identical to (6) if we substitute n for n ? d in (6).

In support of the error generalization, simulations using

dictor values from

For example, with _{max} ? 2), but simulated E[REG] values were unchanged (as expected, since E[REG] is independent of predictor distribution). Therefore, unlike some of the more general properties for OLS linear regression, E[CV(d)] appears to depend on predictor distribution. It is worth noting that the two-point instability phenomenon persisted in the uniformly distributed X case and thus appears to be robust to predictor distribution.

Recall the asymptotic model selection conditions from [

To use CV(d) for model selection in a manner that is consistent with Shao’s setup, one begins with a pool of candidate predictors and evaluates all possible linear models defined by non-empty subsets of this predictor pool for the purpose of estimating some response. The optimal model contains only and all of the predictors that contribute to the response. CV(d) is evaluated for each candidate model, and the model exhibiting the smallest CV(d) value is selected. Define the optimal d value (d_{opt}) to correspond with the CV(d) that exhibits the highest rate of selecting the optimal model. If Shao’s result has relevance in the small sample setting, one would expect d_{opt} to generally be among the larger allowable d values. Further, we would expect d_{opt} to increase nearly at the same rate as n, while at the same time also observing growth in n ? d_{opt}.

For this simulation, define predictor pool

_{ub}) determined by the largest allowable d for the full model (which uses all three predictors). For each d, the model with the smallest CV(d) value was identified, allowing for a rate of optimal model selection to be estimated across the iterations. For comparison, REG values also were computed and evaluated for optimal model selection rate.

Optimal model selection rates using model selectors CV(d) and REG (which is plotted at d = 0 for convenience) are shown in Figures 4(a)-(e) for the five unique optimal model cases, followed by the average optimal model selection rate in

At the extreme cases _{opt} = d_{ub} and d_{opt} = 1, respectively, for all examined n. For cases in between (Figures 4(b)-(d)), d_{opt} varies with n in a generally logical progression. Ultimately we are interested in the behavior of d_{opt} in the arbitrary case, where the optimal model can be any one of the seven candidate models. This situation is depicted in _{opt} does appear to exhibit behavior not unlike that implied by Shao’s conditions, suggesting that the essence of Shao’s asymptotic result may well have applicability in this most elementary of model selection scenarios, and perhaps the small sample model selection setting in general.

Additional simulation results (_{1} and X_{2} exhibit similar behavior but with lower and flatter CV(d) rate curves (i.e., reduced capability for optimal model selection and less distinction for d_{opt}) and attenuated growth in d_{opt}. Though this situation does not precisely conform to Shao’s setup, it is valuable none the less because model selection situations using real data frequently involve correlated predictors.

The first objective of this research was to examine values of E[CV(d)] in a small sample setting using simulation. This effort resulted in Conjectures 1 and 2, which constitute the first explicitly stated, generally applicable formulas for E[CV(d)]. The link established between (6) and (7) and the random-X MSEP described in [

Revelation of the two-point numerical instability at the end of the E[CV(d)] error curve, which was not incompatible with the conjectured formulas for E[CV(d)] because of their breakdown at these values, was an

unexpected outcome. This phenomenon, which did not appear to depend on predictor distribution, suggests the curious result that OLS linear regression models fit using just 1 or 0 degrees of freedom must be unique in some way compared to models fit using 2 or more degrees of freedom. Theoretical investigation of this exceptional behavior might best begin by examining the development of the Hotelling T^{2}-statistic for the multivariate normal X case that provides the basis for the X-random MSEP in (8).

For the second objective, an elementary model selection simulation with candidate predictors_{1} and X_{2} were_{opt} = 1). When the optimal model was the mean model (i.e., the simplest model), then CV(d_{ub}) was the best model ranking statistic (d_{opt} = d_{ub}). For cases in between, d_{opt} was observed to vary with n in a generally logical progression dependent on optimal model complexity. Ultimately we are interested in the arbitrary optimal model case, which was simulated by averaging all of the specific optimal model selection rates. For the arbitrary optimal model case, d_{opt} and optimal model selection rate demonstrated behavior reflective of the conditions prescribed in [_{opt} increased as n increased, (ii) d_{opt} was generally among the larger allowable d values, (iii) d_{opt} increased at nearly the same rate as n, and (iv) growth occurred in n ? d_{opt}. These behaviors persisted in a dampened fashion when correlated predictors were used, with faster growth observed in n ? d_{opt}.

For practitioners, the analyses presented in this paper shed new light on computed CV(d) values, especially for small sample model selection and forecast error variance estimation problems where little is known about the behavior of CV(d). With the conjectured formulas for E[CV(d)] (which appear to be exact for the multivariate normal case), theoreticians can formulate more precise series expansions for CV(d) to facilitate the furthering of our mathematical understanding of this interesting and useful statistic.

Jude H.Kastens, (2015) Small Sample Behaviors of the Delete-d Cross Validation Statistic. Open Journal of Statistics,05,382-392. doi: 10.4236/ojs.2015.55040