Variable Selection in Randomized Block Design Experiment

Sadiah Mohammed Aljeddani

doi:10.4236/ajcm.2022.122013

American Journal of Computational Mathematics > Vol.12 No.2, June 2022

Variable Selection in Randomized Block Design Experiment

Sadiah Mohammed Aljeddani
Department of Mathematics, Umm Al Qura University, Makkah, Saudi Arabia.
DOI: 10.4236/ajcm.2022.122013 PDF HTML XML 186 Downloads 930 Views Citations

Abstract

In the experimental field, researchers need very often to select the best subset model as well as reach the best model estimation simultaneously. Selecting the best subset of variables will improve the prediction accuracy as noninformative variables will be removed. Having a model with high prediction accuracy allows the researchers to use the model for future forecasting. In this paper, we investigate the differences between various variable selection methods. The aim is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. The result of the comparative study supports the utilization of the LARS method for statistical analysis of data from blocked experiments.

Keywords

Variable Selection, Shrinkage Methods, Linear Mixed Model, Blocked Designs

Share and Cite:

Aljeddani, S. (2022) Variable Selection in Randomized Block Design Experiment. American Journal of Computational Mathematics, 12, 216-231. doi: 10.4236/ajcm.2022.122013.

1. Introduction

Variable selection and statistical modelling are extremely useful in two aspects: prediction accuracy and clear interpretation. Prediction accuracy can be improved by removing the un-significant variables or shrinking them toward zero to have almost no effect on the response surface variable. This method reduces the variance of the predicted values and improves the overall prediction accuracy. Moreover, we often like to have a smaller and more meaningful model for good interpretation. Ensuring the selected model has the highest prediction accuracy will allow the model to be used for future predictions. Parallel to this, selecting the simplest model by avoiding uninformative variables, which often do not influence the response variable, is also crucial to enhancing scientific analysis. By removing the noninformative or nonactive variables, the predictive ability of models can be improved and parsimoniously describe the relationship between the informative, or active variables and the response variable. Variable selection issue refers to obtaining an adequate subset of variables for the model. The subject of variable selection in linear regression analysis is a remarkable subject. The experimenter initially may be uncertain about the most influential structure of the model. It might be unclear whether all the variables should be included in the model or if only some of them have significant effects on the response variable. Therefore, the variable selection procedure builds a regression model with an appropriate subset of variables.

As part of the former studies in this field, [1] and [2] proposed a shrinkage technique. They defined the regularisation methods by adding a penalty function to the residual sum of squares, and minimization or maximization of the penalised function with respect to the coefficients yields penalised likelihood estimators. Moreover, [3] discussed the penalised method for the linear mixed model. In their study, they dealt with supersaturated designs as they are very cost-effective with respect to the number of runs. This kind of design is very desirable in industrial experiments. They introduced a nonconvex penalised least squares approach. However, their study added the penalty function to the least square estimator while they deal with supersaturated experiments in which correlated data is expected. In [4] and [5], they overcome the problem of correlated data and selecting variables in a linear mixed model context by adding the penalty function to the generalised least square (GLS) estimator in which the two variance components will be estimated by the restricted maximum likelihood (REML) estimator. Furthermore, [6] developed the method of block deletion and block addition to fit high-dimensional data sets. They developed the threshold method. In their approach, they set a threshold value with the approximation error evaluated using the data. While the variable selection process is running, the error will be approximated and updated for the new model. The next step will use the updated threshold value and process the variable selection again. This updating prevents deleting useful variables. The aim of this paper is to compare the analysis of the frequentist methodology (the backward elimination), penalised shrinkage method (the Adaptive LASSO) and the Least Angle Regression (LARS) for selecting the active variables for data produced by the blocked design experiment. A simulation study using the design of the blocked experiment was also applied to support the comparative study.

The work by [4] differs from this work in which they did not apply the Adaptive LASSO to the pastry dough experiment. Also, the work by [5] has not studied the blocked design and the pastry dough experiment.

2. Linear Mixed Model and Analysis

The model for the block experiments includes two types of errors: block error and residual error. Hence, linear mixed models (LMMs) are used to analyse responses from the blocked experiments.

Linear mixed-effects models (LMMs) introduce correlations between observations using random effects. This leads to the use of generalised least squares (GLS) estimation, combined with restricted maximum likelihood estimation (REML) of the variance components as will be discussed. This type of analysis is used by the most design of experiments textbooks that deal with blocked designs. In matrix notation, the model corresponding to a blocked design is written as

$Y = X β + Z γ + ϵ,$ (1)

where $Y$ is $n \times 1$ vector of observations on the response of interest, $X$ is the $n \times p$ model design matrix containing the polynomial expansions of the m factor levels at the n experimental runs, $β$ is the $p \times 1$ vector of unknown fixed parameters, $Z$ is an $n \times b$ random design matrix which represents the allocation of the runs to blocks, and whose $(i, j)$ th element is one where the ith observation belongs to the jth blocks, and zero otherwise. If the runs of the experiment are grouped per block, then $Z$ is of the form

$Z = diag [1_{k_{1}}, 1_{k_{2}}, \dots, 1_{k_{b}}],$ (2)

where $1_{k}$ is a k vector of ones, and $k_{1}, k_{2}, \dots, k_{b}$ are the blocks sizes. The random effects of the b blocks are contained within the $b \times 1$ vector $γ$ , and the random errors are contained within the $n \times 1$ vector $ϵ$ . It is assumed that $γ$ and $ϵ$ are independent and normally distributed, i.e. $Cov (γ, ϵ) = 0_{b \times n}$ , where $0_{b \times n}$ is the $b \times n$ matrix of zeros. Hence, $γ ~ N (0_{b}, σ_{γ}^{2} I_{b})$ , and $ϵ ~ N (0_{n}, σ_{ϵ}^{2} I_{n})$ , where $0_{b}$ and $0_{n}$ are the b and n column vectors of zeros respectively, and $I_{b}$ and $I_{n}$ are the b-dimensional and n-dimensional identity matrices respectively.

Under these assumptions, $Y$ is a normally distributed random variable with mean $E (Y) = X β$ , and the variance-covariance matrix of the response $Y$ can be written as

$V = Var (Y) = Var (X β + Z γ + ϵ)$ (3)

$= Var (Z γ) + Var (ϵ)$ (4)

$= Z Var (γ) Z^{'} + σ_{ϵ}^{2} I_{n}$ (5)

$= σ_{γ}^{2} Z Z^{'} + σ_{ϵ}^{2} I_{n} .$ (6)

$V$ can be given as a block diagonal,

$V = [\begin{matrix} V_{1} & 0 & \dots & 0 \\ 0 & V_{2} & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋮ \\ 0 & \dots & 0 & V_{b} \end{matrix}],$

where

$V_{i} = σ_{ϵ}^{2} I_{k_{i}} + σ_{γ}^{2} 1_{k_{i}} {1^{'}}_{k_{i}},$

and

$V_{i} = [\begin{matrix} σ_{ε}^{2} + σ_{γ}^{2} & σ_{γ}^{2} & \dots & σ_{γ}^{2} \\ σ_{γ}^{2} & σ_{ε}^{2} + σ_{γ}^{2} & \dots & σ_{γ}^{2} \\ ⋮ & ⋱ & ⋱ & ⋮ \\ σ_{γ}^{2} & \dots & σ_{γ}^{2} & σ_{ε}^{2} + σ_{γ}^{2} \end{matrix}] .$

As a result, the variance-covariance matrix $V_{i}$ of all observations within one block is compound symmetric: the main diagonal of the matrix contains the variances of the observations, while the off-diagonal elements are covariances. However, $V_{i}$ can be rewritten as

$V_{i} = σ_{ϵ}^{2} (I_{k_{i} \times k_{i}} + \frac{σ_{γ}^{2}}{σ_{ϵ}^{2}} 1_{k_{i}} {1^{'}}_{k_{i}})$ (7)

$= σ_{ϵ}^{2} (I_{n} + η Z Z^{'}),$ (8)

where $η = σ_{γ}^{2} / σ_{ϵ}^{2}$ is a measure for the extent to which observations within the same block are correlated. The larger this variance ratio, the stronger observations within the same block are correlated.

When the random error terms as well as the group effects are normally distributed, the maximum likelihood estimate of the unknown model parameter $β$ in Equation (1) is the generalised least squares (GLS) estimate. Detecting the estimator $\hat{β}$ of $β$ , requires to minimise

${(y - X β)}^{'} V^{- 1} (y - X β) = y^{'} V^{- 1} y - 2 β^{'} X^{'} V^{- 1} y + β^{'} X^{'} V^{- 1} X β$ (9)

with respect to $β$ , which is tantamount to detecting $\hat{β}$ , so that

$(X^{'} V^{- 1} X) \hat{β} = X^{'} V^{- 1} y .$ (10)

Therefore, the generalised least squares (GLS) estimator of $β$ is

$\hat{β} = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} Y,$ (11)

and the variance-covariance matrix of the estimators is given by

$\begin{matrix} Var (\hat{β}) = Var ({(X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} Y)) \\ = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} Var (Y) {({(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1})}^{'} \\ = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} V V^{- 1} X {(X^{'} V^{- 1} X)}^{- 1} \\ = {(X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} X) {(X^{'} V^{- 1} X)}^{- 1} \\ = {(X^{'} V^{- 1} X)}^{- 1} . \end{matrix}$ (12)

Often, the variances $σ_{γ}^{2}$ and $σ_{ϵ}^{2}$ are not known and therefore, Equation (11) and Equation (12) cannot be used directly. Instead, the estimates of the variance components, ${\hat{σ}}_{γ}^{2}$ and ${\hat{σ}}_{ϵ}^{2}$ , are substituted in the GLS estimator as in Equation (11), yielding

$\hat{β} = {(X^{'} {\hat{V}}^{- 1} X)}^{- 1} X^{'} {\hat{V}}^{- 1} Y,$ (13)

where

$V = {\hat{σ}}_{ϵ}^{2} I_{n} + {\hat{σ}}_{γ}^{2} Z Z^{'} .$ (14)

In that case, the variance-covariance matrix in Equation (12) can be approximated by

$Var (\hat{β}) = {(X^{'} {\hat{V}}^{- 1} X)}^{- 1} .$ (15)

The generalised least square (GLS) estimator is unbiased, meaning that $E (\hat{β}) = β$ , and is equal to the maximum likelihood estimator (MLE). The likelihood function defined as it is the joint probability density function for the observed data examined as a function of the parameters. Hence, the likelihood function for $Y$ in Equation (1) is

$L (β | Y) = {(2 π)}^{- n / 2} {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(Y - X β)}^{'} V^{- 1} (Y - X β)],$ (16)

where π is a constant which does not depend on $β$ . The maximum likelihood estimator (MLE) is the estimator that maximises the likelihood function, which is tantamount to detecting the $\hat{β}$ as

$\frac{\partial}{\partial β} L (\hat{β} | Y) = 0,$ (17)

which is equal to

$\frac{\partial}{\partial β} \ln L (\hat{β} | Y) = 0,$ (18)

where $\ln L (\hat{β} | Y)$ is the log likelihood function. As Equation (9) is proportionate to log of Equation (16), the GLS estimator in Equation (11) is the result of Equation (17) and Equation (18).

Moreover, $V$ can be estimated when observed data is obtained. In this work, we used the Restricted Maximum Likelihood (REML) estimator to estimate $V$ . According to [7], “REML requires the transformation of the response to remove the influence of the other model parameters followed by the maximisation of the likelihood for these transformed responses”. The likelihood in REML includes knowledge about the variance components yet does not include knowledge about the fixed effects [8].

The restricted maximum likelihood (REML) used to estimate $σ_{ϵ}^{2}$ and $σ_{γ}^{2}$ is

$l_{REML} (σ_{ϵ}^{2}, σ_{γ}^{2}; Y) = - \frac{1}{2} ln | V | - \frac{1}{2} ln | X^{'} V^{- 1} X | - \frac{1}{2} {(Y - X \hat{β})}^{'} V^{- 1} (Y - X \hat{β}),$ (19)

where $\hat{β}$ is defined in Equation (13). The restricted log-likelihood $l_{REML} (σ_{ϵ}^{2}, σ_{γ}^{2}; Y)$ is minimised with respect to the variance components $σ_{ϵ}^{2}$ and $σ_{γ}^{2}$ to obtain an unbiased estimate for the variance components. In this work, REML is minimised by using the function “fmincon” in Matlab.

3. The Backward Elimination Method

Backward elimination starts with the full model, and then the least significant variable, which corresponds to the highest p-value above a significance level α, will be dropped. The reduced models in each step are re-fitted by following the rule of significance level until all remaining variables are statistically significant meaning the corresponding variable has a p-value ≤ α.

We did the generalised least square estimator using the backward elimination to compare with other methods. The relative theory of the Backward elimination starts with the full model and eliminates one variable at a time based on the Wald (Wa) test statistic. The Wald test statistic is suitable to compare nested models when the variance-covariance matrix changes after each drop. We estimate the variance components $σ_{ϵ}^{2}$ and $σ_{γ}^{2}$ by REML using the full model, so we deal with the Wald test statistic as the variance components are known. We used the Wald test statistic at the 5% significance level, and we compare nested models through the process of backward elimination. Assuming that $y$ is normally distributed and the variance components are known, we test the hypothesis:

$H_{0} : β_{i} = 0 vs H_{1} : β_{i} \neq 0$

The Wald (Wa) test statistic will follow an F distribution with 1 and r degrees of freedom.

$W a = \frac{{\hat{β}}_{i}^{2}}{σ_{i i}^{2}} ~ F_{(1, r)}$ (20)

where $σ_{i i}^{2}$ is the $(i, i)$ th element of $Var (\hat{β})$ . The degrees of freedom in a statistical calculation represents how many values are involved in a calculation that has the freedom to vary. The degrees of freedom can be defined as they are equal to the number of independent observations minus the number of parameters. The degrees of freedom could be calculated to guarantee the statistical accuracy of tests statistics such as chi-square tests, t-tests and F-tests. Often, these tests are utilized to make a comparison between the observed data with the data that could be expected to be achieved according to a particular hypothesis.

4. Adaptive LASSO Shrinkage Method (ALASSO)

To overcome the drawbacks in the classical approaches of variable selection, [1] and [2] proposed regression modelling by regularisation technique. This technique prevents overfitting by restricting the model, typically to reduce its complexity. The regularisation methods are based on shrinkage penalties, where penalty functions are added to the residual sum of squares or subtracted from the log-likelihood, and minimisation or maximisation of penalised functions with respect to coefficients yields penalised likelihood estimators. The shrinkage penalty method can be explained as there is a penalty for any nonzero estimate of the model when we minimise the sum of the squared residuals. Thus, the penalty will shrink the size of the estimated coefficients toward zero. It places a constraint on the size of the regression coefficients [9] [10]. Shrinkage methods do not explicitly select variables, instead, they minimise the sum of the squared residuals by applying a penalty on the size of the estimated coefficients. They have the advantage of selecting variables and estimating the coefficients simultaneously. The advantage of shrinkage methods is that their use often improves the prediction accuracy and helps with the selection of a more parsimonious model, though there is a trade-off between bias and the variance of the final model (see [11] [12] ).

Based on the size of the estimated coefficients, the penalised estimates might be only shrunk in the size while in the case of small estimated coefficients, it is more likely the penalised estimates will be set to zero. Hence, the choice of the shrinkage parameter $λ$ is sensitive and important. Unlike traditional subset selection, penalised regression is a continuous process as it shrinks the size of the coefficients and yields stable models with low prediction errors. However, as some shrinkage penalty functions shrink the size of the coefficients towards zero and not explicitly zero in the case of large size of estimated coefficients or the case of a very small amount of shrinkage parameter $λ$ , the resulting models in such case suffer from complexity and overfitting.

For linear mixed models, the penalised generalised least squares estimates have been discussed by [3], it can be found by minimising

$Q_{PGLS} (β) = \frac{1}{2 n} {(Y - X β)}^{'} V^{- 1} (Y - X β) + \sum_{j = 1}^{d} p_{λ} (| β_{j} |),$ (21)

with respect to $β$ , where $\hat{V} = {\hat{σ}}_{ϵ}^{2} I_{n} + {\hat{σ}}_{γ}^{2} Z Z^{'}$ , and d is the number of the model coefficients $β$ . The expression $p_{λ} (.)$ is a penalty function and the shrinkage parameters $λ$ is an unknown strictly positive thresholding parameter, which is often selected using information selection criteria after setting a grid for $λ$ . In this work, we set a grid from 0 to 3. However, researchers can set any grid, this choice is suitable for the data that we used, and the effects size assumed in the simulation. It is assumed that $K = 1 + p_{t} + v_{c}$ , such that $p_{t}$ is the number of the active fixed parameters in the fitted penalised least squares model [13]. The $p_{t}$ can then be computed as $p_{t} = tr [X {X^{'} {\hat{V}}^{- 1} X + \hat{W}}^{- 1} X^{'} {\hat{V}}^{- 1}]$ , where $\hat{W}$ is a penalty matrix [3]. The $v_{c}$ is the number of variance components that are used in fitting the penalised model. The following information selection criteria can be used in selecting the shrinkage parameters $λ$ . Akaike Information Criterion (AIC) is given by,

$AIC = 2 K - 2 \ln \hat{L} .$

We illustrate the algorithim of the PGLS as follows:

1) Let $β^{(0)}$ be the generalised least squares estimator ${\hat{β}}_{GLS}$ as in Equation (13), for the full model, a model fitted by all the variables in the experiment, and ${\hat{σ}}_{ϵ}^{2}$ , ${\hat{σ}}_{γ}^{2}$ be the REML estimates of the variance components for this model.

2) (a) Set a grid for $λ$ with $l$ values $λ_{1}, λ_{2}, \dots, λ_{l}$ for each grid.

(b) For $i = 1, 2, \dots, l$ of the grid, use $λ_{i}$ to estimate the model parameters of the corresponding tuning.

(d) Return $λ$ .

3) (a) Set $β^{(0)}$ as the GLS estimator and $λ$ given from the previous loop.

(b) We set the $SE (\hat{β})$ as our proposed threshold to find out if the estimates are statistically significant or not, where $SE (\hat{β}) = \sqrt{Var (\hat{β})}$ in Equation (15) is the standard error of $\hat{β}$ . Thus, we set values of $\hat{β} \leq SE (\hat{β})$ equal to zero.

${\hat{β}}^{(1)} = {X^{'} {\hat{V}}^{- 1} X + W^{(0)}}^{- 1} X^{'} {\hat{V}}^{- 1} Y,$

where ${\hat{β}}^{(1)}$ is the vector of collecting all nonzero coefficients, and $W^{(0)}$ is a penalty matrix for the initial values ${\hat{β}}^{(0)}$ which it can be defined as

$W^{(0)} = diag {\frac{{p^{'}}_{λ} (| {\hat{β}}_{1}^{(0)} |)}{| {\hat{β}}_{1}^{(0)} |}, \dots, \frac{{p^{'}}_{λ} (| {\hat{β}}_{d^{*}}^{(0)} |)}{| {\hat{β}}_{d^{*}}^{(0)} |}},$

where the $d^{*}$ is the total number of nonzero model coefficients.

(d) Any elements of ${\hat{β}}^{(1)}$ that are $\hat{β} \leq SE (\hat{β})$ are set to zero and the nonzero coefficients are jointly updated along with the matrix $W^{(0)}$ .

(e) Steps (c) to (d) are repeated until convergence takes place and no more factors can be removed.

4) Denote $\hat{β}$ the final estimates of the nonzero model coefficients and $\hat{W}$ the corresponding estimated $W$ penalty matrix.

The covariance of the nonzero parameter estimates can then be obtained from the sandwich formula [3]:

$\hat{c o v} (\hat{β}) = {(X^{'} {\hat{V}}^{- 1} X + \hat{W})}^{- 1} X^{'} {\hat{V}}^{- 1} X {(X^{'} {\hat{V}}^{- 1} X + \hat{W})}^{- 1}$ (22)

The nonsignificant variables will be removed from the model, however, their indices will be saved and replaced with zero indicating that the variables have been removed.

[14] proposed the Adaptive LASSO (ALASSO) to obtain consistency in variable selection and prediction accuracy. ALASSO based on using a weighted (L₁), the LASSO penalty with weight determined by an initial estimator. Recall the PGLS in Equation (21), the penalty term by ALASSO is,

$p_{λ} (β) = λ \sum_{j = 1}^{d} {\hat{w g}}_{j} | β_{j} |,$

the vector of the weights $\hat{w g} = {(\hat{w g_{1}}, \hat{w g_{2}}, \dots, \hat{w g_{d}})}^{'}$ are the adaptive data-driven weights, where ${\hat{w g}}_{j}$ , $j = 1, \dots, d$ can be constructed by ${\hat{w g}}_{j} = {({\hat{β}}_{j}^{(0)})}^{- ψ}$ , where $ψ$ is a positive constant and ${\hat{β}}^{(0)}$ is ${\hat{β}}_{GLS}$ [14].

To construct the adaptive weights $\hat{w g}$ , [14] suggested to pick a $ψ > 0$ . We

take a fixed $ψ$ such that $ψ > \frac{2 μ}{1 - μ}$ , where $0 \leq μ < 1$ . In our numerical studies, we let $ψ = [\frac{2 μ}{1 - μ}] + 1$ as used by [15] to avoid the tuning of $ψ$ . In their study,

they concluded that for any $0 \leq μ < 1$ , we can choose an appropriate $ψ$ to

construct the adaptive weights and the oracle property holds as long as $ψ > \frac{2 μ}{1 - μ}$ .

These weights cause less shrinkage to more important predictors, which leads to consistent variable selection results. In this work, we set $ψ = 0.8$ as this choice was according our simulation study. We found that this choice provided the lowest Type I and II error rates among the choices that have been investigated during our research process.

5. Least Angle Regression Method (LARS)

Least Angle Regression (LARS) is a relative method proposed by [12], and can be viewed as a kind of “democratic” version of forward stepwise regression. The forward stepwise regression builds a model sequentially, adding one variable at a time. At each step, it identifies the best variable to include in a set which contains only active variables which considered to be statistically significant, and then update the least squares fit to include all the active variables.

The LARS algorithm has been described in [12]. Let $X$ is the $n \times p$ design matrix for p factors. Let $\hat{μ} = X \hat{β}$ be the LARS estimates of the response, and $\hat{β}$ be the generalised least square estimate of the vector of coefficients. The LARS procedure works as follows:

1) Begin at ${\hat{μ}}_{0} = 0$ and set all coefficients to zero.

2) Find the variable, say $x_{1}$ , which is most correlated with the response.

3) Fit the model using the generalised least square estimator in the direction of $x_{1}$ (or $u_{1}$ , the unit vector along $x_{1}$ ) until another variable, say $x_{2}$ , has as much correlation with the current residual as $x_{1}$ does.

4) At this point, the LARS estimate is updated to ${\hat{μ}}_{1} = {\hat{μ}}_{0} + {\hat{γ}}_{1} u_{1}$ , where ${\hat{γ}}_{1}$ is chosen such that the current residual $y - {\hat{μ}}_{1}$ bisects the angle between $x_{1}$ and $x_{2}$ .

5) Instead of continuing along $x_{1}$ , LARS proceeds in the direction of $u_{2}$ , the unit bisector of the two variables $x_{1}$ and $x_{2}$ , until a third variable $x_{3}$ earns its way into the most correlated set.

6) Now the LARS estimate is updated to ${\hat{μ}}_{1} = {\hat{μ}}_{1} + {\hat{γ}}_{2} u_{2}$ , where ${\hat{γ}}_{2}$ is chosen such that the current residual $y - {\hat{μ}}_{2}$ has equal angles with $x_{1}, x_{2}$ , and $x_{3}$ .

7) LARS then proceeds along $u_{3}$ , the equiangular unit vector, i.e. along the least angle direction, until a fourth variable enters, etc. LARS builds up estimates in successive steps, in each step adding one variable to the model, so only p steps are required for the full set of solutions, where p is the number of variables.

The LARS algorithm in [12] provides the ordinary least square estimates. AljS modifed the LARS to deal with mixed effect models as follows: researcher needs to either weight the data as $X^{*} = \sqrt{C} X$ and $Y^{*} = \sqrt{C} Y$ , where $C = V^{- 1}$ , and apply the lars function in the “lars” package in R. Othewise, researcher needs to modify the Gram matrix $G = X^{'} V^{- 1} X$ to calculate the generalised least square estimator (GLS) in the lars function in Matlab.

6. Real Data Application

In this work, we demonstrate the differences of the variable selection methods which we studying on the nonorthogonally blocked response surface experiment. The real dataset for the pastry dough experiment has been used to apply the backward elimination, the Adaptive LASSO, and the LARS approaches. The estimated coefficients were obtained as well as the standard errors using all methods. The estimated variance components using REML in Equation (19), have been reported as well. We apply the methods in Sections 3, 4, and 5 using equations (20), (21), and the LARS algorithm to estimate the coefficients and the variance components. Also, we computed the GLS estimator using Equation (11). The standard errors are computed using the sandwich formula in Equation (22).

7. Analysis of the Pastry Dough Experiment

The design of the pastry dough experiment is nonorthoganally blocked response surface design. The design and the responses described in [16]. The factors invistigated in the experiment were the feed flow rate (x₁), the initial moisture content (x₂), the screw speed of mixing process for pastry dough (x₃). The purpose of the experimenter was to understand how the various properties of the dough depend on these three factors and how to develop an overall control scheme based on the experimental results. It was decieded that seven days of experimentation were affordable, so that 28 runs could be performed. We will apply our three variable selection methods on the response of measure the light transmission in bands of spectrum. The data of the experiment are given in Table 1. A full second order model in the three explonatory variables was used to explain the behavior of the response. The jth observation within the ith block can be expressed as

$y_{i j} = β_{0} + \sum_{i = 1}^{3} β_{i} x_{i} + \sum_{i = 1}^{3} \sum_{j = i + 1}^{3} β_{i j} x_{i} x_{j} + \sum_{i = 1}^{3} β_{i i} x_{i}^{2} + γ_{i} + ε_{i j}$

The variance compenents were estimated using the GLS-REML in the work by [17] as the ${\hat{σ}}_{ϵ}^{2} = 0.09695$ and ${\hat{σ}}_{γ}^{2} = 0.9703$ . The estimates for the ten parameters of the full quadratic model for the response are displayed in Table 2. The table contains the estimates obtained using the generlaised least square estimator, backward elimination, adaptive lasso, and least angle regression estimators. The REML has been used to estimate the variance components from the full quadratic model in this work. We found that both ${\hat{σ}}_{ϵ}^{2} = 0.09695$ and ${\hat{σ}}_{γ}^{2} = 0.9702$ , similar to the reults given by [17]. We note that the LARS method yield in a simillar model subset to the GLS in the original work by [17].

8. Simulation Study

To examine our methods, we need to run simulation studies in order to find out how the resulting model will be compared to the true model which used in the simulation. In the simulation, we set $σ_{ε}^{2} + σ_{γ}^{2} = 10$ , and the variance components

Table 1. Data for the pastry dough mixing experiment.

Table 2. Estimated coefficients and standard errors (in parentheses) for the pastry dough experiment for y₁ by the Backward elimination, Adaptive Lasso (ALASSO) and the LARS. The last row is the estimated coefficients and p-values (in parentheses) from [17].

ratio to two different levels, $η = 1$ and 10. Similar values for $η$ have been used for the analysis of data from many blocked and split-plot experiments (see, for instance, [18] and [16] ). We generate 1000 datasets using the design structure from the motivating experiment in Table 1 for the pastry dough design. Given the assumed true model, we compare the performance of the backward elimination, ALASSO, and LARS. We focus on the properties of the estimated models by investigating the following properties:

1) Consistency in variable selection (frequency in selecting the active/nonactive variable).

2) Prediction performance.

For point 1, at 5% significant level, we report Type I error rate (an effect that is truly not significant but the corresponding procedure estimate indicates that it is significant). We also report Type II error rate (an effect that is truly present but the corresponding procedure estimate indicates that it is not significant).

For point 2, following [9] and [19], prediction accuracy is measured by computing the mean-squared error for each penalised estimate ${\hat{β}}_{λ}$ as,

$M E ({\hat{β}}_{λ}) = {(X {\hat{β}}_{λ} - X β)}^{'} (X {\hat{β}}_{λ} - X β) .$

The relative model error (RME) is the ratio of the model error of the penalised estimates to the model error for the GLS estimates of the fixed effects,

$R M E = \frac{M E ({\hat{β}}_{λ})}{M E ({\hat{β}}_{GLS})},$

where ${\hat{β}}_{GLS}$ in Equation (13) is the generalised least squares estimator of $β$ . The median of the relative model error (MRME) over 1000 simulated data sets were reported. MRME values greater than one indicate that the methods estimates perform worse than the GLS estimates, values near to one indicate that the the methods estimates performs in a similar way to the GLS estimates, values less than one indicate that the methods estimates performs better than the GLS estimates.

9. Simulation Study Using the Design of the Pastry Dough Experiment

A simulation study was performed to examine the performance of the backward, ALASSO and LARS estimates. Using the design of the pastry dough experiment from Table 1, the response variable was generated given the true model

$E (Y) = 4 x_{1} + 2 x_{2} - 4 x_{1} x_{2} + 4 x_{1}^{2} + 2 x_{2}^{2} .$

In this experiment, five active variables $x_{1}, x_{2}, x_{1} x_{2}, x_{1}^{2}$ and $x_{2}^{2}$ and four nonactive variables $x_{3}, x_{1} x_{3}, x_{2} x_{3}$ and $x_{3}^{2}$ were assumed. We assumed this model as we would like to check a model with variety of factor types. Type I and II error rates of the design for the pastry dough experiments obtained using the backward, ALASSO, and LARS methods are given Table 3 and Table 4. Firstly, with

Table 3. Type I error rate for the pastry dough design.

Table 4. Type II error rate for the pastry dough design.

regard to Type I error rate, for the main effects lie between 0.040 to 0.050 at both levels of $η$ which is acceptable. The interaction effects are noticebly larger and run in the range of 0.046 to 0.091. The LARS method recorded the least Type I error rates at $η = 1$ and 10 for all effects. The quadratic effects yield in large Type I error rate in both levels $η = 1$ and 10 as this factor was hard to estimted as nonactive by all methods. However, the LARS yield in the lowest Type I error rate at $η = 1$ by rate of 0.175 and $η = 10$ by rate of 0.096. The quadratic effects are nearly not orthogonal to the blocks as the main effects and the two-factor interaction effects as made it difficult to estimate.

Secondly, with respect to the Type II error rate, the larger size of main and interaction effects was correctly estimated by almost no or few errors by all methods. In contrast, smaller size main and quadratic effects were hard to detect. However, the LARS method yields smaller Type II errors than the other methods. We notice that by the increase of the $η$ value, the type error rates are reduced. This can be explained as the fact that the large value of $η$ correspond to small

values of $σ_{ϵ}^{2}$ as $η = \frac{σ_{γ}^{2}}{σ_{ϵ}^{2}}$ . This yields in smaller standard errors for each factor effect.

We found that the block design with correlated data affects the performance of the methods as well as the trade-off between Type I and Type II error rates. The analysis from the simulation study which used the nonorthogonal design

Figure 1. Median relative model error (MRME) for the pastry dough design.

from the pastry dough experiment showed that the LARS followed by ALASSO could control the Type II error rate at bot $η = 1$ and 10 better than the backward elimination. Furthermore, the Type I error rate was hard to control by backward and ALASSO at $η = 1$ and almost at $η = 10$ for all nonactive variables, especially for the quadratic effect $x_{3}^{2}$ . The LARS at $η = 10$ performs better than the other methods with respect to controlling the Type I error rate.

Finally, with regard to the median relative model error (MRME), from Figure 1, we note that all methods have lower MRME than the GLS estimator method at $η = 10$ . However, the LARS method has the lowest MRME. This indicates to the LARS has better performance than the GLS estimates. We conclude that the LARS method has the best performance among all methods used in the simulation.

10. Conclusion

This paper provided an analysis of data from blocked experiments using a motivating example from the industrial environment. Specifically, we recommend the use of the modified LARS method for variable selection in the blocked experiments. In our results, we observed that the LARS can identify the active variables (linear, two-factor interaction and quadratic), much better than the traditional used GLS method combined with backward elimination and the Adaptive LASSO shrinkage method. However, as expected this comes with the expense of slightly higher Type I error rates. We also observed a better prediction performance for the models chosen by the LARS compared to the models chosen by the other two methods.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12, 55-67. https://doi.org/10.1080/00401706.1970.10488634
[2]	Breiman, L., et al. (1996) Heuristics of Instability and Stabilization in Model Selection. The Annals of Statistics, 24, 2350-2383. https://doi.org/10.1214/aos/1032181158
[3]	Li, R. and Lin, D.K.J. (2003) Analysis Methods for Supersaturated Design: Some Comparisons. Journal of Data Science, 1, 249-260. https://doi.org/10.6339/JDS.2003.01(3).134
[4]	Mylona, K. and Goos, P. (2011) Penalized Generalized Least Squares for Model Selection under Restricted Randomization. Manuscript Submitted for Isaac Newton Institute for Mathematical Sciences; NI 11032.
[5]	Aljeddani, S. (2019) Statistical Analysis of Data from Experiments Subject to Restricted Randomisatio. PhD Thesis, University of Southampton, Southampton.
[6]	Ozawa, S., Nagatani, T. and Abe, S. (2010) Fast Variable Selection by Block Addition and Block Deletion. Journal of Intelligent Learning Systems and Applications, 2, 200-211. https://doi.org/10.4236/jilsa.2010.24023
[7]	Matthews, E.S. (2015) Design of Factorial Experiments in Blocks and Stages. PhD Thesis, University of Southampton, Southampton.
[8]	Corbeil, R.R. and Searle, S.R. (1976) Restricted Maximum Likelihood (Reml) Estimation of Variance Components in the Mixed Model. Technometrics, 18, 31-38. https://doi.org/10.2307/1267913
[9]	Fan, J.Q. and Li, R.Z. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
[10]	Hastie, T., Tibshirani, R. and Friedman, J. (2009) Unsupervised Learning. In The Elements of Statistical Learning, Springer, New York, 485-585. https://doi.org/10.1007/978-0-387-84858-7_14
[11]	Tibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodological), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
[12]	Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004) Least Angle Regression. The Annals of Statistics, 32, 407-499. https://doi.org/10.1214/009053604000000067
[13]	Wahba, G. (1980) Spline Bases, Regularization, and Generalized Cross Validation for Solving Approximation Problems with Large Quantities of Noisy Data. Proceedings of the International Conference on Approximation Theory in Honor or George Lorenz, Austin, 8-11 January 1980.
[14]	Zou, H. (2006) The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101, 1418-1429. https://doi.org/10.1198/016214506000000735
[15]	Zou, H. and Zhang, H.H. (2009) On the Adaptive Elastic-Net with a Diverging Number of Parameters. Annals of Statistics, 37, 1733. https://doi.org/10.1214/08-AOS625
[16]	Gilmour, S.G. and Trinca, L.A. (2000) Some Practical Advice on Polynomial Regression Analysis from Blocked Response Surface Designs. Communications in Statistics-Theory and Methods, 29, 2157-2180. https://doi.org/10.1080/03610920008832601
[17]	Goos, P. (2002) The Optimal Design of Blocked and Split-Plot Experiments. Springer Science & Business Media, New York. https://doi.org/10.1007/978-1-4613-0051-9
[18]	Letsinger, J.D., Myers, R.H. and Lentner, M. (1996) Response Surface Methods for Birandomization Structures. Journal of Quality Technology, 28, 381-397. https://doi.org/10.1080/00224065.1996.11979697
[19]	Ibrahim, J.G., Zhu, H.T., Garcia, R.I. and Guo, R.X. (2011) Fixed and Random Effects Selection in Mixed Effects Models. Biometrics, 67, 495-503. https://doi.org/10.1111/j.1541-0420.2010.01463.x

Journals Menu

Follow SCIRP

	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies