Bayesian Variable Selection for Mixture Process Variable Design Experiment

Sadiah M. A. Aljeddani

doi:10.4236/ojmsi.2022.104022

Open Journal of Modelling and Simulation > Vol.10 No.4, October 2022

Bayesian Variable Selection for Mixture Process Variable Design Experiment

Sadiah M. A. Aljeddani

Department of Mathematics, Al-Lith College, Umm Al-Qura University, Al-Lith, Saudi Arabia.
DOI: 10.4236/ojmsi.2022.104022 PDF HTML XML 158 Downloads 744 Views Citations

Abstract

This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.

Keywords

Variable Selection, Bayesian Analysis, Mixture Experiment, Split-Plot Design

Share and Cite:

Aljeddani, S. (2022) Bayesian Variable Selection for Mixture Process Variable Design Experiment. Open Journal of Modelling and Simulation, 10, 391-416. doi: 10.4236/ojmsi.2022.104022.

1. Introduction

Bayesian methods are important approaches due to their ability to quantify uncertainty. In such an approach, prior distributions that represent subjective beliefs about parameters are assigned to the regression coefficients. By applying Bayes’ rule, prior beliefs are updated by the data and transformed into posterior distributions, on which all inference is based.

An important task in regression building is to determine which variables should be included in the model. Therefore, the principle of Bayesian variable selection is to get the large and the small effects distinctive, and effective prior essences mass around zero and distribute the remaining over the parameter space. Such a prior represents the fact that there are small coefficients close to zero on one hand and larger coefficients on the other hand. These priors can be built as a combination of two distributions, along with a narrow normal continuous distribution centred at zero with a small variance called a “spike”, and the other with a flat normal continuous distribution with a large variance to spread over a wider range of parameter values called a “slab”. This type of priors is called “Spike-and-Slab” priors (see Figure 1). Practically saying, those priors are beneficial for purposes of variable selection because they permit the classification of the regression coefficients into two groups: one group with large, significant effects, and the other group with small, negligible effects.

As reviewed by [1] [2] introduced Bayesian variable selection via “spike-and-slab” prior distributions. The spike prior that they used was a probability mass at zero to remove the non-significant variables. Their slab is the uniform distribution with a large symmetric range in order to keep the significant variables. Following their work, many priors were proposed to implement the spike-and-slab property. [3] proposed the Stochastic Search Variable Selection (SSVS) in which the coefficients are sampled from a mixture of two normal distributions with different variances. The spike part is the distribution with a small variance while the slab

Figure 1. Gaussian mixture prior for $β$ .

part is the distribution with a much larger variance. Also, [4] proposed positive mass at zero for the spike part and a normal distribution for the slab part. In addition, [5] proposed a Bayesian approach for model selection in fractionated split-plot experiments with application to “robust-parameter design”. In their work, they extend the SSVS algorithm of [6] to account for the split-plot error structure. They derive an expression for the posterior probability of a model that requires the computation of, at most, two unidimensional integrals, and employ this quantity for model selection. They were able to integrate the coefficients and the variance components from the joint posterior distribution of all parameters because they use the conjugate normal-inverse gamma prior for these parameters. The integrals are computed with Gaussian quadrature, and Global and Local search algorithms to find models with high posterior probabilities.

The above review summarised the former research and analyzed the problems in the Bayesian variable selection field. On the other side, we notice that the frequentist methods analysis of this kind of experimental design would fail in high Type I error rate [7]. The difficulty of using the frequentist analysis to estimate the variance components is explained in [8]. To overcome this issue, we apply the Bayesian variable selection as this method introduces prior distribution about the variance component in the linear mixed models.

The novel contribution of this paper to Bayesian variable selection is motivated by a very specific experimental design of data from experiments subject to restricted randomisation. We have two different levels of the experimental units one for the whole plots and the other for the subplots in the split-plot design; see, for example, [9]. To address this issue, we adapt the SSVS algorithm in which we sample the subplot coefficients using a mixture of normal posterior distributions with a slab variance different from the slab variance which will be used in the mixture of normal posterior distributions for the whole-plot coefficients. This method reduces Type I and II error rates as well as reduces the prediction error for split-plot design rather than applying the SSVS algorithm in which all coefficients will be sampled from a mixture of normal posterior distributions with one slab variance. We called the approached method the Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD). The frequentist analysis is dependent on the estimates of the variance components, yet these estimates cannot be precisely calculated because of the deficiency of the degrees of freedom for the random effects in the split-plot design. This issue was discussed by [8]. Introducing a prior distribution for the variance components in the linear mixed model provides additional information to overcome the problem of variance estimation.

This paper differs from [10] in which the dataset used in this paper is the Vinyl Thickness experiment, which had not been used in variable selection. Also, it differs in the choice of the shape parameter of the correlation parameter as in this paper we used $a = b = 2$ to fit this data. A preprint has previously been published [11].

2. Split-Plot Design and Sample Model

The model for the block experiments includes two types of errors: block error and residual error. Hence, linear mixed models (LMMs) are used to analyse responses from the blocked experiments.

Linear mixed-effects models (LMMs) introduce correlations between observations using random effects. This leads to the use of generalised least squares (GLS) estimation, combined with restricted maximum likelihood estimation (REML) of the variance components as will be discussed. This type of analysis is used by the most design of experiments textbooks that deal with blocked designs. In matrix notation, the model corresponding to a blocked design is written as

$Y = X β + Z γ + ε,$ (1)

where Y is $n \times 1$ vector of observations on the response of interest, X is the $n \times p$ model design matrix containing the polynomial expansions of the m factor levels at the n experimental runs, $β$ is the $p \times 1$ vector of unknown fixed parameters, Z is an $n \times b$ random design matrix which represents the allocation of the runs to blocks, and whose $(i, j)$ ^th element is one where the i^th observation belongs to the j^th blocks, and zero otherwise. If the runs of the experiment are grouped per block, then Z is of the form

$Z = diag [1_{k_{1}}, 1_{k_{2}}, \dots, 1_{k_{b}}],$ (2)

where $1_{k}$ is a k vector of ones, and $k_{1}, k_{2}, \dots, k_{b}$ are the blocks sizes. The random effects of the b blocks are contained within the $b \times 1$ vector $γ$ , and the random errors are contained within the $n \times 1$ vector $ε$ . It is assumed that $γ$ and $ε$ are independent and normally distributed, i.e. $Cov (γ, ε) = 0_{b \times n}$ , where $0_{b \times n}$ is the $b \times n$ matrix of zeros. Hence, $γ ~ N (0_{b}, σ_{γ}^{2} Ι_{b})$ , and $ε ~ N (0_{n}, σ_{ε}^{2} I_{n})$ , where $0_{b}$ and $0_{n}$ are the b and n column vectors of zeros respectively, and $I_{b}$ and $I_{n}$ are the b-dimensional and n-dimensional identity matrices respectively.

Under these assumptions, Y is a normally distributed random variable with mean $E (Y) = X β$ , and the variance-covariance matrix of the response Y can be written as

$V = Var (Y) = Var (X β + Z γ + ε)$ (3)

$= Var (Z γ) + Var (ε)$ (4)

$= Z Var (γ) Z^{'} + σ_{ε}^{2} I_{n}$ (5)

$= σ_{γ}^{2} Z Z^{'} + σ_{ε}^{2} I_{n} .$ (6)

V can be given as a block diagonal,

$V = [\begin{matrix} V_{1} & 0 & \dots & 0 \\ 0 & V_{2} & \dots & 0 \\ ⋮ & ⋱ & ⋱ & ⋮ \\ 0 & \dots & 0 & V_{b} \end{matrix}],$

where

$V_{i} = σ_{ε}^{2} I_{k_{i}} + σ_{γ}^{2} 1_{k_{i}} {1^{'}}_{k_{i}},$

and

$V_{i} = [\begin{matrix} σ_{ε}^{2} + σ_{γ}^{2} & σ_{γ}^{2} & \dots & σ_{γ}^{2} \\ σ_{γ}^{2} & σ_{ε}^{2} + σ_{γ}^{2} & \dots & σ_{γ}^{2} \\ ⋮ & ⋱ & ⋱ & ⋮ \\ σ_{γ}^{2} & \dots & σ_{γ}^{2} & σ_{ε}^{2} + σ_{γ}^{2} \end{matrix}] .$

As a result, the variance-covariance matrix $V_{i}$ of all observations within one block is compound symmetric: the main diagonal of the matrix contains the variances of the observations, while the off-diagonal elements are covariances. However, $V_{i}$ can be rewritten as

$V_{i} = σ_{ε}^{2} (I_{k_{i} \times k_{i}} + \frac{σ_{γ}^{2}}{σ_{ε}^{2}} 1_{k_{i}} {1^{'}}_{k_{i}})$ (7)

$= σ_{ε}^{2} (I_{n} + η Z Z^{'}),$ (8)

where $η = σ_{γ}^{2} / σ_{ε}^{2}$ is a measure for the extent to which observations within the same block are correlated. The larger this variance ratio, the stronger observations within the same block are correlated.

When the random error terms as well as the group effects are normally distributed, the maximum likelihood estimate of the unknown model parameter $β$ in Equation (1) is the generalised least squares (GLS) estimate. Detecting the estimator $\hat{β}$ of $β$ , requires to minimise

${(y - X β)}^{'} V^{- 1} (y - X β) = y^{'} V^{- 1} y - 2 β^{'} X^{'} V^{- 1} y + β^{'} X^{'} V^{- 1} X β$ (9)

with respect to $β$ , which is tantamount to detecting $\hat{β}$ , so that

$(X^{'} V^{- 1} X) \hat{β} = X^{'} V^{- 1} y .$ (10)

Therefore, the generalised least squares (GLS) estimator of $β$ is

$\hat{β} = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} Y,$ (11)

and the variance-covariance matrix of the estimators is given by

$\begin{matrix} Var (\hat{β}) = Var ({(X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} Y)) \\ = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} Var (Y) {({(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1})}^{'} \\ = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} V V^{- 1} X {(X^{'} V^{- 1} X)}^{- 1} \\ = {(X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} X) {(X^{'} V^{- 1} X)}^{- 1} \\ = {(X^{'} V^{- 1} X)}^{- 1} . \end{matrix}$ (12)

Often, the variances $σ_{γ}^{2}$ and $σ_{ε}^{2}$ are not known and therefore, Equation (11) and Equation (12) cannot be used directly. Instead, the estimates of the variance components, ${\hat{σ}}_{γ}^{2}$ and ${\hat{σ}}_{ε}^{2}$ , are substituted in the GLS estimator as in Equation (11), yielding

$\hat{β} = {(X^{'} V^{- 1} X)}^{- 1} X^{'} V^{- 1} Y,$ (13)

where

$V = {\hat{σ}}_{ε}^{2} I_{n} + {\hat{σ}}_{γ}^{2} Z Z^{'} .$ (14)

In that case, the variance-covariance matrix in Equation (12) can be approximated by

$Var (\hat{β}) = {(X^{'} V^{- 1} X)}^{- 1} .$ (15)

The generalised least square (GLS) estimator is unbiased, meaning that $E (\hat{β}) = β$ , and is equal to the maximum likelihood estimator (MLE). The likelihood function defined as it is the joint probability density function for the observed data examined as a function of the parameters. Hence, the likelihood function for Y in Equation (1) is

$L (β | Y) = {(2 π)}^{- n / 2} {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(Y - X β)}^{'} V^{- 1} (Y - X β)],$ (16)

where $π$ is a constant which does not depend on $β$ . The maximum likelihood estimator (MLE) is the estimator that maximises the likelihood function, which is tantamount to detecting the $\hat{β}$ as

$\frac{\partial}{\partial β} L (\hat{β} | Y) = 0,$ (17)

which is equal to

$\frac{\partial}{\partial β} \ln L (\hat{β} | Y) = 0,$ (18)

where $\ln L (\hat{β} | Y)$ is the log likelihood function. As Equation (9) is proportionate to log of Equation (16), the GLS estimator in Equation (11) is the result of Equation (17) and Equation (18).

The restricted maximum likelihood (REML) used to estimate $σ_{ε}^{2}$ and $σ_{γ}^{2}$ is

$l_{REML} (σ_{ε}^{2}, σ_{γ}^{2}; Y) = - \frac{1}{2} ln | V | - \frac{1}{2} ln | X^{'} V^{- 1} X | - \frac{1}{2} {(Y - X \hat{β})}^{'} V^{- 1} (Y - X \hat{β}),$ (19)

where $\hat{β}$ is defined in Equation (13). The restricted log-likelihood $l_{REML} (σ_{ε}^{2}, σ_{γ}^{2}; Y)$ is minimised with respect to the variance components $σ_{ε}^{2}$ and $σ_{γ}^{2}$ to obtain an unbiased estimate for the variance components.

3. The Bayesian Methodology

3.1. Bayesian Framework

The essential philosophy behind Bayesian inference is to update a prior distribution for an unidentified parameter to a posterior distribution by Bayes’ theorem. Bayes’ theorem can be used to estimate the conditional distributions. While the frequentist approach treats the parameters as unknown and fixed, the Bayesian approach regards them as random variables. We can define the prior distribution $p (θ)$ as the probability density (or mass) function which reflects our beliefs about $θ$ in the parameter space $Θ$ . For given data $y = {(y_{1}, y_{2}, \dots, y_{n})}^{'}$ , the likelihood function $f (y | θ)$ can then be defined given the parameter $θ$ for the data y. Also, we can define the posterior density (or mass) function $p (θ | y_{1}, y_{2}, \dots, y_{n})$ which represents our updated belief about $θ$ given the observed data y.

Using Bayes’ theorem, the posterior density of $θ$ given y is:

$p (θ | y) = \frac{f (y | θ) p (θ)}{\int_{Θ} f (y | θ) p (θ) d θ} .$ (20)

Bayesian inference continues from this distribution. The denominator of Equation (20) is the marginal likelihood of y, and it often does not need to be calculated because it is independent of $θ$ . Bayes’ rule can then be written as:

$p (θ | y) \propto f (y | θ) p (θ) .$ (21)

Equation (21) defines the unnormalised posterior density. The posterior then is proportional to the likelihood × the prior. For more details on Bayesian inference, see [12] and [13].

A prior distribution can be selected based on past information or experimental practice. It can be informative or uninformative. The informative distribution is given numerical information to estimate the parameter of concern. The uninformative reflects equilibrium among outcomes when weak information about the parameter is presented. There are two types of uninformative priors: proper prior and improper prior. The density for proper prior distribution integrates to 1 whereas the integral of the density for an improper distribution is not finite. If the prior integrates to any positive finite value, it is called an unnormalised density and can be renormalised-multiplied by a constant to integrate to 1 [12] [13].

3.2. Markov Chain Monte Carlo (MCMC) Methods

Markov Chain Monte Carlo simulation is a general method based on drawing values of the $θ$ from approximate distributions, and then correcting those draws to better approximate the target posterior distribution $p (θ | y)$ [12] [13] [14]. A Markov chain can be defined as a sequence of random variables $θ^{1}, θ^{2}, \dots$ for which for any iteration t, the distribution of $θ^{t}$ depends only on the most recent value $θ^{t - 1}$ [12] [13] [14]. A Markov chain is generated by sampling $θ^{t} ~ p (θ | θ^{t - 1})$ . Thisp is called the transition kernel of the Markov chain. Therefore, $θ^{t}$ depends only on $θ^{t - 1}$ , not on $θ^{0}, θ^{1}, \dots, θ^{t - 2}$ .

As $t \to \infty$ , the sampling from Markov chain converges to the posterior for the right choice of transition kernel $p (θ | y)$ . Thus, we should run the simulation long enough so that the distribution of the current draws is close enough to $p (θ | y)$ .

3.2.1. Metropolis-Hastings Sampling

Metropolis-Hastings sampling was proposed by [15] and [16]. The theory of this sampling is based on rejection sampling. The acceptance-rejection method is a technique of getting samples from a distribution with an unknown form. The Metropolis-Hastings algorithm is a common expression for a family of Markov chain simulation methods. It is worth describing the Metropolis algorithm first, then broadening it to discuss the Metropolis-Hastings algorithm. Let $p (θ | y)$ be the conditional posterior distribution where we want to sample from. Let $θ^{t}$ be the current parameter value, and let $π (θ)$ be the proposal density. The proposal density is much like a conventional transition operator for a Markov chain, the proposal distribution depends only on the previous state in the chain. However, the transition operator for the Metropolis algorithm has a additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the chain.

3.2.2. Gibbs Sampling

A Gibbs sampler is the simplest of the Markov chain simulation algorithms, and it is used to sample from the conditional conjugate models, where we can directly sample from each conditional posterior [12] [13]. It is rare to find all the conditional posteriors in a model in known forms. One may find some conditional posterior distributions that are possible to directly sample from. Furthermore, one may find some of the conditional posteriors that cannot be straightforwardly sampled from. Therefore, the procedure for this issue is to update the parameters one at a time with the Gibbs sampler used where possible, and one-dimensional Metropolis updating where necessary. This process is called the Metropolis-Hastings within Gibbs sampling and will be used in this work.

4. A Hierarchical Mixture Model for Variable Selection

The linear mixed model fitted to data from a split-plot experiment with n responses is

$y ~ N (β_{0} 1_{n} + X β, V),$ (22)

where y is $n \times 1$ vector of random responses, $β_{0}$ is the intercept, $1_{n}$ is a $n \times 1$ vector of ones, X is the $n \times p$ model matrix without the column of the intercept, $β$ is the $p \times 1$ vector of fixed effect parameters and V is

$V = σ_{ε}^{2} (I_{n} + \frac{σ_{γ}^{2}}{σ_{ε}^{2}} Z Z^{'}),$

where Z is the random effect design matrix. As $ρ = \frac{σ_{γ}^{2}}{σ^{2}}$ , and $σ^{2} = σ_{ε}^{2} + σ_{γ}^{2}$ , then V can be written as

$V = σ^{2} (1 - ρ) (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}) .$ (23)

We need to find the highest posterior probability of an indicator vector $ν = (ν_{1}, ν_{2}, \dots, ν_{p})$ such that

$ν_{j} = (\begin{array}{l} 0 if β_{j} = 0 \\ 1 if β_{j} \neq 0 \end{array}$

for $j = 1, 2, \dots, p$ . When $ν_{j} = 1$ the term is assumed to be active and will be included in the model, and when $ν_{j} = 0$ the term is assumed to be non-active and will not be included in the model.

Following [6], and [5], we assume that $β | σ^{2}, ν, c ~ N (0_{p}, σ^{2} D_{ν, c})$ , where $ν$ is the indicator vector, c is the prior variance of the slab distribution, and $D_{ν, c}$ is a diagonal matrix with the j^th diagonal element $c I (ν_{j} = 1) + d I (ν_{j} = 0), j = 1, \dots, p$ . The parameters $σ^{2}, ν$ and c will be given prior distributions, and the parameter d is assumed to be a small fixed non-negative number because we want the spike distribution to have a smaller variance than the slab distribution. Formally the prior construction of $β$ is the following:

$β_{j} | σ^{2}, ν_{j}, c ~ (1 - ν_{j}) N (0, d σ^{2}) + ν_{j} N (0, c σ^{2}) .$

For every coefficient $β_{j}$ , a Bernoulli variable $ν_{j}$ is defined taking values 1 and 0 with probability of inclusion $ω$ , as $p (ν_{j} = 1) = ω$ and $p (ν_{j} = 0) = (1 - ω)$ . Often, $ν_{j}$ ’s are taken as independent Bernoulli ( $ω$ ) random variables, where $0 < ω < 1$ . It is common to fix $ω$ in the normal mixture, however, we shall deal with $ω$ as a parameter to investigate different values of $ω$ , and sample it from the Beta distribution as it will be explained in Section 6.3.

5. Prior Distributions

1) Following the prior distributions used by [5], we assume that the prior distribution for the fixed effects is $β ~ N (0, σ^{2} D_{ν, c})$

$p (β | σ^{2}, ν, c) \propto {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp (- \frac{1}{2} β^{'} {(σ^{2} D_{ν, c})}^{- 1} β) .$

2) The prior distribution for the total variance is $σ^{2} ~ I G (a, b)$ ,

$p (σ^{2}) \propto {(σ^{2})}^{- a - 1} exp (- \frac{b}{σ^{2}}) .$

For this work, we used $a = 0$ and $b = 0$ following [5] as this yields the common non-informative prior for $σ^{2}$ . This prior is improper, however we will sample from the posterior distribution, which should be a proper gamma distribution.

3) The prior distribution for the correlation parameter is $ρ ~ B e t a (a^{'}, b^{'})$ with shape parameters $a^{'}, b^{'} > 0$ . We consider $a^{'} = b^{'} = 2$ following [8]. According to [8], “A $B e t a (a^{'}, b^{'})$ prior distribution for a correlation parameter can be interpreted as indicating a prior point estimate of $a^{'} / (a^{'} + b^{'})$ , this prior information being worth $a^{'} + b^{'}$ observations”. Our prior was selected to be centred at $2 / (2 + 2) = 0.5$ and to be worth four observations in each block. For an experiment with $a^{'} + b^{'}$ observations, the posterior distribution would give equal weight to the prior and the likelihood [8]. Similar choice of the prior is in [6] as they set $a = b = 2$ so that the $p (ρ)$ is symmetric with a mode of 0.5. This has the effect of pulling the posterior mode of $p (ρ)$ towards 0.5 when the data are scarce. The prior density for $ρ$ is

$p (ρ) \propto ρ^{(a^{'} - 1)} {(1 - ρ)}^{(b^{'} - 1)} .$

4) The prior distribution for the elements of the indicator vector is $ν_{j} ~ B e r n o u l l i (ω)$ ,

$p (ν_{j}) = (\begin{array}{l} ω if ν_{j} = 1 \\ 1 - ω if ν_{j} = 0 \end{array}$

where $ω$ is the prior probability that $β_{j}$ is active following [5].

5) The prior distribution for the elements of the probability of inclusion is $ω ~ B e t a (c_{0}, d_{0})$ ,

$p (ω) \propto ω^{(c_{0} - 1)} {(1 - ω)}^{(d_{0} - 1)} .$

[5] set $ω = 0.25$ . However, we select $c_{0}$ and $d_{0}$ such that the prior of $ω$ has a mode = 0.25. The choice of $c_{0} = 2$ and $d_{0} = 4$ results in a prior with a mode = 0.25 and the upper cumulative percentile at 5% equals 0.66. Meaning a 5% chance the observations have a probability density function ≥ 0.66.

6) The prior distribution for the slab variance c is a discrete uniform prior distribution with support points T = {1/4, 9/16, 1, 4, 9, 16, 25} as given by [5]. They found that large values of c tend to favor sparse models with large effects and in this case small effects will be missed. On the other hand, small values of c tend to favor less sparse models. Moreover, very small values of c tend to favor sparse models again. They select the support points in T such that it covers small and large values of c. The prior distribution for c is

$p (c) = (\begin{array}{l} \frac{1}{7} if c \in T \\ 0 otherwise \end{array}$

6. Full Conditional Distributions

We use the prior distributions presented in Section 5 to derive the full conditional distributions. The likelihood of the data depends on $β$ , so we can derive the conditional distribution for $β$ using the prior distribution $β | σ^{2}, ν, c ~ N (0_{p}, σ^{2} D_{ν, c})$ and the likelihood for the model 22

$L (y | θ) \propto {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V^{- 1} (y - X β)] .$

Note that we standardise both X and y so the fixed effect vector $β$ does not include the intercept. The conditional distribution for $β$ can be expressed as:

$\begin{array}{l} p (β | ν, σ^{2}, c, ω, ρ, y) \propto p (y | β, ν, σ^{2}, c, ω, ρ) p (β | σ^{2}, ν, c) \\ \propto {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V^{- 1} (y - X β)] \\ \times {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp (- \frac{1}{2} β^{'} {(σ^{2} D_{ν, c})}^{- 1} β) \\ \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [- \frac{1}{2} (β^{'} {(σ^{2} D_{ν, c})}^{- 1} β) \\ - \frac{1}{2} (y^{'} V^{- 1} y - y^{'} V^{- 1} X β - β^{'} X^{'} V^{- 1} y + β^{'} X^{'} V^{- 1} X β)] \end{array}$

$\begin{array}{l} \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [- \frac{1}{2} β^{'} ({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X) β] \\ \times exp [- \frac{1}{2} (- y^{'} V^{- 1} X) β] exp [- \frac{1}{2} β^{'} (- X^{'} V^{- 1} y)] \\ \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [- \frac{1}{2} β^{'} ({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X) β] \\ \times exp [\frac{1}{2} (y^{'} V^{- 1} X) β + \frac{1}{2} β^{'} (X^{'} V^{- 1} y)] \\ \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [- \frac{1}{2} β^{'} ({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X) β + β^{'} (X^{'} V^{- 1} y)] . \end{array}$

The key to deriving the joint posterior distribution is to rewrite the expression in the exponential part in a more convenient form. This can happen by using the multivariate completion of squares:

$U^{'}AU - 2U^{'} α = {(U - A^{- 1} α)}^{'} A (U - A^{- 1} α) - α^{'} A^{- 1} α,$

where A is a symmetric positive definite (hence invertible) matrix. We assume $U = β$ , $A = {(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X$ , and $α = X^{'} V^{- 1} y$ .

The conditional distribution for $β$ can be written as:

$\begin{array}{l} p (β | ν, σ^{2}, c, ω, ρ, y) \\ \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [- \frac{1}{2} β^{'} ({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X) β - 2 β^{'} (X^{'} V^{- 1} y)] \\ \propto {| V |}^{- 1 / 2} {| σ^{2} D_{ν, c} |}^{- 1 / 2} exp [{[β - {({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} y)]}^{'} \\ \times [{(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X] [β - {({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} y)]] \\ \propto exp [- \frac{1}{2} {(β - β_{*})}^{'} D_{*}^{- 1} (β - β_{*})] . \end{array}$

Thus, we can sample $β$ from the conditional posterior $N (β_{*}, D_{*})$ , where

$β_{*} = {({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X)}^{- 1} (X^{'} V^{- 1} y) and D_{*} = {({(σ^{2} D_{ν, c})}^{- 1} + X^{'} V^{- 1} X)}^{- 1} .$ (24)

6.1. The Conditional Distribution for ρ

The likelihood of the data depends on $ρ$ , so the conditional distribution for $ρ$ can be derived by

$\begin{array}{l} p (ρ | β, ν, σ^{2}, c, ω, y) \propto p (y | β, ν, σ^{2}, c, ω, ρ) p (ρ) \\ \propto {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V^{- 1} (y - X β)] \times ρ^{(a^{'} - 1)} {(1 - ρ)}^{(b^{'} - 1)} . \end{array}$

We note here that the likelihood depends on $ρ$ through $V$ as in (4.6), so we can express $V$ as a function of $ρ$ ,

$V = σ^{2} (1 - ρ) (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}) .$

The conditional distribution for $ρ$ is a non-standard distribution that cannot be sampled directly. Therefore, we use the Metropolis-Hastings (M-H) rejection sampling. Our correlation parameter is $ρ \in (0,1]$ , and has a prior $β (a^{'}, b^{'})$ . We apply the Random-Walk Metropolis-Hastings algorithm, and select a proposal distribution of log-normal distribution for the variance ratio $η$ where $η = f (ρ) = \frac{ρ}{1 - ρ}$ with a mean equal to the current value of $η^{t}$ at iteration t and variance $s^{2}$ . The choice of $s^{2}$ affects the jumping rule in the random walk proposal distribution. As we have one parameter to be updated in the random walk algorithm which is $ρ$ , we follow [12] and [17] to set $s^{2} = g^{2} Σ$ . The most efficient jump has a scale $g \approx 2.4 / \sqrt{h}$ where h is the number of parameters which will be updated. In this work, we set $g = 2.4$ and $h = 1$ following [17], and we set $Σ = 100$ as this yields an appropriate acceptance rate associated with the independent sampler of the ACF plot. Thus, $η \in (0, \infty)$ and

$g (η) = \frac{1}{η \sqrt{2 π s^{2}}} \exp [- \frac{1}{2 s^{2}} {(\ln η - η^{t})}^{2}]$

We can use $η = \frac{ρ}{1 - ρ}$ as a transformation function between $η$ and $ρ$ as $ρ = \frac{η}{1 + η}$ and the Jacobian function of $ρ$ is $J (ρ) = \frac{d η}{d ρ} = \frac{1}{{(1 - ρ)}^{2}}$ .

We draw a proposal value $η_{*}$ from a log-normal( $η^{t}, s^{2}$ ) distribution, and the probability of accepting or rejecting $η_{*}$ is the minimum of 1 and the ratio r where r is

$r = \frac{p (ρ_{*} | all)}{p (ρ^{t} | all)} \times \frac{q (η^{t} | η_{*})}{q (η_{*} | η^{t})},$

which is equivalent to

$r = \frac{p (ρ_{*} | all)}{p (ρ^{t} | all)} \times \frac{q (ρ^{t} | ρ_{*}) J (ρ^{t})}{q (ρ_{*} | ρ^{t}) J (ρ_{*})} .$

Our proposal ratio is

$\frac{q (η^{t} | η_{*})}{q (η_{*} | η^{t})} = \frac{η_{*} \exp [- \frac{1}{2 s^{2}} {(\ln η^{t} - η_{*})}^{2}]}{η^{t} \exp [- \frac{1}{2 s^{2}} {(\ln η_{*} - η^{t})}^{2}]},$

which is equivalent to

$\frac{q (ρ^{t} | ρ_{*}) J (ρ^{t})}{q (ρ_{*} | ρ^{t}) J (ρ_{*})} = \frac{(\frac{ρ_{*}}{1 - ρ_{*}}) \exp [- \frac{1}{s^{2}} {(\ln (\frac{ρ^{t}}{1 - ρ^{t}}) - (\frac{ρ_{*}}{1 - ρ_{*}}))}^{2}] \times | \frac{1}{{(1 - ρ^{t})}^{2}} |}{(\frac{ρ^{t}}{1 - ρ^{t}}) \exp [- \frac{1}{s^{2}} {(\ln (\frac{ρ_{*}}{1 - ρ_{*}}) - (\frac{ρ^{t}}{1 - ρ^{t}}))}^{2}] \times | \frac{1}{{(1 - ρ_{*})}^{2}} |} .$

The ratio r can be expressed as

$\begin{matrix} r = \frac{{| V (ρ_{*}) |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V {(ρ_{*})}^{- 1} (y - X β)] \times {(ρ_{*})}^{(a^{'} - 1)} {(1 - ρ_{*})}^{(b^{'} - 1)}}{{| V (ρ^{t}) |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V {(ρ^{t})}^{- 1} (y - X β)] \times {(ρ^{t})}^{(a^{'} - 1)} {(1 - ρ^{t})}^{(b^{'} - 1)}} \\ \times \frac{ρ_{*} (1 - ρ^{t}) \exp [- \frac{1}{s^{2}} {(\ln (\frac{ρ^{t}}{1 - ρ^{t}}) - (\frac{ρ_{*}}{1 - ρ_{*}}))}^{2}] | {(1 - ρ^{t})}^{- 2} |}{ρ^{t} (1 - ρ_{*}) \exp [- \frac{1}{s^{2}} {(\ln (\frac{ρ_{*}}{1 - ρ_{*}}) - (\frac{ρ^{t}}{1 - ρ^{t}}))}^{2}] | {(1 - ρ_{*})}^{- 2} |} . \end{matrix}$ (25)

where $V (ρ_{*}) = σ^{2} (1 - ρ_{*}) (I_{n} + \frac{ρ_{*}}{1 - ρ_{*}} Z Z^{'})$ , and $V (ρ^{t}) = σ^{2} (1 - ρ^{t}) (I_{n} + \frac{ρ^{t}}{1 - ρ^{t}} Z Z^{'})$ .

6.2. The Conditional Distribution for σ²

The likelihood of the data depends on $σ^{2}$ , so we can express the conditional distribution of $σ^{2}$ as

$\begin{array}{l} p (σ^{2} | β, ρ, ν, c, ω, y) \propto p (y | β, ν, σ^{2}, c, ω, ρ) p (σ^{2}) \\ \propto {| V |}^{- 1 / 2} exp [- \frac{1}{2} {(y - X β)}^{'} V^{- 1} (y - X β)] \times {(σ^{2})}^{- a - 1} exp (- \frac{b}{σ^{2}}) . \end{array}$

We know that $V = σ^{2} (1 - ρ) (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'})$ , so the conditional posterior for $σ^{2}$ can be written as

$\begin{array}{l} p (σ^{2} | y, \dots) \\ \propto {| (1 - ρ) σ^{2} (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}) |}^{- 1 / 2} {(σ^{2})}^{- a - 1} \times exp (- \frac{b}{σ^{2}}) \\ \times exp [- \frac{1}{2} {(y - X β)}^{'} {((1 - ρ) σ^{2} (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}))}^{- 1} (y - X β)] \\ \propto {(σ^{2})}^{- (a + \frac{n}{2}) - 1} exp (- \frac{1}{σ^{2}} [\frac{{(y - X β)}^{'} {((1 - ρ) (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}))}^{- 1} (y - X β)}{2} + b]) . \end{array}$

This is the inverse gamma distribution with a shape parameter $a^{*}$ and a scale parameter $b^{*}$ such that

$a^{*} = a + \frac{n}{2} and b^{*} = \frac{{(y - X β)}^{'} {((1 - ρ) (I_{n} + \frac{ρ}{1 - ρ} Z Z^{'}))}^{- 1} (y - X β)}{2} + b .$ (26)

The indicator vector can be drawn conditionally on the regressor coefficient and computation of the marginal likelihood is not required. The prior probabilities for $ν_{j}$ are

$p (ν_{j}) = (\begin{array}{l} ω if ν_{j} = 1 \\ 1 - ω if ν_{j} = 0 \end{array}$

where $ω$ is the prior probability that $β_{j}$ is active. The joint conditional posterior distribution for $ν$ has mass function

$\begin{matrix} p (ν | β, y) \propto p (y | β, ν) p (β, ν) \\ = p (y | β) p (β, ν) \\ \propto p (β, ν) \\ = p (β | ν) p (ν) . \end{matrix}$

The conditional density for $β$ given $ν$ is

$\begin{matrix} p (β | ν) \propto {| σ^{2} diag [c I (ν_{j} = 1) + d I (ν_{j} = 0)] |}^{- 1 / 2} \\ \times exp [- \frac{1}{2} β^{'} {(σ^{2} diag [c I (ν_{j} = 1) + d I (ν_{j} = 0)])}^{- 1} β] . \end{matrix}$

The conditional distribution for the j^th component given $ν_{j}$ is

$\begin{matrix} p (β_{j} | ν_{j}) \propto {| σ^{2} [c I (ν_{j} = 1) + d I (ν_{j} = 0)] |}^{- 1 / 2} \\ \times exp [- \frac{β_{j}^{2}}{2 σ^{2} [c I (ν_{j} = 1) + d I (ν_{j} = 0)]}] . \end{matrix}$

The conditional posterior probabilities for $ν_{j}$ are therefore

$\begin{matrix} p (ν_{j} = 1 | β_{j}, y) = p (ν_{j} = 1) p (β_{j} | ν_{j} = 1) \\ \propto ω {| c σ^{2} |}^{- 1 / 2} exp [- \frac{β_{j}^{2}}{2 c σ^{2}}], \end{matrix}$ (27)

and

$\begin{matrix} p (ν_{j} = 0 | β_{j}, y) = p (ν_{j} = 0) p (β_{j} | ν_{j} = 0) \\ \propto (1 - ω) {| d σ^{2} |}^{- 1/2} exp [- \frac{β_{j}^{2}}{2 d σ^{2}}] . \end{matrix}$ (eq:4.10)

6.3. The Conditional Distribution for ω

The probability of inclusion $ω$ can be drawn conditionally on the indicator and computation of the marginal likelihood is not required. Hence the conditional distribution for $ω$ is

$\begin{matrix} p (ω | ν, σ^{2}, c, β, ρ, y) \propto p (y | ω, ν, σ^{2}, c, β, ρ) p (ω, ν, σ^{2}, c, β, ρ) \\ = p (y | ν, σ^{2}, c, β, ρ) p (ω, ν) \\ \propto p (ω, ν) \\ = p (ν | ω) p (ω) \\ \propto ω^{\sum_{j = 1}^{p} ν_{j}} {(1 - ω)}^{p - \sum_{j = 1}^{p} ν_{j}} \times ω^{(c_{0} - 1)} {(1 - ω)}^{(d_{0} - 1)} \\ \propto ω^{c_{0} + \sum_{j = 1}^{p} ν_{j} - 1} {(1 - ω)}^{p - \sum_{j = 1}^{p} ν_{j} + d_{0} - 1} . \end{matrix}$

Hence,

$ω | ν ~ Beta (c_{0} + \sum_{j = 1}^{p} ν_{j}, p - \sum_{j = 1}^{p} ν_{j} + d_{0}), where p - \sum_{j = 1}^{p} ν_{j} = \sum_{j = 1}^{p} I (ν_{j} = 0) .$ (28)

6.4. The Conditional Distribution for c

The prior distribution for c is a discrete uniform distribution with support points T = {1/4, 9/16, 1, 4, 9, 16, 25}, and it can be drawn conditionally on the regressor coefficient. The computation of the marginal likelihood is not required. Hence, the conditional distribution for c

$\begin{matrix} p (c | β, σ^{2}, ν, ω, ρ, y) \propto p (y | β, σ^{2}, ν, c) p (β, ν, σ^{2}, c) \\ = p (y | β, σ^{2}) p (β, ν, σ^{2}, c) \\ \propto p (β, σ^{2}, ν, c) \\ = p (β | σ^{2}, ν, c) p ( c ) \end{matrix}$

$\begin{array}{l} \propto \frac{1}{7} {| σ^{2} diag [c I (ν_{j} = 1) + d I (ν_{j} = 0)] |}^{- 1 / 2} \\ \times exp [- \frac{1}{2} β^{'} {(σ^{2} diag [c I (ν_{j} = 1) + d I (ν_{j} = 0)])}^{- 1} β] \\ \propto \frac{1}{7} {[\prod_{j = 1}^{p} [c I (ν_{j} = 1) + d I (ν_{j} = 0)]]}^{- 1 / 2} \\ \times exp [- \frac{1}{2 c} (β^{'} I_{\sum_{j} ν_{j} = 1} β)] \times exp [- \frac{1}{2 d} (β^{'} I_{\sum_{j} ν_{j} = 0} β)] \end{array}$

$\begin{array}{l} \propto \frac{1}{7} c^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} + \frac{1}{7} d^{\frac{- \sum_{j =1}^{p} (1 - ν_{j})}{2}} \times exp [- \frac{1}{2 c} (β^{'} I_{\sum_{j} ν_{j} = 1} β)] \\ \times exp [- \frac{1}{2 d} (β^{'} I_{\sum_{j} ν_{j} = 0} β)] \\ \propto \frac{1}{7} c^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 c} (β^{'} I_{\sum_{j} ν_{j} = 1} β)] \\ \propto \frac{1}{7} c^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 c} (β^{'} β)] . \end{array}$

Therefore,

$p (c | y, \dots) \propto (\begin{array}{l} \frac{1}{7} c^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 c} (β^{'} β)] if c \in T \\ 0 otherwise \end{array}$

Then, the posterior probabilities of the conditional distribution $p (c | y, \dots)$ are

$p (c = \frac{1}{4} | y, \dots) \propto \frac{1}{7} {(\frac{1}{4})}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (\frac{1}{4})} (β^{'} β)],$

$p (c = \frac{9}{16} | y, \dots) \propto \frac{1}{7} {(\frac{9}{16})}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (\frac{9}{16})} (β^{'} β)],$

$p (c = 1| y, \dots) \propto \frac{1}{7} {(1)}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (1)} (β^{'} β)],$

$p (c = 4| y, \dots) \propto \frac{1}{7} {(4)}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (4)} (β^{'} β)],$

$p (c = 9| y, \dots) \propto \frac{1}{7} {(9)}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (9)} (β^{'} β)],$

$p (c = 16| y, \dots) \propto \frac{1}{7} {(16)}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (16)} (β^{'} β)],$

and

$p (c = 25| y, \dots) \propto \frac{1}{7} {(25)}^{\frac{- \sum_{j =1}^{p} ν_{j}}{2}} \times exp [- \frac{1}{2 (25)} (β^{'} β)] .$

The conditional posterior $p (c | y, \dots)$ can be written as

$p (c | y, \dots) = (\begin{array}{l} \frac{c^{- \frac{\sum_{j =1}^{p} ν_{j}}{2}} exp [- \frac{β^{'} β}{2 c}]}{\sum_{c \in T} c^{- \frac{\sum_{j =1}^{p} ν_{j}}{2}} exp [- \frac{β^{'} β}{2 c}]} if c \in T \\ 0 otherwise \end{array}$ (29)

7. Bayesian Variable Selection Algorithms

In this section, we introduce the computational algorithms which we use in the Bayesian analysis for variable selection using the SSVS, and the SSVS-SPD. In this work, we choose an asymmetric proposal distribution, the log-normal density. We apply the Metropolis-Hastings algorithms to sample the variance ratio $η$ where $η = f (ρ) = \frac{ρ}{1 - ρ}$ , and $ρ$ is the correlation parameter. This is because of the fact that in our experiments, observations from different subplots within the same wholeplot are positively correlated as $ρ = \frac{σ_{γ}^{2}}{σ_{ε}^{2} + σ_{γ}^{2}}$ ; also observations from different wholeplots are independent [5].

7.1. The Stochastic Search Variable Selection (SSVS) Algorithm

We process the MCMC estimation of the parameters $β, ρ, σ^{2}, ν, ω$ and c. We use the priors of all these parameters as in Section 5. The following Metropolis-Hastings within Gibbs sampling algorithm can be implemented. Let y be the $n \times 1$ vector of random responses, X is the $n \times p$ model matrix without the column of the intercept, $β$ is the $p \times 1$ vector of fixed effect parameters, where p is the number of fixed effect parameters that need to be estimated. We set initial values for the parameters as $β^{(0)} = 1_{p}$ , $ν^{(0)} = 1_{p}$ , $ρ^{(0)} = 0.5$ , $σ^{2}^{(0)} = 10$ , $c^{(0)} = 1$ , $ω^{(0)} = 0.5$ , $d = 0.001$ . Starting at the tth iteration such that $t = 1, 2, \dots, i t s$ where $i t s = 10000$ , and setting $j = 1, 2, \dots, p$ , the sampling algorithm is:

1) For $j = 1, 2, \dots, p$ , sample $ν_{j}^{(t)}$ of the indicator vector $ν^{(t)}$ using (27) for $β_{j}^{(t - 1)}$ , $c^{(t - 1)}$ , $σ^{2}^{(t - 1)}$ , and $ω^{(t - 1)}$ .

2) Sample the mixture weight $ω^{(t)}$ using (28) for $ν^{(t)}$ .

3) Sample the regressor coefficients $β^{(t)}$ using (24) for X, y, $V^{(t - 1)}$ , $D^{(t - 1)}$ , $c^{(t - 1)}$ , $ν^{(t)}$ , and $σ^{2}^{(t - 1)}$ .

4) Sample the total variance $σ^{2}^{(t)}$ using (26) for X, y, Z, $β^{(t)}$ , and $ρ^{(t - 1)}$ .

5) a) Sample $ρ_{*}^{(t)}$ from $β (a^{'}, b^{'})$ ;

b) Calculate $α^{(t)}$ using (25) for X, y, $V (ρ_{*}^{(t)})$ , $V (ρ^{(t - 1)})$ , and $β^{(t)}$ ;

c) Sample $u^{(t)}$ from $U (0,1)$ ;

d) If $α^{(t)} > u^{(t)}$ , then set $ρ^{(t)} = ρ_{*}^{(t)}$ , otherwise set $ρ^{(t)} = ρ^{(t - 1)}$ .

6) Sample $c^{(t)}$ from the set T with probability mass function given in (29) for $β^{(t)}$ , and $ν^{(t)}$ .

7.2. The Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD) Algorithm

We adapt the SSVS for the analysis of data from split-plot designs by taking into account the two types of factors, i.e. the whole-plot factors and the subplot factors which expected to have different effect sizes for the two strata in split-plot design [9]. This approach can be reported as the Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD). While the SSVS samples all parameters from one slab variance posterior distribution, the SSVS-SPD samples the whole-plot parameters and the subplot parameters from two different slab variance posterior distributions given that the whole-plot and the subplot effects might have different sizes. While the SSVS samples all parameters from one slab variance posterior distribution, the SSVS-SPD samples the whole-plot parameters and the subplot parameters from two different slab variance posterior distributions. We use the same priors as in the SSVS for all the parameters of interest as in Section 5. Basically, the SSVS-SPD can be seen as running the SSVS twice in one process, one for subplot factors and the other one for whole-plot factors. The algorithm can be explained as follows:

We process the MCMC estimation of the parameters $β, ρ, σ^{2}, ν, ω$ and c. The following Metropolis-Hastings within Gibbs sampling algorithm can be implemented. Let y be the $n \times 1$ vector of random responses, X is the $n \times p$ model matrix without the column of the intercept, X.S is the $n \times p_{s}$ model matrix for subplot factors where $p_{s}$ is the number of subplot fixed effect parameters.

Also, X.W is the $n \times p_{w}$ model matrix for whole-plot factors where $p_{w}$ is the number of whole-plot fixed effect parameters. The $β = (β_{s}, β_{w})$ is the $p \times 1$ vector of fixed effect parameters, where p is the number of fixed effect parameters that need to be estimated, $β_{s}$ is the $p_{s} \times 1$ subplot effect parameters, and $β_{w}$ is the $p_{w} \times 1$ whole-plot effect parameters.

We set initial values for the parameters as $β_{s}^{(0)} = 1_{p_{s}}$ , $β_{w}^{(0)} = 1_{p_{w}}$ . The initial values for the indicator vectors for the subplot factor $ν_{s}$ and the whole-plot factor $ν_{w}$ are $ν_{s}^{(0)} = 1_{p s}$ and $ν_{w}^{(0)} = 1_{p w}$ . Also, $ρ^{(0)} = 0.5$ , $σ^{2}^{(0)} = 10$ , and $d = 0.001$ . The initial values for the slab variance for the subplot factors $c_{s}$ and for the slab variance for the whole-plot factors $c_{w}$ are $c_{s}^{(0)} = c_{w}^{(0)} = 1$ . Finally, the initial weights for the subplot factors $ω_{s}$ and for the whole-plot factors $ω_{w}$ are $ω_{s}^{(0)} = ω_{w}^{(0)} = 0.5$ .

Starting at the tth iteration such that $t = 1, 2, \dots, i t s$ where $i t s = 10000$ , and setting $j = 1, 2, \dots, p_{s}$ and $k = 1, 2, \dots, p_{w}$ , the sampling algorithm is:

1) For $j = 1, 2, \dots, p_{s}$ , and $k = 1, 2, \dots, p_{w}$ sample $ν_{s j}^{(t)}$ and $ν_{w k}^{(t)}$ of the indicator vectors $ν_{s}^{(t)}$ and $ν_{w}^{(t)}$ using (27) for $β_{s j}^{(t - 1)}$ , $β_{w k}^{(t - 1)}$ , $c_{s}^{(t - 1)}$ , $c_{w}^{(t - 1)}$ , $σ^{2}^{(t - 1)}$ , $ω_{s}^{(t - 1)}$ , and $ω_{w}^{(t - 1)}$ .

2) Allocate $ν^{(t)} = (ν_{s}^{(t)}, ν_{w}^{(t)})$ .

3) Sample the mixture weights $ω_{s}^{(t)}$ and $ω_{w}^{(t)}$ using (28) for $ν_{s}^{(t)}$ and $ν_{w}^{(t)}$ .

4) Allocate $ω^{(t)} = (ω_{s}^{(t)}, ω_{w}^{(t)})$ .

5) Sample the regressor coefficients $β_{s}^{(t)}$ and $β_{w}^{(t)}$ using (24) for X, y, $V^{(t - 1)}$ , $D_{s}^{(t - 1)}$ , $D_{w}^{(t - 1)}$ , $c_{s}^{(t - 1)}$ , $c_{w}^{(t - 1)}$ , $ν_{s}^{(t)}$ , $ν_{w}^{(t)}$ , and $σ^{2}^{(t - 1)}$ . Where the $D_{s}$ is a diagonal matrix with the j^th diagonal element $c_{s}^{(t - 1)} I (ν_{s j} = 1) + d I (ν_{s j} = 0)$ , and $D_{w}$ is a diagonal matrix with the k^th diagonal element $c_{w}^{(t - 1)} I (ν_{w k} = 1) + d I (ν_{w k} = 0)$ .

6) Allocate $ν^{(t)} = (ν_{s}^{(t)}, ν_{w}^{(t)})$ and $D^{(t)} = diag (D_{s}^{(t)}, D_{w}^{(t)})$ .

7) Sample the total variance $σ^{2}^{(t)}$ using (26) for X, y, Z, $β^{(t)}$ , and $ρ^{(t - 1)}$ .

8) a) Sample $ρ_{*}^{(t)}$ from $β (a, b)$ ;

b) Calculate $α^{(t)}$ using (25) for X, y, $V (ρ_{*}^{(t)})$ , $V (ρ^{(t - 1)})$ , and $β^{(t)}$ ;

c) Sample $u^{(t)}$ from $U (0,1)$ ;

d) If $α^{(t)} > u^{(t)}$ , then set $ρ^{(t)} = ρ_{*}^{(t)}$ , otherwise set $ρ^{(t)} = ρ^{(t - 1)}$ .

9) Sample $c_{s}^{(t)}$ and $c_{w}^{(t)}$ from the set T with probability mass function given in (29) for $β_{s}^{(t)}$ , $β_{w}^{(t)}$ , $ν_{s}^{(t)}$ and $ν_{w}^{(t)}$ .

10) Allocate $c^{(t)} = (c_{s}^{(t)}, c_{w}^{(t)})$ .

8. Real Data Application

[18] and [19] described an experiment in the production of vinyl for automobile seat covers. The experiment has 28 runs and is a modified version of an example in [19]. It involves the production of vinyl for automobile seat covers. In the experiment, the effects of five factors on the thickness of the vinyl are investigated. Three of the factors are mixture components and two of them are so-called process variables. As in ordinary mixture experiments, the component proportions sum to one. In this example, the response of interest does not only depend on these proportions, but also on the effects of the process variables. The mixture components in the experiment are three plasticizers whose proportions are represented by $s_{1}, s_{2}$ and $s_{3}$ . The two process variables studied are rate of extrusion ( $w_{1}$ ) and temperature of drying ( $w_{2}$ ). The experiment was conducted in a split-plot format. The process variables are the whole plot factors of the experiment, whereas the mixture components are the sub-plot factors. The data are shown in Table 1.

Table 1. Data for the Vinyl thickness for three-component mixture experiment with two process variable [20].

A main effects plus two factor interactions model was assumed for the process variable $w_{1}$ and $w_{2}$ . For the mixture components, the quadratic mixture model was used. The main effects of the process variables were crossed with the linear blending terms only, so the model estimated in [18] is given by

$y_{i j} = \sum_{i = 1}^{3} β_{i} s_{i} + \sum_{i = 1}^{m - 1} \sum_{j = i + 1}^{m} β_{i j} s_{i} s_{j} + \sum_{i = 1}^{2} α_{i} w_{i} + α w_{1} w_{2} + \sum_{i = 1}^{2} \sum_{j = 1}^{2} δ_{i j} s_{i} w_{j} + γ_{i} + ε_{i j}$

They computed the variance components $σ^{2} = σ_{ε}^{2} + σ_{γ}^{2}$ by R. The ${\hat{σ}}_{ε}^{2} = 6.764$ and the ${\hat{σ}}_{γ}^{2} = 0.00001$ . The factor effect estimates will be displayed in the next section to compare it with the proposed methods in this paper.

9. Analysis of the Vinyl Thickness Experiment

The real dataset for the vinyl thickness experiment has been used to apply the Bayesian variable selection approach. The model involves 5 main variables ( $w_{1}, w_{2}, s_{1}, s_{2}, s_{3}$ ) and two factor interaction variables ( $w_{1} s_{1}, w_{1} s_{2}, s_{1} s_{2}, s_{1} s_{3}, s_{2} s_{3}, w_{2} s_{1}, w_{2} s_{2}, w_{1} w_{2}$ ). We used the prior distributions above. Following [21], in a Bayesian framework the final model could be the median probability model consisting of those variables whose posterior inclusion probability $p (ν_{j} = 1 | y) \geq 0.5$ . The posterior probability of parameter $β_{j}, j = 1, 2, \dots, p$ being active is approximated by

$\sum_{q = 1}^{i t s} \frac{ν_{j}^{(q)}}{i t s},$ (30)

where $ν_{j}^{(q)}$ is $ν_{j}$ sampled at iteration $q = 1, \dots, i t s$ of the Metropolis-Hastings within Gibbs sampling algorithm.

We summarise the results of applying the Bayesian variable selection for the data from thickness vinyl experiment. The model which will be used is:

The estimates of the 13 parameters of the model have been reported using (SSVS) and (SSVS-SPD) for the response y which is displayed in Table 1 as well as the estimates by the generalised least estimator (GLS) for comparison purpose.

Figure 2 shows a comparison between SSVS and SSVS-SPD with respect to the resulting approximate posterior probability for the thickness vinyl experiment. The parameters $β_{w_{1}}, β_{s_{1}}, β_{s_{2}}, β_{s_{3}}, β_{w_{1} s_{1}}, β_{w_{1} s_{2}}$ have the highest posterior probability of being active by both SSVS and SSVS-SPD. This indicates that the six associated variables to these terms play a significant role in this experiment. Followed by these terms, we find the parameters $β_{w_{1} w_{2}}, β_{s_{1} s_{2}}, β_{s_{1} s_{3}}, β_{s_{2} s_{3}}$ have an approximate posterior probability of about 0.5 and 0.6 by both SSVS and SSVS-SPD. We note that SSVS and SSVS-SPD tend to consider $β_{s_{1} s_{2}}, β_{s_{1} s_{3}}, β_{s_{2} s_{3}}$ to be significant at an approximate posterior probability of 0.5 and 0.6 while they are not significant by the GLS method. All methods consider $β_{w_{2}}, β_{w_{2} s_{1}}, β_{w_{2} s_{2}}$ to be non significant with low approximate posterior probability in this experiment. The Bayesian analysis for the real data of thickness vinyl experiment yielded 10 significant variables. In contrast, the p value for the GLS estimates in Table 2 shows there are 7 significant variables as it excludes the variables associated to the coefficients $β_{s_{1} s_{2}}, β_{s_{1} s_{3}}, β_{s_{2} s_{3}}$ . This means that the SSVS and SSVS-SPD yielded in extra 3 significant variables to the model. Table 2 represents the posterior means of the coefficients and standard deviation by both the SSVS and the SSVS-SPD methods and the GLS estimates with the p values for the thickness vinyl experiment. Table 3 shows the posterior mean of the correlation $\hat{ρ}$ and the posterior mean of the total variance ${\hat{σ}}^{2}$ for both SSVS and SSVS-SPD methods. Figure 3, shows the ACF plots for Markov Chain using Metropolis Hastings within Gibbs sampling algorithm used by SSVS and SSVS-SPD to sample the correlation $ρ$ and the variance $σ^{2}$ .

Figure 2. Approximate posterior probability for the thickness vinyl experiment.

Table 2. The estimated coefficients and standard deviations (in parenthesis) for the thickness vinyl experiment by SSVS and SSVS-SPD. The first row is the estimates coefficeients and p-values (in parenthesis) for the GLS.

Table 3. Posterior means of the ${\hat{σ}}^{2}$ and the $\hat{ρ}$ by the SSVS and SSVS-SPD for the thickness vinyl experiment.

Figure 3. ACF plot for the Markov chain formed by sampling the total variance $σ^{2}$ and the correlation $ρ$ by SSVS and SSVS-SPD for the thickness vinyl experiment.

10. Simulation Study Using the Design of the Vinyl Thickness Experiment

We performed a simulation study by generating 1000 datasets, where each dataset would ran for 10000 iterations using MCMC. The SSVS and the SSVS-SPD would be applied at two levels of $η = 1$ and $η = 10$ . We assume that $σ_{ε}^{2} + σ_{ε}^{2} = 10$ . Thus, the true value of the total variance $σ^{2} = 10$ . Also, the true value of $ρ$ is 0.5. We will calculate Type I and II error rates using the indicator vector $ν$ and the approximate posterior probability in 30. If the true variable is active but the algorithm yielded a corresponding approximation posterior probability of less than 0.5, this variable would then have Type II error rate. Also, if the true variable is non-active but the algorithm yielded a corresponding approximation posterior probability larger than or equal to 0.5, this variable would then have Type I error rate. We also will calculate the precision of the point estimates by SSVS and SSVS-SPD by counting the median relative model error (MRME) for the estimates of the SSVS and SSVS-SPD. We focus on the properties of the estimated models by investigating the following properties:

1) consistency in variable selection (frequency in selecting the active/non-active variable), and

2) prediction performance.

For point 1, at 5% significant level, we report Type I error rate (an effect that is truly not significant but the corresponding procedure estimate indicates that it is significant). We also report Type II error rate (an effect that is truly present but the corresponding procedure estimate indicates that it is not significant).

For point 2, following [22] and [23], prediction accuracy is measured by computing the mean-squared error for each penalised estimate ${\hat{β}}_{λ}$ as,

$ME ({\hat{β}}_{λ}) = {(X {\hat{β}}_{λ} - X β)}^{'} (X {\hat{β}}_{λ} - X β) .$

The relative model error (RME) is the ratio of the model error of the penalised estimates to the model error for the GLS estimates of the fixed effects,

$R M E = \frac{M E ({\hat{β}}_{λ})}{M E ({\hat{β}}_{G L S})},$

where ${\hat{β}}_{G L S}$ in Equation (13) is the generalised least squares estimator of $β$ . The median of the relative model error (MRME) over 1000 simulated data sets were reported. MRME values greater than one indicate that the methods estimates perform worse than the GLS estimates, values near to one indicate that the methods estimates performs in a similar way to the GLS estimates, values less than one indicate that the methods estimates performs better than the GLS estimates.

We perform a simulation study to examine the performance of the SSVS and SSVS-SPD methods. Using the design of the thickness vinyl we generate the response given the true model as:

$E (Y) = 4 w_{1} - 3 s_{1} + s_{3} + 4 w_{1} w_{2} + 2 s_{1} s_{2} - s_{1} s_{3} + 2 w_{1} s_{1} + 3 w_{2} s_{1}$ (31)

Type I and II error rates are displayed in Table 4 and Table 5, for two setting of $η = 1$ and $η = 10$ . Also, Table 6 represents the estimated Posterior means of ${\hat{σ}}^{2}$ and $\hat{ρ}$ by the SSVS and SSVS-SPD from the simulation by using the design of the thickness vinyl experiment. Figure 4 shows the MRME values at $η = 1$ and $η = 10$ . We notes that the SSVS at $η = 1$ have MRME greater than one which indicates that the estimates by the SSVS is worse than the GLS estimates. While at $η = 10$ , the SSVS perform better than the GLS estimates. The SSVS-SPD have similar performance with the GLS estimates at both level of $η$ .

Type II error rates at both level of $η$ generally are low indicating that the active values are easy to detect by both SSVS and SSVS-SPD. Furthermore, Type II error rates by the SSVS-SPD are lower than Type II error rates by the SSVS. With regard to Type I error rate, for both level of $η$ , the SSVS-SPD have lower Type I error rates than the SSVS. Detecting active variables by both SSVS and SSVS-SPD is better than detecting the non active variables. In summary, we found that the SSVS-SPD detects the active and non-active effect factors with a lower error rate than the SSVS. The analysis of the Median Relative Model Error (MRME) for both SSVS and SSVS-SPD using the thickness vinyl design provides similar behaviour to the GLS except at $η = 10$ for the SSVS as detecting active effect factors was harder than the SSVS-SPD.

Table 4. Type I error rate for the simulation by using the design of thickness vinyl experiment.

Table 5. Type II error rate for the simulation by using the design of thickness vinyl experiment.

Figure 4. Median relative model error (MRME) for the simulation by using the design of thickness vinyl experiment.

Table 6. Posterior means of the ${\hat{σ}}^{2}$ and the $\hat{ρ}$ by the SSVS and SSVS-SPD for the simulation by using the design of thickness vinyl experiment.

11. Conclusion

This paper provided an analysis of data from split-plot mixture process experiments using a motivating example from the industrial environment. Specifically, we recommend the use of the SSVS-SPD method for Bayesian variable selection. In our results, we observed that the SSVS-SPD can identify the active variables (linear and two-factors interaction), much better than the SSVS and the traditional used GLS method. However, as expected this comes with the expense of slightly higher Type I error rates. We also observed a better prediction performance for the models chosen by the SSVS-SPD compared to the models chosen by the SSVS method.

Conflicts of Interest

The author declares no conflicts of interest regarding the publication of this paper.

References

[1]	Aljeddani, S. (2022) Bayesian Variable Selection for Mixture Process Variable Design Experiment. https://doi.org/10.21203/rs.3.rs-1906056/v1
[2]	Tang, X., Xu, X., Ghosh, M. and Ghosh, P. (2016) Bayesian Variable Selection and Estimation Based on Global-Local Shrinkage Priors. Sankhya A, arXiv: 1605.07981.
[3]	Mitchell, T.J. and Beauchamp, J.J. (1988) Bayesian Variable Selection in Linear Regression. Journal of the American Statistical Association, 83, 1023-1032. https://doi.org/10.1080/01621459.1988.10478694
[4]	George, E.I. and McCulloch, R.E. (1993) Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88, 881-889. https://doi.org/10.1080/01621459.1993.10476353
[5]	Geweke. J. (1996) Variable Selection and Model Comparison in Regression. Bayesian Statistics 5, Valencia, 5-9 June 1994.
[6]	Tan, M.H.Y. and Wu, C.F.J. (2013) A Bayesian Approach for Model Selection in Fractionated Split Plot Experiments with Applications in Robust Parameter Design. Technometrics, 55, 359-372. https://doi.org/10.1080/00401706.2013.778790
[7]	George, E.I. and McCulloch, R.E. (1997) Approaches for Bayesian Variable Selection. Statistica Sinica, 7, 339-373.
[8]	Aljeddani, S. (2022) Variable Selection in Randomized Block Design Experiment. Journal of Computational Mathematics, 12, 216-231. https://doi.org/10.4236/ajcm.2022.122013
[9]	Gilmour, S.G. and Goos, P. (2009) Analysis of Data from Non-Orthogonal Multistratum Designs in Industrial Experiments. Journal of the Royal Statistical Society Series C: Applied Statistics, 58, 467-484. https://doi.org/10.1111/j.1467-9876.2009.00662.x
[10]	Jones, B. and Nachtsheim, C.J. (2009) Split-Plot Designs: What, Why, and How. Journal of Quality Technology, 41, 340-361. https://doi.org/10.1080/00224065.2009.11917790
[11]	Aljeddani, S. (2019) Statistical Analysis of Data from Experiments Subject to Restricted Randomisation. University of Southampton, Southampton.
[12]	Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B. (2014) Bayesian Data Analysis. Vol. 2, CRC Press, Boca Raton. https://doi.org/10.1201/b16018
[13]	O’Hagan, A. and Forster, J. (2004) Kendall’s Advanced Theory of Statistics: Bayesian Statistics. Vol. 2B, Arnold, London.
[14]	Gilks, W.R., Richardson, S. and Spiegelhalter, D. (1995) Markov Chain Monte Carlo in Practice. CRC Press, Boca Raton. https://doi.org/10.1201/b14835
[15]	Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H. and Teller, E. (1953) Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics, 21, 1087-1092. https://doi.org/10.1063/1.1699114
[16]	Hastings, W.K. (1970) Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika, 57, 97-109. https://doi.org/10.1093/biomet/57.1.97
[17]	Gelman, A., Roberts, G.O., Gilks, W.R., et al. (1996) Efficient Metropolis Jumping Rules. Bayesian Statistics, 5, 599-607.
[18]	Cornell, J.A., Kowalski, S.M. and Vining, G.G. (2002) Split-Plot Designs and Estimation Methods for Mixture Experiments with Process Variables. Technometrics, 44, 72-79. https://doi.org/10.1198/004017002753398344
[19]	Cornell, J.A. (2011) Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data. John Wiley & Sons, Hoboken.
[20]	Goos, P. and Donev, A.N. (2007) Tailor-Made Split-Plot Designs for Mixture and Process Variables. Journal of Quality Technology, 39, 326-339. https://doi.org/10.1080/00224065.2007.11917699
[21]	Barbieri, M.M., Berger, J.O. et al. (2004) Optimal Predictive Model Selection. The Annals of Statistics, 32, 870-897. https://doi.org/10.1214/009053604000000238
[22]	Fan, J. and Li, R. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360. https://doi.org/10.1198/016214501753382273
[23]	Ibrahim, J.G., Zhu, H., Garcia, R.I. and Guo, R. (2011) Fixed and Random Effects Selection in Mixed Effects Models. Biometrics, 67, 495-503. https://doi.org/10.1111/j.1541-0420.2010.01463.x

Journals Menu

Follow SCIRP

	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies