Bayesian Variable Selection for Mixture Process Variable Design Experiment

This paper discussed Bayesian variable selection methods for models from split-plot mixture designs using samples from Metropolis-Hastings within the Gibbs sampling algorithm. Bayesian variable selection is easy to implement due to the improvement in computing via MCMC sampling. We described the Bayesian methodology by introducing the Bayesian framework, and explaining Markov Chain Monte Carlo (MCMC) sampling. The Metropolis-Hastings within Gibbs sampling was used to draw dependent samples from the full conditional distributions which were explained. In mixture experiments with process variables, the response depends not only on the proportions of the mixture components but also on the effects of the process variables. In many such mixture-process variable experiments, constraints such as time or cost prohibit the selection of treatments completely at random. In these situations, restrictions on the randomisation force the level combinations of one group of factors to be fixed and the combinations of the other group of factors are run. Then a new level of the first-factor group is set and combinations of the other factors are run. We discussed the computational algorithm for the Stochastic Search Variable Selection (SSVS) in linear mixed models. We extended the computational algorithm of SSVS to fit models from split-plot mixture design by introducing the algorithm of the Stochastic Search Variable Selection for Split-plot Design (SSVS-SPD). The motivation of this extension is that we have two different levels of the experimental units, one for the whole plots and the other for subplots in the split-plot mixture design.


Introduction
Bayesian methods are important approaches due to their ability to quantify uncertainty. In such an approach, prior distributions that represent subjective be-the variance component in the linear mixed models.
The novel contribution of this paper to Bayesian variable selection is motivated by a very specific experimental design of data from experiments subject to restricted randomisation. We have two different levels of the experimental units one for the whole plots and the other for the subplots in the split-plot design; see, for example, [9]. To address this issue, we adapt the SSVS algorithm in which we sample the subplot coefficients using a mixture of normal posterior distributions with a slab variance different from the slab variance which will be used in the mixture of normal posterior distributions for the whole-plot coefficients. This method reduces Type I and II error rates as well as reduces the prediction error for split-plot design rather than applying the SSVS algorithm in which all coefficients will be sampled from a mixture of normal posterior distributions with one slab variance. We called the approached method the Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD). The frequentist analysis is dependent on the estimates of the variance components, yet these estimates cannot be precisely calculated because of the deficiency of the degrees of freedom for the random effects in the split-plot design. This issue was discussed by [8]. Introducing a prior distribution for the variance components in the linear mixed model provides additional information to overcome the problem of variance estimation.
This paper differs from [10] in which the dataset used in this paper is the Vinyl Thickness experiment, which had not been used in variable selection. Also, it differs in the choice of the shape parameter of the correlation parameter as in this paper we used 2 a b = = to fit this data. A preprint has previously been published [11].

Split-Plot Design and Sample Model
The model for the block experiments includes two types of errors: block error and residual error. Hence, linear mixed models (LMMs) are used to analyse responses from the blocked experiments.
Linear mixed-effects models (LMMs) introduce correlations between observations using random effects. This leads to the use of generalised least squares (GLS) estimation, combined with restricted maximum likelihood estimation (REML) of the variance components as will be discussed. This type of analysis is used by the most design of experiments textbooks that deal with blocked designs.
In matrix notation, the model corresponding to a blocked design is written as where Y is 1 n × vector of observations on the response of interest, X is the n p × model design matrix containing the polynomial expansions of the m factor levels at the n experimental runs, β is the 1 p × vector of unknown fixed parameters, Z is an n b × random design matrix which represents the allocation of the runs to blocks, and whose ( ) , i j th element is one where the i th observation belongs to the j th blocks, and zero otherwise. If the runs of the experiment are grouped per block, then Z is of the form where k 1 is a k vector of ones, and 1 2 , , , b k k k  are the blocks sizes. The random effects of the b blocks are contained within the 1 b × vector γ , and the random errors are contained within the 1 n × vector ε . It is assumed that γ and ε are independent and normally distributed, i.e. Under these assumptions, Y is a normally distributed random variable with mean ( ) =  Y Xβ , and the variance-covariance matrix of the response Y can be written as ( ) ( ) V can be given as a block diagonal, is a measure for the extent to which observations within the same block are correlated. The larger this variance ratio, the stronger observations within the same block are correlated.
When the random error terms as well as the group effects are normally distributed, the maximum likelihood estimate of the unknown model parameter β in Equation (1) is the generalised least squares (GLS) estimate. Detecting the estimator β of β , requires to minimise ( ) ( ) with respect to β , which is tantamount to detecting β , so that ( ) Therefore, the generalised least squares (GLS) estimator of β is ( ) (11) and the variance-covariance matrix of the estimators is given by Often, the variances 2 γ σ and 2 ε σ are not known and therefore, Equation (11) and Equation (12) (14) In that case, the variance-covariance matrix in Equation (12) can be approximated by The generalised least square (GLS) estimator is unbiased, meaning that ( ) =  β β , and is equal to the maximum likelihood estimator (MLE). The likelihood function defined as it is the joint probability density function for the observed data examined as a function of the parameters. Hence, the likelihood function for Y in Equation (1) is where π is a constant which does not depend on β . The maximum likelihood estimator (MLE) is the estimator that maximises the likelihood function, which is tantamount to detecting the β as is the log likelihood function. As Equation (9) is proportionate to log of Equation (16), the GLS estimator in Equation (11) is the result of Equation (17) and Equation (18).
The restricted maximum likelihood (REML) used to estimate where β is defined in Equation (13). The restricted log-likelihood ( )

Bayesian Framework
The essential philosophy behind Bayesian inference is to update a prior distribution for an unidentified parameter to a posterior distribution by Bayes' theorem.
Bayes' theorem can be used to estimate the conditional distributions. While the frequentist approach treats the parameters as unknown and fixed, the Bayesian approach regards them as random variables. We can define the prior distribution ( ) p θ as the probability density (or mass) function which reflects our beliefs about θ in the parameter space Θ . For given data ( ) 1 2 , , , n y y y ′ =  y , | p y θ . Thus, we should run the simulation long enough so that the distribution of the current draws is close enough to ( )

Metropolis-Hastings Sampling
Metropolis-Hastings sampling was proposed by [15] and [16]. The theory of this sampling is based on rejection sampling. The acceptance-rejection method is a technique of getting samples from a distribution with an unknown form. The However, the transition operator for the Metropolis algorithm has a additional step that assesses whether or not the target distribution has a sufficiently large density near the proposed state to warrant accepting the chain.

Gibbs Sampling
A Gibbs sampler is the simplest of the Markov chain simulation algorithms, and it is used to sample from the conditional conjugate models, where we can directly sample from each conditional posterior [12] [13]. It is rare to find all the conditional posteriors in a model in known forms. One may find some conditional posterior distributions that are possible to directly sample from. Furthermore, one may find some of the conditional posteriors that cannot be straightforwardly sampled from. Therefore, the procedure for this issue is to update the parameters one at a time with the Gibbs sampler used where possible, and one-dimensional Metropolis updating where necessary. This process is called the Metropolis-Hastings within Gibbs sampling and will be used in this work.

A Hierarchical Mixture Model for Variable Selection
The linear mixed model fitted to data from a split-plot experiment with n responses is where y is 1 n × vector of random responses, 0 β is the intercept, n 1 is a 1 n × vector of ones, X is the n p × model matrix without the column of the intercept, β is the 1 p × vector of fixed effect parameters and V is where Z is the random effect design matrix. As We need to find the highest posterior probability of an indicator vector ( ) Following [6], and [5], we assume that where ν is the indicator vector, c is the prior variance of the slab distribution, and ,c D ν is a diagonal matrix with the j th diagonal element The parameters 2 , σ ν and c will be given prior distributions, and the parameter d is assumed to be a small fixed nonnegative number because we want the spike distribution to have a smaller variance than the slab distribution. Formally the prior construction of β is the following: Bernoulli variable j ν is defined taking values 1 and 0 with probability of inclusion ω , as ( ) Often, j ν 's are taken as independent Bernoulli ( ω ) random variables, where 0 1 ω < < . It is common to fix ω in the normal mixture, however, we shall deal with ω as a parameter to investigate different values of ω , and sample it from the Beta distribution as it will be explained in Section 6.3.

Prior Distributions
1) Following the prior distributions used by [5], we assume that the prior distribution for the fixed effects is 2) The prior distribution for the total variance is For this work, we used 0 a = and 0 b = following [5] as this yields the common non-informative prior for 2 σ . This prior is improper, however we will sample from the posterior distribution, which should be a proper gamma distribution.
3) The prior distribution for the correlation parameter is following [8]. According to [8], "A ( ) This has the effect of pulling the posterior mode of ( ) p ρ towards 0.5 when the data are scarce. The prior density for ρ is The prior distribution for the elements of the indicator vector is Open Journal of Modelling and Simulation ( ) where ω is the prior probability that j β is active following [5].
5) The prior distribution for the elements of the probability of inclusion is 6) The prior distribution for the slab variance c is a discrete uniform prior distribution with support points T = {1/4, 9/16, 1, 4, 9, 16, 25} as given by [5].
They found that large values of c tend to favor sparse models with large effects and in this case small effects will be missed. On the other hand, small values of c tend to favor less sparse models. Moreover, very small values of c tend to favor sparse models again. They select the support points in T such that it covers small and large values of c. The prior distribution for c is

Full Conditional Distributions
We use the prior distributions presented in Section 5 to derive the full conditional distributions. The likelihood of the data depends on β , so we can derive the conditional distribution for β using the prior distribution ( ) Note that we standardise both X and y so the fixed effect vector β does not include the intercept. The conditional distribution for β can be expressed as: The key to deriving the joint posterior distribution is to rewrite the expression in the exponential part in a more convenient form. This can happen by using the multivariate completion of squares: The conditional distribution for β can be written as: Thus, we can sample β from the conditional posterior ( )

The Conditional Distribution for ρ
The likelihood of the data depends on ρ , so the conditional distribution for ρ can be derived by We note here that the likelihood depends on ρ through V as in (4.6), so we can express V as a function of ρ ,  s affects the jumping rule in the random walk proposal distribution. As we have one parameter to be updated in the random walk algorithm which is ρ , we follow [12] and [17] to set 2 2 s g = Σ .
The most efficient jump has a scale where h is the number of parameters which will be updated. In this work, we set 2.4 g = and 1 h = following [17], and we set 100 Σ = as this yields an appropriate acceptance rate associated with the independent sampler of the ACF plot. Thus, We draw a proposal value η * from a log-normal( 2 , t s η ) distribution, and the probability of accepting or rejecting η * is the minimum of 1 and the ratio r where r is The ratio r can be expressed as

The Conditional Distribution for σ 2
The likelihood of the data depends on 2 σ , so we can express the conditional The indicator vector can be drawn conditionally on the regressor coefficient and computation of the marginal likelihood is not required. The prior probabili-Open Journal of Modelling and Simulation where ω is the prior probability that j β is active. The joint conditional posterior distribution for ν has mass function The conditional distribution for the j th component given j ν is The conditional posterior probabilities for j ν are therefore

The Conditional Distribution for ω
The probability of inclusion ω can be drawn conditionally on the indicator and computation of the marginal likelihood is not required. Hence the conditional distribution for ω is

The Conditional Distribution for c
The prior distribution for c is a discrete uniform distribution with support points T = {1/4, 9/16, 1, 4, 9, 16, 25}, and it can be drawn conditionally on the regressor coefficient. The computation of the marginal likelihood is not required.
Hence, the conditional distribution for c

Bayesian Variable Selection Algorithms
In this section, we introduce the computational algorithms which we use in the Bayesian analysis for variable selection using the SSVS, and the SSVS-SPD. In this work, we choose an asymmetric proposal distribution, the log-normal density. We apply the Metropolis-Hastings algorithms to sample the variance ratio

The Stochastic Search Variable Selection (SSVS) Algorithm
We process the MCMC estimation of the parameters

The Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD) Algorithm
We adapt the SSVS for the analysis of data from split-plot designs by taking into account the two types of factors, i.e. the whole-plot factors and the subplot factors which expected to have different effect sizes for the two strata in split-plot design [9]. This approach can be reported as the Stochastic Search Variable Selection for Split-Plot Design (SSVS-SPD). While the SSVS samples all parameters from one slab variance posterior distribution, the SSVS-SPD samples the whole-plot parameters and the subplot parameters from two different slab variance posterior distributions given that the whole-plot and the subplot effects might have different sizes. While the SSVS samples all parameters from one slab variance posterior distribution, the SSVS-SPD samples the whole-plot parameters and the subplot parameters from two different slab variance posterior distributions. We use the same priors as in the SSVS for all the parameters of interest as in Section 5. Basically, the SSVS-SPD can be seen as running the SSVS twice in one process, one for subplot factors and the other one for whole-plot factors. The algorithm can be explained as follows: We process the MCMC estimation of the parameters 2 , , , , ρ σ ω β ν and c.
The following Metropolis-Hastings within Gibbs sampling algorithm can be implemented. Let y be the 1 n × vector of random responses, X is the n p ×  Table 1. Table 1. Data for the Vinyl thickness for three-component mixture experiment with two process variable [20].
Obs. Block The Bayesian analysis for the real data of thickness vinyl experiment yielded 10 significant variables. In contrast, the p value for the GLS estimates in Table 2 shows there are 7 significant variables as it excludes the variables associated to the coefficients 1 2  1 3  2 3 , , s s s s s s β β β . This means that the SSVS and SSVS-SPD yielded in extra 3 significant variables to the model. Table 2 represents the posterior means of the coefficients and standard deviation by both the SSVS and the SSVS-SPD methods and the GLS estimates with the p values for the thickness vinyl experiment. Table 3 shows the posterior mean of the correlation ρ and the posterior mean of the total variance 2 σ for both SSVS and SSVS-SPD methods. Figure 3, shows the ACF plots for Markov Chain using Metropolis Hastings within Gibbs sampling algorithm used by SSVS and SSVS-SPD to sample the correlation ρ and the variance 2 σ .

Simulation Study Using the Design of the Vinyl Thickness Experiment
We performed a simulation study by generating 1000 datasets, where each dataset would ran for 10000 iterations using MCMC. The SSVS and the SSVS-SPD would be applied at two levels of 1 η = and 10 η = . We assume that 2 2 10 ε ε σ σ + = . Thus, the true value of the total variance 2 10 σ = . Also, the true value of ρ is 0.5. We will calculate Type I and II error rates using the indicator vector ν and the approximate posterior probability in 30. If the true variable is active but the algorithm yielded a corresponding approximation posterior probability of less than 0.5, this variable would then have Type II error rate. Also, if the true variable is non-active but the algorithm yielded a corresponding approximation posterior probability larger than or equal to 0.5, this variable would S. M. A. Aljeddani Open Journal of Modelling and Simulation then have Type I error rate. We also will calculate the precision of the point estimates by SSVS and SSVS-SPD by counting the median relative model error (MRME) for the estimates of the SSVS and SSVS-SPD. We focus on the properties of the estimated models by investigating the following properties: 1) consistency in variable selection (frequency in selecting the active/non-active variable), and 2) prediction performance.
For point 1, at 5% significant level, we report Type I error rate (an effect that is truly not significant but the corresponding procedure estimate indicates that it is significant). We also report Type II error rate (an effect that is truly present but the corresponding procedure estimate indicates that it is not significant).
For point 2, following [22] and [23], prediction accuracy is measured by computing the mean-squared error for each penalised estimate ˆλ β as, The relative model error (RME) is the ratio of the model error of the penalised estimates to the model error for the GLS estimates of the fixed effects, (13) is the generalised least squares estimator of β .
The median of the relative model error (MRME) over 1000 simulated data sets were reported. MRME values greater than one indicate that the methods estimates perform worse than the GLS estimates, values near to one indicate that the methods estimates performs in a similar way to the GLS estimates, values less than one indicate that the methods estimates performs better than the GLS estimates.
We perform a simulation study to examine the performance of the SSVS and SSVS-SPD methods. Using the design of the thickness vinyl we generate the response given the true model as: ( ) Type I and II error rates are displayed in Table 4 and Table 5, for two setting of 1 η = and 10 η = . Also, Table 6 represents the estimated Posterior means of 2 σ and ρ by the SSVS and SSVS-SPD from the simulation by using the design of the thickness vinyl experiment. Figure 4 shows the MRME values at 1 η = and 10 η = . We notes that the SSVS at 1 η = have MRME greater than one which indicates that the estimates by the SSVS is worse than the GLS estimates. While at 10 η = , the SSVS perform better than the GLS estimates. The SSVS-SPD have similar performance with the GLS estimates at both level of η .
Type II error rates at both level of η generally are low indicating that the active values are easy to detect by both SSVS and SSVS-SPD. Furthermore, Type II error rates by the SSVS-SPD are lower than Type II error rates by the SSVS. Open Journal of Modelling and Simulation With regard to Type I error rate, for both level of η , the SSVS-SPD have lower Type I error rates than the SSVS. Detecting active variables by both SSVS and SSVS-SPD is better than detecting the non active variables. In summary, we found that the SSVS-SPD detects the active and non-active effect factors with a lower error rate than the SSVS. The analysis of the Median Relative Model Error (MRME) for both SSVS and SSVS-SPD using the thickness vinyl design provides similar behaviour to the GLS except at 10 η = for the SSVS as detecting active effect factors was harder than the SSVS-SPD.

Conclusion
This paper provided an analysis of data from split-plot mixture process experiments using a motivating example from the industrial environment. Specifically, we recommend the use of the SSVS-SPD method for Bayesian variable selection.
In our results, we observed that the SSVS-SPD can identify the active variables (linear and two-factors interaction), much better than the SSVS and the traditional used GLS method. However, as expected this comes with the expense of slightly higher Type I error rates. We also observed a better prediction performance for the models chosen by the SSVS-SPD compared to the models chosen by the SSVS method.

Conflicts of Interest
The author declares no conflicts of interest regarding the publication of this paper.