^{1}

^{*}

^{2}

This paper presents a novel variable selection method in additive nonparametric regression model. This work is motivated by the need to select the number of nonparametric components and number of variables within each nonparametric component. The proposed method uses a combination of hard and soft shrinkages to separately control the number of additive components and the variables within each component. An efficient algorithm is developed to select the importance of variables and estimate the interaction network. Excellent performance is obtained in simulated and real data examples.

Variable selection has played a pivotal role in scientific and engineering applications, such as biochemical analysis [

Popular classical sparse-regression methods such as Least absolute shrinkage operator (LASSO [

Smoothing based non-additive nonparametric regression methods [

Nonparametric variable selection based on kernel methods is increasingly becoming popular over the last few years. Liu et al. [

To circumvent this bottleneck, Yang et al. [

To overcome the computational challenge facing in Yang et al. [

The rest of the paper is organized as follows. Section 2 presents the additive Gaussian process model. Section 3 describes the two-level regularization and the prior specifications. The posterior computation is detailed in Section 4 and the variable selection and interaction recovery approach are presented in Section 5. The simulation study results are presented in Section 6. A couple of real data examples are considered in Section 7. We conclude with a discussion in Section 8.

For observed predictor-response pairs ( x i , y i ) ∈ ℝ p × ℝ , where i = 1 , 2 , ⋯ , n (i.e. n is the sample size and p is the dimension of the predictors), an additive nonparametric regression model can be expressed as

y i = F ( x i ) + ϵ i , ϵ i ~ Ν ( 0 , σ 2 ) F ( x i ) = ϕ 1 f 1 ( x i ) + ϕ 2 f 2 ( x i ) + ⋯ + ϕ k f k ( x i ) . (1)

The regression function F in (1) is a sum of k regression functions, with the relative importance of each function controlled by the set of non-negative parameters ϕ = ( ϕ 1 , ϕ 2 , ⋯ , ϕ k ) T . Typically the unknown parameter ϕ is assumed to be sparse to prevent F from over-fitting the data.

Gaussian process (GP) [

We assume that the response vector y = ( y 1 , y 2 , ⋯ , y n ) in (1) is centered and scaled. Let f l ~ GP ( 0, c l ) with

c l ( x , x ′ ) = exp { − ∑ j = 1 p κ l j ( x j − x ′ j ) 2 } . (2)

In the next section, we discuss appropriate regularization on ϕ and { κ l j , l = 1 , ⋯ , k ; j = 1 , ⋯ , p } . A shrinkage prior on the { κ l j , j = 1 , ⋯ , p } facilitates the selection of variables within component l and allows adaptive local smoothing. An appropriate regularization on ϕ allows F to adapt to the degree of additivity in the data without over-fitting.

A full Bayesian specification will require placing prior distribution on both ϕ and κ . However, such a specification requires tedious posterior sampling algorithms to sample from the posterior distribution as seen in [

Conditional on f 1 , ⋯ , f k , (1) with ϕ l , and ϕ l > 0 . Hence we impose L 1 regularization on ϕ l , which is as following

1 n ∑ i = 1 N { y ( x i ) − ∑ l = 1 k ϕ l f l ( x i ) } + λ ∑ l = 1 k ϕ l (3)

In the algorithm, ϕ l is updated using least absolute shrinkage and selection operator (LASSO) [

The proposed model has the number of components, k, which determines how many components to fit the data and build the prediction. We propose using LASSO to choose k. First, we start with a large k value. As ϕ j is updated with the LASSO algorithm, the LASSO algorithm prunes unnecessary Gaussian process f l . Therefore, the value of k is updated, which is equal to the number of components which are not pruned.

The parameters κ l j controls the effective number of variables within each component. For each l, { κ l j , j = 1 , ⋯ , p } are assumed to be sparse. As opposed to the two component mixture prior on κ l j in [

κ l j ~ N ( 0, ψ l j τ l ) , τ l ~ f g , ψ l j ~ f l , (4)

for each fixed l, where f g and f l are densities on the positive real line. In (4), τ l controls global shrinkage towards the origin while the local parameters { ψ l j , j = 1 , ⋯ , p } allow local deviations in the degree of shrinkage for each predictor. Special cases include Bayesian lasso [

In this section, we develop a fast algorithm which is a combination of L 1 optimization and conditional MCMC to estimate the parameters ϕ l , ψ l j , and τ l for l = 1 , ⋯ , k and j = 1 , ⋯ , p . Conditional on κ l j , (1) is linear in ϕ l and hence we resort to the least angle regression procedure [

The least angle regression procedure better approach which exploits the special structure of the lasso problem, and provides an efficient way to compute the solutions. Next, we describe the conditional MCMC to sample from κ l j and F ( x * ) at a new point x * conditional on the parameters ϕ l . For two collection of vectors X v and Y v of size m 1 and m 2 respectively, denote by c ( X v , Y v )

the m 1 × m 2 matrix { c ( x , y ) } x ∈ X v , y ∈ Y v . Let X = { x 1 , x 2 , ⋯ , x n } and define

c ( X , X ) , c ( x * , X ) , c ( X , x * ) and c ( x * , x * ) denote the corresponding matrices. For a random variable q, we denote by q | − the conditional distribution of q given the remaining random variables.

Observe that the algorithm does not necessarily produce samples which are approximately distributed as the true posterior distribution. The combination of optimization and conditional sampling is similar to stochastic EM [

1) Compute the kernel k ( x , x ) , k ( x , x * ) , k ( x * , x ) , k ( x * , x * ) with the kernel formula k ( x , x ′ ) = exp ( − γ d j ‖ x − x ′ ‖ 2 ) .

2) Compute f l − ( x i ) = ∑ j ≠ l ϕ j f j ( x i ) . Compute the predictive mean

μ l * = k ( x * , x ) [ c ( X , X ) + σ 2 I ] − 1 ( y − f l − ) (5)

3) Compute the predictive variance

Σ l * = c ( x * , x * ) − c ( x * , X ) [ c ( X , X ) + σ 2 ] − 1 c ( X , x * ) . (6)

4) Sample f l | − , y ~ N ( μ l * , Σ l * ) .

5) Compute the predictive

F ( x * ) = ϕ 1 f 1 * + ϕ 2 f 2 * + ⋯ + ϕ k f k * . (7)

6) Update ψ l j by sampling from the following posterior distribution, p ( ψ l j | − , y )

p ( ψ l j | − , y ) ∝ e x p { − 1 2 y T [ c ( X , X ) + σ 2 I ] − 1 y } | c ( X , X ) + σ 2 I | p ( ψ l j ) . (8)

7) Update τ l , j = 1 , ⋯ , k by sampling from the following posterior distribution, p ( τ l | − , y )

p ( τ l | − , y ) ∝ e x p { − 1 2 y T [ c ( X , X ) + σ 2 I ] − 1 y } | c ( X , X ) + σ 2 I | p ( τ l ) . (9)

8) Update γ d j by using the formula γ d j = τ j Ψ d j .

9) Update the vector Γ with the LASSO estimation.

10) Update κ l j by sampling

κ l j ~ N ( 0, ψ l j τ l ) (10)

11) Update ϕ j and prune unnecessary f j where j ≠ l with the LASSO algorithm.

The MCMC algorithm above is illustrated with the following flow-chart.

In the MCMC algorithm above, the conditional distributions of τ j and ψ l j are not available in closed form. Therefore, we sample them using Metropolis- Hastings algorithm [

1) Propose log τ l * ~ N ( log τ l t , σ τ 2 ) .

2) Compute the Metropolis ratio:

p = min [ p ( τ l * | − ) p ( τ l t | − ) , 1 ] (11)

3) Sample u ~ U ( 0,1 ) . If u < p then log τ l t + 1 = log τ l * , else log τ l t + 1 = log τ l t .

The flowchart for the above Metropolis-Hastings algorithm is as following:

The proposal variance σ τ 2 is tuned to ensure that the acceptance probability is between 20% - 40%. We also propose a similar Metropolis-Hastings algorithm to sample from the conditional distribution of ψ l j | − .

In this section, we first state a generic algorithm to select important variables based on the samples of the parameter vector γ . This algorithm is independent of the prior for γ and unlike other variable selection algorithms, it requires few tuning parameters making it suitable for practical purposes. The idea is based on finding the most probable set of variables in the median of the γ samples. Since the distribution for the number of important variables is more stable and largely unaffected by the Metropolis-Hastings algorithm, we find the mode H of the distribution for the number of important variables. Then, we select the H largest coefficients from the posterior mean of γ .

In this algorithm, we use k-means algorithm [^{th} iteration, the number of non-zero signals h ( t ) is estimated by the smaller cluster size out of the two clusters. We take the mode over all the iterations to obtain the final estimate H for the number of non-zero signals i.e. H = mode ( h ( t ) ) . The H largest entries of the posterior median of | γ | are identified as the non-zero signals.

We run the algorithm for 5000 iterations with a burn-in of 2000 to ensure convergence. Based on the remaining iterates, we apply the algorithm to κ j l for each component f l to select important variables within each f l for l = 1 , ⋯ , k . Using this approach, we select the important variables within each function. We define the inclusion score of a variable as the proportion of functions (out of k) which contains that variable. Next, we apply the algorithm to ϕ and select the important functions. Let us denote by A f the set of active functions, obtained from the LASSO algorithm as discussed in Section 3.2. The interaction score between a pair of selected variables is defined as the proportion of functions within A f in which the selected pair appears together. Using these interaction scores, we can find the interaction between important variables with optimal number of active components. Observe that the inclusion and interaction scores are not a functional of the posterior distribution and is purely a property of the additive representation. Hence, we do not require the sampler to converge to the posterior distribution. As illustrated in Section 6, these inclusion and the interaction scores provide an excellent representation of a variable or an interaction being present or absent in the model. An illustratfor both variable selection and interaction will be displayed in Section 6.

In this section, we consider eight different simulation settings with 50 replicated datasets each and test the performance of our algorithm with respect to variable selection, interaction recovery, and prediction. To generate the simulated data, we draw x i j ~ Unif ( 0,1 ) , and y i ~ N ( f ( x i ) , σ 2 ) , where 1 ≤ i ≤ n , 1 ≤ j ≤ p and σ 2 = 0.02 .

We compute the Inclusion score for each variable in each simulated dataset, then provide the bar plots as in Figures 1-4 below.

Equation for the Dataset | ||||
---|---|---|---|---|

Simulated Dataset | n | p | Non-interaction Data | SNR |

1 | 100 | 10 | 37.3274 | |

2 | 100 | 100 | 36.9188 | |

3 | 100 | 20 | 41.1118 | |

4 | 100 | 100 | 41.6303 |

Equation for the Dataset | ||||
---|---|---|---|---|

Simulated Dataset | n | p | Interaction Data | SNR |

1 | 100 | 10 | 41.9095 | |

2 | 100 | 100 | 42.1258 | |

3 | 100 | 20 | 43.0888 | |

4 | 100 | 100 | 44.4024 |

From these histograms, we rank the Inclusion score value. Based on our ranking, we select a threshold value to identify the signal based on the top Inclusion score values. From our ranking, the selected threshold value is 0.1. The ranking of variables selection has been mentioned in Guyon and Elisseeff [

Based on the results in

In order to capture the interaction network, we compute the probability of interaction between two variables by calculating the proportion of functions in which both the variables jointly appear. Since we are interested in capturing the interaction between selected variables, we plot interaction heat map for selected variables with their probability of interaction values, for each dataset for both the non-interaction and interaction cases.

Based on Figures 5-8, it is evident that the estimated interaction probabilities for the non-interacting variables are less than the corresponding number for interacting variables. With these heat map values, we plot the interaction

Non-interaction Dataset | Interaction Dataset | |||
---|---|---|---|---|

Dataset | FPR | FNR | FPR | FNR |

1 | 0.0 | 0.0 | 0.0 | 0.0 |

2 | 0.0 | 0.0 | 0.0 | 0.0 |

3 | 0.0 | 0.0 | 0.0 | 0.05 |

4 | 0.0 | 0.01 | 0.0 | 0.01 |

network in

Based on the interaction network in

We randomly partition each dataset into training (50%) and test (50%) observations. We apply our algorithm on the training data and compare the performance on the test dataset. For the sake of brevity we plot the predicted vs. the observed test observations only for a few cases in

From

Bayesian Additive Regression Tree (BART; [

Since BART is well-known to deliver excellent prediction results, its performance in terms of variable selection and interaction recovery in high- dimensional setting is worth investigating. In this section, we compare our method with BART in all the three aspects: variable selection, interaction recovery and predictive performance. For comparison, with BART, we used the same simulation settings as in

We used 50 replicated datasets and compute average inclusion probabilities for each variable. Similar to §6.1, we ranked the Inclusion score, and chose the threshold value equal to 0.1 in order to find selected variables. Then, we computed the false positive and false negative rates for both algorithms as in

In

According to

Our Algorithm | BART | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Dataset | p | n | Non-interaction | Interaction | Non-interaction | Interaction | ||||

FPR | FNR | FPR | FNR | FPR | FNR | FPR | FNR | |||

1 | 10 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |

2 | 100 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |

3 | 20 | 100 | 0.0 | 0.05 | 0.0 | 0.05 | 1.0 | 1.0 | 1.0 | 1.0 |

4 | 100 | 100 | 0.0 | 0.01 | 0.0 | 0.01 | 1.0 | 1.0 | 1.0 | 1.0 |

1 | 150 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |

4 | 150 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 |

1 | 200 | 100 | 0.01 | 0.0 | 0.0 | 0.0 | NA | NA | NA | NA |

4 | 200 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | NA | NA | NA | NA |

as n, BART fails to run while our algorithm provides excellent results in variable selection. Overall, our algorithm performs significantly better than BART in terms of variable selection.

In this section, we demonstrate the performance of our method in two real data sets. We use the Boston housing data and concrete slump test datasets obtained from UCI machine learning repository. Both data have been used extensively in the literature.

In this section, we used the Boston housing data to compare the performance between BART and our algorithm. The Boston housing data [

MEDV is chosen as the response and the remaining variables are included as predictors. We ran our algorithm for 5000 iterations and the prediction result for both algorithms is shown in

Although our algorithm has a comparable prediction error with BART, we argue below that we have a more convincing result in terms of variable selection. We displayed the Inclusion score barplot in

Variables | Abbreviation | Description |
---|---|---|

1 | CRIM | Per capita crime rate |

2 | ZN | Proportion of residential land zoned for lots over 25,000 squared feet |

3 | INDUS | Proportion of non-retail business acres per town |

4 | CHAS | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |

5 | NOX | Nitric oxides concentration (parts per 10 million) |

6 | RM | Average number of rooms per dwelling |

7 | AGE | Proportion of owner-occupied units built prior to 1940 |

8 | DIS | Weighted distances to five Boston employment centers |

9 | RAD | Index of accessibility to radial highways |

10 | TAX | Full-value property-tax rate per $10,000 |

11 | PTRATIO | Pupil-teacher ratio by town |

12 | B | |

13 | LSTAT | Percentage of lower status of the population |

14 | MEDV | Median value of owner-occupied homes in $1000’s |

Based on the histograms, we chose the threshold value equal to 0.1 for easily comparing BART and our algorithm. From the ranking and the chosen threshold value, BART only selected NOX and RM, while our algorithm selected CRIM, ZN, NOX, DIS, B and LSTAT. In order to compare the performance, we looked at Savitsky et al. [

In this section we consider an engineering application to compare our algorithm against BART. The concrete slump test dataset records the test results of two executed tests on concrete to study its behavior [

The first test is the concrete-slump test, which measures concrete’s plasticity. Since concrete is a composite material with mixture of water, sand, rocks and cement, the first test determines whether the change in ingredients of concrete is consistent. The first test records the change in the slump height and the flow of water. If there is a change in a slump height, the flow must be adjusted to keep the ingredients in concrete homogeneous to satisfy the structure ingenuity. The second test is the “Compressive Strength Test”, which measures the capacity of a concrete to withstand axially directed pushing forces. The second test records the compressive pressure on the concrete.

The concrete slump test dataset has 103 instances. The data is split into 53 instances for training and 50 instances for testing. There are seven continuous input variables, which are seven ingredients to make concrete, and three outputs, which are slump height, flow height and compressive pressure. Here we only consider the slump height as the output. The description for each variable and output is summarized in

The predictive performance is illustrated in

Similar to the Boston housing dataset, our algorithm performs closely to BART in prediction. Next, we investigated the performances in terms of variable selection. We plotted the bar-plot of the Inclusion score for each variable in

Yurugi et al. [

Variables | Ingredients | Unit |
---|---|---|

1 | Cement | kg |

2 | Slag | kg |

3 | Fly ash | kg |

4 | Water | kg |

5 | Super-plasticizer (SP) | kg |

6 | Coarse Aggregation | kg |

7 | Fine Aggregation | kg |

8 | Slump | cm |

9 | Flow | cm |

10 | 28-day Compressive Strength | Mpa |

measure the plasticity of a concrete, coarse aggregation is a critical variable in the concrete slump test. According to

In this section we consider a dataset, which has more than 100 predictors to compare our algorithm against BART. Therefore, we chose the Community and Crime dataset. This dataset describes the socio-economic, law enforcement, and crime data in communities of the United States in 1995 [

In this data, there are about 124 predictors, 5 non-predictors, and 18 response values. The details for each response value can be found at University of California, Irvine (UCI) Machine Learning Database [

Since BART and our algorithm has different Inclusion score values, we cannot pick the threshold values to identify variables for comparison. Since our algorithm only selects 10 predictors, we decided to rank predictors in BART based on their Inclusion score. Then, we chose BART’s top 10 predictors with highest Inclusion score to compare with ours.

According to Blumstein and Rosenfeld [

Variables | Our Algorithm | BART |
---|---|---|

1 | % household with social security income | % African-American |

2 | % Mom and kids under labor force | income per capita for Asian heritage |

3 | % immigrants in the last 8 years | % employed in manufacturing |

4 | % immigrants in the last 3 years | % kids in two parents family |

5 | % housing occupied | % of working mom |

6 | % vacant housing more than 6 months | % kids in unmarried families |

7 | Number of housing occupied in upper quantile | % immigrants in the last 8 years |

8 | Number of sworn full time police officer | number of unit house built |

9 | Number of sworn police officer in operation | number of housing without plumbing facilities |

10 | Total request for police per police officer | % people living in the same city since 1985 |

is economic condition: variables 1, 2, 5, 6, and 7. The second category is demographic change: variables 3 and 4. The third category is policing: variables 8, 9, and 10. Similarly, selected variables in BART can be grouped into three categories. The first category is economic condition: variables 3, 2, 5 and 8. The second category is demographic change: variables 1 and 7. The third category is socialization and social service: variable 6. Based on these grouping, one can see that our selected variables is more agreeable to the study of Blumstein and Rosenfeld [

In this paper, we propose a novel Bayesian nonparametric approach for variable selection and interaction recovery with excellent performance in selection and interaction recovery in both simulated and real datasets. Our method obviates the computation bottleneck in recent unpublished work [

Although such sparse additive models are well known to adapt to the underlying true dimension of the covariates [

Vo, G. and Pati, D. (2017) Sparse Additive Gaussian Process with Soft Interactions. Open Journal of Sta- tistics, 7, 567-588. https://doi.org/10.4236/ojs.2017.74039