A Bayesian Approach for Penalized Splines with Hierarchical Penalty ()
1. Introduction
Penalized splines have become a popular nonparametric smoothing technique [1] [2] using a global smoothing parameter in order to control the amount of smoothing in a function. Flexible methods that allow the smoothing parameter to vary over the domain as a penalty function of independent variables have been proposed [3] [4]. Adaptive smoothing method has also become increasingly important in statistics and it involves allowing the smoothing parameter, the bandwidth or the placement of knots to vary across the domain, adapting to the change of roughness [5] - [14]. Modelling the smoothing parameter as a penalty function of independent variable can also be used to achieve adaptiveness. This involves formulating the adaptive smoothing as a minimization problem with a new penalty function in which the estimate has the same form as the smoothing spline and method developed for classical smoothing splines can be used. [4] derived the reproducing kernels for a generic penalty function and suggested modelling it by B-splines. [3] studied the solution of the penalized least square estimate in which the smoothing parameter is a varying function across the domain under the Reproducing Kernel Hilbert Space approach. [15] proposed to model the penalty function by a step function where the segmentation is data driven and estimate it by maximizing the generalized likelihood. A complexity penalty was added to the generalized likelihood in selecting the best step function from a collection of candidate. This approach was very computational expensive due to the large number of candidate models and proposed search algorithm and thus has a serious limitation. In this research, we aim at developing a Hierarchical penalty model using p-splines which will result in more adaptive smoothing. In penalized regression splines, [2] modelled the penalty function by a linear interpolation on the logarithmic scale, [16] modelled the penalty function from full Bayesian approach and used Markov chain Monte Carlo for computation, [17] developed a fast and simple algorithm for the Bayesian p-spline based on Laplace approximation for the marginal likelihood. To accommodate a flexible class of functions, a number of works have been constructed through the smoothing spline estimator of the non-parametric distribution function
. These include [18] [19] [20]. [4] [21] considered a mixed-effects model with a fixed smoothing parameter.
The study’s primary objective was to assess the properties of the mixed model with penalized hierarchical penalty developed in [22] under Bayesian approach by simulation study and compared it with the mixed model without penalized hierarchical penalty. Therefore, the study’s main contribution was to investigate the developed model fit performance when hierarchy penalty is added to mixed model.
The paper is organized as follows: Section 2 describes the methods of the study; Section 3 presents the main results, simulation study results, and Section 4 presents conclusion and suggestions for further research.
2. Methods
In applied statistics, data normally plays an important role. However it can have different and complex shape that will result in parametric modeling being unfavorable. Non parametric modeling provides a solution to this in that the shape of the functional relationships is not determined before but is driven by the data. This means that it can adjust to capture unusual or unexpected features in the data. Penalized spline is a non parametric regression that is commonly used for scatter plot smoothing. Suppose we consider smoothing a scatter plot where the data are denoted
for
. The underlying trend would be a function m where
is the vector of known spline basis functions and
is the vector of unknown coefficients. P-spline is low order basis spline with a penalty to avoid under smoothing. Such splines are typically not spatially adaptive and hence have trouble when function is varying rapidly.
2.1. Bayesian Model
Bayesian approach allows the use of objective data or subjective opinion in specifying a prior distribution. Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities as predictions for new observations. It is a process of using observed data to infer properties of the statistical distributions that generated that data. Given data
where
is univariate, our parametric model is defined by
where
is unknown function, the
are independent conditional on
and normally distributed with mean zero and variance
. To estimate
we use regression P-splines. The basis function will be piecewise polynomial function whose highest order derivative take jumps at fixed “knots”.
where
is a vector of regression coefficients and
are fixed knots.
To model the unknown function
we can use regression spline of any degree but here the study used degree 2 for convenience so that,
(1)
r the number of knots is taken to be large but much less than the number of data points n. We use knots that are equally spaced sample quantiles of X. The number of knots are specified by the user although the choice is not crucial [2] a minimum number of knots is needed to capture the spatial variability in the data.
Equation (1) can be interpreted as a Bayesian linear model and rewritten as
(2)
where
,
is
vector of regression coefficients.
is
error vector and the design matrix
is defined as
Suppose
are independent and identically distributed
. The terms involving
can be considered as fixed effects in the model. Normal prior on
with mean 0 and variance 100 was assumed. This effectively acts as a non informative uniform prior on the fixed effects. The random variables in the vector of spline coefficient are assumed a priori independent and normally distributed
where
.
is the smoothing parameter that shrinks all the coefficient of the spline hence a global smoothing parameter. The
is all constant as a function of k that the smoothing is not spatially adaptive. In order to develop a spatially adaptive technique we need to model
so as to capture the spatial heterogeneity of the data because different smoothing parameter leads to different amount of smoothing in different regions.
The study developed a Hierarchical model for
where
is a function evaluated at the knots (
). By taking the functional form of
as another linear regression i.e.
(3)
which is rewritten as
(4)
where
are fixed knots. The number of sub knots
is user specified and typically less than r (the number of knots in the original spline). The knots
are also taken to be equally spaced quantiles of
. Rewriting Equation (4) as a Bayesian linear model we have
(5)
where
,
is an
vector and
is the design matrix given by
The random variables in the Equation (4) are also assumed to be priori independent and normally distributed i.e.
where
and the parameters
are also independent and normally distributed with mean zero and large variance.
2.2. MCMC Methods
MCMC has become a very important computational tool in Bayesian statistics, since it allows for Monte Carlo approximation of complex posterior distributions where analytical or numerical integration techniques are not applicable. These methods are commonly used in exploration of joint posterior distribution. A Markov chain is basically a sequence of random variables generated from a transition distribution q() constructed such that the distribution of the next random variables to be generated depends only on the current state of the chain [23]. The aim of Bayesian analysis is to construct a Markov chain that converges to a stationary distribution which is the target posterior distribution. The Markov chain is then simulated until the point where the simulated draws resembled draws from the target distribution.
2.3. Model Diagnostics
The models were compared using the Deviance Information Criterion (DIC). The best fitting model is one with the smallest DIC given by
, in which
is the posterior mean of the deviance that measures the goodness of fit, and pD gives the effective number of parameters in the model which penalizes for complexity of the model. Lower vales of
indicate a better fit while small values of pD indicate a parsimonious model. To compare two models, It is observed how big the difference the Deviance Information Criterion (DIC) values of the models need to be so that one can declare that one model is better than the other one is not clearly cut. Studies have describe that a difference of 3 in model DIC between two models is difficult to distinguished but a DIC difference greater than 3 can be losely distinguished [24].
3. Main Results
Simulation Study
The model Equation (1) was fitted in Bayesian approach, and simulation was done using WinBUGS, four chains was run with a burn-in of 1000 iterations a thinning to every 1 step (in order to reduce the auto-correlation of the chain) until we obtained a sample of 16,000. The mean function m() for the model (5) was simplified to
(6)
where
is the vector of regression coefficients, and
are fixed knots. The spline degree q = 2, number of knots as p = 20, sub-knots t = 10, 15, 17, sample sizes N = 50, 100 and 500,
,
and the following priors distribution were assumed
,
,
. 4 chains were run for MCMC. To prevent over-fitting, then we minimize
(7)
where
is the smoothing parameter and P is a known positive semi-definite penalty matrix.
The posterior distribution of the parameters for the model is given by
For each model run, 5000 Markov chain Monte Carlo (MCMC) iterations were ran, with the initial 1000 discarded to cater for the burn-in period and thereafter keeping every tenth sample value. The iterations left were used for assessing convergence of the MCMC and parameter estimation. We assessed MCMC convergence of all models parameters by checking trace plots of the MCMC output.
Table 1 presents model diagnostics for the fitted models having hierarchy penalty and without hierarchy penalty. The results showed that for Non-hierarchical model (2) with mean function (6), M1 had a smallest Deviance Information Criterion (DIC) 493.216 compared to M2 with DIC = 963.612 and M3 with DIC = 4785.9 when the sample size taken was 50, however as the sample size increases the DIC become large. M1 provide a better fit. The credible interval for the mean function M(x) given the sample N = 50 was significantly large.
For the hierarchical model represented by the Equation (5), Table 1 shows that the model with hierarchy penalty having 15 sub-knots had the lowest DIC = 196.209 compared to the model with 10 sub-knots (DIC = 202.321) and 17 sub-knots (DIC = 207.852), this indicate that the model with hierarchy penalty and 15 sub-knots provides a better fit for the data. The credible interval for the hierarchical regression function M() was 243.5 - 289.0 (Table 2).
Table 1. DIC for models with and without the hierarchy penalty with sizes 50, 100, 500.
Table 2. 95% credible intervals for the model mean functions for models with and without the hierarchy penalty with sizes 50, 100, 500.
(a)(b)
Figure 1. Traceplots plots for model parameters. (a) Beta parameter in Non-Hierarchical model; (b) Rho parameter in Non-Hierarchical model.
(a)(b)
Figure 2. Traceplots plots for model parameters. (a) Gamma parameter in Hierarchical model; (b) h parameter in Hierarchical model.
Convergence of the Parameters
In this section we checked for the convergence of the model parameters by examining their respective trace plots. The parameters
and
of the model without hierarchy penalty showed an acceptable convergence to the stationary distribution both with graphical tests shown by trace plots in Figure 1.
Figure 2 shows that the parameters
and
of the model with hierarchy penalty also had an acceptable convergence to the stationary distribution both with graphical tests.
4. Conclusion and Suggestions
This study utilizes Bayesian technique to model the penalized spline smoothing regression model with hierarchical penalty. We build on existing model by Eilers [25] and Rupert [2]. The model in this study is based on the penalized regression spline, in a semi-parametric modeling paradigm. Many situations arise where the relationship between the function of the response variable and covariates is non-linear. Due to the high complexity and intractability of this non-parametric model, the maximum likelihood approaches are not viable in the estimation of these models. Consequently, the Bayesian inference, utilizing MCMC techniques is highly favoured. Based on simulation results, penalized spline smoothing with hierarchical penalty provides a better fit compared to penalized spline smoothing without hierarchy; this was shown by the rapid convergence of the model posterior parameters and the lowest DIC value of the model. Therefore from the simulation results, the study concluded that the hierarchical model with fifteen sub-knots provides a better fit of the data.
Suggestions
Although the proposed model provides a better fit compared to the penalized spline without hierarchy penalty, the mean function density becomes rough as the sample size increases therefore the study proposes further research on the nature of an appropriate smoothing parameter for the model can be done.
Acknowledgements
Sincere thanks to my supervisors Prof. Samuel Mwalili and Prof. Leo Odongo for their professional guidance and performance, and special thanks to my husband and parents for their moral support and rare attitude of high quality.