Hierarchical Penalized Mixed Model

Penalized spline has been a popular method for estimating an unknown function in the non-parametric regression due to their use of low-rank spline bases, which make computations tractable. However its performance is poor when estimating functions that are rapidly varying in some regions and are smooth in other regions. This is contributed by the use of a global smoothing parameter that provides a constant amount of smoothing across the function. In order to make this spline spatially adaptive we have introduced hierarchical penalized splines which are obtained by modelling the global smoothing parameter as another spline.

KEYWORDS

1. Introduction

Non parametric smoothing involves letting the data determine the amount of smoothing. Classical smoothing splines use a global smoothing parameter in order to control the amount of smoothing in a function. When homogeneity of the smoothness cannot be reasonably assumed across the whole domain of the function, a natural extension is to allow the smoothing parameter to vary over the domain as a penalty function of independent variable, adapting to the change of roughness  . Adaptive smoothing has been an interesting topic in statistics and it involves allowing the smoothing parameter, the bandwidth or the placement of knots to vary across the domain, adapting to the change of roughness  - . In penalized regression splines,  modeled the penalty function by a linear interpolation on the logarithmic scale,  modeled the penalty function from full Bayesian approach and used Markov chain Monte Carlo for computation,  developed a fast and simple algorithm for the Bayesian p-spline based on Laplace approximation for the marginal likelihood. Modeling the smoothing parameter as a penalty function of independent variable can also be used to achieve adaptiveness. This involves formulating the adaptive smoothing as a minimization problem with a new penalty function in which the estimate has the same form as the smoothing spline and method developed for classical smoothing splines can be used.  derived the reproducing kernels for a generic penalty function and suggested modeling it by B-splines.  studied the solution of the penalized least square estimate in which the Smoothing parameter is a varying function across the domain under the Reproducing Kernel Hilbert Space approach.  proposed to model the penalty function by a step function where the segmentation is data driven and estimate it by maximizing the generalized likelihood. A complexity penalty was added to the generalized likelihood in selecting the best step function from a collection of candidate. This approach was very computational expensive due to the large number of candidate models and proposed search algorithm and thus has a serious limitation. In this research we aim at developing a Hierarchical penalty model using p-splines which will result in more adaptive smoothing.

2. Modeling Approach

2.1. Penalized Splines

P-splines are low-order basis spline with a penalty to avoid under smoothing. They are typically not spatially adaptive and hence have trouble when functions are varying rapidly. Regression splines are approximations to functions typically using low-order number of basis function. These splines are subject to lack of smoothness and various strategies have been proposed to attain this smoothness. e.g Regression P-splines  achieves smoothness by penalizing the sum of squares or likelihood by a single penalty parameter. The penalty parameter and the fit using P-splines are easy to compute using mixed model technology;    and are not sensitive to knot parameter selection . A penalized spline can be seen as a compromise between smoothing and regression spline and it combines the attractive attributes of regression and smoothing splines. They are basically regression spline in which the penalty is applied directly to the coefficients of the piecewise polynomial. Hence one can retain a large number of knots and constrain their effect using a penalty to avoid over fitting. The number of knots defining the spline function is larger than that justified by the data but smaller than the number of observations. Thus they are referred to as low-rank smoothers and this significantly reduces numerical effort. The level of over fitting is controlled by a roughness penalty over the curve. The most common choice is a penalty based on the integral of a squared derivative of a spline curve. To avoid the drawbacks in regression spline and optimize the fit we can choose a large number of knots e.g. min (n/4, 40) as suggested by  and prevent over fitting by penalizing the coefficients of splines. That is, one finds

$\underset{\beta ,d}{\mathrm{min}}{‖Y-X\beta -Sd‖}^{2}$ (1)

subject to ${‖d‖}^{2} for non negative constant a. Where Y is the response variable, β and d are the fixed and random effects vectors, X and S are the design matrices associated with the fixed and random effects vectors. Using a Lagrange multiplier, this minimization can be written as

$\underset{\beta ,d}{\mathrm{min}}{‖Y-X\beta -Sd‖}^{2}+\omega {d}^{\text{T}}d=\underset{\beta ,d}{\mathrm{min}}{‖y-C\theta ‖}^{2}+\omega {\theta }^{\text{T}}D\theta$ (2)

With $\theta ={\left({\beta }^{\text{T}},{d}^{\text{T}}\right)}^{\text{T}}$, D is a block diag ${0}_{\left(p+1\right)×\left(p+1\right)}{I}_{K}$ and $\omega \ge 0.$

The resulting estimate is given by

$\stackrel{^}{y}=Z{\left({Z}^{\text{T}}Z+\omega D\right)}^{-1}{Z}^{\text{T}}y$ (3)

The smoothness of this estimate varies continuously as a function of a global smoothing parameter ω. The larger the value of ω the more the fit shrinks towards polynomial fit while small values of ω result in an over fitted estimate. Penalized spline can be seen as a generalization of the spline smoothing with more flexible choice of bases, penalties and knots. One chooses the spline basis based on sufficiently large number of knot and penalizes unnecessary structure. This spline possesses a number of good properties: It shows no boundary effect as many kernels smoother do. i.e. the spreading of a fitted curve as density outside of the domain of the data generally accompanied by bending towards zero, it is a straight forward extension of (generalized) linear regression models, conserve moments (means, variances) of the data i.e. Given a linear p spline with degree q + 1 and a penalty of order q + 1 or higher

${\sum }_{j=1}^{k}{x}^{q}{y}_{j}={\sum }_{j=1}^{k}{x}^{q}{\stackrel{^}{y}}_{j}$ (4)

For all values of the smoothing parameter ω where ${\stackrel{^}{y}}_{j}$ the fitted values are. This property is very useful in density smoothing where mean and variance of the estimated density are the same as mean and the variance of the data for any amount of smoothing. It also has polynomial curve fit as its limits. That is, for a penalty of order q and large values of the smoothing parameter ω, the fitted function will approach a polynomial of degree $q-1$, if the degree of the p-spline is equal or higher than q. Also the computations, including those of cross validation are relatively cheap and can easily be incorporated into standard software .

2.2. Mixed Models

Mixed model are regression model with both the fixed effects and random effects. They correspond to a hierarchy of levels with the repeated, correlated measurement occurring among all the lower level units for each particular upper level. The standard linear mixed model has the form

$Y=X\beta +Sd+\epsilon$ (5)

where Y is a vector of observed responses, β is an unknown vector of fixed effects, d is an unknown vector of random effects or subject specific, with mean zero and variance W, X and S are design matrices associated with a vector of fixed effects β and a vector of random effects d respectively and ε is a vector of residual error term with zero mean and covariance matrix P. The dimensions of the design matrices X and S must conform to the lengths of the observation vector Y and the number of fixed and random effects respectively. It is generally assumed that the elements of d are uncorrelated with the elements of ε in which case the covariance matrix of the random effects and residual error term is a block diagonal

$var\left(\begin{array}{c}d\\ \epsilon \end{array}\right)=\left[\begin{array}{cc}W& 0\\ 0& P\end{array}\right]$ (6)

The matrices S and W will themselves be block diagonal if the data arise from a hierarchical structure, where a fixed number of random effects common to observations within a single higher-level unit are assumed to vary across the units for a given level of the hierarchy. Typically the vectors of residual errors are taken to independent and identically distributed and thus $P={\sigma }_{\epsilon }^{2}I$ where ${\sigma }_{\epsilon }^{2}$ is the residual variance. The covariance matrix W of the random effects vector d is often assumed to have a structure that depends on a series of unknown variance component parameters that need to be estimated in addition to the residual variance ${\sigma }_{\epsilon }^{2}$ and the vector of fixed effects β.

The universal estimators of the fixed and random effects are the best linear unbiased estimators (BLUE) $\stackrel{^}{\beta }$ of β and the best linear unbiased predictors (BLUP) $\stackrel{^}{d}$ of d. This can be recovered as the solution to the mixed model equation,

$\left[\begin{array}{cc}{X}^{\text{T}}{P}^{-1}X& {X}^{\text{T}}{P}^{-1}Y\\ {R}^{\text{T}}{P}^{-1}X& {Y}^{\text{T}}{P}^{-1}S+{F}^{-1}\end{array}\right]\left(\begin{array}{c}\stackrel{^}{\beta }\\ \stackrel{^}{d}\end{array}\right)=\left[\begin{array}{c}{X}^{\text{T}}{P}^{-1}Y\\ {S}^{\text{T}}{P}^{-1}Y\end{array}\right]$ (7)

A mixed model is of the form,

$Y=X\beta +Sd+\epsilon$

Assuming that d and ε are multivariate normal;

$\left[\begin{array}{c}d\\ \epsilon \end{array}\right]~N\left(\left[\begin{array}{c}0\\ 0\end{array}\right],\left[\begin{array}{cc}W& 0\\ 0& P\end{array}\right]\right)$ (8)

and taking $H=var\left(Y\right)$. Then $Y~N\left(X\beta ,H\right)$ Which result into a pdf

$f\left(y;\beta ,H\right)=\frac{1}{\sqrt{2\pi H}}\mathrm{exp}\left\{-\frac{1}{2H}{\left(y-X\beta \right)}^{2}\right\}$

The likelihood function becomes,

$L\left(y;\beta ,H\right)={\prod }_{i=1}^{n}f\left(y;\beta ,H\right)={\left(2\pi H\right)}^{-\frac{n}{2}}\text{exp}\left\{-\frac{1}{2H}{\sum }_{i=1}^{n}{\left(y-X\beta \right)}^{2}\right\}$ (9)

The log likelihood becomes,

$l\left(y;\beta ,H\right)=-\frac{n}{2}\mathrm{log}2\pi -\frac{n}{2}\mathrm{log}H+\mathrm{exp}\left[-\frac{1}{2}{\left(y-X\beta \right)}^{\text{T}}{H}^{-1}{\left(y-X\beta \right)}^{\text{T}}\right]$

$l\left(y;\beta ,H\right)=-\frac{1}{2}\left\{\mathrm{log}2\pi +\mathrm{log}H+{\left(y-X\beta \right)}^{\text{T}}{H}^{-1}{\left(y-X\beta \right)}^{\text{T}}\right\}$ (10)

where $H=var\left(Y\right)=SW{S}^{\text{T}}+P$. Assuming that the parameters defining the covariance matrices W and P are known, the MLE $\stackrel{^}{\beta }$ of β is

$\stackrel{^}{\beta }={\left({X}^{\text{T}}{H}^{-1}X\right)}^{-1}{X}^{\text{T}}{H}^{-1}y$ (11)

which although not obvious algebraically must also satisfy the mixed model equation given earlier. Since one of the ways in which these equations can be derived is directly from multivariate normality assumption. Typically W and P will not be known and can be estimated by substituting the expression for $\stackrel{^}{\beta }$ back into $l\left(\beta ;W;P\right)$ and maximizing the result over the parameters defining W and P. Once estimates for W and P have been determined, we can return to the mixed model equations and determined the BLUP $\stackrel{^}{d}$ of random effects vector d as the vector that minimizes the expected mean squared error of prediction.

$E\left\{{\left(\stackrel{^}{d}-d\right)}^{\text{T}}\left(\stackrel{^}{d}-d\right)\right\}$ (12)

The BLUP of d can be expressed as the posterior expectation of the random effects given the data $\stackrel{^}{d}=E\left(d/Y\right)$ which can be solved explicitly under the normality assumption to yield,

$\stackrel{^}{d}=W{S}^{\text{T}}{H}^{-1}\left(Y-X\stackrel{^}{\beta }\right)$ (13)

2.3. Hierarchical Penalized Mixed Model

Assuming

$Y~N\left(m\left({X}_{i}\right);{\sigma }_{i}^{2}\right),i=1,2,\cdots ,n$ (14)

where $m\left({X}_{i}\right)$ is modelled as truncated polynomial spline

$m\left(X\right)={\beta }_{0}+{\beta }_{1}{X}^{1}+\cdots +{\beta }_{p}{X}^{q}+{\sum }_{k=1}^{r}{\rho }_{k}{\left(X-{\tau }_{k}\right)}_{+}^{q}$ (15)

where ${\tau }_{1},{\tau }_{2},\cdots ,{\tau }_{r}$ are the knots covering the range of x’s and

${\left(X-{\tau }_{k}\right)}_{+}^{q}=\left\{\begin{array}{l}{\left(X-{\tau }_{k}\right)}_{+}^{q},\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\left(X-{\tau }_{k}\right)}_{+}^{q}>0\\ 0,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{otherwise}\end{array}$

The knots are placed over the range of x’s and the dimension of r is chosen generously. In penalized spline the approach is to put a penalty on the coefficient of ${\rho }_{k}$. The standard approach is to minimize sum of squares and the quadratic penalty $\omega {\rho }^{\text{T}}D\rho$, where ω is the penalty parameter and D is the penalty square matrix. In truncated polynomial D is an identity matrix and the penalty is $\omega {\rho }^{\text{T}}\rho$. In B spline basis the penalty is constructed using the difference between neighboring spline coefficients . An important feature of penalized spline is its links to linear mixed model. Due to this link we assume

$\rho ~N\left(0,{\sigma }_{\rho }^{2}{D}^{-1}\right)$

where ρ is a vector of spline coefficients, ${\sigma }_{\rho }^{2}={\sigma }_{\epsilon }^{2}/\omega$ and ${D}^{-1}$ is a generalized inverse of D.

In this approach a single parameter ${\sigma }_{\rho }^{2}$ is used to shrink all the coefficients of spline and this can be a limitation especially if the underlying function is locally varying, i.e. it fails to completely capture the features of functions that exhibit strong heterogeneity. One way to avoid this is to allow the coefficients ${\rho }_{1},\cdots ,{\rho }_{r}$ to have prior variances ${\rho }_{k}~N\left(0,{\sigma }_{\rho }^{2}\left\{{\tau }_{k}\right\}\right)$ and assume that the shrinkage variance process ${\sigma }_{\rho }^{2}\left\{{\tau }_{k}\right\}$ is a smooth function modeled as a log-penalized spline

${\sigma }_{\rho }^{2}\left\{{\tau }_{k}\right\}=\text{exp}\left[{\gamma }_{0}+{\gamma }_{1}^{{\tau }^{1}}+\cdots +{\gamma }_{p}^{{\tau }^{p}}+{\sum }_{j=1}^{l}{h}_{j}{\left(X-{\alpha }_{j}\right)}_{+}^{q}\right]$ (16)

where ${\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{l}$ is a second layer of knots covering the range of ${\tau }_{1},{\tau }_{2},\cdots ,{\tau }_{r}$. l is practically less than r. The hierarchical penalized smoothing model is completed by the shrinkage assumption ${h}_{j}~N\left(0,{\sigma }_{h}^{2}\right),j=1,2,\cdots ,l$ and ${\sigma }_{h}^{2}$ is constant.

Thus our hierarchical smoothing model can be written as

$Y|\rho ,h={X}_{\rho }\beta +{S}_{\rho }\rho +\epsilon ,\epsilon ~N\left(0,{\sigma }_{\epsilon }^{2}{I}_{n}\right)$

$\rho |h~N\left(0,{\Sigma }_{\rho }\right)$

${\Sigma }_{\rho }=diag\left\{\mathrm{exp}\left({X}_{h}^{\text{T}}y+{Z}_{h}C\right)\right\}$

$C~N\left(0,{\sigma }_{l}^{2}{I}_{n}\right)$

where:

$y=\left(\begin{array}{c}{y}_{1}\\ :\\ {y}_{n}\end{array}\right),{X}_{\rho }=\left(\left[\begin{array}{ccc}1& \cdots & {x}_{1}^{q}\\ ⋮& \ddots & ⋮\\ 1& \cdots & {x}_{n}^{q}\end{array}\right]\right),{S}_{\rho }=\left(\left[\begin{array}{ccc}{\left({x}_{1}-{\tau }_{1}\right)}_{+}^{q}& \cdots & {\left({x}_{1}-{\tau }_{r}\right)}_{+}^{q}\\ ⋮& \ddots & ⋮\\ {\left({x}_{n}-{\tau }_{1}\right)}_{+}^{q}& \cdots & {\left({x}_{n}-{\tau }_{r}\right)}_{+}^{q}\end{array}\right]\right)$

${X}_{h}=\left(\left[\begin{array}{ccc}1& \cdots & {\tau }_{1}^{p}\\ ⋮& \ddots & ⋮\\ 1& \cdots & {\tau }_{l}^{p}\end{array}\right]\right),{Z}_{h}=\left(\left[\begin{array}{ccc}{\left({\tau }_{1}-{\alpha }_{1}\right)}_{+}^{p}& \cdots & {\left({\tau }_{1}-{\alpha }_{l}\right)}_{+}^{p}\\ ⋮& \ddots & ⋮\\ {\left({\tau }_{r}-{\alpha }_{1}\right)}_{+}^{p}& \cdots & {\left({\tau }_{r}-{\alpha }_{l}\right)}_{+}^{p}\end{array}\right]\right)$

$\beta ={\left({\beta }_{0},\cdots ,{\beta }_{p}\right)}^{\text{T}},\rho ={\left({\rho }_{0},\cdots ,{\rho }_{r}\right)}^{\text{T}}$

3. Results and Conclusion

Penalized splines are very common in parametric regression but they have one major drawback in that they are not spatially adaptive. This is due to the use of a global smoothing parameter across the whole heterogeneous function. In this research we aimed at coming up with a spatially adaptive penalized spline by introducing hierarchical splines. This was achieved by modeling the global smoothing parameter ω that is normally used in classical smoothing as another spline.

Conflicts of Interest

The authors declare no conflicts of interest.

Cite this paper

Ndung’u, A. , Mwalili, S. and Odongo, L. (2019) Hierarchical Penalized Mixed Model. Open Journal of Statistics, 9, 657-663. doi: 10.4236/ojs.2019.96042. 