^{1}

^{2}

^{3}

Penalized spline has been a popular method for estimating an unknown function in the non - parametric regression due to their use of low-rank spline bases, which make computations tractable. However its performance is poor when estimating functions that are rapidly varying in some regions and are smooth in other region s . This is contributed by the use of a global smoothing parameter that provides a constant amount of smoothing across the function. In order to make this spline spatially adaptive we have introduced hierarchical penalized splines which are obtained by modelling the global smoothing parameter as another spline.

Non parametric smoothing involves letting the data determine the amount of smoothing. Classical smoothing splines use a global smoothing parameter in order to control the amount of smoothing in a function. When homogeneity of the smoothness cannot be reasonably assumed across the whole domain of the function, a natural extension is to allow the smoothing parameter to vary over the domain as a penalty function of independent variable, adapting to the change of roughness [

P-splines are low-order basis spline with a penalty to avoid under smoothing. They are typically not spatially adaptive and hence have trouble when functions are varying rapidly. Regression splines are approximations to functions typically using low-order number of basis function. These splines are subject to lack of smoothness and various strategies have been proposed to attain this smoothness. e.g Regression P-splines [

min β , d ‖ Y − X β − S d ‖ 2 (1)

subject to ‖ d ‖ 2 < a for non negative constant a. Where Y is the response variable, β and d are the fixed and random effects vectors, X and S are the design matrices associated with the fixed and random effects vectors. Using a Lagrange multiplier, this minimization can be written as

min β , d ‖ Y − X β − S d ‖ 2 + ω d T d = min β , d ‖ y − C θ ‖ 2 + ω θ T D θ (2)

With θ = ( β T , d T ) T , D is a block diag 0 ( p + 1 ) × ( p + 1 ) I K and ω ≥ 0.

The resulting estimate is given by

y ^ = Z ( Z T Z + ω D ) − 1 Z T y (3)

The smoothness of this estimate varies continuously as a function of a global smoothing parameter ω. The larger the value of ω the more the fit shrinks towards polynomial fit while small values of ω result in an over fitted estimate. Penalized spline can be seen as a generalization of the spline smoothing with more flexible choice of bases, penalties and knots. One chooses the spline basis based on sufficiently large number of knot and penalizes unnecessary structure. This spline possesses a number of good properties: It shows no boundary effect as many kernels smoother do. i.e. the spreading of a fitted curve as density outside of the domain of the data generally accompanied by bending towards zero, it is a straight forward extension of (generalized) linear regression models, conserve moments (means, variances) of the data i.e. Given a linear p spline with degree q + 1 and a penalty of order q + 1 or higher

∑ j = 1 k x q y j = ∑ j = 1 k x q y ^ j (4)

For all values of the smoothing parameter ω where y ^ j the fitted values are. This property is very useful in density smoothing where mean and variance of the estimated density are the same as mean and the variance of the data for any amount of smoothing. It also has polynomial curve fit as its limits. That is, for a penalty of order q and large values of the smoothing parameter ω, the fitted function will approach a polynomial of degree q − 1 , if the degree of the p-spline is equal or higher than q. Also the computations, including those of cross validation are relatively cheap and can easily be incorporated into standard software [

Mixed model are regression model with both the fixed effects and random effects. They correspond to a hierarchy of levels with the repeated, correlated measurement occurring among all the lower level units for each particular upper level. The standard linear mixed model has the form

Y = X β + S d + ε (5)

where Y is a vector of observed responses, β is an unknown vector of fixed effects, d is an unknown vector of random effects or subject specific, with mean zero and variance W, X and S are design matrices associated with a vector of fixed effects β and a vector of random effects d respectively and ε is a vector of residual error term with zero mean and covariance matrix P. The dimensions of the design matrices X and S must conform to the lengths of the observation vector Y and the number of fixed and random effects respectively. It is generally assumed that the elements of d are uncorrelated with the elements of ε in which case the covariance matrix of the random effects and residual error term is a block diagonal

v a r ( d ε ) = [ W 0 0 P ] (6)

The matrices S and W will themselves be block diagonal if the data arise from a hierarchical structure, where a fixed number of random effects common to observations within a single higher-level unit are assumed to vary across the units for a given level of the hierarchy. Typically the vectors of residual errors are taken to independent and identically distributed and thus P = σ ε 2 I where σ ε 2 is the residual variance. The covariance matrix W of the random effects vector d is often assumed to have a structure that depends on a series of unknown variance component parameters that need to be estimated in addition to the residual variance σ ε 2 and the vector of fixed effects β.

The universal estimators of the fixed and random effects are the best linear unbiased estimators (BLUE) β ^ of β and the best linear unbiased predictors (BLUP) d ^ of d. This can be recovered as the solution to the mixed model equation,

[ X T P − 1 X X T P − 1 Y R T P − 1 X Y T P − 1 S + F − 1 ] ( β ^ d ^ ) = [ X T P − 1 Y S T P − 1 Y ] (7)

A mixed model is of the form,

Y = X β + S d + ε

Assuming that d and ε are multivariate normal;

[ d ε ] ~ N ( [ 0 0 ] , [ W 0 0 P ] ) (8)

and taking H = v a r ( Y ) . Then Y ~ N ( X β , H ) Which result into a pdf

f ( y ; β , H ) = 1 2 π H exp { − 1 2 H ( y − X β ) 2 }

The likelihood function becomes,

L ( y ; β , H ) = ∏ i = 1 n f ( y ; β , H ) = ( 2 π H ) − n 2 exp { − 1 2 H ∑ i = 1 n ( y − X β ) 2 } (9)

The log likelihood becomes,

l ( y ; β , H ) = − n 2 log 2 π − n 2 log H + exp [ − 1 2 ( y − X β ) T H − 1 ( y − X β ) T ]

l ( y ; β , H ) = − 1 2 { log 2 π + log H + ( y − X β ) T H − 1 ( y − X β ) T } (10)

where H = v a r ( Y ) = S W S T + P . Assuming that the parameters defining the covariance matrices W and P are known, the MLE β ^ of β is

β ^ = ( X T H − 1 X ) − 1 X T H − 1 y (11)

which although not obvious algebraically must also satisfy the mixed model equation given earlier. Since one of the ways in which these equations can be derived is directly from multivariate normality assumption. Typically W and P will not be known and can be estimated by substituting the expression for β ^ back into l ( β ; W ; P ) and maximizing the result over the parameters defining W and P. Once estimates for W and P have been determined, we can return to the mixed model equations and determined the BLUP d ^ of random effects vector d as the vector that minimizes the expected mean squared error of prediction.

E { ( d ^ − d ) T ( d ^ − d ) } (12)

The BLUP of d can be expressed as the posterior expectation of the random effects given the data d ^ = E ( d / Y ) which can be solved explicitly under the normality assumption to yield,

d ^ = W S T H − 1 ( Y − X β ^ ) (13)

Assuming

Y ~ N ( m ( X i ) ; σ i 2 ) , i = 1 , 2 , ⋯ , n (14)

where m ( X i ) is modelled as truncated polynomial spline

m ( X ) = β 0 + β 1 X 1 + ⋯ + β p X q + ∑ k = 1 r ρ k ( X − τ k ) + q (15)

where τ 1 , τ 2 , ⋯ , τ r are the knots covering the range of x’s and

( X − τ k ) + q = { ( X − τ k ) + q , ( X − τ k ) + q > 0 0 , otherwise

The knots are placed over the range of x’s and the dimension of r is chosen generously. In penalized spline the approach is to put a penalty on the coefficient of ρ k . The standard approach is to minimize sum of squares and the quadratic penalty ω ρ T D ρ , where ω is the penalty parameter and D is the penalty square matrix. In truncated polynomial D is an identity matrix and the penalty is ω ρ T ρ . In B spline basis the penalty is constructed using the difference between neighboring spline coefficients [

ρ ~ N ( 0 , σ ρ 2 D − 1 )

where ρ is a vector of spline coefficients, σ ρ 2 = σ ε 2 / ω and D − 1 is a generalized inverse of D.

In this approach a single parameter σ ρ 2 is used to shrink all the coefficients of spline and this can be a limitation especially if the underlying function is locally varying, i.e. it fails to completely capture the features of functions that exhibit strong heterogeneity. One way to avoid this is to allow the coefficients ρ 1 , ⋯ , ρ r to have prior variances ρ k ~ N ( 0 , σ ρ 2 { τ k } ) and assume that the shrinkage variance process σ ρ 2 { τ k } is a smooth function modeled as a log-penalized spline

σ ρ 2 { τ k } = exp [ γ 0 + γ 1 τ 1 + ⋯ + γ p τ p + ∑ j = 1 l h j ( X − α j ) + q ] (16)

where α 1 , α 2 , ⋯ , α l is a second layer of knots covering the range of τ 1 , τ 2 , ⋯ , τ r . l is practically less than r. The hierarchical penalized smoothing model is completed by the shrinkage assumption h j ~ N ( 0 , σ h 2 ) , j = 1 , 2 , ⋯ , l and σ h 2 is constant.

Thus our hierarchical smoothing model can be written as

Y | ρ , h = X ρ β + S ρ ρ + ε , ε ~ N ( 0 , σ ε 2 I n )

ρ | h ~ N ( 0 , Σ ρ )

Σ ρ = d i a g { exp ( X h T y + Z h C ) }

C ~ N ( 0 , σ l 2 I n )

where:

y = ( y 1 : y n ) , X ρ = ( [ 1 ⋯ x 1 q ⋮ ⋱ ⋮ 1 ⋯ x n q ] ) , S ρ = ( [ ( x 1 − τ 1 ) + q ⋯ ( x 1 − τ r ) + q ⋮ ⋱ ⋮ ( x n − τ 1 ) + q ⋯ ( x n − τ r ) + q ] )

X h = ( [ 1 ⋯ τ 1 p ⋮ ⋱ ⋮ 1 ⋯ τ l p ] ) , Z h = ( [ ( τ 1 − α 1 ) + p ⋯ ( τ 1 − α l ) + p ⋮ ⋱ ⋮ ( τ r − α 1 ) + p ⋯ ( τ r − α l ) + p ] )

β = ( β 0 , ⋯ , β p ) T , ρ = ( ρ 0 , ⋯ , ρ r ) T

Penalized splines are very common in parametric regression but they have one major drawback in that they are not spatially adaptive. This is due to the use of a global smoothing parameter across the whole heterogeneous function. In this research we aimed at coming up with a spatially adaptive penalized spline by introducing hierarchical splines. This was achieved by modeling the global smoothing parameter ω that is normally used in classical smoothing as another spline.

The authors declare no conflicts of interest regarding the publication of this paper.

Ndung’u, A.W., Mwalili, S. and Odongo, L. (2019) Hierarchical Penalized Mixed Model. Open Journal of Statistics, 9, 657-663. https://doi.org/10.4236/ojs.2019.96042