^{1}

^{*}

^{2}

^{*}

In this paper, we consider the problem of variable selection and model detection in additive models with longitudinal data. Our approach is based on spline approximation for the components aided by two Smoothly Clipped Absolute Deviation (SCAD) penalty terms. It can perform model selection (finding both zero and linear components) and estimation simultaneously. With appropriate selection of the tuning parameters, we show that the proposed procedure is consistent in both variable selection and linear components selection. Besides, being theoretically justified, the proposed method is easy to understand and straightforward to implement. Extensive simulation studies as well as a real dataset are used to illustrate the performances.

Longitudinal data arise frequently in biological and economic applications. The challenge in analyzing longitudinal data is that the likelihood function is difficult to specify or formulate for non-normal responses with large cluster size. To allow richer and more flexible model structures, an effective semi-parametric regression tool is the additive model introduced by [

where

method for variable selection and model detection in model (1.1) and show that the proposed method can correctly select the nonzero components with probability approaching one as the sample size goes to infinity.

Statistical inference of additive models with longitudinal data has also been considered by some authors. By extending the generalized estimating equations approach, [

In the next section, we will propose the two-fold SCAD penalization procedure based on QIF and compu- tational algorithm; furthermore we present its theoretical properties. In particular, we show that the procedure can select the true model with probability approaching one, and show that newly proposed method estimates the non-zero function components in the model with the same optimal mean square convergence rate as the standard spline estimators. Simulation studies and an application of proposed methods in a real data example are included in Sections 3 and 4, respectively. Technical lemmas and proofs are given in Appendix.

Consider a longitudinal study with

vector

is observed and can be modelled as

where

At the start of the analysis, we do not know which component functions in model (1.1) are linear or actually zero. We adopt the centered B-spline basis, where

article for simplicity of proof. Other regular knot sequences can also be used, with similar asymptotic results. Suppose that

To simplify notation, we first assume equal cluster size

where

where

The vector

where

Our main goal is to find both zero components (i.e.,

function). The former can be achieved by shrinking

where

To study the rate of convergence for

for

(A1) The covariates

(A2) Let

(A3) For each

(A5) Let

(A6) The matrix

Theorem 1. Suppose that the regularity conditions A1-A5 hold and the number of knots

For

Theorem 2. Under the same assumptions of Theorem 1, and if the tuning parameter

a)

b)

Theorem 2 also implies that above additive model selection possesses the consistency property. The results in Theorems 2 are similar to semiparametric estimation of additive quantile regression model in [

Finally, in the same spirit of that [

Theorem 3. Suppose that the regularity conditions A1-A5 hold and the number of knots

as assumed in Theorem 1, The parameters

In this section, we conducted Monte Carlo studies for the following longitudinal data and additive model. the continuous responses

where

tion with mean 0, a common marginal variance

The predictors

To illustrate the effect on estimation efficiency, we compare the penalized QIF approach in [

n | Correlation | Method | ||||||
---|---|---|---|---|---|---|---|---|

100 | CS | PQIF | 0.32 | 0.42 | 0.3 | 0.29 | 0.31 | 0.26 |

TFPQIF | 0.3 | 0.46 | 0.28 | 0.25 | 0.23 | 0.22 | ||

ORACLE | 0.14 | 0.14 | 0.15 | 0.15 | 0.13 | 0.12 | ||

AR(1) | PQIF | 0.36 | 0.39 | 0.32 | 0.3 | 0.29 | 0.25 | |

TFPQIF | 0.29 | 0.39 | 0.35 | 0.2 | 0.25 | 0.22 | ||

ORACLE | 0.13 | 0.15 | 0.22 | 0.14 | 0.12 | 0.1 | ||

250 | CS | PQIF | 0.25 | 0.29 | 0.25 | 0.24 | 0.19 | 0.15 |

TFPQIF | 0.22 | 0.31 | 0.26 | 0.14 | 0.16 | 0.15 | ||

ORACLE | 0.12 | 0.11 | 0.19 | 0.097 | 0.098 | 0.09 | ||

AR(1) | PQIF | 0.28 | 0.24 | 0.31 | 0.33 | 0.28 | 0.19 | |

TFPQIF | 0.20 | 0.2 | 0.27 | 0.24 | 0.14 | 0.15 | ||

ORACLE | 0.1 | 0.11 | 0.13 | 0.21 | 0.1 | 0.096 | ||

500 | CS | PQIF | 0.15 | 0.14 | 0.25 | 0.23 | 0.2 | 0.17 |

TFPQIF | 0.15 | 0.3 | 0.26 | 0.11 | 0.12 | 0.1 | ||

ORACLE | 0.09 | 0.13 | 0.12 | 0.07 | 0.07 | 0.07 | ||

AR(1) | PQIF | 0.18 | 0.3 | 0.26 | 0.13 | 0.12 | 0.14 | |

TFPQIF | 0.16 | 0.23 | 0.25 | 0.09 | 0.09 | 0.09 | ||

ORACLE | 0.08 | 0.13 | 0.12 | 0.077 | 0.081 | 0.07 |

CS | AR(1) | ||||||||
---|---|---|---|---|---|---|---|---|---|

NCC | NNT | NLC | NLT | NCC | NNT | NLC | NLT | ||

100 | PQIF | 5.96 | 2 | 0 | 0 | 5.83 | 2 | 0 | 0 |

TFPQIF | 2.64 | 2 | 2.58 | 2.36 | 2.52 | 2 | 2.63 | 2.46 | |

ORACLE | 2 | 2 | 3 | 3 | 2 | 2 | 3 | 3 | |

250 | PQIF | 5.63 | 2 | 0 | 0 | 5.45 | 2 | 0 | 0 |

TFPQIF | 2.34 | 2 | 2.66 | 2.65 | 2.41 | 2 | 2.59 | 2.5 | |

ORACLE | 2 | 2 | 3 | 3 | 2 | 2 | 3 | 3 | |

500 | PQIF | 5.35 | 2 | 0 | 0 | 5.2 | 2 | 0 | 0 |

TFPQIF | 2.04 | 2 | 2.93 | 2.93 | 2.1 | 2 | 2.89 | 2.86 | |

ORACLE | 2 | 2 | 3 | 3 | 2 | 2 | 3 | 3 |

with the one penalty QIF when the error are Gaussian, and we also list the oracle model as a benchmark, the oracle model is only available in simulation studies where the true information is known in

In Ta ble 2, we conduct some simulations to evaluate finite sample performance of the proposed method. Let

estimator

In this subsection, we analyze data from the Multi-Center AIDS Cohort Study. The dataset contains the human immunodeficiency virus, HIV, status of 283 homosexual men who were infected with HIV during the follow-up period between 1984 and 1991. All individuals were scheduled to have their measurements made during semi- annual visits. Here

In our analysis, the response variable is the CD4 cell percentage of a subject at distinct time points after HIV infection. We take four covariates for this study:

the partially linear additive models instead of additive model because of the binaray variable

depletes rather quickly at the beginning of HIV infection, but the rate of depletion appears to be slowing down at four years after the infection. This result is the same as before [

In summary, we present a two-fold penalty variable selection procedure in this paper, which can select linear component and significant covariate and estimate unknown coefficient function simultaneously. The simulation study shows that the proposed model selection method is consistent with both variable selection and linear components selection. Besides, being theoretically justified, the proposed method is easy to understand and straightforward to implement. Further study of the problem is how to use the multi-fold penalty to solve the model selection and variable selection in generalized additive partial linear models with longitudinal data.

Liugen Xue’s research was supported by the National Nature Science Foundation of China (11171012), the Science and Technology Project of the Faculty Adviser of Excellent PhD Degree Thesis of Beijing (20111000503) and the Beijing Municipal Education Commission Foundation (KM201110005029).

For convenience and simplicity, let

Lemma 1. Under the conditions (A1)-(A6), minimizing the no penalty QIF

Proof: According to [

Proof of Theorem 1. Let

be the object function in (2.7), where

As a result, this implies that

Since

By the definition of SCAD penalty function, removing the regularizing terms in (A2)

with

and

where

Proof of Theorem 2. We only show part (b), as an illustration and part (a) is similar. Suppose for some

As in the proof of Theorem 1, we have

Proof of Theorem 3. For any regularization parameters

CASE 1:

Since true

CASE 2:

CASE 3:

QIF (2.6)

CASE 4: