Automatic Variable Selection for High-dimensional Linear Models with Longitudinal Data

High-dimensional longitudinal data arise frequently in biomedical and genomic research. It is important to select relevant covariates when the dimension of the parameters diverges as the sample size increases. We consider the problem of variable selection in high-dimensional linear models with longitudinal data. A new variable selection procedure is proposed using the smooth-threshold generalized estimating equation and quadratic inference functions (SGEE-QIF) to incorporate correlation information. The proposed procedure automatically eliminates inactive predictors by setting the corresponding parameters to be zero, and simultaneously estimates the nonzero regression coefficients by solving the SGEE-QIF. The proposed procedure avoids the convex optimization problem and is flexible and easy to implement. We establish the asymptotic properties in a high-dimensional framework where the number of covariates n p increases as the number of cluster n increases. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedure.


Introduction
Longitudinal data arise frequently in biomedical and health studies in which repeated measurements form the same subject are correlated.A major aspect of longitudinal data is the within subject correlation among the repeated measurements.Ignoring this within subject correlation causes a loss of efficiency in general problems.One of the commonly used regression methods for analyzing longitudinal data is generalized estimating equations (Liang and Zeger, [1]).Generalized estimating equations (GEE) using a working correlation matrix with nuisance parameters estimate regression parameters consistently even when the correlation structure is misspecified.However, under such misspecification, the estimator can be inefficient.For this reason, Qu et al. [2] proposed a method of quadratic inference functions (QIF).It avoids estimating the nuisance correlation structure parameters by assuming that the inverse of working correlation matrix can be approximated by a linear combination of several known basis matrices.The QIF can efficiently take the within subject correlation into account and is more efficient than the GEE approach when the working correlation is misspecified.The QIF estimator is also more robust against contamination when there are outlying observations (Qu and Song, [3]).
High-dimensional longitudinal data, which consist of repeated measurements on a large number of covariates, arise frequently from health and medical studies.Thus, it is important for statisticians to develop a new statistical methodology and theory of variable selection and estimation for high-dimensional longitudinal data, which can reduce the modeling bias.Generally speaking, most of the variable selection procedures are based on pena-lized estimation using penalty functions.Such as, q L penalty (Frank and Friedman, [4]), LASSO penalty (Tibshirani, [5]), SCAD penalty (Fan and Li, [6]), and so on.In the longitudinal data framework, Pan [7] proposed an extension of the Akaike information criterion (Akaike, [8]) by applying the quasi-likelihood to the GEE, assuming independent working correlation.Wang and Qu [9] developed a Bayesian information type of criterion (Schwarz,[10]) based on the quadratic inference functions.Fu [11] applied the bridge penalty model to the GEE and Xu et al. [12] introduced the adaptive lasso for the GEE setting, and references therein.These methods are able to perform variable selection and parameter estimation simultaneously.However, most of the theory and implementation is restricted to a fixed dimension of parameters.
Despite the importance of variable selection in high-dimensional settings (Fan and Li, [13]; Fan and Lv, [14]), variable selection for longitudinal data that take into consideration the correlation information is not well studied when the dimension of parameters diverges.In this paper we use the smooth-threshold generalized estimating equation based on quadratic inference functions (SGEE-QIF) to the high-dimensional longitudinal data.The proposed procedure automatically eliminates the irrelevant parameters by setting them as zero, and simultaneously estimates the nonzero regression coefficients by solving the SGEE-QIF.Compared to the shrinkage methods and the existing research findings reviewed above, our method offers the following improvements: 1) the proposed procedure avoids the convex optimization problem; 2) the proposed SGEE-QIF approach is flexible and easy to implement; 3) the proposed method is easy to deal with the longitudinal correlation structure and extend the estimating equations approach to high-dimensional longitudinal data.
The rest of this paper is organized as follows.In Section 2 we first propose variable selection procedures for high-dimensional linear models with longitudinal data, and asymptotic properties of the resulting estimators.In Section 3 we give the computation of the estimators as well as the choice of the tuning parameters.In Section 4 we carry out simulation studies to assess the finite sample performance of the method.Some assumptions and the technical proofs of all asymptotic results are provided in the Appendix.

Model and Notation
We consider a longitudinal study with n subjects and i m observations over time for the i th subject ( ) for a total of taken from the i th subject at time ij t .We assume that the full data set , is observed and can be modelled as where β is a 1 n p × vector of unknown regression coefficients, and n p diverging as the sample size in- creases.ij ε is random error with ( ) In addition, we give assumptions on the first two moments of , where ( ) v ⋅ is a known variance function.

Quadratic Inference Functions
Denote ( ) and write i X in a similar fashion.Following Liang and Zeger [1], a GEE can be used to estimate the regression parameters, β , ( ) where i V is the covariance matrix of i Y .The matrix i V is often modelled as ( ) R α is a com- mon working correlation depending on a set of unknown nuisance parameters α .Based on the estimation theory associated with the working correlation structure, the GEE estimator of the regression coefficient proposed by Liang and Zeger [1] is consistent if consistent estimators of the nuisance parameters α can be obtained.For suggested methods for estimating α , see Liang and Zeger [1].However, even in some simple cases, consistent estimators of α do not always exist (Crowder [15]).To avoid this drawback, Qu et al. [2] suggested that the inverse of the working correlation matrix, ( ) is approximated by a linear combination of basis matrices, , 1, , where 1 , , s M M  are known matrices, and 1 , , s a a  are unknown constants.This is a sufficiently rich class that accommodates, or at least approximates, the correlation structures most commonly used.Details of utilizing a linear combination of some basic matrices to model the inverse of working correlation can be found in Qu et al. [2].
Substituting (2.3) to (2.2), we get the following class of estimating functions: Instead of estimating parameters directly, they recognized that a GEE (2.4) is equivalent to solving the linear combination of a vector of estimating equations: However, (2.5) does not work because the dimension of ( ) n g β is obviously greater than the number of un- known parameters.Using the idea of generalized method of moments (Hansen,[16]), Qu et al. [2] defined the quadratic inference functions (QIF), ( ) ( ) ( ) ( ) where Note that n Ω depends on β .The QIF estimate ˆn β is then given by ( ) Then, based on (2.6), according to Qu et al. [2], the corresponding estimating equation for β is ( ) where n g  is the

Smooth-Threshold Generalized Estimating Equations Based on QIF
Variable selection is an important topic in high dimensional regression analysis and most of the variable selection procedures are based on penalized estimation using penalty functions.Because of these variable selection procedures using penalty function have a singularity at zero.So, these procedure require convex optimization, which incurs a computational burden.To overcome this problem, Ueki [17] developed a automatic variable selection procedure that can automatically eliminate irrelevant parameters by setting them as zero.The method is easily implemented without solving any convex optimization problems.Motivated by this idea we propose the following smooth-threshold generalized estimating equations based on quadratic inference functions (SGEE-QIF) ( ) ( )

OPEN ACCESS OJS
where ∆ is the diagonal matrix whose diagonal elements are , and n p I is the n p dimen- sional identity matrix.Note that the j th SGEE-QIF with 1 j δ = reduces to 0 j β = .Therefore, SGEE-QIF (2.8) can yield a sparse solution.Unfortunately, we cannot directly obtain the estimator of β by solving (2.8).This is because the SGEE-QIF involves j δ , which need to be chosen using some data-driven criteria.For the choice of , Ueki [17] suggested that j δ may be determined by the data, and can be chosen by with an initial estimator ( ) β .The initial estimator ( ) 0 ˆj β can be obtained by solving the QIF (2.6) for the full model.Note that this choice involves two tuning parameters ( ) , λ γ .In Section 4, fol- lowing the idea of Ueki [17], we use the BIC-type criterion to select the tuning parameters.

Asymptotic Properties
We next study the asymptotic properties of the smooth-threshold estimator.Let 0 β be the fixed true value of β .Denote , where 0 β is the true value of β .Following Fan and Peng [18] and Wang [19], it is possible to prove the oracle properties for the SGEE-QIF estimators, including n n p consistency, variable selection consistency and asymptotic normality.
To obtain the asymptotic properties in the paper, we require the following regularity conditions: (C1).The parameter space S is compact, and 0 β is an interior point of S . (C2).
( ) ( ) ( ) ( ) which is the set of indices of nonzero parameters, where The following theorem gives the consistency of the SGEE-QIF estimators.
Theorem 1.Under conditions C1-C8, for any positive λ and γ , such that ( ) There exists a sequence β of the solutions of (2.9) such that 0 ˆn p p O n .
Furthermore, we show that such consistent estimators must possess the sparsity property and the estimators for nonzero coefficients have the same asymptotic distribution as that based on the correct submodel.
Remark: Theorems 1 and 2 imply that the proposed SGEE-QIF procedure is consistent in variable selection, it can identify the zero coefficients with probability to 1.By appropriate tuning parameters, the SGEE-QIF estimators have the oracle property; that is, the asymptotic variances for the SGEE-QIF estimators are the same as what we would have if we knew in advance the correct submodel.

Algorithm
Next, we propose the iterative algorithm to implement the procedures as follows: Step 1, Calculate the initial estimates ( ) 0 β of β by solving the initial QIF (2.5) estimator.Let Step 2, By using the current estimate ( )  ˆk β , we choose the tuning parameters ( ) , λ γ by the BIC criterion.Step 3, Update the estimator of β as follows: { } Step 4, Iterate Steps 2-3 until convergence, and denote the final estimators of β as the SGEE-QIF estimator.

Choosing the Tuning Parameters
To implement the procedures described above, we need to choose the tuning parameters ( ) , λ γ .Following Ueki

Choosing the Basis Matrices
The choice of basis matrices M is a matrix with 0 on the diagonal and 1 off the diagonal.More details about choosing the basis matrices can be seen in Zhou and Qu [20].

Simulation Studies
In this section we conduct a simulation study to assess the finite sample performance of the proposed procedures.In the simulation study, the performance of estimator β will be assessed by using the average the mean square error (AMSE), defined as .
While the remaining coefficients, corresponding to the irrelevant variables, are given by zeros.In addition, let  ( ) , , ~0, Corr , ( ) Corr , i ε α is a known correlation matrix with parameter α used to determine the strength of with-subject dependence.Here we consider ij ε has the first- order autoregressive (AR-1) or compound symmetry (CS) correlation (i.e.exchangeable correlation) structure with 0.7 α = .In the following simulations, we make 500 simulation runs and take 60 n = and 120 .In the simulation study, for each simulated data set, we compare the estimation accuracy and model selection properties of the SGEE-QIF method, the SCAD-penalized QIF and the Lasso-penalized QIF for two different working correlations.For each of these methods, the average of zero coefficients over the 500 simulated data sets is reported in Tables 1 and 2. Note that "Correct" in tables means the average number of zero regression coefficients that are correctly estimated as zero, and "Incorrect" depicts the average number of non-zero regression coefficients that are erroneously set to zero.At the same time, we also examine the effect of using a misspecified correlation structure in the model, which are also reported in Tables 1 and 2. From Tables 1 and 2, we can make the following observations.1) For the parametric component, the performances of variable selection procedures become better and better as n increases.For example, the values in the column labeled "Correct" become more and more closer to the true number of zero regression coefficients in the models.
2) Compared with the penalized QIF based on Lasso and SCAD, SGEE-QIF performs satisfactory in terms of variable selection.
3) It is not surprised that the performances of variable selection procedures based on the correct correlation structure work better than based on the incorrect correlation structure.However, we also note that the performance does not significantly depend on working covariance structure.

Discussion
In this paper, we have proposed the smooth-threshold generalized estimating equation based on quadratic inference function (SGEE-QIF) to the high-dimensional longitudinal data.Our method incorporates the within subject correlation structure of the longitudinal to automatically eliminate the irrelevant parameters by setting them as zero, and simultaneously estimates the nonzero regression coefficients.As a future research topic, it is interesting to consider the variable selection for the high/ultra-high dimension varying coefficient models with longitudinal data.

Appendix. Proof of Theorems
for n enough.This will imply that there exists a local solution to the equation ( ) According to condition (C7), consider the k th ( ) where jv i σ is the ( ) , hence, we have ( )   ) where ( )

∑
Therefore, ni Z satisfies the conditions of the Lindeberg-Feller central limit theorem.Hence, the proof of Theorem 2 is completed.

s
= Α the number of true nonzero parameters.s may be fixed or grow with n .We assume, under the regularity conditions, the initial QIF estimator 0 β obtained by solving the QIF (2.6) satisfies 0 0 ).The weighting matrix ( )n β Ωconverges almost surely to a constant matrix 0 Ω , where 0 Ω is in- vertible.Furthermore, the first and second partial derivatives of n Ω in β are all a twice differentiable function of β .Furthermore the third derivatives of

[ 17 ]
, we use BIC-type criterion to choose these two parameters.That is, we choose ( )

kM 1 M is the identity matrix, 2 M 3 M 1 M
in (2.2) is not difficult, especially for those special correlation structures which are frequently used.If we assume ( ) R α is the first-order autoregressive correlation matrix.The exact inversion ( ) 1 R α − can be written as a linear combination of three basis matrices, they are 1 has 1 on two main off-diagonals and 0 elsewhere, and has 1 on the corners (1, 1) and (m, m), and 0 elsewhere.Suppose ( ) R α is an exchangeable working correlation matrix, it has 1 on the diagonal, and α everywhere off the diagonal.Then as a linear combination of two basis matrices, is the identity matrix and 2

B
denote the s th components k B , by some elementary calcu- lation, we obtain complete the proof of 1).Next, we will prove 2).As shown in 1), ˆ0 j β = for 0 j A ∈ with probability tending to 1.At the same time, with probability tending to 1, 0 ˆA β satisfies the smooth threshold generalized equations based on quadratic inference functions (SGEE-QIF)


To establish the asymptotic normality, it suffices to check the Lindeberg condition, i.e., for any ε ,