Joint Variable Selection of Mean-Covariance Model for Longitudinal Data

In this paper we reparameterize covariance structures in longitudinal data analysis through the modified Cholesky decomposition of itself. Based on this modified Cholesky decomposition, the within-subject covariance matrix is decomposed into a unit lower triangular matrix involving moving average coefficients and a diagonal matrix involving innovation variances, which are modeled as linear functions of covariates. Then, we propose a penalized maximum likelihood method for variable selection in joint mean and covariance models based on this decomposition. Under certain regularity conditions, we establish the consistency and asymptotic normality of the penalized maximum likelihood estimators of parameters in the models. Simulation studies are undertaken to assess the finite sample performance of the proposed variable selection procedure.


Introduction
In recent years, the method of joint modeling of mean and covariance on the general linear model with multivariate normal errors, was heuristically introduced by Pourahmadi [1,2].The key advantages of such models include the convenience in statistical interpretation and computational ease in parameter estimation, which is described in Section 2. On the other hand, the estimation of the covariance matrix is important in a longitudinal study.A good estimator for the covariance can improve the efficiency of the regression coefficients.Furthermore, the covariance estimation itself is also of interest [3].A number of authors have studied the problem of estimating the covariance matrix.Pourahmadi [1,2] considered generalized linear models for the components of the modified Cholesky decomposition of the covariance matrix.Fan et al. [4] and Fan and Wu [5] proposed to use a semiparametric model for the covariance function.Recently, Rothman et al. [6] proposed a new regression interpretation of the Cholesky factor of the covariance matrix by parameterizing itself and guaranteed the positivedefiniteness of the estimated covariance at no additional computational cost.Furthermore, based on this decomposition [6], Zhang and Leng [7] proposed efficient maximum likelihood estimates for joint mean-covariance analysis.
As is well known, as a part of modeling strategy, variable selection is an important topic in most statistical analysis, and has been extensively explored over the last three decades.In a traditional linear regression setting, many selection criteria (e.g., AIC and BIC) have been extensively used in practice.Nevertheless, those selection methods suffer from expensive computational costs.As computational efficiency is more desirable in many situations, various shrinkage methods have been developed, which include but are not limited to: the nonnegative garrotte [8], the LASSO [9], the bridge regression [10], the SCAD [11], and the one-step sparse estimator [12].Recently, Zhang and Wang [13] proposed a new criterion, named PICa, to simultaneously select explanatory variables in the mean model and variance model in heteroscedastic linear models based on the model structure.Zhao and Xue [14] presented a variable selection procedure by using basis function approximations and a partial group SCAD penalty for semiparametric varying coefficient partially linear models with longitudinal data.
In this paper we show that the modified Cholesky decomposition of the covariance matrix, rather than its inverse, also has a natural regression interpretation, and therefore all Cholesky-based regularization methods can be applied to the covariance matrix itself instead of its inverse to obtain a sparse estimator with guaranteed positive definiteness.Furthermore, we aim to develop an efficient penalized likelihood based method to select important explanatory variables that make a significant contribution to the joint modelling of mean and covariance structures for longitudinal data.With proper choices of the penalty functions and the tuning parameters, we establish the consistency and asymptotic normality of the resulting estimator.Simulation studies are used to illustrate the proposed methodologies.Compared with existing methods, our procedure offers the following differences and improvements.Firstly, Zhang and Leng [7] discussed efficient maximum likelihood estimates and model selection for joint mean-covariance analysis based BIC.As is well known, BIC selection method would suffer from expensive computational costs.However, our method can select significant variables and obtain the parameter estimators simultaneously in the joint modelling of mean and covariance structures for longitudinal data, that implies that our method can avoid the heavy computational burden.Secondly, in this paper we assume the covariates may be of high dimension, which become increasingly common in many health studies, and our method also can select the important subsets of the covatiates.Thirdly, we reparameterize covariance structures in longitudinal data analysis through the modified Cholesky decomposition of itself, which is brought closer to time series analysis, for which the moving average model may provide an alternative, equally powerful and parsimonious representation.

 
The rest of this paper is organized as follows.In Section 2 we first describe a reparameterization of covariance matrix itself through the modified Cholesky decomposition and introduce the joint mean and covariance models for longitudinal data.We then propose a variable selection method for the joint models via penalized likelihood function.Asymptotic properties of the resulting estimators are considered in Section 3. In Section 4 we give the computation of the penalized likelihood estimator as well as the choice of the tuning parameters.In Section 5 we carry out simulation studies to assess the finite sample performance of the method.

Variable Selection for Joint
Mean-Covariance Model

Modified Cholesky Decomposition of the Covariance Matrix
Suppose that there are n independent subjects and the ith subject has m i repeated measurements.Specifically, denote the response vector for the ith subject, , which are observed at time  .We assume that the response vector is . As a tool for regularizing the inverse covariance matrix, Pourahmadi [1] suggested using the modified Cholesky factorization of i   .To parametrize , Pourahmadi [1] first proposed to decompose it as i i i i .The lower triangular matrix i is unique with 1's on its diagonal and the below diagonal entries of i T are the negative autoregressive parameters The diagonal entries of are the innovation variances as According to the idea of the proposed decomposition in Rothman et al. [6], we let i i , a lower triangular matrix with 1's on its diagonal, we can write i i i i .We actually use a new statistically meaningful representation that reparameterizes the covariance matrices by the modified Cholesky decomposition advocated by Rothman et al. [6].The entries ijk in i can be interpreted as the moving average coefficients in   g Here  is a monotone and differentiable known link function, and ij x , ijk and ij are the p × 1, q × 1 and d × 1 vectors of covariates, respectively.The covariate ij z h x and ij are the usual covariates used in regres- sion analysis, while ijk is usually taken as a polynomial of time difference . We further refer to  as moving average coefficients and  as innovation coefficients.In this paper we assume that the covariates ij x , ijk and ij may be of high dimension and we would select the important subsets of the covariates ij z h x , ijk and ij , simultaneously.We first assume all the explanatory variables of interest, and perhaps their interactions as well, are already included into the initial models.Then, we aim to remove the unnecessary explanatory variables from the models.z h

Penalized Maximum Likelihood for JMVGLRM
Many traditional variable selection criteria can be considered as a penalized likelihood which balances modelling biases and estimation variances [11].Let     denote the log-likelihood function.For the JMVGLRM, we propose the penalized likelihood function where , , , , ; , , is a given penalty function with the tuning parameter .The tuning parameters can be chosen by a data-driven criterion such as cross validation (CV), generalized crossvalidation (GCV) [9], or the BIC-type tuning parameter selector [16] which is described in Section 4.Here we use the same penalty function for all the regression coefficients but with different tuning parameters for the mean parameters, moving average parameters and log-innovation variances, respectively.Note that the penalty functions and tuning parameters are not necessarily the same for all the parameters.For example, we wish to keep some important variables in the final model and therefore do not want to penalize their coefficients.In this paper, we use the smoothly clipped absolute deviation (SCAD) penalty whose first derivative satisfies for some [11].Following the convention in [11], we set in our work.The SCAD penalty is a spline function on an interval near zero and constant outside, so that it can shrink small value of an estimate to zero while having no impact on a large one.
The penalized maximum likelihood estimator of  , denoted by  , maximizes the function   With appropriate penalty functions, maximizing   L  with respect to  leads to certain parameter estimators vanishing from the initial models so that the corresponding explanatory variables are automatically removed.
Hence, through maximizing   L  we achieve the goal of selecting important variables and obtaining the parameter estimators, simultaneously.In Section 4, we provide the technical details and an algorithm for calculating the penalized maximum likelihood estimator  .

Asymptotic Properties
We next study the asymptotic properties of the resulting penalized likelihood estimate.We first introduce some notations.Let 0  denote the true values of  .Furthermore, let For ease of presentation and without loss of generality, it is assumed that 0  consists of all nonzero components of Here we denote  as n  to emphasize its dependence on sample size .n    , depending on whether To obtain the asymptotic properties in the paper, we require the following regularity conditions: (C1): The covariate vectors ij ijk x z i m and ij h are fixed.Also, for each subject the number of repeated measurements, , is fixed (C2): The parameter space is compact and the true value 0 is in the interior of the parameter space. (C3): The design matrices i X and i H in the joint models are all bounded, meaning that all the elements of the matrices are bounded by a single finite real number.
Theorem 1 Assume   . Under the conditions (C1)-(C3), with probability tending to 1 there must exist a local maximizer n  of the penalized likelihood function The following theorem gives the asymptotic normality property of ˆn  .Let 1 where 0 j is the jth component of .
with probability tending to 1.
where "   " stands for the convergence in distribution;   The proofs of the Theorems 1 and 2 are similar to [11].To save space, the proofs are omitted.

Because
 is irregular at the origin, the commonly used gradient method is not applicable.Now, we develop an iterative algorithm based on the local quadratic approximation of the penalty function as in [11].
Also, given an initial value 0 we can approximate the penalty function by a quadratic function [11] for .

 
L  leads to a solution iterated by Secondly, as the data are normally distributed the Therefore, the resulting score functions are is the derivative of the inverse of the link function    1 g   , and , where 11 1 Finally, by using the Fisher information matrix to approximate the observed information matrix, the following algorithm summarizes the computation of penalized maximum likelihood estimators of the parameters in JMVGLRM.
Step 2. Given the current values Step 3. Repeat Step 2 above until certain convergence criteria are satisfied.

Choosing the Tuning Parameters
The penalty function l  involves the tuning parameters that controls the amount of penalty.Many selection criteria, such as CV, GCV, AIC and BIC selection can be used to select the tuning parameters.Wang et al. [16] suggested using the BIC for the SCAD estimator in linear models and partially linear models, and proved its model selection consistency property, i.e., the optimal parameter chosen by BIC can identify the true model with probability tending to one.Hence, we use their suggestion throughout this paper.So the BIC will be used to choose the optimal . Nevertheless, in or real application, how to simultaneously select a total of s shrinkage parameters   , 1, , i s To bypass this difficulty, we follow the idea of [12,16,17], and simplify the tuning parameters as 1)       can be selected according to the fol- lowing BIC-type criterion is simply the number of nonzero co-where   .

efficients of
From our simulation study, we found that this method works well.

Simulation Studies
In this section we conduct simulation studies to assess the small sample performance of the proposed procedures.We consider the sample size n = 100, 200, and 400 respectively.Each subject is supposed to be measured by i times with i .In the simulation study, 1000 repetitions of random samples are generated by using the above data generation procedure.For each simulated data set, the proposed variable selection procedures for finding out penalized maximum likelihood estimators with SCAD and adaptive lasso (ALASSO) penalty functions [17] are considered.The unknown tuning parameters , ,

Example 1: Linear Mean Model for JMVGLRM
In this example, we first consider the linear model for mean parameters as a special JMVGLRM.We choose the true values of the mean parameters, moving average parameters and log-innovation variances to be with   , 2 0.5 , and , , , respectively, while the remaining coefficients, corresponding to the irrelevant variables, are given by zeros.In the models with x is generated from a multivariate normal distribution with mean zero, marginal variance 1 and all correlations 0.5.We take and The average number of the estimated zero coefficients for the parametric components, with 1000 simulation runs, is reported in Table 1.Note that "Correct" in Table 1 means the average number of zero regression coefficients that are correctly estimated as zero, and "Incorrect" depicts the average number of non-zero regression coefficients that are erroneously set to zero.
From Table 1, we can make the following observations.Firstly, the performances of variable selection procedures with different penalty functions become better and better as n increases.For example, the values in the column labeled "Correct" become more and more closer to the true number of zero regression coefficients in the models.Secondly, the SCAD and ALASSO penalty methods perform similarly in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.Thirdly, for the designed settings, the overall performance of the variable selection procedure is satisfactory.
Next, we compare the two decomposition methods under two data generating processes, autoregressive (AR) decomposition [1] and moving average (MA) decomposition [6].The main measurements for comparison are differences between the fitted mean i  and the true mean i  , and the fitted covariance matrix i  to the true i  .In particular, we define two relative errors as Here A denotes the largest singular value of A. We compute the averages of these two relative errors for 1000 replications with n = 100 and 200.Table 2 gives the averages of relative errors for the MA decomposition and AR decomposition, when the data are generated from our model under different true covariance matrix.In Ta- ble 2, "MA.data" ("AR.data")means that the true covariance matrix follows the moving average structure (autoregressive structure)."MA.fit" ("AR.fit")means we We use the settings in example 1 to assess the performance of the proposed variable selection procedures, and the simulation results are reported in Table 3.
The results in Table 3 show that under different sample size, the proposed variable selection methods have the desired performance, which is substantively similar to the previous example.

Example 3: High-Dimensional Setup for JMVGLRM
In this example, we discuss how the proposed variable selection procedures can be applied to the "large n, diverging s" setup for JMVGLRM.We consider the following high-dimensional logistic mean model in JMVGLRM: where 0  is a p-dimensional vector of parameters with  is a q-dimensional vector of parameters with  denotes the largest integer not greater than u.In addition, x with is generated from a multivariate normal distribution with mean zero, marginal variance 1 and all correlations 0.5.We take   The summary of simulation results are reported in Table 4.
It is easy to see from Table 4 that, the proposed variable selection method is able to correctly identify the true submodel, and works remarkably well, even if it is the "large n, diverging s" setup for JMVGLRM.
the first two derivatives of the log-likelihood function    are continuous.Around a given point 0  , the log-likelihood function can be approximated by ith element, jth element and kth element of the unpenalized estimate penalty functions are chosen by BIC criterion in the simulation.The performance of estimator   1, 2, 3 l  ˆ ,  and  will be assessed by the mean square error (MSE), defined as Copyright © 2013 SciRes.OJS n = 100, 200 and 400, and    0 m denotes a m-vector of 0's.Using these values, the mean i  and covariance matrix i  are constructed through the modified Cholesky decomposition described in Section 2.Then, the responses i are then drawn from the multivariate normal distribution y Funding Project of Science and Technology