Concave Group Selection of Nonparameter Additive Accelerated Failure Time Model ()
1. Introduction
With the development of the Internet, high-dimensional data has been widely collected in life, especially in the field of medical research and finance, the results or responses of data are censored, so the study of high-dimensional censored data is meaningful. However, due to the impact of “disaster of dimension”, the study of high-dimensional data becomes extremely difficult, and some special methods must be adopted to deal with it. As the number of data dimensions increases, the performance of high-dimensional data structures declines rapidly. In low-dimensional spaces, we often use Euclidean distance to measure the similarity between data; but in high-dimensional spaces, this kind of similarity no longer exists, which makes the data mining of high-dimensional data very severely challenging. On the one hand, the performance of the data mining algorithm based on the index structure is reduced; on the other hand, many mining methods based on the entire spatial distance function will fail. By reducing the number of dimensions, the data can be reduced from high to low dimensions, and then using low-dimensional data processing methods. Therefore, the study of effective dimensionality reduction methods becomes significant in statistics.
In many studies, the main results or responses of survival data are censored. Survival analysis is another important theme of statistics, and it has been widely used in medical research and finance. Therefore, the study of survival data has attracted a lot of attention. The Cox model [1] is the most commonly used regression model for survival data. The alternative method of the PH model is the accelerated failure time model, which directly correlates the logarithm of the failure time with the covariate, and is similar to the traditional linear model, which is easier to explain than the PH model. [2] takes into account both Lasso and threshold gradient oriented regularization for high-dimensional AFT model estimation and variable selection. [3] uses partial least squares (PLS) and Lasso methods to select variables in AFT models with high-dimensional covariates. [4] proposed a robust weighted minimum absolute deviation method to estimate the high-dimensional AFT model. [5] uses COSSO penalty in the nonparameter AFT model for variable selection [6] in the high-dimensional nonparameter AFT model, using the reproduction kernel Hilbert norm penalty for estimation, a new enhanced algorithm is proposed for censoring time data. The algorithm is suitable for fitting parameter accelerated failure time models. [7] studied the elastic net method for variable selection under the Cox proportional hazard model and the AFT model with high-dimensional covariates. [8] developed a robust prediction model for event time results through LASSO regularization. This model is aimed at the Gehan estimation of high-dimensional prediction variables accelerated failure time AFT model. [9] extends rank-based Lasso estimation to the estimation and variable selection in the high-dimensional partial linear acceleration failure time model. [10] uses the bridge penalty for regular estimation and parameter selection of high-dimensional AFT models. Based on the high-dimensional semi-parameter accelerated failure time model, [11] proposed the Buckley-James method of double penalty, which can perform variable selection and parameter estimation at the same time. [12] has developed a method for quickly predicting variable selection and contraction estimation of high-dimensional predictive variable AFT models. The model is related to the correlation vector machine (RVM), which relies on maximum posterior estimation to get sparse estimates quickly. [13] proposes a semiparametric regression model whose covariate effect contains parametric and nonparametric parts. The selection of parametric covariates is achieved by iterative LASSO method, and the nonparametric components are estimated using the sieve method [14], and based on kullback-leibler geometry [15], an empirical model selection tool for nonparameter components was obtained. However, they leave behind some theoretical issues that have not yet been resolved. [16] takes into account the estimation and variable selection of LASSO and MCP in AFT models with high covariates. [17] implements regularization in the high-dimensional AFT model L1/2 for variable selection. [18] proposed a covariate adjustment screening and variable selection procedure under the accelerated failure time model. It also appropriately adjusted the low-dimensional confounding factors to achieve a more accurate estimation of regression coefficients. [19] proposed an adaptive elastic net and weighted elastic net with censored data and high-dimensional variable selection in the AFT model. [20] proposed to apply a tensor recursive neural network architecture to extract latent representations from the entire patient medical record of the high-dimensional AFT model. [21] considers a novel Sparse L2 Boosting algorithm, which is based on a semiparameter variable coefficient accelerated failure time model of right-censored survival data with high-dimensional covariates model prediction and variable selection. [22] developed a variable selection method in an AFT model with high-dimensional predictive variables, which consists of a set of algorithms based on two widely used techniques in the field of variable selection in survival analysis synthesis: Buckley-James method and Dantzig selector.
In this article, based on potential predictors, we applied the GMCP (Group Minimax Concave Penalty) penalty method for the first time to the study of a high-dimensional nonparametric accelerated failure time additive regression model (2.1) (MCP, [23] ). The weighted least squares solution of the model based on GMCP penalty is given. We also derived the group coordinate descent algorithm used to calculate the GMCP estimate in this model. Our simulation results show that the weighted least squares estimation based on GMCP penalty works well in the high-dimensional nonparameter accelerated failure time additive regression model, and is superior to the GLasso (Group Least Absolute Shrinkage and Selection Operator) penalty method.
The rest of the paper is organized as follows. In Section 2, we describe the nonparameter accelerated failure time additive regression (NP-AFT-AR) model and our research methods. In Section 3, we give the asymptotic oracle property of GMCP estimation. The simulation results are given in Section 4. Verification of actual data is given in Section 5. The conclusion is given in Section 6.
2. Models and Methods
2.1. Model
In this paper, we study the following nonparametric accelerated failure time additive regression (NP-AFT-AR) model to describe the relationship between the independent predictors or covariates Xj’s and the failure time T:
(2.1)
where
is the intercept,
is a
vector of covariates, fj’s are unknown smooth functions with zero means, i.e.,
and
is the random error term with mean zero and a finite variance
. We consider sample size is small
, assuming that some additive components
are zero, the main purpose of our research is to find the non-zero components and zero components; the second goal is to find the specific functional form of the non-zero components in order to propose a more parsimonious model. In this study, we apply the GMCP penalty in the proposed NP-AFT-AR model for component selection and estimation. We use B-splines to parameterize the nonparameter components, then invoke the inverse probability-of-censoring weighted least squares method to achieve the goals. We treat the spline approximation for each component as a group of variables subject to selection. By the GMCP penalty approach, we show that the proposed method can select significant component functions by choosing the nonzero spline basis functions.
2.2. Weighted Least Squares Estimation
We define
as the ith subject’s survival time, and let
denote the censoring time and
denote the event indicator, i.e.,
; which takes value 1 if the event time is observed, or 0 if the event time is censored. Define
as the minimum of the survival time and the censoring time, i.e.,
: Then, the observed data are in the form
,
. which are assumed to be an independent and identically distributed (i.i.d.) sample from
.
Let
be the order statistics of Yi’s,
and
are the associated censoring indicators and covariates. Let F be the distribution of T and
be its Kaplan-Meier estimator
, where the
’s are Kaplan-Meier weights ( [24] ) calculated by
[4] showed that the weights,
’s, are the jumps in the Kaplan-Meier estimator. These are equivalent to the inverse probability-of-censoring weights ( [25] [26] ),
; where
is the Kaplan-Meier estimator of G, the distribution function of C. The Stute’s weighted least squares loss function for the NP-AFT-AR model (2.1) is defined as
(2.2)
Here, we use B-spline basis functions to approximated unknown functions fj’s. For every function component, assuming that
is bounded; and
; The basis functions are determined by the order
and the number of interior knots
. The total number of B-spline basis functions for each function component would be
: For identifiability, satisfy
; we take the total number of basis functions to be
only and center all the basis functions at their means. Then the B-splines approximation for each function component,
; is given by
where
are the B-spline basis functions and
is the corresponding coefficient parameter vector. Let
denote the
design matrix of B-spline basis of the jth predictor and
be its ith row vector corresponding to the sorted data. Denote the
design matrix as
; the ith row of
as
; and the corresponding parameter vector as
. Then we have
(2.3)
By plugging Equation (2.3) into Equation (2.2), we will get the new loss function as following:
(2.4)
By centering
and
with their
-weighted means, the intercept becomes 0. Denote
and
; where
and
Let
denote the L2 norm of any vector
. For simplicity, we use
and
. Then we can rewrite the Stute’s weighted least squares loss function Equation (2.4) as
(2.5)
2.3. Weighted Least Square Estimation of GMCP Penalty
B-splines approximation is used on the unknown functions, which transforms the nonparameter regression into a parameter regression that makes variable selection and parameter estimation easier to solve. Meanwhile, the grouped variables in
; i.e.,
; for each
, are all related to the variable
; so we can consider B-spline basis functions for each nonparameter function
to be a group. Instead of selecting the significant nonparameter functions, our task converts to choosing the significant B-spline basis functions from
or nonzero coefficients from
.
In order to carry out variable selection at the group and individual variable levels simultaneously. In our case, the GMCP penalty function is
(2.6)
where
is a parameter that controls the concavity of
and
is the penalty parameter. Here
. We require
and
. The term MCP comes from the fact that it minimizes the maximum concavity measure defined at (2.2) of [23], subject to conditions on unbiasedness and selection feature. The MCP can be easily understood by considering its derivative
(2.7)
where for any
vector
,
is the L1 norm:
,
is the penalty tuning parameter and
. In our case, each
represents the jth group of basis functions, i.e.,
; the values of the basis functions for each nonparameter function
may be different from those for another function
; and when
; we assume there is no overlap between groups. Now combining the objective function in Equation (2.5) and the penalty function in Equation (2.6), we have the penalized weighted least squares objective function for the proposed NP-AFT-AR model as follows:
(2.8)
We can conduct group or component selection and estimation by minimizing
: If
; it implies that the function component
is deleted, otherwise, it is selected, further, the individual basis functions within a group can be selected.
2.4. Computation
We derive a group coordinate descent algorithm for computing
. This algorithm is a natural extension of the standard coordinate descent algorithm ( [27] ). It has also been used in calculating the penalized estimates based on concave penalty functions ( [28] ).
The group coordinate descent algorithm optimizes a target function with respect to a single group at a time, iteratively cycling through all groups until convergence is reached. It is particularly suitable for computing
, since it has a simple closed form expression for a single-group model, see (2.11) below.
We write
for an
upper triangular matrix
via the Cholesky decomposition. Let
and
. Simple algebra shows that
(2.9)
Note that
.
and
(2.10)
Let
. For
, it can be verified that the value that minimizes
is
(2.11)
In particular, when
, we have
which is the GLasso estimate for a single-group model ( [29] ).
The group coordinate descent algorithm can now be implemented as follows. Suppose the current values for the group parameter
are given. We want to minimize
with respect to
. Let
(2.12)
and write
and
. Let
be the minimizer of
. When
, we have
, where M is defined in (2.11).
For any given
, we use (2.11) to cycle through one component at a time. Let
be the initial value. The proposed coordinate descent algorithm is as follows.
Initial vector of residuals
, where
, For
, carry out the following calculation until convergence. For
, repeat the following steps.
Step 1: Calculate
.
Step 2: Update
.
Step 3: Update
and
.
The last step ensures that r holds the current values of the residuals. Although the objective function is not necessarily convex, it is convex with respect to a single group when the coefficients of all the other groups are fixed.
3. Asymptotic Oracle Properties of GMCP
Let
denote the cardinality of any set
and
. Define
Here
is
dimensional sub-design matrix corresponding to the variables in A, Denote
We make the following assumptions.
Similar to [5], we assume:
(C1) T and C are independent.
(C2)
.
(C3)
and
.
(C4) Denote
and
as the least upper bounds of T and C, respectively. Then
or
.
(C5)
has finite envelope function.
(C6)
for
.
These assumptions correspond to the conditions in [30]. In the random censorship model, (C1) is a basic assumption. (C2) given the failure time T, the censoring indicator is independent of the
. (C3) in least-squares estimation, we need the second moment. (C4) assumes the probability of an event being observed is greater than zero, which guarantees the consistency of the estimator. (C5) is a fundamental condition for the consistency and convergence rate in the proofs, and is used in the entropy calculation. (C6) guarantees that
is identifiable.
(C7) There exist constant
and
where
such that
(C8) There is a small constant
such that
.
(C9) The random errors
are independent and identically distributed as
, where
and
; moreover, the tail probabilities satisfy
for
and some constantsC and K.
(C10) There exists a positive constant M such that
.
(C7) is the sparse Riesz condition (SRC) formulated for the nonparameter AFT model (2.1), which controls the range of eigenvalues of the matrix Z. This condition was introduced to study the properties of Lasso for the linear regression model by [31]. (C8) assumes that the unimportant predictors are small in the
sense, but do not need to be exactly zero. If
, (C8) becomes
for all
. The problem of variable selection is equivalent to distinguishing nonzero functions from zero functions. (C9) assumes that the distribution of the error terms has sub-Gaussian tails. This condition holds when the error distribution is normal. (C10) assumes that all the predictors are uniformly bounded, which is satisfied in many practical situations.
In this subsection, we simply write
is GMCP estimator. Let
and set
if
is empty. Define
(3.1)
and
This is the oracle least squares estimator. Of course, it is not a real estimator, since the oracle set is unknown.
We first consider the case where the 2-norm GMCP objective function is convex. This necessarily requires
where
be the smallest eigenvalue of
, and recall
. As in [32], define the function
(3.2)
This function arises from an upper bound for the tail probabilities of the chi-square distributions given in Lemma A.2 in Appendix. This is derived from an exponential inequality for chi-square random variables of [33].
Theorem 3.1. Suppose
are independent and identically distributed as
and (C1)-(C10). Then for any
statisfying
,
and
, we have
and
where
and
.
We give the proof of Theorem 3.1 in Appendix. It provides an upper bound on the probability that
is not equal to the oracle estimator in terms of the tail probability functionh in (3.2). The key condition
ensures that the 2-norm GMCP criterion is strictly convex. Nonetheless, this result is a starting for a similar result in
case. The following corollary is an immediate consequence of Theorem 3.1.
Corollary 1. suppose that the condition of Therorm 3.1 are satisfied. Also suppose that
for
as
. If
, then
and
where
By Corollary 1, the 2-norm GMCP estimator equals the oracle least squares estimator with probability converging to one. This implies it is group selection consistent. We now consider the high-dimensional case where
. Under condition (C7), let
,
and
. Define
Theorem 3.2. suppose
are independent and identically distributed as
and B satisfies the
in (C7) with
,
and
, we have
and
where
and
.
Corollary 2. suppose that the condition of Therorm 3.2 are satisfied. Also suppose that
for
as
. If
, then
and
where
.
Theorem 3.2 and Corollary 2 provide sufficient conditions for the asymptotic oracle property of the global 2-norm GMCP estimator in the
situations. Here we allow
. So p can be greater than n. The condition
is stronger than the corresponding condition
in Theorem 3.5 ( [34] ). The condition
ensures that the GMCP criterion is convex in any q-dimensional subspace. It is stronger than the minimal sufficient condition
for convexity in q-dimensional subspaces. This is the price we need to pay in search for a lower-dimensional space that contains the true model.
4. Numerical Simulation
In this section, we conduct simulation studies to evaluate the performance of the GMCP and GLasso penalties in a high-dimensional NP-AFT-AR model with limited samples. We therefore focus on the comparisons of the group selection methods with only the BIC ( [35] ) selected tuning parameter
, is given:
Where RSS is the sum of squared residuals, df is the number of selected variables given
. We choose
from the increasing sequence in Section 5, for any given value of
, We choose from a sequence of 100 values
, from
to
, Where
is corresponding to the covariate
with
“design” matrix.
is the maximum penalty value, which compresses all estimated coefficients to zero.
We compute the empirical prediction mean square error (MSE) to reveal the estimation accuracy. Let
be the estimator of
; and we define MSE as
Three scenarios are considered in the following, where some nonzero components are linear and the response variable is subject to various censoring rates. The sample size
and a total of 100 simulation runs are used. The logarithm of censoring time
is generated from a uniform distribution
, where
and
are determined by a Monte-Carlo method to achieve the censoring rates of 35% and 40% respectively. For example, the censoring rate
is approximated by
where
is drawn from the proposed model (2.1) and
is drawn from
, M is the Monte-Carlo simulation runs used to compute cr. When we chose
, which is considered to be the desired censoring rate. To take account of the computational efficiency and accuracy, we use the cubic B-spline with five evenly distributed interior knots for all the functions
, which gives the number of
basis functions for each nonparametric component. Due to the identifiability constraint,
; the actual number of basis functions used is 8. This choice is made because our simulation studies indicated that using a larger number of knots does not improve the finite sample performance (results are not shown).
4.1. Scenario 1 (Covariates Are Independent)
In this scenario, we consider independent covariates and set the intercept
: The logarithm of failure times,
, are generated from
where
The predictors are sampled from the
.
We set
and consider the cases where
, respectively to see the performance of our proposed methods as the sample size increases. The penalty parameters are selected using CV as described above.
The results for the the GMCP, GSCAD and GLasso methods are given in Table 1 and Table 2 based on 100 replications. The columns in Table 1 include the average number of variables (NV) being selected, model error (ER), percentage of occasions on which correct variables are included in the selected model (%IN) and percentage of occasions on which the exactly correct variables are selected (%CS) with standard error in parentheses. Table 2 summarizes the mean square errors for the six important functions
with standard error in parentheses.
Several observations can be obtained from Tables 1-4. The model that was selected by the GMCP and is better than the one selected by the GLasso in terms of model error, the percentage of occasions on which the true variables being selected and the mean square errors for the important coefficient functions. The GMCP includes the correct variables with high probability. When the sample size increases, the performance of both methods becomes better as expected. To examine the estimated nonparametric functions from Concave group Selection methods, we plot GMCP along with the true function components in Figure 1 and Figure 2. The estimated nonparametric coefficient functions are from GMCP method in one run when 100. From the graph, the estimators of the
Table 1. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 2. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 3. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 4. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.
Figure 1.
, the solid black line is the real function, the dotted red line is the GMCP estimation, CR = 35%.
Figure 2.
, the solid black line is the real function, the dotted red line is the GMCP estimation, CR = 40%.
nonparameter
, fit the true functions well, which are consistent with the mean square errors for the functions reported in Table 2, Table 4.
4.2. Scenario 2 (Covariates Are Correlated)
In this scenario, we consider correlated covariates and set the intercept
: The logarithm of failure times,
, are generated from?
where the covariates
are generated from
where
and U are i.i.d.
. This provides a design with a correlation coefficient of 0.5 between all of the covariates.
The simulation study results are reported in Tables 5-8. The conclusions for Scenario 2 are very similar to those for Scenario 1. When the censoring rate increases, the estimation and selection performance decreases for all methods. The results in Table 6, Table 8 show that the GMCP estimator is more accurate than the GLasso estimator for both the individual component functions and the full
Table 5. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 6. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 7. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.
Table 8. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.
model, since the MSE under the GMCP approach is always smaller than that under the GLasso approach. The results in Table 5, Table 7 show that the GMCP method conducts component selection more precisely than the GLasso method, while the GLasso method chooses many zero component functions as nonzero functions. To examine the estimated nonparametric functions from the GMCP, we plot them along with the true function components in Figure 3, Figure 4. The estimated functions are from the GMCP method in one run when
. The estimation and selection accuracy decrease when covariates are correlated, we can still see that the estimated curves under the GMCP method are close to the true curves compared with the estimated curves under the GLasso method.
5. Application in NA-AFT-Model
In this section, we will use Shedden 2008 (for short) to conduct an empirical analysis of part of the collected lung adenocarcinoma data to illustrate the proposed method. For more information, see [36]. Retrospective data of 442 lung adenocarcinoma patients were collected at multiple locations, including their survival time, some other clinical and demographic data, and the expression level of the 22,283 gene from the following genes: tumor samples. However, most samples have small changes. Therefore, in our application, we randomly select 321 samples, the first 500, 1000 genes. Therefore,
, and the survival rate is 35.8%.
Figure 3.
, the solid black line is the real function, the dotted red line is the group MCP estimation, the dotted blue line is the group SCAD estimation, and the black line is the group lasso estimation, CR = 35%.
Figure 4.
, the solid black line is the real function, the dotted red line is the group MCP estimation, the dotted blue line is the group SCAD estimation, and the black line is the group lasso estimation, CR = 40%.
Figure 5. The estimation function graph based on GLasso and GMCP approximates the corresponding 200746_s_at by using the same covariate, where the red dotted line is the GMCP estimate and the gray dotted line is the GLasso estimate.
Here, we are interested in the effect of tumor gene expression levels on the survival time of lung adenocarcinoma patients. Since the linear assumption is always latent in high dimensions, the proposed method may be more suitable for analyzing feature selection problems considering nonlinear effects. In our analysis, we set the spline base
for each gene. The proposed method selects 1 gene locus under GMCP (ie 200746_s_at). However, when
, the method under GLasso penalized regression alone selected the 6, 10 gene.
From Figure 5, we find that the larger the dimension, the worse the GLasso method estimation, but it has little effect on the GMCP estimation. Therefore, the verification of the actual data shows that the GMCP penalty is better than the GLasso penalty, and the accuracy is higher, and the calculation cost of the two is the same. Under the same conditions, the GMCP method is more suitable than the GLasso.
6. Concluding Remarks
In this paper, we study the weighted least squares estimation and selection attributes of GMCP in the NP-AFT-AR model with high-dimensional data. For the GMCP method, our simulation results show that GLasso tends to select some unimportant variables. In contrast, GMCP has progressive predictability, which shows that it also has selection consistency.
Appendix Proof
Lemma 1. Let
be a random variable with chi-square distribution with k degrees of freedom. For
,
, where
is defined in (3.2).
This lemma is a restatement of the exponential inequality for chi-square distributions of [33].
proofof Theorem 3.1. Since
is the oracle least squares estimator, we have
and
If
, then by the definition of the MCP,
. Since
, the criterion (2.8) is strictly convex. By the Karush-Kuhn-Tucker (KKT) conditions, the equality
holds in the intersection of the events
and
We first bound
. Let
. By (A.1) [34] and using
It follows that
, where
, Because
,
is distributed as a
distribution with
degrees of freedom. We have, for
(6.1)
where we used lemma 1 in the third line. Now consider
, Recall
. If
for all
, then
. This implies
Let
be a
matrix with a
identity matrix
in the pth block and 0’s elsewhere. Then
. Note that
and
id distributed as a
distribution with q degrees of freedom. Therefore, similar to
, we have, for
,
(6.2)
Combining
and
, we have
Since
, we can obtain
. This completes the proof.
For any
and
, define
Lemma 2. Suppose
. We have
proof. For any
. We have
. Thus
Therefore,
Since
is a projection matrix,
, where
. Since there
are ways to choose A from
, we have
This and Lemma A.2 imply that
(6.3)
here we used the inequality
, this completes the proof.
Define I as any set that satisfies
Lemma 3. Suppose that
satisfies that
,
, and
. Let
. Then for any
with
, we have
.
proof This lemma can be proved along the line of the proof of Lemma 1 of [23] and is omitted. proofof Theorem 3.2. By Lemma 3, in the event
(6.4)
we have
, Thus in the event (6.4), the original model with p groups reduces a model with at most
groups, in this reduced model, the condition of Theorem 3.2 implies that the conditions of Theorem 3.2. By Lemma 2,
(6.5)
Therefore, combining (6.5) and Theorem 3.1, we have
, since
, we can obtain
. This proves Theorem 3.2.