Concave Group Selection of Nonparameter Additive Accelerated Failure Time Model

Ling Zhu

doi:10.4236/ojs.2021.111008

Open Journal of Statistics > Vol.11 No.1, February 2021

Concave Group Selection of Nonparameter Additive Accelerated Failure Time Model

Ling Zhu
Jinan University, Guangzhou, China.
DOI: 10.4236/ojs.2021.111008 PDF HTML XML 264 Downloads 803 Views

Abstract

In this paper, we have studied the nonparameter accelerated failure time (AFT) additive regression model, whose covariates have a nonparametric effect on high-dimensional censored data. We give the asymptotic property of the penalty estimator based on GMCP in the nonparameter AFT model.

Keywords

Accelerated Failure Time Model, Nonparameter Model, Group Minimax Concave Penalty, Weighted Least Squares Estimation

Share and Cite:

Zhu, L. (2021) Concave Group Selection of Nonparameter Additive Accelerated Failure Time Model. Open Journal of Statistics, 11, 137-161. doi: 10.4236/ojs.2021.111008.

1. Introduction

With the development of the Internet, high-dimensional data has been widely collected in life, especially in the field of medical research and finance, the results or responses of data are censored, so the study of high-dimensional censored data is meaningful. However, due to the impact of “disaster of dimension”, the study of high-dimensional data becomes extremely difficult, and some special methods must be adopted to deal with it. As the number of data dimensions increases, the performance of high-dimensional data structures declines rapidly. In low-dimensional spaces, we often use Euclidean distance to measure the similarity between data; but in high-dimensional spaces, this kind of similarity no longer exists, which makes the data mining of high-dimensional data very severely challenging. On the one hand, the performance of the data mining algorithm based on the index structure is reduced; on the other hand, many mining methods based on the entire spatial distance function will fail. By reducing the number of dimensions, the data can be reduced from high to low dimensions, and then using low-dimensional data processing methods. Therefore, the study of effective dimensionality reduction methods becomes significant in statistics.

In many studies, the main results or responses of survival data are censored. Survival analysis is another important theme of statistics, and it has been widely used in medical research and finance. Therefore, the study of survival data has attracted a lot of attention. The Cox model [1] is the most commonly used regression model for survival data. The alternative method of the PH model is the accelerated failure time model, which directly correlates the logarithm of the failure time with the covariate, and is similar to the traditional linear model, which is easier to explain than the PH model. [2] takes into account both Lasso and threshold gradient oriented regularization for high-dimensional AFT model estimation and variable selection. [3] uses partial least squares (PLS) and Lasso methods to select variables in AFT models with high-dimensional covariates. [4] proposed a robust weighted minimum absolute deviation method to estimate the high-dimensional AFT model. [5] uses COSSO penalty in the nonparameter AFT model for variable selection [6] in the high-dimensional nonparameter AFT model, using the reproduction kernel Hilbert norm penalty for estimation, a new enhanced algorithm is proposed for censoring time data. The algorithm is suitable for fitting parameter accelerated failure time models. [7] studied the elastic net method for variable selection under the Cox proportional hazard model and the AFT model with high-dimensional covariates. [8] developed a robust prediction model for event time results through LASSO regularization. This model is aimed at the Gehan estimation of high-dimensional prediction variables accelerated failure time AFT model. [9] extends rank-based Lasso estimation to the estimation and variable selection in the high-dimensional partial linear acceleration failure time model. [10] uses the bridge penalty for regular estimation and parameter selection of high-dimensional AFT models. Based on the high-dimensional semi-parameter accelerated failure time model, [11] proposed the Buckley-James method of double penalty, which can perform variable selection and parameter estimation at the same time. [12] has developed a method for quickly predicting variable selection and contraction estimation of high-dimensional predictive variable AFT models. The model is related to the correlation vector machine (RVM), which relies on maximum posterior estimation to get sparse estimates quickly. [13] proposes a semiparametric regression model whose covariate effect contains parametric and nonparametric parts. The selection of parametric covariates is achieved by iterative LASSO method, and the nonparametric components are estimated using the sieve method [14], and based on kullback-leibler geometry [15], an empirical model selection tool for nonparameter components was obtained. However, they leave behind some theoretical issues that have not yet been resolved. [16] takes into account the estimation and variable selection of LASSO and MCP in AFT models with high covariates. [17] implements regularization in the high-dimensional AFT model L_1/2 for variable selection. [18] proposed a covariate adjustment screening and variable selection procedure under the accelerated failure time model. It also appropriately adjusted the low-dimensional confounding factors to achieve a more accurate estimation of regression coefficients. [19] proposed an adaptive elastic net and weighted elastic net with censored data and high-dimensional variable selection in the AFT model. [20] proposed to apply a tensor recursive neural network architecture to extract latent representations from the entire patient medical record of the high-dimensional AFT model. [21] considers a novel Sparse L₂ Boosting algorithm, which is based on a semiparameter variable coefficient accelerated failure time model of right-censored survival data with high-dimensional covariates model prediction and variable selection. [22] developed a variable selection method in an AFT model with high-dimensional predictive variables, which consists of a set of algorithms based on two widely used techniques in the field of variable selection in survival analysis synthesis: Buckley-James method and Dantzig selector.

In this article, based on potential predictors, we applied the GMCP (Group Minimax Concave Penalty) penalty method for the first time to the study of a high-dimensional nonparametric accelerated failure time additive regression model (2.1) (MCP, [23] ). The weighted least squares solution of the model based on GMCP penalty is given. We also derived the group coordinate descent algorithm used to calculate the GMCP estimate in this model. Our simulation results show that the weighted least squares estimation based on GMCP penalty works well in the high-dimensional nonparameter accelerated failure time additive regression model, and is superior to the GLasso (Group Least Absolute Shrinkage and Selection Operator) penalty method.

The rest of the paper is organized as follows. In Section 2, we describe the nonparameter accelerated failure time additive regression (NP-AFT-AR) model and our research methods. In Section 3, we give the asymptotic oracle property of GMCP estimation. The simulation results are given in Section 4. Verification of actual data is given in Section 5. The conclusion is given in Section 6.

2. Models and Methods

2.1. Model

In this paper, we study the following nonparametric accelerated failure time additive regression (NP-AFT-AR) model to describe the relationship between the independent predictors or covariates X_j’s and the failure time T:

$T = \exp (η_{0} + \sum_{j = 1}^{p} f_{j} (X_{j}) + ε)$ (2.1)

where $η_{0}$ is the intercept, $X = (X_{1}, \dots, X_{p})$ is a $p \times 1$ vector of covariates, f_j’s are unknown smooth functions with zero means, i.e., $E f_{j} (X_{j}) = 0$ and $ε$ is the random error term with mean zero and a finite variance $σ^{2}$ . We consider sample size is small $n < p$ , assuming that some additive components $f_{j}$ are zero, the main purpose of our research is to find the non-zero components and zero components; the second goal is to find the specific functional form of the non-zero components in order to propose a more parsimonious model. In this study, we apply the GMCP penalty in the proposed NP-AFT-AR model for component selection and estimation. We use B-splines to parameterize the nonparameter components, then invoke the inverse probability-of-censoring weighted least squares method to achieve the goals. We treat the spline approximation for each component as a group of variables subject to selection. By the GMCP penalty approach, we show that the proposed method can select significant component functions by choosing the nonzero spline basis functions.

2.2. Weighted Least Squares Estimation

We define $T_{i}$ as the i^th subject’s survival time, and let $C_{i}$ denote the censoring time and $δ_{i}$ denote the event indicator, i.e., $δ_{i} = I (T_{i} \leq C_{i})$ ; which takes value 1 if the event time is observed, or 0 if the event time is censored. Define $Y_{i}$ as the minimum of the survival time and the censoring time, i.e., $Y_{i} = \log (\min (T_{i}, C_{i}))$ : Then, the observed data are in the form $(Y_{i}, δ_{i}, X_{i})$ , $i = 1, \dots, n$ . which are assumed to be an independent and identically distributed (i.i.d.) sample from $(Y, δ, X)$ .

Let $Y_{(1)} \leq \dots \leq Y_{(n)}$ be the order statistics of Y_i’s, $δ_{(1)}, \dots, δ_{(n)}$ and $X_{(1)}, \dots, X_{(n)}$ are the associated censoring indicators and covariates. Let F be the distribution of T and $\hat{F_{n}}$ be its Kaplan-Meier estimator $\hat{F_{n}} (y) = \sum_{i = 1}^{n} ω_{n i} 1 (Y_{(i)} \leq y)$ , where the $ω_{n i}$ ’s are Kaplan-Meier weights ( [24] ) calculated by

$ω_{n 1} = \frac{δ_{(1)}}{n}, ω_{n i} = \frac{δ_{(i)}}{n - i + 1} \prod_{j = 1}^{i - 1} {(\frac{n - j}{n - j + 1})}^{δ_{(j)}}, i = 2, \dots, n$

[4] showed that the weights, $ω_{n i}$ ’s, are the jumps in the Kaplan-Meier estimator. These are equivalent to the inverse probability-of-censoring weights ( [25] [26] ), $ω_{n i} = δ_{(i)} / \hat{G_{n}} (Y_{(i)} -)$ ; where $\hat{G_{n}}$ is the Kaplan-Meier estimator of G, the distribution function of C. The Stute’s weighted least squares loss function for the NP-AFT-AR model (2.1) is defined as

$Q_{n} = \frac{1}{2} \sum_{i = 1}^{n} n ω_{n i} {Y_{(i)} - η_{0} - \sum_{j = 1}^{p} f_{j} (X_{(i) j})}^{2}$ (2.2)

Here, we use B-spline basis functions to approximated unknown functions f_j’s. For every function component, assuming that $X_{j}$ is bounded; and $E {f_{j} (X_{j})} = 0, j = 1, \dots, p$ ; The basis functions are determined by the order $(p + 1)$ and the number of interior knots $κ$ . The total number of B-spline basis functions for each function component would be $p + κ + 1$ : For identifiability, satisfy $E f_{j} (X_{j}) = 0$ ; we take the total number of basis functions to be $M_{n} = p + κ$ only and center all the basis functions at their means. Then the B-splines approximation for each function component, $f_{j} (X_{j}), j = 1, \dots, p$ ; is given by

$f_{j} (X_{j}) \approx \sum_{k =1}^{M_{n}} β_{j k} B_{j k} ( X j )$

where $B_{j k} (X_{j})$ are the B-spline basis functions and $β_{j} = {(β_{j 1}, \dots, β_{j M_{n}})}^{T}$ is the corresponding coefficient parameter vector. Let $B_{j}$ denote the $n \times M_{n}$ design matrix of B-spline basis of the j^th predictor and $B_{j (i)}$ be its i^th row vector corresponding to the sorted data. Denote the $n \times p M_{n}$ design matrix as $B = (B_{1}, B_{2}, \dots, B_{p})$ ; the i^th row of $B$ as $B_{(i)}$ ; and the corresponding parameter vector as $β = {(β_{1}^{T}, \dots, β_{p}^{T})}^{T}$ . Then we have

$\begin{matrix} \sum_{j = 1}^{p} f_{j} (X_{(i) j}) = f_{1} (X_{(i) 1}) + \dots + f_{p} (X_{(i) p}) \\ \approx \sum_{k = 1}^{M_{n}} β_{1 k} B_{1 k} (X_{(i) 1}) + \dots + \sum_{k = 1}^{M_{n}} β_{p k} B_{p k} (X_{(i) p}) \\ = \sum_{j = 1}^{p} B_{(i) j} β_{j} \end{matrix}$ (2.3)

By plugging Equation (2.3) into Equation (2.2), we will get the new loss function as following:

$Q_{n} (η_{0}, β) = \frac{1}{2} \sum_{i = 1}^{n} n ω_{n i} {Y_{(i)} - η_{0} - \sum_{j = 1}^{p} B_{(i) j} β_{j}}^{2}$ (2.4)

By centering $B_{(i) j}$ and $Y_{(i)}$ with their $ω_{n i}$ -weighted means, the intercept becomes 0. Denote ${\tilde{B}}_{(i) j} = {(n ω_{n i})}^{1 / 2} ({\hat{B}}_{(i) j} - {\bar{B}}_{j ω})$ and $\tilde{Y} = {(n ω_{n i})}^{1 / 2} (Y_{(i)} - {\bar{Y}}_{ω})$ ; where ${\bar{B}}_{j ω} = \sum_{i = 1}^{n} ω_{n i} B_{(i) j} / \sum_{i = 1}^{n} ω_{n i}$ and ${\bar{Y}}_{ω} = \sum_{i = 1}^{n} ω_{n i} Y_{(i)} / \sum_{i = 1}^{n} ω_{n i}$ Let ${‖ a ‖}_{2} = {(\sum_{j = 1}^{m} {| a_{j} |}^{2})}^{1 / 2}$ denote the L₂ norm of any vector $a \in R^{m}$ . For simplicity, we use ${\tilde{B}}_{j} = {({\tilde{B}}_{(1) j}, \dots, {\tilde{B}}_{(n) j})}^{T}$ and $\tilde{Y} = {({\tilde{Y}}_{(1)}, \dots, {\tilde{Y}}_{(n)})}^{T}$ . Then we can rewrite the Stute’s weighted least squares loss function Equation (2.4) as

$Q_{n} (β) = \frac{1}{2} \sum_{i = 1}^{n} {{\tilde{Y}}_{(i)} - \sum_{j = 1}^{p} {\tilde{B}}_{(i) j} β_{j}}^{2} = \frac{1}{2} {‖ \tilde{Y} - \sum_{j = 1}^{p} {\tilde{B}}_{j} β_{j} ‖}_{2}^{2}$ (2.5)

2.3. Weighted Least Square Estimation of GMCP Penalty

B-splines approximation is used on the unknown functions, which transforms the nonparameter regression into a parameter regression that makes variable selection and parameter estimation easier to solve. Meanwhile, the grouped variables in ${\tilde{B}}_{j}$ ; i.e., ${\tilde{B}}_{j k}; k = 1, \dots, M_{n}$ ; for each $j = 1, \dots, p$ , are all related to the variable $X_{j}$ ; so we can consider B-spline basis functions for each nonparameter function $f_{j}$ to be a group. Instead of selecting the significant nonparameter functions, our task converts to choosing the significant B-spline basis functions from ${\tilde{B}}_{j}$ or nonzero coefficients from $β_{j}$ .

In order to carry out variable selection at the group and individual variable levels simultaneously. In our case, the GMCP penalty function is

$ρ_{γ} ({‖ β_{j} ‖}_{A_{j}}, λ) = λ \int_{0}^{{‖ β_{j} ‖}_{A_{j}}} {(1 - \frac{x}{γ λ})}_{+} d x$ (2.6)

where $γ$ is a parameter that controls the concavity of $ρ$ and $λ$ is the penalty parameter. Here $x_{+} = x 1_{{x \geq 0}}$ . We require $λ \geq 0$ and $γ > 1$ . The term MCP comes from the fact that it minimizes the maximum concavity measure defined at (2.2) of [23], subject to conditions on unbiasedness and selection feature. The MCP can be easily understood by considering its derivative

${\dot{ρ}}_{γ} ({‖ β_{j} ‖}_{A_{j}}, λ) = λ {(1 - \frac{{‖ β_{j} ‖}_{A_{j}}}{γ λ})}_{+}$ (2.7)

where for any $m \times 1$ vector $a$ , ${‖ a ‖}_{1}$ is the L₁ norm: ${‖ a ‖}_{1} = | a_{1} + \dots + a_{m} |$ , $λ > 0$ is the penalty tuning parameter and $A_{j} = {k : β_{j k} \in β_{j}}$ . In our case, each $A_{j}$ represents the j^th group of basis functions, i.e., ${\tilde{B}}_{j k}, k = 1, \dots, M_{n}$ ; the values of the basis functions for each nonparameter function $f_{j}$ may be different from those for another function $f_{j^{'}}$ ; and when $j \neq j^{'}$ ; we assume there is no overlap between groups. Now combining the objective function in Equation (2.5) and the penalty function in Equation (2.6), we have the penalized weighted least squares objective function for the proposed NP-AFT-AR model as follows:

$Q_{n λ} (β) = \frac{1}{2} {‖ \tilde{Y} - \sum_{j = 1}^{p} {\tilde{B}}_{j} β_{j} ‖}_{2}^{2} + \sum_{j = 1}^{p} λ \int_{0}^{{‖ β_{j} ‖}_{A_{j}}} {(1 - \frac{x}{γ λ})}_{+} d x$ (2.8)

We can conduct group or component selection and estimation by minimizing $Q_{n λ} (β)$ : If ${‖ β_{j} ‖}_{A_{j}} = 0$ ; it implies that the function component $f_{j}$ is deleted, otherwise, it is selected, further, the individual basis functions within a group can be selected.

2.4. Computation

We derive a group coordinate descent algorithm for computing $β$ . This algorithm is a natural extension of the standard coordinate descent algorithm ( [27] ). It has also been used in calculating the penalized estimates based on concave penalty functions ( [28] ).

The group coordinate descent algorithm optimizes a target function with respect to a single group at a time, iteratively cycling through all groups until convergence is reached. It is particularly suitable for computing $β$ , since it has a simple closed form expression for a single-group model, see (2.11) below.

We write $A_{j} = R_{j}$ for an $M_{n} \times M_{n}$ upper triangular matrix $R_{j}$ via the Cholesky decomposition. Let $θ_{j} = R_{j} β_{j}$ and ${\hat{B}}_{j} = {\tilde{B}}_{j} R_{j}^{- 1}$ . Simple algebra shows that

$Q (θ, λ, γ) = \frac{1}{2} {‖ \tilde{Y} - \sum_{j = 1}^{p} {\hat{B}}_{j} θ_{j} ‖}_{2}^{2} + \sum_{j = 1}^{p} λ \int_{0}^{‖ θ_{j} ‖} {(1 - \frac{x}{γ λ})}_{+} d x$ (2.9)

Note that $n^{- 1} {\hat{B}}^{'}_{j} {\hat{B}}_{j} = {R^{'}}_{j} (n^{- 1} {\hat{B}}^{'}_{j} {\hat{B}}_{j}) R_{j}^{- 1} = I_{m_{n}}$ . ${\hat{Y}}_{j} = \hat{Y} - \sum_{k \neq j}^{p} {\hat{B}}_{k} θ_{k}$ and

$Q_{j} (θ_{j}, λ, γ) = \frac{1}{2} {‖ {\tilde{Y}}_{j} - {\hat{B}}_{j} θ_{j} ‖}_{2}^{2} + λ \int_{0}^{‖ θ_{j} ‖} {(1 - \frac{x}{γ λ})}_{+} d x$ (2.10)

Let $η_{j} = {\hat{B}}_{j} {({\hat{B}}^{'}_{j} {\hat{B}}_{j})}^{- 1} {\hat{Y}}_{j}$ . For $γ > 1$ , it can be verified that the value that minimizes $Q_{j} (θ, λ, γ)$ is

${\tilde{θ}}_{j, G M} (λ, γ) = M (η_{j}; λ, γ) \equiv {\begin{array}{l} 0 & if ‖ η_{j} ‖ \leq λ \\ \frac{γ}{γ - 1} (1 - \frac{λ}{‖ η_{j} ‖}) η_{j} & if λ < ‖ η_{j} ‖ \leq γ λ \\ η_{j} & if ‖ η_{j} ‖ > γ λ \end{array}$ (2.11)

In particular, when $γ = \infty$ , we have

${\tilde{θ}}_{j, G L} = {(1 - \frac{λ}{‖ η_{j} ‖})}_{+} η_{j},$

which is the GLasso estimate for a single-group model ( [29] ).

The group coordinate descent algorithm can now be implemented as follows. Suppose the current values for the group parameter ${\tilde{θ}}_{k}^{(s)}, k \neq j$ are given. We want to minimize $Q (θ, λ, γ)$ with respect to $θ_{j}$ . Let

$Q_{j} (θ_{j}, λ, γ) = \frac{1}{2} {‖ \tilde{Y} - \sum_{k \neq j} {\hat{B}}_{k} {\tilde{θ}}_{k}^{(s)} - {\hat{B}}_{j} θ_{j} ‖}_{2}^{2} + λ \int_{0}^{‖ θ_{j} ‖} {(1 - \frac{x}{γ λ})}_{+} d x$ (2.12)

and write ${\tilde{Y}}_{j} = \sum_{k \neq j} {\hat{B}}_{k} {\tilde{θ}}_{k}^{(s)}$ and $η_{j} = n^{- 1} {\hat{B}}^{'}_{j} (\tilde{Y} - {\tilde{Y}}_{j})$ . Let ${\tilde{θ}}_{j}$ be the minimizer of $Q_{j} (θ_{j}, λ, γ)$ . When $γ > 1$ , we have ${\tilde{θ}}_{j} = M (η_{j}, λ, γ)$ , where M is defined in (2.11).

For any given $(λ, γ)$ , we use (2.11) to cycle through one component at a time. Let ${\tilde{θ}}^{(0)} = {({\tilde{θ}}_{1}^{(0)}^{'}, \dots, {\tilde{θ}}_{p}^{(0)}^{'})}^{'}$ be the initial value. The proposed coordinate descent algorithm is as follows.

Initial vector of residuals $r = Y - \tilde{Y}$ , where $\tilde{Y} = \sum_{j = 1}^{p} {\hat{B}}_{j} θ_{j}^{(0)}$ , For $s = 0,1, \dots$ , carry out the following calculation until convergence. For $j = 1, \dots, p$ , repeat the following steps.

Step 1: Calculate ${\tilde{η}}_{j} = n^{- 1} {\hat{B}}^{'}_{j} r + {\tilde{θ}}_{j}^{(s)}$ .

Step 2: Update $θ_{j}^{(s + 1)} = M ({\tilde{η}}_{j}; λ, γ)$ .

Step 3: Update $r = r - {\hat{B}}_{j} (θ_{j}^{(s + 1)} - θ_{j}^{(s)})$ and $j = j + 1$ .

The last step ensures that r holds the current values of the residuals. Although the objective function is not necessarily convex, it is convex with respect to a single group when the coefficients of all the other groups are fixed.

3. Asymptotic Oracle Properties of GMCP

Let $| A |$ denote the cardinality of any set $A \in {1, \dots, p}$ and $d_{A} = | A | M_{n}$ . Define

$B_{A} = ({\tilde{B}}_{j k}, k = 1, \dots, M_{n}; j \in A) and Σ_{A} = \frac{{B^{'}}_{A} B_{A}}{n}$

Here $B_{A}$ is $n \times n d_{A}$ dimensional sub-design matrix corresponding to the variables in A, Denote ${‖ f_{j} ‖}_{2} = [E f_{j}^{2} (X_{j})]$ We make the following assumptions.

Similar to [5], we assume:

(C1) T and C are independent.

(C2) $P r (T \leq C | T, X) = P r (T \leq C | T)$ .

(C3) $E (T^{2}) < \infty$ and $E (ε | X) = 0$ .

(C4) Denote $τ_{T}$ and $τ_{C}$ as the least upper bounds of T and C, respectively. Then $τ_{T} < τ_{C}$ or $τ_{T} = τ_{C} = \infty$ .

(C5) $f^{2}$ has finite envelope function.

(C6) $E {f (X) - f^{0} (X)}^{2}$ for $f \neq f^{0}$ .

These assumptions correspond to the conditions in [30]. In the random censorship model, (C1) is a basic assumption. (C2) given the failure time T, the censoring indicator is independent of the $X$ . (C3) in least-squares estimation, we need the second moment. (C4) assumes the probability of an event being observed is greater than zero, which guarantees the consistency of the estimator. (C5) is a fundamental condition for the consistency and convergence rate in the proofs, and is used in the entropy calculation. (C6) guarantees that $f_{0}$ is identifiable.

(C7) There exist constant $q^{*} > 0, c_{1} > 0$ and $c_{2} > 0$ where $0 < c_{1} \leq c_{2} < \infty$ such that

$c_{1} \leq \frac{{‖ B_{A} ν ‖}_{2}^{2}}{n} \leq c_{2}, \forall | A | = q^{*}, {‖ ν ‖}_{2} = 1 and ν \in R^{d_{A}}$

(C8) There is a small constant $η_{1} \geq 0$ such that $Σ_{k \in A_{0}} {‖ f_{j} ‖}_{2} \leq η_{1}$ .

(C9) The random errors $ε_{i}, i = 1, \dots, n$ are independent and identically distributed as $ε$ , where $E (ε) = 0$ and $E (ε^{2}) = σ^{2} < \infty$ ; moreover, the tail probabilities satisfy $P (| ε | > x) \leq K \exp (- C x^{2})$ for $x > 0$ and some constantsC and K.

(C10) There exists a positive constant M such that $| x_{i k} | \leq M, i = 1, \dots, n; k = 1, \dots, p$ .

(C7) is the sparse Riesz condition (SRC) formulated for the nonparameter AFT model (2.1), which controls the range of eigenvalues of the matrix Z. This condition was introduced to study the properties of Lasso for the linear regression model by [31]. (C8) assumes that the unimportant predictors are small in the $L_{2}$ sense, but do not need to be exactly zero. If $η_{1} = 0$ , (C8) becomes $f_{j} = 0$ for all $k \in A_{0}$ . The problem of variable selection is equivalent to distinguishing nonzero functions from zero functions. (C9) assumes that the distribution of the error terms has sub-Gaussian tails. This condition holds when the error distribution is normal. (C10) assumes that all the predictors are uniformly bounded, which is satisfied in many practical situations.

In this subsection, we simply write ${\hat{f}}_{j} (X_{j}) = \sum_{k = 1}^{M_{n}} {\hat{β}}_{j k} B_{j k}$ is GMCP estimator. Let $β_{*}^{o} = \min {{‖ β_{j}^{o} ‖}_{2}, j \in A_{0}^{c}}$ and set $β_{*}^{o} = \infty$ if $A_{0}^{c}$ is empty. Define

${\hat{β}}^{o} = \arg \min_{b} {\frac{1}{2} {‖ \tilde{Y} - \sum_{j = 1}^{p} {\tilde{B}}_{j} β_{j}^{o} ‖}_{2}^{2}; {‖ β_{j}^{o} ‖}_{2} = 0, \forall j \notin A_{0}^{c}}$ (3.1)

and

$f_{j}^{o} (X) = \sum_{k =1}^{M_{n}} {\hat{β}}_{j k}^{o} B_{j k}$

This is the oracle least squares estimator. Of course, it is not a real estimator, since the oracle set is unknown.

We first consider the case where the 2-norm GMCP objective function is convex. This necessarily requires $c_{\min} > 0$ where $c_{\min}$ be the smallest eigenvalue of $Σ$ , and recall $Σ = n^{- 1} B^{'} B$ . As in [32], define the function

$h (t, k) = \exp (- k {(\sqrt{2 t - 1} - 1)}^{2} / 4), t > 1, k = 1, 2, \dots$ (3.2)

This function arises from an upper bound for the tail probabilities of the chi-square distributions given in Lemma A.2 in Appendix. This is derived from an exponential inequality for chi-square random variables of [33].

Theorem 3.1. Suppose $ε_{1}, \dots, ε_{n}$ are independent and identically distributed as $N (0, σ^{2})$ and (C1)-(C10). Then for any $(λ, γ)$ statisfying $γ > 1 / c_{\min}$ , $β_{*}^{o} > λ γ$ and $n λ^{2} > σ^{2}$ , we have

$P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \leq η_{1 n} (λ) + η_{2 n} ( λ )$

and

$P (\hat{f} \neq {\hat{f}}^{o}) \leq η_{1 n} (λ) + η_{2 n} ( λ )$

where $η_{1 n} (λ) = (p - q) h (n λ^{2} / σ^{2}, M_{n})$ and $η_{2 n} (λ) = q h (c_{1} n {(β_{*}^{o} - γ λ)}^{2} / σ^{2}, M_{n})$ .

We give the proof of Theorem 3.1 in Appendix. It provides an upper bound on the probability that $\hat{f}$ is not equal to the oracle estimator in terms of the tail probability functionh in (3.2). The key condition $γ > 1 / c_{\min}$ ensures that the 2-norm GMCP criterion is strictly convex. Nonetheless, this result is a starting for a similar result in $p > n$ case. The following corollary is an immediate consequence of Theorem 3.1.

Corollary 1. suppose that the condition of Therorm 3.1 are satisfied. Also suppose that $β_{*}^{o} \geq γ λ + a_{n} τ_{n}$ for $a_{n} \to \infty$ as $n \to \infty$ . If $λ \geq a_{n} λ_{n}$ , then

$P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \to 0 as n \to \infty$

and

$P (\hat{f} \neq {\hat{f}}^{o}) \to 0 as n \to \infty$

where

$λ_{n} = σ \sqrt{2 \log (\max {p - q,1}) / (n M_{n})} and τ_{n} = σ \sqrt{2 \log (\max ({q,1}) / (n c_{1} M_{n}))}$

By Corollary 1, the 2-norm GMCP estimator equals the oracle least squares estimator with probability converging to one. This implies it is group selection consistent. We now consider the high-dimensional case where $p > n$ . Under condition (C7), let $K_{*} = \bar{c} - 1 / 2$ , $m_{*} = K_{*} q$ and $ξ = 1 / (4 c^{*} M_{n})$ . Define

$η_{3 n} (λ) = {(p - q)}^{m_{*}} \frac{e^{m_{*}}}{m_{*}^{m_{*}}} h (ξ n λ^{2} σ^{- 2} / M_{n}, m_{*} M_{n})$

Theorem 3.2. suppose $ε_{1}, \dots, ε_{n}$ are independent and identically distributed as $N (0, σ^{2})$ and B satisfies the $S R C (q^{*}, c_{1}, c_{2})$ in (C7) with $q^{*} \geq (\bar{c} - 1 / 2)$ , $m_{*} = K_{*} q$ and $ξ = 1 / (4 c^{*} M_{n})$ , we have

$P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \leq η_{1 n} (λ) + η_{2 n} (λ) + η_{3 n} ( λ )$

and

$P (\hat{f} \neq {\hat{f}}^{o}) \leq η_{1 n} (λ) + η_{2 n} ( λ )$

where $η_{1 n} (λ) = (p - q) h (n λ^{2} / σ^{2}, M_{n})$ and $η_{2 n} (λ) = q h (c_{1} n {(β_{*}^{o} - γ λ)}^{2} / σ^{2}, M_{n})$ .

Corollary 2. suppose that the condition of Therorm 3.2 are satisfied. Also suppose that $β_{*}^{o} \geq γ λ + a_{n} τ_{n}$ for $a_{n} \to \infty$ as $n \to \infty$ . If $λ \geq a_{n} λ_{n}^{*}$ , then

$P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \to 0 as n \to \infty$

and

$P (\hat{f} \neq {\hat{f}}^{o}) \to 0 as n \to \infty$

where $λ_{n}^{*} = 2 σ \sqrt{2 c_{2} M_{n} \log (p - q) / n}$ .

Theorem 3.2 and Corollary 2 provide sufficient conditions for the asymptotic oracle property of the global 2-norm GMCP estimator in the $p > n$ situations. Here we allow $p - | A_{0}^{c} | = \exp {O (n / (c_{2} M_{n}))}$ . So p can be greater than n. The condition $n λ^{2} ξ > σ^{2} M_{n}$ is stronger than the corresponding condition $n λ^{2} > σ^{2}$ in Theorem 3.5 ( [34] ). The condition $γ \geq c_{1}^{- 1} \sqrt{4 + \bar{c}}$ ensures that the GMCP criterion is convex in any q-dimensional subspace. It is stronger than the minimal sufficient condition $γ > c_{1}^{- 1}$ for convexity in q-dimensional subspaces. This is the price we need to pay in search for a lower-dimensional space that contains the true model.

4. Numerical Simulation

In this section, we conduct simulation studies to evaluate the performance of the GMCP and GLasso penalties in a high-dimensional NP-AFT-AR model with limited samples. We therefore focus on the comparisons of the group selection methods with only the BIC ( [35] ) selected tuning parameter $(λ, M_{n})$ , is given:

$BIC (λ, M_{n}) = \log ({RSS}_{λ, M_{n}}) + \log (n) \frac{d f_{λ, M_{n}}}{n}$

Where RSS is the sum of squared residuals, df is the number of selected variables given $(λ, M_{n})$ . We choose $M_{n}$ from the increasing sequence in Section 5, for any given value of $M_{n}$ , We choose from a sequence of 100 values $λ$ , from $0.01 λ_{\max}$ to $λ_{\max}$ , Where $λ_{\max} = \max_{1 \leq j \leq p} {‖ {\tilde{B}}^{'}_{j} \tilde{Y} ‖}_{2} / \sqrt{M_{n}} {\tilde{B}}^{'}_{j}$ is corresponding to the covariate $X_{j}, j = 1, \dots, p$ with $n \times M_{n}$ “design” matrix. $λ_{\max}$ is the maximum penalty value, which compresses all estimated coefficients to zero.

We compute the empirical prediction mean square error (MSE) to reveal the estimation accuracy. Let ${\hat{f}}_{j}$ be the estimator of $f_{j}, j = 1, \dots, p$ ; and we define MSE as

${MSE}_{f_{j}} = \frac{1}{n} \sum_{i = 1}^{n} {| {\hat{f}}_{j} (X_{i j}) - f_{p} (X_{i j}) |}^{2}$

Three scenarios are considered in the following, where some nonzero components are linear and the response variable is subject to various censoring rates. The sample size $n = 400, 200$ and a total of 100 simulation runs are used. The logarithm of censoring time $C_{i}$ is generated from a uniform distribution $U (c_{1}, c_{2}), c_{1} > 0; c_{2} > 0$ , where $c_{1}$ and $c_{2}$ are determined by a Monte-Carlo method to achieve the censoring rates of 35% and 40% respectively. For example, the censoring rate $c r = P r (T > C)$ is approximated by $\hat{c r} = \sum_{i = 1}^{M} I (T_{i} > C_{i}) / M$ where $T_{i}$ is drawn from the proposed model (2.1) and $C_{i}$ is drawn from $U (- c_{1}, c_{2}), c_{1} > 0; c_{2} > 0$ , M is the Monte-Carlo simulation runs used to compute cr. When we chose $c_{1} = 0, c_{2} = 4, \hat{c r} \approx 40 %$ , which is considered to be the desired censoring rate. To take account of the computational efficiency and accuracy, we use the cubic B-spline with five evenly distributed interior knots for all the functions $f_{j}, j = 1, \dots, p$ , which gives the number of $3 + 1 + 5 = 9$ basis functions for each nonparametric component. Due to the identifiability constraint, $E {f_{j} (X_{j})} = 0$ ; the actual number of basis functions used is 8. This choice is made because our simulation studies indicated that using a larger number of knots does not improve the finite sample performance (results are not shown).

4.1. Scenario 1 (Covariates Are Independent)

In this scenario, we consider independent covariates and set the intercept $η_{0} = 0$ : The logarithm of failure times, $T_{i}, i = 1, \dots, n$ , are generated from

$\begin{array}{l} T = \exp (f_{1} (X_{1}) + f_{2} (X_{2}) + f_{3} (X_{3}) + f_{4} (X_{4}) + f_{5} (X_{5}) \begin{matrix} \end{matrix} \\ + f_{6} (X_{6}) + \sum_{j = 7}^{p} f_{j} (X_{j}) + ε) \end{array}$

where

$f_{1} (X_{1}) = 2 {(\sin (0.25 π X_{1}))}^{3}, f_{2} (X_{2}) = 2 \sin (2 X_{2}), f_{3} (X_{3}) = X_{3}^{2} - \frac{3}{4},$

$f_{4} (X_{4}) = 1.2 X_{4}, f_{5} (X_{5}) = \exp (- X_{5}) - \frac{25}{12},$

$f_{6} (X_{6}) = \frac{1}{4} X_{6}^{3}, f_{7} (X_{7}) = \dots = f_{p} (X_{p}) \equiv 0.$

The predictors are sampled from the $N (0,1)$ .

We set $p = 500$ and consider the cases where $n = 400, 200$ , respectively to see the performance of our proposed methods as the sample size increases. The penalty parameters are selected using CV as described above.

The results for the the GMCP, GSCAD and GLasso methods are given in Table 1 and Table 2 based on 100 replications. The columns in Table 1 include the average number of variables (NV) being selected, model error (ER), percentage of occasions on which correct variables are included in the selected model (%IN) and percentage of occasions on which the exactly correct variables are selected (%CS) with standard error in parentheses. Table 2 summarizes the mean square errors for the six important functions $n^{- 1} \sum_{i = 1}^{n} {| {\hat{f}}_{j} (X_{j i}) - f_{j} (X_{j i}) |}^{2}, j = 1, \dots, 6$ with standard error in parentheses.

Several observations can be obtained from Tables 1-4. The model that was selected by the GMCP and is better than the one selected by the GLasso in terms of model error, the percentage of occasions on which the true variables being selected and the mean square errors for the important coefficient functions. The GMCP includes the correct variables with high probability. When the sample size increases, the performance of both methods becomes better as expected. To examine the estimated nonparametric functions from Concave group Selection methods, we plot GMCP along with the true function components in Figure 1 and Figure 2. The estimated nonparametric coefficient functions are from GMCP method in one run when 100. From the graph, the estimators of the

Table 1. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 2. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 3. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 4. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.

Figure 1. $n = 200$ , the solid black line is the real function, the dotted red line is the GMCP estimation, CR = 35%.

Figure 2. $n = 200$ , the solid black line is the real function, the dotted red line is the GMCP estimation, CR = 40%.

nonparameter $f_{j} (X_{j}), j = 1, \dots,6$ , fit the true functions well, which are consistent with the mean square errors for the functions reported in Table 2, Table 4.

4.2. Scenario 2 (Covariates Are Correlated)

In this scenario, we consider correlated covariates and set the intercept $η_{0} = 0$ : The logarithm of failure times, $T_{i}, i = 1, \dots, n$ , are generated from?

$f_{1} (X_{1}) = 1.2 X_{1}, f_{2} (X_{2}) = 2 \sin (2 X_{2}), f_{3} (X_{3}) = (X_{3}^{2} - \frac{3}{4}),$

$f_{4} (X_{4}) = \exp (- X_{5}) - \frac{25}{12}, f_{5} (X_{5}) = \sin (0.5 π X_{5}),$

$f_{6} (X_{6}) = 2 {(\sin (0.25 π X_{6}))}^{3}, f_{7} (X_{7}) = \dots = f_{p} (X_{p}) \equiv 0.$

where the covariates $X = (X_{1}, X_{2}, \dots, X_{p})$ are generated from $X_{p} = (W_{p} + 0.5 U) / 1.5$ where $W_{1}, \dots, W_{p}$ and U are i.i.d. $N (0,1)$ . This provides a design with a correlation coefficient of 0.5 between all of the covariates.

The simulation study results are reported in Tables 5-8. The conclusions for Scenario 2 are very similar to those for Scenario 1. When the censoring rate increases, the estimation and selection performance decreases for all methods. The results in Table 6, Table 8 show that the GMCP estimator is more accurate than the GLasso estimator for both the individual component functions and the full

Table 5. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 6. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 7. Simulation results. NV, number of selected variables; ER, model error; IN%, percentage of occasions on which the correct variables are included in the selected model; CS%, percentage of occasions on which exactly correct variables are selected, averaged over 100 replications. Enclosed in parentheses are the corresponding standard errors.

Table 8. Simulation results. Mean Square errors for the important coefficient functions based on 100 replications. Enclosed in parentheses are the corresponding standard errors.

model, since the MSE under the GMCP approach is always smaller than that under the GLasso approach. The results in Table 5, Table 7 show that the GMCP method conducts component selection more precisely than the GLasso method, while the GLasso method chooses many zero component functions as nonzero functions. To examine the estimated nonparametric functions from the GMCP, we plot them along with the true function components in Figure 3, Figure 4. The estimated functions are from the GMCP method in one run when $n = 200$ . The estimation and selection accuracy decrease when covariates are correlated, we can still see that the estimated curves under the GMCP method are close to the true curves compared with the estimated curves under the GLasso method.

5. Application in NA-AFT-Model

In this section, we will use Shedden 2008 (for short) to conduct an empirical analysis of part of the collected lung adenocarcinoma data to illustrate the proposed method. For more information, see [36]. Retrospective data of 442 lung adenocarcinoma patients were collected at multiple locations, including their survival time, some other clinical and demographic data, and the expression level of the 22,283 gene from the following genes: tumor samples. However, most samples have small changes. Therefore, in our application, we randomly select 321 samples, the first 500, 1000 genes. Therefore, $n = 321, p = 1000$ , and the survival rate is 35.8%.

Figure 3. $n = 200$ , the solid black line is the real function, the dotted red line is the group MCP estimation, the dotted blue line is the group SCAD estimation, and the black line is the group lasso estimation, CR = 35%.

Figure 4. $n = 200$ , the solid black line is the real function, the dotted red line is the group MCP estimation, the dotted blue line is the group SCAD estimation, and the black line is the group lasso estimation, CR = 40%.

Figure 5. The estimation function graph based on GLasso and GMCP approximates the corresponding 200746_s_at by using the same covariate, where the red dotted line is the GMCP estimate and the gray dotted line is the GLasso estimate.

Here, we are interested in the effect of tumor gene expression levels on the survival time of lung adenocarcinoma patients. Since the linear assumption is always latent in high dimensions, the proposed method may be more suitable for analyzing feature selection problems considering nonlinear effects. In our analysis, we set the spline base $M_{n} = 5$ for each gene. The proposed method selects 1 gene locus under GMCP (ie 200746_s_at). However, when $p = 500, 1000$ , the method under GLasso penalized regression alone selected the 6, 10 gene.

From Figure 5, we find that the larger the dimension, the worse the GLasso method estimation, but it has little effect on the GMCP estimation. Therefore, the verification of the actual data shows that the GMCP penalty is better than the GLasso penalty, and the accuracy is higher, and the calculation cost of the two is the same. Under the same conditions, the GMCP method is more suitable than the GLasso.

6. Concluding Remarks

In this paper, we study the weighted least squares estimation and selection attributes of GMCP in the NP-AFT-AR model with high-dimensional data. For the GMCP method, our simulation results show that GLasso tends to select some unimportant variables. In contrast, GMCP has progressive predictability, which shows that it also has selection consistency.

Appendix Proof

Lemma 1. Let $χ_{k}^{2}$ be a random variable with chi-square distribution with k degrees of freedom. For $t > 1$ , $P (χ_{k}^{2} \geq k t) \leq h (t, k)$ , where $h (t, k)$ is defined in (3.2).

This lemma is a restatement of the exponential inequality for chi-square distributions of [33].

proofof Theorem 3.1. Since ${\hat{β}}^{o}$ is the oracle least squares estimator, we have ${\hat{β}}_{j}^{o}, j \in A_{0}$ and

$- {\tilde{B}}^{'}_{j} (\tilde{Y} - \tilde{B} {\hat{β}}^{o}) / n = 0, \forall j \in A_{0}^{c}$

If ${‖ {\hat{β}}_{j}^{o} ‖}_{2} / \sqrt{M_{n}} \geq λ γ$ , then by the definition of the MCP, $\dot{ρ} ({‖ {\hat{β}}_{j}^{o} ‖}_{2}; \sqrt{M_{n}} λ, γ) = 0$ . Since $c_{\min} > 1 / γ$ , the criterion (2.8) is strictly convex. By the Karush-Kuhn-Tucker (KKT) conditions, the equality $\hat{β} (λ, γ) = {\hat{β}}^{o}$ holds in the intersection of the events

$Ω_{1} (λ) = {\max_{j \in A_{0}} {‖ n^{- 1} {\tilde{B}}^{'}_{j} (\tilde{Y} - \tilde{B}) {\hat{β}}^{o} ‖}_{2} / \sqrt{M_{n}} \leq λ}$

and

$Ω_{2} (λ) = {\min_{j \in A_{0}^{c}} {‖ {\hat{β}}_{j}^{o} ‖}_{2} \geq γ λ}$

We first bound $1 - P (Ω_{1} (λ))$ . Let ${\hat{β}}_{A_{0}^{c}} = {({\hat{β}}_{j}, j \in A_{0}^{c})}^{'}$ . By (A.1) [34] and using $\tilde{Y} = {\tilde{B}}_{A_{0}^{c}} β_{A_{0}^{c}}^{o} + ε$

${\hat{β}}_{A_{0}^{c}}^{o} = Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}} \tilde{Y} / n = β_{A_{0}^{c}}^{o} + Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}} ε / n$

It follows that $n^{- 1} {\tilde{B}}_{j} (\tilde{Y} - \tilde{B} {\hat{β}}^{o}) = n^{- 1} {\tilde{B}}^{'}_{k} (I_{n} - P_{A_{0}^{c}}) ε$ , where $P_{A_{0}^{c}} = n^{- 1} {\tilde{B}}_{A_{0}^{c}} Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}}$ , Because ${\tilde{B}}^{'}_{j} {\tilde{B}}_{j} = I_{M_{n}}$ , ${‖ {\tilde{B}}_{j} (I_{n} - P_{A_{0}^{c}}) ε ‖}_{2}^{2} / σ^{2}$ is distributed as a $χ^{2}$ distribution with $M_{n}$ degrees of freedom. We have, for $n λ^{2} / σ^{2} \geq 1$

$\begin{matrix} 1 - P (Ω_{1} (λ)) = P (\max_{j \in A_{0}} {‖ n^{1 / 2} {\tilde{B}}^{'}_{j} (I_{n} - P_{A_{0}^{c}}) ε ‖}_{2}^{2} / (M_{n} σ^{2}) > n λ^{2} / σ^{2}) \\ \leq \sum_{j \in A_{0}} P ({‖ n^{- 1 / 2} {\tilde{B}}^{'}_{j} (I_{n} - P_{A_{0}^{c}}) ε ‖}_{2}^{2} / (\sqrt{M_{n}} σ^{2}) > M_{n} n λ^{2} / σ^{2}) \\ \leq \sum_{j \in A_{0}} h (n λ^{2} / σ^{2}, M_{n}) \\ \leq (p - q) h (n λ^{2} / σ^{2}, M_{n}) \\ = η_{1 n} (λ) \end{matrix}$ (6.1)

where we used lemma 1 in the third line. Now consider $Ω_{2} (λ)$ , Recall $β_{*}^{o} = \min_{j \in A_{0}^{c}} {‖ β_{j}^{o} ‖}_{2}$ . If ${‖ {\hat{β}}_{j}^{o} - β_{j}^{o} ‖}_{2} / \sqrt{M_{n}} \leq β_{*}^{o} - γ λ$ for all $j \in A_{0}^{c}$ , then $\min_{j \in A_{0}^{c}} {‖ {\hat{β}}_{j}^{o} ‖}_{2} / \sqrt{M_{n}} \geq γ λ$ . This implies

$1 - P (Ω_{2} (λ)) \leq P (\max_{j \in A_{0}^{c}} {‖ {\hat{β}}_{j}^{o} - β_{j}^{o} ‖}_{2} / \sqrt{M_{n}} > β_{*}^{o} - γ λ)$

Let ${\tilde{B}}_{j}$ be a $M_{n} \times M_{n} q$ matrix with a $M_{n} \times M_{n}$ identity matrix $I_{M_{n}}$ in the pth block and 0’s elsewhere. Then $n^{1 / 2} ({\hat{β}}_{j}^{o} - β_{j}^{o}) = n^{- 1 / 2} {\tilde{B}}_{j} Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}} ε$ . Note that

${‖ n^{- 1 / 2} {\tilde{B}}_{j} Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2} \leq {‖ {\tilde{B}}_{j} ‖}_{2} {‖ Σ_{A_{0}^{c}}^{- 1 / 2} ‖}_{2} {‖ n^{- 1 / 2} Σ_{A_{0}^{c}}^{- 1 / 2} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2} \leq c_{1}^{- 1 / 2} {‖ n^{- 1 / 2} Σ_{A_{0}^{c}}^{- 1 / 2} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2}$

and ${‖ n^{- 1 / 2} Σ_{A_{0}^{c}}^{- 1 / 2} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2}^{2} / σ^{2}$ id distributed as a $χ$ distribution with q degrees of freedom. Therefore, similar to $η_{1 n} (λ)$ , we have, for $c_{1} n (β_{*}^{o} - γ λ) / σ^{2} > 1$ ,

$\begin{matrix} 1 - P (Ω_{2} (λ)) = P (\max_{j \in A_{0}^{c}} n^{- 1 / 2} {‖ {\tilde{B}}_{j} Σ_{A_{0}^{c}}^{- 1} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2} / \sqrt{M_{n}} > \sqrt{n} (β_{*}^{o} - γ λ)) \\ \leq P (\max_{j \in A_{0}^{c}} {‖ n^{- 1 / 2} Σ_{A_{0}^{c}}^{- 1 / 2} {\tilde{B}}^{'}_{A_{0}^{c}} ε ‖}_{2}^{2} / (M_{n} σ^{2}) > c_{1} n {(β_{*}^{o} - γ λ)}^{2} / σ^{2}) \\ \leq q h (c_{1} n {(β_{*}^{o} - γ λ)}^{2} / σ^{2}, M_{n}) \\ = η_{2 n} (λ) \end{matrix}$ (6.2)

Combining $η_{1 n} (λ)$ and $η_{2 n} (λ)$ , we have

$P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \leq 1 - P (Ω_{1} (λ)) + 1 - P (Ω_{2} (λ)) \leq η_{1 n} (λ) + η_{2 n} ( λ )$

Since $\hat{f} (x) = B \hat{β}$ , we can obtain $P (\hat{f} \neq {\hat{f}}^{o}) \leq η_{1 n} (λ) + η_{2 n} (λ)$ . This completes the proof.

For any $Q \subset {1, \dots, p}$ and $m \geq 1$ , define

$ζ (ν; m, B) = \max {\frac{{‖ (P_{A} - P_{B}) ν ‖}_{2}}{{(m n)}^{1 / 2}} : Q \subseteq A \subseteq {1, \dots, p}, d_{A} = m + d_{B}}$

Lemma 2. Suppose $ξ n λ^{2} > σ^{2} M_{n}$ . We have

$P (2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m, A_{0}^{c}) > λ) \leq {(p - q)}^{m} \frac{e^{m}}{m^{m}} \exp (- m ξ n λ^{2} / 16)$

proof. For any $A \supseteq A_{0}^{c}$ . We have $(P_{a} - P_{A_{0}^{c}}) {\tilde{B}}_{A_{0}^{c}} {\hat{β}}_{A_{0}^{c}} = 0$ . Thus

$(P_{A} - P_{A_{0}^{c}}) \tilde{Y} = (P_{A} - P_{A_{0}^{c}}) ({\tilde{B}}_{A_{0}^{c}} β_{A_{0}^{c}} + ε) = (P_{A} - P_{A_{0}^{c}}) ε$

Therefore,

$P (2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m, A_{0}^{c}) > λ) = P (\max_{A \supseteq A_{0}^{c}, | A | = m + q} {‖ (P_{A} - P_{A_{0}^{c}}) ε ‖}_{2}^{2} / σ^{2} > ξ m n λ^{2})$

Since $P_{A} - P_{B}$ is a projection matrix, ${‖ (P_{A} - P_{A_{0}^{c}}) ε ‖}_{2}^{2} / σ^{2} \sim χ_{m_{A}}^{2}$ , where $m_{A} = \sum_{j \in A - A_{0}^{c}, A \supseteq A_{0}^{c}} M_{n} \leq m M_{n}$ . Since there $(\begin{matrix} p - q \\ m \end{matrix})$ are ways to choose A from ${1, \dots, p}$ , we have

$P (2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m, A_{0}^{c}) > λ) \leq (\begin{matrix} p - q \\ m \end{matrix}) P (χ_{m M_{n}}^{2} > ξ m n λ^{2}) .$

This and Lemma A.2 imply that

$\begin{matrix} P (2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m, A_{0}^{c}) > λ) \leq (\begin{matrix} p - q \\ m \end{matrix}) h (ξ n λ^{2} / M_{n}, m M_{n}) \\ \leq {(p - q)}^{m} \frac{e^{m}}{m^{m}} h (ξ n λ^{2} / M_{n}, m M_{n}) \end{matrix}$ (6.3)

here we used the inequality $(\begin{matrix} p - q \\ m \end{matrix}) \leq {(p - q)}^{m} \frac{e^{m}}{m^{m}}$ , this completes the proof.

Define I as any set that satisfies

$\begin{array}{l} A_{0}^{c} \cup {j : {‖ {\hat{β}}_{j} ‖}_{2} \neq 0} \subseteq I \\ \subseteq A_{0}^{c} \cup {j : n^{- 1} {\tilde{B}}^{'} (\tilde{Y} - \tilde{B} \hat{β}) = \dot{ρ} ({‖ {\hat{β}}_{j} ‖}_{2}; \sqrt{M_{n}} λ, γ) \sqrt{M_{n}} {\hat{β}}_{j} / {‖ {\hat{β}}_{j} ‖}_{2}} \end{array}$

Lemma 3. Suppose that $\tilde{B}$ satisfies that $S R C (q^{*}, c_{1}, c_{2})$ , $q^{*} \geq (K_{*} + 1) m_{n} q$ , and $γ \geq c_{1}^{- 1} \sqrt{4 + \bar{c}}$ . Let $m_{*} = K_{*} q$ . Then for any $\tilde{Y} \in R^{n}$ with $λ \geq 2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m_{*}, A_{0}^{c})$ , we have $| I | \leq (K_{*} + 1) q$ .

proof This lemma can be proved along the line of the proof of Lemma 1 of [23] and is omitted. proofof Theorem 3.2. By Lemma 3, in the event

$2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m_{*}, A_{0}^{c}) \leq λ$ (6.4)

we have $| I | \leq (K_{*} + 1) q$ , Thus in the event (6.4), the original model with p groups reduces a model with at most $(K_{*} + 1) q$ groups, in this reduced model, the condition of Theorem 3.2 implies that the conditions of Theorem 3.2. By Lemma 2,

$P (2 \sqrt{c_{2} M_{n}} ζ (\tilde{Y}; m_{*}, A_{0}^{c}) \leq λ) \leq η_{3 n} (λ)$ (6.5)

Therefore, combining (6.5) and Theorem 3.1, we have $P (\hat{β} (λ, γ) \neq {\hat{β}}^{o}) \leq η_{1 n} (λ) + η_{2 n} (λ) + η_{3 n} (λ)$ , since $\hat{f} (x) = B \hat{β}$ , we can obtain $P (\hat{f} \neq {\hat{f}}^{o}) \leq η_{1 n} (λ) + η_{2 n} (λ) + η_{3 n} (λ)$ . This proves Theorem 3.2.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1]	Cox, D.R. (1972) Regression Models and Life Tables (with Discussion). Journal of the Royal Statistical Society: Series B, 2, 187-220. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
[2]	Huang, J., Ma, S. and Xie, H.L. (2006) Regularized Estimation in the Accelerated Failure Time Model with High-Dimensional Covariates. Biometrics, 62, 813-820. https://doi.org/10.1111/j.1541-0420.2006.00562.x
[3]	Datta, S., Jennifer, L.R. and Datta, S. (2007) Predicting Patient Survival from Microarray Data by Accelerated Failure Time Modeling Using Partial Least Squares and LASSO. Biometrics, 63, 259-271. https://doi.org/10.1111/j.1541-0420.2006.00660.x
[4]	Huang, J., Ma, S. and Xie, H.L. (2007) Least Absolute Deviations Estimation for the Accelerated Failure Time Model. Statistica Sinica, 17, 1533-1548.
[5]	Leng, C. and Ma, S. (2007) Accelerated Failure Time Models with Nonlinear Covariates Effects. Australian and New Zealand Journal of Statistics, 49, 155-172. https://doi.org/10.1111/j.1467-842X.2007.00470.x
[6]	Schmid, M. and Hothorn, T. (2008) Flexible Boosting of Accelerated Failure Time Models. BMC Bioinformatics, 9, Article No. 269. https://doi.org/10.1186/1471-2105-9-269
[7]	David, E. and Li, Y. (2009) Survival Analysis with High-Dimensional Covariates: An Application in Microarray Studies. Statistical Applications in Genetics and Molecular Biology, 8, 1-122. https://doi.org/10.2202/1544-6115.1423
[8]	Cai, T., Huang, J. and Tian, L. (2009) Regularized Estimator for the Accelerated Failure Time Model. Biometric, 65, 394-404. https://doi.org/10.1111/j.1541-0420.2008.01074.x
[9]	Johnson, B.A.(2009) Rank-Based Estimation in the l1-Regularized Partly Linear Model for Censored Outcomes with Application to Integrated Analyses of Clinical Predictors and Gene Expression data. Biostatistics, 10, 3659-466.
[10]	Huang, J. and Ma, S. (2010) Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis, 16, 176-195. https://doi.org/10.1007/s10985-009-9144-2
[11]	Wang, Z. and Wang, C.Y. (2010) Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data. Statistical Applications in Genetics and Molecular Biology, 9, Article No. 24. https://doi.org/10.2202/1544-6115.1550
[12]	Liu, F., Dunson, D. and Zou, F. (2011) High-Dimensional Variable Selection in Meta-Analysis for Censored Data. Biostatistics, 67, 504-512. https://doi.org/10.1111/j.1541-0420.2010.01466.x
[13]	Ma, S. and Du, P. (2012) Variable Selection in Partly Linear Regression Model with Diverging Dimensions for Right Censored Data. Statistica Sinica, 22, 1003-1020. https://doi.org/10.5705/ss.2010.267
[14]	Schumaker, L. (1981) Spline Functions: Basic Theory. Cambridge University Press, Cambridge.
[15]	Gu, C. (2004) Model Diagnostics for Smoothing Spline ANOVA Models. Canadian Journal of Statistic, 32, 347-358. https://doi.org/10.2307/3316020
[16]	Hu, J.W. and Chai, H. (2013) Adjusted Regularized Estimation in the Accelerated Failure Time Model with High Dimensional Covariates. Journal of Multivariate Analysis, 122, 96-114. https://doi.org/10.1016/j.jmva.2013.07.011
[17]	Chai, H., Liang, Y. and Liu, X.Y. (2015) The L1/2 Regularization Approach for Survival Analysis in the Accelerated Failure Time Model. Computers in Biology and Medicine, 64, 283-290. https://doi.org/10.1016/j.compbiomed.2014.09.002
[18]	Xia, X.C., Jiang, B.Y., Li, J.L. and Zhang, W.Y. (2016) Low-Dimensional Confounder Adjustment and High-Dimensional Penalized Estimation for Survival Analysis. Lifetime Data Analysis, 22, 547-569. https://doi.org/10.1007/s10985-015-9350-z
[19]	Khan, M.H.R. and Shaw, J.E.H. (2016) Variable Selection for Survival Data with a Class of Adaptive Elastic Net Techniques. Statistics and Computing, 26, 725-741. https://doi.org/10.1007/s11222-015-9555-8
[20]	Yang, Y.C., Fasching, P.A. and Tresp, V. (2017) Modeling Progression Free Survival in Breast Cancer with Tensorized Recurrent Neural Networks and Accelerated Failure Time Models. Proceedings of Machine Learning Research, 68, 164-176.
[21]	Yue, M., Li, J.L. and Ma, S.G. (2018) Sparse Boosting for High-Dimensional Survival Data with Varying Coefficients. Statistics in Medicine, 37, 789-800. https://doi.org/10.1002/sim.7544
[22]	Khan, M.H.R. and Shaw, J.E.H. (2019) Variable Selection for Accelerated Lifetime Models with Synthesized Estimation Techniques. Statistical Methods in Medical Research, 28, 937-952. https://doi.org/10.1177%2F0962280217739522
[23]	Zhang, C.H. (2010) Nearly Unbiased Variable Selection under Minimax Concave Penalty. The Annals of Statistics, 38, 894-942. https://doi.org/10.1214/09-AOS729
[24]	Stute, W. (1993) Almost Sure Representations of the Product-Limit Estimator for Truncated Datay. The Annals of Statistics, 21, 146-156. https://doi.org/10.1214/aos/1176349019
[25]	Zhou, M. (1992) M-Estimation in Censored Linear Models. Biometrika, 79, 837-841.
[26]	Satten, G.A. and Datta, S. (2001) The Kaplan-Meier Estimator as an Inverse-Probability-of-Censoring Weighted Average. The American Statistician 55, 207-210. https://doi.org/10.1198/000313001317098185
[27]	Fu, W.J. (1998) Penalized Regressions: The Bridge versus the LASSO. Journal of Computational and Graphical Statistics, 7, 397-416. https://doi.org/10.1080/10618600.1998.10474784
[28]	Breheny, P. and Huang, J. (2011) Coordinate Descent Algorithms for Nonconvex Penalized Regression, with Applications to Biological Feature Selection. The Annals of Statistics, 5, 232-253. https://doi.org/10.1214/10-AOAS388
[29]	Yuan, M. and Lin, Y. (2006) Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society: Series B, 68, 49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
[30]	Stute, W. (1999) Nonlinear Censored Regression. Statistica Sinica, 9, 1089-1102.
[31]	Zhang, C.H. and Huang, J. (2008) The Sparsity and Bias of the Lasso Selection in High-Dimensional Linear Regression. The Annals of Statistics, 36, 1567-1594. https://doi.org/10.1214/07-AOS520
[32]	Huang, J., Breheny, P. and Ma, S. (2012) A Selective Review of Group Selection in High-Dimensional Models. Statistical Science, 27, 481-499. https://doi.org/10.1214/12-STS392
[33]	Laurent, B. and Laurent, P. (2000) Adaptive Estimation of a Quadratic Functional by Model Selection. The Annals of Statistics, 28, 1302-1338. https://doi.org/10.1214/aos/1015957395
[34]	Yang, G.R., Huang, J. and Zhou, Y. (2014) Concave Group Methods for Variable Selection and Estimation in High-Dimensional Varying Coefficient Models. Science china Mathematics, 57, 2073-2090. https://doi.org/10.1007/s11425-014-4842-y
[35]	Schwarz, G. (1978) Estimating the Dimension of a Model. The Annals of Statistics, 6, 461-464. https://doi.org/10.1214/aos/1176344136
[36]	Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma (2008) Gene Expression-Based Survival Prediction in Lung Adenocarcinoma: A Multi-Site, Blinded Validation Study. Nature Medicine, 14, 822-827. https://doi.org/10.1038/nm.1790

Journals Menu

Follow SCIRP

	+1 323-425-8868
	customer@scirp.org
	+86 18163351462(WhatsApp)
	1655362766

	Paper Publishing WeChat

Journals Menu

Home

About SCIRP

Service

Policies