_{1}

In this paper, we have studied the nonparameter accelerated failure time (AFT) additive regression model, whose covariates have a nonparametric effect on high-dimensional censored data. We give the asymptotic property of the penalty estimator based on GMCP in the nonparameter AFT model.

With the development of the Internet, high-dimensional data has been widely collected in life, especially in the field of medical research and finance, the results or responses of data are censored, so the study of high-dimensional censored data is meaningful. However, due to the impact of “disaster of dimension”, the study of high-dimensional data becomes extremely difficult, and some special methods must be adopted to deal with it. As the number of data dimensions increases, the performance of high-dimensional data structures declines rapidly. In low-dimensional spaces, we often use Euclidean distance to measure the similarity between data; but in high-dimensional spaces, this kind of similarity no longer exists, which makes the data mining of high-dimensional data very severely challenging. On the one hand, the performance of the data mining algorithm based on the index structure is reduced; on the other hand, many mining methods based on the entire spatial distance function will fail. By reducing the number of dimensions, the data can be reduced from high to low dimensions, and then using low-dimensional data processing methods. Therefore, the study of effective dimensionality reduction methods becomes significant in statistics.

In many studies, the main results or responses of survival data are censored. Survival analysis is another important theme of statistics, and it has been widely used in medical research and finance. Therefore, the study of survival data has attracted a lot of attention. The Cox model [_{1/2} for variable selection. [_{2} Boosting algorithm, which is based on a semiparameter variable coefficient accelerated failure time model of right-censored survival data with high-dimensional covariates model prediction and variable selection. [

In this article, based on potential predictors, we applied the GMCP (Group Minimax Concave Penalty) penalty method for the first time to the study of a high-dimensional nonparametric accelerated failure time additive regression model (2.1) (MCP, [

The rest of the paper is organized as follows. In Section 2, we describe the nonparameter accelerated failure time additive regression (NP-AFT-AR) model and our research methods. In Section 3, we give the asymptotic oracle property of GMCP estimation. The simulation results are given in Section 4. Verification of actual data is given in Section 5. The conclusion is given in Section 6.

In this paper, we study the following nonparametric accelerated failure time additive regression (NP-AFT-AR) model to describe the relationship between the independent predictors or covariates X_{j}’s and the failure time T:

T = exp ( η 0 + ∑ j = 1 p f j ( X j ) + ε ) (2.1)

where η 0 is the intercept, X = ( X 1 , ⋯ , X p ) is a p × 1 vector of covariates, f_{j}’s are unknown smooth functions with zero means, i.e., E f j ( X j ) = 0 and ε is the random error term with mean zero and a finite variance σ 2 . We consider sample size is small n < p , assuming that some additive components f j are zero, the main purpose of our research is to find the non-zero components and zero components; the second goal is to find the specific functional form of the non-zero components in order to propose a more parsimonious model. In this study, we apply the GMCP penalty in the proposed NP-AFT-AR model for component selection and estimation. We use B-splines to parameterize the nonparameter components, then invoke the inverse probability-of-censoring weighted least squares method to achieve the goals. We treat the spline approximation for each component as a group of variables subject to selection. By the GMCP penalty approach, we show that the proposed method can select significant component functions by choosing the nonzero spline basis functions.

We define T i as the i^{th} subject’s survival time, and let C i denote the censoring time and δ i denote the event indicator, i.e., δ i = I ( T i ≤ C i ) ; which takes value 1 if the event time is observed, or 0 if the event time is censored. Define Y i as the minimum of the survival time and the censoring time, i.e., Y i = log ( min ( T i , C i ) ) : Then, the observed data are in the form ( Y i , δ i , X i ) , i = 1, ⋯ , n . which are assumed to be an independent and identically distributed (i.i.d.) sample from ( Y , δ , X ) .

Let Y ( 1 ) ≤ ⋯ ≤ Y ( n ) be the order statistics of Y_{i}’s, δ ( 1 ) , ⋯ , δ ( n ) and X ( 1 ) , ⋯ , X ( n ) are the associated censoring indicators and covariates. Let F be the distribution of T and F n ^ be its Kaplan-Meier estimator F n ^ ( y ) = ∑ i = 1 n ω n i 1 ( Y ( i ) ≤ y ) , where the ω n i ’s are Kaplan-Meier weights ( [

ω n 1 = δ ( 1 ) n , ω n i = δ ( i ) n − i + 1 ∏ j = 1 i − 1 ( n − j n − j + 1 ) δ ( j ) , i = 2 , ⋯ , n

[

Q n = 1 2 ∑ i = 1 n n ω n i { Y ( i ) − η 0 − ∑ j = 1 p f j ( X ( i ) j ) } 2 (2.2)

Here, we use B-spline basis functions to approximated unknown functions f_{j}’s. For every function component, assuming that X j is bounded; and E { f j ( X j ) } = 0 , j = 1 , ⋯ , p ; The basis functions are determined by the order ( p + 1 ) and the number of interior knots κ . The total number of B-spline basis functions for each function component would be p + κ + 1 : For identifiability, satisfy E f j ( X j ) = 0 ; we take the total number of basis functions to be M n = p + κ only and center all the basis functions at their means. Then the B-splines approximation for each function component, f j ( X j ) , j = 1, ⋯ , p ; is given by

f j ( X j ) ≈ ∑ k = 1 M n β j k B j k ( X j )

where B j k ( X j ) are the B-spline basis functions and β j = ( β j 1 , ⋯ , β j M n ) T is the corresponding coefficient parameter vector. Let B j denote the n × M n design matrix of B-spline basis of the j^{th} predictor and B j ( i ) be its i^{th} row vector corresponding to the sorted data. Denote the n × p M n design matrix as B = ( B 1 , B 2 , ⋯ , B p ) ; the i^{th} row of B as B ( i ) ; and the corresponding parameter vector as β = ( β 1 T , ⋯ , β p T ) T . Then we have

∑ j = 1 p f j ( X ( i ) j ) = f 1 ( X ( i ) 1 ) + ⋯ + f p ( X ( i ) p ) ≈ ∑ k = 1 M n β 1 k B 1 k ( X ( i ) 1 ) + ⋯ + ∑ k = 1 M n β p k B p k ( X ( i ) p ) = ∑ j = 1 p B ( i ) j β j (2.3)

By plugging Equation (2.3) into Equation (2.2), we will get the new loss function as following:

Q n ( η 0 , β ) = 1 2 ∑ i = 1 n n ω n i { Y ( i ) − η 0 − ∑ j = 1 p B ( i ) j β j } 2 (2.4)

By centering B ( i ) j and Y ( i ) with their ω n i -weighted means, the intercept becomes 0. Denote B ˜ ( i ) j = ( n ω n i ) 1 / 2 ( B ^ ( i ) j − B ¯ j ω ) and Y ˜ = ( n ω n i ) 1 / 2 ( Y ( i ) − Y ¯ ω ) ; where B ¯ j ω = ∑ i = 1 n ω n i B ( i ) j / ∑ i = 1 n ω n i and Y ¯ ω = ∑ i = 1 n ω n i Y ( i ) / ∑ i = 1 n ω n i Let ‖ a ‖ 2 = ( ∑ j = 1 m | a j | 2 ) 1 / 2 denote the L_{2} norm of any vector a ∈ R m . For simplicity, we use B ˜ j = ( B ˜ ( 1 ) j , ⋯ , B ˜ ( n ) j ) T and Y ˜ = ( Y ˜ ( 1 ) , ⋯ , Y ˜ ( n ) ) T . Then we can rewrite the Stute’s weighted least squares loss function Equation (2.4) as

Q n ( β ) = 1 2 ∑ i = 1 n { Y ˜ ( i ) − ∑ j = 1 p B ˜ ( i ) j β j } 2 = 1 2 ‖ Y ˜ − ∑ j = 1 p B ˜ j β j ‖ 2 2 (2.5)

B-splines approximation is used on the unknown functions, which transforms the nonparameter regression into a parameter regression that makes variable selection and parameter estimation easier to solve. Meanwhile, the grouped variables in B ˜ j ; i.e., B ˜ j k ; k = 1 , ⋯ , M n ; for each j = 1, ⋯ , p , are all related to the variable X j ; so we can consider B-spline basis functions for each nonparameter function f j to be a group. Instead of selecting the significant nonparameter functions, our task converts to choosing the significant B-spline basis functions from B ˜ j or nonzero coefficients from β j .

In order to carry out variable selection at the group and individual variable levels simultaneously. In our case, the GMCP penalty function is

ρ γ ( ‖ β j ‖ A j , λ ) = λ ∫ 0 ‖ β j ‖ A j ( 1 − x γ λ ) + d x (2.6)

where γ is a parameter that controls the concavity of ρ and λ is the penalty parameter. Here x + = x 1 { x ≥ 0 } . We require λ ≥ 0 and γ > 1 . The term MCP comes from the fact that it minimizes the maximum concavity measure defined at (2.2) of [

ρ ˙ γ ( ‖ β j ‖ A j , λ ) = λ ( 1 − ‖ β j ‖ A j γ λ ) + (2.7)

where for any m × 1 vector a , ‖ a ‖ 1 is the L_{1} norm: ‖ a ‖ 1 = | a 1 + ⋯ + a m | , λ > 0 is the penalty tuning parameter and A j = { k : β j k ∈ β j } . In our case, each A j represents the j^{th} group of basis functions, i.e., B ˜ j k , k = 1, ⋯ , M n ; the values of the basis functions for each nonparameter function f j may be different from those for another function f j ′ ; and when j ≠ j ′ ; we assume there is no overlap between groups. Now combining the objective function in Equation (2.5) and the penalty function in Equation (2.6), we have the penalized weighted least squares objective function for the proposed NP-AFT-AR model as follows:

Q n λ ( β ) = 1 2 ‖ Y ˜ − ∑ j = 1 p B ˜ j β j ‖ 2 2 + ∑ j = 1 p λ ∫ 0 ‖ β j ‖ A j ( 1 − x γ λ ) + d x (2.8)

We can conduct group or component selection and estimation by minimizing Q n λ ( β ) : If ‖ β j ‖ A j = 0 ; it implies that the function component f j is deleted, otherwise, it is selected, further, the individual basis functions within a group can be selected.

We derive a group coordinate descent algorithm for computing β . This algorithm is a natural extension of the standard coordinate descent algorithm ( [

The group coordinate descent algorithm optimizes a target function with respect to a single group at a time, iteratively cycling through all groups until convergence is reached. It is particularly suitable for computing β , since it has a simple closed form expression for a single-group model, see (2.11) below.

We write A j = R j for an M n × M n upper triangular matrix R j via the Cholesky decomposition. Let θ j = R j β j and B ^ j = B ˜ j R j − 1 . Simple algebra shows that

Q ( θ , λ , γ ) = 1 2 ‖ Y ˜ − ∑ j = 1 p B ^ j θ j ‖ 2 2 + ∑ j = 1 p λ ∫ 0 ‖ θ j ‖ ( 1 − x γ λ ) + d x (2.9)

Note that n − 1 B ^ ′ j B ^ j = R ′ j ( n − 1 B ^ ′ j B ^ j ) R j − 1 = I m n . Y ^ j = Y ^ − ∑ k ≠ j p B ^ k θ k and

Q j ( θ j , λ , γ ) = 1 2 ‖ Y ˜ j − B ^ j θ j ‖ 2 2 + λ ∫ 0 ‖ θ j ‖ ( 1 − x γ λ ) + d x (2.10)

Let η j = B ^ j ( B ^ ′ j B ^ j ) − 1 Y ^ j . For γ > 1 , it can be verified that the value that minimizes Q j ( θ , λ , γ ) is

θ ˜ j , G M ( λ , γ ) = M ( η j ; λ , γ ) ≡ { 0 if ‖ η j ‖ ≤ λ γ γ − 1 ( 1 − λ ‖ η j ‖ ) η j if λ < ‖ η j ‖ ≤ γ λ η j if ‖ η j ‖ > γ λ (2.11)

In particular, when γ = ∞ , we have

θ ˜ j , G L = ( 1 − λ ‖ η j ‖ ) + η j ,

which is the GLasso estimate for a single-group model ( [

The group coordinate descent algorithm can now be implemented as follows. Suppose the current values for the group parameter θ ˜ k ( s ) , k ≠ j are given. We want to minimize Q ( θ , λ , γ ) with respect to θ j . Let

Q j ( θ j , λ , γ ) = 1 2 ‖ Y ˜ − ∑ k ≠ j B ^ k θ ˜ k ( s ) − B ^ j θ j ‖ 2 2 + λ ∫ 0 ‖ θ j ‖ ( 1 − x γ λ ) + d x (2.12)

and write Y ˜ j = ∑ k ≠ j B ^ k θ ˜ k ( s ) and η j = n − 1 B ^ ′ j ( Y ˜ − Y ˜ j ) . Let θ ˜ j be the minimizer of Q j ( θ j , λ , γ ) . When γ > 1 , we have θ ˜ j = M ( η j , λ , γ ) , where M is defined in (2.11).

For any given ( λ , γ ) , we use (2.11) to cycle through one component at a time. Let θ ˜ ( 0 ) = ( θ ˜ 1 ( 0 ) ′ , ⋯ , θ ˜ p ( 0 ) ′ ) ′ be the initial value. The proposed coordinate descent algorithm is as follows.

Initial vector of residuals r = Y − Y ˜ , where Y ˜ = ∑ j = 1 p B ^ j θ j ( 0 ) , For s = 0,1, ⋯ , carry out the following calculation until convergence. For j = 1, ⋯ , p , repeat the following steps.

Step 1: Calculate η ˜ j = n − 1 B ^ ′ j r + θ ˜ j ( s ) .

Step 2: Update θ j ( s + 1 ) = M ( η ˜ j ; λ , γ ) .

Step 3: Update r = r − B ^ j ( θ j ( s + 1 ) − θ j ( s ) ) and j = j + 1 .

The last step ensures that r holds the current values of the residuals. Although the objective function is not necessarily convex, it is convex with respect to a single group when the coefficients of all the other groups are fixed.

Let | A | denote the cardinality of any set A ∈ { 1, ⋯ , p } and d A = | A | M n . Define

B A = ( B ˜ j k , k = 1, ⋯ , M n ; j ∈ A ) and Σ A = B ′ A B A n

Here B A is n × n d A dimensional sub-design matrix corresponding to the variables in A, Denote ‖ f j ‖ 2 = [ E f j 2 ( X j ) ] We make the following assumptions.

Similar to [

(C1) T and C are independent.

(C2) P r ( T ≤ C | T , X ) = P r ( T ≤ C | T ) .

(C3) E ( T 2 ) < ∞ and E ( ε | X ) = 0 .

(C4) Denote τ T and τ C as the least upper bounds of T and C, respectively. Then τ T < τ C or τ T = τ C = ∞ .

(C5) f 2 has finite envelope function.

(C6) E { f ( X ) − f 0 ( X ) } 2 for f ≠ f 0 .

These assumptions correspond to the conditions in [

(C7) There exist constant q * > 0 , c 1 > 0 and c 2 > 0 where 0 < c 1 ≤ c 2 < ∞ such that

c 1 ≤ ‖ B A ν ‖ 2 2 n ≤ c 2 , ∀ | A | = q * , ‖ ν ‖ 2 = 1 and ν ∈ R d A

(C8) There is a small constant η 1 ≥ 0 such that Σ k ∈ A 0 ‖ f j ‖ 2 ≤ η 1 .

(C9) The random errors ε i , i = 1 , ⋯ , n are independent and identically distributed as ε , where E ( ε ) = 0 and E ( ε 2 ) = σ 2 < ∞ ; moreover, the tail probabilities satisfy P ( | ε | > x ) ≤ K exp ( − C x 2 ) for x > 0 and some constantsC and K.

(C10) There exists a positive constant M such that | x i k | ≤ M , i = 1 , ⋯ , n ; k = 1 , ⋯ , p .

(C7) is the sparse Riesz condition (SRC) formulated for the nonparameter AFT model (2.1), which controls the range of eigenvalues of the matrix Z. This condition was introduced to study the properties of Lasso for the linear regression model by [

In this subsection, we simply write f ^ j ( X j ) = ∑ k = 1 M n β ^ j k B j k is GMCP estimator. Let β ∗ o = min { ‖ β j o ‖ 2 , j ∈ A 0 c } and set β ∗ o = ∞ if A 0 c is empty. Define

β ^ o = arg min b { 1 2 ‖ Y ˜ − ∑ j = 1 p B ˜ j β j o ‖ 2 2 ; ‖ β j o ‖ 2 = 0, ∀ j ∉ A 0 c } (3.1)

and

f j o ( X ) = ∑ k = 1 M n β ^ j k o B j k

This is the oracle least squares estimator. Of course, it is not a real estimator, since the oracle set is unknown.

We first consider the case where the 2-norm GMCP objective function is convex. This necessarily requires c min > 0 where c min be the smallest eigenvalue of Σ , and recall Σ = n − 1 B ′ B . As in [

h ( t , k ) = exp ( − k ( 2 t − 1 − 1 ) 2 / 4 ) , t > 1 , k = 1 , 2 , ⋯ (3.2)

This function arises from an upper bound for the tail probabilities of the chi-square distributions given in Lemma A.2 in Appendix. This is derived from an exponential inequality for chi-square random variables of [

Theorem 3.1. Suppose ε 1 , ⋯ , ε n are independent and identically distributed as N ( 0, σ 2 ) and (C1)-(C10). Then for any ( λ , γ ) statisfying γ > 1 / c min , β ∗ o > λ γ and n λ 2 > σ 2 , we have

P ( β ^ ( λ , γ ) ≠ β ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ )

and

P ( f ^ ≠ f ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ )

where η 1 n ( λ ) = ( p − q ) h ( n λ 2 / σ 2 , M n ) and η 2 n ( λ ) = q h ( c 1 n ( β ∗ o − γ λ ) 2 / σ 2 , M n ) .

We give the proof of Theorem 3.1 in Appendix. It provides an upper bound on the probability that f ^ is not equal to the oracle estimator in terms of the tail probability functionh in (3.2). The key condition γ > 1 / c min ensures that the 2-norm GMCP criterion is strictly convex. Nonetheless, this result is a starting for a similar result in p > n case. The following corollary is an immediate consequence of Theorem 3.1.

Corollary 1. suppose that the condition of Therorm 3.1 are satisfied. Also suppose that β ∗ o ≥ γ λ + a n τ n for a n → ∞ as n → ∞ . If λ ≥ a n λ n , then

P ( β ^ ( λ , γ ) ≠ β ^ o ) → 0 as n → ∞

and

P ( f ^ ≠ f ^ o ) → 0 as n → ∞

where

λ n = σ 2 log ( max { p − q ,1 } ) / ( n M n ) and τ n = σ 2 log ( max ( { q ,1 } ) / ( n c 1 M n ) )

By Corollary 1, the 2-norm GMCP estimator equals the oracle least squares estimator with probability converging to one. This implies it is group selection consistent. We now consider the high-dimensional case where p > n . Under condition (C7), let K ∗ = c ¯ − 1 / 2 , m ∗ = K ∗ q and ξ = 1 / ( 4 c * M n ) . Define

η 3 n ( λ ) = ( p − q ) m ∗ e m ∗ m ∗ m ∗ h ( ξ n λ 2 σ − 2 / M n , m ∗ M n )

Theorem 3.2. suppose ε 1 , ⋯ , ε n are independent and identically distributed as N ( 0, σ 2 ) and B satisfies the S R C ( q * , c 1 , c 2 ) in (C7) with q * ≥ ( c ¯ − 1 / 2 ) , m ∗ = K ∗ q and ξ = 1 / ( 4 c * M n ) , we have

P ( β ^ ( λ , γ ) ≠ β ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ ) + η 3 n ( λ )

and

P ( f ^ ≠ f ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ )

where η 1 n ( λ ) = ( p − q ) h ( n λ 2 / σ 2 , M n ) and η 2 n ( λ ) = q h ( c 1 n ( β ∗ o − γ λ ) 2 / σ 2 , M n ) .

Corollary 2. suppose that the condition of Therorm 3.2 are satisfied. Also suppose that β ∗ o ≥ γ λ + a n τ n for a n → ∞ as n → ∞ . If λ ≥ a n λ n * , then

P ( β ^ ( λ , γ ) ≠ β ^ o ) → 0 as n → ∞

and

P ( f ^ ≠ f ^ o ) → 0 as n → ∞

where λ n * = 2 σ 2 c 2 M n log ( p − q ) / n .

Theorem 3.2 and Corollary 2 provide sufficient conditions for the asymptotic oracle property of the global 2-norm GMCP estimator in the p > n situations. Here we allow p − | A 0 c | = exp { O ( n / ( c 2 M n ) ) } . So p can be greater than n. The condition n λ 2 ξ > σ 2 M n is stronger than the corresponding condition n λ 2 > σ 2 in Theorem 3.5 ( [

In this section, we conduct simulation studies to evaluate the performance of the GMCP and GLasso penalties in a high-dimensional NP-AFT-AR model with limited samples. We therefore focus on the comparisons of the group selection methods with only the BIC ( [

BIC ( λ , M n ) = log ( RSS λ , M n ) + log ( n ) d f λ , M n n

Where RSS is the sum of squared residuals, df is the number of selected variables given ( λ , M n ) . We choose M n from the increasing sequence in Section 5, for any given value of M n , We choose from a sequence of 100 values λ , from 0.01 λ max to λ max , Where λ max = max 1 ≤ j ≤ p ‖ B ˜ ′ j Y ˜ ‖ 2 / M n B ˜ ′ j is corresponding to the covariate X j , j = 1 , ⋯ , p with n × M n “design” matrix. λ max is the maximum penalty value, which compresses all estimated coefficients to zero.

We compute the empirical prediction mean square error (MSE) to reveal the estimation accuracy. Let f ^ j be the estimator of f j , j = 1, ⋯ , p ; and we define MSE as

MSE f j = 1 n ∑ i = 1 n | f ^ j ( X i j ) − f p ( X i j ) | 2

Three scenarios are considered in the following, where some nonzero components are linear and the response variable is subject to various censoring rates. The sample size n = 400 , 200 and a total of 100 simulation runs are used. The logarithm of censoring time C i is generated from a uniform distribution U ( c 1 , c 2 ) , c 1 > 0 ; c 2 > 0 , where c 1 and c 2 are determined by a Monte-Carlo method to achieve the censoring rates of 35% and 40% respectively. For example, the censoring rate c r = P r ( T > C ) is approximated by c r ^ = ∑ i = 1 M I ( T i > C i ) / M where T i is drawn from the proposed model (2.1) and C i is drawn from U ( − c 1 , c 2 ) , c 1 > 0 ; c 2 > 0 , M is the Monte-Carlo simulation runs used to compute cr. When we chose c 1 = 0 , c 2 = 4 , c r ^ ≈ 40 % , which is considered to be the desired censoring rate. To take account of the computational efficiency and accuracy, we use the cubic B-spline with five evenly distributed interior knots for all the functions f j , j = 1, ⋯ , p , which gives the number of 3 + 1 + 5 = 9 basis functions for each nonparametric component. Due to the identifiability constraint, E { f j ( X j ) } = 0 ; the actual number of basis functions used is 8. This choice is made because our simulation studies indicated that using a larger number of knots does not improve the finite sample performance (results are not shown).

In this scenario, we consider independent covariates and set the intercept η 0 = 0 : The logarithm of failure times, T i , i = 1, ⋯ , n , are generated from

T = exp ( f 1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 3 ) + f 4 ( X 4 ) + f 5 ( X 5 ) + f 6 ( X 6 ) + ∑ j = 7 p f j ( X j ) + ε )

where

f 1 ( X 1 ) = 2 ( sin ( 0.25 π X 1 ) ) 3 , f 2 ( X 2 ) = 2 sin ( 2 X 2 ) , f 3 ( X 3 ) = X 3 2 − 3 4 ,

f 4 ( X 4 ) = 1.2 X 4 , f 5 ( X 5 ) = exp ( − X 5 ) − 25 12 ,

f 6 ( X 6 ) = 1 4 X 6 3 , f 7 ( X 7 ) = ⋯ = f p ( X p ) ≡ 0.

The predictors are sampled from the N ( 0,1 ) .

We set p = 500 and consider the cases where n = 400 , 200 , respectively to see the performance of our proposed methods as the sample size increases. The penalty parameters are selected using CV as described above.

The results for the the GMCP, GSCAD and GLasso methods are given in

Several observations can be obtained from Tables 1-4. The model that was selected by the GMCP and is better than the one selected by the GLasso in terms of model error, the percentage of occasions on which the true variables being selected and the mean square errors for the important coefficient functions. The GMCP includes the correct variables with high probability. When the sample size increases, the performance of both methods becomes better as expected. To examine the estimated nonparametric functions from Concave group Selection methods, we plot GMCP along with the true function components in

Results for high dimension, p = 500 | ||||
---|---|---|---|---|

NV | ER | IN% | CS% | |

n = 400 ( CR = 35 % ) | ||||

Group Lasso | 6.0 | 0.0004 | 100.0 | 100.0 |

(0.00) | (0.0003) | (0.00) | (0.00) | |

Group SCAD | 6.0 | 0.0001 | 100.0 | 100.0 |

(0.00) | (0.0001) | (0.00) | (0.00) | |

Group MCP | 6.0 | 0.00009 | 100.0 | 100.0 |

(0.00) | (0.00009) | (0.00) | (0.00) | |

n = 200 ( CR = 35 % ) | ||||

Group Lasso | 8.0 | 0.0015 | 97.0 | 97.0 |

(1.83) | (0.0018) | (0.171) | (0.171) | |

Group SCAD | 6.1 | 0.0007 | 99.0 | 99.0 |

(0.35) | (0.0014) | (0.100) | (0.100) | |

Group MCP | 6.2 | 0.0005 | 99.0 | 98.0 |

(1.62) | (0.0010) | (0.100) | (0.140) |

f 1 ( X 1 ) | f 2 ( X 2 ) | f 3 ( X 3 ) | f 4 ( X 4 ) | f 5 ( X 5 ) | f 6 ( X 6 ) | |
---|---|---|---|---|---|---|

n = 400 ( CR = 35 % ) | ||||||

group Lasso | 0.109 | 0.312 | 0.324 | 0.150 | 0.682 | 0.624 |

(0.054) | (0.087) | (0.105) | (0.065) | (0.584) | (0.299) | |

group SCAD | 0.076 | 0.227 | 0.262 | 0.114 | 0.683 | 0.519 |

(0.049) | (0.086) | (0.089) | (0.068) | (0.626) | (0.319) | |

group MCP | 0.073 | 0.226 | 0.258 | 0.111 | 0.644 | 0.516 |

(0.048) | (0.085) | (0.094) | (0.064) | (0.540) | (0.319) | |

n = 200 ( CR = 35 % ) | ||||||

group Lasso | 0.259 | 0.803 | 1.399 | 0.415 | 0.864 | 0.711 |

(0.101) | (0.268) | (0.558) | (0.168) | (0.610) | (0.313) | |

group SCAD | 0.202 | 0.378 | 0.724 | 0.175 | 0.603 | 0.584 |

(0.125) | (0.274) | (0.374) | (0.142) | (0.533) | (0.662) | |

group MCP | 0.200 | 0.365 | 0.720 | 0.162 | 0.639 | 0.547 |

(0.117) | (0.262) | (0.367) | (0.123) | (0.687) | (0.646) |

Results for high dimension, p = 500 | ||||
---|---|---|---|---|

NV | ER | IN% | CS% | |

n = 400 ( CR = 40 % ) | ||||

Group Lasso | 6.6 | 0.0003 | 100.0 | 100.0 |

(1.16) | (0.0003) | (0.0) | (0.0) | |

Group SCAD | 6.1 | 0.0001 | 100.0 | 100.0 |

(0.29) | (0.0001) | (0.0) | (0.0) | |

Group MCP | 6.1 | 0.00009 | 100.0 | 100.0 |

(0.37) | (0.00009) | (0.0) | (0.0) | |

n = 200 ( CR = 40 % ) | ||||

Group Lasso | 8.4 | 0.0016 | 96.0 | 95.0 |

(2.31) | (0.0031) | (0.196) | (0.219) | |

Group SCAD | 6.1 | 0.0010 | 97.0 | 95.0 |

(2.04) | (0.0031) | (0.171) | (0.219) | |

Group MCP | 6.2 | 0.0007 | 97.0 | 96.0 |

(2.23) | (0.0027) | (0.171) | (0.196) |

f 1 ( X 1 ) | f 2 ( X 2 ) | f 3 ( X 3 ) | f 4 ( X 4 ) | f 5 ( X 5 ) | f 6 ( X 6 ) | |
---|---|---|---|---|---|---|

n = 400 ( CR = 40 % ) | ||||||

group Lasso | 0.111 | 0.247 | 0.176 | 0.132 | 0.750 | 0.666 |

(0.055) | (0.102) | (0.140) | (0.074) | (0.702) | (0.313) | |

group SCAD | 0.077 | 0.202 | 0.110 | 0.109 | 0.681 | 0.563 |

(0.051) | (0.100) | (0.059) | (0.087) | (0.592) | (0.357) | |

group MCP | 0.074 | 0.202 | 0.113 | 0.107 | 0.655 | 0.555 |

(0.050) | (0.100) | (0.074) | (0.087) | (0.552) | (0.345) | |

n = 200 ( CR = 40 % ) | ||||||

group Lasso | 0.392 | 0.746 | 0.777 | 0.543 | 1.304 | 0.439 |

(0.144) | (0.343) | (0.456) | (0.272) | (0.834) | (0.197) | |

group SCAD | 0.133 | 0.441 | 0.271 | 0.217 | 0.894 | 0.297 |

(0.159) | (0.357) | (0.369) | (0.283) | (0.769) | (0.244) | |

group MCP | 0.122 | 0.428 | 0.286 | 0.197 | 0.916 | 0.289 |

(0.148) | (0.346) | (0.462) | (0.240) | (1.125) | (0.243) |

nonparameter f j ( X j ) , j = 1, ⋯ ,6 , fit the true functions well, which are consistent with the mean square errors for the functions reported in

In this scenario, we consider correlated covariates and set the intercept η 0 = 0 : The logarithm of failure times, T i , i = 1 , ⋯ , n , are generated from?

T = exp ( f 1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 3 ) + f 4 ( X 4 ) + f 5 ( X 5 ) + f 6 ( X 6 ) + ∑ j = 7 p f j ( X j ) + ε )

f 1 ( X 1 ) = 1.2 X 1 , f 2 ( X 2 ) = 2 sin ( 2 X 2 ) , f 3 ( X 3 ) = ( X 3 2 − 3 4 ) ,

f 4 ( X 4 ) = exp ( − X 5 ) − 25 12 , f 5 ( X 5 ) = sin ( 0.5 π X 5 ) ,

f 6 ( X 6 ) = 2 ( sin ( 0.25 π X 6 ) ) 3 , f 7 ( X 7 ) = ⋯ = f p ( X p ) ≡ 0.

where the covariates X = ( X 1 , X 2 , ⋯ , X p ) are generated from X p = ( W p + 0.5 U ) / 1.5 where W 1 , ⋯ , W p and U are i.i.d. N ( 0,1 ) . This provides a design with a correlation coefficient of 0.5 between all of the covariates.

The simulation study results are reported in Tables 5-8. The conclusions for Scenario 2 are very similar to those for Scenario 1. When the censoring rate increases, the estimation and selection performance decreases for all methods. The results in

Results for high dimension, p = 500 | ||||
---|---|---|---|---|

NV | ER | IN% | CS% | |

n = 400 ( CR = 35 % ) | ||||

Group Lasso | 9.0 | 0.0024 | 100.0 | 99.0 |

(2.05) | (0.0011) | (0.00) | (0.10) | |

Group SCAD | 7.8 | 0.0015 | 100.0 | 100.0 |

(1.58) | (0.0009) | (0.00) | (0.00) | |

Group MCP | 8.3 | 0.0013 | 100.0 | 100.0 |

(2.07) | (0.0007) | (0.00) | (0.00) | |

n = 200 ( CR = 35 % ) | ||||

Group Lasso | 12.9 | 0.0043 | 86.5 | 86.0 |

(3.47) | (0.0033) | (0.343) | (0.347) | |

Group SCAD | 8.3 | 0.0033 | 93.5 | 92.0 |

(1.63) | (0.0031) | (0.247) | (0.271) | |

Group MCP | 8.6 | 0.0024 | 93.5 | 93.5 |

(1.72) | (0.0024) | (0.247) | (0.247) |

f 1 ( X 1 ) | f 2 ( X 2 ) | f 3 ( X 3 ) | f 4 ( X 4 ) | f 5 ( X 5 ) | f 6 ( X 6 ) | |
---|---|---|---|---|---|---|

n = 400 ( CR = 35 % ) | ||||||

group Lasso | 0.149 | 0.173 | 0.224 | 0.998 | 0.073 | 0.124 |

(0.059) | (0.082) | (0.218) | (0.166) | (0.026) | (0.069) | |

group SCAD | 0.086 | 0.117 | 0.191 | 0.757 | 0.032 | 0.114 |

(0.047) | (0.082) | (0.527) | (0.139) | (0.017) | (0.122) | |

group MCP | 0.070 | 0.133 | 0.177 | 0.715 | 0.028 | 0.113 |

(0.042) | (0.089) | (0.479) | (0.132) | (0.013) | (0.124) | |

n = 200 ( CR = 35 % ) | ||||||

group Lasso | 0.404 | 0.597 | 0.586 | 1.406 | 0.233 | 0.256 |

(0.149) | (0.264) | (0.143) | (0.304) | (0.109) | (0.065) | |

group SCAD | 0.221 | 0.365 | 0.441 | 0.956 | 0.119 | 0.206 |

(0.162) | (0.503) | (0.326) | (0.338) | (0.125) | (0.120) | |

group MCP | 0.175 | 0.374 | 0.363 | 0.849 | 0.082 | 0.177 |

(0.138) | (0.496) | (0.337) | (0.275) | (0.1044) | (0.106) |

Results for high dimension, p = 500 | ||||
---|---|---|---|---|

NV | ER | IN% | CS% | |

n = 400 ( CR = 40 % ) | ||||

Group Lasso | 9.0 | 0.0019 | 100.0 | 98.0 |

(1.71) | (0.0010) | (0.00) | (0.14) | |

Group SCAD | 7.7 | 0.0012 | 100.0 | 98.0 |

(1.18) | (0.0007) | (0.00) | (0.14) | |

Group MCP | 8.4 | 0.0009 | 100.0 | 98.0 |

(1.54) | (0.00055) | (0.00) | (0.14) | |

n = 200 ( CR = 40 % ) | ||||

Group Lasso | 13.0 | 0.0044 | 89.0 | 85.0 |

(3.57) | (0.0031) | (0.313) | (0.357) | |

Group SCAD | 8.2 | 0.0033 | 95.0 | 85.0 |

(1.66) | (0.0030) | (0.218) | (0.357) | |

Group MCP | 8.4 | 0.0024 | 95.0 | 86.0 |

(1.50) | (0.0023) | (0.218) | (0.347) |

f 1 ( X 1 ) | f 2 ( X 2 ) | f 3 ( X 3 ) | f 4 ( X 4 ) | f 5 ( X 5 ) | f 6 ( X 6 ) | |
---|---|---|---|---|---|---|

n = 400 ( CR = 40 % ) | ||||||

group Lasso | 0.104 | 0.192 | 0.157 | 0.781 | 0.071 | 0.152 |

(0.071) | (0.103) | (0.0844) | (0.191) | (0.036) | (0.058) | |

group SCAD | 0.070 | 0.103 | 0.132 | 0.737 | 0.037 | 0.100 |

(0.037) | (0.076) | (0.160) | (0.269) | (0.030) | (0.065) | |

group MCP | 0.065 | 0.099 | 0.127 | 0.740 | 0.034 | 0.081 |

(0.037) | (0.074) | (0.171) | (0.298) | (0.028) | (0.059) | |

n = 200 ( CR = 40 % ) | ||||||

group Lasso | 0.414 | 0.578 | 0.495 | 1.466 | 0.213 | 0.231 |

(0.134) | (0.232) | (0.132) | (0.332) | (0.087) | (0.069) | |

group SCAD | 0.224 | 0.262 | 0.389 | 1.213 | 0.115 | 0.211 |

(0.176) | (0.246) | (0.301) | (0.489) | (0.109) | (0.172) | |

group MCP | 0.176 | 0.204 | 0.351 | 1.141 | 0.084 | 0.207 |

(0.148) | (0.190) | (0.365) | (0.527) | (0.099) | (0.166) |

model, since the MSE under the GMCP approach is always smaller than that under the GLasso approach. The results in

In this section, we will use Shedden 2008 (for short) to conduct an empirical analysis of part of the collected lung adenocarcinoma data to illustrate the proposed method. For more information, see [

Here, we are interested in the effect of tumor gene expression levels on the survival time of lung adenocarcinoma patients. Since the linear assumption is always latent in high dimensions, the proposed method may be more suitable for analyzing feature selection problems considering nonlinear effects. In our analysis, we set the spline base M n = 5 for each gene. The proposed method selects 1 gene locus under GMCP (ie 200746_s_at). However, when p = 500 , 1000 , the method under GLasso penalized regression alone selected the 6, 10 gene.

From

In this paper, we study the weighted least squares estimation and selection attributes of GMCP in the NP-AFT-AR model with high-dimensional data. For the GMCP method, our simulation results show that GLasso tends to select some unimportant variables. In contrast, GMCP has progressive predictability, which shows that it also has selection consistency.

The author declares no conflicts of interest regarding the publication of this paper.

Zhu, L. (2021) Concave Group Selection of Nonparameter Additive Accelerated Failure Time Model. Open Journal of Statistics, 11, 137-161. https://doi.org/10.4236/ojs.2021.111008

Lemma 1. Let χ k 2 be a random variable with chi-square distribution with k degrees of freedom. For t > 1 , P ( χ k 2 ≥ k t ) ≤ h ( t , k ) , where h ( t , k ) is defined in (3.2).

This lemma is a restatement of the exponential inequality for chi-square distributions of [

proofof Theorem 3.1. Since β ^ o is the oracle least squares estimator, we have β ^ j o , j ∈ A 0 and

− B ˜ ′ j ( Y ˜ − B ˜ β ^ o ) / n = 0, ∀ j ∈ A 0 c

If ‖ β ^ j o ‖ 2 / M n ≥ λ γ , then by the definition of the MCP, ρ ˙ ( ‖ β ^ j o ‖ 2 ; M n λ , γ ) = 0 . Since c min > 1 / γ , the criterion (2.8) is strictly convex. By the Karush-Kuhn-Tucker (KKT) conditions, the equality β ^ ( λ , γ ) = β ^ o holds in the intersection of the events

Ω 1 ( λ ) = { max j ∈ A 0 ‖ n − 1 B ˜ ′ j ( Y ˜ − B ˜ ) β ^ o ‖ 2 / M n ≤ λ }

and

Ω 2 ( λ ) = { min j ∈ A 0 c ‖ β ^ j o ‖ 2 ≥ γ λ }

We first bound 1 − P ( Ω 1 ( λ ) ) . Let β ^ A 0 c = ( β ^ j , j ∈ A 0 c ) ′ . By (A.1) [

β ^ A 0 c o = Σ A 0 c − 1 B ˜ ′ A 0 c Y ˜ / n = β A 0 c o + Σ A 0 c − 1 B ˜ ′ A 0 c ε / n

It follows that n − 1 B ˜ j ( Y ˜ − B ˜ β ^ o ) = n − 1 B ˜ ′ k ( I n − P A 0 c ) ε , where P A 0 c = n − 1 B ˜ A 0 c Σ A 0 c − 1 B ˜ ′ A 0 c , Because B ˜ ′ j B ˜ j = I M n , ‖ B ˜ j ( I n − P A 0 c ) ε ‖ 2 2 / σ 2 is distributed as a χ 2 distribution with M n degrees of freedom. We have, for n λ 2 / σ 2 ≥ 1

1 − P ( Ω 1 ( λ ) ) = P ( max j ∈ A 0 ‖ n 1 / 2 B ˜ ′ j ( I n − P A 0 c ) ε ‖ 2 2 / ( M n σ 2 ) > n λ 2 / σ 2 ) ≤ ∑ j ∈ A 0 P ( ‖ n − 1 / 2 B ˜ ′ j ( I n − P A 0 c ) ε ‖ 2 2 / ( M n σ 2 ) > M n n λ 2 / σ 2 ) ≤ ∑ j ∈ A 0 h ( n λ 2 / σ 2 , M n ) ≤ ( p − q ) h ( n λ 2 / σ 2 , M n ) = η 1 n ( λ ) (6.1)

where we used lemma 1 in the third line. Now consider Ω 2 ( λ ) , Recall β ∗ o = min j ∈ A 0 c ‖ β j o ‖ 2 . If ‖ β ^ j o − β j o ‖ 2 / M n ≤ β ∗ o − γ λ for all j ∈ A 0 c , then min j ∈ A 0 c ‖ β ^ j o ‖ 2 / M n ≥ γ λ . This implies

1 − P ( Ω 2 ( λ ) ) ≤ P ( max j ∈ A 0 c ‖ β ^ j o − β j o ‖ 2 / M n > β ∗ o − γ λ )

Let B ˜ j be a M n × M n q matrix with a M n × M n identity matrix I M n in the pth block and 0’s elsewhere. Then n 1 / 2 ( β ^ j o − β j o ) = n − 1 / 2 B ˜ j Σ A 0 c − 1 B ˜ ′ A 0 c ε . Note that

‖ n − 1 / 2 B ˜ j Σ A 0 c − 1 B ˜ ′ A 0 c ε ‖ 2 ≤ ‖ B ˜ j ‖ 2 ‖ Σ A 0 c − 1 / 2 ‖ 2 ‖ n − 1 / 2 Σ A 0 c − 1 / 2 B ˜ ′ A 0 c ε ‖ 2 ≤ c 1 − 1 / 2 ‖ n − 1 / 2 Σ A 0 c − 1 / 2 B ˜ ′ A 0 c ε ‖ 2

and ‖ n − 1 / 2 Σ A 0 c − 1 / 2 B ˜ ′ A 0 c ε ‖ 2 2 / σ 2 id distributed as a χ distribution with q degrees of freedom. Therefore, similar to η 1 n ( λ ) , we have, for c 1 n ( β ∗ o − γ λ ) / σ 2 > 1 ,

1 − P ( Ω 2 ( λ ) ) = P ( max j ∈ A 0 c n − 1 / 2 ‖ B ˜ j Σ A 0 c − 1 B ˜ ′ A 0 c ε ‖ 2 / M n > n ( β ∗ o − γ λ ) ) ≤ P ( max j ∈ A 0 c ‖ n − 1 / 2 Σ A 0 c − 1 / 2 B ˜ ′ A 0 c ε ‖ 2 2 / ( M n σ 2 ) > c 1 n ( β ∗ o − γ λ ) 2 / σ 2 ) ≤ q h ( c 1 n ( β ∗ o − γ λ ) 2 / σ 2 , M n ) = η 2 n ( λ ) (6.2)

Combining η 1 n ( λ ) and η 2 n ( λ ) , we have

P ( β ^ ( λ , γ ) ≠ β ^ o ) ≤ 1 − P ( Ω 1 ( λ ) ) + 1 − P ( Ω 2 ( λ ) ) ≤ η 1 n ( λ ) + η 2 n ( λ )

Since f ^ ( x ) = B β ^ , we can obtain P ( f ^ ≠ f ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ ) . This completes the proof.

For any Q ⊂ { 1, ⋯ , p } and m ≥ 1 , define

ζ ( ν ; m , B ) = max { ‖ ( P A − P B ) ν ‖ 2 ( m n ) 1 / 2 : Q ⊆ A ⊆ { 1, ⋯ , p } , d A = m + d B }

Lemma 2. Suppose ξ n λ 2 > σ 2 M n . We have

P ( 2 c 2 M n ζ ( Y ˜ ; m , A 0 c ) > λ ) ≤ ( p − q ) m e m m m exp ( − m ξ n λ 2 / 16 )

proof. For any A ⊇ A 0 c . We have ( P a − P A 0 c ) B ˜ A 0 c β ^ A 0 c = 0 . Thus

( P A − P A 0 c ) Y ˜ = ( P A − P A 0 c ) ( B ˜ A 0 c β A 0 c + ε ) = ( P A − P A 0 c ) ε

Therefore,

P ( 2 c 2 M n ζ ( Y ˜ ; m , A 0 c ) > λ ) = P ( max A ⊇ A 0 c , | A | = m + q ‖ ( P A − P A 0 c ) ε ‖ 2 2 / σ 2 > ξ m n λ 2 )

Since P A − P B is a projection matrix, ‖ ( P A − P A 0 c ) ε ‖ 2 2 / σ 2 ∼ χ m A 2 , where m A = ∑ j ∈ A − A 0 c , A ⊇ A 0 c M n ≤ m M n . Since there ( p − q m ) are ways to choose A from { 1, ⋯ , p } , we have

P ( 2 c 2 M n ζ ( Y ˜ ; m , A 0 c ) > λ ) ≤ ( p − q m ) P ( χ m M n 2 > ξ m n λ 2 ) .

This and Lemma A.2 imply that

P ( 2 c 2 M n ζ ( Y ˜ ; m , A 0 c ) > λ ) ≤ ( p − q m ) h ( ξ n λ 2 / M n , m M n ) ≤ ( p − q ) m e m m m h ( ξ n λ 2 / M n , m M n ) (6.3)

here we used the inequality ( p − q m ) ≤ ( p − q ) m e m m m , this completes the proof.

Define I as any set that satisfies

A 0 c ∪ { j : ‖ β ^ j ‖ 2 ≠ 0 } ⊆ I ⊆ A 0 c ∪ { j : n − 1 B ˜ ′ ( Y ˜ − B ˜ β ^ ) = ρ ˙ ( ‖ β ^ j ‖ 2 ; M n λ , γ ) M n β ^ j / ‖ β ^ j ‖ 2 }

Lemma 3. Suppose that B ˜ satisfies that S R C ( q * , c 1 , c 2 ) , q * ≥ ( K ∗ + 1 ) m n q , and γ ≥ c 1 − 1 4 + c ¯ . Let m ∗ = K ∗ q . Then for any Y ˜ ∈ R n with λ ≥ 2 c 2 M n ζ ( Y ˜ ; m ∗ , A 0 c ) , we have | I | ≤ ( K ∗ + 1 ) q .

proof This lemma can be proved along the line of the proof of Lemma 1 of [

2 c 2 M n ζ ( Y ˜ ; m ∗ , A 0 c ) ≤ λ (6.4)

we have | I | ≤ ( K ∗ + 1 ) q , Thus in the event (6.4), the original model with p groups reduces a model with at most ( K ∗ + 1 ) q groups, in this reduced model, the condition of Theorem 3.2 implies that the conditions of Theorem 3.2. By Lemma 2,

P ( 2 c 2 M n ζ ( Y ˜ ; m ∗ , A 0 c ) ≤ λ ) ≤ η 3 n ( λ ) (6.5)

Therefore, combining (6.5) and Theorem 3.1, we have P ( β ^ ( λ , γ ) ≠ β ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ ) + η 3 n ( λ ) , since f ^ ( x ) = B β ^ , we can obtain P ( f ^ ≠ f ^ o ) ≤ η 1 n ( λ ) + η 2 n ( λ ) + η 3 n ( λ ) . This proves Theorem 3.2.