^{1}

^{*}

^{1}

^{1}

This paper studies the re-adjusted cross-validation method and a semi parametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.

The variance estimate, in this paper, is the residual variance of the model. In the process of statistical modeling, the variance estimation of the model has been extensively studied. Most of the research methods are simple two-stage method, in the first stage, the important variables in the model are selected by the method of variable selection; in the second stage, the variance is estimated by the ordinary least squares method. In the first phase, the traditional variable selection method has two criteria, namely the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These two traditional methods use the empirical likelihood method to select the model with the smallest AIC and BIC values. At the same time, the variables contained in the model are the selected optimal variables. However, this variable selection method is neither continuous nor ordered. Therefore, the variance of the model estimated by the traditional method will be large. Moreover, with the development of technology, high-dimensional data is applied to all aspects of life. The number of variables increases exponentially, and the calculation of the above two criteria also shows an exponential increasing trend, so the above method cannot be applied to high-dimensional data.

In the past research, many important variable selection methods such as LASSO (Least Absolute Shrinkage Selection Operator) and SCAD (Smoothly Clipped Absolute Deviation) have been proposed. LASSO was proposed by Tibshirani (1996) [

A well-behaved variance estimation method can improve the prediction accuracy of the model and better explain the socio-economic phenomena. However, it is more important to choose a suitable regression model. There is also a large amount of literature on the study of regression models. When the data dimension is low, the parametric model and the nonparametric model are sufficient to solve the problem. But as the dimension increases, a more flexible semi-parametric model is more suitable. The literature research on semi-parametric models is mostly focused on the introduction of new models, such as linear models, add-on models, and so on. Hastie and Tibishirani (1993) [

The rest of the paper is organized as follows. In Section 2, we briefly introduce the varying index coefficient model, including the estimation method, the statistical inference of the coefficients, and the RCV estimation of the model. In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed methods. In Section 4, a real data set is analyzed to compare the proposed methods with the existing methods. A discussion is given in Section 5.

The semiparametric model is widely used in regression models, especially the varying coefficient model (VCM) proposed by Hastie and Tibishirani in 1993, which has been widely used in real data. An important feature of the varying coefficient model is that the coefficients of its covariates are controlled by smooth functions, which can show nonlinear reactions. The form of the variable coefficient model is as follows:

Y = ∑ l = 1 d m l ( Z ) X l + ε (2.1)

where Y is a response variable, X = ( X 1 , ⋯ , X p ) T and Z ∈ [ 0 , 1 ] (for simplicity) are explanatory covariates, m ( ⋅ ) = ( m 1 ( ⋅ ) , ⋯ , m p ( ⋅ ) ) T is a p-dimensional vector of the unknown coefficient functions, and model error ε is independent of ( X , Z ) with mean zero and finite variance σ 2 . The variable coefficient model of Equation (2.1) faces two challenges in the case of today’s complex data. First, the variable Z has little effect relative to Y, so the interaction between the variables Z and X is difficult to detect; second, in many complex situations, Z is multi-dimensional, for example, studying the effects between chemical constituents. Thus, the coefficient function m l ( Z ) in the VCM model will fall into the dimension curse. To overcome these two problems, Ma and Song proposed the Varying index Coefficient Model (VICM) in 2015. The varying index coefficient model is as follows:

Y = m ( Z , X , β ) + ε = ∑ l = 1 d m ( Z T β l ) X l + ε (2.2)

where β l = ( β l 1 , ⋯ , β l p ) T is the coefficient of the variable Z and β l k is the coefficient of Z k in Z. The introduction of the varying index coeffcient model was based on Ma and Song’s study of this biomedical project that affects children’s growth rates.

The estimation of the varying index coefficient models has two main aspects: one is the estimation of the parameter part β , and the other is the estimation of the function coefficient m l ( u l ) of the non-parametric part. In this paper, the estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME). Once β is fixed, the unknown function coefficient m l ( u l ) is estimated using B-spline. The specific estimation process of the varying index coefficient models is as follows.

Let { ( X i , Z i , Y i ) , 1 ≤ i ≤ n } be the independent and identically distributed samples from model (2.2). Our main interest is to estimate the coefficient vectors β l and the non-parametric functions m l ( ⋅ ) for l = 1 , ⋯ , d . The estimation of β l and m l ( ⋅ ) in VICM is equivalent to maximizing

1 n ∑ i = 1 d ϕ h 1 { Y i − ∑ l = 1 d m l ( Z i T β l ) X i l } (2.3)

subject to the constraint ‖ β l ‖ = 1 and β l 1 > 0 , where ϕ h 1 ( t ) = h 1 − 1 ϕ ( t / h 1 ) , ϕ t is a kernel density function symmetric about 0 and h 1 is a bandwidth which determines the degree of robustness of the estimate. We use the standard normal density for ϕ t throughout this paper to simplify the calculation. We use a basic approximation to estimate nonparametric functions. That is, we approximate m l ( ⋅ ) by the B-spline basis function because they have bounded support and

are numerically stable. More specially, let B q ( u ) = ( B 1 q ( u ) , ⋯ , B J n q ( u ) ) T be the

B-spline basis functions of order q ( q ≥ 2 ) , where J n = N n + q and N n is the number of interior knots for a knot sequence

ξ 1 = ⋯ = 0 = ξ q < ξ q + 1 < ⋯ < ξ N n + q < 1 = ξ N n + q + 1 = ⋯ = ξ N n + 2 q ,

where N n increases along with the sample size n. Consider the distance between two neighboring knots H i = ξ i − ξ i − 1 and H = max 1 ≤ i ≤ N n + 1 { H i } . Then, there exists constants C 0 such that H min 1 ≤ i ≤ N n + 1 { H i } < C 0 , max 1 ≤ i ≤ N n { H i + 1 − H i } = o ( N n − 1 ) . Let U l ( β l ) = Z T β l , without loss of generality, we assume that U l ( β l ) is confined in a compact set [ 0 , 1 ] . Then, nonparametric functions m l ( u l ) can be approximated by

m l ( u l ) ≈ B q ( u l ) T λ l ( β ) , l = 1 , ⋯ , d (2.4)

where λ l ( β ) = ( λ s , l ( β ) : 1 ≤ s ≤ J n ) T . Let λ ( β ) = ( λ 1 ( β ) T , ⋯ , λ d ( β ) T ) T . Based on the above approximation, the objective function (2.3) becomes

1 n ∑ i = 1 n ϕ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ s , l X i l } . (2.5)

Subsequently, we estimate the parameter vectors β l and the nonparametric functions m l ( ⋅ ) in two steps below.

Step 1. Given β , we obtain estimate λ ^ ( β ) of λ ( β ) by maximizing the objective function (2.5). Then, the estimator of m l ( u l ) can be obtained by

m ^ l ( u l , β ) = ∑ s = 1 J n B s , q ( u l ) λ ^ s , l ( β ) = B q ( u l ) T λ ^ l ( β ) . (2.6)

In order to obtain efficient estimators of β , the “remove-one-component” method is employed. Specifically, for β l = ( β l 1 , ⋯ , β l p ) T , let β l , - 1 = ( β l 2 , ⋯ , β l p ) T be a p − 1 dimensional vector by removing the 1st component β l 1 in β l for all 1 ≤ l ≤ d . Then β l can be rewritten as

β l = β l ( β l , − 1 ) = ( 1 − ‖ β l , − 1 ‖ 2 , β l , − 1 T ) T , ‖ β l , − 1 ‖ 2 < 1 . (2.7)

Thus, β l is infinitely differentiable with respect to β l , − 1 and the Jacobian matrix is

J l = ∂ β l ∂ β l , − 1 = ( − β l , − 1 T / 1 − ‖ β l , − 1 ‖ 2 I p − 1 ) , (2.8)

where I p is the p × p identity matrix. We denote β − 1 = ( β l , − 1 T , ⋯ , β d , − 1 T ) T and reformulate the parameter space of β − 1 as follows:

Θ − 1 = { β − 1 = ( β l , − 1 T : 1 ≤ l ≤ d ) T : ‖ β l , − 1 ‖ 2 < 1 , β l , − 1 ∈ R p − 1 } . (2.9)

Let β = β ( β − 1 ) with β l = β l ( β l , − 1 ) for 1 ≤ l ≤ d . Since the estimation procedure of β requires estimates of both m l and its first order derivative m ˙ l . We can adopt the spline functions of one order lower than that of m l to approximate the m ˙ l . Following Ma and Song (2014), a spline estimator of m ˙ l is given by

m ˙ ^ l ( u l , β ) = ∑ s = 1 J n B ˙ s , q ( u l ) λ ^ s , l ( β ) = ∑ s = 2 J n B s , q − 1 ( u l ) ω ^ s , l ( β ) (2.10)

where ω ^ s , l ( β ) = ( q − 1 ) { λ ^ s , l ( β ) − λ ^ s − 1 , l ( β ) } / ( ξ s + q − 1 − ξ s ) for 2 ≤ s ≤ J n . Thus, one has

m ˙ ^ s , l ( u l , β ) = B q − 1 ( u l ) T D 1 λ ^ l ( β ) ,

where B q − 1 ( u l ) = ( B s , q − 1 ( u l ) : 2 ≤ s ≤ J n ) T and

D 1 = ( q − 1 ) [ − 1 ξ q + 1 − ξ 2 1 ξ q + 1 − ξ 2 0 ⋯ 0 0 − 1 ξ q + 2 − ξ 3 1 ξ q + 2 − ξ 3 ⋯ 0 ⋮ ⋮ ⋱ ⋱ ⋮ 0 0 ⋯ − 1 ξ N + 2 q − 1 − ξ N + q 1 ξ N + 2 q − 1 − ξ N + q ] ( J n − 1 ) × J n .

Step 2. After this re-parametrization, combine with the estimators m ˙ l and m ˙ ^ l for l = 1 , ⋯ , d , we can construct the profile spline modal objective function for the parametric components. Then, we can obtain the estimator β ^ − 1 of β − 1 by maximizing L n ( β ( β − 1 ) ) over β − 1 ∈ Θ − 1 , where

L n ( β ( β − 1 ) ) = 1 n ∑ i = 1 n ϕ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ s , l ( β ) X i l } , (2.11)

which is equivalent to solve the following estimating equations:

∂ L n ( β ( β − 1 ) ) / ∂ β − 1 = − 1 n ∑ i = 1 n ϕ ˙ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ ^ s , l ( β ) X i l } × { { m ˙ ^ 1 ( U i 1 ( β 1 ) , β ) X i 1 J 1 T Z i + ( ∂ λ ^ ( β ) T / ∂ β 1 , − 1 ) D i ( β ) } ⋮ { m ˙ ^ d ( U i d ( β d ) , β ) X i d J d T Z i + ( ∂ λ ^ ( β ) T / ∂ β d , − 1 ) D i ( β ) } } = 0 (2.12)

where D i ( β ) = ( D i , s l ( β l ) , 1 ≤ s ≤ J n , 1 ≤ l ≤ d ) T with D i , s l ( β l ) = B s , q ( U i l ( β l ) ) X i l , m ˙ ^ l ( ⋅ , β ) is given in (2.10) and ϕ ˙ h 2 is the first derivative of ϕ h 2 . We obtain the estimate of β − 1 , say, β ^ − 1 and then obtain β ^ via the transformation (2.7). Thus, we call the estimator β ^ as the profile spline modal estimator (PSME).

In this section, we conduct simulation studies to evaluate the finite sample performance of the proposed methodology. We generate data from the following VICM:

Y i = m ( Z i , X i , β ) + ε i = m 1 ( Z i T β 1 ) X i 1 + m 2 ( Z i T β 2 ) X i 2 + m 3 ( Z i T β 3 ) X i 3 + ε i (3.1)

with X i = ( X i 1 , X i 2 , X i 3 ) T , where X i is generated from Bernoulli (p = 0.5), and ( X i 2 , X i 3 ) T is drawn from a bivariate normal distribution with mean 0, variance 1, and covariance 0.2. To generate Z i = ( Z i 1 , Z i 2 , Z i 3 ) T , we first sample ( Z i 1 * , Z i 2 * , Z i 3 * ) T from a multivariate normal with mean 0, variance 1, and covariance 0.2, and then let Z i k = Φ ( Z i k * ) − 0.5 , k = 1 , 2 , 3 , where Φ ( ⋅ ) is the CDF of the standard normal. The true loading parameters are set as β 1 = 1 14 ( 2 , 1 , 3 ) T , β 1 = 1 14 ( 3 , 2 , 1 ) T , β 1 = 1 14 ( 2 , 3 , 1 ) T . Set

m l ( u l ) = m l * ( u l ) − E { m l * ( u l ) } , l = 1 , 2 , 3

where m 1 * ( u 1 ) = 10 exp ( 5 u 1 ) / { 1 + exp ( 5 u 1 ) } , m 2 * ( u 2 ) = 5 sin ( π u 2 ) , and m 3 * ( u 3 ) = 3 { sin ( π u 3 ) + cos ( 2 π u 3 − 4 π / 3 ) } . Finally, Y i , 1 ≤ i ≤ n , are generated from the VICM (3-1), where β = ( β 1 T , β 2 T , β 3 T ) T , and errors ε i follow N ( 0 , σ 2 ( Z i , X i ) ) with σ 2 ( Z i , X i ) = { 100 − m ( Z i , X i , β ) } / { 100 + m ( Z i , X i , β ) } .

Although the estimation process of the varying index coefficient model is introduced in Section 2.2, it is still difficult to directly estimate (2.12). Therefore, an iterative calculation algorithm is needed to estimate the unknown parameters and the unknown function coefficients. The specific algorithm is divided into the following two steps:

Step 1. The initial value ( β 1 , β 2 , β 3 ) of β is obtained in the following four steps:

1) Assuming that the unknown function m l is a linear function, then m ( Z i , X i , β ) = ∑ i = 1 d a l + b l ( Z i T β l ) X i l .

2) The estimated value ( a ^ l , v ^ l ) of ( a l , v l ) is estimated by minimizing ∑ i = 1 n { Y i − ∑ l = 1 d ( a l + v l T Z i X i l ) } 2 , and thus the expression β ^ l 0 = ( v ^ l / ‖ v ^ l ‖ 2 ) sgn ( v ^ 1 l ) is obtained, where v ^ 1 l is a part of v ^ l .

3) Let U ^ l = Z i T β ^ l 0 , then obtain the initial unknown function m ^ l i n i ( ⋅ ) from the varying coefficient model Y = ∑ l = 1 d m l ( U ^ l ) X l + ε .

4) Obtain β 1 i n i by minimizing 2 − 1 ∑ i = 1 n { Y i − ∑ l = 1 n m ^ l i n i ( Z i T β l ) X i l } 2 , i.e. the initial value.

Step 2. Iterative calculations are performed by the asymptotic properties of the large sample parameter estimates and the theorems given by Ma and Song (2015) [

n ( β ^ − 1 − β − 1 0 ) = { n − 1 ∑ i = 1 n Φ ( X i , Z i , β 0 ) ⊗ 2 } − 1 × { n − 1 / 2 ∑ i = 1 n ( Y i − m ( Z i , X i ) ) Φ ( X i , Z i , β 0 ) } + o p ( 1 ) (3.2)

where Φ ( X i , Z i , β 0 ) = [ { m ˙ l ( U l ( β l 0 ) , β l 0 ) X l J l T Z ˜ } T , 1 ≤ l ≤ d ] , and Z ˜ = Z − P ( Z ) in the above expression. Here Z ˜ can be estimated by Z ˜ = Z − P n ( Z ) , where

P n ( Z k ) = ∑ l = 1 d g ^ 1 J 0 ( U l ( β ^ ) , β ^ ) X l . (3.3)

The estimation procedure of g ^ 1 J 0 ( ⋅ , β ^ ) in Equation (3.3) is similar to the estimation of the unknown function m ^ l ( ⋅ , β ^ ) , except that the response variable Y is replaced by Z k in the iterative estimation process. According to the asymptotic properties (3.2) we can get an equation and use this equation for iterative calculations. The iteration stops when the absolute difference (dif) from the last calculated unknown parameter is less than 10^{-4} or the iteration number (iter) is greater than or equal to 100.

According to the idea of the above specific algorithm, we use R (64-bit) to write four functions such as Design matrix, transform, Jac, vicmest. Among them, vicmest is the main program for estimating unknown parameters, and the other three functions are intermediate conversion functions. First, we calculate the initial value β 1 , β 2 , β 3 of β through the first step. The results are shown in

Initial value | β 1 | β 2 | β 3 |
---|---|---|---|

1 | 0.715 | 0.951 | 0.722 |

2 | 0.345 | 0.287 | 0.897 |

3 | 0.933 | 0.747 | 0.311 |

n = 200 | β 11 | β 12 | β 13 | β 21 | β 22 | β 23 | β 31 | β 32 | β 33 |
---|---|---|---|---|---|---|---|---|---|

True | 0.534 | 0.267 | 0.801 | 0.801 | 0.534 | 0.267 | 0.534 | 0.801 | 0.267 |

β ^ | 0.537 | 0.316 | 0.782 | 0.798 | 0.538 | 0.271 | 0.461 | 0.833 | 0.305 |

Bias | 0.003 | 0.049 | −0.019 | −0.003 | 0.003 | 0.004 | −0.074 | 0.032 | 0.038 |

From the main program vicmest, not only can the estimated value β ^ be obtained, but also can we obtain gamm0, which is the coefficient after the expansion of the B-spline basis function. Bring the calculated β ^ and the coefficient gamm0 into the Formula (2.10), get the value of the unknown function m ^ l ( ⋅ , β ) and the predicted value of the response variable Y. The results of the gamm0 coefficient are shown in

In this section, we numerically simulate the variance estimation of the varying index coefficient model in high-dimensional conditions.

The profile spline modal estimator (PSME) shows good estimation variance under low-dimensional data settings. However, in the case of high-dimensional data, it will fall into the dimension curse, and the deviation of the estimated variance will increase as the dimension increases. The re-adjustment cross-validation method proposed by Fan et al. (2012) [

We first perform data simulation. The setting of the real model is the same as model (3.1), with only the dimension of the covariant Z changed, that is, the first

B i 1 | B i 2 | B i 3 | B i 4 | B i 5 | B i 6 | |
---|---|---|---|---|---|---|

B 1 i | −4.444 | −4.808 | −3.0887 | 3.912 | 5.004 | 3.973 |

B 2 i | −2.083 | −6.179 | −4.543 | 4.683 | 5.185 | 4.876 |

B 3 i | −6.294 | −2.710 | 3.652 | −5.970 | 7.624 | 5.766 |

m ^ 1 ( ⋅ , β ) | m ^ 2 ( ⋅ , β ) | m ^ 3 ( ⋅ , β ) | |
---|---|---|---|

RMSE | 0.349 | 0.222 | 0.258 |

type of covariate Z is set to a high dimensional variable. Where Z i k = Φ ( Z i k * ) − 0.5 , k = 1 , 2 , ⋯ , d , that is, Z is a d-dimensional covariate. The setting of m l ( u l ) and the error term is also the same as model (3.1).

In the case of high dimensional data, the independent variables are often highly correlated. However, not all independent variables are related to the dependent variable Y. In fact, only a small number of covariates are associated with the dependent variable Y. For the selection of such high-dimensional variables, Fan et al. (2008) [

We use the SIS method proposed by Fan et al. (2008) [

It can be seen from

As shown in

In fact, it is not difficult to explain this phenomenon because in the variable selection phase, for the SIS method, we select the variables with the co-correlation coefficients ranked in the top twenty (descending order). Since the fixed value 20 is small relative to the covariate, the probability of selecting all important variables is relatively low. From the data in the RCV-SIS column in

In this section, we will use the data collected by the Mayo Clinic. These data were obtained from trials conducted by the Mayo Clinic in primary biliary cirrhosis (PBC) from 1974 to 1984. Specific data can be found in the R language Survival package. The dataset included 424 PBC patients who were referred to the Mayo Clinic during the decade between 1974 and 1984. The data met the randomized placebo-based eligibility criteria.

In the data set, the first 312 patients participated in the randomized trial while the other 112 patients did not participate in the clinical trial, but agreed to record the basic measurements and follow the medical recommendations. Six of the above samples lost follow-up shortly after diagnosis. Thus there are 106 cases and 312 random participants. We preprocessed the data set via R software.

Naïve-SIS | RCV-SIS | |
---|---|---|

d = 50 | 8.171 | 4.838 |

d = 100 | 8.273 | 5.199 |

d = 500 | 6.692 | 5.043 |

We first remove some samples with missing values. The number of samples that were eventually brought into the calculation after deletion was 276. The specific variables are described in

As can be seen from _{1} (Status), the patient’s age X_{2} (Age), and the patient’s gender X_{3} (Sex). Here we need to explain why we want to add gender variables. The gender variable was added because Huang Siyu (1985) [

L n ( Y i ) = m ( Z , X , β ) + ε I = ∑ l 3 m l ( ∑ d 15 Z l d * β l d ) X i l + ε i . (4.1)

Serial number | Variables | Description |
---|---|---|

1 | Time (Y) | number of days between registration and the earlier of death |

2 | Status (X1) | status at end point, 0/1/2 for censored, transplant, dead |

3 | Age (X2) | in years |

4 | Sex (X3) | m/f |

5 | Trt (Z1) | 1/2/NA for D-penicillmain, placebo, not randomised |

6 | Ascites (Z2) | presence of ascites |

7 | Hepato (Z3) | presence of hepatomegaly or enlarged liver |

8 | Spiders (Z4) | blood vessel malformations in the skin |

9 | Edema (Z5) | 0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy |

10 | Bili (Z6) | serum bilirubin (mg/dl) |

11 | Chol (Z7) | serum cholesterol (mg/dl) |

12 | Albumin (Z8) | serum albumin (g/dl) |

13 | Copper (Z9) | urine copper (ug/day) |

14 | alk.phos (Z10) | alkaline phosphotase (U/liter) |

15 | Ast (Z11) | aspartate aminotransferase, once called SGOT (U/ml) |

16 | Trig (Z12) | triglycerides (mg/dl) |

17 | Platelet (Z13) | platelet count |

18 | Protime (Z14) | standardised blood clotting time |

19 | Stage (Z15) | histologic stage of disease (needs biopsy) |

Since the covariate Z has different physical meanings and different dimensions, it is meaningless to simulate the model at this time. So we need to eliminate the effects of different dimensions of the data through transformation. Therefore, these 15 variables should be standardized before the specific calculation, which is Z-Score standardization.

Through previous studies, we have roughly learned that variables such as serum bilirubin content (Z6), albumin content (Z8), urinary copper content (Z9), alkaline phosphatase content (Z10), prothrombin time (Z14) have a strong relationship with the response variable Y. We first use the SIS method to select the variables with the first 8 covariate correlations, and then use the simple two-stage method and the re-adjusted cross-validation (RCV) two-stage method to estimate the coefficient β of the covariate Z and the model variance. The results are shown in

As can be seen from

In this paper, we study a new class of semiparametric regression models: varying index coefficient models. The estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME), while the unknown non-parametric function part is expanded with the B-spline. After studying the gradual nature of the coefficients, we estimate the coefficient β

N-SIS | RCV-SIS | |
---|---|---|

AMS (variables selected) | Z2, Z4, Z5, Z6, Z8, Z9, Z12, Z15 | The first stage: Z4, Z5, Z6, Z7, Z9, Z11, Z12, Z15 The second stage: Z2, Z5, Z6, Z8, Z9, Z12, Z14, Z15 |

MSE | 2.184 | 1.661 |

VARIANCE | 2.674 | 1.495 |

using an iterative method. With data simulation, we found that the estimated β of this method has a small deviation, and the unknown function part of the B-spline estimation has a good fitting effect as well. Finally, under the setting conditions of high-dimensional data, we carried out a two-stage RCV estimation of the varying index coefficient model. We find that the variance and mean square error estimated by the RCV method are superior to the simple two-stage method. In the final empirical phase, it was originally intended to model the PBC data using a survival model (semi-parametric varying coefficient additive risk model). However, through research literature, it is known that gender variables and state variables are closely related to the survival time of patients with primary biliary cirrhosis. The variable Z has a certain relationship with the three variables X (status, gender and age). Therefore, we used the varying index coefficient model to model the PBC data, and found that the variance and mean square error of the RCV method are better than the simple two-stage method.

Further researches for the proposed method are needed. Firstly, further effort to investigate the asymptotic properties of the proposed method needs to be done. Secondly, this paper only estimates the variance and mean square error of the varying index coefficient model, but lacks the research on the coefficient β and the estimation of the nonparametric function of the parameter part of the model. Therefore, we can study more robust estimation methods in the future. In addition, we can focus more on the asymptotic properties of the non-parametric part of the varying index coefficient model.

The authors declare no conflicts of interest regarding the publication of this paper.

Wang, M., Lv, H. and Wang, Y.C. (2019) Variance Estimation for High-Dimensional Varying Index Coefficient Models. Open Journal of Statistics, 9, 555-570. https://doi.org/10.4236/ojs.2019.95037