OJSOpen Journal of Statistics2161-718XScientific Research Publishing10.4236/ojs.2019.95037OJS-95536ArticlesPhysics&Mathematics Variance Estimation for High-Dimensional Varying Index Coefficient Models MiaoWang1*HaoLv1YicunWang1Department of Statistics, School of Economics, Jinan University, Guangzhou, China05092019090555557011, September 20195, October 2019 8, October 2019© Copyright 2014 by authors and Scientific Research Publishing Inc. 2014This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/

This paper studies the re-adjusted cross-validation method and a semi parametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.

High-Dimensional Data Refitted Cross-Validation Varying Index Coefficient Models Variance Estimation
1. Introduction

The variance estimate, in this paper, is the residual variance of the model. In the process of statistical modeling, the variance estimation of the model has been extensively studied. Most of the research methods are simple two-stage method, in the first stage, the important variables in the model are selected by the method of variable selection; in the second stage, the variance is estimated by the ordinary least squares method. In the first phase, the traditional variable selection method has two criteria, namely the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These two traditional methods use the empirical likelihood method to select the model with the smallest AIC and BIC values. At the same time, the variables contained in the model are the selected optimal variables. However, this variable selection method is neither continuous nor ordered. Therefore, the variance of the model estimated by the traditional method will be large. Moreover, with the development of technology, high-dimensional data is applied to all aspects of life. The number of variables increases exponentially, and the calculation of the above two criteria also shows an exponential increasing trend, so the above method cannot be applied to high-dimensional data.

In the past research, many important variable selection methods such as LASSO (Least Absolute Shrinkage Selection Operator) and SCAD (Smoothly Clipped Absolute Deviation) have been proposed. LASSO was proposed by Tibshirani (1996)  . For more details, see Fan and Peng (2004), Zhao and Yu (2006), Bunea (2007), Zhang and Huang (2008), Lv and Fan (2009), Fan and Lv (2011), and Kim (2008)  -  . In this method, a penalty term is added on the basis of the ordinary least squares method, and the coefficient value is reduced to 0, so that the corresponding variable is excluded from the model. Another type of variable selection tool is DS (Dantzig Selector). This method was first proposed by Candes and Tao (2005)  and can be easily reshaped into a linear model. Fan and Lv (2008)  sort the covariance matrix between covariate and response variables, and then select the first few variables with the largest correlation coefficient to complete the variable selection. This method is called SIS (Sure Independence Screening). In later studies, some scholars extended the SIS, namely the iterative SIS (ISIS) method: the regression analysis was performed using the variables and dependent variables selected by SIS, and the regression residuals were replaced with response variables. Then continue to use the SIS method for a new round of variable selection. And repeat the above steps until all the important variables. For details, see Fan et al. (2009)  . After screening out the important variables, the second step of the simple two-stage method is generally calculated by least squares method. However, in order to overcome the root cause of the dimension, many scholars study the variance estimation in the case of high-dimensional data. Fan et al. proposed a re-adjusted cross-validation method (RCV) in 2012 to improve the simple two-stage approach. It is proved that the variance estimated by this method is stable and accurate. Zhao et al. (2014)  studied the variance estimation of linear models under certain assumptions. Reid, Tibshirani, Friedman (2016)  studied the model residual estimation in LASSO regression and performed a large number of simulations. They considered that the variance estimation of the residual sum of squares based on adaptive regularization parameter selection has the properties of finite samples.

A well-behaved variance estimation method can improve the prediction accuracy of the model and better explain the socio-economic phenomena. However, it is more important to choose a suitable regression model. There is also a large amount of literature on the study of regression models. When the data dimension is low, the parametric model and the nonparametric model are sufficient to solve the problem. But as the dimension increases, a more flexible semi-parametric model is more suitable. The literature research on semi-parametric models is mostly focused on the introduction of new models, such as linear models, add-on models, and so on. Hastie and Tibishirani (1993)  proposed the Variable Coefficient Model (VCM), which has been widely used in practical applications. In addition, some scholars studied the single index coefficient model (SICM). The Variable Coefficient Single Index Model (VICSIM) was proposed by Wong et al. (2008)  . Ma and Song (2014)  proposed the varying index coefficient model for the first time, which has overcome the problems that the variable coefficient model cannot solve. Most scholars apply variable selection methods such as SIS, LASSO, and SCAD to the parametric model, while the method used in nonparametric estimation are kernel estimation, local linear kernel estimation, and spline functions. For the estimation of semi-parametric regression models, such as partial linear regression model, variable coefficient model, single-index model, etc., the parameter part is estimated by Profile Least Square Estimation (PLSE), and its non-parametric part is still using the previous non-parametric method. For example, Xue and Liang (2010)  used the PLSE method of kernel estimation when estimating the non-parametric part of the single-index model. However, there are few literatures on varying index coefficients proposed in 2015, and the related estimation algorithms mainly use the profile least squares estimation method with B-spline to estimate the variable coefficient index model. Lv et al. (2016)  improved the PLSE, proposed a robust estimation procedure combining the logarithmic regression and the B-spline, and established the large sample property of the parameter estimation. The estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME). Moreover, in order to obtain the progressive distribution of the unknown function m l ( Z ) , they also proposed a two-stage method of local linear kernel estimation.

The rest of the paper is organized as follows. In Section 2, we briefly introduce the varying index coefficient model, including the estimation method, the statistical inference of the coefficients, and the RCV estimation of the model. In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed methods. In Section 4, a real data set is analyzed to compare the proposed methods with the existing methods. A discussion is given in Section 5.

2. Methodology2.1. Varying Index Coefficient Models

The semiparametric model is widely used in regression models, especially the varying coefficient model (VCM) proposed by Hastie and Tibishirani in 1993, which has been widely used in real data. An important feature of the varying coefficient model is that the coefficients of its covariates are controlled by smooth functions, which can show nonlinear reactions. The form of the variable coefficient model is as follows:

Y = ∑ l = 1 d m l ( Z ) X l + ε (2.1)

where Y is a response variable, X = ( X 1 , ⋯ , X p ) T and Z ∈ [ 0 , 1 ] (for simplicity) are explanatory covariates, m ( ⋅ ) = ( m 1 ( ⋅ ) , ⋯ , m p ( ⋅ ) ) T is a p-dimensional vector of the unknown coefficient functions, and model error ε is independent of ( X , Z ) with mean zero and finite variance σ 2 . The variable coefficient model of Equation (2.1) faces two challenges in the case of today’s complex data. First, the variable Z has little effect relative to Y, so the interaction between the variables Z and X is difficult to detect; second, in many complex situations, Z is multi-dimensional, for example, studying the effects between chemical constituents. Thus, the coefficient function m l ( Z ) in the VCM model will fall into the dimension curse. To overcome these two problems, Ma and Song proposed the Varying index Coefficient Model (VICM) in 2015. The varying index coefficient model is as follows:

Y = m ( Z , X , β ) + ε = ∑ l = 1 d m ( Z T β l ) X l + ε (2.2)

where β l = ( β l 1 , ⋯ , β l p ) T is the coefficient of the variable Z and β l k is the coefficient of Z k in Z. The introduction of the varying index coeffcient model was based on Ma and Song’s study of this biomedical project that affects children’s growth rates.

2.2. Estimation Procedure for the VICM

The estimation of the varying index coefficient models has two main aspects: one is the estimation of the parameter part β , and the other is the estimation of the function coefficient m l ( u l ) of the non-parametric part. In this paper, the estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME). Once β is fixed, the unknown function coefficient m l ( u l ) is estimated using B-spline. The specific estimation process of the varying index coefficient models is as follows.

Let { ( X i , Z i , Y i ) , 1 ≤ i ≤ n } be the independent and identically distributed samples from model (2.2). Our main interest is to estimate the coefficient vectors β l and the non-parametric functions m l ( ⋅ ) for l = 1 , ⋯ , d . The estimation of β l and m l ( ⋅ ) in VICM is equivalent to maximizing

1 n ∑ i = 1 d ϕ h 1 { Y i − ∑ l = 1 d m l ( Z i T β l ) X i l } (2.3)

subject to the constraint ‖ β l ‖ = 1 and β l 1 > 0 , where ϕ h 1 ( t ) = h 1 − 1 ϕ ( t / h 1 ) , ϕ t is a kernel density function symmetric about 0 and h 1 is a bandwidth which determines the degree of robustness of the estimate. We use the standard normal density for ϕ t throughout this paper to simplify the calculation. We use a basic approximation to estimate nonparametric functions. That is, we approximate m l ( ⋅ ) by the B-spline basis function because they have bounded support and

are numerically stable. More specially, let B q ( u ) = ( B 1 q ( u ) , ⋯ , B J n q ( u ) ) T be the

B-spline basis functions of order q ( q ≥ 2 ) , where J n = N n + q and N n is the number of interior knots for a knot sequence

ξ 1 = ⋯ = 0 = ξ q < ξ q + 1 < ⋯ < ξ N n + q < 1 = ξ N n + q + 1 = ⋯ = ξ N n + 2 q ,

where N n increases along with the sample size n. Consider the distance between two neighboring knots H i = ξ i − ξ i − 1 and H = max 1 ≤ i ≤ N n + 1 { H i } . Then, there exists constants C 0 such that H min 1 ≤ i ≤ N n + 1 { H i } < C 0 , max 1 ≤ i ≤ N n { H i + 1 − H i } = o ( N n − 1 ) . Let U l ( β l ) = Z T β l , without loss of generality, we assume that U l ( β l ) is confined in a compact set [ 0 , 1 ] . Then, nonparametric functions m l ( u l ) can be approximated by

m l ( u l ) ≈ B q ( u l ) T λ l ( β ) , l = 1 , ⋯ , d (2.4)

where λ l ( β ) = ( λ s , l ( β ) : 1 ≤ s ≤ J n ) T . Let λ ( β ) = ( λ 1 ( β ) T , ⋯ , λ d ( β ) T ) T . Based on the above approximation, the objective function (2.3) becomes

1 n ∑ i = 1 n ϕ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ s , l X i l } . (2.5)

Subsequently, we estimate the parameter vectors β l and the nonparametric functions m l ( ⋅ ) in two steps below.

Step 1. Given β , we obtain estimate λ ^ ( β ) of λ ( β ) by maximizing the objective function (2.5). Then, the estimator of m l ( u l ) can be obtained by

m ^ l ( u l , β ) = ∑ s = 1 J n B s , q ( u l ) λ ^ s , l ( β ) = B q ( u l ) T λ ^ l ( β ) . (2.6)

In order to obtain efficient estimators of β , the “remove-one-component” method is employed. Specifically, for β l = ( β l 1 , ⋯ , β l p ) T , let β l , - 1 = ( β l 2 , ⋯ , β l p ) T be a p − 1 dimensional vector by removing the 1st component β l 1 in β l for all 1 ≤ l ≤ d . Then β l can be rewritten as

β l = β l ( β l , − 1 ) = ( 1 − ‖ β l , − 1 ‖ 2 , β l , − 1 T ) T , ‖ β l , − 1 ‖ 2 < 1 . (2.7)

Thus, β l is infinitely differentiable with respect to β l , − 1 and the Jacobian matrix is

J l = ∂ β l ∂ β l , − 1 = ( − β l , − 1 T / 1 − ‖ β l , − 1 ‖ 2 I p − 1 ) , (2.8)

where I p is the p × p identity matrix. We denote β − 1 = ( β l , − 1 T , ⋯ , β d , − 1 T ) T and reformulate the parameter space of β − 1 as follows:

Θ − 1 = { β − 1 = ( β l , − 1 T : 1 ≤ l ≤ d ) T : ‖ β l , − 1 ‖ 2 < 1 , β l , − 1 ∈ R p − 1 } . (2.9)

Let β = β ( β − 1 ) with β l = β l ( β l , − 1 ) for 1 ≤ l ≤ d . Since the estimation procedure of β requires estimates of both m l and its first order derivative m ˙ l . We can adopt the spline functions of one order lower than that of m l to approximate the m ˙ l . Following Ma and Song (2014), a spline estimator of m ˙ l is given by

m ˙ ^ l ( u l , β ) = ∑ s = 1 J n B ˙ s , q ( u l ) λ ^ s , l ( β ) = ∑ s = 2 J n B s , q − 1 ( u l ) ω ^ s , l ( β ) (2.10)

where ω ^ s , l ( β ) = ( q − 1 ) { λ ^ s , l ( β ) − λ ^ s − 1 , l ( β ) } / ( ξ s + q − 1 − ξ s ) for 2 ≤ s ≤ J n . Thus, one has

m ˙ ^ s , l ( u l , β ) = B q − 1 ( u l ) T D 1 λ ^ l ( β ) ,

where B q − 1 ( u l ) = ( B s , q − 1 ( u l ) : 2 ≤ s ≤ J n ) T and

D 1 = ( q − 1 ) [ − 1 ξ q + 1 − ξ 2 1 ξ q + 1 − ξ 2 0 ⋯ 0 0 − 1 ξ q + 2 − ξ 3 1 ξ q + 2 − ξ 3 ⋯ 0 ⋮ ⋮ ⋱ ⋱ ⋮ 0 0 ⋯ − 1 ξ N + 2 q − 1 − ξ N + q 1 ξ N + 2 q − 1 − ξ N + q ] ( J n − 1 ) × J n .

Step 2. After this re-parametrization, combine with the estimators m ˙ l and m ˙ ^ l for l = 1 , ⋯ , d , we can construct the profile spline modal objective function for the parametric components. Then, we can obtain the estimator β ^ − 1 of β − 1 by maximizing L n ( β ( β − 1 ) ) over β − 1 ∈ Θ − 1 , where

L n ( β ( β − 1 ) ) = 1 n ∑ i = 1 n ϕ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ s , l ( β ) X i l } , (2.11)

which is equivalent to solve the following estimating equations:

∂ L n ( β ( β − 1 ) ) / ∂ β − 1 = − 1 n ∑ i = 1 n ϕ ˙ h 2 { Y i − ∑ l = 1 d ∑ s = 1 J n B s , q ( U i l ( β l ) ) λ ^ s , l ( β ) X i l }         × { { m ˙ ^ 1 ( U i 1 ( β 1 ) , β ) X i 1 J 1 T Z i + ( ∂ λ ^ ( β ) T / ∂ β 1 , − 1 ) D i ( β ) } ⋮ { m ˙ ^ d ( U i d ( β d ) , β ) X i d J d T Z i + ( ∂ λ ^ ( β ) T / ∂ β d , − 1 ) D i ( β ) } } = 0 (2.12)

where D i ( β ) = ( D i , s l ( β l ) , 1 ≤ s ≤ J n , 1 ≤ l ≤ d ) T with D i , s l ( β l ) = B s , q ( U i l ( β l ) ) X i l , m ˙ ^ l ( ⋅ , β ) is given in (2.10) and ϕ ˙ h 2 is the first derivative of ϕ h 2 . We obtain the estimate of β − 1 , say, β ^ − 1 and then obtain β ^ via the transformation (2.7). Thus, we call the estimator β ^ as the profile spline modal estimator (PSME).

3. Simulation Studies3.1. Results in Finite Sample

In this section, we conduct simulation studies to evaluate the finite sample performance of the proposed methodology. We generate data from the following VICM:

Y i = m ( Z i , X i , β ) + ε i = m 1 ( Z i T β 1 ) X i 1 + m 2 ( Z i T β 2 ) X i 2 + m 3 ( Z i T β 3 ) X i 3 + ε i (3.1)

with X i = ( X i 1 , X i 2 , X i 3 ) T , where X i is generated from Bernoulli (p = 0.5), and ( X i 2 , X i 3 ) T is drawn from a bivariate normal distribution with mean 0, variance 1, and covariance 0.2. To generate Z i = ( Z i 1 , Z i 2 , Z i 3 ) T , we first sample ( Z i 1 * , Z i 2 * , Z i 3 * ) T from a multivariate normal with mean 0, variance 1, and covariance 0.2, and then let Z i k = Φ ( Z i k * ) − 0.5 , k = 1 , 2 , 3 , where Φ ( ⋅ ) is the CDF of the standard normal. The true loading parameters are set as β 1 = 1 14 ( 2 , 1 , 3 ) T , β 1 = 1 14 ( 3 , 2 , 1 ) T , β 1 = 1 14 ( 2 , 3 , 1 ) T . Set

m l ( u l ) = m l * ( u l ) − E { m l * ( u l ) } , l = 1 , 2 , 3

where m 1 * ( u 1 ) = 10 exp ( 5 u 1 ) / { 1 + exp ( 5 u 1 ) } , m 2 * ( u 2 ) = 5 sin ( π u 2 ) , and m 3 * ( u 3 ) = 3 { sin ( π u 3 ) + cos ( 2 π u 3 − 4 π / 3 ) } . Finally, Y i , 1 ≤ i ≤ n , are generated from the VICM (3-1), where β = ( β 1 T , β 2 T , β 3 T ) T , and errors ε i follow N ( 0 , σ 2 ( Z i , X i ) ) with σ 2 ( Z i , X i ) = { 100 − m ( Z i , X i , β ) } / { 100 + m ( Z i , X i , β ) } .

Although the estimation process of the varying index coefficient model is introduced in Section 2.2, it is still difficult to directly estimate (2.12). Therefore, an iterative calculation algorithm is needed to estimate the unknown parameters and the unknown function coefficients. The specific algorithm is divided into the following two steps:

Step 1. The initial value ( β 1 , β 2 , β 3 ) of β is obtained in the following four steps:

1) Assuming that the unknown function m l is a linear function, then m ( Z i , X i , β ) = ∑ i = 1 d a l + b l ( Z i T β l ) X i l .

2) The estimated value ( a ^ l , v ^ l ) of ( a l , v l ) is estimated by minimizing ∑ i = 1 n { Y i − ∑ l = 1 d ( a l + v l T Z i X i l ) } 2 , and thus the expression β ^ l 0 = ( v ^ l / ‖ v ^ l ‖ 2 ) sgn ( v ^ 1 l ) is obtained, where v ^ 1 l is a part of v ^ l .

3) Let U ^ l = Z i T β ^ l 0 , then obtain the initial unknown function m ^ l i n i ( ⋅ ) from the varying coefficient model Y = ∑ l = 1 d m l ( U ^ l ) X l + ε .

4) Obtain β 1 i n i by minimizing 2 − 1 ∑ i = 1 n { Y i − ∑ l = 1 n m ^ l i n i ( Z i T β l ) X i l } 2 , i.e. the initial value.

Step 2. Iterative calculations are performed by the asymptotic properties of the large sample parameter estimates and the theorems given by Ma and Song (2015)  . Under certain assumptions, the estimated parameters satisfy the following asymptotic properties:

n ( β ^ − 1 − β − 1 0 ) = { n − 1 ∑ i = 1 n Φ ( X i , Z i , β 0 ) ⊗ 2 } − 1     × { n − 1 / 2 ∑ i = 1 n ( Y i − m ( Z i , X i ) ) Φ ( X i , Z i , β 0 ) } + o p ( 1 ) (3.2)

where Φ ( X i , Z i , β 0 ) = [ { m ˙ l ( U l ( β l 0 ) , β l 0 ) X l J l T Z ˜ } T , 1 ≤ l ≤ d ] , and Z ˜ = Z − P ( Z ) in the above expression. Here Z ˜ can be estimated by Z ˜ = Z − P n ( Z ) , where

P n ( Z k ) = ∑ l = 1 d g ^ 1 J 0 ( U l ( β ^ ) , β ^ ) X l . (3.3)

The estimation procedure of g ^ 1 J 0 ( ⋅ , β ^ ) in Equation (3.3) is similar to the estimation of the unknown function m ^ l ( ⋅ , β ^ ) , except that the response variable Y is replaced by Z k in the iterative estimation process. According to the asymptotic properties (3.2) we can get an equation and use this equation for iterative calculations. The iteration stops when the absolute difference (dif) from the last calculated unknown parameter is less than 10-4 or the iteration number (iter) is greater than or equal to 100.

According to the idea of the above specific algorithm, we use R (64-bit) to write four functions such as Design matrix, transform, Jac, vicmest. Among them, vicmest is the main program for estimating unknown parameters, and the other three functions are intermediate conversion functions. First, we calculate the initial value β 1 , β 2 , β 3 of β through the first step. The results are shown in Table 1. It can be seen from Table 1 that the initial value calculated by Step 1 is consistent with the trend of the actual value, but the deviation from the actual value is still large. Therefore, it is necessary to further calculate the estimated value by the second step. At this point, by running the four programs such as vicmest, the result of stopping the main program after 64 iterations is finally obtained, and d i f = 8.45267 × 10 − 6 at this time. The specific calculation results of the estimated values β ^ of β and their deviations are shown in Table 2. It can be seen from Table 2 that the value of a estimated by the profile spline modal estimator (PSME) is better, the deviation from the actual value (Bias) is smaller, and the mean deviation is less than 5%.

The initial values of β calculated by step 1
Initial valueβ 1β 2β 3
10.7150.9510.722
20.3450.2870.897
30.9330.7470.311
The estimated value of β and its deviation from the true value
n = 200β 11β 12β 13β 21β 22β 23β 31β 32β 33
True0.5340.2670.8010.8010.5340.2670.5340.8010.267
β ^0.5370.3160.7820.7980.5380.2710.4610.8330.305
Bias0.0030.049−0.019−0.0030.0030.004−0.0740.0320.038

From the main program vicmest, not only can the estimated value β ^ be obtained, but also can we obtain gamm0, which is the coefficient after the expansion of the B-spline basis function. Bring the calculated β ^ and the coefficient gamm0 into the Formula (2.10), get the value of the unknown function m ^ l ( ⋅ , β ) and the predicted value of the response variable Y. The results of the gamm0 coefficient are shown in Table 3. An estimate of m ^ l ( ⋅ , β ) can be seen from Figure 1, where the red curve represents the estimate and the black curve represents the actual value. It can be seen intuitively from Figure 1 that the fitting effect of the B-spline expansion is very good, not only the general trend of the unknown non-parametric function is well maintained, but also the accuracy of the estimation is relatively high. As shown in Table 4, we calculate the root mean square error of the coefficient of m ^ l ( ⋅ , β ) by further calculation. It can be seen from Table 4 that the unknown function has a small deviation, and the RMSE is less than 0.28, which indicates that the estimation effect is better. Moreover, Y can be calculated after obtaining the estimated values β ^ and m ^ l ( ⋅ , β ) . Finally, the variance of error of the model (3.1) is calculated to be 5.777.

3.2. Results in High-Dimensional Case

In this section, we numerically simulate the variance estimation of the varying index coefficient model in high-dimensional conditions.

The profile spline modal estimator (PSME) shows good estimation variance under low-dimensional data settings. However, in the case of high-dimensional data, it will fall into the dimension curse, and the deviation of the estimated variance will increase as the dimension increases. The re-adjustment cross-validation method proposed by Fan et al. (2012)  can be considered as an effective way to overcome the dimension curse in high-dimensional problems through theoretical proof and data simulation test. Naturally, this paper applies the re-adjusted cross-validation method (RCV) to the high-dimensional varying index coefficient model for the first time. There are two types of covariates in the varying index coefficient model (2.2). The first type is Z. If Z is from a single variable, its relationship with the covariate X is more difficult to detect. The second type of variable is the covariate X. What we are concerned about is the estimation of the variance of the varying index coefficient model when the first type of covariate Z is high-dimensional.

We first perform data simulation. The setting of the real model is the same as model (3.1), with only the dimension of the covariant Z changed, that is, the first

The coefficient of the B-spline basis function
B i 1B i 2B i 3B i 4B i 5B i 6
B 1 i−4.444−4.808−3.08873.9125.0043.973
B 2 i−2.083−6.179−4.5434.6835.1854.876
B 3 i−6.294−2.7103.652−5.9707.6245.766
Root mean square error (RMSE) of m ^ l ( ⋅ , β )
m ^ 1 ( ⋅ , β )m ^ 2 ( ⋅ , β )m ^ 3 ( ⋅ , β )
RMSE0.3490.2220.258

type of covariate Z is set to a high dimensional variable. Where Z i k = Φ ( Z i k * ) − 0.5 , k = 1 , 2 , ⋯ , d , that is, Z is a d-dimensional covariate. The setting of m l ( u l ) and the error term is also the same as model (3.1).

In the case of high dimensional data, the independent variables are often highly correlated. However, not all independent variables are related to the dependent variable Y. In fact, only a small number of covariates are associated with the dependent variable Y. For the selection of such high-dimensional variables, Fan et al. (2008)  proposed SIS method with Sure Screening properties based on the relevant criteria, which can first reduce the dimension d to a relatively small number. Therefore, all important variables could be filtered into the model. So that lower-dimensional model selection methods such as SCAD, Dangit selector, LASSO, or adaptive LASSO could be used. With lower-dimensional model selection method, some smaller coefficients can be compressed to zero, thereby removing the extraneous variables that are filtered by the SIS method. The idea of SIS makes high-dimensional model selection possible, greatly speeding up the selection of variables, and making model selection problems efficient and modular. The SIS variable selection method can be used in conjunction with any model selection technique. Fan et al. (2010)  apply the SIS method to the Cox proportional hazard model. The Cox proportional hazard model is similar to the varying index coefficient model mentioned in this paper. They are all nonlinear models and both have the need to estimate the coefficients of the nonparametric function and its parameter parts. Therefore, we believe that in the case of high-dimensional data, it is feasible to use the SIS method to make the first variable selection of the varying index coefficient model.

We use the SIS method proposed by Fan et al. (2008)  to select variables. The number of variables selected is tentatively 20. The calculation process is simulated using R software. We have written VicmRCV and the vicmest function for the estimation of the RCV process. The data simulation process was repeated 100 times, and a box plot of the variance as shown in Figure 2 was obtained. In the figure, naïve represents a simple two-stage approach, while rcv represents a re-adjusted cross-validation method.

It can be seen from Figure 2 that in the d dimension (the dimension of the variable Z, is as high as 100), the sample size is only 200, the variance of the re-adjusted cross-validation (RCV) two-stage method is better than the simple two-stage method. However, the calculated error variance value is large. To some extent, the estimation method of the estimated varying index coefficient model mentioned in Section 3.1 of this paper is not accurate enough and not robust enough as the estimated error variance value is large.

As shown in Table 5, changing the values of p and n gives more simulation results. Table 5 compares the normal two-stage method (Naive-SIS) with the RCV two-stage method (RCV-SIS) at n = 100 , d = 50 , 100 , 500 . By comparing the root mean square error estimated by the two estimation methods, we find that the mean square error (MSE) of the RCV two-stage estimation is smaller in each dimension than the MSE estimated by the ordinary two-stage method. That is, the model estimated by the RCV method is more accurate. But from Table 5, we can also find other laws. Conventionally, as the dimension p increases, the estimated accuracy decreases, which results in the root mean square error becomes larger. However, from the results of Table 5, this law is completely inapplicable in the ordinary two-stage method. When the dimension comes to maximum ( d = 500 ), the root mean square error is the smallest, and its value is 6.692. When d = 100 , the MSE is the largest with a value of 8.273. In conclusion, the order is disorganized, and the mean square error does not become larger as the dimension becomes larger in general cases.

In fact, it is not difficult to explain this phenomenon because in the variable selection phase, for the SIS method, we select the variables with the co-correlation coefficients ranked in the top twenty (descending order). Since the fixed value 20 is small relative to the covariate, the probability of selecting all important variables is relatively low. From the data in the RCV-SIS column in Table 5, it can be seen that the SIS method is much more stable after combining RCV. At d = 50, the estimated MSE is the smallest with value of 4.838. In the case of three different dimensions, the error estimated by the RCV method is smaller than the mean square error estimated by the ordinary two-stage method.

4. Real Data Analysis

In this section, we will use the data collected by the Mayo Clinic. These data were obtained from trials conducted by the Mayo Clinic in primary biliary cirrhosis (PBC) from 1974 to 1984. Specific data can be found in the R language Survival package. The dataset included 424 PBC patients who were referred to the Mayo Clinic during the decade between 1974 and 1984. The data met the randomized placebo-based eligibility criteria.

In the data set, the first 312 patients participated in the randomized trial while the other 112 patients did not participate in the clinical trial, but agreed to record the basic measurements and follow the medical recommendations. Six of the above samples lost follow-up shortly after diagnosis. Thus there are 106 cases and 312 random participants. We preprocessed the data set via R software.

Mean Square Error (MSE) for two different estimation methods at n = 100
Naïve-SISRCV-SIS
d = 508.1714.838
d = 1008.2735.199
d = 5006.6925.043

We first remove some samples with missing values. The number of samples that were eventually brought into the calculation after deletion was 276. The specific variables are described in Table 6.

As can be seen from Table 6, the response variable Y is the survival time of the patient. Since the difference among the response variables Y is large, we logarithmically convert the time Y to reduce the error. There are three covariates of X, which are the patient’s state X1 (Status), the patient’s age X2 (Age), and the patient’s gender X3 (Sex). Here we need to explain why we want to add gender variables. The gender variable was added because Huang Siyu (1985)  found that the incidence of men with primary biliary cirrhosis (PBC) was much lower than that of women. Therefore, we can know that gender has a great relationship with PBC. Another type of covariate Z has a total of 15 variables including albumin (albumin), alkaline phosphatase (alk.phos), triglyceride (Trig), and platelet count (Platelet). Therefore, the varying index coefficient model constructed in this section is as follows:

L n ( Y i ) = m ( Z , X , β ) + ε I = ∑ l 3 m l ( ∑ d 15 Z l d * β l d ) X i l + ε i . (4.1)

Interpretation of experimental variables in primary biliary cirrhosis (PBC)
Serial numberVariablesDescription
1Time (Y)number of days between registration and the earlier of death
2Status (X1)status at end point, 0/1/2 for censored, transplant, dead
3Age (X2)in years
4Sex (X3)m/f
5Trt (Z1)1/2/NA for D-penicillmain, placebo, not randomised
6Ascites (Z2)presence of ascites
7Hepato (Z3)presence of hepatomegaly or enlarged liver
8Spiders (Z4)blood vessel malformations in the skin
9Edema (Z5)0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy
10Bili (Z6)serum bilirubin (mg/dl)
11Chol (Z7)serum cholesterol (mg/dl)
12Albumin (Z8)serum albumin (g/dl)
13Copper (Z9)urine copper (ug/day)
14alk.phos (Z10)alkaline phosphotase (U/liter)
15Ast (Z11)aspartate aminotransferase, once called SGOT (U/ml)
16Trig (Z12)triglycerides (mg/dl)
17Platelet (Z13)platelet count
18Protime (Z14)standardised blood clotting time
19Stage (Z15)histologic stage of disease (needs biopsy)

Since the covariate Z has different physical meanings and different dimensions, it is meaningless to simulate the model at this time. So we need to eliminate the effects of different dimensions of the data through transformation. Therefore, these 15 variables should be standardized before the specific calculation, which is Z-Score standardization.

Through previous studies, we have roughly learned that variables such as serum bilirubin content (Z6), albumin content (Z8), urinary copper content (Z9), alkaline phosphatase content (Z10), prothrombin time (Z14) have a strong relationship with the response variable Y. We first use the SIS method to select the variables with the first 8 covariate correlations, and then use the simple two-stage method and the re-adjusted cross-validation (RCV) two-stage method to estimate the coefficient β of the covariate Z and the model variance. The results are shown in Table 7.

As can be seen from Table 7, when using SIS for variable selection, the important variables such as Z6 and Z9 are selected three times. Important variables have strong correlations in theoretical analysis. From this aspect, it can be seen that the SIS variable selection method can select all important variables with a high probability to a certain extent. The RCV can repeatedly select variables by selecting the first missing variable or deleting the extra variable that was selected for the first time. The model variance estimated by the RCV-SIS two-stage method is significantly better than the N-SIS simple two-stage method. In the high-dimensional case, the re-adjusted cross-validation method (RCV) has a better performance in the varying index coefficient model. The root mean square error and the resulting variance are smaller than the simple two-stage estimate. Therefore, the RCV-SIS two-stage method is more accurate in predicting the survival time of patients, and can provide more reasonable guidance and advice for follow-up medical treatments.

5. Discussion

In this paper, we study a new class of semiparametric regression models: varying index coefficient models. The estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME), while the unknown non-parametric function part is expanded with the B-spline. After studying the gradual nature of the coefficients, we estimate the coefficient β

PBC dataset estimation results
N-SISRCV-SIS
AMS (variables selected)Z2, Z4, Z5, Z6, Z8, Z9, Z12, Z15The first stage: Z4, Z5, Z6, Z7, Z9, Z11, Z12, Z15 The second stage: Z2, Z5, Z6, Z8, Z9, Z12, Z14, Z15
MSE2.1841.661
VARIANCE2.6741.495

using an iterative method. With data simulation, we found that the estimated β of this method has a small deviation, and the unknown function part of the B-spline estimation has a good fitting effect as well. Finally, under the setting conditions of high-dimensional data, we carried out a two-stage RCV estimation of the varying index coefficient model. We find that the variance and mean square error estimated by the RCV method are superior to the simple two-stage method. In the final empirical phase, it was originally intended to model the PBC data using a survival model (semi-parametric varying coefficient additive risk model). However, through research literature, it is known that gender variables and state variables are closely related to the survival time of patients with primary biliary cirrhosis. The variable Z has a certain relationship with the three variables X (status, gender and age). Therefore, we used the varying index coefficient model to model the PBC data, and found that the variance and mean square error of the RCV method are better than the simple two-stage method.

Further researches for the proposed method are needed. Firstly, further effort to investigate the asymptotic properties of the proposed method needs to be done. Secondly, this paper only estimates the variance and mean square error of the varying index coefficient model, but lacks the research on the coefficient β and the estimation of the nonparametric function of the parameter part of the model. Therefore, we can study more robust estimation methods in the future. In addition, we can focus more on the asymptotic properties of the non-parametric part of the varying index coefficient model.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Cite this paper

Wang, M., Lv, H. and Wang, Y.C. (2019) Variance Estimation for High-Dimensional Varying Index Coefficient Models. Open Journal of Statistics, 9, 555-570. https://doi.org/10.4236/ojs.2019.95037

ReferencesTibshirani, R. (1996) Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society (Series B), 58, 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.xFan, J. and Peng, H. (2004) On Nonconcave Penalized Likelihood with Diverging Number of Parameters. Annals of Statistics, 32, 928-961. https://doi.org/10.1214/009053604000000256Zhao, P. and Yu, B. (2006) On Model Selection Consistency of Lasso. Journal of Machine Learning Research, 7, 2541-2563.Bunea, F., Tsybakov, A. and Wegkamp, M. (2007) Sparsity Oracle Inequalities for the Lasso. Electronic Journal of Statistics, 64, 330-332. https://doi.org/10.1214/07-EJS008Zhang, C.H. and Huang, J. (2008) The Sparsity and Bias of the Lasso Selection in High-Dimensional Linear Regression. Annals of Statistics, 36, 1567-1594. https://doi.org/10.1214/07-AOS520Lv, J. and Fan, Y. (2009) A Unified Approach to Model Selection and Sparse Recovery Using Regularized Least Squares. Annals of Statistics, 37, 3498-3528. https://doi.org/10.1214/09-AOS683Fan, J. and Lv, J. (2011) Nonconcave Penalized Likelihood with NP-Dimensionality. Journal IEEE Transactions on Information Theory, 57, 5467-5484. https://doi.org/10.1109/TIT.2011.2158486Kim, Y., Choi, H. and Oh, H.S. (2008) Smoothly Clipped Absolute Deviation on High Dimensions. Journal of the American Statistical Association, 103, 1665-1673. https://doi.org/10.1198/016214508000001066Candes, E. and Tao, T. (2005) The Danzig Selector: Statistical Estimation When p Is Much Larger than n. Annals of Statistics, 35, 2313-2351. https://doi.org/10.1214/009053606000001523Fan, J. and Lv, J. (2008) Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society, 70, 849-911. https://doi.org/10.1111/j.1467-9868.2008.00674.xFan, J., Samworth, R. and Wu, Y. (2009) Ultrahigh Dimensional Feature Selection: Beyond the Linear Model. Journal of Machine Learning Research, 10, 2013-2038.Zhao, S.D., Cai, T.T. and Li, H. (2014) Variance Estimation in High-Dimensional Linear Models. Biometrika, 2, 269-284. https://doi.org/10.1093/biomet/ast065Reid, S., Tibshirani, R. and Friedman, J. (2016) A Study of Error Variance Estimation in Lasso Regression. Statistica Sinica, 26, 35-67. https://doi.org/10.5705/ss.2014.042Hastie, T. and Tibshirani, R. (1993) Varying-Coefficient Models. Journal of the Royal Statistical Society (Series B), 55, 757-796. https://doi.org/10.1111/j.2517-6161.1993.tb01939.xWong, H., Ip, W.C. and Zhang, R. (2008) Varying-Coefficient Single-Index Model. Statistics, 52, 1458-1476. https://doi.org/10.1016/j.csda.2007.04.008Ma, S. and Song, X.K. (2015) Varying Index Coefficient Models. Journal of the American Statistical Association, 110, 341-356. https://doi.org/10.1080/01621459.2014.903185Xue, L. and Liang, H. (2008) Polynomial Spline Estimation for a Generalized Additive Coefficient Model. Scandinavian Journal of Statistics, 37, 26-46. https://doi.org/10.1111/j.1467-9469.2009.00655.xLv, J., Yang, H. and Guo, C. (2016) Robust Estimation for Varying Index Coefficient Models. Computational Statistics, 31, 1-37. https://doi.org/10.1007/s00180-015-0595-5Fan, J., Guo, S. and Hao, N. (2012) Variance Estimation Using Refitted Cross-Validation in Ultrahigh Dimensional Regression. Journal of the Royal Statistical Society (Series B), 74, 37-65. https://doi.org/10.1111/j.1467-9868.2011.01005.xFan, J., Yang, F. and Wu, Y. (2010) High-Dimensional Variable Selection for Cox’s Proportional Hazards Model. Statistics, 105, 205-217. https://doi.org/10.1214/10-IMSCOLL606Huang, S.Y. (1985) Is There a Difference in the Severity of Primary Biliary Liver Hardening? International Journal of Digestive Diseases, No. 3, 186-187.