Variance Estimation for High-Dimensional Varying Index Coefficient Models

This paper studies the re-adjusted cross-validation method and a semiparametric regression model called the varying index coefficient model. We use the profile spline modal estimator method to estimate the coefficients of the parameter part of the Varying Index Coefficient Model (VICM), while the unknown function part uses the B-spline to expand. Moreover, we combine the above two estimation methods under the assumption of high-dimensional data. The results of data simulation and empirical analysis show that for the varying index coefficient model, the re-adjusted cross-validation method is better in terms of accuracy and stability than traditional methods based on ordinary least squares.

tinuous nor ordered. Therefore, the variance of the model estimated by the traditional method will be large. Moreover, with the development of technology, high-dimensional data is applied to all aspects of life. The number of variables increases exponentially, and the calculation of the above two criteria also shows an exponential increasing trend, so the above method cannot be applied to highdimensional data.
In the past research, many important variable selection methods such as LASSO (Least Absolute Shrinkage Selection Operator) and SCAD (Smoothly Clipped Absolute Deviation) have been proposed. LASSO was proposed by Tibshirani (1996) [1]. For more details, see Fan and Peng (2004), Zhao and Yu (2006), Bunea (2007), Zhang and Huang (2008), Lv and Fan (2009), Fan and Lv (2011), and Kim (2008) [2]- [8]. In this method, a penalty term is added on the basis of the ordinary least squares method, and the coefficient value is reduced to 0, so that the corresponding variable is excluded from the model. Another type of variable selection tool is DS (Dantzig Selector). This method was first proposed by Candes and Tao (2005) [9] and can be easily reshaped into a linear model. Fan and Lv (2008) [10] sort the covariance matrix between covariate and response variables, and then select the first few variables with the largest correlation coefficient to complete the variable selection. This method is called SIS (Sure Independence Screening). In later studies, some scholars extended the SIS, namely the iterative SIS (ISIS) method: the regression analysis was performed using the variables and dependent variables selected by SIS, and the regression residuals were replaced with response variables. Then continue to use the SIS method for a new round of variable selection. And repeat the above steps until all the important variables. For details, see Fan et al. (2009) [11]. After screening out the important variables, the second step of the simple two-stage method is generally calculated by least squares method. However, in order to overcome the root cause of the dimension, many scholars study the variance estimation in the case of high-dimensional data. Fan et al. proposed a re-adjusted cross-validation method (RCV) in 2012 to improve the simple two-stage approach. It is proved that the variance estimated by this method is stable and accurate. Zhao et al. (2014) [12] studied the variance estimation of linear models under certain assumptions. Reid, Tibshirani, Friedman (2016) [13] studied the model residual estimation in LASSO regression and performed a large number of simulations.
They considered that the variance estimation of the residual sum of squares based on adaptive regularization parameter selection has the properties of finite samples.
A well-behaved variance estimation method can improve the prediction accuracy of the model and better explain the socio-economic phenomena. However, it is more important to choose a suitable regression model. There is also a large amount of literature on the study of regression models. When the data dimension is low, the parametric model and the nonparametric model are sufficient to solve the problem. But as the dimension increases, a more flexible semi-parametric  [18] improved the PLSE, proposed a robust estimation procedure combining the logarithmic regression and the B-spline, and established the large sample property of the parameter estimation.
The estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME). Moreover, in order to obtain the progressive distribution of the unknown function ( ) l m Z , they also proposed a two-stage method of local linear kernel estimation.
The rest of the paper is organized as follows. In Section 2, we briefly introduce the varying index coefficient model, including the estimation method, the statistical inference of the coefficients, and the RCV estimation of the model. In Section 3, simulation studies are conducted to evaluate the finite sample performance of the proposed methods. In Section 4, a real data set is analyzed to compare the proposed methods with the existing methods. A discussion is given in Section 5.

Varying Index Coefficient Models
The semiparametric model is widely used in regression models, especially the varying coefficient model (VCM) proposed by Hastie and Tibishirani in 1993, which has been widely used in real data. An important feature of the varying coefficient model is that the coefficients of its covariates are controlled by smooth functions, which can show nonlinear reactions. The form of the variable coefficient model is as follows: where Y is a response variable, vector of the unknown coefficient functions, and model error ε is independent of ( ) , X Z with mean zero and finite variance 2 σ . The variable coefficient model of Equation (2.1) faces two challenges in the case of today's complex data.
First, the variable Z has little effect relative to Y, so the interaction between the variables Z and X is difficult to detect; second, in many complex situations, Z is multi-dimensional, for example, studying the effects between chemical constituents. Thus, the coefficient function

Estimation Procedure for the VICM
The estimation of the varying index coefficient models has two main aspects: one is the estimation of the parameter part β , and the other is the estimation of the function coefficient ( ) l l m u of the non-parametric part. In this paper, the estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME). Once β is fixed, the unknown function coefficient subject to the constraint 1 where n N increases along with the sample size n. Consider the distance between two neighboring knots Subsequently, we estimate the parameter vectors l β and the nonparametric functions ( ) l m ⋅ in two steps below.
Step 1. Given β , we obtain estimate ( ) λ β of ( ) Thus, l β is infinitely differentiable with respect to , 1 l β − and the Jacobian matrix is 2 , 1 Step 2. After this re-parametrization, combine with the estimators l m  and ˆl m  for 1, , l d =  , we can construct the profile spline modal objective function for the parametric components. Then, we can obtain the estimator which is equivalent to solve the following estimating equations: We obtain the estimate of 1 β − , say, 1 β − and then obtain β via the transformation (2.7). Thus, we call the estimator β as the profile spline modal estimator (PSME).

Results in Finite Sample
In this section, we conduct simulation studies to evaluate the finite sample per-  T  T  1  1  1  2  2  2  3  3  3 , ,   T  T  T  1  2  3 , , β β β β = , and errors i Although the estimation process of the varying index coefficient model is introduced in Section 2.2, it is still difficult to directly estimate (2.12). Therefore, an iterative calculation algorithm is needed to estimate the unknown parameters and the unknown function coefficients. The specific algorithm is divided into the following two steps: Step 1. The initial value ( )

3) Let
in the above expression. Here Z  can be estimated by The estimation procedure of ( )  Table 1. It can be seen from Table 1 that the initial value calculated by Step 1 is consistent with the trend of the actual value, but the deviation from the actual value is still large. Therefore, it is necessary to further calculate the estimated value by the second step. At this point, by running the four programs such as vicmest, the result of stopping the main program after 64 iterations is finally obtained, and  Table 2. It can be seen from Table 2 that the value of a estimated by the profile spline modal estimator (PSME) is better, the deviation from the actual value (Bias) is smaller, and the mean deviation is less than 5%.    Table 3. An estimate of ( ) , l m β ⋅ can be seen from Figure   1, where the red curve represents the estimate and the black curve represents the actual value. It can be seen intuitively from Figure 1 that the fitting effect of the B-spline expansion is very good, not only the general trend of the unknown non-parametric function is well maintained, but also the accuracy of the estimation is relatively high. As shown in Table 4, we calculate the root mean square error of the coefficient of ( ) , l m β ⋅ by further calculation. It can be seen from

Results in High-Dimensional Case
In this section, we numerically simulate the variance estimation of the varying index coefficient model in high-dimensional conditions.
The profile spline modal estimator (PSME) shows good estimation variance under low-dimensional data settings. However, in the case of high-dimensional data, it will fall into the dimension curse, and the deviation of the estimated variance will increase as the dimension increases.    They are all nonlinear models and both have the need to estimate the coefficients of the nonparametric function and its parameter parts. Therefore, we believe that in the case of high-dimensional data, it is feasible to use the SIS method to make the first variable selection of the varying index coefficient model.
We use the SIS method proposed by Fan et al. (2008) [10] to select variables.
The number of variables selected is tentatively 20. The calculation process is simulated using R software. We have written VicmRCV and the vicmest function for the estimation of the RCV process. The data simulation process was repeated 100 times, and a box plot of the variance as shown in Figure 2 was obtained. In the figure, naïve represents a simple two-stage approach, while rcv represents a re-adjusted cross-validation method.
It can be seen from Figure   As shown in Table 5, changing the values of p and n gives more simulation results. Table 5 compares the normal two-stage method (Naive-SIS) with the RCV two-stage method (RCV-SIS) at 100 n = , 50,100,500 d = . By comparing the root mean square error estimated by the two estimation methods, we find that the mean square error (MSE) of the RCV two-stage estimation is smaller in each dimension than the MSE estimated by the ordinary two-stage method. That is, the model estimated by the RCV method is more accurate. But from Table 5, we can also find other laws. Conventionally, as the dimension p increases, the estimated accuracy decreases, which results in the root mean square error becomes larger. However, from the results of Table 5, this law is completely inapplicable in the ordinary two-stage method. When the dimension comes to maximum ( 500 d = ), the root mean square error is the smallest, and its value is 6.692. When 100 d = , the MSE is the largest with a value of 8.273. In conclusion, the order is disorganized, and the mean square error does not become larger as the dimension becomes larger in general cases.
In fact, it is not difficult to explain this phenomenon because in the variable selection phase, for the SIS method, we select the variables with the co-correlation coefficients ranked in the top twenty (descending order). Since the fixed value 20 is small relative to the covariate, the probability of selecting all important variables is relatively low. From the data in the RCV-SIS column in Table 5, it can be seen that the SIS method is much more stable after combining RCV. At d = 50, the estimated MSE is the smallest with value of 4.838. In the case of three different dimensions, the error estimated by the RCV method is smaller than the mean square error estimated by the ordinary two-stage method.

Real Data Analysis
In this section, we will use the data collected by the Mayo Clinic. These data were obtained from trials conducted by the Mayo Clinic in primary biliary cirrhosis (PBC) from 1974 to 1984. Specific data can be found in the R language Survival package. The dataset included 424 PBC patients who were referred to the Mayo Clinic during the decade between 1974 and 1984. The data met the randomized placebo-based eligibility criteria.
In the data set, the first 312 patients participated in the randomized trial while the other 112 patients did not participate in the clinical trial, but agreed to record the basic measurements and follow the medical recommendations. Six of the above samples lost follow-up shortly after diagnosis. Thus there are 106 cases and 312 random participants. We preprocessed the data set via R software. Therefore, these 15 variables should be standardized before the specific calculation, which is Z-Score standardization.
Through previous studies, we have roughly learned that variables such as serum bilirubin content (Z6), albumin content (Z8), urinary copper content (Z9), alkaline phosphatase content (Z10), prothrombin time (Z14) have a strong relationship with the response variable Y. We first use the SIS method to select the variables with the first 8 covariate correlations, and then use the simple two-stage method and the re-adjusted cross-validation (RCV) two-stage method to estimate the coefficient β of the covariate Z and the model variance. The results are shown in Table 7.
As can be seen from method is significantly better than the N-SIS simple two-stage method. In the high-dimensional case, the re-adjusted cross-validation method (RCV) has a better performance in the varying index coefficient model. The root mean square error and the resulting variance are smaller than the simple two-stage estimate. Therefore, the RCV-SIS two-stage method is more accurate in predicting the survival time of patients, and can provide more reasonable guidance and advice for follow-up medical treatments.

Discussion
In this paper, we study a new class of semiparametric regression models: varying index coefficient models. The estimation of the unknown coefficient β is estimated by the profile spline modal estimator method (PSME), while the unknown non-parametric function part is expanded with the B-spline. After studying the gradual nature of the coefficients, we estimate the coefficient β using an iterative method. With data simulation, we found that the estimated β of this method has a small deviation, and the unknown function part of the B-spline estimation has a good fitting effect as well. Finally, under the setting conditions of high-dimensional data, we carried out a two-stage RCV estimation of the varying index coefficient model. We find that the variance and mean square error estimated by the RCV method are superior to the simple two-stage method. In the final empirical phase, it was originally intended to model the PBC data using a survival model (semi-parametric varying coefficient additive risk model). However, through research literature, it is known that gender variables and state variables are closely related to the survival time of patients with primary biliary cirrhosis. The variable Z has a certain relationship with the three variables X (status, gender and age). Therefore, we used the varying index coefficient model to model the PBC data, and found that the variance and mean square error of the RCV method are better than the simple two-stage method.
Further researches for the proposed method are needed. Firstly, further effort to investigate the asymptotic properties of the proposed method needs to be done. Secondly, this paper only estimates the variance and mean square error of the varying index coefficient model, but lacks the research on the coefficient β and the estimation of the nonparametric function of the parameter part of the model. Therefore, we can study more robust estimation methods in the future.
In addition, we can focus more on the asymptotic properties of the non-parametric part of the varying index coefficient model.