Estimation of Nonparametric Regression Models with Measurement Error Using Validation Data

We consider the problem of estimating a function g in nonparametric regression model when only some of covariates are measured with errors with the assistance of validation data. Without specifying any error model structure between the surrogate and true covariables, we propose an estimator which integrates orthogonal series estimation and truncated series approximation method. Under general regularity conditions, we get the convergence rate of this estimator. Simulations demonstrate the finite-sample properties of the new estimator.


Introduction
Consider the following nonparametric regression model of a scaler response Y on multi-covariates ( ) where ( ) g ⋅ is an unknown function and ε is a noise variable with ( ) independent noise (see [1]).
The relationship between the true variable and the surrogate variable can be rather complicated.Misspecification of this relationship may lead to a serious misinterpretation of the data.Common solution is to use the help of validation data to infer the missing information.To be specific, one observes independent replicates ( ) W Z Y rather than ( ) , , X Z Y , where the relationship between i W and i X may or may not be specified.If not, the missing information for the statistical inference will be taken from a sample ( ) , , j j j X W Z , 1 N j N n + ≤ ≤ + , of so-call validation data independent of the primary (surrogate) sample.We aim at estimating the unknown function ( ) g ⋅ by using the surrogate data ( ) and the validation data , , Recently, statistical inference based on surrogate data and a validation sample has attracted considerable attention (see [2]- [13]), and the above referenced authors developed suitable methods for different models.However, all these works mostly are concerned with the parametric or semi-parametric relationships between covariates and responses, and these approaches are difficult to generalize to nonparametric regression model.[14] and [15] proposed two nonparametric estimators for nonparametric regression model with measurement error using validation data, but their methods are not applicable to our problem since [14] assumes the response rather than the covariable is measured with error, and the method proposed by [15] applies for one-dimensional explanatory variable only.
This article is organized as follows.In Section 2 we propose a regularizationbased method.Under general regularity conditions, we give the convergence rate of our estimator in Section 3. Section 4 provides some numerical results from simulation studies, whereas proofs of the theorems are presented in Appendix.

Description of the Estimator
Recall model (1) and the assumptions below it.We assume that X, W and Z are all real-valued random variables.The extension to random vectors complicates the notation but does not affect the main ideas and results.Without loss of generality, let the supports of X, W and Z all be contained in [ ] 0,1 (otherwise, one can carry out monotone transformations of X, W and Z).
Let XWZ f and WZ f denote respectively the joint density of ( ) , , X W Z and marginal density of ( ) , W Z .Then, according to (2), we have According to Equation (4), the function g is the solution of a Fredholm integral equation of the first kind, and this inverse problem is known to be illposed and needs a regularization method.A variety of regulation schemes are available in the literature (see e.g.[16]) but we focus in this paper on the Tikhonov regularized solution: arg min , where the penalization term 0 α > is the regularization parameter.
We define the adjoint operator z . Then the regularized solution ( 5) is equivalently: To obtain the estimator of ( ) , g x z , we consider the orthogonal series method and kernel method.Under some regularity conditions in Section 3, for each may be approximated by a truncated orthogonal series, , , and , , which may be trigonometric, polynomial, spline, wavelet, and so on.A discussion of different bases and their properties can be found in the literature (see e.g.[17], [18]).Only to be specific, here and in what follows we are considering the normalized Legendre polynomials on [ ] 0,1 , which can be obtained through the Rodrigues' formula ( ) ( ) The integer K is a truncation point which is the main smoothing parameter in the approximating series. Let bandwidth.We consider the following estimators, ( ) ( ) ( ) ˆˆˆ, , and , .
The operators z T and z T * can then be estimated by , , d and  ,  , , d .

Theoretical Properties
In this section, we introduce the assumptions that will be used below to study the statistical properties of the estimator.We shall consider the following assumptions: (A1) (i) The support of ( ) , ,  Assumption (A1) is sufficient condition for z T to be a Hilbert-Schmidt operator and therefore to be compact (see [19] Note that the regularization bias is ( ) , .
In order to control the speed of convergence to zero of the regularization bias g g α − , we introduce the following regularity space β Ψ for 0 We then obtain the following result by applying Proposition .
Therefore, when the regularization parameter α is pushed towards zero, the smoother the function g of interest (i.e.g β ∈ Ψ for larger β ) is, the faster the rate of convergence to zero of the regularization bias will be.


The proofs of all the results are reported in the Appendix.

Simulation Studies
In this section, we briefly illustrate the finite-sample performance of the estimator discussed above.We compare our estimator to the standard Nadaraya-Watson estimator (denoted as ( ) ) base on the primary dataset In fact, ( ) is a gold standard in the simulation study, even if it is practically unachievable due to measurement errors.Moreover, the performance of estimator est g is assessed by using the square root of average square errors (RASE) ( ) ( ) , respectively.We generate 500 datasets for each sample size of ( ) , we used the normalized Legendre polynomials as basis and the standard normal kernel (denote , and the bandwidth was selected by generalized cross-validation approach (GCV).For our estimator ( ) , we used the cross-validation approach to choosing the four parameters N h , n h , K and α .For this purpose, N h , n h and ( ) , K α are selected separately as follows.

Define
. , Here, we adopt the cross-validation (CV) approach to estimate n h by where the subscript j − denotes the estimator being constructed without using the jth observation.Similarly, we get ˆN h .After obtaining ˆN h and ˆn h , we then select ( ) where the subscript i − denotes the estimator being constructed without using the ith observation ( ) where we have used the fact that XWZ f is uniformly bounded on [ ] 2 0,1 .
We conclude that ( ) ( ) By the triangle inequality and Jensen inequality, we have Under Assumption (A2) (i), we can show that ( ) Lemma A1 of [20]).
By construction of the estimator, we have where the last equality is due to (10).The desired result follows immediately.Proof of Theorem 3.
(i) The r order partial or mixed partial derivative of XWZ f with respect to ( ) , x w , and the r order partial derivative of XWZ f with respect to z, are both continuous in ( ) [ ] c ≠ being some finite constant.

φ
are orthonormal and complete basis functions on 2 ([0,1]) task remained is to establish the order of the term Similar to the proof of Lemma 6.1, under Assumptions (A2)(ii), (A3) and (A4), it is easy to show that is not uncommon that Z is measured exactly but X is measured with error and instead only its surrogate variable W which is always satisfied if, for example, W is a function of X and some F.Liu, Z. H. Yin DOI: 10.4236/am.2017.8101061455 Applied Mathematics