Estimation of Nonparametric Multiple Regression Measurement Error Models with Validation Data ()
Received 3 November 2015; accepted 27 December 2015; published 30 December 2015
![](//html.scirp.org/file/16-1240602x5.png)
1. Introduction
We can consider the following nonparametric regression model of a scaler response Y on an explanatory variable X
(1)
where
is assumed to be a smooth, continuous but unknown nonparametric regression function and
is a noise variable with
and
. It is not uncommon that the explanatory variable X is measured with error and instead only its surrogate variable W can be observed. In this case, one observes independent replicates
,
, of
rather than
, where the relationship between
and
may or may not be specified. If not, the missing information for the statistical inference will be taken from a sample
,
, of so-call validation data independent of the primary (surrogate) sample. The objective of this manuscript is to estimate the unknown function
via the surrogate data
and the validation data
.
A wide number of problems of similar type have attracted considerable attention in research literature over the past two decades (see, [1] -[6] ). For instance, a quasi-likelihood method is intensively studied by [7] . A regression calibration approach is developed by [8] [9] and [10] [11] propose a method based on simulation- extrapolation (SIMEX) estimation. Other related methods include Bayesian approaches (see, [12] ), semi-parametric method (see, [13] [14] ), empirical likelihood method (see, [15] ) and the instrumental variable method (see, [16] ). Unfortunately, all these work mostly assume some parametric relationships between covariates and responses. Recently, nonparametric estimators of g have been developed by [17] and [18] . [17] develops a kernel-based approach for nonparametric regression function estimation with surrogate data and validation sampling. However, his method is not applicable for model (1) since it assumes that the response but not the covariable is measured with error. [18] proposes a nonparametric estimator which integrates local linear regression and Fourier transformation method when both explanatory and surrogate variable are scalars. Nonetheless, their method cannot be extended to multidimensional problems in which the explanatory variable vectors can consist of variables being measured with or/and without errors. For additional references and relevant topics for nonparametric regression models with measurement errors, ones may consult [19] and the references therein.
In practice, nonparametric estimation of g may not be an easy task since, as explained in Section 2, the relation that identifies g is a Fredholm equation of the first kind, i.e.
(2)
which may lead to an ill-posed inverse problem. Ill-posed inverse problem related to nonparametric regression model has received considerable attention recently. [20] [21] consider kernel-based estimators while [22] and [23] develop series or sieve estimators. However, their methods require an instrumental variable, and assume that the explanatory variable X is directly observable without errors. In this article, we propose a nonparametric estimation approach which consists of two major steps. First, we propose estimators of generalized Fourier coefficients of T and m based on surrogate and validation data. Second, we replace the infinite-dimensional operator T by the finite-dimensional approximation to avoid higher-order coefficient estimation, and hence it develops an estimator of g. Furthermore, we extend this method to the case that only some of covariates are measured with errors. Under mild conditions, the consistencies of the resulting estimators are established and the convergence rates are also derived.
This article is arranged as follows. In Section 2, we first describe our estimation approach for the case that the covariates are all measured with errors. Extension to the case that only some of covariates are measured with errors will be discussed as well. We derive the convergence rates of our estimators under some regularity conditions in Section 3. Section 4 presents some numerical results from simulation studies. A brief discussion will be given in Section 5. Proofs of the theorems are presented in Appendix.
2. Methodology
We first describe our estimation approach for the case that the covariates are all measured with errors. In addi-
tion to the independent and identically distributed (i.i.d.) primary observations
from model (1), assume that i.i.d. validation data
are also available. We shall suppose that X and W are both
d-dimensional random vectors. Without loss of generality, let the supports of X and W both be contained in
(otherwise, one can carry out monotone transformations of X and W).
In the following we let
,
,
denote respectively the joint density of
, marginal densities of X and W. Then we have
(3)
According to Equation (3), g is actually the solution to an integral equation called Fredholm equation of the first kind. Let
and
![]()
Define the operator
by
![]()
Hence, Equation (3) is equivalent to the operator equation
(4)
For the unknown smooth function
, we assume that
where
![]()
where c is a positive and finite constant.
denotes the Sobolev space of smoothness
, that is
![]()
where
,
, and the derivatives
. Given an integer s, the norm
is
![]()
here
denotes the norm on
.
An estimator of g can then be obtained by replacing T and m by their series estimators based on surrogate data and validation data, and solving the resultant empirical version of (4). As before, let
denote a complete, orthonormal sequence for
. Hence, we can write
![]()
where
and
represent the generalized Fourier coefficients of m and
, respectively. Intuitively, we can obtain the estimators of
,
,
and
by
![]()
![]()
respectively, where the integer q is a truncation point which is the main smoothing parameter in the approximating Fourier series. The operator T can then be consistently estimated by
![]()
Define the subset of
:
![]()
The estimator of
can be computed by
(5)
Remark 1. Let
be the
matrix whose
element is
and
be the
observed vector of Y based on the surrogate data
. Let
and
, respectively, denote the
matrices whose
elements are
and
based on the validation data. If
and
, then the solution to (5) assumes the following form
(6)
where
is given by
.
Next, we extend the estimator in (5) to nonparametric regression measurement error models with multi-covariates, that is
(7)
where X is measured with error and W being its observed surrogate variable, and Z is measured without error. Let
be a random sample from model (7), and
be i.i.d. validation observations. We assume that X and W are supported on
, and Z is supported on
.
Let
,
and
denote respectively the joint density of
, marginal densities of X and W, all conditioning on
. Similar to (3), for any
, we have
(8)
where
, and the operator
is defined by
![]()
where
is any function on
.
To obtain the estimator of
, we set
where K is a kernel function and
is a
bandwidth. Let
. We consider the following estimators
![]()
and
![]()
Then we have
![]()
Define the operator
by
![]()
for any
.
Then, for any
, the estimator of
is
(9)
Remark 2. Denote
and![]()
. Let
and
. If
and
, then the solution to (9) has the following form
(10)
where
is given by
.
Remark 3. If Z is discretely distributed with finite support, then
can be estimated by (9) with
being replaced by
, where
is the indicator function.
3. Theoretical Properties
In this section, we study the asymptotic properties of the estimators proposed in Section 2. We define
(
) as a sieve measure of ill-posedness (see, [23] ):
![]()
First, we investigate the large-sample properties of the estimator
. For this purpose, we present the following regular conditions which are mild and can be found in [24] ) and [23] .
A1. (i) The support of
is contained in
; (ii) The joint probability measure of
is absolutely continuous with respect to the product probability measure of Y and W and; (iii) The support of W is a cartesian product of compact connected intervals on which W has a probability density function that is bounded away from zero.
A2. For each
, the function
is bounded by c.
A3. (i)
with
and
; and (ii)
belongs to
with
.
A4. The set of functions
is a orthonormal, complete basis for
, and bounded uniformly over k.
A5. (i)
for some constant
; and (ii)
,
,
as
,
.
Theorem 1. Under conditions A1 - A5, as
and
, we have
(11)
where
denotes
for any
.
In (11), the term
arises from the bias of
caused by truncating the series approximation of g. The truncation bias decreases as s increases and g becomes smoother. Therefore, the smoother of g the faster the rate of convergence of
. The terms
and
are respectively induced by random surrogate sampling errors and random validation sampling errors in the estimates of the generalized Fourier coefficients
. When X is measured without error, the convergence rate of the sieve estimator of g is
. Comparing this rate to that in (11), we note that the bias part
is of the same order, however, the standard
deviation part blows up from
to
.
A more precise behaviour of the estimator can be obtained but depends on
, as [23] discussed, which can be classified into mildly ill-posed case and severely ill-posed case. In the next corollary, we obtain these rates for the two particular cases.
Corollary 1. Suppose the assumptions of Theorem 1 are satisfied.
(i) Let
(mildly ill-posed case) with
, and
, we have
![]()
(ii) Let
(severely ill-posed case) with
, and
, we have
![]()
where the function
goes to
slowly such that
for all
.
Remark 4. According to Corollary 1(i), the convergence rate becomes
when
,
and
. This is slower than that of the sieve estimator of a conditional mean function which can achieve the rate of convergence
.
Next, we study the large-sample properties of the estimator
. For this purpose, we make the following assumptions.
B1. (i) The support of
is contained in
, and Z is supported on
; (ii) Conditioning on
, the joint probability measure of
is absolutely continuous with respect to the product probability measure of Y and W and; (iii) Conditioning on
, the support of W is a cartesian product of compact connected intervals on which W has a probability density function that is bounded away from zero.
B2. For each
,
is bounded by c.
B3. (i) For each
, (8) has a solution
with
and
that does not depend on z and; (ii) For each
,
belongs to
with
.
B4. (i) The set of functions
is a orthonormal, complete basis for
, and bounded uniformly over k and; (ii) The kernel function K is a symmetrical, twice continuously differentiable function on
, and
for
and
, with
being some finite constant.
B5. (i) N, n,
,
satisfy the conditions that
and
; ( ii )
and
, where
and
are constants and
; and ( iii )
with
for some constant
.
B6. (i)
for some constant
; and ( ii )
,
,
as
,
.
Theorem 2. Suppose assumptions B1 - B6 are satisfied. For each
, let
with
, we have
![]()
The proofs of all the theorems are reported in Appendix.
4. Numerical Properties
In this subsection, we conducted a simulation study of the finite-sample performance of the proposed estimators. First, we choose the cosine sequence with
and
as the complete
orthonormal basis for
, then get our estimators (denoted as
and
) following (6) and (10). For comparison, we consider [18] method (denoted as
), and used the standard Nadaraya-Watson estimator with a Epanechnikov kernel to calculate
based on the primary dataset. It should be pointed out that
can serve as a gold standard in the simulation study, even though it is practically unachievable due to measurement errors. The performance of estimator
is assessed by using the average integrated squared errors
(MISE)
, where
, are grid points at which
is evaluated.
Example 1: We considered model (1) with the regression function being
![]()
and
being distributed as
. To perform this simulation, we generate X from a standard normal distribution, that is,
, and assume that
,
, and
is the standard deviation of the measurement error. Then, trim X and W in
and scale to
respectively. Only results for
and
are reported here. Simulations were run with different validation and primary data sizes
ranging from
to
according to the ratio
and
, respectively. For each case, 1000 simulated data sets were generated for each sample size of
.
It is interesting to compare our estimator
with the estimators
and
. Here, since our estimator
involves the regulation parameter q, we therefore present the following cross-validation (CV) selection criterion
![]()
where the subscript
meant that the estimator was constructed without using the ith observation
. For
, [18] proposed an automatic way of choosing the smoothing parameters
,
and q. For
, the CV approach is used for choosing bandwidth
.
Figure 1 shows the regression function curve
, and the curves of the median MISEs based on 1000 replicated estimates of
,
and
with
under different sample size. From Figure 1, both
and
successfully capture the patterns of the true regression curves and have smaller bias than
. As expected,
fails to produce accurate function curve estimates. In addition, it is obvious that the quality of our proposed estimator improve with the increase of sample sizes.
Table 1 compares, for various sample sizes, the results obtained for estimating curve
when
or
. The estimated MISEs which were evaluated on a grid of 201 equidistant values of x in
are presented. Our results show that the estimators
and
outperform
. It is noteworthy that our proposed
estimator generally performs better than the estimator proposed by [18] for the resultant MISEs of
are usually smaller. Also, the performance of
improves (i.e. the corresponding MISEs decrease) considerably as the sample sizes increases. For any nonparametric method in measurement error regression problem, the quality of the estimator also depends on the discrepancy of the observed sample. That is, the performance of the estimator depends on the variances of measurement error. Here, we compare the results for different values of
. As expected, Table 1 shows that the effect of the variances on the estimator performance is obvious.
Example 2: We considered model (7) with the regression function being
![]()
and
being distributed as
. The covariate
was generated from a bivariate normal distribution
with
and the correlation coefficient between X and Z being 0.6, and
,
. Then, trim X, W and Z in
and scale to
respectively.
Results for
and
are reported. Simulations were run with different validation and primary data sizes
ranging from
to
according to the ratio
and
, respectively. For each case, 1000 simulated data sets were generated for each sample size of
.
Here, we only compared our estimator
with the naive estimator
which is the multivariate ker-
nel regression estimator based on the primary dataset
, since [18] method cannot be applied to
multivariate cases. Here, we used the Epanechnikov kernel function
,
for ![]()
and used an product kernel
with
for
. For the
naive estimator
, bandwidth selection rules were considered by [25] . For our estimator
, we used the cross-validation approach to choosing the three parameters
,
and q. For this purpose,
and
are selected separately as follows.
Define
![]()
Here, we adopt the cross-validation (CV) approach to estimate
by
![]()
where the subscript
denotes the estimator being constructed without using the jth observation. After obtaining
, we then select
by
![]()
where the subscript
denotes the estimator being constructed without using the ith observation
.
We compute MISE at
grid points of
ranging in
. Table 2 reports the MISE for estimating curves
when
or
for various sample sizes. Table 2 shows that our proposed estimator substantially outperformed the naive kernel estimator
. It is obvious that our proposed estimator
has much smaller MISE than
.
5. Discussion
In this paper, we propose a new method for estimating non-parametric regression measurement error models using surrogate data and validation sampling. The covariates are measured with errors while we do not assume any error model structure between the true covariates and the surrogate variable. Most importantly, our proposed method can be readily extended to the multi-covariates model, say,
where x is measured with error but z is measured exactly. Numerical results show that the new estimators are promising in terms of cor-
recting the bias arising from the errors-in-variables. It generally preforms better than the approach proposed by [18] .
Acknowledgements
This work was supported by NSFC11301245, NSFC11501126 and Natural Science Foundation of Jiangxi Province of China under grant number 20142BAB211018.
Appendix
Proof of Theorem 1
Let
denotes the adjoint operator of T. Under assumption A1(ii), the self-adjoint operators of
and
have the same eigenvalue sequence
with
Moreover, we assume that the corresponding eigenfunctions of the operators
and
are also orthonormal basis
, and for all ![]()
![]()
Define
![]()
Let
be the operator whose kernel is
![]()
then
. By the definition of
, we have
.
Lemma 1. Under conditions A1 and A3(i) and the sieve space
, we have
1)
;
2)
.
Lemma 2. Under conditions A1, A3(ii) and A4, we have
![]()
By some modifications of the proof of Theorem 2 in [23] and applying the Theorem 7 in [24] , the proofs of Lemma 1 and Lemma 2 are straightforward and are omitted.
Proof of Theorem 1. By the triangle inequality, we have
![]()
By the definition of
and condition A3(i), we have
(12)
see e.g. [26] for Fourier series.
Next, by the definition of
and the triangle inequality, we have
![]()
We now analyze the term
. By the triangle inequality, we have
![]()
By conditions A2, A4 and central limit theorem, we can show that
. From condition A3(ii), we have
. Hence,
. In addition, by the definition of
and the triangle inequality, we have
![]()
These and Lemma 2 imply
![]()
This and Lemma 1 imply
(13)
The theorem follows immediately from (12)-(13).
Proof of Theorem 2
Lemma 3. For each
, define
![]()
Let
be the operator whose kernel is
![]()
then
. By the definition of
, we have
.
Proof of Theorem 2. For each
, by the triangle inequality, we have
![]()
By assumption B3(i), it is easy to show that
.
Similar to the proof of Theorem 1, we have
![]()
According to assumptions B2, B3(ii), B4, and B5(i), we can show that
,
. In addition, by some modifications of the proof of Lemma 2, under assumptions B1, B3(ii), B4, B5(i) and B6, we have
![]()
For the term
, under assumptions B1, B3(i) and the sieve space
, we have
![]()
Combining all these results, we complete the proof. W