A Multiplicative Bias Correction for Nonparametric Approach and the Two Sample Problem in Sample Survey ()
1. Introduction
Sometimes, it happens that two separate surveys gather related information on a variable of interest of a population, U, having perhaps distinct designs and mode of sampling. It becomes very important on how to combine the data from the two surveys.
Take as example, the students of the sub-regional institute of statistics and apply economics (ISSEA), and those of the polytechnic institute, both in different ways with different importances to collect data on unemployment in Cameroon. Researchers at the national institute of statistics (Cameroon) are faced with the following problem: how can the data from these two distinct surveys joined together to produce a single data and have a better representation of the population?
Some great scientists have been looking into these problems for several years. The approach to this problem have been in different ways; one of which involve getting estimates of the two surveys separately and using the inverse of the estimated variances as weights to weigh them together as seen in [1] . [2] went further by using empirical likelihood method to combine information from multiple survey. Another option to this consist of putting the two data sets in a single data set, taking into account the weight on individual sample units. Developed in [3] are some of these methods which include; the pseudo- likelihood, missing information principle and iterated post-stratified estimator. After simulations on two different populations, it was concluded that, in neither population the design based ways of combining data yield best results. The iterated post-stratified estimator looks to be a very promising non-parametric way to combined data from two sources.
Just recently [4] used the Nonparametric regression, which is the model-based sampler’s method of choice when there is a serious doubt about the suitability of a linear or other simple parametric models for the survey data at hand. The nonparametric regression supersedes the need for use of design weights and standard design-based weights. Recognition of this is especially helpful in confronting problems in sampling situations where design weights are missing or questionable.
This study made use of kernel smoothers, especially the Nadaraya Watson smoother. However, estimators based on Nadaraya Watson smoothing weights are normally biased in small samples and at boundary points.
There exist alternative techniques of reducing the bias. For a detailed review see [5] - [11] . These methods improve the performance of nonparametric regression at points of large curvature. But in this framework, we consider a multiplicative bias correction approach to nonparametric regression to have an estimate with a smaller bias than existing ones.
Outline of the Paper
The remaining part of this paper is organized as follows: In Section 2, a multiplicative bias corrected estimator
for the finite population totals is proposed. In Section 3, the asymptotic properties of the proposed estimator are derived. In Section 4, an empirical study of the derived properties is presented. In Section 5 we give a conclusion to the paper.
2. Proposed Estimator
Consider a finite population,
and let
represent the combined random sample drawn from the population using different sampling techniques. Suppose that to each of these
, there is an auxiliary information
.
Let consider the following model;
(1)
(2)
where
and
are twice continuously differentiable functions (that is lipschitz continuous). With these assumptions on
and
, one can estimate
and
non-parametrically.
Let
be i.i.d. with zero mean, and variance
. We can refer to this set-up as the weak model. In this scheme, we can ignore which of the original samples, the
are available from.
Usually in the computation of finite population total,we have the formula given by
(3)
where, s refers to the sample and r refers to the nonsampled part of the population. Since the values of the sample part is known, the process of estimating the finite population total is equivalent to predicting the nonsample part of the population.
To do this, the multiplicative bias corrected technique is employed in which case the proposed estimator of the population total is now defined as
(4)
where
is the inclusion probability
is the multiplicative bias corrected estimator.
The principal objective of the multiplicative bias corrected technique is to correct the insufficiences of the kernel smoother that is the bias problem at the boundaries. Given a pilot smoother of the regression function
(5)
The inverse relative estimation error of the smoother at each of the observations is given by
.
A noisy estimate of the ratio,
, is given by
(6)
Smoothing the noisy estimate
leads to
(7)
Above gives a better estimate for the inverse of the relative estimation error at each particular observation and can therefore be used as a multiplicative correction of the pilot smoother.
(8)
For both
and
, we use the same weighting scheme;
(9)
where
h is the bandwidth
K is a probability density function, symmetric about zero.
n is the sample size
Bandwidth Selection Techniques
● Implement biased cross-validation (bcv).
● Implement unbiased cross-validation (ucv).
● Implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator (ndr0)
● Can use a more common variation given by Scott (1992) (ndr)
3. Properties of Proposed Estimator
3.1. Assumptions
The following assumptions are made in the estimation of
.
● The regression function is bounded and strictly positive, that is,
for all x
● The regression function is twice continuously differentiable everywhere.
●
has finite fourth moments and has a symmetric distribution around zero.
● The bandwidth h is such that,
,
and
as
3.2. Asymptotic Unbiasedness of the Proposed Estimator
We want to show that
as
. Under the model based, the bias of the estimator
is defined as follows;
(10)
Now, we have the expected value of the proposed estimator for the finite population total given by;
(11)
(12)
(13)
is obtained by analysing the individual terms of the stochastic approximation of
. Let us then establish the stochastic approximatiom of
as shown by (Hengartner 2009).
From (8),
(14)
(15)
(16)
Let define,
then we can express
as.
Through the series expansion,
is an approximation of the quantity R.
Replacing both
and
in (16), we obtain
Using the assumption
the remainder term turns to zero in probability and the expression reduces to;
To solve Equation (16), we need to find
hence,
since
(17)
Hence,
(18)
The above expression can be reduced by considering a limited Taylor series of
about a point x. Hence
(19)
Now, substituting the first two terms in (18) gives
(20)
But
and
, therefore
(21)
Furthermore,
Hence the asymptotic bis of the estimator is given by
The bias of
will be of order
. Thus it converges to zero at a faster rate compared to the existing non-parametric estimators which generally converge at the rate
.
3.3. Asymptotic Variance of the Proposed Estimator
The variance of the finite population total is given by;
Firstly,
(22)
Using the assumption
, the remainder terms converge to zero in probability. Therefore
and Equation (22) reduces to
(23)
Truncating the binomial expansion at the first term yields
Simplify the above expression by considering the first and second part of the Taylor series of
. So we obtain
(24)
Therefore,
(25)
Thus the asymptotic variance is given by
(26)
This implies that
is more efficient than the usual non-parametric regression estimator proposed by Dorfman (1992).
3.4. Asymptotic Mean Square Error
The asymtotic mean square error of the estimator
is given by
(27)
(28)
As
and
, the
turns to 0 indicating that, the proposed estimator is statistically consistent.
4. Empirical Study
4.1. Population
In this section, the theory developed in the previous section was tested using a set of simulation studies, with a mix of survey designs, and employing various approaches to selecting the best bandwidths. We employ a population U of countries in the world of size, N = 188, with auxiliary variable x = gross national product (GNI) and variable of interest y = human development index(HDI), of interest is the population total of the HDI,
.
Figure 1 below shows the scatter diagram of the population. Where HDI is on the vertical axis and GNI on the horizontal axis, where there exist a quadratic relationship between the two variables.
We suppose, for each run of the experiment that two samples are taken:
Sample 1 (
): srswor
Sample 2 (
): stratsrs-four strata equal in each, and 8 units taken at random
in each, so that
. The total experiment consists of 500 runs of pairs of samples. Table 1 gives the estimators considered.
For an estimator
we considered three measures of relative success across the 500 runs:
i) Unconditional relative bias measured as ratio of mean value (across runs) to target
ii) Unconditional relative root mean square error divided by target
4.2. Results
Results obtained are tabulated in Table 2.
From the results obtained, we observe that the unbiased cross validation approach is a viable means of selecting bandwidth as it gives the lowest bias and root mean square error across all the estimators. The proposed estimator to the two sample problem gives better estimates of the population total compared to those realized using the estimator proposed by [12] , and [4] respectively.
Furthermore, we study the conditional performances of the selected estimators. 500 samples obtained were sorted by the values of the mean of the auxiliary variable and put in 25 groups each containing 20 values. We then compute the bias and root mean square error of each group. The plots of conditional performances against the average of the sorted mean auxiliary variable. We then report the behaviour of the conditional bias for the different bandwidth.
Figure 2 and Figure 3 indicate the conditional bias and conditional root mean square respectively, with each of the plot drawn at different bandwidth. The population mean of auxiliary variable x was found to be 1.701. Under the conditional bias plots, it is observed that, the proposed estimator outperforms the two currently used estimatorsin terms of conditional biases especially with the unbiased cross-validation and the biased cross-validation method of selecting bandwidth. This trend persist in the case of conditional root mean square error.
5. Conclusion
The aim of this study was to develop an estimator with the lowest bias for the finite population total using the multiplicative bias corrected approach to non parametric regression. This study reveals that the proposed estimator is more efficient than the modified nonparametric estimator (NPT). With a suitable bandwidth selection (ucv), the proposed estimator has the smallest bias and root mean square error values. It has therefore proven to be efficient in resolving the boundary value problem that is associated with the existing nonparametric smoothers.
Acknowledgements
My first appreciation goes to my supervisors Professor Odhiambo and Doctor Mageto for accompanying me through this work. Also, alot of thanks to the African Union for providing for this scientific reseach and placing such confident in its youth. Lastly but not the least, thanks to my family for their support.