Local Polynomial Regression Estimator of the Finite Population Total under Stratified Random Sampling : A Model-Based Approach

In this paper, auxiliary information is used to determine an estimator of finite population total using nonparametric regression under stratified random sampling. To achieve this, a model-based approach is adopted by making use of the local polynomial regression estimation to predict the nonsampled values of the survey variable y. The performance of the proposed estimator is investigated against some designbased and model-based regression estimators. The simulation experiments show that the resulting estimator exhibits good properties. Generally, good confidence intervals are seen for the nonparametric regression estimators, and use of the proposed estimator leads to relatively smaller values of RE compared to other estimators.


Introduction
Sample surveys' main objective is to obtain information about the population, and then use such information to make inference about some population quantities.The information that is mostly sought about the population is usually aggregate values of various population characteristics, total number of units, proportion of units having certain attributes.The information can be collected by either sampling methods or census.One of the approaches to using auxiliary information in construction of estimators is by assuming a working model that describes the relationship between the survey variable and the auxiliary variable.Estimators are then derived based on this model.At this stage, estimators are sought to have good efficiency given that the model is true.In most cases, a linear model is assumed.Generalized regression estimators by [1] and [2] including linear regression estimators and ratio estimators by [3], and best linear unbiased estimators by [4] and [5] and post-stratification estimators by [6] as well are all derived from the assumption of linear models.Sometimes the linear model fails, and therefore, the resulting estimators do not beat the purely design-based estimators.As a result, [7] proposed a class of estimators in which the working model assumes a nonlinear parametric model.The improvement of the efficiency of such estimators, however, requires prior information about the exact parametric population structure.As a result of these concerns, several researchers have so far considered nonparametric models for ξ .Nonparametric regression may be used in the estimation of unknown finite population quantities such as population totals, means, proportions or averages.The idea of nonparametric regression traces its origin in works by [8] and [9].Nonparametric-based estimation is often more robust and flexible than inference based on parametric regression models or design probabilities (as in designed-based inference) [10].In sample surveys, auxiliary information is used at the estimation stage of finite population quantities-population total or mean, say-to increase the precision of estimators of such population quantities [11] [12] [13].
A variety of approaches exist for construction of more efficient estimators for population total or mean, and they include model-based and design-based methods.
Model-based approach in sample surveys is based on superpopulation models, which assumes that the population under study is a realization of a random variable having a superpopulation model ξ .This model ξ is used to predict the nonsampled values of the population, and hence the finite population quantities, total Y or mean Y [13].
[14] first considered nonparametric models for ξ within a model-assisted approach and obtained a local polynomial regression estimator as a generalization of the ordinary generalized regression estimator.Their simulation study shows that the proposed estimator performs relatively better than other parametric estimators.[13] improved on [14] estimator and developed a model-based local polynomial regression estimator applicable to direct sampling designs such as simple random sampling and systematic sampling.Their estimator demonstrates better performance than [14] model-assisted estimator.Their estimator also beats other parametric estimators.
In this paper, auxiliary information is used to determine an estimator of finite population total using nonparametric regression under stratified random sampling.To achieve this, a model-based approach is adopted by making use of the local polynomial regression estimation to predict the nonsampled values of the survey variable y.Stratified estimators for finite population total Y or mean Y have proved to yield better estimators than those resulting from simple random sampling [15] [16].
Additionally, it has been shown in the literature that local polynomial approximation method has several nice features including satisfactory boundary behaviour, easy interpretability, applicability for a variety of design-circumstances and nice minimax properties (see [17] [18] and [19]).x j N =  be the auxiliary measurement positively correlated with hi y .

Proposed Estimator
From each stratum, a simple random sample of size h n is selected without replace- ment, where h n is sufficiently large with respect to h N and 0 Let h s be the sample in the th h stratum and h r be the nonsampled set in the th h stratum.
The population total is defined as which can rewritten as  Once the sample has been observed, the problem of estimating Y becomes the problem of predicting the sum of the nonsampled hj y s ′ .Usually, inference is made using the known sample and the model ξ .The first component in Equation ( 1) is known while the second requires prediction which is the focus in this paper.In this paper, local polynomial regression method will be used to predict the unknown hj y s ′ , h j r ∀ ∈ .
Suppose the distribution generating hj y s ′ is given by the superpopulation model, ξ in which ( ) where hj e s ′ are independently distributed random variables with mean 0 and variance ( ) , for and , 0, otherwise where ( ) x σ and ( ) m x are assumed to be continuous and twice differentiable fun- ctions of x, and ( ) In practice, the values of ( ) m x are unknown and so requires prediction.Adopting [13] [14] and [20] ideas, we make use of local polynomial regression of degree p, which is a generalization of the kernel smoothing, to predict the unobserved hj y s ′ in Equation (1).Let ( ) ( ) , where K denotes a continuous kernel function and b is the bandwidth.

Then a model-based local polynomial regression estimator of the nonsampled hj y s ′
in the th h stratum is given by: ( ) where . Equation (6) holds as long as Now denoting the estimator for the finite population total by ˆLP Y and the estimator within the th h stratum by ˆh LP Y .Therefore, in stratum h, the estimator of the population total based on local polynomial regression is and the estimator for the finite population total is with

Properties of Proposed Estimator
In this section, a study is carried out on various properties of estimator (8), which may be important in practice.In doing so, the following assumptions are made: 1) The regression function ( ) m x has a bounded second derivative.
2) The marginal density, ( ) 3) The conditional variance is bounded and continuous.
4) The kernel density function ( ) K x is bounded and continuous satisfying the following:  .These conditions on ( ) K ⋅ were imposed and used in [18] work and are purposely for the convenience of technical arguments and therefore can be relaxed.

ˆLP Y Is Asymptotically Model-Unbiased
Now consider the difference: ( ) and taking expectation yields ( ) hj hj E y m which is the bias associated with ˆLP Y .
Approximating hj m by Taylor series expansion about a point hj x and assuming Letting ( ) ˆ1 2 and applying expectations then Theorem 3 of [21] allows that under conditions (1)-( 4) if It implies that ( ) and thus ˆLP Y is asymptotically model-unbiased.

Mean Square Error (MSE) of ˆLP Y
The estimator (8) has the MSE ( ) ( ) which can be decomposed as Theorem 1 of [18] allows that under Condition (1), if , 0 Observe that Equation (24) tends to zero if 0 b → and h n b → ∞ and thus ( ) This shows that ˆLP Y is statistically consistent and thus useful.

Simulation Study
In this section, a study is carried out on the practical performance of several estimators (see Table 1 and Table 2 for the estimators).
The first estimator is design-based, the second one is parametric and model-based while the last two are nonparametric and model-based.

Description of the Population
The working model is taken to be ( ) ( ) , hj h j Cov y y σ ′ ′ = .In this study, four populations are considered, which are generated from the regression model given by ( ) with the following mean functions ( ) ( ) m represent various deviations from the linear model, 1 m .These populations are plotted in Figure 1.For more on these populations, see [13] and [14].
The errors are assumed to be independent and identically distributed (i.i.d) normal random variables having mean 0 and standard deviation 0.1 σ = .They contain 2000 units and the population i x is simulated as i.i.d uniform random variables.The  Epanechnikov kernel, is used for kernel smoothing on each of the populations.In each case, bandwidth values b n − = (see [20]) (with [15]) are considered.
Data simulations, the estimators and computations were obtained using R Software on a desktop.
To analyze the performance of the proposed estimator against some specified estimators, relative absolute bias (RAB) is computed as and the relative efficiency (RE) with respect to the Horvitz-Thompson (HT) estimator is computed as θ is the estimator of the finite population total being considered; Y is the true population total and R is the number of replications.
The relative efficiency (RE) is meant to examine the robustness of the various estimators against the proposed estimator.
The confidence intervals (CI) and the average lengths (AL) of the confidence intervals of various estimators are also computed as follows: ( ) where U CI and L CI are the upper and lower confidence limits respectively; θ and R are as defined earlier.

Results
The results of this simulation study are summarized in Table 3  ), the performance of each estimator is analyzed using the RAB and RE.The RAB indicates the measure of how close the estimator being considered is from the actual value, while the RE is used to check the robustness of the estimator.For instance, an estimator, 1 θ , will be said to be "better" or more preferable than another one, 2 θ , if its RE is comparably smaller.That is, if where 1 θ and 2 θ are estimators, then 1 θ is said to be "better" than 2 θ .
Table 2. Summary of the formulae used in computing the respective population totals of the various estimators.
Estimator Formulae Horvitz-Thompson, ˆHT Y ( ) The confidence intervals and average length of the intervals are also measured for each case.A smaller length is better because it implies that the true population total is captured within a smaller range and therefore results are more precise.
The estimators ˆPE Y and ˆLP Y are tested under the same bandwidth choice i.e.  3 and Table 4 below.
Table 3 shows the RAB's and RE's of the various estimators with respect to the Horvitz-Thompson estimator ( ˆHT Y ).Table 4 shows the confidence intervals and their average lengths.
In most scenarios, ˆLP Y is better than the parametric estimators, but the parametric estimator, ˆREG Y , performs best when the model is correctly specified, as Table 3 shows.This occurs both in the linear and the bump populations, where in the former, a strong linear relationship holds between the variables while in the latter, the function is linear over most of its range despite a "bump" for a small part of the range of hi x s ′ .
When the model is completely misspecified as in the Sine and Jump populations, a greater efficiency can be achieved by the nonparametric regression estimators.This can be seen in Table 3 for the Sine and Jump populations: the nonparametric estimators ( ˆLP Y and ˆPE Y ) are more efficient than their parametric opponent, ˆREG Y .When the underlying superpopulation model is completely unknown, a reasonable choice for finite population total estimation would be the nonparametric estimators such as ˆLP Y and ˆPE Y with small bandwidth choices.This can be seen in Table 3 and Table 4.
In this study, ˆLP Y is sometimes seen to perform much bettter but not as worse as ˆPE Y , and hence the proposed estimator, ˆLP Y emerges as the best performing among the nonparametric estimators being considered here (see Table 3).A good overall performance is observed with the proposed estimator, with smaller values of RAB and RE than the model-based competitor ˆPE Y for every population and fixed bandwidth under consideration.
Despite ˆLP Y being relatively the best estimator, its performance is significantly affected by the bandwidth choices.As the bandwidth size increases, some amount of efficiency is lost (see Table 3).

Conclusion
In this study, performance of the proposed estimator has been investigated against dominates ˆHT Y for all populations except in the Jump population, where it dominates all estimators being considered.Generally, good confidence intervals are seen for the nonparametric regression estimators, and use of the proposed estimator leads to relatively smaller values of RE compared to other estimators.We conclude that nonparametric regression approach under stratified random sampling using the proposed estimator yields good results. where
2 b = .Results of this simulation are shown in Table some design-based and model-based regression estimators.The RE values of the proposed estimator are in general close to one.It has been shown that for whichever bandwidth considered, ˆLP Y essentially dominates ˆREG Y for all the populations except Linear and Bump populations, where ˆREG Y is competitive.Further, ˆLP Y essentially Consider a population consisting of N units.Suppose this population is divided into H

Table 1 .
Estimators being compared in the Simulation study.
and Table 4.For each

Table 3 .
Relative absolute bias (RAB) and relative efficiency (RE) based on 1000 replications of simple random sampling within strata from four fixed populations of size

Table 4 .
Estimated lower and upper confidence limits and corresponding average lengths based on 1000 replications of simple random sampling within strata from four fixed populations of size LCL is the Lower Confidence Limit, UCL is the Upper Confidence Limit and AL is the Average Length).Additionally, a keen look at the estimated totals in Table3shows that: as the bandwidth increases, the local linear regression estimator, ˆLP Y becomes equivalent to the linear regression estimator, ˆREG Y .This shows that the bandwidth has an effect on the mean square error of ˆLP Y .Particularly, for whichever bandwidth that is considered in this study, ˆLP Y essentially dominates ˆREG Y for all the populations except Linear and Bump populations, where ˆREG Y is competitive.Further, ˆLP Y essentially dominates ˆHT Y for all populations except in the Jump population, where ˆHT Y dominates all estimators being considered.The overall performance of ˆLP Y is consistently good as long as the bandwidth remains small in this particular study.