Approximation of Finite Population Totals Using Lagrange Polynomial ()
1. Introduction
This study is using an approximation technique to approximate the finite population total called the Lagrange polynomial that doesn’t require any selection of bandwidth as in the case of local polynomial regression estimator. The Lagrange polynomials are used for polynomial interpolation and extrapolation. For each given set of distinct points xj and yj, the Lagrange polynomial of the lowest degree takes on each point xj corresponding to yj (i.e. the functions coincide at each point). Although named after Joseph Louis Lagrange, who published it in 1795, the method was first discovered in 1779 by Edward Waring. It is also an easy consequence of a formula published in 1783 by Leonhard Euler as will be seen later on how it works.
[1] in the context of using auxiliary information from survey data to estimate the population total defined
as the set of labels for the finite population. Letting
be the respective values of the study variable y and the auxiliary variable x attached to ith unit. Of interest is the estimation of population total
using the known population totals
at the estimation stage, if we let
be the set of sampled units under a general sampling design p, and let
be the first order inclusion probabilities. In 1940, Cochran made an important contribution to the modern sampling theory by suggesting methods of using the auxiliary information for the purpose of estimation in order to increase the precision of the estimates [2] . He developed the ratio estimator to estimate the population mean or the total of the study variable y. The ratio estimator of population
is of the form
The aim of this method is to use the ratio of sample means of two characters which would be almost stable under sampling fluctuations and, thus, would provide a better estimate of the true value. It has been well-known fact that
is most efficient than the sample mean estimator
, where no auxiliary information is used, if ρyx, the coefficient of correlation between y and x, is greater than half the ratio of coefficient of variation of x to that of y, that is, if
(1.0)
Thus, if the information on an auxiliary variable is either already available or can be obtained at no extra cost and it has a high positive correlation with the main character, one would certainly prefer ratio estimator to develop more and more superior techniques to reduce bias and also to obtain unbiased estimators with greater precision by modifying either the sampling schemes or the estimation procedures or both. [3] further extended the work of [4] on systematic sampling. [5] also dealt with the problem of estimation using the priori-information. Contrary to the situation of ratio estimator, if variables y and x are negatively correlated, then the product estimator of population mean
is of the form
(1.1)
that was proposed by [6] . It has been observed that the product estimator gives higher precision than the sample mean estimator
under the condition that is if
(1.2)
The expressions for bias and mean square errors of
and
have been derived by [7] .
[8] made use of known value of
for defining the difference estimator
(1.3)
where β is a constant. The best choice of β which minimizes the variance of the estimator is seen to be
(1.4)
which is the population regression coefficient of y on x. Since, β is generally unknown in practice, it is estimated by sample regression coefficient
(1.5)
Using sample regression coefficient (i.e. b), Watson defined simple linear regression estimator as
(1.6)
This estimator is biased, the bias being negligible for large samples.
The most common way of defining a more efficient class of estimators than usual ratio (product) and sample mean estimator is to include one or more unknown parameters in the estimators whose optimum choice is made by minimizing the corresponding mean square error or variance. Sometimes, such modifications or generalizations are made by mixing two or more estimators with unknown weights whose optimum values are then determined which generally depend upon population parameters. In order to propose efficient classes of estimators, [9] suggested a one-parameter family of factor-type (F-T) ratio estimators defined as
(1.7)
where
,
,
,
,
. The literature on survey sampling describes a great variety of
techniques of using auxiliary information to obtained more efficient estimators. Keeping this fact in view, a large number of authors have paid their attention toward the formulation of modified ratio and product estimators using information on an auxiliary variate, for instance, see [10] and Singh et al. [11] .
Suppose n is large and
. We assume that
and
are quite close such that
so that the bias of
becomes quite small.
The concept of nonparametric models within a model assisted framework was first introduced by [12] in estimating population parameters like population total and mean. The estimator was based on local polynomial smoothing. For a population of size N and where values for y are fully observed, they proposed the following estimator for population total of the variable y. The estimator could also be written as
(1.8)
The first term in (1.8) is a design estimator which the second is model component. Therefore, when the sample comprises of the whole population, the model component reduces to zero since πi = 1 and s = N. We therefore have the actual population total. [13] proposed the super population model ξ, such that
where
is a known function of xi. They proposed model
calibration estimator for population total Yt to be
In local polynomial regression, a lower-order weighted least squares (WLS) regression is fit at each point of interest, x using data from some neighborhood around x. Following the notation from [14] , let the (Xi, Yi) be ordered pairs such that
(1.9)
where
,
is the variance of Yi at the point Xi, and Xi comes from some distribution, f. In some cases, homoscedastic variance is assumed, so we let
. It is typically of interest to estimate m(x). Using Taylor’s expansion:
(1.91)
We can estimate these terms using weighted least squares by solving the following for β:
(1.92)
In (1.92), h controls the size of the neighborhood around x0, and Kh(.) controls the weights, where
, and K is a kernel function. Denote the solution to (1.92) as
. Then estimated
. [15] proposed to use nonparametric method to obtain
. However, this estimator experiences a twin problem of how to determine the optimal degrees of the local polynomial. A higher degree polynomial yields a smoother
but worsens the boundary variance [16] . Such estimators are challenging to employ in cases of multiple covariates and when data is sparse. Another challenge is how to incorporate categorical covariates. It is therefore necessary to consider other methods to recover the fitted values such as splines. The term spline originally referred to a tool used by draftsmen to draw curves. According to [17] , splines are piecewise regression functions we constrain to join at points called knots.
The Horvitz-Thompson (HT) estimator, which is originally discussed by [18] doesn’t make use of the auxiliary information xi but instead uses only the study variable yi to obtain the population total.
Consider the population of size N with units
. Suppose we want to select sample s of size ns.
Let πi be the probability of including ith unit of the population in sample s. This is called the inclusion probability or first order inclusion probability of ith unit in the sample.
Let πij be the probability of including ith and jth units in the sample. This is called the joint inclusion probability or second order inclusion probability.
When the sample is obtained from a probability sampling design, an unbiased estimator for the Total
is given by
(1.93)
is unbiased under design based approach [19]
Variance
The variance of this estimator can be minimized when πi ∝ yi. That is, if the first order inclusion probability is proportional to yi, the resulting HT estimator under this sampling design will have zero variance. However, in practice, we can’t construct such design because we don’t know the value of yi in the design stage. If there is a good auxiliary variable xi which is believed to be closely related with yi, then a sampling design with πi ∝ xi can lead to very efficient sampling design This method of estimating the finite population totals doesn’t make use of the auxiliary information xi but instead uses only the study variable yi to obtain the population totals.
Research literature has revealed that the ratio estimator performs better than the local linear polynomial estimator when the population is linear no matter which variance is used. The local linear polynomial regression estimator becomes a better estimator when the population used is either quadratic or exponential especially with an increase in the sample size which increases the likelihood of outliers in the sample.
One of the most useful and well-known classes of functions mapping the set of real numbers into itself is algebraic polynomials, the set of functions of the form
where n is a non-negative integer and
are real constants. One reason for their importance is that they uniformly approximate continuous functions. By this we mean that given any function, defined and continuous on a closed and bounded interval, there exists a polynomial that is as “close” to the given function as desired [20] .
In Section 2 we briefly introduced the Lagrange polynomial and in Section 2.1 we further defined the Lagrange polynomial. Section 2.2 talked about properties of polynomial approximations and proof of the Karl Weierstrass theorem. Section 3 talked about the main results with the use of real data from the Kenya National Bureau of Statistics on population census. While Section 3.2 showed how to calculate missing values via interpolation. Section 3.3 and 3.4 extrapolated the population totals in 2009 and 2019 respectively. Section 4 concluded by stating that, the best approximating polynomial for a quick convergence must be a linear one in order to give a better extrapolation.
2. Approximation of Finite Population Totals
In this section, we are basically introducing an approximator that is the Lagrange polynomial approximate of the finite population totals.
2.1. Proposed Lagrange Polynomial
Consider a finite population
of N units. Let (y, x) be the (total, year) variables taking non negative real values (yi, xi) respectively, on the unit
. From the population U, a simple random sample of size n is drawn without replacement. Then, the Lagrange interpolating polynomial is the polynomial p(x) of degree ≤ (n − 1) that passes through the n points
,
and is given by:
,
where
written explicitly,
2.2. Asymptotic Properties of Polynomial Approximations
Polynomial Approximation of Functions:
Weierstrass Theorem:
continuous
Then there exists a sequence of polynomials Pn(x) such that
as n → ∞
Proof of Theorem:
continuous.
(Bernstein Polynomial)
We are going to consider three functions:
,
and
and show convergence.
Hence
Also,
Let
Hence
In order to obtain a best approximating polynomial that has less error, one needs to choose a linear interpolating points that is closest to the target point
3. Main Results
3.1. Data Exploration
The plot showed an upward growth in the population of Kenya. This could be attributed to good health services causing a reduction in the maternal death, deaths as a result of disease outbreak, a boost in socio-economic growth and political stability (Figure 1).
Figure 1. The Kenya population census data since 1969 to 2009 were plotted to see the behaviour of the data as soon shown above in green.
However, we aimed at selecting a sample size of two from 1969 to 2009 population census using a technique of simple random sampling without replacement making a sample total of ten. A pair of linear samples selected were plotted on the same charts to approximate the function f(x) in green colour as shown below for each.
The chart in (Figure 2) below comprises of two linear polynomials that have uniformly approximated the function in green in order to give a better approximate to the population total in 2019. As can be seen, the two linear plots are not showing any better approximate of the function f(x) in green in order to help us extrapolate the population total in 2019.
The linear polynomials in (Figure 3) below in red and blue are used to uniformly approximate the function f(x) in green so as to help us extrapolate the population total in 2019. This was clearly seen to have obtained high variation in the approximation. The blue line appeared to be better than the red at the end point.
Similarly, the approximating linear polynomials in red and green in (Figure 4) are used to approximate the function f(x) in green. Unfortunately, the two approximating lines are not suitable to help extrapolate the population total in 2019.
The approximating linear polynomials shown below in (Figure 5) are used to uniformly approximate the function f(x) in green representing the trend of the entire population. As seen on the chart, the black line appeared to perform better at the end point than the blue but showed some variations.
Finally, the approximating linear polynomials in (Figure 6) are used to uniformly approximate the function f(x) representing the total population trend per year. The chart has clearly shown that, the black dotted line depicted the best
Figure 2. This chart was obtained from a set of data ranging from [1969, 1979] in yellow to [1969, 1989] in blue and the green function (f(x)).
Figure 3. This chart was obtained from a set of data ranging from [1969, 1999] in red to [1969, 2009] in blue and the green function (f(x)).
Figure 4. This chart was obtained from a set of data ranging from [1979, 1989] in green dotted line to [1979, 1999] in red and the green function(f(x)).
Figure 5. This chart was obtained from a set of data ranging from [1979, 2009] in black to [1989, 1999] in blue and the green function(f(x)).
Figure 6. This chart was obtained from a set of data ranging from [1989, 2009] in red dotted line to [1999, 2009] in black dotted line and the green function (f(x)).
approximate on its entire interval which is [1999, 2009] as the place for the Best Approximating Polynomial (BAP) to approximate the function f(x) uniformly to any degree of accuracy.
3.2. Calculating Missing Values via Interpolation
and
;
and
Columns 1 through 8
28,686,607 29,678,956 30,671,305 31,663,654 32,656,003 33,648,352 34,640,701 35,633,050
Columns 9 through 11
36,625,399 37,617,748 38,610,097
where i ≥ 2 and h = annual step size
3.3. Approximation of Population Total in 2009
and
given
and
approximated
and
approximated
Approximated value = L9 + L10
Approximated population total = 38,610,097
Error = 0
3.4. Extrapolation of 2019 Population Total
and
and
Approximated value = L19 + L20
Approximated population total = 48,533,587
4. Conclusion
In this work, the Lagrange polynomial has proven to be a good technique in approximating the population total from data obtained from the Kenya National Bureau of Statistics (KNBS). The research revealed that, subsequent population totals can better be approximated using a sample closest to the target population being approximated. Therefore, the best approximating polynomial must be a linear form in order to obtain convergence with a diminishing variation in a given interval. The precision of this technique can better be measured with the outcome obtained in the interpolation of missing values shown in the results above to extrapolate the population total in 2009 which was equal to the exact population obtained in that census. We therefore conclude that, the population of Kenya for the 2019 census will be forty-eight million five hundred and thirty-three thousand five hundred and eighty-seven.
Acknowledgements
We are grateful to the authors for their numerous and valuable contributions to this work, most especially the first author.
Conflict of Interest
The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.