Approximation of Finite Population Totals Using Lagrange Polynomial

Abstract

Approximation of finite population totals in the presence of auxiliary information is considered. A polynomial based on Lagrange polynomial is proposed. Like the local polynomial regression, Horvitz Thompson and ratio estimators, this approximation technique is based on annual population total in order to fit in the best approximating polynomial within a given period of time (years) in this study. This proposed technique has shown to be unbiased under a linear polynomial. The use of real data indicated that the polynomial is efficient and can approximate properly even when the data is unevenly spaced.

Share and Cite:

Kabareh, L. , Mageto, T. and Muema, B. (2017) Approximation of Finite Population Totals Using Lagrange Polynomial. Open Journal of Statistics, 7, 689-701. doi: 10.4236/ojs.2017.74048.

1. Introduction

This study is using an approximation technique to approximate the finite population total called the Lagrange polynomial that doesn’t require any selection of bandwidth as in the case of local polynomial regression estimator. The Lagrange polynomials are used for polynomial interpolation and extrapolation. For each given set of distinct points xj and yj, the Lagrange polynomial of the lowest degree takes on each point xj corresponding to yj (i.e. the functions coincide at each point). Although named after Joseph Louis Lagrange, who published it in 1795, the method was first discovered in 1779 by Edward Waring. It is also an easy consequence of a formula published in 1783 by Leonhard Euler as will be seen later on how it works.

[1] in the context of using auxiliary information from survey data to estimate the population total defined U 1 , U 2 , , U N as the set of labels for the finite population. Letting ( y i , x i ) be the respective values of the study variable y and the auxiliary variable x attached to ith unit. Of interest is the estimation of population total Y t = i = 1 N x i using the known population totals X t = i = 1 N x i at the estimation stage, if we let s 1 , s 2 , , s n be the set of sampled units under a general sampling design p, and let π i = p ( i s ) be the first order inclusion probabilities. In 1940, Cochran made an important contribution to the modern sampling theory by suggesting methods of using the auxiliary information for the purpose of estimation in order to increase the precision of the estimates [2] . He developed the ratio estimator to estimate the population mean or the total of the study variable y. The ratio estimator of population Y ¯ is of the form

y ¯ r = y ¯ x ¯ X ¯ ; x ¯ 0

The aim of this method is to use the ratio of sample means of two characters which would be almost stable under sampling fluctuations and, thus, would provide a better estimate of the true value. It has been well-known fact that y ¯ r is most efficient than the sample mean estimator y ¯ , where no auxiliary information is used, if ρyx, the coefficient of correlation between y and x, is greater than half the ratio of coefficient of variation of x to that of y, that is, if

ρ y x > 1 2 ( C x C y ) (1.0)

Thus, if the information on an auxiliary variable is either already available or can be obtained at no extra cost and it has a high positive correlation with the main character, one would certainly prefer ratio estimator to develop more and more superior techniques to reduce bias and also to obtain unbiased estimators with greater precision by modifying either the sampling schemes or the estimation procedures or both. [3] further extended the work of [4] on systematic sampling. [5] also dealt with the problem of estimation using the priori-information. Contrary to the situation of ratio estimator, if variables y and x are negatively correlated, then the product estimator of population mean Y ¯ is of the form

y ¯ q = y ¯ X ¯ x ¯ ; X ¯ 0 (1.1)

that was proposed by [6] . It has been observed that the product estimator gives higher precision than the sample mean estimator y ¯ under the condition that is if

ρ y x 1 2 ( C x C y ) (1.2)

The expressions for bias and mean square errors of y ¯ r and y ¯ q have been derived by [7] .

[8] made use of known value of X ¯ for defining the difference estimator

y ¯ d = y ¯ + β ( X ¯ x ¯ ) (1.3)

where β is a constant. The best choice of β which minimizes the variance of the estimator is seen to be

β = S y x S x 2 (1.4)

which is the population regression coefficient of y on x. Since, β is generally unknown in practice, it is estimated by sample regression coefficient

b = s y x s x 2 (1.5)

Using sample regression coefficient (i.e. b), Watson defined simple linear regression estimator as

y ¯ 1 r = y ¯ + b ( X ¯ + x ¯ ) (1.6)

This estimator is biased, the bias being negligible for large samples.

The most common way of defining a more efficient class of estimators than usual ratio (product) and sample mean estimator is to include one or more unknown parameters in the estimators whose optimum choice is made by minimizing the corresponding mean square error or variance. Sometimes, such modifications or generalizations are made by mixing two or more estimators with unknown weights whose optimum values are then determined which generally depend upon population parameters. In order to propose efficient classes of estimators, [9] suggested a one-parameter family of factor-type (F-T) ratio estimators defined as

y ¯ f = y ¯ [ ( A + C ) X ¯ + f B x ¯ ( A + f B ) X ¯ + C x ¯ ] (1.7)

where A = ( d 1 ) ( d 2 ) , B = ( d 1 ) ( d 4 ) , C = ( d 2 ) ( d 3 ) ( d 4 ) , d > 0 , f = n N . The literature on survey sampling describes a great variety of

techniques of using auxiliary information to obtained more efficient estimators. Keeping this fact in view, a large number of authors have paid their attention toward the formulation of modified ratio and product estimators using information on an auxiliary variate, for instance, see [10] and Singh et al. [11] .

Suppose n is large and M S E ( R ^ ) = V a r ( R ^ ) . We assume that x ¯ and X ¯ are quite close such that

R ^ R = y ¯ R x ¯ x ¯ = y ¯ R x ¯ X ¯

so that the bias of R ¯ becomes quite small.

The concept of nonparametric models within a model assisted framework was first introduced by [12] in estimating population parameters like population total and mean. The estimator was based on local polynomial smoothing. For a population of size N and where values for y are fully observed, they proposed the following estimator for population total of the variable y. The estimator could also be written as

Y ^ g e n = i s y i π i + ( j = 1 N μ ^ ( x j ) i s μ ^ ( x i ) π i ) (1.8)

The first term in (1.8) is a design estimator which the second is model component. Therefore, when the sample comprises of the whole population, the model component reduces to zero since πi = 1 and s = N. We therefore have the actual population total. [13] proposed the super population model ξ, such that E ξ ( y i ) = μ ( x i ) where μ ( x i ) is a known function of xi. They proposed model

calibration estimator for population total Yt to be Y ˜ = i s y i π i

In local polynomial regression, a lower-order weighted least squares (WLS) regression is fit at each point of interest, x using data from some neighborhood around x. Following the notation from [14] , let the (Xi, Yi) be ordered pairs such that

Y i = m ( X i ) + σ ( X i ) ε i (1.9)

where ε ~ N ( 0 , 1 ) , σ 2 ( X i ) is the variance of Yi at the point Xi, and Xi comes from some distribution, f. In some cases, homoscedastic variance is assumed, so we let σ 2 ( X ) = σ 2 . It is typically of interest to estimate m(x). Using Taylor’s expansion:

m ( x ) m ( x o ) + m ( x o ) ( x x o ) + + m n ( x o ) n ! ( x x o ) n (1.91)

We can estimate these terms using weighted least squares by solving the following for β:

i = 1 n [ Y i j = 0 q β j ( X i x 0 ) j ] 2 K h ( X i x 0 ) (1.92)

In (1.92), h controls the size of the neighborhood around x0, and Kh(.) controls the weights, where K h ( . ) K ( h ) h , and K is a kernel function. Denote the solution to (1.92) as β ^ . Then estimated m v ( x 0 ) = v ! β ^ V . [15] proposed to use nonparametric method to obtain μ ( . ) . However, this estimator experiences a twin problem of how to determine the optimal degrees of the local polynomial. A higher degree polynomial yields a smoother μ ¯ ( . ) but worsens the boundary variance [16] . Such estimators are challenging to employ in cases of multiple covariates and when data is sparse. Another challenge is how to incorporate categorical covariates. It is therefore necessary to consider other methods to recover the fitted values such as splines. The term spline originally referred to a tool used by draftsmen to draw curves. According to [17] , splines are piecewise regression functions we constrain to join at points called knots.

The Horvitz-Thompson (HT) estimator, which is originally discussed by [18] doesn’t make use of the auxiliary information xi but instead uses only the study variable yi to obtain the population total.

Consider the population of size N with units y 1 , y 2 , y 3 , , y N . Suppose we want to select sample s of size ns.

Let πi be the probability of including ith unit of the population in sample s. This is called the inclusion probability or first order inclusion probability of ith unit in the sample.

Let πij be the probability of including ith and jth units in the sample. This is called the joint inclusion probability or second order inclusion probability.

When the sample is obtained from a probability sampling design, an unbiased estimator for the Total Y = i = 1 N y i is given by

Y ^ H T = i = 1 N y i π i = i = 1 N y i π i 1 (1.93)

Y ^ H T is unbiased under design based approach [19]

Variance

V ( Y ^ H T ) = i = 1 N j = 1 N ( π i j π i π j ) y i y j π i π j

The variance of this estimator can be minimized when πi ∝ yi. That is, if the first order inclusion probability is proportional to yi, the resulting HT estimator under this sampling design will have zero variance. However, in practice, we can’t construct such design because we don’t know the value of yi in the design stage. If there is a good auxiliary variable xi which is believed to be closely related with yi, then a sampling design with πi ∝ xi can lead to very efficient sampling design This method of estimating the finite population totals doesn’t make use of the auxiliary information xi but instead uses only the study variable yi to obtain the population totals.

Research literature has revealed that the ratio estimator performs better than the local linear polynomial estimator when the population is linear no matter which variance is used. The local linear polynomial regression estimator becomes a better estimator when the population used is either quadratic or exponential especially with an increase in the sample size which increases the likelihood of outliers in the sample.

One of the most useful and well-known classes of functions mapping the set of real numbers into itself is algebraic polynomials, the set of functions of the form

P n ( x ) = a n x n + a n 1 x n 1 + + a 1 x + a 0

where n is a non-negative integer and a 0 , , a n are real constants. One reason for their importance is that they uniformly approximate continuous functions. By this we mean that given any function, defined and continuous on a closed and bounded interval, there exists a polynomial that is as “close” to the given function as desired [20] .

In Section 2 we briefly introduced the Lagrange polynomial and in Section 2.1 we further defined the Lagrange polynomial. Section 2.2 talked about properties of polynomial approximations and proof of the Karl Weierstrass theorem. Section 3 talked about the main results with the use of real data from the Kenya National Bureau of Statistics on population census. While Section 3.2 showed how to calculate missing values via interpolation. Section 3.3 and 3.4 extrapolated the population totals in 2009 and 2019 respectively. Section 4 concluded by stating that, the best approximating polynomial for a quick convergence must be a linear one in order to give a better extrapolation.

2. Approximation of Finite Population Totals

In this section, we are basically introducing an approximator that is the Lagrange polynomial approximate of the finite population totals.

2.1. Proposed Lagrange Polynomial

Consider a finite population U = { U 1 , U 2 , , U N } of N units. Let (y, x) be the (total, year) variables taking non negative real values (yi, xi) respectively, on the unit U i ( i = 1 , 2 , , N ) . From the population U, a simple random sample of size n is drawn without replacement. Then, the Lagrange interpolating polynomial is the polynomial p(x) of degree ≤ (n − 1) that passes through the n points ( x 1 , y 1 = f ( x 1 ) ) , ( x 2 , y 2 = f ( x 2 ) ) , , ( x n , y n = f ( x n ) ) and is given by:

P ( x ) = j = 1 n P j ( x ) ,

where P j ( x ) = y j k = 1 n x x k x j x k written explicitly,

P ( x ) = ( x x 2 ) ( x x 3 ) ( x x n ) ( x 1 x 2 ) ( x 1 x 3 ) ( x 1 x n ) y 1 + ( x x 1 ) ( x x 3 ) ( x x n ) ( x 2 x 1 ) ( x 2 x 3 ) ( x 2 x n ) y 2 + + ( x x 1 ) ( x x n 1 ) ( x n x 1 ) ( x n x n 1 ) y n

2.2. Asymptotic Properties of Polynomial Approximations

Polynomial Approximation of Functions:

Weierstrass Theorem:

f : [ a , b ] R continuous

Then there exists a sequence of polynomials Pn(x) such that f P n = max x [ a , b ] | f ( x ) P n ( x ) | 0 as n → ∞

Proof of Theorem:

f : [ a , b ] = [ 0 , 1 ] R continuous.

P n ( x ) = B n ( f ) ( x ) = k = 0 n ( n ! k ! ( n k ) ) f ( k n ) x k ( 1 x ) n k

(Bernstein Polynomial)

| | f P n | | 0 as n

We are going to consider three functions: f ( x ) = 1 , f ( x ) = x and f ( x ) = x 2 and show convergence.

f ( x ) = 1

B n ( f ) ( x ) = k = 0 n n ! k ! ( n k ) ! x k ( 1 k ) n k = ( x + 1 x ) n = 1 , n 0

Hence

f B n ( f ) = 0

Also,

f ( x ) = x

B n ( f ) ( x ) = k = 0 n n ! k ! ( n k ) ! k n x k ( 1 k ) n k = k = 1 n ( n 1 ) ! ( k 1 ) ! ( n k ) ! x k ( 1 x ) n k

Let L = k 1

= x L = 0 n 1 ( n 1 ) ! L ! ( n 1 L ) ! x L ( 1 x ) n 1 L

Hence

B n ( f ) f = 0 , n 1

f ( x ) = x 2

= k = 1 n ( n 1 ) ! ( k 1 ) ! ( n k ) ! k 1 + 1 n x k ( 1 x ) n k = k = 2 n ( n 1 ) ! ( k 2 ) ! ( n k ) ! 1 n x k ( 1 x ) n k + 1 n k = 1 n ( n 1 ) ! ( k 1 ) ! ( n k ) ! x k ( 1 x ) n k

B n ( f ) ( x ) = ( n 1 ) n x 2 k = 2 n ( n 2 ) ! ( k 2 ) ! ( n k ) ! x k 2 ( 1 x ) n k + 1 n x

f B n ( f ) = 1 4 n 0 as n

In order to obtain a best approximating polynomial that has less error, one needs to choose a linear interpolating points that is closest to the target point

3. Main Results

3.1. Data Exploration

The plot showed an upward growth in the population of Kenya. This could be attributed to good health services causing a reduction in the maternal death, deaths as a result of disease outbreak, a boost in socio-economic growth and political stability (Figure 1).

Figure 1. The Kenya population census data since 1969 to 2009 were plotted to see the behaviour of the data as soon shown above in green.

However, we aimed at selecting a sample size of two from 1969 to 2009 population census using a technique of simple random sampling without replacement making a sample total of ten. A pair of linear samples selected were plotted on the same charts to approximate the function f(x) in green colour as shown below for each.

The chart in (Figure 2) below comprises of two linear polynomials that have uniformly approximated the function in green in order to give a better approximate to the population total in 2019. As can be seen, the two linear plots are not showing any better approximate of the function f(x) in green in order to help us extrapolate the population total in 2019.

The linear polynomials in (Figure 3) below in red and blue are used to uniformly approximate the function f(x) in green so as to help us extrapolate the population total in 2019. This was clearly seen to have obtained high variation in the approximation. The blue line appeared to be better than the red at the end point.

Similarly, the approximating linear polynomials in red and green in (Figure 4) are used to approximate the function f(x) in green. Unfortunately, the two approximating lines are not suitable to help extrapolate the population total in 2019.

The approximating linear polynomials shown below in (Figure 5) are used to uniformly approximate the function f(x) in green representing the trend of the entire population. As seen on the chart, the black line appeared to perform better at the end point than the blue but showed some variations.

Finally, the approximating linear polynomials in (Figure 6) are used to uniformly approximate the function f(x) representing the total population trend per year. The chart has clearly shown that, the black dotted line depicted the best

Figure 2. This chart was obtained from a set of data ranging from [1969, 1979] in yellow to [1969, 1989] in blue and the green function (f(x)).

Figure 3. This chart was obtained from a set of data ranging from [1969, 1999] in red to [1969, 2009] in blue and the green function (f(x)).

Figure 4. This chart was obtained from a set of data ranging from [1979, 1989] in green dotted line to [1979, 1999] in red and the green function(f(x)).

Figure 5. This chart was obtained from a set of data ranging from [1979, 2009] in black to [1989, 1999] in blue and the green function(f(x)).

Figure 6. This chart was obtained from a set of data ranging from [1989, 2009] in red dotted line to [1999, 2009] in black dotted line and the green function (f(x)).

approximate on its entire interval which is [1999, 2009] as the place for the Best Approximating Polynomial (BAP) to approximate the function f(x) uniformly to any degree of accuracy.

3.2. Calculating Missing Values via Interpolation

x [ 1 ] = [ 1999 ] and y [ 1 ] = [ 28 , 686 , 607 ] ; x [ 11 ] = [ 2009 ] and y [ 11 ] = [ 38 , 610 , 097 ]

Columns 1 through 8

28,686,607 29,678,956 30,671,305 31,663,654 32,656,003 33,648,352 34,640,701 35,633,050

Columns 9 through 11

36,625,399 37,617,748 38,610,097

y [ i ] = y [ i 1 ] + ( y [ 11 ] y [ i 1 ] ) / h

where i ≥ 2 and h = annual step size

3.3. Approximation of Population Total in 2009

x [ 11 ] = [ 2009 ] and y [ 11 ] = [ 38 , 610 , 097 ] given

x [ 10 ] = [ 2008 ] and y [ 10 ] = [ 37 , 617 , 748 ] approximated

x [ 9 ] = [ 2007 ] and y [ 9 ] = [ 36 , 625 , 399 ] approximated

L 9 = ( x [ 11 ] x [ 10 ] ) / ( x [ 9 ] x [ 10 ] ) y [ 9 ]

L 10 = ( x [ 11 ] x [ 10 ] ) / ( x [ 10 ] x [ 9 ] ) y [ 10 ]

Approximated value = L9 + L10

Approximated population total = 38,610,097

Error = 0

3.4. Extrapolation of 2019 Population Total

x [ 11 ] = [ 2009 ] and y [ 11 ] = [ 38 , 610 , 097 ]

x [ 10 ] = [ 2008 ] and y [ 10 ] = [ 37 , 617 , 748 ]

L 19 = ( 2019 x [ 11 ] ) / ( x [ 10 ] x [ 11 ] ) y [ 10 ]

L 20 = ( 2019 x [ 10 ] ) / ( x [ 11 ] x [ 10 ] ) y [ 11 ]

Approximated value = L19 + L20

Approximated population total = 48,533,587

4. Conclusion

In this work, the Lagrange polynomial has proven to be a good technique in approximating the population total from data obtained from the Kenya National Bureau of Statistics (KNBS). The research revealed that, subsequent population totals can better be approximated using a sample closest to the target population being approximated. Therefore, the best approximating polynomial must be a linear form in order to obtain convergence with a diminishing variation in a given interval. The precision of this technique can better be measured with the outcome obtained in the interpolation of missing values shown in the results above to extrapolate the population total in 2009 which was equal to the exact population obtained in that census. We therefore conclude that, the population of Kenya for the 2019 census will be forty-eight million five hundred and thirty-three thousand five hundred and eighty-seven.

Acknowledgements

We are grateful to the authors for their numerous and valuable contributions to this work, most especially the first author.

Conflict of Interest

The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Deville, J.-C. and Sarndal, C.-E. (1992) Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 87, 376.
https://doi.org/10.1080/01621459.1992.10475217
[2] Cochran, W.G. and Goulden, C.H. (1940) Methods of Statistical Analysis. Journal of the Royal Statistical Society, 103, 250.
https://doi.org/10.2307/2980420
[3] Cochran, W.G. (1946) Graduate Training in Statistics. The American Mathematical Monthly, 53, 193.
https://doi.org/10.2307/2305269
[4] Nadaraya, E.A. (1964) On Estimating Regression. Theory of Probability and Its Applications, 9, 141-142.
https://doi.org/10.1137/1109020
[5] Singh, V.K., Singh, H.P., Singh, H.P. and Shukla, D. (1994) A General Class of Chain Estimators for Ratio and Product of Two Means of a Finite Population. Communications in Statistics Theory and Methods, 23, 1341-1355.
https://doi.org/10.1080/03610929408831325
[6] Searls, D.T. (1964) The Utilization of a Known Coefficient of Variation in the Estimation Procedure. Journal of the American Statistical Association, 59, 1225.
https://doi.org/10.1080/01621459.1964.10480765
[7] Wu, C. and Sitter, R.R. (2001) A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data. Journal of the American Statistical Association, 96, 185-193.
https://doi.org/10.1198/016214501750333054
[8] Johnson, A.A., Breidt, F.J. and Opsomer, J.D. (2008) Estimating Distribution Functions from Survey Data Using Nonparametric Regression. Journal of Statistical Theory and Practice, 2, 419-431.
https://doi.org/10.1080/15598608.2008.10411884
[9] Sukhatme, V. (1984) Future Dimensions of World Food and Population. Economic Development and Cultural Change, 32, 892-897.
https://doi.org/10.1086/451435
[10] Watson, G. (1964) Smooth Regression Analysis. The Indian Journal of statistics Series A, 26, 359-372.
[11] Solanki, R.S., Singh, H.P. and Pal, S.K. (2014) Improved Ratio-Type Estimators of Finite Population Variance Using Quartiles. Hacettepe Journal of Mathematics and Statistics, 45, 1.
https://doi.org/10.15672/HJMS.2014448247
[12] Lairez, P. (2016) A Deterministic Algorithm to Compute Approximate Roots of Polynomial Systems in Polynomial Average Time. Foundations of Computational Mathematics.
https://doi.org/10.1007/s10208-016-9319-7
[13] Godambe, V.P. and Thompson, M.E. (1986) Parameters of Super Population and Survey Population: Their Relationships and Estimation. International Statistical Review/Revue International de Statistique, JSTOR, 127-138.
[14] Hansen, M.H., Hurwitz, W.N. and Madow, W.G. (1953) Sample Survey Methods and Theory. Vol. 1, Wiley, New York.
[15] Robson, D.S. (1957) Applications of Multivariate Polykays to the Theory of Unbiased Ratiotype Estimation. Journal of the American Statistical Association, 52, 511-522.
https://doi.org/10.1080/01621459.1957.10501407
[16] Montanari, G.E. and Ranalli, M.G. (2003) On Calibration Methods for Design Based Finite Population Inferences. Bulletin of the International Statistical Institute, 54th Session 60.
[17] Madow, W.G. and Madow, L.H. (1944) On the Theory of Systematic Sampling, I. The Annals of Mathematical Statistics, 15, 1-24.
https://doi.org/10.1214/aoms/1177731312
[18] Keele, L.J. (2008) Semiparametric Regression for the Social Sciences. John Wiley and Sons.
[19] Singh, H.P., Pal, S.K. and Mehta, V. (2016) A Generalized Class of Ratio-Cum-Dual to Ratio Estimators of Finite Population Mean Using Auxiliary Information in Sample Surveys. Mathematical Sciences Letters, 5, 203-211.
https://doi.org/10.18576/msl/050215
[20] Burden, R.L. and Faires, J.D. (2001) Numerical Analysis. Brooks/Cole.

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.