Effect of Correlation Level on the Use of Auxiliary Variable in Double Sampling for Regression Estimation

While an auxiliary information in double sampling increases the precision of an estimate and solves the problem of bias caused by non-response in sample survey, the question is that, does the level of correlation between the auxiliary information x and the study variable y ease in the accomplishment of the objectives of using double sampling? In this research, investigation was conducted through empirical study to ascertain the importance of correlation level between the auxiliary variable and the study variable to maximally accomplish the importance of auxiliary variable(s) in double sampling. Based on the Statistics criteria employed, which are minimum variance, coefficient of variation and relative efficiency, it was established that the higher the correlation level between the study and auxiliary variable(s) is, the better the estimator is.


Introduction
In sampling theory, auxiliary information may be utilized at any of these three stages or by combining two or all of the three stages.These stages are: (1) at the pre-selection stage or designing stage of the survey in stratifying the population; (2) at the sample selection stage; and (3) at the post-selection or estimation stage.In whatever case, the use of auxiliary information in sample survey is better than the case where no auxiliary information is utilized.Ratio, regression, product and difference estimators take advantage of auxiliary information at the estimation stage.However, when the population information is not known then double sampling method becomes necessary for estimation.[1] is of the opinion that estimation of required parameters can efficiently be done with ratio and regression methods of estimation with two-phase sampling or double sampling method.Double sampling for ratio estimation becomes necessary over double sampling for regression estimation if the data under consideration are well fitted by a straight line through the origin [2].Among the authors who have recently contributed to the use of auxiliary variable(s) to establish various estimators for the population parameters are [3][4][5].However, in both cases of ratio and regression estimations or the use of double sampling in ratio and regression estimations, there must exists positive correlation between the auxiliary variable x and study variable .This article, empirically, investigates to ascertain the importance of correlation level in the use of auxiliary variable in estimating the population parameter using double sampling for regression estimation method.y 2. Methodology

Let
be the sample values of the main character and the auxiliary character x respectively obtained with simple random sampling without replacement (SRSWOR) of sample size from the population size .The linear regression estimator of the mean as giving by [6] is: where Similarly, the estimated mean square error (MSE) of l y is giving as: expressing Equation ( 4) in terms of correlation coefficient; (where ˆˆy

Double Sampling for Regression Estimator
The When double sampling for regression estimation is to be used, then there must exist non-zero interception of the regression line on the study variable axis of the scattered diagram.The double sampling linear regression estimator of population mean is giving as where ˆestimated simple linear regression coefficient   sample mean at the first phase x  Reference [7], hence, presented the estimated variance of dl y as Equation ( 7) can be expressed in terms of  (where Similarly, [7] presented the optimum variance of double sampling regression estimator as:

Correlation Coefficient and Coefficient of Determination
The simplest method for measuring the relationship existence between two variables (one dependent variable and one independent variable) is with the tool of correlation and regression analysis [8].Correlation coefficient determines the degree of relationship between variables.It is linear when all parts   x y on a scattered diagram seem to lie near a straight line or it is nonlinear when all parts seem to lie near a curve.This work focuses on linear correlation.Correlation between variables can be measured with the use of different indices (coefficients).The three most popular of these indices are: Pearson's Product-moment correlation, Spearman's rank coefficient and kandall's tau coefficients.Kendall's tau established by [9] can be used as an alternative to spearman's rank correlation coefficient for ranked data.[10] analysed the properties of kendall's coefficient and states that "the coefficient we have introduced provides a kind of average measure of the agreement between pairs of numbers ("agreement", that is to say, in respect of order) and thus has evident recommendation as a measure of the concordance between two rankings" and "In general,  is an easier coefficient to calculation than  .We shall see... that from most theoretical points of view  is preferable to  )".It should be noted that Kendall uses  to represent Spearman's rank correlation coefficient and  as Kendall Tau correlation coefficient.[11] declared that nowadays the calculation of Kendall's coefficient posses no problem.Kendall's coefficient is equivalent to Spearman's rank coefficient in terms of the underlying assumptions, but they are not identical in magnitude, since their underlying logic and computational formulae are quite different.Similarly, Kendall's coefficient and spearman's rank correlation coefficient imply different interpretations.[12,13] examined the use of Pearson's product moment correlation coefficient and Spearman's rank correlation coefficient for geographical data (on map data that are spatially correlated).
Spearman's rank correlation coefficient is a nonparametric (that is distribution free) rank statistic proposed as a measure of the strength of the association between two variables as compared to Pearson's product-moment coefficient, that is a parametric statistic.Similarly, [14] clarified that Spearman's rank correlation is not a measure of the linear relationship between two variables as some statisticians declared.It accesses how well an arbitrary monotonic function can describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables.Unlike Pearson's product-moment coefficient, it does not require the assumption that the relationship between the variables is linear nor does it require the variables to be measured on interval scales.[14] confirmed that Pearson's productmoment correlation coefficient (represented with r) was the first formal correlation measure and it is still the most widely used measure of relationship.
The idea of this paper is to use correlation coefficient to determine the level of relationship between the auxiliary and study variables, after which such data will be analysed with double sampling for regression type estimator to know which correlation level significantly contributes to the objective of implementing auxiliary variable.However, having considered all the correlation coefficient measures, this paper will use Pearson's productmoment correlation coefficient.

Pearson's Product-Moment Correlation Coefficient and Its Coefficient of Determination
Pearson first developed the mathematical formula for this important measure in 1985 [12] presented correlation in Equation ( 11) as the "function of raw scores and mean".Equation ( 11) describes r as the Centred and standardized sum of cross-product of two variables.Using the Cauchy-Schwartz inequality, [15] claim that it can be shown that the absolute value of the numerator is less than or equal to the denominator, therefore, .[14] further presented Pearson Productmoment correlation coefficient as standard covariance.The correlation coefficient is a rescaled covariance and presented as; When the covariance is divided by two standard deviations, the range of the covariance is rescaled to the interval between −1 and +1, thus the interpretation of correlation follows as in the case of Equation (11).

= Sample standard deviation of y
Correlation is sometimes criticized as having no clinical interpretation or meaning [16].This criticism is mitigated by taking the square of the correlation coefficient which is often called COEFFICIENT OF DETERMI-NATION.[17] expressed coefficient of determination proportion of common variation in the two variables (that is the "strength" or "magnitude" of the relationship).He emphasized that it is important to know this magnitude or strength in order to evaluate the correlation between variables.The square index is interpreted as proportion of variation in one variable accounted for by differences in the other variable.According to [16], where .

Error and Interpretation in Correlation Coefficient
Most common error associated with correlation and regression analysis, as emphasized by [16], is confusing when interpreting correlation coefficient result.The most common error in correlation coefficient interpretation is to conclude that changes in one variable causes changes in the other.Correlation coefficient indicates that characteristics vary together or in opposite direction.However, not interpreting the results of Correlation coefficient is another common error.[16] claims that the coefficient must be interpreted in light of the relationship under study and [18] has given different ways to interpret and estimate for coefficient of determination, though based on theory dependent.
For the purpose of this investigation, this article will make use of the interpretation criteria established by [19] (as seen in Table 1) but with boundary amendment (as in Table 2).

Comparison of Estimators
This section proposes on how the empirical comparison will be executed.Minimum variance, coefficient of variation and relative efficiency are the statistical measures that will be used to compare the estimated variance and the standard deviation of double sampling for regression type estimator at three levels of correlation coefficient which will be termed as high, moderate and low positive linear correlation coefficients (see Table 2 for the details on the correlation coefficient).

Coefficient of Variation (CV)
Coefficient of variation is a statistical measure that will be used to know the level of variability in each of these levels of correlation coefficients.[2] defines the coefficient of variation of an estimator   y as the measure of relative variability.Mathematically, it is presented as; where y = Sample mean;   SE y = Standard Error of the estimator y ; and 0. y  The estimated Coefficient of Variation is the standard error expressed as a percentage of the mean.
This can also be presented as; In this article, Equation ( 16) will be used for the computation of the coefficient of variation at different levels of the correlation coefficient after which a tabular comparison will be made.

Relative Efficiency
Relative Efficiency is another statistical measure that will be used to measure the efficiency of one estimator over another.The relative efficiency of estimator "a" to estimator "b" is expressed as;

Relative Efficiency of High Positive Linear Correlation to Medium Positive Linear Correlation
This measures the efficiency of double sampling for regression estimator with high positive linear correlation coefficient to double sampling for regression estimator with medium positive linear correlation coefficient.This is presented as:

Relative Efficiency of High Positive Linear Correlation to Low Positive Linear Correlation
This measures the efficiency of double sampling for regression estimator with high positive linear correlation coefficient to double sampling for regression estimator with low positive linear correlation coefficient.This is presented as:

Relative Efficiency of High Positive Linear Correlation to Low Positive Linear Correlation
This measures the efficiency of double sampling for regression estimator with high positive linear correlation coefficient to double sampling for regression estimator with low positive linear correlation coefficient.This is presented as:

Empirical Comparison
This research work uses primary data obtained from five hundred and seventy four (574) questionnaires distributed to the staff and students of Nursing school, Perioperative Nursing School, School of mid-wifery and Occupational Health School, all in University College Hospital (UCH) in Oyo state of Nigeria.The double sampling uses the household monthly average expenditure (in thousands of Naira) on food consumption as the study variable   y and the household size as the auxiliary variable   x .The double sampling obtains the first and second sample sizes at five different levels as presented below.n is the sample size at first phase and is the sample size at the second phase.n are the optimum sample sizes for the first and second phases respectively.This investigation will obtain all requirements at this optimum sample sizes.

At High Positive Correlation
Where there exists high positive correlation, Figure 1 shows the existence of positive linear relationship between the auxiliary and the study variables at 120 n  SPSS software was used to perform simple linear regression analysis on the data, the model obtained is presented in Equation ( 22) below. .From Equation 22and Figure 1, this means that the intercept on axis is not zero; hence, these data are suitable for double sampling for regression type estimation.Similarly, the result of the correlation coefficient shows that of the variation in the household expenditure 

At Medium Positive Correlation
Where there exists medium positive correlation, Figure 2 shows the existence of approximately positive linear relationship between the auxiliary and the study variables at and .SPSS software was used to perform simple linear regression analysis on the data, the model obtained is presented in Equation ( 23

At Low Positive Correlation
Where there exists medium positive correlation, Figure 3 shows the existence of approximately positive linear relationship between the auxiliary and the study variables at 120 n  and 40 n  .SPSS software was used to perform simple linear regression analysis on the data, the model obtained is presented in Equation (24) below.  .

Computation of the Coefficient of Variation
As proposed in Equation ( 17), the coefficient of variation for each correlation coefficient level is obtained and interpreted in Table 5.The result obtained for the relative efficiency as derived in Equations ( 25)-( 27) are tabulated as seen in Ta- ble 6.

Conclusion
This paper examines the effect of correlation level on the use of auxiliary variable in double sampling for regression estimation.The findings revealed that double sampling for regression with high correlation coefficient (between the auxiliary and study variables) has the minimum variance   variance 5.6671 ,  hence, is the most efficient estimator.Double sampling for regression with medium correlation coefficient performs better variance ; while least efficient estimator is double sampling for regression with low correlation level variance .Thus, the higher the correlation coefficient (between the auxiliary and the study variables) is, the smaller the variance (as seen in Table 4) is.Similarly, it was discovered that double sampling for regression with high correlation coefficient has the highest precision ; with double sampling for regression with medium correlation coefficient having higher precision .Hence, the higher the correlation coefficient (between the auxiliary and the study variables) is, the higher the precision of the estimate (as revealed in Table 5) is.Finally, Table 6 revealed the relative efficiency of double sampling for regression with high correlation coefficient over double sampling for regression with medium correlation coefficient.Similarly, it is the relative efficiency of double sampling for regression with high correlation coefficient over double sampling for regression with low correlation coefficient.Hence, the higher the correlation coefficient (between the auxiliary and the study variables) is, the more efficient the estimator is.
Although, auxiliary information in double sampling procedure increases the precision of an estimate, this paper, therefore, suggested for researchers to know that the correlation level between the study variable and the auxiliary variable will contribute to the efficiency of the estimator under study.In addition, this result can be generalised to all sample survey methodologies that use auxiliary variable to increase the precision of the estimator.
Sample covariance of x and y x s = Sample standard deviation of x y s the Pearson's Product-moment Correlation Coefficient is obtained as and the Coefficient of determination is obtained as

Figure 1 .
Figure 1.Scatter plot of y against x at high correlation level.

Figure 2 .
Figure 2. Scatter plot of y against x at high correlation level.Equation (8),   ˆ7.5596 dl opt V y  and the corresponding standard error is dl SE y  Summary of the various computations at the three correlation levels is presented in Table

Figure 3 .
Figure 3. Scatter plot of y against x at low correlation level.
finally is double sampling for regression with low correlation coefficient having least precision  14.6%  CV

Table 3 . Summary of the first and second phase sample sizes at different levels.
) below.

.3. Relative Efficiency of Medium Positive Linear Correlation Coefficient to Low Positive Linear Correlation Coefficient Using
Equation (19):