Data Fusion Using Empirical Likelihood

The authors propose a robust semi-parametric empirical likelihood method to integrate all available information from multiple samples with a common center of measurements. Two different sets of estimating equations are used to improve the classical likelihood inference on the measurement center. The proposed method does not require the knowledge of the functional forms of the probability density functions of related populations. The advantages of the proposed method are demonstrated through extensive simulation studies by comparing the mean squared errors, coverage probabilities and average lengths of confidence intervals with those from the classical likelihood method. Simulation results suggest that our approach provides more informative and efficient inference than the conventional maximum likelihood estimator if certain structural relationship exists among the parameters of relevant samples.


Introduction
A common problem in clinical trials and other medical research is how to accurately and efficiently estimate parameters of interest when the current sample size is too small due to cost and time constraints.Usually there might exist certain surrogate populations with low sampling cost that could provide relevant information for the population of direct inferential interest.In this article, we propose a robust semi-parametric method to integrate related information from different sources to improve the classical likelihood method.
The classical likelihood approach is arguably the most widely used method in statistical inference.It has been routinely applied in almost all the statistical applications.Despite a great success and excellent asymptotic properties, the classical likelihood has known limitations associated with making inference for small sample sizes.Consider a thought experiment as follows.Suppose that a random experiment is to toss a coin twice.The parameter of interest, denoted as 1  , is the probability of turning up head for this coin.The maximum likelihood estimator (MLE) of 1  is denoted as 1 ˆ.If the coin is a fair one,  the MLE will obtain the following:   1 P  0 or 1 1 2

 
and   1 1 2 1 2   P  .Thus, the MLE would have 50% chance to make a nonsensical decision by using the MLE when the sample size is only two.In addition, suppose that for some reason we cannot use this coin any more but we can flip another coin instead.In this situation, the classical likelihood approach would not consider the se-cond experiment since it comes from a different population unless a functional relationship between the two parameters is known.If the second population is related to the first one due to some connection between these two parameters, one should be able to utilize this connection and make better statistical inference.
Different statistical methodologies have been proposed in the literature to integrate information from different sources (or populations) in a very general setting, see [1] and [2], and referees therein.Most of these methods, however, face the challenge of accurately validating or evaluating the relevance of all related information to guard against the possibility of introducing a significant bias or contaminating the current sample.In other words, the magnitude of integration must be controlled carefully and in addition likelihood weights must be chosen judiciously in order to achieve any desired improvement in statistical inference.We propose to tackle this difficult problem using a robust semi-parametric empirical likelihood method to achieve more accurate and robust inferential results.
Empirical likelihood, which was first introduced in [3], is a nonparametric method of inference based on a datadriven likelihood ratio function.It allows the statistician to employ likelihood methods, without specifying a parametric model for the data.It enjoys both the flexibility of nonparametric methods and the efficiency of parametric likelihood.As shown in [4], empirical likelihood is a prominent efficient tool in estimating parameters by incorporating estimating equations into constrained maximization of the empirical likelihood function.In the problem we consider, the relevant information from different sources could be used by incorporating extra set of estimating equations in the empirical likelihood framework.

 
x gy


To obtain robust estimates, we use median as an estimate of center instead of mean.We propose here using two different kinds of estimating equations; one uses median and the other one uses a smoothed version of median.The smoothing technique is the one proposed in [5] to improve the coverage accuracy.Our method can be easily generalized to multiple samples with relevant information.Without loss of generality, we consider data with two populations in the sequel.
The rest of the paper is organized as follows.The methodology framework, the proposed empirical likelihood approach, and its theoretical properties are presented in Section 2. Results of simulation studies demonstrating the empirical performance are provided in Section 3. Conclusion and some discussion are provided in Section 4.

Methodology
Suppose there are two groups of data from different population but sharing the same parameter of interest.Assume that The second group of data 1 might be different from the first population, and


Our goal is to estimate  by using both samples.Directly using the log-likelihood we might get a biased estimation due to the difference between the two populations.
We propose a semi-parametric empirical likelihood method which only requires the independence of these two samples.To combine the second sample with the first one, we use the following semi-parametric empirical likelihood is an estimating function.From the empirical likelihood theory, we know is maximized by , where  is the Lagrange multiplier.We can rewrite the log likelihood function as The advantage of the log profile likelihood function is that it does not depend on the likelihood weights which could be difficult to choose.In our propose method, we do not require that the probability density function of the second population is identical to the first population.By using the empirical likelihood method, we do not even need to specify the functional form of the underlying distribution of the second population.Therefore, we can gain robust estimates in the sense that model mis-specification problem is avoided.Consequently, our method can be employed in a relatively wide range of applications when the functional form of the probability density function is not known.
In the following, the theoretical properties of the proposed RSPELE estimator will be presented.For clarity, all proofs are postponed to the Appendix.Theorem 1 below shows that under some regularity conditions, the RSPELE estimator  is consistent to 0  .Shao (2003) on the normality of the maximum likelihood estimator in parametric models.Let  be the true parameter.We further assume that A1) There exists a matrix such that > 0

 
is sufficiently small and is continuous at

A3)
Then, it follows that   0 in probability in the  such that   .exists with probability one and the set of its discontinuity points has zero probability.
The asymptotic distribution of is given in Theorem 3.
Theorem 3 Assume that the assumptions made in The old.The limiting distribution of We note that other test statistics, for example, a test statistic based on Theorem 2, may also be used.
Estimating equations provide a very flexible way to specify how the parameters of a statistical model should be estimated.They serve as constraints in maximizing the empirical likelihood.It was shown in [5] that the empirical likelihood method is an efficient tool for point estimation through estimating equations.In this section, we consider two different kinds of estimating equations using the information of median, since median is robust with respect to the outliers, one may use as estimating function based on the second group data, where g , we may use the smoothed version of the constraint which was motivated by Shi and Lau (1999).First of all, we define the estimating equation for the smoothed empirical likelihood.In general, let  be the -order kernel (Shi and Lau, 1999), such that where h is called the smoothing parameter.The kernel  is a symmetric probability density with bounded and compact support.Let the estimating function is continuous with respect to y, but it is not a fixed function as the smoothing parameter varies.See Shi and Lau (1999) for details.In addition, by using the arguments similar to those stated in the Theorems 1-3 and Shi and Lau (1999), we may get the similar asymptotic results.

Data Fusion with Conventional Empirical Likelihood
Simulation studies are carried out by performing data fusion when two samples are available.The first sample is generated from standard normal distribution and the second sample Y Y is generated from normal, double exponential, or t-distribution respectively.The sample size of first sample, , is 10 and for the second sample, the sample size, , varies from 10, 20 to 30.
First of all we use the median constraint, so the log same as the one used in [6], likelihood function of the simulation model is where s is the MLE variance, The simulation model is identical to the first experiment.Four values of the smoothed parameter are used which are 2 to the power of −1, −3/4, −1/2 and −1/4.The log likelihood function of the simulation model is We present the mean square error (MSE) ratio of RSPELE to MLE based on 1000 replications in Table 1.
The simulation results show that RSPELE performs well except in the situation when the second population is normally distributed with large variation as the first one.When the second sample size is increasing, the RSPELE becomes more accurate.Moreover, we have smaller MSE of RSPELE when the data of the second population is more concentrated around the center, for example, the double exponential distribution.

Smoothed Empirical Likelihood
In this section we demonstrate the smoothed version of the estimating equations.The kernel we chose is the We provide the MSE ratio of RSPELE to MLE based on 1000 replications in Table 2. Results of the smoothed version are slightly better than the results of the median version, no matter which smoothing parameter is chosen.When the underlying distribution of the second population is not the same as the first population, the RSPELE

Confidence Intervals
In this subsection, we construct the confidence interval for the median by bootstrapping.In this simulation study, the first sample X is generated from standard normal distribution and the second sample Y is generated from normal, double exponential, or t-distribution.The sample size of X is 10 and of Y varies from 10, 20, to 30.The size of the bootstrapped sample is 200 and the number of iterations is set to be 1000.First of all, we use the median estimating equation and record the coverage probabilities and the simulated average confidence interval lengths (AL) in Table 3 for nominal levels of 80, 90, 95, and 99 percent.The coverage probabilities and the AL of using the smoothed version of the estimating equation are recorded in Tables 4 and 5 with different smoothing parameters which are 2 to the power of -1 and -1/2.We report the results of MLE in Table 6.Since the results of MLE do not depend on the second population, we further compare the coverage probabilities and AL as in Tables 3-5 with Table 6. n The results of smoothed version are better than median version in terms of the coverage probabilities.The coverage probabilities of RSPELE and MLE are very close but the confidence intervals of RSPELE are about 10% narrower than of MLE.The results of RSPELE when the underlying distribution of the second population is either t or double exponential distribution are better than the results of RSPELE when underlying distribution is normal distribution.That is because normal distribution is flatter than t and double exponential.Consequently, if the second population provides a good information about the center we can use it to get better estimates.

Discussion
In this paper, we propose a robust semi-parametric empirical likelihood in a multiple-sample model with common measurement of center.We use two different kind of estimating equations of information about the median.Simulation studies have shown that the second population could provide very useful information on the parameter of interest by comparing the performance of various commonly used measures for evaluations.

Proof of Theorem 1
We rewrite the Equation (1) as       n n n   , and .
Case 1.In view of the proof of Theorem 4.17 of [6], it follows that for any  , for large . 1 By Assumptions (A1)-(A3), applying a similar approach as in [7], it can be shown that . In light of the arguments in [4,8], and under the assumptions (A1)-(A3), it follows that for any  , for large .
2 Combining (3) with (4), we have for any for large n.Therefore, there exists   We denote , by applying Taylor's expansion, it follows that which can be written as

Proof of Theorem 3
Assume that the assumptions made in Theorem 3 hold.A statistic for testing   is given by .

 
Hence, g y g y g y g y g y (7) Combining Equations ( 5)-( 7), we have Hence the limiting distribution of is the same as the distribution of H . under Ŵecall  the robust semi-parametric empirical likelihood estimate (RSPELE).

2
the  and  is shown in Theorem 2.Theorem In addition to the conditions of Theorem 1, we assume that 1 n b n  , where b is a constant.We also assume that  can be shown similarly.The details are omitted.   Proof of Theorem 2