Smoothed Empirical Likelihood Inference for ROC Curves with Missing Data

The receiver operating characteristic (ROC) curve has been widely used in scientific research fields. After using the random hot deck imputation, we propose the smoothed empirical likelihood ratio statistic for the ROC curve with missing data. Its asymptotic distribution is a scaled chi-square distribution and empirical likelihood confidence intervals for ROC curves are constructed. The simulation study shows that the proposed interval estimates perform well based on the coverage probability for different sample sizes and response rates.


Introduction
The receiver operating characteristic (ROC) curve has been extensively used to evaluate the diagnostic tests.The ROC curve is usually defined as a graphical plot of the sensitivity vs (1-specificity).It is clear that ROC curve is an appealing method to summarize the accuracy of predictions in the diagnostic test.In recent years, ROC curves have been widely applied to medical research, diagnostic medicine and many other scientific research fields (Zhou, McClish and Obuchowski [1], Pepe [2]).
Empirical likelihood (EL) is a useful nonparametric statistical inference method which does not need to assume a known family of distributions (see Owen [3]).Owen [4,5] originally proposed EL confidence regions for the population mean parameter in the complete data setting.Chen and Hall [6] introduced smoothed EL confidence intervals for quantiles.To improve the performance of normal approximation methods for the ROC curve for small sample sizes, the EL based method has been used to estimate the ROC curve.Claeskens, Jing, Peng and Zhou [7] developed smoothing EL confidence intervals for ROC curves and Su et al. [8] proposed plug-in EL for the ROC curve.Liang and Zhou [9] developed semi-parametric EL confidence intervals for ROC curves with right censoring.
In recent years, missing data problem received much attention in biomedical studies, population survey andmany other related fields.Some of the responses may not be obtained due to information loss (see Qin and Qian [10], Wang and Rao [11]).There is no inference procedure about the ROC curve with missing data.Recently Qin and Qian [10] proposed smoothing EL interval estimation for the difference of two quantiles with missing data.Motivated by their idea, we propose empirical likelihood ratio for the ROC curve with missing data and prove that the resulting EL ratio has a scaled chi-squared limiting distribution.This approach is a natural extension of Claeskens et al. [7] to missing data.
The rest of the paper is organized as follows.In Section 2, adopting Qin and Qian [10]'s approach, which was also from Claeskens et al. [7], we propose the smoothed empirical likelihood ratio statistic, derive its limitingdistribution and construct the empirical likelihood confidence interval for the ROC curve.In Section 3, we conduct a simulation study to evaluate the finite sample performance of the empirical likelihood interval estimation.The conclusion is given in Section 4. The proofs are given in the Appendix.

Main Results
In the following, we adopt the same notations and terminologies as those in Qin and Qian [10].Suppose there are two independent populations , otherwise.We assume that x, y are missing completely at random, i.e., . We consider i.i.d.samples of missing data x i m It is of interest to study two populations, one with disease and another one with non-disease.Suppose that the distribution function of disease population X is F(t) and the distribution function of non-disease population Y is G(t).The sensitivity and specificity for a continuous-scale diagnostic test are 1-F(t) and G(t) at a threshold t.At a given level , the ROC curve is expressed as . As Qin and Qian [10], let the bandwidth , and the kernel functions and We adopt the smoothed EL approach of Qin and Qian [10] and define the profile EL ratio statistic at : satisfy the following equations: . By using the Lagrange multipliers method, we have that , and where  satisfy the following score equations: Suppose that 0  is the true value of  .In this paper, we assume the same regularity conditions (i)-(v) in Qin and Qian [10] with condition (ii) modified as follows: Recall that condition (iii): n/m → k as m + n → ∞.Then we state the main result about confidence intervals for the ROC curve.
Theorem 1.Under the regularity conditions (i)-(v), as m + n→∞, there exists a root We know that k is estimated by n/m, and and can be consistently estimated by Copyright © 2012 SciRes.OJS Y. H. AN 23 respectively.Qin and Qian [10] showed that, Plugging in the above consistent estimators, we obtain a consistent estimator of .Let   be the upper  -quantile of 2   1    .Thus, it follows from Theorem 1 that the EL confidence interval for is given by Remark: The asymptotic distribution of the EL statistic is a standard  distribution for complete data since

Simulation Studies
In this section, we carry out extensive simulation studies to evaluate the performance of the EL method for the ROC curve in terms of coverage probability and average length of confidence intervals with different response rates and sample sizes.The simulation setting is similar to Qin and Qian [10].The diseased population X is distributed as and the bandwidths 200, 150 .We generate 1000 random samples of the data.The proposed EL confidence intervals for the ROC curve are constructed at q = 0.1, 0.3, 0.5, and 0.7.The nominal level of the confidence intervals is selected as 1 95% α  . From Tables 1-4 we have the following findings: 1) Note the response rate is higher and larger than 0.6 in the simulation study.For each fixed response rate and sample size, the coverage probability of confidence intervals for the ROC curve is close to the nominal level 95%.In the simulation, when sample size   50,50 = 0.1, (Γ = 0.3891).q is small, the coverage probability is still good.
2) For almost all the cases in the simulation study, when the response rates increase, the coverage probabilities of confidence intervals are closer to 95%, i.e., they are more accurate, and the average length of the confidence intervals decreases, because larger response rates provide more information for the data.
3) Similarly, when the sample sizes increase, the coverage probabilities of confidence intervals are more accurate, and the average length of the confidence intervals decreases.
4) For different q = 0.1, 0.3, 0.5, and 0.7, the EL confidence intervals maintain good coverage probability, and it is very stable.

Discussion
In this paper, we developed the smoothing empirical likelihood method for the ROC curve with missing datawhich is a natural extension of Claeskens et al. [7].The key technique used to impute the missing data is the random hot deck imputation procedure.Under imputation, the proposed smoothed EL statistic converges to a scaled chi-square distribution.In addition, we carry out the simulation studies to evaluate the finite sample performance of the proposed EL interval estimation for the ROC curve.For either smaller or larger q, the EL confidence intervals for the ROC curve have good coverage probabilities which are close to the nominal level.In summary, the proposed EL interval estimation is a reliable and useful tool for the ROC curve analysis with missing data.In the future, we will use other imputation methods to achieve better interval estimation and improve the performance.

Appendix. Proof of Theorem 1
To prove Theorem 1, we need some additional lemmassimilar to those in Qin and Qian [10].We only give an outline of the proofs since they follow similar arguments as Qin and Qian [10].
Lemma A.1.Under the regularity conditions of Theorem 1, as , we have m n Proof of Lemma A.1.We follow the similar lines as Qin and Qian [10].Let 1r , and Like Qin and Qian [10], we have that 0, As Qin and Qian [10], we have that, The rest of Lemma A.1 can be proved following same lines.It is omitted.
Proof of Lemma A.2.We follow the same arguments as Qin and Qian [10].The proof is omitted.
Lemma A.3. (Qin and Qian [10]).Assume that     Proof of Theorem 1.It is similar to the proof of Theorem 1 in Qin and Qian [10].The proof of Theorem 1 is omitted.
m x = m -r x and y y m n r  .Qin and Qian [10] use x r  s and y r s to denote the sets of respondents with respect to x and y, and use x m s and y m s to denote the sets of

1 2 and
it coincides with the conclusion of Claeskens et al. (2003). 1 P P  

2 .
We draw random samples x and y from the populations X and Y. c b  The response rates for x and y are chosen as  , and c 0 are defined in Lemma A.1 and Theorem 1.

Table 1 . Empirical likelihood confidence intervals for the ROC curve at
CP(%): coverage probability; LE: the average left endpoint; RE: the average right endpoint; AL: the average length of the interval.Copyright © 2012 SciRes.OJS Y. H. AN