Measurement Error for Age of Onset in Prevalent Cohort Studies

Prevalent cohort studies involve screening a sample of individuals from a population for disease, recruiting affected individuals, and prospectively following the cohort of individuals to record the occurrence of disease-related complications or death. This design features a response-biased sampling scheme since individuals living a long time with the disease are preferentially sampled, so naive analysis of the time from disease onset to death will over-estimate survival probabilities. Unconditional and conditional analyses of the resulting data can yield consistent estimates of the survival distribution subject to the validity of their respective model assumptions. The time of disease onset is retrospectively reported by sampled individuals, however, this is often associated with measurement error. In this article we present a framework for studying the effect of measurement error in disease onset times in prevalent cohort studies, report on empirical studies of the effect in each framework of analysis, and describe likelihood-based methods to address such a measurement error.


Introduction
Prevalent cohort studies of chronic diseases involving screening populations and sampling individuals with the condition of interest for prospective follow-up [1].Examples of such studies include cancer screening trials [2], studies of HIV prevalence [3] and studies of dementia [4] [5].The prevalent cohort design is both more efficient and more practical than the incident cohort design [6] in which a cohort of disease-free individuals are followed for disease onset, and only the subset of individuals developing the disease yields information on the time from disease onset to death.The prevalent cohort design features a form of response-dependent sampling, however, in the sense that diseased individuals with long survival times are preferentially selected for inclusion into the cohort [1] [2] [7]; some authors refer to the resulting data as "length-biased".Valid statistical inference depends critically on adequately addressing the sampling scheme in the likelihood construction, and there are two broad frameworks for analysis, both of which make use of the retrospectively reported time of disease onset recorded at the time of sampling.
Analysis in the conditional framework is based on the fact that individuals who died before the time of screening cannot be sampled, and so the survival times among sampled individuals are left-truncated by the time from disease onset to enrollment.The unconditional framework is based on the density of the survival times derived under the prevalent cohort sampling scheme.That is, if the disease incidence is stationary, the onset times follow a time homogeneous Poisson process, and the resulting left truncation times have a constant density.If the probability that an individual is sampled is proportional to their survival time, the density of times subject to this sampling scheme can be derived and used for likelihood construction.
For the conditional approach, parametric, nonparametric and semiparametric methods are relatively straightforward and have seen considerable application [3] [8] [11].Wang [10] proposed a product-limit estimator for left-truncated survival times which maximizes the conditional likelihood and loses no information when the distribution of the truncation time variable is unspecified.For semiparametric Cox models, the partial likelihood approach can be adopted for left-truncated data but with an adjusted risk set [8] [11] [12].Wang et al. [12] argued that the nonparametric and semiparametric estimators are efficient when the distribution of the truncation time is unspecified but can be inefficient when the distribution of truncation time is parameterized.
Unconditional analyses [5] [13]- [16] are based on the joint distribution of the backward recurrence time (time from disease onset to sampling) and the forward recurrence time (time from sampling to death).Vardi [13] [14] and Asghrian et al. [5] developed the nonparametric maximum likelihood estimator (NPMLE) for right-censored length-biased survival times, but this NPMLE does not have closed form and its limiting distribution is intractable [15] [16].Huang and Qin [17] derived a new closed-form nonparametric estimator that incorporates the information about the length-biased sampling.Wang [18] proposed pseudo-likelihood for length-biased failure times under the Cox proportional hazards model, but this method cannot be applied to right-censored failure times.Luo and Tsai [19] and Tsai [20] derived pseudo-partial-likelihood estimators for right-censored lengthbiased data which have closed-form and retain high efficiency.Shen et al. [21] considered modeling covariate effects for length-biased data under time transform and accelerated failure time models.Qin and Shen [22] recently proposed two estimating equations for fitting the Cox proportional hazards model that are formulated based on different weighted risk sets.
Both conditional and unconditional analyses make use of the retrospectively reported times of disease onset, with the latter further based on the assumption of a stationary (Poisson) incidence process.However, there is often considerable error and uncertainty in the retrospectively reported onset times.This is particularly true for onset times related to disease featuring cognitive impairment or mental health disorders.In some settings the reported times may better represent times of symptom onset, rather than the actual start of the disease process which may lead to underestimation of disease duration.In other settings the errors may lead to earlier or later reported onset times.
The purpose of this article is to examine the impact of measurement error in the retrospectively reported onset time for both the conditional and unconditional frameworks.The remainder of the paper is organized as follows.
In Section 2, we introduce notation and likelihood construction for prevalent cohort data.The impact of misspecification of the disease onset time is explored in Section 3 by simulation for the unconditional and conditional approaches, and methods for correcting for this measurement error are described in Section 4. General remarks and topics for further research are given in Section 5.

Notation and Likelihood Construction
Consider a population and a chronic disease such that at any time an individual in the population is in one of V be the calendar time of death (time of entry to state 2 D ); then Consider a study starting at calendar time R (recruitment time), when individuals in the population are screened for the disease of interest and those who are diseased are to be recruited into the study.Figure 1 shows a hypothetical situation in the prevalent cohort study, where calendar time is represented on the horizontal axis.Individuals who are sampled must have developed the disease of interest at some point over the calendar time interval [ ] , A R , and be still alive at the recruitment time R.Those who develop the disease over [ ] , A R but die before the recruitment time cannot, of course, be selected for inclusion in the sample.Those who develop the disease after the recruitment time are also not eligible for recruitment.The times The conditional likelihood for right-censored left-truncated survival data is ( ) assuming 0 i v is recorded correctly.By conditioning on the observed truncation time, it is not necessary to model the distribution of the onset times.
If the disease onset process is a stationary Poisson process, ( ) ( ) and the resulting sample is right-censored length-biased sample.If the distribution of the onset time is known and can be parameterized, the conditional approach may be inefficient and it is natural to want to make use of the information contained in the onset process.
We now consider the distribution of the onset times over the interval [ ] , A R in the target population.Let ( ) ( ) be the probability an onset time occurs in an interval [ ] ⊥ , so that the distribution of the survival time since disease onset does not depend on onset time.We also define the sample onset time density for individuals who satisfy the inclusion criterion, ( ) ( ) When the onset process is stationary, as A → −∞ , the sample density function for the onset time (3) can be simplified to be where is the population mean survival time with disease.From ( 3) and ( 4), one can see that the onset time among sampled individuals contains information regarding the survival distribution.The unconditional likelihood utilizing this information is based on the joint distribution of ( ) 0 , V X , which can be written as where . Thus the full likelihood is the product of the conditional likelihood and the marginal likelihood of sample onset times, ( ) Under the assumption of a stationary disease process and based on (4), the unconditional likelihood for rightcensored length-biased sample can be written as ; .
Thus the unconditional approach exploits information in the disease onset times to improve efficiency over the conditional approach, but it does so by making stationary assumption for the disease onset process, which makes it less robust.
The estimators ˆC θ and ˆF θ can be found by maximizing the conditional (2) or unconditional (5) likelihoods respectively when parametric models are applied.Further, the resulting estimators have an asymptotic normal distribution, so ˆ0, , 0, , where C  and F  are the Fisher information matrices for conditional and unconditional likelihoods.

Nonparametric Estimation of the Survival Function Estimation
Nonparametric methods are often more appealing than parametric methods when there is limited knowledge regarding the distribution of survival times.Wang et al. [23] and Wang [10] derived the product-limit estimator for left-truncated survival data.Let ( ) ( ) indicate whether individual i has been recruited into the study and under observation at time u, where is the left-truncation time, and let be an indicator they are at risk of an event.Let ( ) ( ) be the event indicator, and ( ) ( ) . Then the logarithm of the likelihood for left-truncated data (2), can be rewritten as is the cumulative hazard function.The nonparametric maximum likelihood estimator (NPMLE) of the survivor function for right-censored left-truncated data is where , ( ) ( ) , and The conditional NPMLE is consistent, but a more efficient estimator can be obtained when the onset process is stationary.Vardi [14] proposed a nonparametric maximum likelihood estimator for survival distribution function ( ) G t based on a length-biased sample under the multiplicative censoring.The NPMLE of ( ) G t is found by an expectation-maximization algorithm which maximizes the likelihood function of the form ( ) ( ) Vardi [14] also argued that, based on the renewal theory, the joint distribution of ( ) 0 , T V under length-biased sampling is ( ) f t µ .Hence the density function for the observed length-biased event time is , and then the survivor function for event time in the population is ( ) ( )  .The full likelihood (6) under length-biased sampling can be rewritten as which is exactly the same as Vardi (8).The Vardi [14] algorithm can therefore be used to obtain the NPMLE of ( ) Qin et al. [24] developed an expectation-maximization algorithm for the analysis of length-based data by constructing a complete data likelihood using the Turnbull [9] approach and considering contributions from "ghosts"; these are individuals not sampled into the cohort because they died before the screening assessment.Unlike Vardi [14] method, their likelihood function is derived from the unbiased distribution of event time and EM algorithm directly estimates ( ) dF t , which allows one to impose any model and parameter constraints for this distribution function.

Introduction
Both the conditional and unconditional analyses make use of the reported onset time, and the latter requires the additional assumption of a stationary disease incidence process.For individuals determined to have the disease at the time of assessment, the disease may have begun several years earlier, making accurate recall of the onset time difficult.There may therefore be considerable uncertainty about the reported onset time and the difference between the true onset time and the reported onset time represents recall, reporting, or measurement error; we will henceforth use the term measurement error.
Both the conditional and unconditional approaches to the analysis of prevalent cohort data will in general lead to biased estimators in the presence of measurement error.We therefore investigate the impact of this measurement error in both the conditional and unconditional frameworks for parametric and nonparametric settings.

The Classical Measurement Error Model
In retrospective studies, selected patients need to recall their disease onset times.In this case, the recall times are very likely different from the exact disease onset times, even though perhaps they are quite close.Consider disease incidence over [ ] , A R , and a sample of the prevalent cohort is selected at recruitment time R. Let 0 V be the exact disease onset time which is not observed and 0 U be the retrospectively reported disease onset time.A classical error model Carroll et al. [25] leads to 0 0 where is random measurement error, and The data obtained in this case are { } 0 , , , ; 1, , , where i X is observed event time or censoring time, and i δ is a censoring indicator.Notice that diseased individuals who are still alive at the recruitment time and selected into the study need to report their onset time retrospectively, and their reported onset time should also satisfy the condition 0 A U R ≤ ≤ .In this case the sample distribution of 0 U given 0 V becomes a trun-cated normal distribution, with density function written as ( )

Empirical Study of Measurement Error
If we ignore the measurement error and treat 0 U as the true onset time, both the left-truncation time and survival time will be in error.Conditional and unconditional parametric analyses will lead to biased estimators for parameters of interest.To examine this impact, we conduct the following simulation study which follows the same strategy of Huang and Qin (2011) to generate length-biased data.We let the true disease onset time 0 V be , and the underlying survival time T be independently generated from a Weibull distribution with survival function V U − , respectively.We set the sample size as n = 500 and simulation nsim = 1000 data sets.To examine the impact of measurement error in disease onset time, naive, conditional and unconditional parametric and nonparametric approaches are applied to the resulting data, all of which involved treating 0 U as the "true" onset time.Table 1 summarizes the average bias (EBIAS), empirical standard error (ESE), average model-based standard error (ASE), and empirical 95% coverage probability of estimators based on naive (NAIVE), conditional (COND) and unconditional (UNCOND) likelihoods.
From Table 1, we see that all three likelihood methods lead to biased estimators, since they all ignore the measurement error in the disease onset time.Although the ESE and ASE agree with each other, the empirical coverage probability is far away from the nominal value.Further, when the variance of the measurement error becomes smaller, the biases of estimators reduce a lot and the empirical coverage probabilities become better.This makes sense because the smaller the variance of measurement error, the closer of reported onset time to the true onset time, which reduces the impact of using the reported onset time.Table 2 and Table 3 summarize the nonparametric estimates of the survivor functions and percentiles based on naive, conditional and unconditional approaches, along with the estimates based on parametric models for comparison.Similar conclusions can be drawn about the effect of measurement error in disease onset time for nonparametric analyses.One thing needs to mention is that even when the variance of measurement error becomes smaller, the biases are still quite large for the naive approach, under parametric and nonparametric analyses.This is because the naive approach treats the recruited sample as a representative sample of the population and does not correct for the selection bias for left-truncated or length-biased data.
To clearly understand the importance of correcting for measurement error in disease onset time for prevalent cohort samples, we plot the true survivor function versus estimated survivor functions based on the naive, conditional and unconditional likelihoods without correcting for measurement error, both parametric and nonparametric models are considered.Figure 2 shows that ignoring the measurement error in onset time, both conditional and unconditional likelihoods lead to biased estimate of survivor function.

Corrected Parametric Conditional Likelihood
A "correct" likelihood approach can be used to account for the measurement error in the onset time and will   yield unbiased estimators of the parameters of interest if the component model assumptions are correctly specified.Such a likelihood should be based on the reported onset time and the (possibly censored) survival time, which will require explicit modeling of the measurement error process.Let ( ) h v u be the density function of the calendar time of death given the reported onset time, i.e.

(
) ( ) The "correct" conditional likelihood for right-censored left-truncated data { } 0 , , ; 1, , ; Similarly, the joint density of the observed onset time and calendar time of death is ( where the last equality is derived by (10).
The "correct" unconditional likelihood can then be constructed as follows, ( ) ( ) ( ) where Since * M L might contain the information about parameters we are interested in, the "correct" unconditional likelihood might be more efficient than the "correct" conditional likelihood.Further, when the underlying onset time is a stationary process, then we can let ( ) ( ) and let A → −∞ to obtain both "correct" like- lihoods for length-biased data.
The maximum likelihood estimators * ˆC θ and * ˆF θ under (un)conditional likelihoods can be easily found by maximizing ( 12) and ( 14) respectively and have asymptotic normal distribution, as n → ∞ ,

Empirical Study of Corrected Likelihood
To examine the performance of "correct" likelihoods in the presence of measurement error in disease onset time, we use the same strategy to generate length-biased survival data with measurement error in disease onset times as in Section 3.2.The "correct" likelihood is considered here in two scenarios: the variance of the measurement error ( ) log φ σ = is known or unknown.Figure 3 shows the estimated survivor functions based on the conditional and unconditional likelihood approaches which ignoring the measurement error and "correct" conditional and unconditional likelihood approaches based on ( 12) and (14).From this figure, we can find that the proposed "correct" likelihood approach adjusts the measurement error well and leads to better estimates of the survivor functions.Table 4 summarizes the empirical properties of the estimates based on the naive parametric conditional likelihood, the "correct" parametric conditional likelihood, the naive parametric unconditional likelihood, and the "correct" parametric unconditional likelihood.For the corrected likelihood we maximize (12) and (14) both with respect to ψ (i.e. when φ is treated as unknown) and with respect to θ when φ is fixed at the true value.Whether the variance of error ( ) φ is known or unknown, the "correct" likelihood approach reduces the bias of estimates, and the resulting empirical coverage probabilities are all within the acceptable range.These simulations therefore provide empirical support to the claim that the "correct" likelihood approach adjusts for the measurement error and yields consistent estimators.Notable is the only modest increase in the empirical or average standard errors of parameter estimates when the variance of the measurement error distribution is estimated, especially for the shape parameter κ .The "correct" likelihood approach also provides a good estima- tor of φ , and the empirical bias of estimator for φ is small at 0.03 with standard error 0.27 for the conditional analysis and 0.01 with standard error 0.11 for the unconditional analysis, when log1 0 φ = = , for example.

Discussion
Statistical models and methods for the analysis of prevalent cohort data have been reviewed here from both the conditional and unconditional frameworks.It is well known that naive analyses which ignore the selection bias lead to overestimation of the survivor probabilities.The conditional likelihood based on the density for lefttruncated event times can be used to correct for this selection bias.The unconditional likelihood approach is based on the joint density of the backwards and forward recurrence times yield more efficient estimators by incorporating the information contained in the onset times.The typical assumption required to formulate the associated model is of a stationary disease incidence process.Since both approaches make use of the onset time information to correct for selection effects, misspecification of the retrospectively reported disease onset time can have serious implications on the estimation.We investigate the impact of measurement error in disease onset time for prevalent cohort sample and propose "correct" conditional and unconditional likelihoods to account for the measurement error.
The methods we proposed to correct for measurement error in this paper are based on the parametric model.It

1 D 2 D
three states: alive and disease-free ( ) 0 D , alive with disease ( ) , and dead ( ) .For individuals who de- velop the disease, the path is 01 2 D D D → →and interest often lies in the distribution of the survival time with the disease, or equivalently the sojourn time distribution for state 1 D .For individual i, let 0 i V be the calendar time of disease onset and 1 i

Figure 1 .
Figure 1.Diagram of calendar times and study times of disease onset, left-truncation and survival.

(
of all parameters. time scale which can be recorded.Suppose the censoring time, measured from the time of recruitment, is independently and uniformly distributed over [ ] 1, 2 , which leads to a 30% true censoring rate.To incorporate the measurement error in the onset time, we adapt the classical measurement error model (9) and assume that reflect mild and strong measurement error, respectively.In presence of measurement error, although the ascertainment criteria is still 1V R > to form a prevalent cohort sample, both the left truncation time and survival time are affected by the random error and are recorded as 0

Figure 2 .
Figure 2. Nonparametric and parametric estimates of survivor function based on the naive, conditional and unconditional likelihoods in presence of measurement error in disease onset time when ignoring the measurement error; n = 5000.(a) σ = 1; (b) σ = 0.5.

F
are information matrices based on conditional ( )