An Empirical Bayes Approach to Robust Variance Estimation : A Statistical Proposal for Quantitative Medical Image Testing

The current standard for measuring tumor response using X-ray, CT and MRI is based on the response evaluation criterion in solid tumors (RECIST) which, while providing simplifications over previous (WHO) 2-D methods, stipulate four response categories: CR (complete response), PR (partial response), PD (progressive disease), SD (stable disease) based purely on percentage changes without consideration of any measurement uncertainty. In this paper, we propose a statistical procedure for tumor response assessment based on uncertainty measures of radiologist’s measurement data. We present several variance estimation methods using time series methods and empirical Bayes methods when a small number of serial observations are available on each member of a group of subjects. We use a publically available database which contains a set of over 100 CT scan images on 23 patients with annotated RECIST measurements by two radiologist readers. We show that despite the bias in each individual reader’s measurements, statistical decisions on tumor change can be made on each individual subject. The consistency of the two readers can be established based on the intra-reader change assessments. Our proposal compares favorably with the RECIST standard protocol, raising the hope that, statistically sound decision on change analysis can be made in the future based on careful variability and measurement uncertainty analysis.


Introduction
Currently there is much interest in treating quantitative medical imaging as a biomarker, employing medical imaging tools to assess tumor change, especially in the assessment of response to medical therapy.In order to use medical imaging effectively as quantitative measurement tools, a number of questions are raised regarding quantifying cancerous tumor changes over course of time, such as  What measures should be used in quantifying meaningful change or response of suspicious tumor objects from images, whether it is based on volume (3D) which has attracted a lot of current interest, or the WHO 1 (2D) or RECIST (1D) [1,2]? What is the basic variability in these measures, in-cluding both intrinsic measurement variability (e.g.repeatability and reproducibility, and effects from different instrument settings), and expert bias in marking up these measures, or biological variability? A critical but related question is-given the variability in imaging acquisition analysis, what is the minimum change that can be detected for a given imaging modality and a chosen image processing method and sizing measure.For example, one would like to know with credible statistical accuracy how large a measured size change must be in order to be declared "significant" in a single individual?
In this paper, we present preliminary steps toward a statistical methodology for variance estimation that will help address these questions.Because there are typically few measurements available on each individual subject, even in a longitudinal study, it is crucial that individual variance estimates for many patients be pooled together to arrive at stable individual estimates.We apply the empirical Bayes method of Herbert Robbins [3] on estimating many variances for this purpose.Our approach is based on the following intuitive rationale: First, we want to have an empirical variance estimation which is not biased (upward) by the presence of signals, while on the other hand we want to avoid underestimation due to failure to account for additional sources of uncertainty.Secondly, because there are only a few observations for each patient, the variance estimation, whatever method being used, is going to be highly variable due to the low degree of freedom, and it is imperative that more stabilized variance estimates be applied in order to achieve higher power.We propose to use time-series-based robust variance estimators and rejection rules based on a series of measurements on each patient for a given reader so that the effect of real trend in the measurements in an individual subject's progress over time may be minimized.The empirical Bayes variance estimation approach [3,4] can then be applied to individual variance estimates by pooling information across subjects (patients) on which a reader (radiologist) has made observations, providing an indirect way of incorporating intrareader measurement uncertainty.Finally, a statistical decision rule of change analysis can be developed for a single individual, even if an individual reader may commit systematic bias in his or her measurements.
Currently the most common quantitative measure of tumor nodule size is based on the RECIST technique [1], a set of protocols based on the endpoint defined as the sum of largest diameters of all "target" lesions.In addition, RECIST also recommends the following percentage-based decision rules: Partial Response (PR) in which there occurs at least a 30% decrease from initial baseline measurement, Progressive Disease (PD) where there is at least a 20% increase relative to the smallest value of measurement after treatment initiation, and Stable Disease (SD) where there is neither sufficient shrinkage to qualify as PR or sufficient increase to qualify as PD.
There are at least two concerns from the statistical point of view in applying RECIST guidelines to practice: First, the guide fails to address the uncertainty that is associated with the RECIST measurement, such as the effect of various slice thickness or spacing, and effects from experimental factors [6]; Secondly, the guide fails to clarify or ignores the importance of intra-observer and inter-observer variability by radiologists.Several recent studies have indicated the significant variance contribution of the second source and its important effect on the RECIST decisions [7][8][9].By focusing on a case study of a small set of CT scan images from the RIDER [10] database on 23 patients on which two expert radiologists have made a series of markings on some single nodule of RECIST measurements, we demonstrate that a variance estimation approach works reasonably well in providing a statistical alternative to the RECIST percentage-threshold decision rules, and in providing an assessment of the reliability of the two observers.For example, we find that even if there is a clear systematic bias between the two observers, statistical decision on the change analysis can be made reliably based on the serial observations from a single observer, by combining information across different subjects, and that the two observers agree with each other more often than expected from a random guess.The results of the statistical decision rules compare favorably with the categorical percentage-based RECIST method.Thus, variance analysis in quantitative imaging measures can provide informed decisions on clinical image change analysis using a statistical approach based on variance estimation and measurement uncertainty analysis.

Statistical Methodology for Variance Estimation
Imagine that there are a number of patients under observation at some discrete time points in a given timespan, as in a typical longitudinal therapeutic study.The data can either be some derived measures of nodule volume, area (WHO) or diameter (RECIST), provided by computer-assisted or manual readings by radiologists, and we denote them for a given patient as a time series, , assuming that they are taken at equally spaced time points for each patient, though our methodology does not require equally spaced observational times.Specifically we may write i i X y t  , i = 1, •••, N. If time is the only covariate of interest-though any other information serves as a covariate-we can assume that the data (by one reader, one computer algorithm) for each patient consist of: where f(t) models the change (signal) which may reflect growth as well as effects from clinical treatments,  denote the systematic bias, denotes the repeatability variance component, and t  is the measurement error with zero mean and unit variance.Both regression model f(t) and variance function in (1) can be extended to include more covariates and even past observations.Such models are widely used in financial volatility modeling; see for example [11].
Our focus is on how estimation of can be made in the presence of f(t), which is usually unknown.The first case, is to assume that f t c , some unknown constant.Then if we assume that   , constant variance over time, an obvious estimator is which is exactly the same estimator as estimator, as suggested in [12].However, estimators   are valid only under the assumption that there is no change, or f(t) is a constant, and will be heavily biased if f(t) changes with time.We present an alternative estimator, 1 This difference-based variance estimator can be justified based on the assumption that f is slowly-varying, or locally constant.In addition, some robust statistics measures may be desirable, due to the fact that they are useful for small data set and are resistant to potential outliers in data.For example, we consider the variance estimator, also called the Gini mean difference.Also related is the Median Absolute Deviation (MAD) measure, defined as i X median , ˆ1.4826 median 1, , References [13,14] gave extensive discussion of the strengths of robust estimation in practice.
Consequently, we propose as a robust version of (3).A few comments on comparing the different variance estimators are in order.
1) The Gini mean difference in (4) Gini  is more robust than (2) and has less variance than MAD in (5).
2) In order to reduce potential bias when there is change (or when f(t) is not a constant), we recommend a time-differenced based estimators in (3) and (6).
It should be emphasized that variance estimators like (6) are proposed here to address variance estimation when "no change conditions" can be met in incremental time steps.If this condition cannot be met, estimators like (3) or ( 6) can still contain significant bias due to signals, and this should be adjusted according procedures suggested in Section 4 by pooling information from other patients.
Once reliable variance estimation becomes available, one can use them to make inference on change analysis, we can define t-statistics-like quantity such as which gives the standardized overall change for a patient in the study period 1 p t t    and may be compared to standard statistical inference procedure such as significance test or power analysis.Here  can be one of the variance estimators proposed here, such as (6).However, we recommend more stabilized variance estimates by pooling information from other patients, as discussed in Section 3.
If there are m patients being monitored over i time intervals, , and the number of readers (radiologists) is p, we can generalize model (1) to individual-based model as: The reader difference due to multiple readers or radiologists is modeled by the bias β j (t) and variance In our formulation, to simplify, we have ignored the actual time recordings and treated the time series data as if they are observed on equally spaced time intervals.Because typically there are only a few observations (over a period of 4 to 5 visits by a patient), the variance functions associated with model ( 8) are assumed to be independent of time t (homoscedastic), and the reader bias is assumed to be constant over time for each reader.In the following section we discuss how variance estimates from different subjects can be combined to provide an improved and more stable statistical test.

Pooled Variance Estimates
Recall that we may use a variance estimator like (6) which, however, requires the longitudinal growth to be slowly-varying.We discuss bias reduction by using information from data sets on other patients.If there are many variances to be estimated, the main issue is how information from similar data sets can be combined to obtain improved and more reliable variance estimates.
Robbins [3] discussed a linear empirical Bayes approach to estimation of many variances which share some common mean.Specifically, if we are given a number of data sets to estimate respectively many means and variances, simultaneously.Let ij x be independent and where denotes the nonnegative part.Then one of the ways of estimating the i  by linear empirical Bayes method (abbreviated as l.e.B) is to use Equation ( 49) of [3], see also [4].In our applications, for readings of each patient, the robust variance estimate RTS  will be computed and used in place of2 i s in (10).It is noted that in Robbins's approach, the signal is assumed to be constant over time.As discussed in Section 3, this assumption can be relaxed when variance estimator based on ( 6) is used, since the latter is still valid when the underlying f is locally constant.However, if the latter assumption cannot be met, the variance estimate can be inflated due to the bias from the signals.Bias adjustment procedure can be easily devised by "borrowing" information from variance estimates across many subjects.An implementation is illustrated within a real data example in the next section.
Given the availability of reliable variance estimation, we can define a statistical procedure for change analysis based on the z-type ratio quantity for comparing means of two normal random variables: where defines the overall change in the study period for patient i, for i = 1, •••, m, and the ratio can be compared to the standard normal distribution for significance test.Here i  is the final variance estimate for a given patient based on annotation data from a given radiologist.A change analysis decision rule can be based on (11), using, say the standard normal distribution as reference for significance test whether there is an increase, or a decrease in the serial measurements of nodule diameters.This statistical proposal for deciding change is in contrast to the recommended RECIST practice [1] which is based on the percentage change if the measurement at the entry time point is taken as the baseline for patient i.We approximate the RECIST guideline by classifying progressive disease (PD) or partial response (PR) based on whether the measured tumor percentage change ( 12) is a greater than or equal to 20% increase (i.e., PD), or shows shrinkage by 30% or more (i.e., PR).

Analysis of the Bias and Corroboration of Expert Annotations in the RIDER Database
The annotation data consisting of single tumor diameter measurements by two expert radiologists on 23 patient cases based on over 100 CT image scans contained from the National Cancer Institute RIDER image archive (NCIA) [10] is the focus of this statistical case study.The RIDER medical image archive [15] is a large collection of CT images of patients undergoing treatment for nonsmall cell lung cancer.CT scans, de-identified for patient privacy, had their cancer masses measured by RECIST guidelines at approximately 12 week repeated intervals to track tumor response during the course of therapy.The images were acquired by state of the art 16-row multidetector spiral CT scanners at adjacent 5 mm slice thicknesses and stored in standard DICOM data format 2 .The cases were viewed and the tumor masses measured at each time interval on a standard picture-archiving system (PACS) viewing workstation (Cedara, Merge Healthcare 3 ).These time-sequence RECIST readings by multiple radiologists provide a candidate "ground truth" nodule size behavior on each patient.There are 90 observations in total for 23 patients, with longitudinal observations ranging from 2 to 7 visits per patient.
Figure 1 shows the plot of the raw data.on ( 6) for each patient.The bottom figure, the test statistics is defined similarly as in (11) but with variance estimate as computed given in the bottom figure in Figure 2. One can conclude from Figure 3 that the test results in the bottom figure significantly improve over original results in top figure.For patient number 2, the new test does not find significance, while original raw-variance based test finds strong significance due to a low variance estimate.The new test seems to be more consistent with visual appearance of patient number 2 in Figure 1.Similar comments apply to data for patient number 11.The opposite is observed for patient number 18, and patient number 19.Original tests based on raw variance estimate do not find significance due to inflated variance estimates, and this is corrected in the new test.As a result, significant change is observed for both patient numbers 18 and 19 using the improved test.It is found that the two readers agree with each other on their assessment in the direction of change on 19 out of 23 patients.(They disagreed on patient number 1, 20, 21, and 23).A summary of decision results based on the statistical tests is provided in detail in Table 1, where we use 10% as the threshold for significant increase and 5% level for significant decrease.
Interestingly, one may compare the statistical test results with the RECIST guidelines (a time-sequence increase of at least 20% defines "progressive disease" (PD), while a decrease of at least 30% defines "partial response" (PR)), which can be inferred from the relative percentage change data plotted in Figure 4. Summary of the RECIST analysis is given in Table 2.In short, on 4 out of 23 patients, the two experts have given percentage changes of opposite directions (cf.patient numbers 1, 20, 21, and 23).The two experts differ in their computed percentage changes with a mean average difference of 21%.In terms of RECIST decision results based on the radiologists' individual assessments, in addition to the 4 patients on which they totally disagree, they agree on 15 Variance estimation after bias adjustment We define the following symbols: x for significant increase at 10% level, --for significant decrease, o for non-significant decrease, and % for non-significant increase.

Reader 1
Partial Recovery (PR) 0 0 0 2  patients, and agree partially on 4 patients (patient numbers 8, 9, 17, and 22) in their categorical classifications (progressive disease (PD), partial response (PR)).In terms of the Kappa measure [16], the statistical tests give . . . . 1 0.5861 where θ ij denotes one of the entries in Table 1, and θ .iand θ i. denote the row and column sums, divided by the total (23).The Kappa number for the statistical test indicates slightly better agreement between the two readers than the Kappa measure (0.5027) based on RECIST-based results in Table 2 (However, both approaches indicate there is moderate agreement among the two readers [17]).
At the minimum we can ask whether there is any corroboration or dependence between the two readers' assessments.If we only use the signs of the categorical score measurements (so the variance estimation has no impact), there are 7 concordant "increases", 10 concordant "decreases", 4 discordant decisions for "increase" by one reader and "decrease" by another reader, and vice versa.The Chi-squared test for independence by the two readers using the contingency table gives a value of 5.5457 with 1 df, and P-value of 0.0185.Using Fisher's exact test, P-value is 0.0092 for two-sided alternative.

Discussion
We believe that there is a strong need to study the reliability and statistical performance of RECIST, or any other time-sequence tumor size measurement regimes such as WHO or 3D volume metrics.Statistical methods suggested in this paper are used to demonstrate the potential of medical decision making by taking into account explicitly the uncertainty in the markings by expert radiologists, and a statistical decision rule for change could potentially be available for the future based on realistic measurement quantification along the lines of [6,18].In addition, there is a critical need for establishing measurement uncertainty, such as accommodating the effects of protocols and instrument settings [6].Statistics-based decision rule can easily incorporate the different facets of uncertainty components in therapy response decision making.There are needs to study biological variability and to study the algorithmic factors of computer-assisted measurements in other size measures such as volume metric which is mainly useful for thin slice CT scans (1.0 mm or less) [6].
Partly due to the observation that there is measurement bias in the absolute nodule size measurements, alternative procedures have been investigated for direct change measurements (e.g.[19,20]).However, we caution the readers that the latter approach raises additional issues with the uncertainty in the change measurements themselves and there are still issues on how to assess measurement uncertainty in change-measurement data such as for small nodules.Though there are many developments with RECIST, this important topic has received little attention in the statistical literature (an exception is [21]), we believe there are ample opportunities for statisticians to be engaged in this important medical image decision analysis concerned with assessing therapeutic effectiveness.
may have his or her own variance function uncertainty and his or her own change function i f t .

Figure 2 3 .Figure 1 .
Figure 1.Plots of RECIST readings versus time index for each patient by two radiologists (denoted 1, 2) in a longitudinal study.The RECIST markup data here is the largest diameter of one identified nodule, in millimeters (mm).

Figure 3 .Figure 2 .
Figure 3. Z-ratio test statistics for change based on readings from two experts (denoted by symbols 1, 2) in a longitudinal study involving 23 Patients.Top: Based on raw variance estimate.Bottom: Based on pooled and stabilized variance estimation as given in Figure 2. The solid lines (black) denote the 0.05 significance test threshold for positive or negative change in the mean, and dashed line (red) denote the 0.10 significance test threshold for change.

Figure 4 .
Figure 4. RECIST percentage-based interpretation of annotated RIDER data.X-axis: Patient number from 1 through 23 for 23 patients in the annotated database.Y-axis: Computed percentage changes (measurement at final time minus at entry time, divided by measurement at entry time) for two experts (1s for reader one, 2s for reader two).Data are the RECIST annotations (largest diameter of nodule among all slices at a given time) at different times for all 23 patients by two radiologists.The two solid lines denote the 20% increase and 30% decrease thresholds.