New Tests for Assessing Non-Inferiority and Equivalence from Survival Data

We propose a new nonparametric method for assessing non-inferiority of an experimental therapy compared to a standard of care. The ratio E  R  of true median survival times is the parameter of interest. This is of considerable interest in clinical trials of generic drugs. We think of the ratio E R m m of the sample medians as a point estimate of the ratio E R   . We use the Fieller-Hinkley distribution of the ratio of two normally distributed random variables to derive an unbiased level-α test of inferiority null hypothesis, which is stated in terms of the ratio E  R  and a pre-specified fixed non-inferiority margin δ. We also explain how to assess equivalence and non-inferiority using bootstrap equivalent confidence intervals on the ratio E  R  . The proposed new test does not require the censoring distributions for the two arms to be equal and it does not require the hazard rates to be proportional. If the proportional hazards assumption holds good, the proposed new test is more attractive. We also discuss sample size determination. We claim that our test procedure is simple and attains adequate power for moderate sample sizes. We extend the proposed test procedure to stratified analysis. We propose a “two one-sided tests” approach for assessing equivalence.


Introduction
Non-inferiority and equivalence trials aim to show that the experimental therapy is not clinically worse than (non-inferiority) or clinically similar to (equivalence) an active control therapy.As the statistical formulation is one-sided, non-inferiority trials are also called one-sided equivalence trials.ICH E10 [1] is an authentic and official guidance document on the choice controls in noninferiority clinical trials.The active control, which is also called a reference, is usually a standard of care.As noted in [1], most active-control equivalence trials are really non-inferiority trials intended to establish the efficacy of a new therapy.A non-inferiority trial is conducted to evaluate the efficacy of an experimental therapy compared to an active control when it is hypothesized that the experimental therapy may not be superior to a proven effective therapy, but is clinically and statistically not inferior in effectiveness.If the experimental therapy has a better safety profile, and/or easier to administer, and/or costs less, then non-inferiority trials are considered appropriate [2].
Confidence intervals on hazard ratios are used to assess equivalence and non-inferiority from survival data.The concept of hazard ratio is elusive.Clinicians find it hard to understand.Koch [3] says that though it is straightforward to construct confidence intervals on hazard ratios, it can be awkward to interpret.Wellek [4] proposed a log-rank test for equivalence of two survivor functions.According to Wellek, the survivor functions are considered equivalent if the absolute difference between the two survival curves is less than a pre-specified margin   0   over the whole range of values of event-time.His test is carried out in terms of the regression coefficient for a dummy covariate indexing the trial arms.Though Wellek's paper is remarkable in its technical content, the test procedure is not used in practice.A possible reason is that his definition of equivalence criterion is conceptually difficult for clinicians to understand.Moreover, this formulation of the problem requires that the survival curves belong to the same proportional hazards model.The proportional hazards assumption is often inappropriate.We would like to point out that if the proportional hazards assumption holds good, the tests for non-inferiority (and equivalence) in terms of medians would be more attractive.
Because the distribution of survival times tends to be positively skewed, the median is the preferred summary measure of the location of the distribution.Also, the median is straightforwardly informative to the clinicians.Efron [5] said it very nicely-"The median is often favored as a location estimate in censored data problems because, in addition to its usual advantage of easy interpretability, it least depends upon the right tail of the Kaplan-Meier curve, which can be highly unstable if censoring is heavy."Simon [6] emphasizes the importance of confidence intervals on median survival times.He writes: "For exponential survival distributions, the hazard ratio equals the ratio of medians.Exponential survival means that the survival curve is a straight line on a semilogarithmic scale (log survival probability over time).Because exponential distributions are good approximations to the survival curves seen in many kinds of advanced cancer, confidence intervals for the hazard ratio are often interpreted as confidence intervals for the ratio of medians."Simon also explains how to calculate a confidence interval on the ratio of median survivals when the survival distributions are exponential.As a result, it has become a common practice in clinical trial study reporting to give point and interval estimates for the median survival time.This motivated us to consider testing for equivalence and non-inferiority of an experimental therapy compared to a reference therapy in terms of their median survival times.As assessing non-inferiority in terms of the difference between median survival times is trivial, we focus on their ratio.
Rubinstein et al. [7] were probably the first to consider the problem of testing the null hypothesis that the median survival times are equal against an alternative that the median survival time for the experimental treatment exceeds that of the control arm.They assumed exponential distributions for survival data.Britsol [8] presents a modification to Rubinstein's procedure for situations where it is desired to show that the experimental treatment is not much worse than the control.As noted by Berger and Hsu [9], and Hauschke and Hothorn [10], testing for non-inferiority in terms of the ratio of the averages often reflects clinical rationale rather than the difference between the averages.Bristol wants to test the null hypothesis that the ratio of medians is less than or equal to a fixed margin  against the alternative that the ratio exceeds  .To simplify the matter, he assumes that failure times have exponential distributions.Bristol's real interest is in testing the ratio hypothesis 0 1 H stated in (3.1) below in Section 3.However, he uses log transformation of the ratio to derive an asymptotic test.We circumvent this problem by introducing the Fieller-Hinkley (hereafter abbreviated as F-H) distribution on the ratio of two normally distributed random variables.Moreover, we don't assume failure times to follow exponential or some other parametric distributions.

One Sample Survival Model, Median Estimate and Standard Error
We develop the tests under the frame work of a randomly right-censored survival model.We assume that The basic quantity employed to describe time-to-event phenomenon is the survivor , where   timate is given by is the product-limit estimate of .That is, the median survival time is estimated from the product-limit estimate to be the first time that the survival curve falls to 0.5 or below.The sample median is asymptotically normally distributed with mean  .The variance  m of is mathematically intractable.The SAS lifetest procedure provides an estimate of survivor function accompanied by survival standard error [11].By default, the SAS lifetest procedure uses the Kaplan-Meier method.It also produces a point estimate of the median  of F and the 95% confidence interval-derived by Brookmeyer and Crowley [12].Brookmeyer and Crowley obtained the confidence intervals by inverting a generalization of the sign test for censored data.They did not need the standard error of the sample median.Obviously, the SAS lifetest procedure does not provide the standard error of the sample median .One form of the asymptotic variance of median is is found using the Greenwood's formula [13].A slightly different version of is provided in [14]: As f is unknown, the variance given either in (2.1) or (2.2) becomes useless in estimating the population median time  [15].We propose to estimate the standard error of using the Efron's bootstrap [5], which m does not make any distributional assumptions.In a single sample setting, Efron's bootstrap may be described as follows.We draw a bootstrap sample     may set B equal to 1000.This is called "modelfr One ee" or the Efron's bootstrap procedure II.The University of Texas at Austin [16] has provided some introductory SAS codes needed to resample a SAS dataset.
Efron [5]  .We suppress the subscript BOO of the estimated variance in (2.3).In fact, Keaney and Wei [17], among others, have used bootstrap to find the standard error of m .
What is an indication of an unstable median or he y an T av censoring is a crucial question.As observed in [12], if the survival curve is relatively flat in the neighborhood of 50% survival, there can be great deal of variability in the estimated median.It would be more appropriate to cite a confidence interval for the median.We propose a simple rule of thumb.If the upper limit of a 95% confidence interval on median is not available, one may conclude that median is unstable and/or censoring is heavy.Therefore, the proposed tests should work efficiently when the Brookmeyer-Crowley upper limit of a 95% confidence interval on median is available.This also minimizes the number of bootstrap samples whose Kaplan-Meier curves do not reach 0.5 survival probability.In addition, asymptotic normality requires that 2 m   .

Null and Alternative Hypotheses
Let E T and R T denote the times to event for the exental and perim reference treatment groups, respectively.We use E S and R S to denote the survival functions, and E  and R  t denote the medians of o E T and R T , respectively.Depending on the application one may test Here 1 L   and large median values point t positive effect o large s.For example, the null and alternative hypotheses in (3.1) are appropriate if non-inferiority as measured by the overall survival of patients is desired.In some other applications, small median values may point to large positive effects, in which case, for proving noninferiority, one may test where 1 U  .For example, if duration of anemia (or  time to response) is the clinical endpoint, it is appropriate to consider the null and alternative hypotheses in (3.2).Here is of considerable interest in clinical trials of generic drugs.Henceforth, we assume that two independent sam- -censored event-times are given.W right e use T to represent the data.The sample size E n and R n are sufficiently large.The censoring proportio , in eac arm, is moderate.That is, the trial is designed to have long enough follow-up time so that more than one half of the subjects in both arms had the event.Let , respectively.The proportional hazards assumption ot required.However, we assume that the each treatment group has survival curve that is not relatively flat in the neighborhood of 50 percent survival.We also assume that each median estimate is at least two times larger than its standard error.Then the ratio where  denotes the standard normal density function.

sy me
The stribution As the ratio f median survival times is always positive, we suppress the superscript.
Koti used the F-H distribution to derive non a po variabl -inferiority te e. o sts under analysis of variance setting [20].Koti also used the F-H distribution to derive tests for null hypothesis of non-unity ratio of proportions [21].In this paper, his test procedure is extended to survival data analysis.We think of the ratio

Test for the Lower Inequality ull hypothesis
In this section we consider testing the n 1 0 H against the alternative hypothesis 1 A H , which are ed in (3.1).Under the null hypothesis stat We need to find a cutoff point w that satisfies the equation where a z is the 100a-th percentile of the standard normal distribution.The cutoff point w satisfying (5.2) defines the rejection region for a given value of R  .

Note that
ectively.These confidence intervals should be as de as possible.Let We describe   in (5.3) as a rectangular parameter sp , and 1 N D denote the domain of the line 1 .Here 1 λ repre-λ sents the parameter space under the simple null hypothesis . We assume that 1 2) for some w and   w That satisfy (5. and , and That is, Therefore, the rule that rejects  % and get a quadratic equation:


The roots of the equation are quadratic . The root that is smaller

p alue and Power of the Test
T e p-value for the test is where is the observed ratio.The power of osed test is the probability that th the prop sis 1 0 e null hypothe-H , will be rejected when the alternative hypothesis 1 A H , is true.We define the power function . Therefore, the power L  may be called the minimum power.

The Test Is Un
Note that biased That is, the type-I error probability is at most  and the power of the test is at least  .Thus, the test is unbiased.

Test for the Upper Inequality
Next, we discuss testing the null pothesis 2 0

p-Value and Power of the Test
The p-value for the test is where  in (6.4) may be called the mi uivale e s censored survival model, Efron has considered using bootstrap to estimate nimum power.The test is unbiased.

Bootstrap Eq nt Confidenc Interval
In one sample case, for randomly rightthe sampling distribution of ].He has demonstrated that the sampling distribution of , where b denotes the bootstrap Kaplan-Meier estimate.See [5,22] That is, we plug in the sample estim percent confidence interval for the parameter of interest and comparing the constructed confi nterval with the pre-specif va dence i ied equilence range [9].In this paper, we use the distribution The interval in (7.2) may be ed in two ways.ce lim   , then the two groups are considered equivalent.In order to demonstrate non-inferiority, this interval should lie entirely on the positive side of non-inferiority margin.That is, if the confidence interval in (7.2) excludes the non-inferiority margin, then non-inferiority is de ted.

Sample Size Determination
In the current setting, the standard error of sample median is not explicitly expressed in terms of the number of events.Therefore, we assume exponential monstra model for ume that sample size calculation.That is, we ass : .
w r w r  % % fo

Power Approach
We assume that R m is given.That is, ˆR  is known.To be consistent with W calculate the optimal number of events r p r arm,

Bootstrap Confidence Interval Approach
In this setting, the distribution function of To find an optimal samp use le size, we and solve for ˆˆ: 2 , where , and  : to be the one containin y statement, we have g the equalit That is, it is po sible to restate the null and alternative hypotheses in te     s rms of the sums of strata medians.Let Consequently, we set 1 .
and v :

. Concluding Remarks
We deal with the ratio E R   directly, and therefore, Copyright © 2013 SciRes.OJS our approach is easy for clinicians to understand.Existing test procedures for assessing non-inferiority and equivalence require hazard rates under the two treatm arms to be proportional.Our test proposed in this paper is free of this requirement and therefore, has wider applicability.
The power definitions in (5.6) and (6.4) may be considered as alternative to the power definitions in [20,21].
It ma zel test ent y be recalled here that the Mantel-Haens [23] is often called an average partial association statistic.
Here we have a parallel situation.Note that the null hypothesis 1 0 K H in (9.1) may be written as 1 0 : . Therefore, the procedure in Section 9 tes ts the null hypothesis on the ratio of averages of strata medians.

H
indicate that the experimental therapy is t inferi to the reference therapy.The lower and upper bounds no or L  and U  defining non-inferiority are called non-infer ity mar ns.The selection of noninferiority margin ior gi L  (or U  ) depends upon a combination of statistical reasoning and clinical judgment.For a discussion on the choice of a non-inferiority margin, reference is made to ICH-E10 document [1].For example, testing 1 2

2  2 
S denote the product-limit survival estimates and E m a nd R m denote the median time estimates for the perintal and reference groups, respectively., respectively.As mentioned in Section 2, e assum that the bootstrap variances given by (2.3) are the de facto variances of E m and R m is n estimate of the ratio E R  an to use the distribution G of the to make inference on d we intend ratio W E R   .As usual, w denotes an observed value of W .We gard the var nces2

. 4 Figure 1 .
Figure 1.Two DFs of W both with a median of 0.8.Usually, in designing a clinical trial, one aims to have a power over 0.5.Note that the power, for example,


defines the critical region of the test.Alterna-

2 
in   G w of (4.1) to get an asymptotic distribution of the bootstrap ratio b W .Note that the distribution function   ˆB G w is completely specified.Equivalence between the two treatments is often tested by may write down the quadratic equations of the t (5.2) and tio ype shown in (6.2) and solve them.See section 8 for illustran.If the constructed confidence interval  

r
, respectively.We e the ize determination for the test for the upper inequality.That is, we consider testin -specified constant.Ideally, the choice of d should depend on the width of  

Figure 2 .
Figure 2. Overview of the equivalence test.

Table 1 . Optimal numbers of events r* per arm for α = 0.025 and d = 0.45.
R  (median) E  R  [9] then combine the results according intersection-union principle.We have already outlined the two onesided tests in Sections 5 and 6 above.The null hypothesisH are rejected at level  .As indicated by Berger and Hsu[9], this test can be quite conservative.We defi We evaluate the power of the test at the alternative 1  0 H is rejected in fav r of A o H at level  , if both hy-A H of (10.1).
However, this power may be low in some cases.Then one may use Table1or Table2for sample size determination.