Comparative Analysis of Group Sequential Designs Tests for Randomized Controlled Clinical Trials : A Model Study on Two-Sided Tests for Comparing Two Treatments

Clinical trials are usually long term studies and it seems impossible to reach all required subjects at the same time. Performing interim analyses and monitoring results may provide early termination of trial after obtaining significant results. The aim of this study is comparing group sequential tests in respect to advantage of sample size reduction and early termination. In this study, 4 test types used in group sequential designs were compared with fixed sample size design test and each other. Comparisons were done according to two-sided tests for comparing two treatments. In this sense, 1080 models were performed. In models, 2 different Type I errors, 2 different powers, 5 different analysis groups, 6 different effect sizes and 9 different variances selections were considered. All test types increased the maximum sample size in different manner, compared with fixed sample size design. Each test had different critical values to reject H0 hypothesis, at the same type I error rate and number of analyses conditions. Selection of test type used in group sequential designs depends on a few characteristics, as reducing sample size, early termination and detecting minimal effect size. Test performance is highly related with selected Type I error rate, power and number of analyses. In addition to these statistical characteristics, researchers should decide test type with respect to other trial conditions as the issue of trial, reaching subjects easy or not and importance of early termination.


Introduction
Clinical trials are designed to detect differences between treatments with a certain power and Type I error rate.Investigators should ensure to design clinical trials that contain adequate statistical power and sample size.It takes a long time to reach required subject number at the same time.Data are accumulated periodically course of the trial.Thus, it may take a few years to enroll enough subjects to meet determined sample size at the beginning of the trial.Particularly, in clinical trials which have death risk or any potential harm this may be more difficult and reaching required sample size may cause more increasing of trial time.Therefore, it is an important interest of investigators substantially to analyze accumulated data in specific intervals and evaluate results.Performing interim analyses and monitoring results may provide early termination of trial after obtaining significant results corresponding superiority, inferiority or equivalency of new treatment according to standard method.
Clinical trials can be classified in two groups in term of sample size, as fixed sample size designs and sequential designs [1].In fixed sample designs, sample size is calculated at the beginning of the trial, and data are analyzed once after all required subjects enrolled.In sequential designs, sample size is calculated at the beginning of the trial similarly, but data are analyzed periodically by interim analyses as the trial going on and a final analysis is done at the finishing of the trial if required.Results of each interim analysis are evaluated to decide stopping or continuing the trial, and thereby the trial is monitoring [2][3][4].
Sequential designs were initially developed for economical reasons.Early termination for a trial that have positive result means that a new product can be used sooner.If the trial have negative results, early termination ensures saving from sources.Sequential designs typically serve to savings in sample size, time and cost of the trial comparing with fixed sample size designs [5][6][7].
There are several reasons for monitoring a trial, and decide to stopping or continuing it.In medical researches possible side effects, quality of life, cost or availability of alternative treatments can not be known at the beginning of the trial [5].The most important reason is stop treating subjects with an ineffective treatment, when results show that test treatment is superior, inferior or equivalent to the standard treatment.
Sequential designs are categorized in three groups: fully sequential designs, group sequential designs and flexible sequential designs [1,5].In group sequential designs, interim analyses are done periodically at certain times determined at the beginning of the trial.Group sequential designs require determination of number and time of interim analyses at the beginning of the trial and remain constant.Interim analyses must be done by equal intervals [1].
Group sequential designs based on the evaluation of results obtained interim analysis of data collected from each patient group with predetermined sample size [1].There are many statistical criteria that controlled Type I error rate during periodic analyses.At each interim analysis test value calculated and compared with critical value of test.These critical values vary according to number of interim analyses and Type I error rate selected.Commonly used group sequential tests are suggested by Pocock and O'Brien & Fleming [6,7].They have been improved for common test statistics used to compare means, medians, proportions or survival curves.Group sequential designs required determination of number and time of interim analyses at the beginning of the trial and remain constant.Interim analyses must be done in equal intervals [1].
The aim of this study is to evaluate four types (Pocock, O'Brien & Fleming, Wang & Tsiatis and Haybittle-Peto tests) of group sequential designs' tests used to compare means of two treatments comparatively.The comparisons were done in respect to advantage of sample size reduction, potential of early termination and detecting minimum differences between treatments at the same conditions.

Two-Sided Tests for Comparing Two Treatments
In two sided hypothesis tests, the null hypothesis (H 0 ) referring "there is no statistically significant difference between two treatments" is controlled against the alternative hypothesis (H 1 ) referring "there is a statistically significant difference between two treatments".When treatments' means distributing normally with a known variance, the test statistic calculating is Z.In fixed sample size designs, when the value of Z statistic calculated is equal to or larger than a c value named "critical value" the null hypothesis (H 0 ) is rejected while it is accepted when the value of Z statistic is less.Determination of critical value based on the Type I error rate selected.Type I error is usually determined as 0.05 while 0.01 or 0.001 values are selected when the study has death risk, irreversible harms or potential risks.The formula of Z statistic to compare means of two treatments as A and B distributing normally with a known variance, and including n subjects is as follow [8,9]: Required sample size in each treatment group for comparing two independent groups is calculated as following way [5,[8][9][10][11]:

Group Sequential Designs
In group sequential designs, number of analysis (K) and required sample size (m) in each group for each analysis is determined initially while two treatments are comparing.Total number of analyses in a group sequential design is K, consisting of K -1 interim analyses and a final analysis.The maximum subject number enroll to study is 2mK.The formula of Z k statistic to compare two means if A and B distributing normally with a known variance, and including 2m subjects is as follow [5,9]: Maximum sample sizes for each test types are differrent and calculated by multiplying n f in Equation (2.1.2) with a special factor R varying for each test type and each number of analysis.
In group sequential designs, since the number of statistical analyses is more than one, to protect total Type I error rate (α), Type I error rate (α k ) for each analysis should be determined by allocating the total Type I error rate (α) to each analysis.And the c k critical values are determined according to these Type I error rates (α k ) [1-3, 5,12].Interim analyses are done after collecting data from each 2m subjects groups periodically and Z k test statistic is calculated for each analysis.When the value of Z k statistic calculated is equal to or larger than c k critical value the null hypothesis (H 0 ) is rejected and the trial is terminated, referring as "positive result".If the value of Z k statistic calculated is less than c k critical value, the trial continuing by adding a new 2m subjects group.If there is no positive result until final analysis and if still Z k < c k at the final analysis, than trial is terminated accepting the null hypothesis, referring as "negative result" [1,3,5,9].

Pocock Test
The value of Z statistic after each interim analyses and final analysis is calculated with the formula given at Equation (2.

O'Brien & Fleming Test
The value of Z statistic after each interim analyses and final analysis is calculated in the same way with the formula given at Equation (2. based on total Type I error rate (α), Type II error rate (β) and number of analyses (K).Subject number for each treatment in each interim analysis is calculated as follow [5,9]:

Wang & Tsiatis Test
There and value of  .Subject number for each treatment in each interim analysis is calculated as follow [5,9].
) for all interim analyses.It is different only for final analysis The critical values for final analysis is varying based on total Type I error rate (α) and number of analyses (K) [5,9].The maximum sample size for Haybittle-Peto test is calculated by multiplying n f in Equation ( 2 and number of analyses (K).Subject number for each treatment in each interim analysis is calculated as follow [5,9].

Models
In

 
Sample size calculations for large effect sizes were too small and critical values according to analysis group numbers can be calculated by iteration [5,13], so 1080 of these 10080 models were used.In these models, 2 different Type I errors (α), 2 different powers (1-β), 5 different number of analyses (K), 6 different effect sizes (d) and 9 different variances (σ 2 ) selections were considered: 0.05 and 0.01  [5,13].Results are summarized with tables.  level combinations were not shown.In addition, sample sizes calculated for larger effect sizes were very small as 3.9 and some of them not possible in practice.Thereby some of these sample sizes were similar and not comparable.So, sample sizes for , Pocock test had lower critical values that might detect smaller effect sizes than Wang & Tsiatis test for all  values in first two or three interim analyses, while Wang & Tsiatis test had lower critical values that might detect smaller effect sizes in last two interim analyses and final analysis.This changing was observed at 5 th -7 th interim analysis when 10 K  , at 7 th -10 th interim analysis when 15 K  and at 8 th -11 th interim analysis when 20 K  .Number of analysis which this changing was observed, was varying according to value of  .Haybittle-Peto test had a different way as having a constant critical value for all interim analysis.This critical value was placed between critical values of Pocock and O'Brien & Fleming tests for early interim analyses, and became higher from them after a few interim analysis.This changing was observed at 3 rd interim analysis when 5  K , at 5 th interim analysis when 10 K  , at 8 th interim analysis when 15  K and at 11 th interim analysis when 20 K  .Only for final analysis Haybittle-Peto test had a different critical value that the nearest one to fixed sample size design test (Tables 1-4).

Critical
All of the group sequential tests were required more sample size than the fixed sample size designs (Tables 5-10).This increase was depending on effect size.Difference between sample sizes for group sequential tests and fixed sample size design was minimal when the effect size was large.Even they were similar beginning from 2.0 d  .and more effect sizes.It was mainly caused by the smallness of sample sizes such as 3.9.Because of the smallness of sample sizes, the differrences between sample sizes for each test can not be observed and they were seen similar.So, advantage and disadvantage of each test in term of sample size can be compared for low effect sizes.increases with number of analysis.So, the power of test achieves mainly later analyses [4,5].

Discussion
Maximum sample size requiring in group sequential designs increases as the number of analyses increase.But, basic goal of group sequential designs is evaluating the advantage of early stopping through interim analysis [6,7,13].It takes into consideration that, maximum sample size is only required when there is no positive result in all interim analyses and the trial goes on to the final analyses.
Critical values of group sequential tests in interim analyses were ordered in a different manner, changing for each interim analysis.In addition, order of tests in term of critical values changing according to number of analysis (K).Pocock test had the lowest critical values for early interim analyses while O'Brien & Fleming test had the lowest critical values for latter analyses.Number of interim analysis in which this changing occurred varied according to number of analysis (K).The number of analyses performed is important as the test type used.In some conditions, 1 or 2 interim analyses may be effective for decreasing sample size, and generally 4 or 5 interim analyses are sufficient.Accordingly, for a group sequential design using O'Brien Fleming test, 10 K  interim analyses have been the optimum.In addition, for a group sequential design using Pocock test, 5 K  interim analysis seems unreasonable.
As a result, these four test types have several advantages and disadvantages.In a group sequential trial, decision of test type using to analysis the trial data, based on a few criteria: 1) whether early termination is important or not; 2) reducing sample size; 3) the issue of trial; 4) whether reaching the subject easy or not; 5) detecting minimal effect sizes.In the conditions that, reaching subjects is hard or studying smaller sample size because of high risk, the test which provides that detect smaller treatment differences at the first interim analyses can be preferred.

2 . 1 )
. The critical values of O'Brien & Fleming test that compared with the value of Z statistics are denoted as varying based on total Type I error rate (α) and number of analyses (K), and are different for each interim analyses and final analysis [5,9]: The maximum sample size for O'Brien & Fleming test is calculated by multiplying n f in Equation (2.1.2) with  values vary based on total Type I error rate (α), Type II error rate (β) increase because of increase in   K k values, and so the difference in critical values according to other tests in initial interim analyses increase as planned total number of analyses (K) increase.  K k values start to decrease from first analysis to final analyses and therefore decrease, so the critical values according to other tests are lower in latter analyses.Similarly, Wang & Tsiatis test shows same manner for critical values.analyses as planned total number of analyses (K) increase, and there is a big difference in terms of critical values according to other tests.It starts to decrease as the analyses goes on, and lower than other tests in latter analysis.
2.1).The critical values of Pocock test that compared with the value of Z statistics are denoted as Also, the value of Z statistic after each interim analyses and final analysis is calculated with the formula given at Equation (2.2.1).The critical values of Wang & Tsiatis test that compared with the value of Z statistics are denoted as is a  parameter for Wang & Tsiatis test differently from other tests and certain values of this parameter makes Wang & Tsiatis test the same with Pocock and O'Brien & Fleming tests.Wang & Tsiatis test is   , , WT C K   for final analy- sis.For interim analyses, the critical values are obtained by multiplying   , , WT C K   with   WT R K    values.The   , , , WT R K    values vary based on total Type I error rate (α), Type II error rate (β), number of analyses (K) Critical values to reject null hypothesis (H 0 ) and maximum sample sizes required for all test types were determined for each combination.In each combination, these four test types were compared with each other and fixed sample size design test, and advantages and disadvantages of tests were examined in same conditions.
The nearest critical value of group sequential tests at final analysis was obtained for Haybittle-Peto test and the furthest one was obtained for Pocock test.And it did not change according to the number of analysis (Tables1-4).When 5 K  , Pocock test had the lowest critical values that might detect smaller effect sizes in first three interim analyses while O'Brien & Fleming test had the lowest critical values that might detect smaller effect sizes at 4 th interim analysis and final analysis.This changing was observed at 7 th interim analysis when 10 K  , at 10 th interim analysis when 15 K  and at 13 th interim analysis when 20 K  .Wang & Tsiatis test was always placed between these two tests.It was closed to O'Brien & Fleming test for small  values and started to close up to Pocock with increasing  values.Critical values of Wang & Tsiatis test for early interim analyses were getting closer to Pocock test as the  values increase, and they were decreasing for later interim analyses.Similarly, critical values of Wang & Tsiatis test were getting closer to O'Brien & Fleming test as the  values decrease, and they were decreasing for later interim analyses in parallel with critical values of O'Brien & Fleming test.When 5  K

Table 10 . Sample sizes for α = 0.05, (1 -β) = 0.80, and d = 1.5.
test required the largest sample size among the group sequential test types.O'Brien & Fleming and Haybittle-Peto tests had nearest sample sizes to fixed sample size design and order of these two tests changed with number of analyses.For example, when The sample size was close to O'Brien & Fleming test for small  values and started close to Pocock test with increasing  values.This condition was not varying according to number of interim analysis and other parameters related to calculation of sample size, as power, Type I error rate, effect size and variance.Sample sizes were almost similar for increasing effect sizes, and be- K: Number of analysis, n: Sample sizes for tests; n P : Pocock; n B : O'Brien & Fleming; n WT : Wang & Tsiatis; n HP : Haybittle-Peto.Pocock Results obtained about sample size in all combinations almost same.Haybittle-Peto and O'Brien & Fleming tests have been required much smaller sample sizes comparing to other test types.Pocock test has been required the largest sample size for all combinations.Wang & Tsiatis test has been always required sample size that placed between O'Brien & Fleming and Pocock tests.The reason for requiring small sample size of O'Brien & Fleming test comparing to Pocock test, can be understood from the formula for calculating the Z statistic, Equality (2.2.1).It can be seen that, effect of effect size on expected value of test statistic   k E Z