|Home | About | Journals | Submit | Contact Us | Français|
Randomized clinical trials are designed with stopping boundaries to guide data monitoring committees with their decision making concerning ongoing trials. In particular, when extremely positive results are seen and a boundary is crossed, the data monitoring committee may recommend releasing the results earlier to the public than at the definitive final analysis time specified in the protocol. For trials that are still accruing, this also means stopping accrual. Because the information about treatment efficacy is more limited in an early analysis than in a final analysis, questions have been raised about the appropriateness of incorporating early stopping for positive results in trial designs. In particular, there are concerns that treatment effects seen early may not be real or may be overly optimistic. To examine this issue, we collected information about treatment efficacy on National Cancer Institute Cooperative Group trials that were stopped early for positive results (information both at the time the trial was stopped/released and at times of further follow-up). Twenty-seven such trials were located. For 17 of 18 of these trials with sufficient follow-up information, the treatment effect was similar or only slightly smaller at last follow-up compared with the stopping/release time. We critically evaluate reasons why one might be concerned about early stopping for positive results. We conclude that for trials with well-designed interim monitoring plans, the ability to stop early for positive results is an important component of the trial design, allowing the public to benefit as soon as possible from the study conclusions.
Interim monitoring plans, including formal guidelines for stopping trials early (stopping accrual and/or releasing results early) for compelling results, are a standard part of the designs of randomized clinical trials (RCTs). The monitoring guidelines are designed to limit the probability of a false-positive result (type I error) while allowing trials to stop early. Although the benefits to the public in releasing compelling positive trial results early are obvious, there have been concerns expressed1–5 about the correctness of early stopping of RCTs for positive results. To address the concerns raised, we performed a review of all treatment RCTs performed by the National Cancer Institute (NCI) Cooperative Groups and published from 1990 to 2005, thus providing empiric data on the issues as has been suggested.6 We also examined frequently cited reasons why one should be cautious about using early stopping for positive results and found some of these reasons to be correct but others lacking in statistical validity.
We located NCI Cooperative Group phase III treatment RCTs whose accrual was stopped early or whose results were released early for positive results with the first publication appearing in the years 1990 to 2005; we included Canadian National Cancer Institute of Canada (NCIC) Clinical Trials Group trials partially supported by NCI. Using a list from PDQ (http://www.cancer.gov/cancertopics/pdq/cancerdatabase), all trial publications appearing in Journal of Clinical Oncology, Journal of the National Cancer Institute, New England Journal of Medicine, The Lancet, or Blood or as an American Society of Clinical Oncology or American Society of Hematology abstract were examined for evidence of early stopping/release for positive results. We located 27 such trials (Table 1);7–60 it is possible that a small number of trials were missed if they were not published in the searched publications or if no mention was made of the early stopping/release in the publications. (Once a relevant trial was located, all publications associated with that trial were examined.) The end points that crossed interim monitoring boundaries for the 27 trials were overall survival (OS; seven trials), progression-free survival (PFS; six trials), disease-free survival (DFS; six trials), event-free survival (EFS; three trials), complete response rate (two trials), failure-free survival (one trial), continuous complete remission rates (one trial), and response rate (one trial). End point definitions can be found in Appendix Table A1 (online only). Accrual was complete at the time of stopping/release for 70% of the trails (19 of 27 trials) and ranged from 66% to 97% complete for the other eight trials.
Table 2 lists efficacy statistics on these trials at their times when they crossed their interim monitoring boundary, they were first published, and further follow-up information was published (when available). Efficacy data are given as reported in the publications (eg, hazard ratio or the 3-year survival in each arm), but all P values given here are two-sided. The follow-up information is an attempt to assess, in hindsight, how accurate or inaccurate the early positive results were. Regardless of these findings, there is no intended implication that the data monitoring committees making these particular stopping decisions made incorrect decisions given the protocol interim monitoring guidelines and the information they had at the time. We focus here on the results at the time of interim monitoring boundary crossing and at the last follow-up available (which could be when the results were first published if there was no additional follow-up). When the trial results crossed the boundary, the ratio of observed events to events required at the final analysis (information fraction) ranged from 15% to 90%, with a median of approximately 60%.
Although all 27 trials met their prespecified study objectives when their positive results crossed their boundaries, to provide a summary characterization of Table 2, we first focused on the 18 trials that had follow-up information of at least 80% (an arbitrary figure representing trials for which planned final analysis results could be considered available). For 14 of the 18 trials (North Central Cancer Treatment Group NCCTG-844652; Radiation Therapy Oncology Group RTOG-8501; Cancer and Leukemia Group B CLB-9011; Eastern Cooperative Oncology Group ECOG-2491; Children's Cancer Group CCG-1882; RTOG-9001; NCCTG-9741; Eastern Cooperative Oncology Group E1496; Eastern Cooperative Oncology Group E-E1A00; CCG-1961; E3200; E2997; National Surgical Adjuvant Breast and Bowel Project B NSABP-B-31/NCCTG-N9831; ECOG-4599), the treatment effect was similar at early stopping/release and last follow-up. For Southwest Oncology Group SWOG-8814 and Eastern Cooperative Oncology Group ECOG-2100, the treatment effect became slightly smaller, although with the same statistical significance. For Eastern Cooperative Oncology Group EST-3189, the treatment effect became slightly smaller with the statistical significance much weaker (although still statistically significant based on the protocol specification). For RTOG-9413, the treatment effect disappeared; at the time of early release, 90% of the required events were observed, and the 4-year PFS rates were 56% v 46% (P = .014). When long-term follow-up became available (156% of the required events), the 4-year PFS rates were 54% in both arms. Because in this trial the results were released so close to the protocol-specified final analysis, one can argue that this is an illustration of a study with the final analysis reversed by longer follow-up.
Although not one of the 18 trials with at least 80% information follow-up, the release of data in POG-9006 (with accrual 97% complete) may seem problematic. The trial reported 2-year EFS rates of 84% v 75% (P = .006) with 54% information (when the data crossed the interim monitoring boundary, the 2-year EFS rates were 82% v 70.8%, P = .0016, with 38% information). With further follow-up and 77% information, the reported 4-year EFS rates were 70.6% v 64% (P = .21). We will return to discussion of this trial in the next section. For the other eight trials that did not have at least 80% information follow-up, one trial had no further follow-up (NCIC-MA21), and the other seven trials retained statistical significance, with two having a smaller treatment effect (CCG-5942 and SWOG-S9701) and five showing a similar treatment effect (SWOG-8892, SWOG-8797, CLB-9344, SWOG-9133, and NCIC-MA17).
We discuss a number of reasons that have been used to suggest that early stopping for extremely positive results may not be appropriate.
The designated primary end point of an RCT is the one that is used to make the definitive statement concerning treatment effectiveness. There can be controversy about the appropriate primary end point, with different end points yielding different required sample sizes for the trial. For example, a trial that demonstrates a positive treatment effect at its conclusion for PFS may not have a sufficient number of deaths at that time to evaluate conclusively OS benefits. The possibility of early stopping exacerbates this potential problem in that there may be little information available about the nonprimary end points if the trial is stopped early. For example, Cannistra61 questioned the decision to close and report early the results of SWOG-S9701 and NCIC-MA17 based on PFS and DFS end points, respectively, when the OS data were immature. If one believes that an improvement in a non-OS end point results in direct patient benefit, then there should be little argument against stopping a trial early based on extremely positive results for that end point. For example, the NCIC-MA17 investigators62 considered DFS an important clinical end point for their adjuvant breast cancer setting. However, sometimes a non-OS end point is used not because it directly represents patient benefit, but because it is a surrogate for OS. As a surrogate for OS, it may have more statistical power because events accumulate faster and because it may be less susceptible to potential confounding by treatment crossovers to the experimental arm after non-OS events in the control arm. In this case, it is not as clear that one would need to stop a trial early for positive non-OS treatment effects, unless the surrogate is uniformly accepted in the clinical community.
When a non-OS primary end point does not directly represent clinical benefit, a reasonable strategy is to use OS for the interim analysis for positive effects even though the primary end point is different. To do this, one would have to be comfortable continuing a trial that showed extremely positive results in the non-OS primary end point provided that OS differences were not large. Other possibilities include requiring extremely positive results for early stopping/release or starting the interim monitoring for positive effects at a late enough time point that accrual will be complete or almost complete. This may allow evaluation of the OS effect with further follow-up. For example, in Table 3, we see that seven of 16 trials that stopped based on non-OS end points in Table 2 eventually achieved OS treatment effects that were large enough and precise enough to attain statistical significance (P < .05); four trials did not have OS results available. A potentially useful strategy to ameliorate the dilution of treatment effect as a result of crossovers is to censor the OS data of the control arm patients at the time when positive results are released; see Bukowski et al63 for an example.
Depending on the shape of the experimental and control treatment survival curves, early release of data can lead to different conclusions than with additional follow-up. Figure 1 displays hypothetical curves, with the experimental treatment being better on average. Note that although the control treatment curve drops faster than the experimental treatment curve during the first 5 years, the opposite is true for years 6 to 10. This is known as a case of crossing hazards and does not imply that the experimental treatment is worse than the control treatment in the later years (see the Appendix, online only). An important implication of crossing hazards for a conventionally designed RCT is that the trial may have more power to reject the null hypothesis with less follow-up. For example, if the true survival curves are as displayed in Figure 1, then an RCT randomly assigning 750 patients per arm accruing uniformly over 4 years with 2 years of follow-up would have 85% power to reject the null hypothesis (one-sided type I error = 0.025, log-rank statistic). (All power calculations are derived from simulation of 10,000 data sets.) The same trial with 5 years of follow-up would have 54% power. (In special circumstances where one expects the survival curves to come together, alternatives to the log-rank test that weight the earlier data more heavily may be appropriate.)
An implication of crossing hazards to the topic of early stopping is that an early highly statistically significant result leading to stopping the trial may become less statistically significant (or even not statistically significant) with further follow-up. (This could also happen with additional long-term follow-up of a trial that was not stopped early.) For example, suppose the RCT described earlier with 5 years of follow-up had an interim analysis after 2 years of follow-up that would release the trial results early if P < .0025. This would happen 57% of the time if the true survival curves were as in Figure 1, and with 3 years of additional follow-up, 22% of these times the results would no longer be statistically significant (P > .025). Note that these occurrences would not be false positives because the null hypothesis that the survival curves are identical is not true. However, if one believed that it would be misleading to the clinical community to see only the first 6 years of the curves in Figure 1, then releasing the results early would be a mistake regardless of the statistical significance of the early results. In practice, one will unlikely know beforehand whether the curves will come back together as in Figure 1 or keep separating. This suggests that interim monitoring is appropriate, but additional follow-up after the early release of extremely positive results is advisable.
Empirical evidence of crossing hazards is suggested by Figures 2 and and3,3, which display the OS curves for EST-3189 and the complete continuous remission curves for POG-9006, respectively. For EST-3189, the P value went from .0025 at the time of interim monitoring boundary crossing to .10 with 2 years of additional follow-up. For POG-9006, the P value went from .0016 at the time of interim monitoring boundary crossing to .22 with 4 years of additional follow-up.
Although RCTs are designed to answer definitively a treatment question, they are not perfect. In particular, they will infrequently lead to a rejection of the null hypothesis when it is true (a type I error) or the nonrejection of the null hypothesis when the treatments are truly different (a type II error). A design parameter for RCTs is the type I error rate, which is frequently set at 0.05. This means that if the null hypothesis is true, then there is, at most, a 5% chance that the trial will result in a statistically significant outcome. It is important to note that, in a properly designed trial, the type I error rate encompasses both type I errors that occur when the trial is stopped early for positive results as well as type I errors that occur with a positive conclusion at the regularly scheduled trial end. Therefore, there is not an excess of type I errors as a result of the possibility of early stopping with appropriately designed interim monitoring boundaries.
An alternative way to consider type I errors vis-à-vis concerns about early stopping is to calculate the probability that a positive conclusion is a false positive, given that the trial stopped early. A standard application of Bayes' theorem allows this calculation as a function of the prior probability that the null hypothesis is true.64 Such calculations show that, for a positive trial, the probability that the trial is a false positive is lower if the trial crossed an interim monitoring boundary than if it did not. As a simple example, consider a trial designed with 90% power for a specified alternative, with one-sided type I error of 0.025, and where the true treatment effect is null 80% of the time and equal to the specified alternative 20% of the time. Without the possibility of early stopping, the probability that a trial with a positive outcome (one-sided P < .025) is a false positive is 10%. If an O'Brien-Fleming interim monitoring boundary65 is used with two equally spaced interim looks, then the probability that a trial that crosses this boundary at the first interim look (33% information) is a false positive is 1.2%, and the probability that a trial that first crosses the boundary at the second interim look (67% information) is a false positive is 4.4%; the overall false-positive rate remains at 10%.
It has been noted66,67 that a treatment effect observed for a trial that stops early for positive results will be, on average, higher than the true treatment effect (ie, is biased upward). Some2,3,5 use this to argue against stopping trials early for positive effects. However, it is also true that the observed treatment effect for a trial that concludes at its regularly scheduled end with a significantly positive result will be biased upward (although not as high as one that has stopped early).68 In particular, for interim analyses occurring with half or more of the total planned events, the upward bias as a result of early stopping is comparable to the upward bias seen in similarly positive trials not stopped early.69 It is important to note that even though the treatment effect is biased upward when estimated when a trial stops early for positive results, there is only a small probability (the type I error) that the treatment effect is not positive. Therefore, concerns about treating future patients with the best treatment may outweigh concerns about not knowing exactly how much better the better treatment is. The empirical data in Table 2 suggest that the potential bias as a result of early stopping is not a major problem.
It has been suggested70 that if the magnitude of the treatment effect at an interim monitoring look is implausible, then one should not stop the trial at this point (implying that one should stop for a smaller observed effect). However, not stopping a trial for extremely positive results but stopping it for less extreme positive results runs counter to both common sense and statistical thinking71; see Clayton and Wheatley72 for an alternative point of view. Whether or not a trial is stopped early, if one has prior information about the magnitude of the treatment effect, then a Bayesian analysis73 may be useful in providing an attenuated estimator of an extremely positive treatment effect.
The vast majority of NCI Cooperative Group phase III trials that crossed an interim monitoring boundary for positive results led to the early release of treatment effect data to the public that, in retrospect, was appropriate and beneficial. Concerns about excess false positives as a result of the early stopping are not supported by statistical theory or the empirical evidence presented here. Concerns about biased treatment effects as a result of the early stopping are statistically valid but may not be practically important; the bias may not be much larger than would be seen for a positive trial not stopped early, and releasing information early about an effective treatment may be more important than knowing the exact magnitude of the benefit. However, concerns about early stopping/release limiting the ability to estimate long-term survival curves (and potentially identify crossing hazards) or to estimate OS curves (when the stopping is based on a non-OS end point) are statistically valid and practically important. An important consideration in this situation is whether the survival curves can be accurately estimated with additional follow-up after the early stopping/release. If the accrual was not complete at the time of early stopping or many patients could be expected to cross over to the experimental treatment when the positive results are released, then it may be impossible even with additional follow-up to estimate what the survival curves would have looked like if there had been no early stopping/release. In this situation, the interim monitoring plan could be conservative during accrual if the monitoring end point is not OS or there is strong interest in the long-term survival curves.
The NCI Cooperative Group trials that we have considered had well-designed interim monitoring plans. The choice of end point and monitoring plan needs to be carefully considered before a trial starts; trial investigators should be comfortable with the predictable stopping and not stopping decisions that will occur under different accruing data scenarios. The ability to stop a trial and release positive data early is an important component of phase III trial design, allowing the public to benefit as soon as possible from the study conclusions.
We thank W. Barlow, M. Devidas, R.J. Gray, P.Y. Liu, M. LeBlanc, B. Peterson, N.L. Seibel, R. Sposto, and K. Winter for providing some statistical details concerning previous trial analyses and R. Gore-Langton for his help in locating publications.
We give a heuristic explanation of how crossing hazards, as in Figure 1, can occur. Consider 100 patients treated with the control treatment. One would expect 30 of these patients to die in the first 5 years (hazard = 30/100 = 30%) and 20 to die in the last 5 years (hazard = 20/(100 – 30) = 29%). Suppose that, compared with the control treatment, the experimental treatment delays for 5 years one third of the deaths (ie, n = 10) that would have occurred in the first 5 years and has no effect on deaths that would have occurred in the last 5 years. One would expect, for 100 patients treated with the experimental therapy, 20 deaths in the first 5 years (hazard = 20/100 = 20%) and 30 deaths in the last 5 years (hazard = 30/(100 – 20) = 37.5%). Notice that the control treatment hazard is higher than the experimental treatment hazard in the first 5 years and vice versa in the last 5 years.
|Trial ID||Primary End Point As Defined in the Primary Publication|
|POG-9006||Continuous complete remission rate: the time from achievement of a complete remission to failure (death, relapse, or second malignancy)|
|CLB-9011||Complete response: absence of constitutional symptoms and of lymphadenopathy, splenomegaly, and hepatomegaly on physical examination; an absolute neutrophil count of at least 1,500/μL, a platelet count of at least 100,000/μL, a hemoglobin level higher than 11 g/dL (without transfusion), and an absolute lymphocyte count of less than 4,000/μL; and bone marrow of normal cellularity, with less than 30% lymphocytes and no lymphoid nodules|
|SWOG-8892||Progression-free survival: time from registration to the date of first observation of progressive disease or death as a result of any cause|
|ECOG-2491||Disease-free survival: the time from the beginning of complete remission to relapse or death as a result of any cause|
|CCG-1882||Event-free survival: the time from random assignment to relapse at any site, a second malignant neoplasm, or death during remission|
|SWOG-8814||Disease-free survival: the time from random assignment to recurrence or death (the definition is from the study protocol; primary publication is an abstract)|
|CLB-9344||Disease-free survival: the time from study entry to first locoregional recurrence, first distant metastasis, or death as a result of any cause|
|CCG-5942||Event-free survival: the time to disease relapse, progression, occurrence of a second malignant neoplasm, or death from any cause|
|SWOG-9133||Failure-free survival: the time from random assignment to the date of disease progression or death|
|RTOG-9413||Progression-free survival: time to the first occurrence of local progression, regional nodal failure, distant failure, biochemical (PSA) failure, or death as a result of any cause|
|SWOG-S9701||Progression-free survival: the time from registration to the date of first recurrence or death|
|NCCTG-N9741||Progression-free survival: the time from study entry to disease progression. Deaths occurring within 30 days of treatment discontinuation were considered disease progression. Without contradictory data, patients who died or were lost to follow-up were assumed to have experienced progression at the time they were last known to be progression free. (This end point was referred to as “time to progression” in the primary publication)|
|NCIC-MA17||Disease-free survival: the time from random assignment to the recurrence of the primary disease (in the breast, chest wall, or nodal metastatic sites), or the development of a new primary breast cancer in the contralateral breast; secondary cancer or death without a recurrence, or a diagnosis of contralateral breast cancer were not included as events|
|E-1496||Progression-free survival: the time from maintenance random assignment to progression or death (the definition is from the study protocol; primary publication is an abstract)|
|E-E1A00||Response rate: best response within four cycles of treatment (4 months from the start of treatment). Standard ECOG response criteria were used. An objective response was defined as a 50% or higher decrease in the serum and urine monoclonal protein levels from baseline. Patients with measurable disease only in the urine needed to have a greater than 90% reduction in 24-hour urine monoclonal protein excretion to be considered as having a response|
|CCG-1961||Event-free survival: time to relapse at any site, death during remission, or a second malignant neoplasm|
|ECOG-2997||Complete response: response was evaluated according to NCI Working Group Criteria (Cheson BD et al, Blood 87:4990-4997, 1996)|
|NSABP-B-31/NCCTG-N9831||Disease-free survival: time to local, regional, and distant recurrence; contralateral breast cancer, including ductal carcinoma in situ; other second primary cancers; and death before recurrence or a second primary cancer|
|ECOG-2100||Progression-free survival: the time from randomization to disease progression or death from any cause|
|NCIC-MA21||Disease-free survival: the time from random assignment to the time of recurrence of the primary disease. Local or nodal recurrence and metastatic disease are considered a recurrence of the primary tumor. Patients who have contralateral breast cancer or a second primary malignancy or die as a result of some cause other than disease will be censored as event free at the time of death (the definition is from the study protocol; primary publication is an abstract)|
Abbreviations: NCCTG, North Central Cancer Treatment Group; RTOG, Radiation Therapy Oncology Group; POG, Pediatric Oncology Group; CLB, Cancer and Leukemia Group B; SWOG, Southwest Oncology Group; ECOG, Eastern Cooperative Oncology Group; CCG, Children's Cancer Group; PSA, prostate-specific antigen; NCIC, National Cancer Institute of Canada; NCI, National Cancer Institute; NSABP, National Surgical Adjuvant Breast and Bowel Project.
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
After this article was accepted for publication, another trial came to our attention that was stopped early for positive results.74 The trial was not identified in our search because the abstract reporting the initial results75 made no mention that the trial was stopped early.
The author(s) indicated no potential conflicts of interest.
Conception and design: Edward L. Korn, Boris Freidlin, Margaret Mooney
Collection and assembly of data: Edward L. Korn, Boris Freidlin
Data analysis and interpretation: Edward L. Korn, Boris Freidlin
Manuscript writing: Edward L. Korn, Boris Freidlin, Margaret Mooney
Final approval of manuscript: Edward L. Korn, Boris Freidlin, Margaret Mooney