|Home | About | Journals | Submit | Contact Us | Français|
To balance patient interests against the need for acquiring evidence, ongoing randomized clinical trials are formally monitored for early convincing indication of benefit or lack of benefit. In lethal diseases like cancer, where new therapies are often toxic and may have limited preliminary efficacy data, monitoring for lack of benefit is particularly important. We review the complex nature of stopping a randomized trial for lack of benefit and argue that many cancer trials could be improved by a more aggressive approach to monitoring. On the other hand, we caution that some commonly used monitoring guidelines may result in stopping for lack of benefit even when a nontrivial beneficial effect is observed.
The ultimate goal of a phase III randomized clinical trial (RCT) is to provide evidence on the benefit-to-risk ratio that is sufficiently compelling to affect medical practice. For a study that is designed to show that a new therapy is better than the current standard, this implies convincing the medical community either that the new therapy has tangible therapeutic benefit or that the benefit-to-risk ratio does not support use of the therapy. To ensure that the study does not continue longer than necessary, most RCTs prospectively specify interim monitoring guidelines for the data monitoring committee (DMC) that allow stopping the study early for so-called positive or negative results. (Comprehensive discussions of the statistical methodology for monitoring RCTs are available.1–3) It should be emphasized that the benefit of early stopping goes beyond minimizing the number of the study participants receiving suboptimal care; even after enrollment has been completed and all patients are off study therapy, the wider clinical community still benefits from timely access to the study results.
In this article, we focus on monitoring studies for early evidence of lack of benefit (also called futility monitoring). We argue that many, but not all, RCTs could be improved by a more aggressive approach to monitoring for lack of benefit and illustrate this by reviewing several recently reported trials. On the other hand, we note that some commonly used futility rules are too aggressive in the second halves of trials and may therefore result in stopping for lack of benefit even with a nontrivial beneficial effect of the experimental treatment. Because most cancer clinical trials use time-to-event end points (eg, overall survival [OS], disease-free survival [DFS], or progression-free survival [PFS]), our presentation will be based on this model.
Most RCTs are designed to show that a new therapy improves a relevant clinical outcome compared with the control arm that contains either active standard therapy or no active therapy (placebo or observation). Therefore, there is generally no need to provide the same degree of evidence that the new therapy is more harmful than the control as there is for the proof of its benefit.4 The interim monitoring guidelines should reflect this asymmetry in the underlying therapeutic question (ie, less stringent evidence of lack of benefit of the new therapy is often sufficient for stopping early for futility than would be required to show sufficient benefit for stopping early for efficacy).5–8 What constitutes sufficient and compelling evidence of lack of benefit depends on the nature of the intervention, context of the disease, and the standard of care.
In advanced disease cancer settings, most RCTs are designed to evaluate new agents with limited preliminary efficacy data. A high proportion of these new agents will be ineffective, and many of these agents have serious toxicities. First, consider designs that compare a new therapy with placebo or observation. (In a lethal disease like cancer, this implies that no active therapy is available.9) There is no scientific or therapeutic value in showing that the new treatment is significantly worse than no treatment.10 Therefore, observing no improvement in the primary outcome at some point into the study is often considered sufficient to demonstrate lack of benefit. The same rationale applies to designs that evaluate addition of a new therapy to the standard therapy (eg, standard plus new therapy v standard therapy plus placebo). A simple guideline is to stop the study if after observing 50% of the design-specified total number of events (50% of the total information), the hazard ratio (HR) of the new over the standard treatment is greater than 1 (ie, the experimental treatment is doing worse).11,12 However, if the experimental therapy is toxic, then waiting until 50% of the events have occurred (and when the accrual may almost be complete) may not be appropriate. In this case, monitoring should commence earlier (approximately 25% to 30% of the total information). For example, Danish Head and Neck Cancer Study 10 was an RCT comparing darbepoetin alfa versus placebo in patients with advanced head and neck cancer receiving radiotherapy.13 This study was stopped at the first scheduled interim analysis performed at 50% of total information (158 events, 522 patients, 87% of planned accrual) after observing significantly worse outcome in terms of the primary end point of locoregional-control (P = .01); 5-year OS rates were 38% and 51% on darbepoetin and control, respectively (P = .08).13,14 The trial was stopped even though the primary end point had apparently not crossed the protocol-specified symmetric monitoring boundary.15 In retrospect, it seems this study could have benefited from an asymmetric interim analysis plan scheduled to start at approximately 25% to 30% of information.
In designs where the experimental arm does not contain an active standard therapy to which it is being compared (new agent v standard active therapy), monitoring for lack of benefit requires some additional considerations (especially in an aggressive disease setting). The monitoring is often influenced by the consideration that accumulating evidence that the new therapy has similar activity to an established active treatment is valuable for the treatment development,16 especially if the new therapy has a favorable toxicity profile. However, in the presence of an active standard therapy, there is a heightened pressure to stop the trial for lack of benefit, because ineffectiveness of the new agent implies that patients in the experimental arm are receiving an inferior treatment. (By contrast, in designs with placebo/observation control, there is usually no expectation that the new therapy will result in a nontrivial detriment in clinical outcome relative to the control, with the main concern being treatment toxicity.) Consider an RCT17 that compared the matrix metalloproteinase inhibitor BAY12-9566 against gemcitabine in advanced pancreatic cancer. The trial was designed using a symmetric boundary, with the first comparative interim analysis scheduled at 50% of the total information. The study was stopped at the first comparative interim analysis (140 deaths observed, 277 patients accrued, 80% of planned accrual) on the basis of observing median survival of 3.2 months on the BAY12-9566 arm and 6.4 months on the gemcitabine arm (P = .0001). Gemcitabine is an approved therapy with documented survival and palliation benefit in this poor-prognosis population.17 Therefore, for this setting, we would recommend a lack-of-benefit rule that is scheduled to commence at approximately 25% of information and required a less dramatic evidence of harm (relative to the standard active therapy) for stopping.
After the start of monitoring, interim analyses should be repeated reasonably frequently (eg, each 10% to 20% of information, or every 6 or 12 months). Once interim analyses have started, frequent interim looks cost little in terms of the statistical power of the trial and reduce the possibility that the trial continues longer than necessary.8 Consider the Evaluation of Neo-Recormon on Outcome in Head and Neck Cancer in Europe trial18 that compared epoetin beta versus placebo in patients with head and neck cancer undergoing radiotherapy; the study completed accrual (351 patients) in April 2001. The publication18 was unclear regarding who was doing the interim monitoring and stated that the study sponsor decided to omit the second of the two planned interim analyses. In April 2003, 2 years after the scheduled date of the omitted interim analysis, the final analysis was conducted. It revealed a significant impairment in cancer control and survival for erythropoietin versus placebo: locoregional PFS (primary end point) HR = 1.62 (P = .0008) and OS HR = 1.39 (P = .02).18 In theory, monitoring rules with frequent interim analyses using asymmetric boundaries could have made the negative results available earlier. This would have been important for a number of ongoing and planned trials of this (or similar) agents. For example, in response to the publication of the Evaluation on Neo-Recormon on Outcome in Head and Neck Cancer in Europe study results, a similarly designed study Radiation Therapy Oncology Group 99-03 (radiotherapy with or without epoetin alfa in head and neck cancer) performed an unscheduled interim analysis.19 On the basis of observing a negative trend in patient outcome for the epoetin alfa arm, the Radiation Therapy Oncology Group DMC closed the trial after only 148 of 372 planned patients were enrolled.19
In settings where survival events accumulate slowly relative to patient accrual, even well-designed survival-based futility rules are unlikely to stop the study before accrual is completed. Incorporating a reliable intermediate end point (eg, PFS) into the futility rule may lessen the number of patients treated with ineffective therapy and speed up evaluation of new agents.20,21 Note that to be used for interim futility monitoring, an intermediate end point does not generally have to be validated as a surrogate end point for the primary outcome; often, a consistent association between the interim and primary end points is considered sufficient.21
In certain cases, an RCT is conducted when reliable evidence of the treatment benefit in a similar setting(s) is available or the therapy is already in limited use in the community. The trial is therefore designed to provide definitive proof to support widespread use of the new treatment. In these situations, stopping the trial early for lack of benefit may require strong refutation of benefit. Another situation where a conservative approach to monitoring (requiring more mature data) is often appropriately adopted is when there is concern that the treatment effect may be delayed.
As an example, consider the second part of National Surgical Adjuvant Breast and Bowel Project B14, in which patients with breast cancer after 5 years of tamoxifen were randomly assigned to 5 additional years of tamoxifen versus placebo.22 Interim analysis results are presented in Table 1. At the time of the second interim analysis, a negative trend was observed, but the DMC decided against recommending stopping the study. The study was stopped at the third interim analysis at 76% of the planned total information, with a significant negative trend.22 In assessing the rationale behind the interim monitoring decisions in this trial, it is important to keep in mind that at the time there was considerable enthusiasm for use of tamoxifen based on strong evidence of tamoxifen benefit when used for up to 5 years;23 there was also evidence that longer tamoxifen treatment is more effective.23 Moreover, there was concern that the benefit from the additional 5 years of tamoxifen may be delayed by a carry-over effect because both arms received tamoxifen for the first 5 years.24 This example represents a somewhat extreme situation where considerable evidence of harm was required to convince the medical community that the experimental therapy was not beneficial. Although some still believed that the study stopped prematurely,24 further follow-up confirmed the study conclusion.25
To accommodate the complex nature of interim monitoring for lack of benefit, several formal stopping guidelines (boundaries) have been developed.26–28 However, their implementation requires careful consideration. These guidelines presume that the trial is designed to target the minimal clinically meaningful effect (ie, the smallest therapeutic effect corresponding to an acceptable benefit-to-risk ratio). If a treatment effect smaller than the one used to design the study is still clinically important, these stopping guidelines may have unreasonable boundaries that recommend stopping the trial for lack of benefit even when a meaningful positive trend is observed.29 For example, for a trial with 80% power (α = 0.05 two-sided) targeting the HR of 0.67 (of the new over the standard treatment), a common implementation of Pampallona-Tsiatis approach27 with four equally spaced interim analyses would suggest stopping the study for lack of benefit at the fourth interim analysis if an HR of 0.81 favoring the experimental arm is observed (the triangular test approach26 would stop with HR = 0.79).
In practice, because of feasibility considerations, many RCTs are designed with target treatment effects exceeding the minimal clinically meaningful one.30 For example, median OS for patients with advanced pancreatic cancer has been relatively unchanged over the last decade (approximately 6 months). In an informal sample of 10 recently published phase III RCTs17,31–39 evaluating the effect of new regimens on OS in this setting, the target treatment effect (HR) ranged from 0.77 to 0.57 (half of the trials targeted effects of 0.67 or better). At the same time, the most recent drug approved for this indication had HR of 0.81.39 Therefore, application of either Pampallona-Tsiatis or triangular approaches to the five trials targeting HR of 0.67 or better would have resulted in boundaries inconsistent with the clinical setting.
An underlying principle behind RCTs is that the patients in the trial participate in advancing clinical science while receiving the best possible medical care. To maintain the delicate balance between individual and collective interests, interim monitoring needs to be carefully calibrated to account for specifics of the trial. A summary of the main considerations in identifying an appropriate approach to monitoring for lack of benefit is presented in Table 2. In studies with no active therapy in the control arm (placebo or observation), the aggressiveness of the futility monitoring should depend on the degree of morbidity of the experimental arm: the more morbidity, the more aggressive the monitoring (rows 1 and 2 of Table 2). An intuitively appealing and flexible approach to monitoring for lack of benefit is based on rejecting the alternative hypothesis at some prespecified significance level.7,40 The desired degree of aggressiveness is achieved by varying the nominal P value required for stopping: with larger P values, it is easier to stop for lack of benefit, thus making the boundary more aggressive. The resulting stopping boundaries correspond approximately to observing a negative trend during the first half of the study (with < 50% of information) and observing no benefit or a even a small positive trend in the second half.
In studies that compare a new, less toxic therapy to an established active control, monitoring for lack of benefit should take into consideration that a marginal improvement over the active control may still result in a valuable contribution to the treatment of the disease. In this setting, an appropriately aggressive stopping boundary should still be used in the first half of the trial to guard against unnecessary exposure of trial participants to suboptimal medical care (row 3 of Table 2). However, in the second half of the trial, stopping for lack of benefit if a small improvement is observed (as suggested by some commonly used futility rules) may be counterproductive (especially in diseases with few effective therapies like pancreatic cancer). Instead, a conservative boundary that allows the study to continue unless the new therapy seems worse is recommended. In fact, for new agents with favorable toxicity profiles that are expected to provide only a marginal improvement over the standard therapy, so-called hybrid designs that allow testing for noninferiority as well as superiority of the new therapy may be more appropriate.41
It should be noted that some RCTs designed to assess therapies intended to improve quality-of-life–related outcomes include mortality only as a secondary end point (or not at all). Even if one does not expect these therapies to make the underlying disease worse, the RCTs should incorporate formal stopping guidelines for detriment in survival42 (or any other efficacy outcome that captures relevant clinical events, eg, DFS).
It is widely recognized that monitoring of RCTs is best performed by an independent DMC.43–45 To guide the DMC decisions, it is important that an interim analysis plan including appropriate futility (and efficacy) boundaries be prospectively specified in the protocol. The boundaries should be examined for consistency with the specific clinical setting to ensure that the trial does not continue longer than necessary and, at the same time, does not stop prematurely without addressing the study goal.
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
The author(s) indicated no potential conflicts of interest.
Conception and design: Boris Freidlin, Edward L. Korn
Financial support: Boris Freidlin, Edward L. Korn
Administrative support: Boris Freidlin, Edward L. Korn
Provision of study materials or patients: Boris Freidlin, Edward L. Korn
Collection and assembly of data: Boris Freidlin, Edward L. Korn
Data analysis and interpretation: Edward L. Korn
Manuscript writing: Boris Freidlin, Edward L. Korn
Final approval of manuscript: Boris Freidlin, Edward L. Korn