We therefore need other strategies in our toolkit to speed up the process of getting reliable answers. A strategy may be considered useful if it can satisfy the following principles: 1) it is better than separate single-arm phase 2 trials in deciding whether to continue testing a new treatment; 2) it will test many new promising treatments at the same time so that the probability of finding a successful new treatment is increased; 3) it has the potential to discontinue unpromising arms quickly and reliably; and 4) it bases major decisions on randomized evidence.
One approach that addresses all of the above principles is the multi-arm multi-stage (MAMS) randomized trial. In this approach, several agents are assessed simultaneously against a single control group in a randomized fashion. In the early stages of the trial, each of the experimental arms is compared in a pairwise manner with the control arm using an “intermediate” outcome measure that is required to be related to the primary outcome measure but does not have to be a true “surrogate” outcome measure [for definitions of surrogacy see (19
)]. Recruitment to experimental arms that do not show sufficient promise with the intermediate outcome measures is discontinued. Recruitment to the control arm and to the promising experimental arms continues until sufficient numbers of patients have been entered to assess the impact of the experimental treatments on the primary outcome measure.
A hypothetical example is a randomized trial with four experimental arms and one control arm, run in two stages (). The intermediate and primary outcome measures are progression-free survival and overall survival, respectively. When a prespecified number of intermediate outcome events have been observed in the control arm, a pairwise comparison is made between each experimental arm and the control arm. If the observed effect size does not cross a predefined critical value, then consideration is given to not randomly assigning additional patients to that experimental arm. Accrual to the trial, however, continues while the analysis is conducted. After the analysis, patients continue to be randomly assigned to those experimental treatments that do cross the critical value and also to the control arm until the prespecified number of events on the primary outcome measure have been observed. The predefined critical value depends on four components: 1) the null hypothesis for the intermediate outcome measure (usually taken to be no difference), 2) the alternative hypothesis for the intermediate outcome measure, 3) the probability of continuing to the next stage should the null hypothesis be true, and 4) the probability of continuing to the next stage should the alternative hypothesis be true. The critical value is calculated for each stage by considering whether we can reject the null hypothesis (at the level of the probability of continuing to the next stage should the null hypothesis be true). Technical details are given in (20
), and the practical specification of these parameters is displayed in the examples below.
Figure 1 Hypothetical randomized trial showing a multi-arm, two-stage design. Arm 1 is the control arm and arms 2–5 are the experimental arms. At the end of stage I, each experimental arm is compared against the control arm in a pairwise manner using the (more ...)
A general explanation of an intermediate outcome measure used in this way is as follows. If there is no effect on the intermediate outcome measure (ie, if the null hypothesis is true), then it is very likely that there will be no effect on the primary outcome measure. The intermediate outcome measure is therefore required to have high negative predictive value. However, if the alternative hypothesis is true for the intermediate outcome, this will not necessarily mean that the alternative hypothesis will be true for the primary outcome measure. There is no requirement for the intermediate outcome measure to have a high positive predictive value. In trials of cancer treatment, typical intermediate outcome measures might be progression-free survival or response to treatment and a typical primary outcome measure might be overall survival. Extension of this model to more than two stages is shown in the examples below. In the MAMS design, a randomized comparison is initiated as soon as possible, although there still remains a role for single-agent phase 2 trials to prioritize new therapies for feeding into MAMS trials (). One of the first advantages of the MAMS design is that many new treatments are considered at once, involving fewer patients over a shorter time with reduced costs than assessing each of the agents in large-scale separate two-arm trials. The multi-arm nature also improves the likelihood of a “positive” trial. For example, if a two-arm phase 3 trial in oncology has a 40% chance of showing a “positive” result (14
), and if we assume that the probability of success of each of the new experimental arms in a MAMS trial is approximately independent, then for a five-arm cancer trial with four new experimental therapies, the probability of at least one successful arm in the trial increases to 87%.
Figure 2 Where do multi-arm multi-stage (MAMS) trials fit into the phase 1, 2, and 3 setup? A) The traditional approach. Three new agents, R1, R2, and R3, enter and pass three single-agent single-arm phase 2 trials and also three separate single-arm combination (more ...)
We are aware of three trials that have used the MAMS design. These are the Systemic Therapy in Advancing or Metastatic Prostate Cancer: Evaluation of Drug Efficacy (STAMPEDE) trial (21
); a collaborative trial, GOG-182/ICON5 (23), involving the Gynecologic Oncology Group (GOG) and the International Collaborative Ovarian Neoplasm Studies Group (ICON) (22
); and ICON6 (23
Examples of multi-arm, multi-stage trials (protocols for these trials)*
) is a six-arm, five-stage trial of different therapies for men who are starting hormone therapy for advanced prostate cancer. Such men will typically have disease that has spread beyond the prostate, and thus it is standard care to treat their disease systemically with hormonal therapy. Approximately 85% of patients initially respond well to such hormone therapy, but the disease progresses in virtually all patients, with a median time to progression of approximately 24 months. A number of treatments, when added to hormone therapy, could potentially improve these outcomes. STAMPEDE is a trial of three of these therapies, together with some combinations of them.
In STAMPEDE, patients are randomly assigned to either the control arm or one of five experimental arms (). The five stages of the trial include a pilot stage, three intermediate activity stages, and a final efficacy stage (). The randomization ratio to the control and the five experimental arms is 2:1:1:1:1:1. The control arm is used in all the pairwise comparisons, and this imbalance in randomization facilitates a more reliable estimate of the event rates in the control arm at any given time. Moreover, for a given total number of patients to be randomly assigned to the trial, the imbalance increases the power slightly for each pairwise comparison with the control arm.
Figure 3 Two multi-arm multi-stage trials. A) Systemic Therapy in Advancing or Metastatic Prostate Cancer: Evaluation of Drug Efficacy (STAMPEDE) trial with six arms (A–F). B) Gynecologic Oncology Group/International Collaborative Ovarian Neoplasm Studies (more ...)
The pilot phase was planned to include 210 patients, with the aim of confirming the safety of the five experimental treatments, particularly in the two arms with treatment combinations of zoledronic acid plus docetaxel and zoledronic acid plus celecoxib that had not been tested before in men with prostate cancer. There was no a priori reason to suspect that any of the experimental treatments would produce unacceptable toxic effects. The three intermediate activity stages were designed to compare each experimental arm pairwise with the control arm on the intermediate outcome measure of failure-free survival (FFS, including prostate-specific antigen–defined progression). At each of these stages, the guideline critical value has been set for the observed HR. These critical values are 1.00, 0.92, and 0.89 for stages I, II, and III, respectively, and analyses will be performed when 115, 225, and 355 FFS events, respectively, have been observed in the control arm. The final stage has the primary outcome measure of overall survival. Key operating characteristics at each stage and overall are the error of continuing to the next stage, should the null hypothesis be true, the overall type I error, and the power (). How were the hurdles chosen? First, if an experimental arm is as effective as specified in the alternative hypothesis, then we require a high probability that it will continue to the next stage. This probability is set at 95% for stages I to III inclusive. To achieve this probability and still have an opportunity to stop an experimental arm for lack of benefit, we need to take a more “relaxed” approach to continuing to the next stage when the null hypothesis is true. An error in this direction can be considered to be “conservative.” For STAMPEDE, at the end of the first stage, we have set a 50% probability of stopping each experimental arm when the null hypothesis is true. After the first stage, as the control arm events continue to accumulate and the information in the trial increases, this probability can be reduced. Thus, at the end of the second stage, the probability of continuing when the null hypothesis is true is reduced to 25%, and at the third stage it is reduced further, to 10%. The power at the end of stage IV for the outcome of overall survival is set at the traditional 90%, with a (one-sided) type I error of 2.5%. Overall, across all stages, each pairwise comparison retains good power of 84%, with an overall type I error of 1.7%. The boundaries and probabilities of stopping, assuming we were to observe an estimate from the trial exactly on the critical HR for that stage, are best displayed graphically ().
Design characteristics of the STAMPEDE trial*
Figure 5 Stopping guidelines on the hazard ratio scale for the Systemic Therapy in Advancing or Metastatic Prostate Cancer: Evaluation of Drug Efficacy (STAMPEDE) trial. CI = confidence interval; HR = hazard ratio; Stop = stopping of accrual (rather than termination (more ...)
Using a uniform distribution to model the accrual rate means that at the end of these three stages, we anticipate 1200, 1800, and 2400 patients to be randomly assigned in the entire trial. For each experimental arm, these numbers will correspond to 172, 272, and 392 patients being entered into each arm (remaining) under the assumption that five experimental arms will accrue in the first stage, four in the second, and three in the third. This trial recruited its first patient on October 17, 2005, and is anticipated to be completed within 7 years. By June 4, 2008, 582 patients had been entered. The pilot phase had been completed successfully, and all arms had been continued into the next stage.
GOG-182/ICON5 is an MAMS trial with five arms and two stages. Women with advanced ovarian cancer were randomly assigned to one of five different combination chemotherapy regimens, consisting of four experimental arms and one control arm (). Separate pilot trials (24
) were conducted before GOG-182/ICON5, the main aim of which was to confirm the feasibility and safety of the new combination regimens before launching a randomized controlled trial; activity was not a major outcome measure. The first stage analysis of GOG-182/ICON5, using progression-free survival, was planned when 240 progressions or deaths in the control arm had been observed. The second stage of the trial was designed to focus on overall survival. At both stages, each of the four experimental arms was to be compared in a pairwise manner with the control arm.
The trial started accruing patients on February 7, 2001, and, with an anticipated entry rate of 500 patients per year, the 240 progressions or deaths were predicted to be observed approximately 4 years into the trial. At the outset, the guideline critical value of the hazard ratio for each pairwise comparison of progression-free survival after stage I was set at 0.87 (HR < 1 favors the experimental over the control arm). Thus, if the observed HR was greater than 0.87 (ie, closer to 1.00), then the Data Monitoring Committee (DMC) should consider recommending stopping further accrual to that particular experimental arm; if HR was less than 0.87, then accrual to the arm should be continued. Assuming that the experimental regimen was truly effective (ie, that it had a real underlying HR of 0.75), then the probability that it would be observed to be better than 0.87 was 93%, with a 5% probability that the trial would continue inappropriately.
The observed accrual rate was exceptionally high, with more than 1200 patients per year being entered into the trial worldwide over 3 years. The first stage analysis was triggered in May 2004, when 3836 patients had been randomly assigned and 272 events (progressions or deaths) had been reported in the control arm. Such a fast accrual rate gave the opportunity to relax the intermediate hurdle. Thus, the DMC considered not only the hurdle of 0.87 but also the hurdle of 0.94. This additional hurdle was introduced without knowledge of the results. This change means that if an experimental regimen was truly effective (ie, had a real underlying HR of 0.75), then the probability that it would jump this new hurdle was greater than 99.9%, with a 5% probability of continuing to the next stage, should the null hypothesis be true. This conservative and small change in the hurdle had very little impact on the overall power and type I error for the trial as a whole.
The statistical report provided to the DMC presented data on PFS, toxicity, and deaths due to treatment (). Overall survival data were also presented for context, although data for this outcome were inevitably limited. In accordance with the prespecified guidelines, the DMC saw no justification to extend accrual to any of the arms and thus indicated that the trial be closed to accrual of further patients. This conclusion was endorsed by the International Steering Committee for the trial, and hence accrual was closed on September 1, 2004. The mature results on overall survival presented in June 2006 [(22
), ] confirm that the decision to not accrue additional patients was a good one.
Estimated treatment hazard ratios (HRs) for progression-free survival and overall survival (ratio of experimental to control) for the first stage analysis of GOG-182/ICON5 presented to the Data Monitoring Committee in May 2004*
Updated treatment hazard ratios (HRs) for progression-free and overall survival (ratio of experimental to control) for the first stage analysis of GOG-182/ICON5 presented at the American Society of Clinical Oncology in June 2006*
The GOG-182/ICON5 trial clearly displays the practical value of the MAMS design. Unfortunately, none of the new treatment approaches showed enough potential on the intermediate outcome measure of progression-free survival to justify continuation to the second and final stage of accrual. It was more appropriate to focus resources on assessing new approaches. However, we obtained reliable answers to these four questions in 3.5 years (from start of accrual to the planned first stage analysis), which is considerably faster than we have been able to do before. The MAMS nature of the trial saved some 20 years when compared with an alternative approach of four consecutive two-arm trials each with overall survival as the primary and only outcome measure.
) is a three-arm, three-stage double-blind placebo-controlled multicenter randomized phase 3 trial for women with relapsed ovarian cancer. The three arms of ICON6 are chemotherapy alone, chemotherapy plus cediranib given during chemotherapy, and chemotherapy plus cediranib during chemotherapy and further cediranib alone for a maximum of 18 months. The primary outcome measure at the three stages are safety at the first stage, progression-free survival at the second stage, and overall survival at the third stage.