We demonstrated that consolidation trials in second line non-randomized setting, designed to show an improvement in PFS over a historical control, can be underpowered for the primary endpoint or can provide biased estimates which cannot be compared with results from other studies. The reason that a single arm consolidation trial might be underpowered, is that estimates of efficacy such as PFS include the duration of second line therapy which dilutes the effect of the investigational treatment. We showed that the study power is affected by the duration of second-line therapy and starting time of IT both of which can vary widely in practice. The longer the time on SLT, the larger the sample size or the greater the clinical benefit must be to show improvement. We recommend that designs of consolidation trials take into account the duration of SLT, by either defining PFS from the start of IT or restricting SLT duration per protocol. This is not a purely statistical decision, since both approaches raise clinical and logistical issues.
It is acknowledged that the question of whether IT is efficacious can be best answered in a Phase III definitive trial of comparing two randomized arms, namely SLT alone and SLT with consolidation therapy added, ie SLT+IT. If power is reduced, randomized Phase II trials would also provide a head to head comparison with a concurrent control and the lack of historical estimates would be eliminated (
14). However, in order to design randomized studies we need meaningful PFS estimates for the control arm and the expected improvement in which to base the sample size required. These estimates are always based on smaller Phase II trials. Furthermore, when a larger randomized Phase II trial is not feasible, single-arm consolidation trials remain a viable option in identifying agents with activity before committing to move into a larger confirmatory trial.
Our focus has been second line therapy, but the question of what is considered a clinically meaningful improvement and when PFS should start applies to consolidation trials in other lines of treatments. In primary therapies these issues are less critical because the duration of first line therapy is typically uniform, averaging from 6–8 cycles whereas the duration of second line therapy can be more variable. For example the Phase III trial known as SWOG S9761/GOG 178 (
15) in which advanced stage OC patients with complete response to platinum/taxane therapy were randomized to receive either 3 or 12 cycles of monthly paclitaxel showed a significant improvement in PFS favoring 12 cycles (median PFS 22 vs 14 months; pvalue=0.006) when PFS was measured from the start of first line therapy and front line therapy was restricted to 5–6 cycles. On the other hand, the Oregovomab trial (
16) which randomized advanced OC patients to maintenance immunotherapy or placebo after 4 to 12 weeks of front line therapy showed no improvement with median PFS of 10.3 (oregovomab) vs 12.9 (placebo) pvalue=0.2, when PFS was measured from randomization 4 to 12 weeks
after the end of front line therapy. The estimate from GOG 178 includes the time of front line therapy, while it correctly restricts it per protocol, while the Oregovomab trial excludes the time of front line therapy by starting PFS at randomization and allowing a TFI of 4 to 12 weeks prior to randomization. While different approaches of reporting PFS are used here, the results may be compared because the duration of primary therapy is relatively consistent. However, in non randomized consolidation trials in the setting of second line treatment, the starting point of PFS is not uniformly defined and duration of SLT is not restricted and can be variable. This limits the ability to compare different studies.
In order to minimize this variability, we propose eligibility restrictions for non-randomized trials evaluating agents in the consolidation, second line setting. One approach would be to restrict the time on SLT and the TFI. The duration of SLT cannot be absolutely restricted as patients may achieve CR at variable time points, but we suggest a design allowing 5–6 cycles of SLT. In addition, if starting IT after SLT, the TFI should be similarly restricted and allow a TFI of up to 2 months from the completion of SLT to the start of IT. If these restrictions are not feasible, another approach would be to exclude SLT from the definition of PFS by calculating PFS from the start of IT, and we have shown that the benefits in terms of sample size and resources are clear in this setting. However, comparisons with historical data must be cautious. When PFS is calculated from the start of SLT, the estimates are valid for all patients enrolled after primary recurrence. When the duration of SLT is excluded from PFS definition, the estimates are less prone to bias since they measure the efficacy of the investigational treatment alone, but these estimates are valid only to patients who have achieved CR after completion of SLT and the literature is less robust in this regard.
Our study addresses the effect of the duration of SLT on the final PFS estimates under specific assumptions. Our sample size and power calculations considered a specific difference in median PFS based on our experience and the estimates reported in the literature. We assumed PFS follows exponential distribution and the hazard is constant within each treatment interval. While this assumption may not be justified when analyzing real data, it has appeal for sample size calculations due to its interpretability and simplicity and it is typically used (
17). Power estimates may differ under other distributions, and such evaluation is beyond the scope of this paper. However, our conclusions about the importance of defining the starting times for IT and PFS apply in general.
We evaluated various treatment strategies and endpoints currently used in consolidation trials and examined the effect of duration of second-line therapy on power and sample size requirements. The appropriate selection of patient population and the endpoint to be examined are the two major challenges in the design of consolidation trials so that comparisons with historical estimates are valid. We recommend that the individual intervals, namely, time on second-line therapy, treatment-free interval, and time on investigational therapy, be reported in future trials so that historical estimates can be obtained and used in the design of single-arm consolidation trials. An informative, unbiased comparison with results of other single-arm Phase II studies will depend on increased uniformity of SLT.