The results of single-arm trials are typically interpreted relative to data from historical control subjects, past study participants with similar characteristics to the study population in question. The appropriateness of using historical control subjects is highly dependent on the endpoint that is being used (ie, the metric of success), as well as the patient population that is being studied. For example, a single-arm phase II trial may be appropriate in a disease setting for which there are no active therapies and for which the metric of success is a high rate of marked tumor shrinkage (ie, response rate [RR] according to Response Evaluation Criteria in Solid Tumors [RECIST]). In other words, a drug meeting this endpoint may be worthy of further investigation, because there is sufficient confidence that the historical control subjects had essentially no RECIST responses with observation alone.
In most cases, however, we are comparing a new drug to historical control subjects who were treated with some type of active therapy or we are using endpoints with much greater historical variability, such as the proportion of patients who are progression free at an arbitrary time point or time-to-event endpoints (progression-free survival [PFS] or overall survival). In such cases, the validity of conclusions from single-arm trials based on historical control subjects is limited by two classic epidemiological factors, selection bias and confounding. Selection bias refers to the phenomenon that current study participants may be different from historical control subjects in ways that affect the outcome of interest. Differences that might bias toward a positive result include baseline patient factors, such as younger age or better performance status; baseline disease factors, such as smaller tumor burden or less aggressive tumor biology; or provider factors, such as size and other characteristics of the treating centers. The same factors, if they are different in the opposite direction, would bias toward a negative result. Korn et al. (10
) validated this concept by identifying a number of patient-specific and trial-specific prognostic variables (performance status, visceral metastases, sex, and exclusion for brain metastases) that influenced the 1-year overall survival rate in phase II trials of metastatic melanoma and showing that between-trial variability could be essentially eliminated by controlling for these variables.
Confounding refers to the phenomenon that current study participants may have a different (better or worse) outcome than historical control subjects because of factors during the treatment period that are not related to the quality of the intervention. For example, if drug X is actually no better than the standard of care but patients receive better supportive care during the treatment period, the results with respect to the primary endpoint may appear better than those of historical control subjects. The impact of supportive care should not be underestimated, because it was recently shown that overall survival was statistically significantly longer among patients receiving early palliative care along with chemotherapy for non–small cell lung cancer (11
). Another potential confounder is the availability of subsequent effective treatments, as exemplified by the recent success with v-raf murine sarcoma viral oncogene homolog B1 (BRAF) inhibitors in BRAF-mutated melanoma. Patients with BRAF mutations who are initially treated with therapies other than BRAF inhibitors would be expected to have improved overall survival than prior historical control subjects because of the subsequent benefit from the BRAF inhibitor. Patients without BRAF mutations are obviously not representative of the overall population, and the survival model that has been proposed by Korn et al. (10
) for screening new agents did not evaluate this important covariate, making this model invalid in the era of targeted therapy for melanoma. The impacts of selection bias and confounding are impossible to quantify when making comparisons with historical control subjects and can even be difficult to identify because most published reports of single-arm trials do not clearly specify the historical data that were used to formulate the null hypotheses (12
). Acknowledging the shortcomings of historical control subjects, some investigators are now conducting single-arm phase II trials that include a simultaneous but smaller control arm, such that both arms are compared with historical control subjects but not to each other (because of inadequate statistical power). Although this is somewhat reassuring if the control arm and historical control subjects have similar outcomes, the precision is not sufficient to ensure comparability (13
), and it is unclear what to do if they have markedly different outcomes.
The use of historical control subjects leads to a high risk of “false positives,” single-arm phase II trials that appear promising but are followed by negative randomized phase III trials. Although examples can be found in all tumor types, there is perhaps no better collective example than in the field of advanced pancreatic cancer. In the last decade, eight drugs were studied in combination with gemcitabine in single-arm phase II trials and found to be promising compared with historical control subjects who received gemcitabine alone (14
). Definitive phase III trials, however, have been disappointingly negative for every one of these combinations (22
), leading to no appreciable change in the standard of care over this period. The only exception, erlotinib, was never studied in combination with gemcitabine in a dedicated phase II trial, and the benefit of the combination in the phase III trial (0.33 months increase in overall survival, compared with gemcitabine alone) is of dubious clinical relevance and has never been replicated (30
). These nine phase III trials in pancreatic cancer also demonstrate the problem with variability of historical control subjects, because the median overall survival for patients receiving gemcitabine alone ranged between 5.4 and 7.2 months, despite the trials being conducted in very similar patient populations. Walter et al. (31
) have pointed out that acute myeloid leukemia is another disease in which progress has been impeded by a high false-positive rate in phase II trials and have proposed that randomization is a key to solving this problem.
Because single-arm and randomized phase II trials are rarely conducted simultaneously or sequentially for the same drug or combination, investigators have used simulation techniques to compare the two designs. Tang et al. (32
) simulated and compared error rates in single-arm vs randomized phase II trials, using both statistical models and individual patient data from a large phase III trial in colorectal cancer. For single-arm trials, they found that random and systematic variation in historical control data could increase the type I (false positive) error rates by two- to fourfold. They also found that the statistical power of single-arm trials was sensitive to unanticipated factors, such as the selection of patients from high-volume vs mid- or low-volume treatment centers. In a similar type of study, our group (33
) resampled data from a large phase III trial in renal cell cancer to simulate and compare various phase II designs and endpoints based on a computed tomography scan at 6 weeks. We found that randomized phase II designs with a continuous change in tumor size endpoint had greater predictive power than a conventional single-arm design for the known phase III result. Stewart et al. (34
) used survival times from patients with non–small cell lung cancer to simulate randomized trials with hypothetical novel therapies that quintupled or doubled survival in only 10% of patients who express a specific target, with no effect on the remaining 90% of patients. They found that randomized trials with a large number of unselected patients would incorrectly conclude that the drug had no benefit, whereas randomized trials with a small number of patients selected for the target would correctly conclude that the drug had a benefit in those patients. Although Stewart et al. (34
) did not simulate any single-arm designs, one implication of their work is that single-arm phase II trials could easily detect drug benefit if the patient population is preselected for the drug’s target, as illustrated by examples () of first-in-class targeted drugs that have shown remarkably high RRs in phase Ib or phase II studies (35
). Even compared with historical control subjects, the promising nature of such drugs cannot be called into question, and single-arm phase II studies would adequately demonstrate their benefit in these biomarker-defined populations or subpopulations. We would caution, however, that we often do not know the relevant target of a drug or do not have a reliable predictive biomarker when it is being developed and studied in clinical trials (eg, sorafenib, originally developed as a raf inhibitor). There are many examples of drugs for which selection of patients based on a therapeutic target has not resulted in dramatic RRs suggestive of clinical efficacy, for example, fms-related tyrosine kinase 3 (FLT3) inhibitors for acute myeloid leukemia patients harboring activating FLT3 mutations (39
First-in-class targeted drugs that have resulted in high response rates in phase Ib or phase II trials