While retrospective validation may be acceptable as a marker validation strategy in selected circumstances, the gold standard for predictive marker validation continues (appropriately) to be a prospective RCT. Several designs have been proposed and utilized in the field of cancer biomarkers for validation of predictive markers. We classify these designs broadly into three categories: targeted or enrichment designs; unselected or all-comers designs, which can further be categorized into sequential testing strategy designs and marker-based designs; and hybrid designs. We discuss the salient features of these designs, along with pertinent examples.
Targeted or Enrichment Designs
This design is based on the paradigm (when there is compelling preliminary evidence) that not all patients will benefit from the study treatment under consideration, but rather that the benefit will be restricted to a subgroup of patients who express (or not express) a specific molecular feature. Consequently, all patients are screened for the presence or absence of a marker or a panel of markers, and only those with (or without) certain molecular features are included in the trial. This design therefore results in a stratification of the study population, with a goal of understanding the safety, tolerability and clinical benefit of a treatment in the subgroup of the patient population defined by a specific marker status.
An enrichment design strategy of enrolling only human epidermal growth factor receptor 2 (HER2
) –positive patients demonstrated that trastuzumab (Herceptin; Genentech, South San Francisco, CA) combined with paclitaxel after doxorubicin and cyclophosphamide significantly improved disease-free survival (DFS) among women with surgically removed HER2
-positive breast cancer.18
The combined analysis using data from both phase III trials with over 1,600 patients in each of the control and treatment arms provided 90% power to detect a 25% reduction in the hazard rate for DFS. In this case, the enrichment strategy clearly succeeded in identifying a subgroup of patients who received a significant benefit from this therapy. However, subsequent analyses have raised the possibility of a beneficial effect of trastuzumab in a more broadly defined patient population than that defined in the two phase III trials.19,20
While multiple possible explanations exist, two questions remain: whether trastuzumab therapy may benefit a potentially larger group than the approximately 20% of patients defined as HER2
positive in these two trials, and questions of assay reproducibility arising from local versus central testing for HER2
status were left unanswered due to the inclusion of only biomarker-defined subgroups in the two phase III clinical trials.21
Two key lessons may be learned from the HER2/trastuzumab example regarding the appropriateness of targeted or enrichment designs. First, before the launching of any trial, particularly one with an enrichment design strategy, assay reproducibility and accuracy must be well established. Second, there should be compelling preliminary evidence to suggest that patients with or without that marker profile do not benefit from the treatments in question. As a general guideline, targeted designs are appropriate when therapies have modest absolute benefit in the unselected population, but cause significant toxicity; when in the absence of selection, therapeutic results are similar whereby a selection design (even if incorrect) would not hurt; and when an unselected design is ethically impossible based on previous studies.
Unselected or All-Comers Designs
In this design, all patients meeting the eligibility criteria (which does not include the status of a biomarker characteristic) are entered into the trial. We note that the ability to provide adequate tissue may be an eligibility criterion for these designs, but not the specific biomarker result. These designs can be broadly classified into sequential testing strategy designs, marker-based designs, or hybrid designs, which are differentiated from each other by the protocol specified approach to the prespecified type I and type II error rates (influencing sample size), analysis plans (including a single hypothesis test, multiple tests, or sequential tests), and randomization schema. The key features of these designs along with examples of clinical trials that have utilized these designs are discussed.
Sequential testing strategy designs.22,23
Sequential testing designs are similar in principle to a standard RCT design with a single primary hypothesis, that is either tested in the overall population first and then in a prospectively planned subset, or in the marker-defined subgroup first, and then tested in the entire population if the subgroup analysis is significant. The first is recommended in cases where the experimental treatment is hypothesized to be broadly effective, and the subset analysis is ancillary. The latter (also known as the closed testing procedure) is recommended when there is strong preliminary data to support that the treatment effect is strongest in the marker-defined subgroup, and that the marker has sufficient prevalence that the power for testing the treatment effect in the subgroup is adequate. Both these approaches appropriately control for the type I error rates associated with multiple testing. A modification to this approach, taking into account potential correlation arising from testing the overall treatment effect and the treatment effect within the marker-defined subgroup has also been proposed.24
The approach of first testing in the subgroup defined by marker status has been implemented in the ongoing US-based phase III trial testing cetuximab in addition to infusional fluorouracil, leucovorin, and oxaliplatin as adjuvant therapy in stage III colon cancer (N0147). While the trial has now been amended to accrue only patients with KRAS–wild-type tumors, approximately 800 patients with KRAS-mutant tumors have already been enrolled. In this trial, the primary analysis will be conducted at the 0.05 level in the patients with wild-type KRAS. A sample size of 1,035 patients with wild-type KRAS per arm will result in 515 total events, providing 90% power to detect a hazard ratio of 1.33 for this comparison using a two-sided log-rank test at a significance level of 0.05. If this subset analysis is statistically significant at P = .05, then the efficacy of the regimen in the entire population will also be tested at level 0.05, as this is a closed testing procedure. This comparison using all 2,910 patients will have 90% power to detect a hazard ratio of 1.27 comparing the two treatment arms, based on a total of 735 events.
Designs that fall under this classification are the marker-by-treatment-interaction design and the marker-based strategy design. A formal comparison of these two designs in the setting of a binary marker is discussed.
The marker-by-treatment-interaction design uses the marker status as a stratification factor (ie, assumes that the overall population can be split into marker-defined subgroups) and randomly assigns patients to treatments within each marker subgroup. This is similar to conducting two independent RCTs, one in each marker-based subgroup, except that both are conducted under one large RCT umbrella. However, this design differs from a single large RCT in four essential characteristics: only patients with a valid marker result are allowed to be randomized, the sample size is prospectively specified separately within each marker-based subgroup, the randomization is stratified by marker status, and this design is clearly a prospective (and a definitive) marker validation trial.
The marker-based strategy design, on the other hand, randomly assigns patients to have their treatment either based on or independent of the marker status. A downside of this design is that it fundamentally includes patients treated with the same regimen on both the marker-based and the non–marker-based arms, resulting in a significant overlap (driven by the prevalence of the marker) in the number of patients receiving the same treatment regimen in both arms. As a consequence, the overall detectable difference in outcomes between the two arms is reduced (depending on the marker prevalence), thus resulting in a comparatively larger trial.
An example of the marker-by-treatment-interaction design is the recently activated phase III biomarker validation study, also known as MARVEL (Marker Validation of Erlotinib in Lung Cancer), of second-line therapy in patients with advanced non–small-cell lung cancer (NSCLC) randomly assigned to pemetrexed or erlotinib (). This trial is motivated by the need to obtain prospective evidence to address the conflicting results from several retrospective analyses regarding the predictive role of epidermal growth factor receptor (EGFR) amplification by FISH in the setting of treatment with chemotherapy and EGFR tyrosine kinase inhibitors, and the fact that EGFR FISH represents a poor prognostic factor in untreated NSCLC patients, MARVEL is designed to prospectively evaluate the clinical predictive utility of EGFR copy number as measured by FISH in advanced NSCLC.4,27–33
The FISH status of the patient will be assessed before randomization (to ensure adequate number of patients with FISH+ and FISH− status) in a central location (to address issues regarding standardization of assay techniques, reproducibility and interpretability of assay results). The primary comparisons will be progression-free survival of patients treated on the erlotinib arm compared to the pemetrexed arm within the FISH+ and FISH− subgroups (286 [30%] FISH+; 670 [70%] FISH−). An overview of the statistical hypothesis testing framework for the MARVEL trial is included in the online-only Appendix.
Fig 1. MARVEL (Marker Validation for Erlotinib in Lung Cancer) trial design. Each cycle of treatment is 21 days. Stratification factors: ECOG performance status, gender, smoking status, histology, best response to prior chemotherapy. EGFR, epidermal growth factor (more ...)
In this design, only a certain marker-defined subgroup of patients are randomly assigned to have their treatment based on their marker status, whereas patients in the other marker-defined subgroups are assigned the standard-of-care treatment(s). This design is powered to detect differences in outcomes only in the marker-defined subgroup that is randomized to treatment choices based on the marker status, similar to an enrichment design strategy. However, unlike the enrichment design, the hybrid design provides additional value: since all patients are screened for marker status to determine whether they are randomly assigned or assigned the standard-of-care treatment(s), it seems prudent to include and collect specimens and follow-up from “all” patients in the trial to allow for future testing for other potential prognostic markers in this population. This design is an appropriate choice when there is compelling prior evidence demonstrating the efficacy of a certain treatment(s) for a marker-defined subgroup, thereby making it unethical to randomly assign patients with that particular marker status to other treatment options.
At least three large phase III marker validation trials have been recently launched with a hybrid trial design: the phase III randomized study of oxaliplatin, leucovorin calcium, and fluororacil with versus without bevacizumab in patients with resected stage II colon cancer and at high risk for recurrence based on molecular markers (ECOG [Eastern Cooperative Oncology Group] 5202; ); the TAILORx (Trial Assigning Individualized Options for Treatment; ) trial designed to evaluate the Oncotype Dx (Genomic Health, Redwood City, CA), a 21-gene recurrence score in tamoxifen-treated patients with breast cancer;34
and the MINDACT (Microarray in Node-Negative Disease May Avoid Chemotherapy; ) trial for patients with node-negative breast cancer designed to evaluate MammaPrint (Agendia, Amsterdam, the Netherlands), the 70-gene expression profile discovered at the Netherlands Cancer Institute.35,36
ECOG (Eastern Cooperative Oncology Group) 5202 trial design. Cycle of treatment, 2 days every 2 weeks, repeat for 12 cycles.
TAILORx (Trial Assigning Individualized Options for Treatment) trial design. RS, risk score.
MINDACT (Microarray in Node-Negative Disease May Avoid Chemotherapy) trial design.
In ECOG 5202, patients with stage II colon cancer, deemed to be at a high risk for recurrence after surgery (estimated 5-year survival rate of 60%) based on two molecular markers are randomly assigned to one of two treatment arms, whereas patients deemed to be at a low risk for recurrence after surgery (5-year survival rate estimate of 90%) will not receive any adjuvant therapy (). The trial is expected to enroll approximately 3,500 patients. With 250 eligible patients per year accrued for 5.5 years (1,375 eligible high-risk patients randomly assigned; 3,438 eligible patients total) and 3 years of follow-up, there is at least 88% power to detect a 37% difference in median DFS (absolute difference of 5%, from 80% to 85%, at 3 years) using a one-sided stratified log-rank test at 0.025 level, with stratification on stage and microsatellite instability status. Unfortunately, this design will not allow for a determination of the benefit of bevacizumab in the low-risk strata, however if the outcomes in the absence of treatment are as favorable as predicted in that group, no postsurgical therapy would generally be recommended.
The TAILORx and MINDACT trials aim to validate two new prognostic and possibly predictive tools for breast cancer, and are the first to test the feasibility of a prognostic tool in clinical application. The TAILORx trial was activated in 2006 and will accrue approximately 10,000 women with hormone receptor–positive, lymph node–negative breast cancer ().37–40
This study uses a noninferiority design (null hypothesis of no difference) to determine whether patients with a recurrence score between 11 and 25 derive benefit from adjuvant chemotherapy with a larger type I error (one-sided, 10%) and smaller type II error (5%) than usual. A decrease in the 5-year DFS rate from 90% with chemotherapy to 87.0% or lower on hormone therapy alone would be considered unacceptable. All patients in the TAILORx trial will provide blood samples for banking and future research. The MINDACT trial is expected to enroll 6,000 patients with node-negative breast cancer ().36
The primary test will be for group A in , where a null hypothesis rate of 92% for the 5-year distant metastases–free survival will be tested. With 6,000 patients overall, this group is expected to enroll 672 patients, thus providing 80% power to reject the null hypothesis if the true distant metastases–free survival is 95% using a one-sided test at the 0.025 significance level. Several other tests to compare the overall efficacy between the two prognosis methods within the randomized cohort as well as comparisons between treatment alternatives (based on a subsequent randomization) within specific subgroups will be performed. The 70-gene profile used in MINDACT previously demonstrated a 97% negative predictive value across all disease types, and a 38% positive predictive value, thus decreasing the likelihood of undertreatment of patients, but having a higher chance of overtreatment.41
The interlaboratory reproducibility of this gene profile and its discriminative ability was also independently validated in retrospective analyses.42,43
Complete genome arrays will be performed on all patient samples collected in this trial.
In summary, these three trials are examples of prospective validation trials utilizing a hybrid trial design that has the potential to substantially change the management of patients in the future, allowing for a better risk assessment and improved individualized treatment.