|Home | About | Journals | Submit | Contact Us | Français|
The validity of preclinical studies of candidate therapeutic agents has been questioned given their limited ability to predict their fate in clinical development, including due to design flaws and reporting bias. In this study, we examined this issue in depth by conducting a meta-analysis of animal studies investigating the efficacy of the clinically approved kinase inhibitor, sorafenib. MEDLINE, Embase, and BIOSIS databases were searched for all animal experiments testing tumor volume response to sorafenib monotherapy in any cancer published until April 20, 2012. We estimated effect sizes from experiments assessing changes in tumor volume and conducted subgroup analyses based on prespecified experimental design elements associated with internal, construct, and external validity. The meta-analysis included 97 experiments involving 1761 animals. We excluded 94 experiments due to inadequate reporting of data. Design elements aimed at reducing internal validity threats were implemented only sporadically, with 66% reporting animal attrition and none reporting blinded outcome assessment or concealed allocation. Anticancer activity against various malignancies was typically tested in only a small number of model systems. Effect sizes were significantly smaller when sorafenib was tested against either a different active agent or combination arm. Trim and fill suggested a 37% overestimation of effect sizes across all malignancies due to publication bias. We detected a moderate dose-response in one clinically approved indication, hepatocellular carcinoma, but not in another approved malignancy, renal cell carcinoma, or when data were pooled across all malignancies tested. In support of other reports, we found that few preclinical cancer studies addressed important internal, construct and external validity threats, limiting their clinical generalizability. Our findings reinforce the need to improve guidelines for the design and reporting of preclinical cancer studies.
Several recent reports have raised questions about the reproducibility of preclinical studies in general—as well as in particular—in cancer (1, 2). Various commentators have posited different reasons for this problem. One is the use of small sample sizes, which lead to high random variation of results (3). Another is the use of methods, like non-blinding of outcome assessment, that introduce validity threats (4). Such practices, when coupled to publication bias, would lead to especially exaggerated and non-reproducible estimates of effect sizes.
In a previous report, we investigated design, reporting and outcomes for tumor volume experiments contained within preclinical studies of the anticancer drug, sunitinib (5). We found that design practices that reduce the threat of bias and random variation, such as outcome assessment blinding, were rarely implemented. Our analysis suggested that effect sizes were inflated when sunitinib was tested in only one model system, but not necessarily when researchers failed to implement measures like randomization. We also reported evidence that effect sizes may have been overestimated by 45% due to publication bias. Last, we found little relationship between sunitinib properties in preclinical studies, and those that have been observed in humans. For instance, we were unable to detect a dose-response effect when we pooled all studies, and all malignancies responded significantly to sunitinib in preclinical studies, even though not all malignancies have responded in clinical trials.
However, this previous study concentrated on a single drug, and was not based on a prespecified protocol. The extent to which our findings generalize to other preclinical cancer studies is unclear. To explore the generalizability of our findings, we undertook a nearly identical systematic review of all preclinical monotherapy studies for the drug, sorafenib. Sorafenib (Nexavar®, BAY 43-9006) is, like sunitinib, a multikinase inhibitor. It is approved for use in renal cell carcinoma (RCC) (6), hepatocellular carcinoma (HCC) (7), and thyroid cancer (8). This drug was chosen because it has been tested against a large number of different malignancies—many of which have been tested in trials as well. It also provides years worth of follow-up preclinical testing. In this report, we survey experimental design parameters for sorafenib preclinical studies and examine whether design features correlated with estimated effect sizes.
Studies were identified by searching MEDLINE, Embase, and BIOSIS databases on April 20, 2012 for trials using these search terms: “sorafenib,” or “Nexavar,” or variations on “BAY 43-9006,” and MeSH terms including “preclinical,” “animals,” or search terms for commonly used animal models. The full search strategy, adapted from Hooijmans et al. (9) and de Vries et al. (10) can be found in Supplementary Text 1. A PRISMA flow diagram (11) can be found in Supplementary Figure 1.
Inclusion criteria at the study level were a) primary data, b) full-text articles b) English language, c) investigated anticancer efficacy, d) measured a treatment effect in live, non-human animals, e) administered sorafenib monotherapy as a comparator or treatment arm. For inclusion at the experiment level and quantitative meta-analysis, additional criteria were f) tested sorafenib against a control arm (e.g. vehicle), g) measured variance as standard deviation of the mean (SD) or standard error of the mean (SEM), h) evaluated drug effect on primary tumor volume, and i) measured baseline tumor volume plus at least one common time measurement between control and treatment arms.
We extracted experimental design elements derived from a prior systematic review of preclinical research guidelines (12). These included the following at the study level: the names of the authors, the month and year of publication, the country associated with the corresponding author, the funding source(s), the conflict of interest statement, the molecular or physiological rationale for the experiment, and the authors’ recommendation of sorafenib in the clinical setting as monotherapy or in combination therapy.
The design elements extracted at the experiment level included the following: sample size of each arm, randomization of treatment allocation, blinding of outcome assessment, inferential statistical test used, removal of animals throughout experiment, species, sex, weight, age, strain, immune status, disease modeled, disease stage, method of tumor initiation, transplantation site, transplantation size, transplant identity, drug administration schedule, administration method, days from disease induction to treatment, day discontinuation of treatment, and presence of combination and comparator arms. We captured the treatment effect at baseline, day 14 (or closest point), last time point, last common time point between control and treatment arms, as well as the standard deviation of the mean (SD) or standard error of the mean (SEM).
We extracted experiments that measured treatment effect as tumor volume (usually in units of mm3) or used a reasonable proxy for tumor volume, including the following: caliper measurement (mm3), tumor weight (mg), optical measurement (photons·s−1), and fold change in tumor volume between control and treatment arms. To account for the heterogeneity of scales in tumor volume measurements, we calculated effect sizes as standardized mean differences (SMDs) using Hedges’ g. The Hedges’ g statistic measures effect sizes in terms of the variability observed within individual studies, producing a standardized measure of treatment effect and allowing for the combination of results (13). We extracted, but did not analyze survival information due to the paucity of data reported (Table 1).
Graphical data were extracted using GraphClick digitizer software (Arizona Software). All extractions were performed by NM. After piloting, we identified eight extraction items that were prone to high inter-rater variability (Supplementary Text 2). These items were double-coded independently by JM and reconciled by discussion.
We calculated the effect sizes as SMDs using Hedges’ g and 95% confidence intervals. We used the statistical software, OpenMeta[Analyst] (14) to calculate pooled effect sizes using the DerSimonian and Laird random-effects model (15) and to assess heterogeneity of data via I2 statistics (16). Statistical significance was set at a p-value of 0.05; because of multiplicities, all testing was exploratory. We did not prospectively register a protocol for this meta-analysis; however, except where noted, hypothesis testing was prespecified and we followed the methods used in our previous meta-analysis of the anticancer drug, sunitinib (5).
For experiments testing multiple doses of sorafenib, we averaged the outcomes and created a pooled effect size for each experiment (except in dose-response analyses). Publication bias was evaluated using funnel plots (17) with Duval and Tweedie’s trim and fill method of estimating missing studies and adjusting the point estimate (18) using Comprehensive Meta Analyst software (19). Funnel plots take advantage of the fact that smaller studies are prone to large random variation. Under-representation of smaller studies showing non-positive effects can suggest publication bias.
For the dose-response curves, we only evaluated experiments using continuous dosing schedules and measuring tumor volume at a fixed time point of 14 days after onset of dosing (±3 days allowed, Supplementary Figure 2B). We excluded all other experiments from this analysis because dosing schedule and time point choice would be expected to correlate with effect sizes.
Our search captured 105 studies containing 191 experiments assessing tumor volume response to sorafenib monotherapy. Although all studies were included in qualitative analyses (Table 1), only 65 studies containing 97 experiments were included in our meta-analysis. A total of 94 experiments (49%) were excluded because they did not report elements required for our quantitative tests (e.g. sample size, a measure of dispersion or baseline tumor volume), 44 of which were reported in a single study (20). The 97 included experiments used 1761 animals, 96% of which were mice. The mean duration of experiments used in quantitative meta-analysis was 21 days (range 3–55 days). Anticancer efficacy experiments relied heavily on human xenograft models of disease (95%). Average sample size in each experiment was 7.74 and 7.78 in treatment and control arms, respectively (range 3–20 for both). There was high heterogeneity of data across all studies (I2=79%) (21). Most studies (98%) were published after sorafenib had received regulatory approval, reflecting the continued exploration of activity against various malignancies and delays in the publication process.
Several bodies have called for implementation of a suite of practices in preclinical testing, including randomization and blinding (22–26). A systematic review of preclinical design guidelines identified a consensus set of practices for improving clinical generalizability (12). We examined the reported implementation of these practices in sorafenib studies.
Experimental practices aimed at minimizing bias and strengthening causal inferences (internal validity) varied. Concealed allocation and blinded outcome assessment were never reported as used. Of the experiments in our sample, 10% evaluated dose-response (≥3 doses) of sorafenib. Moreover, 66% of experiments addressed, exhaustively or briefly, the attrition of animals during experiments.
Design elements aimed at maximizing the correspondence between experimental setup and clinical scenarios (construct validity) were also variable. Key parameters identified in preclinical study design guidelines include matching age of animals to patients, matching sex, matching stage of disease, and confirming mechanism of action. Studies relied disproportionately on younger, female animals with less advanced disease (Table 2)—variables that probably do not match most clinical scenarios. However, most studies (79%) probed for molecular or physiological evidence of mechanism of action.
Many guidelines recommend replication in different models of disease to rule out the possibility that treatment effects are attributable to idiosyncrasies in model systems (external validity) (12). We used an index of external validity first by counting the number of species and models used per malignancy. Hepatocellular carcinoma and high-grade glioma experiments employed the greatest variety of species (n=2) and models (n=2), yet as described below, these malignancies did not show significantly smaller effect sizes (Fig. 1A). Next (and on an ad hoc basis), we created a new index of external validity by pooling all graft studies for a malignancy type and determining the number of different cell lines used to test activity. We examined whether malignancies that were tested in more model systems tended to show more modest effect sizes. Most malignancies tested sorafenib against one or two tumor cell lines. Although malignancies that tested sorafenib using a single representative cell line (n=6) seemed to show larger effect sizes than those testing in more than one model, this was not significant (Fig. 1B).
Effect sizes in experiments, pooled by indication are reflected in Figure 2. The mean effect size across all malignancies was −2.396 (95% CI, −2.682, −2.110). From the 97 included experiments, 76.3% reached statistical significance (p<0.05, Supplementary Figure 3) and 61% of papers concluded by recommending clinical testing. Each pooled malignancy demonstrated significant anticancer activity, except pancreatic cancer (n=2) and squamous cell carcinoma (n=2). Though a quantitative analysis was not possible at this time, malignancies that are known to respond clinically (e.g. RCC (6, 27)) did not suggest substantially larger preclinical pooled effect sizes (Fig. 2) than malignancies that show minimal clinical response to sorafenib monotherapy (e.g. melanoma (28), non-small-cell lung carcinoma (29–31), ovarian (32, 33), and breast cancer (34, 35)). Thyroid carcinoma is also approved, but tumor volume experiments in this indication were missing measurements of baseline tumor volume or variance and were excluded from quantitative meta-analysis.
We performed an exploratory analysis examining whether any experimental design parameters described above corresponded with smaller, and thus likely, more realistic effect sizes. With respect to internal validity practices, there were no clear trends between design and effect size (Fig. 3A). For construct validity, experiments that tested sorafenib as monotherapy against an active comparator, or against a sorafenib-containing combination showed significantly smaller effect sizes than experiments testing against only an inactive control arm (Fig. 3B). Furthermore, experiments that reported a conflict of interest showed significantly smaller effect sizes than those declaring no conflict of interest (Fig. 3B).
One possible explanation for the preponderance of strongly positive studies was publication bias. We constructed funnel plots and performed trim and fill analysis to explore this possibility in our pooled sample. The asymmetric funnel plot for all malignancies (Fig. 4A) suggests the presence of publication bias. Trim and fill analysis suggests a 37% overestimation of effect size across all malignancies, with an adjusted SMD estimate of −1.753 (95% CI −2.073, −1.433) compared to an unadjusted SMD of −2.396 (95% CI −2.682, −2.110). We performed similar analyses for HCC (Fig. 4B) and RCC (Fig. 4C)—the two malignancies for which we had the greatest volume of experiments (n=29 and n=17, respectively). HCC showed no significant suggestion of publication bias. The analyses suggested a 25% overestimation of effect size for RCC, although this was not significant and limited by sample size.
In our sample, ten percent of the experiments performed dose-response curves (≥3 doses) for sorafenib. There is some evidence suggesting dose-response effects in human beings, although these studies are not decisive (36–39). However, preclinical studies that tested dose-response internally demonstrated an effect (Figs. 5B and 5C). As a simple measure of the ability of pooled preclinical studies in our sample to demonstrate causal relationships, we tested for whether we could detect dose-response effects if all eligible experiments (n=91), as well as the indications with the largest volume of experiments (HCC, n=28 and RCC, n=17), were pooled. Using a standardized time point of 14 days after sorafenib administration and restricting our dataset to continuous (daily) dosing schedules, we did not observe a dose-response relationship across all malignancies (p=0.09) (Fig. 5A). Considering the subsets of approved malignancies, HCC experiments showed a moderate dose-response (p<0.001, R2=0.35) (Fig. 5B) while RCC experiments did not (p=0.86) (Fig. 5C).
Preclinical efficacy experiments are typically cited to justify the initiation of clinical trials. However, choice of models, experimental setup, and reporting practices may limit their clinical generalizability. Our report builds on previous findings that experimental practices in preclinical cancer research do not adequately attend to the effects of random variation, bias, and non-publication.
As in our previous study (5), we found that many experiments are reported so poorly that they are almost impossible to interpret. For instance, more than a third of our original sample could not be included in the meta-analysis due to missing information on sample size, measure of dispersion or baseline tumor volume. Similarly, we found limited attention to internal validity threats, as indicated by the general non-implementation and reporting of design elements such as concealed allocation, blinded outcome assessment, and animal attrition. With respect to construct validity, researchers generally relied on young, immunocompromised and female mice, as they had for sunitinib (5). In this study and in the previous one, however, we did not detect exaggerated effect sizes in studies harbouring internal or construct validity threats, as others have (40, 41). Our analysis suggested that experimental effect sizes were significantly smaller when sorafenib was tested against active comparators and/or combination arms—a finding that would be consistent with bias (since the purpose of such studies is to demonstrate that another drug or combination is even more effective) but that contradicts our sunitinib results (5).
Our analysis is suggestive of biases in reporting of sorafenib preclinical studies. First, 76.3% of studies were statistically significant—a proportion that is surprising, given that the mean sample size per arm was small (n=7.76). As with sunitinib, almost all malignancies demonstrated statistically significant activity—the two that did not trended strongly towards positivity. If all malignancies respond to sorafenib, the value for trial planning of the type of in vivo testing used in experiments analyzed here is doubtful. Third, our trim and fill analysis suggested an overestimation of effect size due to publication bias across malignancies that is similar to what we observed for sunitinib (5). Our analysis did not find a strong dose-response relationship for pooled malignancies—nor for one of the malignancies currently approved for monotherapy. Fourth, similar to the results our preclinical sunitinib report (5), our external validity analysis suggested that testing in more model systems—the number of grafts, species, and models used—results in more realistic (i.e. smaller) pooled effect sizes within malignancies, although this trend was non-significant. Last, our findings do not suggest a clear relationship between preclinical effect sizes and clinical outcomes across malignancies, though a more formal analysis including clinical effect sizes is still needed.
Our analysis and inferences about effect sizes have many limitations, not least of which is the hazard of combining effect sizes from an extremely heterogeneous sample of experiments. For example, toxicity of drug at high doses may have dampened the ability of xenograft studies to detect dose-responses (although this fails to explain why they are consistently reported internally) and cell line heterogeneity may mask dose-effects within malignancies. Although the administered dose was always reported, the lack of reported drug exposure data threatens the construct validity of the experiments in our sample. Second, our analysis was focused on only in vivo experiments embedded within preclinical reports. It is possible that tumor volume curves should only be interpreted in the context of additional mechanistic, pharmacokinetic, or in vitro experiments within reports. Third, our systematic review relied on what was published and reported in studies. It is possible studies may have used methodologies, like randomization, and not reported them. Fourth, our systematic review concerns a single drug. Although our findings are consistent with observations reported elsewhere (4, 5, 42, 43), it is possible more robust dose-response curves, or a better relationship between clinical and preclinical effects would be apparent with other drugs. Fifth, our analysis only captures studies published before April 2012, however, we believe that extending our results to the current date would not reveal vastly different treatment outcomes or quality of reporting. Last, our results may reflect problems with using human xenograft tumor growth curves to make clinical inferences—particularly for a drug like sorafenib, which shows cytostatic properties in clinical trials (44, 45). Data suggest that tumor shrinkage may not be a suitable efficacy endpoint for sorafenib; time-to-event data, including prolongation of progression-free survival (PFS) and overall survival (OS), indicate benefits from tumor stabilization despite modest radiographic response in pivotal trials (6, 7). However, our analysis does not include survival data due to the scarcity of reported survival curves in our sample (Table 1). We also note that mean effect sizes observed in sorafenib preclinical studies were much greater than those observed for sunitinib (−2.396 [95% CI, −2.682, −2.110] vs. −1.826 [95% CI, −2.052, −1.601], respectively), a drug associated with high objective response rates in trials. While the usefulness and reproducibility of the xenograft model have been questioned (46–48), many support its use in preclinical studies (49–51).
Our findings contribute to the literature on preclinical design and reporting in cancer, and reinforce our exploratory analysis for sunitinib (5). They also suggest that researchers—and physicians prescribing approved drugs off-label—should be cautious about using tumor curves to infer clinical value. Many xenograft studies do not adhere to basic tenets of reporting, such as describing sample sizes; few implement widely discussed design elements like blinding. It might be objected that cancer represents a “hard endpoint”—and hence is less susceptible to bias than other disease realms. However, measurements of tumor volume, assessments of moribundity for survival curves, or choices of whether to include anomalous measurements involve judgment, and just as in clinical research, such judgments can be affected by bias. We encourage the cancer research community to pursue a sustained discussion of guidelines for experimental setup and results reporting in preclinical research. We also encourage referees to scrutinize manuscripts for reporting. Above all, our findings suggest possibilities for reducing some of the burden and cost associated with unsuccessful translation efforts.
Funding statement: This work was funded by CIHR (EOG111391).
We thank Amanda Hakala and Benjamin Carlisle for their assistance with this manuscript.
Conflict of interest statement: The authors disclose no potential conflicts of interest.