|Home | About | Journals | Submit | Contact Us | Français|
The proportion of proposed new treatments that are ’successful’ is of ethical, scientific, and public importance. We investigated how often new, experimental treatments evaluated in randomized controlled trials (RCTs) are superior to established treatments.
Our main question was: “On average how often are new treatments more effective, equally effective or less effective than established treatments?” Additionally, we wanted to explain the observed results, i.e. whether the observed distribution of outcomes is consistent with the ’uncertainty requirement’ for enrollment in RCTs. We also investigated the effect of choice of comparator (active versus no treatment/placebo) on the observed results.
We searched the Cochrane Methodology Register (CMR) 2010, Issue 1 in The Cochrane Library (searched 31 March 2010); MEDLINE Ovid 1950 to March Week 2 2010 (searched 24 March 2010); and EMBASE Ovid 1980 to 2010 Week 11 (searched 24 March 2010).
Cohorts of studies were eligible for the analysis if they met all of the following criteria: (i) consecutive series of RCTs, (ii) registered at or before study onset, and (iii) compared new against established treatments in humans.
RCTs from four cohorts of RCTs met all inclusion criteria and provided data from 743 RCTs involving 297,744 patients. All four cohorts consisted of publicly funded trials. Two cohorts involved evaluations of new treatments in cancer, one in neurological disorders, and one for mixed types of diseases. We employed kernel density estimation, meta-analysis and meta-regression to assess the probability of new treatments being superior to established treatments in their effect on primary outcomes and overall survival.
The distribution of effects seen was generally symmetrical in the size of difference between new versus established treatments. Meta-analytic pooling indicated that, on average, new treatments were slightly more favorable both in terms of their effect on reducing the primary outcomes (hazard ratio (HR)/odds ratio (OR) 0.91, 99% confidence interval (CI) 0.88 to 0.95) and improving overall survival (HR 0.95, 99% CI 0.92 to 0.98). No heterogeneity was observed in the analysis based on primary outcomes or overall survival (I2 = 0%). Kernel density analysis was consistent with the meta-analysis, but showed a fairly symmetrical distribution of new versus established treatments indicating unpredictability in the results. This was consistent with the interpretation that new treatments are only slightly superior to established treatments when tested in RCTs. Additionally, meta-regression demonstrated that results have remained stable over time and that the success rate of new treatments has not changed over the last half century of clinical trials. The results were not significantly affected by the choice of comparator (active versus placebo/no therapy).
Society can expect that slightly more than half of new experimental treatments will prove to be better than established treatments when tested in RCTs, but few will be substantially better. This is an important finding for patients (as they contemplate participation in RCTs), researchers (as they plan design of the new trials), and funders (as they assess the ’return on investment’). Although we provide the current best evidence on the question of expected ’success rate’ of new versus established treatments consistent with a priori theoretical predictions reflective of ’uncertainty or equipoise hypothesis’, it should be noted that our sample represents less than 1% of all available randomized trials; therefore, one should exercise the appropriate caution in interpretation of our findings. In addition, our conclusion applies to publicly funded trials only, as we did not include studies funded by commercial sponsors in our analysis.
Random allocation to different groups to compare the effects of treatments is used in fair tests to find out which among the treatment options is preferable. Random allocation is only ethical, however, if there is genuine uncertainty about which of the treatment options is preferable. If a patient or their healthcare provider is certain which of the treatments being compared is preferable they should not agree to random allocation, because this would involve the risk that they would be assigned to a treatment they believed to be inferior. Decisions about whether to participate in randomized trials are made more difficult because of the widespread belief that new treatments must inevitably be superior to existing (standard) treatments. Indeed, it is understandable that people hope that this will be the case. If this was actually so, however, the ethical precondition of uncertainty would often not apply. This Cochrane methodology review addresses this important question: “What is the likelihood that new treatments being compared to established treatments in randomized trials will be shown to be superior?” Four cohorts of consecutive, publicly funded, randomized trials, which altogether included 743 trials that enrolled 297,744 patients, met our inclusion criteria for this review. We found that, on average, new treatments were very slightly more likely to have favorable results than established treatments, both in terms of the primary outcomes targeted and overall survival. In other words, when new treatments are compared with established treatments in randomized trials we can expect slightly more than half will prove to be better, and slightly less than half will prove to be worse than established treatments. This conclusion applies to publicly funded trials as we did not include studies funded by commercial sponsors in our analysis. The results are consistent with the ethical preconditions for random allocation - when people are enrolled in randomized trials, the results cannot be predicted in advance as there is genuine uncertainty about which of the treatments being compared in randomized trials will prove to be superior.
When uncertainty exists about which among alternative treatments is preferable for a given health problem, a randomized controlled trial (RCT) is often proposed to resolve this dilemma. Indeed, Sir Austin Bradford Hill, one of the fathers of modern clinical trials methodology, suggested that when we are uncertain about the relative value of one treatment over another, it is time for a trial (Bradford Hill 1963).
Recognition of the importance of uncertainty in the design of RCTs has reached the status of a principle. This ’uncertainty principle’ states that patients should be enrolled in such trials only if there is substantial uncertainty (Atkins 1966; Bradford Hill 1963; Bradford Hill 1987; Edwards 1998; Freedman 1987; Peto 1998; Weijer 2000) about which of the trial treatments would be preferable. Some authors prefer the term equipoise to refer to the required uncertainty before the trial is conducted (Djulbegovic 2001; Weijer 2000). Although not identical, these concepts are similar (Lilford 2001); the main distinction relates to the locus of uncertainty, i.e. ’whose uncertainty is morally relevant’: researchers (clinical equipoise), community (community equipoise), patients (’indifference principle’), or patients and researchers (’uncertainty principle’) (Djulbegovic 2007; Djulbegovic 2011). In this review we will use the term ’uncertainty’ to refer to this fundamental scientific and ethical requirement for conducting randomized trials. This principle is important for this review, because we have previously hypothesized that there is a predictable relationship between the uncertainty, that is, the moral principle, upon which randomized trials are based, and the ultimate outcomes of randomized trials (Djulbegovic 2007; Djulbegovic 2009). That is, if the uncertainty requirement is observed, we would expect, over time, to find no significant difference between the proportion of randomized trials that favor new treatments and those that favor established treatments (Djulbegovic 2001; Djulbegovic 2008; Kumar 2005a; Soares 2005).
In 1997, one of the authors of this review, Chalmers asked “What is the prior probability of a proposed new treatment being superior to established treatments?” (Chalmers 1997). He referred to a small number of reports suggesting that new treatments assessed in randomized trials were just as likely to be inferior as they were to be superior to the established treatments. Since then, several additional studies have been reported which are relevant to this question (Colditz 1989; Djulbegovic 2000a; Djulbegovic 2008; Joffe 2004; Kumar 2005a; Machin 1997; Soares 2005). In an analysis of published reports of trials, Djulbegovic et al (Djulbegovic 2000a) found that, within research sponsored by government and not-for-profit organizations, the results showed a fairly even split: 44% of randomized trials favored established treatments while 56% of the trials favored new treatments. However, when research was sponsored by for-profit organizations, new treatments were significantly favored over established treatments (74% versus 26%; P = 0.004). The source of sponsorship appears to be associated with estimates of treatment effects (Lexchin 2003). Other research has indicated that methodological quality can also affect estimates of treatment effects (Gluud 2006).
In assessing whether new or established treatments are favored on average, an important potential bias that needs to be heeded relates to the fact that investigators frequently fail to publish their research findings (Dickersin 1997; Hopewell 2009; Krzyzanowska 2003). This, in itself, may not create a problem if research is randomly unpublished. In that case, there would simply be less information available, but that information would be unbiased (Dickersin 1997). However, failure to publish is not a random event; rather publication is dramatically influenced by the direction and strength of research findings (Dickersin 1997; Hopewell 2009). If one were to examine a distribution of outcomes from the cohorts of all trials from inception regardless of publication status, this would constitute an unbiased assessment of the effects of new versus established treatments. That is, the unbiased assessment of comparison of new versus established treatment (’treatment success’) can only be done if one has accurate data on both the numerator (estimates of treatment effect comparing new versus established treatment) and denominator (list of trials/comparisons) that were performed (Djulbegovic 2002).
Indeed, research over the past decade has identified several factors that may affect a trial’s results and their availability - publication rate (Dickersin 1992; Dickersin 1997; Hopewell 2007; Hopewell 2009), methodological quality (Altman 1994; Altman 1995; Higgins 2011; Schulz 1995; Wood 2008), and the choice of control interventions (Djulbegovic 2000c; Djulbegovic 2001; Djulbegovic 2003; Mann 2012). To address the question posed by Chalmers (Chalmers 1997), therefore, we need to try to account for all these factors.
We should note here that in this review we are not focused on the related but distinct question: “How often are new treatments, assessed in systematic reviews, better than established treatments” (Djulbegovic 2000b). Rather, we undertook a systematic review to identify studies that had assembled a set of consecutively conducted randomized trials (’cohort’) - by funder or trial registry or other mechanism that would avoid publication bias - and analyzed all trials irrespective of publication status. We will refer to the trials within these cohorts as the ’component trials’.
Cohort analyses of consecutive series of randomized trials, registered at onset, which compared new versus established treatments in humans were eligible for analysis. We deemed all other types of studies not eligible for this review. Originally, we planned to include cohort analyses which included non-randomized component studies or component studies comparing two or more new treatments, but it soon became apparent that it was not possible to analyze randomized components of new with established treatments separately from non-randomized comparisons; therefore, these studies were not considered in our analysis. Likewise, all other studies, in which the impact of publication bias could not be excluded, were deemed ineligible for this review. Typically, these were studies that relied only on published studies (Lathyris 2010; Yanada 2007) and hence there was no way to ensure that the cohorts of studies are not affected by publication bias (unless the authors clearly took into consideration the results of unpublished studies in their report, in which case these studies would have been eligible for our review).
We also excluded the studies which were based on information from research protocols and other resources (e.g. studies that are based on trials’ registers) but which did not report outcomes on superiority of new versus established treatments (Chan 2004). Cohorts based on equivalence and non-inferiority trials would have also been ineligible and, in fact, the RCTs in all four cohorts that were analyzed in this review (see below) were all superiority trials.
We analyzed data on primary outcome and overall survival from randomized trials of any type of disease/intervention. Data on primary outcomes were chosen according to the authors’ definitions in published articles. Because we did not have the protocols available for three out of four cohorts, we did not attempt to verify if the definitions of primary outcomes changed between the studies original design and their final reports (Dwan 2011)
We originally planned to assess the impact of the methodological quality on all results. However, we could extract data for one cohort only (Djulbegovic 2008), which detected no effect of methodological quality on the results. The study by Dent and Raftery (Dent 2011) also detected no impact of the quality on the results but these data were not available for pooling in this analysis. Given that all cohorts included in our review came from large public funders, in which trial protocol development passes several rigorous reviews (Soares 2004), we assumed the impact of methodological quality in other cohorts was also negligible and therefore did not formally include it in this review. However, we did evaluate the effect of comparator (active versus no therapy/placebo) on the distribution of the results.
Types of outcome measures included the direction, size and statistical significance of the results for the primary outcome and most important outcomes (i.e. survival) that are reported in the cohort analyses (excluding surrogate outcomes). An outcome was considered to be a primary outcome if it met the following criteria in hierarchical order: (i) it was explicitly defined as a primary or main outcome by the trialists, (ii) it was the outcome used for power and sample size calculation, or (iii) it is listed as the main outcome in the trials’ objectives.
We searched the following databases without time or language limits to identify relevant published cohort analyses of RCTs: Cochrane Methodology Register (CMR) 2010, Issue 1, part of The Cochrane Library (searched 31 March 2010); MEDLINE Ovid 1950 to March Week 2 2010 (searched 24 March 2010), and EM-BASE Ovid 1980 to 2010 Week 11 (searched 24 March 2010). See Appendix 1 for the search strategies.
We also checked the reference lists to all included studies in this review, checked a Cochrane Review on publication bias (Hopewell 2009) for references that may have provided the appropriate comparison of new versus established treatments, and contacted people we deemed knowledgeable about our review question to try to obtain additional studies.
Given the large number of hits produced by the literature search, we divided the list of retrieved studies into manageable parts among several authors (BD, AK, PG, RP, HS, GV) who screened the titles and abstracts of all retrieved records to identify reports that should definitely be excluded. Every record that was not rejected was assessed by at least two of the authors independently to see if it was likely to meet the inclusion criteria. We finally had a conference call to review the list of all eligible studies. The final list of included studies was created through the discussion on the conference call held on 20 July 2011.
Our final data set consisted of four cohorts (see Results below). Data from two cohorts were already extracted for separate publications (Dent 2011; Djulbegovic 2008). Two authors (AK, TR) independently extracted data for the remaining cohorts (Johnston 2006; Machin 1997). Global checking of data extraction was performed by the first author (BD) and a statistician (RP) before data were ready for the final analysis.
We used the following criteria to assess the methodological quality of included studies:
See Table 1 for a summary of the study characteristics.
For each component study, we extracted the following data (see Table 1):
Originally, we planned to perform an assessment of methodological quality of individual studies for those domains that are known to affect results due to a variety of possible biases and random errors listed below, with a plan to assess the following domains to determine risk of bias:
We planned to use the following domains to address the issue of random error:
The same methodological approach has been used previously (Djulbegovic 2008; Soares 2004), paying particular attention to those factors that are shown to affect the results of randomized trials: publication bias (Hopewell 2009), methodologic quality (Higgins 2011; Juni 1999), and the choice of control intervention (Djulbegovic 2000c; Mann 2012).
The quality assessment from the appraisal of cohorts and individual component trials would have been combined in our overall quality evaluation, in order to provide judgments on the extent of potential bias that may have affected the results. As there is no agreed upon method for doing this, we hoped to approach this in two ways:
Unfortunately, as explained above, we could extract data for one cohort only (Djulbegovic 2008), in which no effect of methodological quality on the results was detected. Dent and Raftery (Dent 2011) also reported no impact of the methodological quality on their results, but these data were not available for the analysis performed herein.
Originally, we planned to report the success rate in the following ways:
Unfortunately, most subgroup analyses were not possible because of the limited domains and data of the available cohorts. In this review, we report the quantitative pooling (meta-analysis) of data according to primary outcomes and overall survival. Arguably, this is the least biased approach to answer the question of “how often new treatments are superior to established ones” (Chalmers 1997). Comparing effects of treatments according to statistical significance is based on ’vote counting methods’ in which effect size, number of patients, and time-to-event data are not taken into account (Hedges 1985). Assessing treatment success by the attempt to deduce the original trialists’ views about superiority of new versus established treatments, while useful, is also possibly fraught by bias because such assessments cannot exclude the potential conflicts of interest of the original investigators (Als-Nielsen 2003). We used three methods to pool the data from the four cohorts of studies:
Our aim was to obtain a description of the empirical distribution for the primary outcome of a trial. We therefore estimated this distribution using Gaussian kernel density methods which are based on a smoothing histogram given a predefined bandwidth and with the potential of giving different weights to each trial (similar to meta-analysis) (Silverman 1986). The choice of bandwidth is a compromise between obtaining a smooth density while identifying variations in the distribution peaks (e.g. multimodality). We constructed the probability density function for the odds or hazard ratios on the log scale using a two-stage adaptive weighted kernel density estimation (Gisbert 2003). We calculated the weights following the random-effects assumption as the inverse of the sum of the within-study variance for a trial plus the between-study variance Tau2 for all trials. We performed the estimation using the computational software Maple (version 14) (Maple 2009).
We used hazard or odds ratios (HR/ORs) to summarize the overall studies’ data expressed with 99% confidence intervals (CIs). We used the more conservative 99% CIs to decrease chance of random error. We used a random-effects model. The unit of analysis was comparison within each trial. In the case of studies with continuous outcome data, we converted the results into dichotomous data using standard methods (Higgins 2011). For trials/reports that included more than one new treatment group, we used the following approach: to avoid issues with correlations and double counting, we first excluded multi-arm comparisons from the main analysis. We selected only one comparison which was associated with the largest effect size favoring experimental treatments. This way we purposefully provide the best-case scenario in terms of treatment success favoring new treatments. In sensitivity analysis we, however, included all comparisons (see Effects of methods). As it can be seen, the results between these two analyses only marginally differ. Note that we could not apply other methods suggested in the literature to conduct meta-analysis that included multiple comparisons such as splitting a control arm to match corresponding experimental arms (Higgins 2011) because we did not have data on the number of patients and events in all cohorts.
Using the year of publication as a co-variate, we performed a meta-regression to assess the change in treatment effect over time.
Trials which used placebo/no therapy as a comparator (see Table 1 for comparator) were included in the main analysis. The rationale for this is that placebo does not replace established treatments but, in fact, always represents an ’add-on’ intervention to the standard treatments (Senn 2000). As the mechanism for violation of the ’uncertainty principle’ relates to the choice of inferior comparator (Djulbegovic 2000c; Mann 2012), we also performed a sensitivity analysis by evaluating the results according to placebo/no therapy versus active control comparisons.
A total of 8792 records were retrieved. Figure 1 shows a flow diagram of all included studies. Table 1 shows the characteristics of the studies. In total, we identified 11 cohorts of RCTs, of which four were eligible for this review. Three papers reported results of smaller cohorts (Joffe 2004; Kumar 2005; Soares 2004) which were all included within a final, large analysis published by Djulbegovic and colleagues (Djulbegovic 2008) and hence were included in this review via this larger cohort. Two other papers were based on published trials only (Lathyris 2010; Yanada 2007) and therefore were excluded from our analysis. Two other cohorts which explored the effect of funding source on study outcome but only included data from published studies were also excluded (Bekelman 2003; Lexchin 2003).
The four eligible cohorts included data from 743 RCTs involving 297,744 patients (Dent 2011; Djulbegovic 2008; Johnston 2006; Machin 1997). Two cohorts addressed evaluation of new treatments in the cancer field (Djulbegovic 2008; Machin 1997), one in neurological disorders (Johnston 2006), and one for mixed types of diseases (Dent 2011). All four cohorts provided data for the primary outcome analysis (Dent 2011: 57 studies, Djulbegovic 2008: 698, Johnston 2006: 24, Machin 1997: 28), while only three provided data for the overall survival analysis (Djulbegovic 2008: 614 studies, Johnston 2006: 20, Machin 1997: 28).
Although the study selection process was not described in the publications of two cohorts that we included in our analysis (Johnston 2006; Machin 1997), it was rather obvious that both reports included all phase III trials whose outcomes the authors evaluated in their respective publications. That is, all four cohorts satisfied a key quality criterion for our analysis: they comprised of a set of consecutively conducted randomized trials.
We deemed all cohorts to include high-quality RCTs with low risk for bias (Dent 2011; Djulbegovic 2008; Johnston 2006; Machin 1997). Nevertheless, as explained above, we could not investigate the effect of bias formally in this review. Two publications included a formal assessment of bias and found no impact of potential bias on the results (Dent 2011; Djulbegovic 2008). (See ’Sensitivity analysis’ below regarding the effect of comparator on the results).
Figure 2 and Figure 3 show kernel density estimation of the effects of new treatments compared to established ones for both primary outcomes (see Table 1 for the list of primary outcomes used in the included studies) and overall survival. The analysis according to primary outcomes is considered important as it reflects the original design and the trialists’ ’best bets’ that new treatments may prove to be superior to established ones (see also Discussion) while the analysis according to overall survival relates to pooling data on most important outcomes for patients. As it can be seen, there is a fairly symmetrical distribution of new versus established treatments centered near ’no effect’ (a log hazard ratio of 0) indicating that experimental treatments are about equally superior or inferior to standard treatments although, on average, new treatments are slightly more superior to old ones.
Figure 4 and Figure 5 show the forest plots of estimates for primary outcomes and survival, respectively. New treatments are slightly more favored both in terms of their effect on primary outcomes (hazard ratio (HR)/odds ratio (OR) 0.91, 99% confidence interval (CI) 0.88 to 0.95) and overall survival (HR 0.95, 99% CI 0.92 to 0.98). No heterogeneity in treatment effects was observed in the analysis based on primary outcomes (I2 = 0%) (Figure 4) or survival outcomes (I2 = 0%) (Figure 5).
Table 2 and Table 3 show a meta-regression evaluating the effect of cohort and the year of publication on the stability of results. As it can be seen, the results remain stable over time, indicating that new types of treatment tested in randomized controlled trials (RCTs) seem to continue to have about the same probability of being superior to established therapies.
Figure 6 and Figure 7 show kernel density estimation of the effects of new treatments compared to established ones for primary outcomes (see Table 1 for the list of primary outcomes used in the included studies) in trials using active therapy as established treatment and placebo/no therapy as established treatment respectively. As it can be seen, there is a fairly symmetrical distribution of new versus established treatments centered near ’no effect’ (a log hazard ratio of 0) indicating that experimental treatments are about equally superior or inferior to standard treatments although, on average, new treatments are slightly more superior to old ones regardless of comparator treatment used.
Figure 8 shows the forest plot of estimates for primary outcome according to type of established treatment used as comparator (active therapy or placebo/no therapy). New treatments are slightly more favored in trials which employed an active comparator (HR/OR 0.92, 99% CI 0.89 to 0.96) while in trials which used a placebo/no therapy as a comparator new treatments resulted in HR 0.79 (99% CI 0.61 to 1.02). The test of interactions between two subgroups was, however, not significant (P = 0.13). At the subgroup level, no heterogeneity in treatment effects was observed in the analysis based on primary outcomes in studies which used an active comparator (I2 = 0%). However, in studies which employed placebo/no therapy as a comparator, high heterogeneity in treatment effects was observed in the analysis based on primary outcomes (I2 = 69%) (Figure 8). The heterogeneity substantially decreased (from 69% to 40%) in this subgroup, when the UK Health Technology Assessment (HTA) cohort (Dent 2011) was excluded from this analysis. This cohort, which included two true placebo comparators and 13 ’no treatment’ comparisons, evaluated a mixture of clinical and cost-effectiveness endpoints, typically without ’blinding’ patients or providers to patient outcomes and, therefore, it is not surprising that we observed relatively high inconsistency (I2 = 69%) in this subgroup.
Table 4 and Table 5 show a meta-regression evaluating the effect of cohort and the year of publication on the stability of results in studies which used active comparator and placebo/no therapy comparator, respectively. As it can be seen, the results has not changed over time when the comparator was an active control. However, when the control was placebo/no therapy a slight, significant drop in treatment success was observed, most likely due the trial cohort effect. When the UK HTA cohort was excluded from the analysis, the association became non-significant (Table 6). As alluded to above, this cohort included patients with a variety of health-related problems and variety of health interventions, which often consisted of assessing the optimal aspect of clinical care and cost/effectiveness. Conceivably, the investigators may have been less uncertain about superiority of a given clinical strategy (such as the uptake of HIV testing, or the usefulness of testing of change in the quality of life, etc. (see Characteristics of included studies) in these pragmatic trials (Dent 2011) than about the efficacy of new cancer drugs. Even so, the results are far from predictable in advance as displayed in Figure 6 and Figure 7 - the observed distribution of the treatment effects is fairly symmetrical with new treatments being only slightly superior to standard ones. Similar results were obtained when based on all comparisons (Appendix 2) (see also Table 7; Table 8; Table 9; Table 10; Table 11).
This comprehensive assessment of comparisons of new, experimental treatments against established therapies in randomized controlled trials (RCTs) shows that, while on average, new treatments are associated with a 5% or 10% improvement in relative survival or primary outcomes (Figure 4; Figure 5), the effects seen are generally in a symmetrical distribution between new versus established treatments (Figure 2; Figure 3). This near-symmetry indicates an unpredictability of new treatment effects, and suggests that investigators cannot predict the trial results in advance. These results have shown remarkable stability over time (stretching over five decades), and are not influenced by the inventions of new treatments or new chemical moieties. This stability is important to note as many authors believe that the results will become more predictable in the era of targeted therapy (Mandrekar 2009). While that is plausible, there is no historical trend for improved understanding in biology disease to lead to greater certainty of effects when tested in RCTs.
We believe that the observed results are not coincidental, but rather reflect the uncertainty requirement, or clinical equipoise, as a driver of discovery of new therapies as they undergo clinical testing (Djulbegovic 2001; Djulbegovic 2007; Djulbegovic 2009). According to this hypothesis, the higher the level of uncertainty before a RCT is undertaken, the less chance that the investigators will be able to predict the effects of treatment in advance (Djulbegovic 2001; Djulbegovic 2007; Djulbegovic 2009). As a result, sometimes new treatments will be better than standard therapies, sometimes the reverse will be true, and sometimes there will be no difference between two treatments (Djulbegovic 2001; Djulbegovic 2007; Djulbegovic 2009). However, the uncertainty hypothesis needs to be combined with the researchers’ preferences toward one of the alternative treatments (typically, new ones) that are being tested (Djulbegovic 2008). Investigators invest a lot of time and effort in the development and testing of new treatments. They do bring their accumulated knowledge into the design of RCTs with the hope they will prove that the new treatments will be successful. This probably partly explains why new therapies are, on average, superior to standard therapies. However, if this accumulated knowledge indicates that the proposed experimental treatment is clearly superior to established treatment (i.e. that there is no uncertainty about the competing treatment effects), then such a RCT would probably be impossible on ethical grounds: during the rigorous peer review process that these trials undergo, someone would probably object, at least in the publicly funded trials, which our analysis dealt with. It is this interplay between researchers’ hope that they have developed treatment which is better than established treatments and the requirement for uncertainty to enroll patients in RCTs that can explain the results we observed (Djulbegovic 2007; Djulbegovic 2009; Djulbegovic 2011). Despite these strong theoretical predictions of the observed results, it should be noted that our sample represents less than 1% of all available randomized trials; therefore, one should exercise the appropriate caution in interpretation of our findings.
We believe that the question asked by one of us almost 15 years ago (Chalmers 1997) is now reliably answered at least when treatments are tested in publicly funded trials. Society can expect that when new experimental treatments are tested against established treatments in RCTs in publicly funded trials, slightly more than half will prove to be better, and slightly less than half will prove to be worse. As we discussed elsewhere (Djulbegovic 2008; Djulbegovic 2007; Djulbegovic 2009; Kumar 2005; Soares 2005), this finding represents good news. Achieving higher predictability in the results would likely lead to the collapse of the current RCT system, as most clinicians and patients would refuse randomization (with typical a 50:50 chance of allocation to successful treatment) if investigators can be certain, say, at 80% or above about the effects of treatments they propose to test.
Our review has some limitations. First, we included only RCTs funded by public agencies. The commercially sponsored trials are believed to have higher success rates as industry invest heavily in treatment development and have more meticulous trial execution (Fries 2004), or their seemingly higher success rates are derived from possibly biased execution linked to the commercial interests (Gluud 2006; Lexchin 2003). To date, however, all reports on treatment outcomes in industry-sponsored trials relied solely on published studies, making it impossible to discern the impact of publication bias on the results (Lexchin 2003). Second, we may have missed some eligible cohorts. However, we believe this is unlikely due to our extensive, broad literature search, and our experience investigating this question for almost 15 years now. It would therefore be unlikely that we had missed some important published reports. Third, we have not addressed the ’efficiency’ of answering the questions, as some of RCTs may have been inconclusive (Djulbegovic 2008). Nevertheless, while the inconclusive results may represent a waste of resources, they still had about an equal chance of generating results in favor of experimental therapy (Dent 2011; Djulbegovic 2008). Fourth, the distribution of observed outcomes could have been affected by bias, such as the choice of inferior or suboptimal established treatments (Mann 2012), or other types of biases that may plague many randomized trials (Higgins 2011). However, as discussed in the Results section, we believe that all included trials were of high quality without evidence of the effect of comparator bias, or other types of biases. Fifth, we analyzed data according to the year of publication. As there is always a delay between time of publication and time when the study was conceived and recruited patients, the year of publication does not necessarily represent uncertainty about treatment effects of the period when the trial was designed. Sixth, the limited domains and descriptive data in the available cohorts made most of our planned subgroup analyses (public versus commercial; specialty area; methodological quality) impossible. Indeed, the majority of the data come from publicly funded trials in oncology. Although the two non-cancer cohorts included had similar results (see Figure 2; Figure 3; Figure 4; Figure 5) we could not fully test the robustness of our conclusions across other disease domains. Finally, this review reflects the search last performed in March of 2010. Originally, we planned to report the aggregate data as described in the cohorts of published trials. However, we soon realized that this would not allow us to generate the quantitative assessment of treatment success. We have, therefore, extracted all data from all individual trials in each of four cohorts. This, however, proved a very time-consuming task, with the result that our review reflects best evidence at the time when the search was completed. Nevertheless, as of this time (August 2012) we are not aware of any new published cohorts of trials comparing the effects of new versus established treatments.
However, we believe that our results are generalizable at least to publicly funded trials. This is because a central principle in the evaluation of the effects of new versus established therapies is that, when uncertain, the investigators’ ’bets’ on the effect of treatment on primary outcomes will not be predictably materialized in any individual RCT. That is, a similar distribution of treatment success should be observed regardless of a type of treatment, disease, or the choice of primary outcomes. This, as repeatedly discussed, applies only to the analyses that are not affected by the factors such as selection of inferior comparator, poor methodologically quality, or selective publication. Indeed, the requirement for a consecutive series of high-quality randomized trials in which publication and outcome reporting bias is accounted for is a key to conducting the accurate evaluation of the effects of new treatments compared to established treatments in randomized trials. As long as these requirements are met, we believe that our results are generalizable to all randomized trials, although further studies are needed to address the distribution of treatment success in commercially sponsored trials.
Society can expect that slightly more than half of new experimental treatments will prove to be better than established treatments when tested in randomized controlled trials (RCTs), but few will be substantially better. This is an important finding for patients (as they contemplate participation in RCTs), researchers (as they plan design of the new trials), and funders (as they assess the ’return on investment’). As our analysis did not include commercially sponsored studies, this conclusion applies to publicly sponsored trials.
Future research should focus on assessing the ’efficiency’ of answering the questions tested in RCTs, as well as the role of commercial sponsorship.
We thank Andy Oxman and Elizabeth Paulson for their help in writing the original protocol. We also thank Mike Clarke for detailed and constructive feedback on the earlier version of the review.
SOURCES OF SUPPORT
Partial support to BD.
Partial support to PPG.
Partial support for IC.
Kernel densities and cumulative kernel densities for all cohorts using all comparisons for each study with extractable data for primary outcome using weights from random-effect model (Figure 9).
|Outcome or subgroup title||No. of studies||No. of participants||Statistical method||Effect size|
|1 Primary outcome||4||Odds / Hazard Ratio (Random, 99% CI)||0.91 [0.88, 0.95]|
|2 Overall survival||3||Hazard Ratio (Random, 99% CI)||0.95 [0.92, 0.98]|
|3 Primary outcome||4||Odds / Hazard Ratio (Random, 99% CI)||0.88 [0.79, 0.97]|
|3.1 Active comparator||4||Odds / Hazard Ratio (Random, 99% CI)||0.92 [0.89, 0.96]|
|3.2 Placebo/no therapy comparator||4||Odds / Hazard Ratio (Random, 99% CI)||0.79 [0.61, 1.02]|
|Outcome or subgroup title||No. of studies||No. of participants||Statistical method||Effect size|
|1 Primary outcome||4||Odds / Hazard Ratio (Random, 99% CI)||0.90 [0.85, 0.94]|
|2 Overall survival||3||Hazard Ratio (Random, 99% CI)||0.95 [0.93, 0.97]|
|3 Primary outcome||4||Odds / Hazard Ratio (Random, 99% CI)||0.86 [0.77, 0.97]|
|3.1 Active comparator||4||Odds / Hazard Ratio (Random, 99% CI)||0.93 [0.89, 0.96]|
|3.2 Placebo/no therapy comparator||4||Odds / Hazard Ratio (Random, 99% CI)||0.78 [0.55, 1.09]|
CONTRIBUTIONS OF AUTHORSBD, AO, HS, GV, and IC drafted the original protocol. PPG helped revised the protocol. PPG and RP screened studies for eligibility. RP, GCL, and BM performed statistical analyses. AK and TR extracted data. BD wrote the first draft of the paper, which was then revised by all authors. All authors approved the final version of the paper.
DECLARATIONS OF INTEREST
The corresponding author (BD) and some of the collaborators (AK, HPS, IC, LD, JR) have published studies that were included in this systematic review.
DIFFERENCES BETWEEN PROTOCOL AND REVIEW
The major difference between the protocol and the review is the introduction of the kernel density analyses to assess the distribution of treatment outcomes. Other differences, which reflect the lack of sufficient data in the included studies, are described in the Methods section above
* Indicates the major publication for the study