Home | About | Journals | Submit | Contact Us | Français |

**|**Ann Oncol**|**PMC4511222

Formats

Article sections

- Abstract
- introduction
- alternatives to equal randomization
- achieved sample size distributions, bias, and power
- drift
- event time outcomes
- discussion
- funding
- disclosure
- references

Authors

Related links

Ann Oncol. 2015 August; 26(8): 1621–1628.

Published online 2015 May 15. doi: 10.1093/annonc/mdv238

PMCID: PMC4511222

Received 2015 January 18; Revised 2015 April 22; Accepted 2015 May 12.

Copyright © The Author 2015. Published by Oxford University Press on behalf of the European Society for Medical Oncology. All rights reserved. For permissions, please email: journals.permissions@oup.com

This article has been cited by other articles in PMC.

In recent years, various outcome adaptive randomization (AR) methods have been used to conduct comparative clinical trials. Rather than randomizing patients equally between treatments, outcome AR uses the accumulating data to unbalance the randomization probabilities in favor of the treatment arm that currently is superior empirically. This is motivated by the idea that, on average, more patients in the trial will be given the treatment that is truly superior, so AR is ethically more desirable than equal randomization. AR remains controversial, however, and some of its properties are not well understood by the clinical trials community.

Computer simulation was used to evaluate properties of a 200-patient clinical trial conducted using one of four Bayesian AR methods and compare them to an equally randomized group sequential design.

Outcome AR has several undesirable properties. These include a high probability of a sample size imbalance in the wrong direction, which might be surprising to nonstatisticians, wherein many more patients are assigned to the inferior treatment arm, the opposite of the intended effect. Compared with an equally randomized design, outcome AR produces less reliable final inferences, including a greatly overestimated actual treatment effect difference and smaller power to detect a treatment difference. This estimation bias becomes much larger if the prognosis of the accrued patients either improves or worsens systematically during the trial.

AR produces inferential problems that decrease potential benefit to future patients, and may decrease benefit to patients enrolled in the trial. These problems should be weighed against its putative ethical benefit. For randomized comparative trials to obtain confirmatory comparisons, designs with fixed randomization probabilities and group sequential decision rules appear to be preferable to AR, scientifically, and ethically.

In medical research, the randomized comparative trial (RCT) is the scientific gold standard for obtaining confirmatory treatment comparisons. While the statistical rationale for randomizing is not well understood by many practitioners, it is accepted widely that randomizing patients between treatments is the right thing to do if one wants to obtain an answer to the question, ‘Is one of these treatments better than the other?’

To see why randomization works when comparing two treatments, *B* and *A*, imagine one could make two copies of each patient, treat one copy with *B*, and treat the other copy with *A*. Suppose the primary outcome, response, is >50% shrinkage of a solid tumor within 12 weeks. If *Y _{B}* indicates response with

To compare treatments for a patient population, the key parameters are *p _{B}* = Prob(response if a patient receives

If one randomizes patients equally between *B* and *A*, the difference between the sample means or proportions is an unbiased estimator of Δ_{B}_{,A}. The idea of unbiased estimation sometimes is misunderstood. Because any statistical estimator is a random quantity, an unbiased estimator does not *equal* Δ_{B}_{,A}, but rather follows a distribution having mean equal to Δ_{B}_{,A}. Figure Figure11 illustrates distributions of two estimators of Δ_{B}_{,A}, one obtained from a RCT and other from observational data. The RCT estimator is unbiased because its distribution is centered around Δ_{B}_{,A}, while the observational data estimator is biased because its distribution is shifted to the right, so it overestimates Δ_{B}_{,A}.

Many things can go wrong with treatment comparison if one does not randomize. A common approach is to estimate Δ_{B}_{,A} by collecting observational data on patients treated with either *B* or *A*, and compute the difference between the empirical response rates or sample means as the statistical estimator. This avoids the time and expense of conducting a clinical trial, but easily can lead to incorrect conclusions. If the patients who received *B* on average had better prognosis than those who received *A*, this can lead to *estimation bias*, denoted by bias* _{B,A}*. The usual difference between the empirical response rates does not estimate Δ

Sources of bias in observational data often are not obvious. One simply may not know that the patients who received treatment *B* tended to have better prognosis, and there often are unknown *latent variables*, that affect outcome and are not balanced between the two treatment groups. Latent variables may arise from systematic differences between nurses, physicians, or medical centers, changes in supportive care over time, different drop-out rates due to differences in toxicity between the treatments, or bias in selecting each patient's treatment, which may or may not be intentional. Such latent variables easily may cause a statistical estimator to be biased.

Many clinical trialists claim that bias in observational data can be corrected by fitting a statistical regression model, such as a logistic model for response or a Cox model for survival time, to adjust for effects of patient prognostic covariates. Unfortunately, *fitting a standard regression model will not correct for bias in observational data*. Valid statistical methods to correct for bias in treatment comparison based on observational data include inverse probability of treatment weighting [1], stratification, or case matching [2–8].

While the unbiased comparison provided by randomization serves the needs of future patients, flipping a coin to choose each patient's treatment looks strange to many nonstatisticians. In medical practice, a physician chooses each patient's treatment based on the patient's prognostic characteristics. This is inherently biased, so a RCT may be considered the antithesis of routine medical practice. The physicians involved in a trial must have *equipoise* in that they have no reason to favor one treatment over the other. If the goal is to maximize benefit to the patients enrolled in the trial, then it may seem that the best action is simply to give each new patient the treatment that currently has the larger estimated response rate or mean survival time. This is called a ‘greedy’, ‘play-the-winner (PTW)’, or ‘myopic’ algorithm. Perhaps counterintuitively, a PTW algorithm may not be best for the patients in the trial. Consider a comparison of *B* and *A* in terms of response probabilities *p _{A}* and

A compromise between equal randomization and PTW is adaptive randomization (AR), which uses the observed data from previous patients in the trial to compute treatment randomization probabilities for newly accrued patients. These sometimes are called ‘randomized play-the-winner’ designs [10, 11]. For example, if one response were observed in three patients treated with *A*, and no responses were observed in three patients treated with *B*, an AR method might randomize the next patient to *B* with probability 0.25 and to *A* with probability 0.75, thus allowing additional data to be obtained on treatment *B*. The usual argument motivating using AR in place of equal randomization is that it is ethically more desirable because, on average, more patients are given the superior treatment. The goal of AR is to obtain sample sizes for the two arms, *N _{A}* and

There are many ways to do AR [11, 16–19]. We will discuss simple Bayesian methods, for binary response outcomes and event time outcomes, two common cases arising in practice. For the binary case, our focus will be a trial with up to 200 patients to compare treatments *B* and *A* having response probabilities *p _{B}* and

Any type of random treatment assignment produces random sample sizes, *N _{B}* and

For all simulations, each case was replicated 10 000 times. Table Table11 gives simulation results for a trial run using each of AR(1), AR(1/2), and the Equal GS design. In the null case where *p _{B}* =

Comparison of designs for a binary response outcome with maximum trial sample size *N* = 200 and true *p*_{A} = 0.25

The means do not tell the whole story. For example, when *p _{B}* = 0.35 and

Distributions of the achieved sample size difference *N*_{B} − *N*_{A} for a 200-patient trial conducted using either AR(1) or AR(1/2), when *p*_{A} = 0.25 and *p*_{B} = 0.35.

While all adaptive designs introduce bias in final estimators, the bias is much larger when AR methods are used. This involves the ideas of accuracy and precision. Higher accuracy of an estimator is quantified by smaller bias, whereas higher precision means smaller variability. Figure Figure33 illustrates distributions of the estimators of Δ_{B}_{,A} for the Equal GS and AR(1/2) designs, when *p _{A}* = 0.25 and

Distributions of the final estimator of the *B*-versus-*A* treatment effect Δ_{B}_{,A} = 0.20, for true response probabilities *p*_{A} = 0.25 and *p*_{B} = 0.45, from a 200-patient trial conducted using either a GS design with equal randomization or AR(1/2).

There is a substantial statistical literature on methods to correct for bias following a GS trial [23–26]. Unfortunately, these methods seldom are used in practice.

As explained in the Appendix, for these AR designs *B* is declared superior to *A* if the posterior probability Prob(*p _{A}* <

To quantify the actual advantage afforded by AR to the patients in the trial, Table Table22 gives the means and 95% probability intervals (2.5th and 97.5th percentiles) of total sample size, number of successes, and number of failures for AR(1)*, AR(1/2)* and the Equal GS design, in the case where the true *p _{A}* = 0.25 and

A prominent argument against the use of AR is that it can lead to biased estimates in the presence of parameter drift. Drift occurs, for example, if *p _{A}* and

Comparison of the designs in Table 1 when *p*_{A} and *p*_{B} drift upward together over time, corresponding to improving prognosis of enrolled patients

With drift, the AR(1*) and AR(1/2)* designs have respective type I errors 0.20 and 0.10 compared with 0.08 for Equal GS, and both have power = 0.57 for *p _{B}* = 0.45, compared with power = 0.87 with Equal GS. All AR methods have bias that is at least triple that of Equal GS.

Since a total parameter drift of 0.20 over the course of a 200-patient trial may be considered large, we repeated the simulations with smaller total drift = 0.10, given in the lower portion of Table Table3.3. With smaller overall drift, inflation of type I error for the AR methods is smaller, but the power figures are much smaller, specifically 0.45 for AR(1)* and 0.49 for AR(1/2)* compared with 0.86 for Equal GS at *p _{A}* = 0.45, and the problems with large

Because progression-free-survival time or overall survival (OS) time often are used as the primary outcome in RCTs, it is useful to examine how similar AR methods applied with time-to-event outcomes behave compared with equal randomization. The following simulations focus on a trial comparing the median OS times of treatments *B* and *A*. We assume the commonly used Bayesian model in which the survival times in each arm follow exponential distributions, and their medians *m _{A}* and

Simulation results for the five designs comparing *A* to *B* in terms of OS times are summarized in Table Table4.4. In null case where both median OS times are *m _{A}* =

Comparison of the three designs with survival time outcomes, maximum *N* = 200, accrual rate 5 patients per month, and true median survival *m*_{A} = 12 months with treatment *A*

As a final simulation study, we evaluated the sensitivity of the five designs to accrual rate, for the case *m _{A}* = 12 months and

It appears that practitioners may not fully understand the properties of the AR methods that they have used. This may be due to examining only average behavior, or not being aware of the large estimation bias and loss of power produced by AR. The use of AR methods in place of equal randomization in clinical trials remains controversial, and there has been an active debate in the medical literature in recent years [12–15, 20, 21, 27].

Our computer simulations have identified and quantified the following problems with AR with important ethical and scientific consequences.

- AR introduces much more variability into the distributions of the achieved total sample sizes
*N*and_{A}*N*, compared with that introduced by equal randomization._{B} - AR may have a high probability of a sample size imbalance in the wrong direction, which may be surprising to nonstatisticians.
- AR introduces substantial bias in the estimator of the comparative treatment effect Δ
_{B}_{,A}by overestimating any actual effect. This bias becomes larger if there is parameter drift. - AR may produce a very large type I error rate. This problem becomes much more severe if there is parameter drift.
- If the decision cutoffs of AR are calibrated to ensure type I error 0.05, the resulting AR methods have much smaller power to detect true treatment differences compared with Equal GS.

The severity of each problem depends on the particular AR method used and design parameters. These problems with AR may be mitigated by adopting a fix-up strategy, such as stratifying or blocking, doing an initial ‘burn-in’ using equal randomization before applying AR, modifying the AR probabilities to shrink them toward 0.50, as AR(1/2), or correcting for bias in some *post hoc* manner. With such fix-ups, the resulting gain in the number of patients treated with the superior treatment over the inferior treatment, if in fact Δ_{B}_{,A} does not equal 0, becomes much smaller.

We have not considered the multiarm case with more than two treatment groups. This case is much more complex, potentially involving both selection and testing, two or more stages, including a control arm or not, and many possible inferential goals. The issues discussed here should be examined when designing a multiarm randomized trial.

In conclusion, it is important to consider the actual gain with AR in light of the problems discussed here. Given our simulation results, and the simulation studies of others, scientific, and ethical problems introduced by AR must be weighed against its putative advantage.

This research was partially supported by NIH/NCI grant RO1 CA 83932.

The authors have declared no conflict of interest.

We thank two anonymous referees for their detailed and constructive comments.

To define the adaptive randomization (AR) methods and statistical tests, denote the achieved sample size when an interim decision is made by *n*. For the case of binary outcomes, under a Bayesian model we assume that *p _{A}* and

and to *A* with probability *r _{n}*

or that *A* is superior to *B* if the similar inequality holds with the roles of *A* and *B* reversed. If the trial is not stopped early, a final decision is made at *N* = 200. The decision cutoff parameters 0.95 and 0.80 were derived to ensure overall type I error probability <0.05. We refer to this design as ‘Equal GS’.

For the case of time-to-event outcomes, the medians *m _{A}* and

with decision parameters 0.996 and 0.01 chosen to ensure overall type I error probability <0.05. Since, for each treatment arm *t* = *A*, *B*, the posterior of *m _{t}* is an inverse gamma distribution with parameters (2.144 + NE

1. Hernan MA, Brumback B, Robins JM
Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology
2000; 5: 561–570. [PubMed]

2. Rosenbaum P, Rubin D
Reducing bias in observational studies using subclassification on the propensity score. J. Am Stat Assoc
1984; 79: 516–524.

3. Morgan SL, Harding DJ
Matching estimators of causal effects: prospects and pitfalls in theory and practice. Sociol Methods Res
2006; 35: 3–60.

4. Austin PC.
The performance of different propensity-score methods for estimating differences in proportions (risk differences or absolute risk reductions) in observational studies. Stat Med
2010; 29: 2137–2148. [PMC free article] [PubMed]

5. Cole SR, Hernán MA, Robins JM et al.
Effect of highly active antiretroviral therapy on time to acquired immunodeficiency syndrome or death using marginal structural models. Am J Epidemiol
2003; 158: 687–694. [PubMed]

6. Cook NR, Cole SR, Hennekens CH
Use of a marginal structural model to determine the effect of aspirin on cardiovascular mortality in the physicians' health study. Am J Epidemiol
2002; 155: 1045–1054. [PubMed]

7. Curtis LH, Hammill BG, Eisenstein EL et al.
Using inverse probability-weighted estimators in comparative effectiveness analyses with observational databases. Med Care
2007; 45(Suppl 2): S103–S107. [PubMed]

8. Wang L, Rotnitzky A, Lin X et al.
Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. J Am Stat Assoc
2012; 107: 493–508; discussion 509–517. [PMC free article] [PubMed]

9. Sutton RS, Barto AG
Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press;
1998.

10. Wei LJ, Durham S
The randomized play-the-winner rule in medical trials. J Am Stat Assoc
1978; 73: 840–843.

11. Zelen M.
A new design for randomized clinical trials. N Engl J Med
1979; 300: 1242–1246. [PubMed]

12. Chappell R, Karrison T
Letter to the editor. Stat Med
2007; 26: 3046–3056. [PubMed]

13. Korn EL, Freidlin B
Outcome-adaptive randomization: is it useful?
J Clin Oncol
2011; 29: 771–776. [PMC free article] [PubMed]

14. Lee JJ, Chen N, Yin G
Worth adapting? Revisiting the usefulness of outcome-adaptive randomization. Clin Cancer Res
2012; 18: 4498–4507. [PMC free article] [PubMed]

15. Yuan Y, Yin G
On the usefulness of outcome-adaptive randomization. J Clin Oncol
2011; 29: 390–392. [PubMed]

16. Rosenberger WR, Lachin JM
The use of response-adaptive designs in clinical trials. Control Clin Trials
1993; 14: 471–484. [PubMed]

17. Karrison TG, Huo D, Chappell R
A group sequential, response-adaptive design for randomized clinical trials. Control Clin Trials
2003; 24: 506–522. [PubMed]

18. Hu F, Rosenberger WF
The Theory of Response-Adaptive Randomization in Clinical Trials. Hoboken: Wiley Series in Probability and Statistics;
2006.

19. Cheung YK, Inoue LYT, Wathen JK, Thall PF
Continuous Bayesian adaptive randomization based on event times with covariates. Stat Med
2006; 25: 55–70. [PubMed]

20. Thall PF, Wathen JK
Practical Bayesian adaptive randomization in clinical trials. Eur J Cancer
2007; 43: 860–867. [PMC free article] [PubMed]

21. Thall PF, Fox P, Wathen JK
Some caveats for outcome adaptive randomization in clinical trials. In Sverdlov O, editor. (ed), Modern Adaptive Randomized Clinical Trials: Statistical, Operational, and Regulatory Aspects. Boca Raton, FL: Taylor & Francis:
2015; in press.

22. Thompson WR.
On the likelihood that one unknown probability exceeds another in view of the evidence of the two samples. Biometrika
1933; 25: 285–294.

23. Emerson SS, Fleming TR
Parameter estimation following group sequential hypothesis testing. Biometrika
1990; 77: 875–892.

24. Liu A, Hall WJ
Unbiased estimation following a group sequential test. Biometrika
1999; 86: 71–78.

25. Tsiatis AA, Rosner GL, Mehta CR
Exact confidence intervals following a group sequential test. Biometrics
1984; 40: 797–803. [PubMed]

26. Whitehead J.
On the bias of maximum likelihood estimation following a sequential test. Biometrika
1986; 73: 573–581.

27. Hey SP, Kimmelman J
Are outcome adaptive allocation trails ethical?
Clin Trials
2015; 12: 102–106. [PMC free article] [PubMed]

Articles from Annals of Oncology are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |