|Home | About | Journals | Submit | Contact Us | Français|
In recent years, various outcome adaptive randomization (AR) methods have been used to conduct comparative clinical trials. Rather than randomizing patients equally between treatments, outcome AR uses the accumulating data to unbalance the randomization probabilities in favor of the treatment arm that currently is superior empirically. This is motivated by the idea that, on average, more patients in the trial will be given the treatment that is truly superior, so AR is ethically more desirable than equal randomization. AR remains controversial, however, and some of its properties are not well understood by the clinical trials community.
Computer simulation was used to evaluate properties of a 200-patient clinical trial conducted using one of four Bayesian AR methods and compare them to an equally randomized group sequential design.
Outcome AR has several undesirable properties. These include a high probability of a sample size imbalance in the wrong direction, which might be surprising to nonstatisticians, wherein many more patients are assigned to the inferior treatment arm, the opposite of the intended effect. Compared with an equally randomized design, outcome AR produces less reliable final inferences, including a greatly overestimated actual treatment effect difference and smaller power to detect a treatment difference. This estimation bias becomes much larger if the prognosis of the accrued patients either improves or worsens systematically during the trial.
AR produces inferential problems that decrease potential benefit to future patients, and may decrease benefit to patients enrolled in the trial. These problems should be weighed against its putative ethical benefit. For randomized comparative trials to obtain confirmatory comparisons, designs with fixed randomization probabilities and group sequential decision rules appear to be preferable to AR, scientifically, and ethically.
In medical research, the randomized comparative trial (RCT) is the scientific gold standard for obtaining confirmatory treatment comparisons. While the statistical rationale for randomizing is not well understood by many practitioners, it is accepted widely that randomizing patients between treatments is the right thing to do if one wants to obtain an answer to the question, ‘Is one of these treatments better than the other?’
To see why randomization works when comparing two treatments, B and A, imagine one could make two copies of each patient, treat one copy with B, and treat the other copy with A. Suppose the primary outcome, response, is >50% shrinkage of a solid tumor within 12 weeks. If YB indicates response with B and YA indicates response with A, then YB − YA is the causal effect of B versus A for that patient. The imaginary outcomes YB and YA are called counterfactuals. For example, if the copy of the patient that received B achieves a response and the copy that received A does not, then YB − YA = 1 − 0 = 1. The mean of all YB − YA differences over a sample of such patient pairs would estimate the causal effect of B versus A. While one cannot make copies of patients, in practice a physician's choice between B and A for a given patient often involves thinking about this imaginary experiment.
To compare treatments for a patient population, the key parameters are pB = Prob(response if a patient receives B) and pA = Prob(response if a patient receives A), and ΔB,A = pB − pA is the population B-versus-A treatment. For survival time, the parameters are mean survival times, mA and mB, and ΔB,A = mB − mA. While ΔB,A is a conceptual object, its meaning is clear, since large positive values correspond to superiority of B over A, and conversely. This paper is about statistical methods for conducting a RCT to estimate ΔB,A and decide whether one treatment is better than the other, and how randomizing adaptively rather than equally can affect such inferences.
If one randomizes patients equally between B and A, the difference between the sample means or proportions is an unbiased estimator of ΔB,A. The idea of unbiased estimation sometimes is misunderstood. Because any statistical estimator is a random quantity, an unbiased estimator does not equal ΔB,A, but rather follows a distribution having mean equal to ΔB,A. Figure Figure11 illustrates distributions of two estimators of ΔB,A, one obtained from a RCT and other from observational data. The RCT estimator is unbiased because its distribution is centered around ΔB,A, while the observational data estimator is biased because its distribution is shifted to the right, so it overestimates ΔB,A.
Many things can go wrong with treatment comparison if one does not randomize. A common approach is to estimate ΔB,A by collecting observational data on patients treated with either B or A, and compute the difference between the empirical response rates or sample means as the statistical estimator. This avoids the time and expense of conducting a clinical trial, but easily can lead to incorrect conclusions. If the patients who received B on average had better prognosis than those who received A, this can lead to estimation bias, denoted by biasB,A. The usual difference between the empirical response rates does not estimate ΔB,A, but actually estimates ΔB,A + biasB,A. For example, if the actual response rates are pB = 0.20 and pA = 0.30, the actual treatment effect is ΔB,A = 0.20–0.30 = −0.10, so A is slightly better than B. If biasB,A = 0.30 due to better prognosis, on average, in the B patients, the difference between the estimated response probabilities will take on a value with mean ΔB,A + biasB,A = −0.10 + 0.30 = 0.20 rather than ΔB,A = −0.10. This likely would lead one to believe, incorrectly, that there is a substantive advantage of B over A, be unrealistically optimistic about B, and organize a large phase III trial yielding negative results.
Sources of bias in observational data often are not obvious. One simply may not know that the patients who received treatment B tended to have better prognosis, and there often are unknown latent variables, that affect outcome and are not balanced between the two treatment groups. Latent variables may arise from systematic differences between nurses, physicians, or medical centers, changes in supportive care over time, different drop-out rates due to differences in toxicity between the treatments, or bias in selecting each patient's treatment, which may or may not be intentional. Such latent variables easily may cause a statistical estimator to be biased.
Many clinical trialists claim that bias in observational data can be corrected by fitting a statistical regression model, such as a logistic model for response or a Cox model for survival time, to adjust for effects of patient prognostic covariates. Unfortunately, fitting a standard regression model will not correct for bias in observational data. Valid statistical methods to correct for bias in treatment comparison based on observational data include inverse probability of treatment weighting , stratification, or case matching [2–8].
While the unbiased comparison provided by randomization serves the needs of future patients, flipping a coin to choose each patient's treatment looks strange to many nonstatisticians. In medical practice, a physician chooses each patient's treatment based on the patient's prognostic characteristics. This is inherently biased, so a RCT may be considered the antithesis of routine medical practice. The physicians involved in a trial must have equipoise in that they have no reason to favor one treatment over the other. If the goal is to maximize benefit to the patients enrolled in the trial, then it may seem that the best action is simply to give each new patient the treatment that currently has the larger estimated response rate or mean survival time. This is called a ‘greedy’, ‘play-the-winner (PTW)’, or ‘myopic’ algorithm. Perhaps counterintuitively, a PTW algorithm may not be best for the patients in the trial. Consider a comparison of B and A in terms of response probabilities pA and pB, when the true (unknown) values are pA = 0.30 and pB = 0.50. Suppose one begins by treating three patients on each arm and thereafter uses the PTW rule but, by chance, 0/3 responses are observed with B, which has probability 0.125, and at least one response is observed with A, which has probability 0.657. Since the empirical response rate with B is 0 and the empirical response rate with A is positive, thereafter all future patients in the trial will be given treatment A. Thus, due to random variation, the trial gets stuck giving the inferior treatment to all but three patients. This phenomenon with greedy decision rules has been well known for many years in sequential analysis .
A compromise between equal randomization and PTW is adaptive randomization (AR), which uses the observed data from previous patients in the trial to compute treatment randomization probabilities for newly accrued patients. These sometimes are called ‘randomized play-the-winner’ designs [10, 11]. For example, if one response were observed in three patients treated with A, and no responses were observed in three patients treated with B, an AR method might randomize the next patient to B with probability 0.25 and to A with probability 0.75, thus allowing additional data to be obtained on treatment B. The usual argument motivating using AR in place of equal randomization is that it is ethically more desirable because, on average, more patients are given the superior treatment. The goal of AR is to obtain sample sizes for the two arms, NA and NB, unbalanced in favor of the arm having larger true success probability, to obtain NB > NA if B has the larger response rate or longer mean survival time. Many practitioners consider AR to be a panacea for the ethical dilemma posed by randomized trials. The use of AR in clinical trials remains controversial, however [12–15, 21, 27].
There are many ways to do AR [11, 16–19]. We will discuss simple Bayesian methods, for binary response outcomes and event time outcomes, two common cases arising in practice. For the binary case, our focus will be a trial with up to 200 patients to compare treatments B and A having response probabilities pB and pA. Our AR methods are similar to those studied previously [20, 21], with AR probabilities computed using the posterior probability that B has a higher response rate than A, denoted by qn(A < B). The original version of AR, introduced by Thompson , randomizes patients to B with probability qn(A < B) and to A with probability 1 − qn(A < B). We refer to this method as AR(1). A modification, which we will call AR(1/2), adjusts these probabilities so both are closer to 0.50. For example, if qn(A < B) = 0.75 and 1 − qn(A < B) = 0.25, AR(1/2) turns these values into randomization probabilities 0.63 and 0.37. As a comparator, we consider a Bayesian group sequential (GS) design that randomizes patients equally between B and A, with a treatment comparison rule applied at n = 50, 100, 150, and 200 patients, and patients randomized in blocks of size 8 to avoid interim sample size imbalance. The rule decides whether to stop and conclude B is superior to A, or A is superior to B, or continue the trial. We refer to this design as ‘Equal GS’. Details are given in the Appendix section.
Any type of random treatment assignment produces random sample sizes, NB and NA, with distributions depending on the trial design and randomization method used. Properties of AR methods derived from computer simulations often are reported in terms of mean sample size distributions. The mean, or ‘expected’ difference between the achieved sample sizes, E(NB − NA), often is used to quantify a putative advantage of an AR method over equal randomization, which has mean 0. If B actually is better than A (pB > pA), then AR should allocate more patients to B than to A. This could be quantified by E(NB − NA) being positive, but reporting only mean sample sizes is misleading, because AR produces much more disperse sample size distributions compared with equal randomization.
For all simulations, each case was replicated 10 000 times. Table Table11 gives simulation results for a trial run using each of AR(1), AR(1/2), and the Equal GS design. In the null case where pB = pA = 0.25, by symmetry the false-positive (type I error) probability for each method is twice the tabled value ‘Probability conclude pB > pA’, so the false-positive probabilities are 0.18 for AR(1), 0.24 for AR(1/2), and 0.048 for Equal GS. The simulations show that both AR(1) and AR(1/2) have much larger false-positive rates compared with the Equal GS design, and smaller power when B is superior to A, with pB = 0.45. The means of NB − NA are quite large for the AR methods when pB = 0.35 or 0.45 and pA = 0.25. Large mean sample size differences, such as those in Table Table1,1, are the most commonly used rationale for arguing that AR is ethically superior to equal randomization.
The means do not tell the whole story. For example, when pB = 0.35 and pA = 0.25, AR(1) has a 14% chance of producing a sample size imbalance of 20 patients or more in favor of the inferior treatment, the opposite of the claimed ethical advantage of AR. This apparently anomalous behavior is explained graphically by Figure Figure2,2, showing the distributions of NB − NA for AR(1) and AR(1/2) in this case. Because AR probabilities, which are computed from interim data, are highly variable, the distributions of NB − NA for AR(1) and AR(1/2) have large left tails below 0, corresponding to values of NB much smaller than NA. The practical and ethical point is that AR may behave pathologically in that it carries a nontrivial risk of creating a large sample size imbalance in favor of the inferior treatment. Reporting only mean sample size differences 66 for AR(1) and 37 for AR(1/2) thus is very misleading. This example illustrates two important points. Since NB and NA are highly disperse with AR, the mean of NB − NA does not adequately describe a given AR method's behavior, and the general claim that any AR method always is ethically superior to equal randomization is false.
While all adaptive designs introduce bias in final estimators, the bias is much larger when AR methods are used. This involves the ideas of accuracy and precision. Higher accuracy of an estimator is quantified by smaller bias, whereas higher precision means smaller variability. Figure Figure33 illustrates distributions of the estimators of ΔB,A for the Equal GS and AR(1/2) designs, when pA = 0.25 and pB = 0.45, so the true effect is ΔB,A = 0.20. The distributions in Figure 3 do not have the symmetric, bell shapes as given in Figure Figure1,1, but rather are highly asymmetric, and shifted to the right, corresponding to overestimation of ΔB,A = 0.20. Both AR estimators are inaccurate. The estimator's distribution for AR(1/2) is more disperse, and less precise than that obtained from Equal GS. The mean estimate of ΔB,A is 0.30 for AR(1/2), so its Bias is 0.10 (50%), the estimate is 0.23 for the Equal GS design, so its Bias is 0.03 (15%). Using AR(1/2) triples the bias in the final estimator of the comparative treatment effect, compared to the Equal GS design. Figure Figure33 also shows that the AR(1/2) estimator will be 0.40 or larger, more than double the true value ΔB,A = 0.20, with probability 0.25, compared with probability 0.05 with Equal GS. Thus, at the end of a trial conducted using AR(1/2), in this case there is a 25% chance that one will overstate the actual benefit of treatment B over A by two fold or more.
As explained in the Appendix, for these AR designs B is declared superior to A if the posterior probability Prob(pA < pB|datan) > 0.99, and A is declared superior to B if this probability is <0.01. It may be argued that, to make the comparisons to Equal GS more fair, the cutoffs used in these stopping rules should be calibrated to obtain false-positive rates smaller than the values 0.18 for AR(1) and 0.24 for AR(1/2) in Table Table1.1. If the cutoff 0.99 is replaced by 0.995 for AR(1), this ensures that the design will have type I error probability 0.05. We denote this design AR(1)*. Similarly, changing the cutoff from 0.99 to 0.9985 for AR(1/2) ensures type I error probability 0.05, and we call this AR(1/2)*. Table Table11 shows that AR(1)* and AR(1/2)* both have very low power, however. When the true pB = 0.45, the Equal GS design power = 0.86, AR(1)* has power = 0.35 and AR(1/2)* has power = 0.40. Thus, if either AR decision rule is calibrated to have a type I error comparable with that of the Equal GS design, then the resulting AR design has much lower power to detect an actual treatment difference. With these larger decision cutoffs, there are no changes in π20, and bias still is substantial, so these problems persist.
To quantify the actual advantage afforded by AR to the patients in the trial, Table Table22 gives the means and 95% probability intervals (2.5th and 97.5th percentiles) of total sample size, number of successes, and number of failures for AR(1)*, AR(1/2)* and the Equal GS design, in the case where the true pA = 0.25 and pB = 0.35. On average, compared with Equal GS, these AR methods give five or six more successes, and one or two more failures. The distributions of all three statistics are highly variable for all methods, however, so considering only mean values is very misleading. The small advantage in mean number of successes should be considered in light of the very small power figures for AR(1)* and AR(1/2)* shown in Table Table1,1, which imply that the AR(1)* and AR(1/2)* designs are not likely to identify a true treatment advance and thus not likely to benefit future patients, and also their large estimation bias.
A prominent argument against the use of AR is that it can lead to biased estimates in the presence of parameter drift. Drift occurs, for example, if pA and pB both increase by the same amount over time due to improving prognosis of patients enrolled over the course of the trial, but the comparative treatment effect ΔB,A remains constant. Karrison et al.  discuss drift and recommend blocking to reduce potential bias due to drift. To examine the effects of drift, we re-simulated the cases in Table Table1,1, but with pA and pB both increasing linearly from their initial values to final values 0.20 larger at the end of the trial. In each case, pA was simulated from the true values pA(n) = 0.25 + 0.20(n/200) when n patients had been accrued. Thus, the treatment effect ΔB,A remained constant throughout the trial, while both probabilities pA and pB increased over time. For example, in each case where nominally pB = 0.35, outcomes for the B arm were simulated using the probabilities pB(n) = 0.35 + 0.20(n/200) when n patients had been accrued, so pB(0) = 0.35 at the start of the trial and this drifted up to pB(200) = 0.55 at the end the trial. Similarly, for arm A, pA(0) = 0.25 at the start and pA(200) = 0.45 at the end, but the difference was pB(n) − pA(n) = 0.10 throughout. The simulation results are summarized in the upper portion of Table Table3.3. Drift has little effect on either the sample size distributions or π20, but it increases estimation bias of both AR methods for true pB = 0.35 or 0.45. In contrast, the bias for Equal GS is very small and does not change when drift is introduced. The largest effect of drift on the AR methods is that it causes them to conclude that pB > pA with much higher probability in all cases. This increases the power when pB = 0.35 or 0.45, but it also has the disastrous effect in the null case, where ΔB,A = 0, of increasing the false-positive rate of AR(1) from 0.18 to 0.36, and of AR(1/2) from 0.24 to 0.32.
With drift, the AR(1*) and AR(1/2)* designs have respective type I errors 0.20 and 0.10 compared with 0.08 for Equal GS, and both have power = 0.57 for pB = 0.45, compared with power = 0.87 with Equal GS. All AR methods have bias that is at least triple that of Equal GS. In the presence of drift, the AR(1)* and AR(1/2)* designs with type I error probability comparable that of the Equal GS design have actual type I error probability either about the same or more than double that of the Equal GS design, and greatly reduced power to detect the large improvement pB = 0.45. The values of π20 and magnitudes of bias are changed very slightly, so the problems with sample size imbalance in the wrong direction and large bias persist.
Since a total parameter drift of 0.20 over the course of a 200-patient trial may be considered large, we repeated the simulations with smaller total drift = 0.10, given in the lower portion of Table Table3.3. With smaller overall drift, inflation of type I error for the AR methods is smaller, but the power figures are much smaller, specifically 0.45 for AR(1)* and 0.49 for AR(1/2)* compared with 0.86 for Equal GS at pA = 0.45, and the problems with large π20 for true pA = 0.35 and very large overestimation bias persist. A reviewer has noted that, in the presence of drift, theoretically, one can minimize the drift-induced inflation of type I error (and the drift-induced bias) in AR design by performing a block-stratified analysis. This would further reduce the power of the trial, however .
Because progression-free-survival time or overall survival (OS) time often are used as the primary outcome in RCTs, it is useful to examine how similar AR methods applied with time-to-event outcomes behave compared with equal randomization. The following simulations focus on a trial comparing the median OS times of treatments B and A. We assume the commonly used Bayesian model in which the survival times in each arm follow exponential distributions, and their medians mA and mB follow identical inverse gamma priors, with means 12 months and variances 1000. This prior is equivalent to two observed events in ~13 months of follow-up, so it is minimally informative. In the simulations, all patients are followed until they have an event, and patient arrival times are simulated from a completely random (Poisson) process with an assumed accrual rate of 5 patients per month. For AR(1) and AR(1/2), it is concluded that mB > mA if the posterior probability Pr(mB > mA|data) > 0.99, and that mA > mB if Pr(mA > mB|data) > 0.99. To control type I error at 0.05, the cutoff 0.9968 was used for both AR(1)* and AR(1/2)*. For the Equal GS design, the randomization was restricted to achieve perfect balance between the two sample sizes when 50, 100, 150, and 200 patients had been enrolled, where the comparative tests were carried out. Additional details are given in the Appendix section.
Simulation results for the five designs comparing A to B in terms of OS times are summarized in Table Table4.4. In null case where both median OS times are mA = mB = 12 months, both AR(1) and AR(1/2) have much larger false-positive rates of 0.14 compared with 0.04 for the Equal GS design. Unlike the binary case, for event time outcomes the AR methods produce slightly biased estimators in the null case, although this may be due to simulation variation. When B actually is superior to A, both AR methods produce large bias, roughly double that of the Equal GS design. For example, when mB = 16 months, on average AR(1/2) gives estimated median survival 18.10 months, a 13% overestimate, compared with 16.78 months with the Equal GS design, a 5% overestimate. Thus, as in the binary outcome case, AR greatly overstates any actual advantage of B over A. When mB = 16 months, the probability of a sample size imbalance of 20 patients or more in the wrong direction is π20 = 0.11 for AR(1) and π20 = 0.07 for AR(1/2). When B has a smaller advantage over A, with mB = 14 months versus mA = 12 months (not tabled), these figures are much larger, with π20 = 0.22 for AR(1) and π20 = 0.16 for AR(1/2). When type I error is controlled to ensure a more fair comparison, as in the binary outcome case AR(1)* and AR(1/2)* have greatly reduced power. For true medians mA = 12 months and mB = 20 months, AR(1)* has power = 0.46 and AR(1/2)* has power = 0.53, compared with power = 0.70 for Equal GS. The values of is π20 are unchanged, both AR* methods still have smaller bias, but they still overestimate the actual median survival for arm B by much more than Equal GS.
As a final simulation study, we evaluated the sensitivity of the five designs to accrual rate, for the case mA = 12 months and mB = 16 months. These simulations, summarized in Table Table5,5, show that π20 increases with accrual rate for the AR methods. The AR(1)* and AR(1/2)* methods show a very similar pattern, but again with much smaller power. This is notable since more rapid accrual often is considered desirable in clinical trials.
It appears that practitioners may not fully understand the properties of the AR methods that they have used. This may be due to examining only average behavior, or not being aware of the large estimation bias and loss of power produced by AR. The use of AR methods in place of equal randomization in clinical trials remains controversial, and there has been an active debate in the medical literature in recent years [12–15, 20, 21, 27].
Our computer simulations have identified and quantified the following problems with AR with important ethical and scientific consequences.
The severity of each problem depends on the particular AR method used and design parameters. These problems with AR may be mitigated by adopting a fix-up strategy, such as stratifying or blocking, doing an initial ‘burn-in’ using equal randomization before applying AR, modifying the AR probabilities to shrink them toward 0.50, as AR(1/2), or correcting for bias in some post hoc manner. With such fix-ups, the resulting gain in the number of patients treated with the superior treatment over the inferior treatment, if in fact ΔB,A does not equal 0, becomes much smaller.
We have not considered the multiarm case with more than two treatment groups. This case is much more complex, potentially involving both selection and testing, two or more stages, including a control arm or not, and many possible inferential goals. The issues discussed here should be examined when designing a multiarm randomized trial.
In conclusion, it is important to consider the actual gain with AR in light of the problems discussed here. Given our simulation results, and the simulation studies of others, scientific, and ethical problems introduced by AR must be weighed against its putative advantage.
This research was partially supported by NIH/NCI grant RO1 CA 83932.
The authors have declared no conflict of interest.
We thank two anonymous referees for their detailed and constructive comments.
To define the adaptive randomization (AR) methods and statistical tests, denote the achieved sample size when an interim decision is made by n. For the case of binary outcomes, under a Bayesian model we assume that pA and pB follow independent beta (0.25, 0.75) priors. Bayesian AR probabilities were computed based on the posterior probability qn(A < B) = Prob(pA < pB|datan). This is the posterior probability that, given the current data, B has a higher response rate than A. For fixed constant either c = 1 or c = 1/2, the AR method randomizes patients to B with probability
and to A with probability rn,A = 1 − rn,B. The version with c = 1, which we call ‘AR(1)’, was introduced by Thompson  and has AR probability rn,B = qn(A < B). The version with c = 1/2, which we call ‘AR(1/2)’, is used commonly to shrink the AR probabilities toward 1/2, to avoid values of rn,B near 0 or 1. For both AR methods, the trial is stopped with B declared superior to A if qn(A < B) > 0.99, or A declared superior to B if qn(B < A) = 1 − qn(A < B) > 0.99, and these rules are applied continuously. As a comparator, we used a group sequential (GS) Bayesian design that randomizes patients equally between B and A in blocks of size 8 to avoid interim sample size imbalance, with treatment comparison rule, applied at n = 50, 100, 150, and 200 patients, that stops and concludes that B is superior to A if
or that A is superior to B if the similar inequality holds with the roles of A and B reversed. If the trial is not stopped early, a final decision is made at N = 200. The decision cutoff parameters 0.95 and 0.80 were derived to ensure overall type I error probability <0.05. We refer to this design as ‘Equal GS’.
For the case of time-to-event outcomes, the medians mA and mB were assumed to have inverse gamma priors with parameters (2.144, 13.728), which implies that mA and mB have prior mean 12 months and variance 1000. The criterion used to define rn,B and rn,A = 1 − rn,B was qn(A < B) = Prob(mA < mB|datan). The Equal GS trial was conducted using the decision rule to stop and declare B superior to A if
with decision parameters 0.996 and 0.01 chosen to ensure overall type I error probability <0.05. Since, for each treatment arm t = A, B, the posterior of mt is an inverse gamma distribution with parameters (2.144 + NEt, 13.728 + TFUt), where NEt = number of events and TFUt = total follow-up time in arm t, the decision criterion at each analysis is computed using qn(A < B) based on the most recent values of (NEA, TFUA) and (NEB, TFUB).