|Home | About | Journals | Submit | Contact Us | Français|
Dynamic allocation of participants to treatments in a clinical trial has been an alternative to randomization for nearly 35 years. Design-adaptive allocation is a particularly flexible kind of dynamic allocation. Every investigation of dynamic allocation methods has shown that they improve balance of prognostic factors across treatment groups, but there have been lingering doubts about their influence on the validity of statistical inferences. Here we report the results of a simulation study focused on this and similar issues. Overall, it is found that there are no statistical reasons, in the situations studied, to prefer randomization to design-adaptive allocation. Specifically, there is no evidence of bias, the number of participants wasted by randomization in small studies is not trivial, and when the aim is to place bounds on the prediction of population benefits, randomization is quite substantially less efficient than design-adaptive allocation. A new, adjusted permutation estimate of the standard deviation of the regression estimator under design-adaptive allocation is shown to be an unbiased estimate of the true sampling standard deviation, resolving a long-standing problem with dynamic allocations. These results are shown in situations with varying numbers of balancing factors, different treatment and covariate effects, different covariate distributions, and in the presence of a small number of outliers.
A substantial amount of research has been devoted to the issue of how study participants should be allocated to treatment groups. Although treatment randomization has become the standard in clinical trials, there has also been a persistent undercurrent of questioning whether a random design selection is really better than a selection structured to have some kind of favorable statistical properties with respect to the treatment effect estimate. Dynamic allocations that use all past allocations to influence the current allocation have been advocated as being superior to randomization.
The source of the skeptical literature on randomization is probably the work of Donald Taves (Taves 1974), who argued that using a method that intended to balance treatment groups with respect to potential prognostic factors (which he called minimization) would do considerably better than randomization, at least in small trials. Following Taves’ suggestion, a number of researchers rapidly established the truth of his claim (Pocock & Simon 1975, Simon 1977, Forsythe & Stitt 1997, Klotz 1978, Pocock 1979, Simon 1979, Begg & Iglewicz 1980, Aickin 1982, Birkett 1985; summarized in Scott et al. 2002). It seems to be well-established that virtually any sensible strategy for achieving balance for prognostic factors does far better than randomization. If balance were the only statistical criterion, then the case would be closed, and dynamic allocations of the sort recommended by many authors should be the standard for clinical trials.
But there has been a contrary strain of skepticism about dynamic allocation. Granted that dynamic allocation improves balance, it does so by connecting different participants through their covariates and allocations in a way that is hard to analyze, and so the distributional properties of treatment effect estimators, in situations where dynamic allocation has been used, might not be accurately portrayed by conventional statistical methods (see Ebbutt et al. 1997 for an example). This complaint needs to be taken seriously, because there is virtually no theory to cover the relatively complex dynamic allocation algorithms that have been proposed (Senn 1997). The critical point is that a dynamic allocation assigns treatment based on the current participant’s covariates and all of the covariates and allocations of previous participants, and it is unclear what the consequences of this are for the sampling distributions of conventional treatment effect estimators. This point has often been elided in both the theoretical and simulation literature (see Kalish & Begg 1987, Green at al. 2001, Atkinson 2002). See Tu (2000) for a discussion that sets dynamic allocation into a framework for dealing with coviate imbalances, and a simulation study of minimization in the case of binary outcomes.
The very earliest departure from pure randomization simply allowed the randomization probabilities to vary as the trial unfolded, in order to increase the probability of having equal numbers of participants at the end (Efron 1971). This method has been extended in the obvious way to strata defined by combinations of values of discrete covariates (Brown 1980), but the disadvantage of this strategy is the limitation imposed by the number of strata that can be permitted. Most balance-focused approaches have been restricted to discrete covariates (see Endo et al. 2006 for an exception) and marginal balance (all minimization algorithms) even though this has been shown to be technically unnecessary (Aickin 1982, Aickin 1983). The dynamic allocation method examined here applies equally to discrete and continuous covariates, and extends trivially to balancing of interactions.
In the light of these considerations, the aim of this article is to compare randomization and design-adaptive allocation with regard to their effects on the sampling distribution of the usual regression estimator of a difference between two treatment groups, in cases where the regression model holds. Since the issues are more salient in small trials, we limit the total sample size to 100. Further, we focus on estimation rather than hypothesis testing, and express efficiency comparisons in terms of empirical standard deviations or numbers of participants wasted (defined below). A further focus is on the extent to which design conditions (such as treatment effect, number of covariates, covariate effects, sample size, and distribution of covariates) have an impact on the results. Due to the challenges presented by theoretical approaches, all results are obtained by simulation.
Design-adaptive allocation (DAA) is a form of dynamic allocation which uses only the covariates (and not the outcomes) to make treatment allocations (Aickin 2001, 2002; see Rosenberger 1996 for a review of response-adaptive allocations). In the two-group situation studied here, DAA can be simply described. As each participant presents for allocation, fit two logistic regressions, with treatment group as the outcome and the balancing factors as the predictors. The only difference between the two fits is that the new participant’s allocation is switched. The treatment is selected to minimize the log likelihood of the corresponding logistic regression model. That is, the allocation is made so as to minimize association between the balancing factors and the treatment group assignment. Since such a step requires that the logistic regression model can be fitted by maximum likelihood, and since this is not possible until a certain configuration of data has been obtained, the initial allocations are done according to a randomly selected, balanced permutation. This involves only a small number of initial participants, generally in the range of 4–6 in the simulations presented below. In some versions of DAA the actual treatment selection is made not to minimumize the log likelihood, but instead by a random choice that is heavily biased toward choosing the minimum. The version of DAA without this random step has been used in this article. Further, the basic algorithm just described was modified to include balance in the sizes of the two groups, in such a way that when the target sample size is reached, both groups are equal.
DAA is easily extended to cases with more than two groups. It can be shown that DAA tends to balance the means of the prognostic factors across treatment groups (Aickin 1983), and in multiple uses of the method in practical trials this has been observed. Moreover, including squares of prognostic factors tends to balance both their means and variances across treatment groups. Further, including the product of two factors balances their interactions across treatment groups. Thus, it is often possible to produce the same first and second-order statistics of the prognostic factors within treatment groups, which is a substantial advantage over the corresponding situation with discrete-variable methods such as minimization.
All of the simulations generated outcome y according to the regression model y = α + βx + γ1z1 + γ2z2 + γ3z3 + u. Here x takes the values −1 or 1, representing the treatment group assignment. The covariates (z’s) are generated to have various distributions, as described below. The u-term is the normal regression residual, always uncorrelated with x and the z’s, with mean 0 and variance 1. All chance variables were generated with the random number generator of Stata (version 9.1, College Station TX), with a different seed for each simulation condition.
Most of the simulation conditions used normally distributed z’s (mean 0, variance 1). The cases were single covariate (γ2=0=γ3), double covariate (γ3=0), and triple covariate. In one case the first covariate was taken to be observed and the second unobserved (the third was absent), with varying correlations between the observed and unobserved covariates. In the final normal case, in addition to a single covariate between 1 and 4 outlier covariate values of ±3 were introduced. In this context, the designation 3,−3 means that two covariate outliers were introduced, with these values.
In addition to the normal cases, two conditions were investigated in which the covariate was binary, once with a single covariate and once with two covariates. This case was included for its interest in balancing binary factors, such as gender. Finally, two cases (one single covariate, one double covariate) with exponentially distributed covariates (mean 1, but then centered at zero). This was to obtain information about the cases in which the covariates had a highly skewed distribution.
All simulations were sample-balanced, in the sense that exactly half of the participants were in each treatment group, under both randomization (RAN) and DAA. The (total) sample sizes were n=20(20)100 (that is, starting at 20 and proceeding in steps of 20 to 100).
Altogether four different estimators of the regression slope were investigated. The first two were the usual regression estimators in the model y = α + βx + u. They are referred to as the marginal RAN and marginal DAA estimators, since they are derived from this marginal model, and differ only in how the x-values were determined. RAN used restricted randomization (a random permutation of a vector half 1 and half −1), and DAA was as described above. The second two estimators were the usual regression estimators in the correct model, including whichever covariates were present (and portrayed as observed) in the simulation condition. They are designated the joint RAN and joint DAA estimators. This basic approach is similar to that of Rovers et. al (2000), except that they investigated only binary covariates, and used a minimization-based version of dynamic allocation.
The gross summary statistics for all simulations appear in Table 1. These figures should be taken as being representative of the simulation conditions that were actually used, and not as extrapolations to general biomedical applications. This does not mean that they are unrealistic, however, since both the standardized treatment effects and standardized covariate effects ranged uniformly from 0 to 0.90, which covers the values that are frequently seen in actual biomedical studies, with perhaps some over-representation of the larger values.
There is no evidence that either the joint or marginal, RAN or DAA estimators have any bias in any of the simulations. With respect to the empirical standard deviations of the estimators (SDE), the pattern is the same across all simulation conditions. The SDE’s for RAN and DAA are very close in the joint model, the marginal DAA SDE is somewhat higher, and then the marginal RAN SDE is considerably higher. As would be expected, all SDE’s rise with the number of covariates. Interestingly, in the case of one observed and one unobserved covariate the marginal DAA and both joint estimators do somewhat worse than in the case of three covariates, while the marginal RAN does slightly better than the triple covariate case, while still being the least accurate of all estimators.
While the same patterns of SDEs is seen in the two binary covariate situations, here the effect of adding the second covariate appears to be less than it was in the normal covariate case. Further, the degree to which marginal RAN is less accurate than the other estimators is much diminished. The first of these comments applies equally to the two exponential covariate simulations, but for the second comment the inaccuracy of RAN seems to lie between the binary and normal cases.
The simulations with outlying covariates are of interest, because there are two opposing arguments, one suggesting less estimation efficiency and the other more efficiency. The first argument would be that the occasional outlier would contribute to the variance of the results, reducing efficiency. But the second is that in regression the variance of the slope estimate decreases with more spread in the distribution of the covariate. The results in Table 1 suggest that the latter effect wins out, since the outlier cases look very much like the case of a single normal covariate without outliers.
Preliminary simulations confirmed what has been reported elsewhere, that the conventional regression estimate of the SD of the marginal DAA estimator is far too large. The dilemma this presents is that although the marginal DAA estimator is more efficient than the marginal RAN estimator, without an estimate of the SD of the former there is no way to realize the efficiency benefit. The problem was addressed here by developing an adjusted permutation variance estimate. Specifically, each observed y was replaced by y - x, where was the regression estimate. The DAA procedure was then applied to 30 random permutations of these replaced values. Empirical estimates of the SDE of the marginal DAA estimator were then computed for 10, 20, and all 30 permutations. There is no reason why a larger number could not be used, except that it did not seem to be necessary. The idea behind this strategy is that the adjusted values are estimates of the y-values that would have been observed under the null hypothesis of no x-effect, together with the usual argument for the permutation distribution. The unadjusted permutation strategy was evidently suggested by Efron (1971), but has only been followed up rarely (for example Ohashi 1990). Simulations not reported here showed that the unadjusted permutation estimate performed quite poorly.
Comparison of the conventional and adjusted permutation SDE estimator is given in Table 2. The overestimation of the conventional estimator should be interpreted in the light of the actual marginal DAA SDEs in Table 1, leading to the conclusion that except perhaps in the binary covariate cases it is quite substantial. In contrast, the adjusted permutation estimator has very little if any bias, and a considerably smaller SD. We conclude that in the simulation conditions studied here, there is an adequate estimator of the SDE of the marginal DAA estimator. It may be worth noting that even the permutation estimates based on 10 and 20 permutations also have negligible bias, but with somewhat larger SDs (results not shown).
In small studies the efficiencies of estimators are most clearly expressed in terms of their effects on absolute sample size. Here we suppose that a RAN estimator has variance when we use nR participants. An equivalent experiment using a DAA estimator with variance would require sample size nD satisfying
Rewrite to get
The left side of this expression represents the number of participants wasted by RAN, in the sense that the RAN study of nR participants could be reduced by this number in order to obtain an equivalent DAA study with nD participants. If this number is negative, then it represents participants wasted by DAA. Although not evident from its definition, one of the attractive features of this measure is that it has a roughly normal distribution over simulation conditions. More importantly, however, it provides the clinical researcher with a practical assessment of the choice between RAN and DAA in terms directly relevant to the conduct and cost of the study. A similar approach to the comparison of allocation methods has been used previously (Atkinson 2002).
Results presented here are from the viewpoint that the entire distribution of wasted participants is important. The mean of this distribution is frequently portrayed as an adequate summary of the efficiency comparison of two estimators, but this ignores how variable the measure is, and in particular it gives no information about the penalty that can be expected among segments such as the 10% of studies with the worst participant wastage. It seems preferable to advise clinical researchers about not only the mean or median, but also the potential extreme inefficiencies that their design choices may entail.
The results are presented in Figure 1 as the cumulative distribution function of the number of participants wasted by RAN, where the joint RAN and DAA estimators are used. For the normal covariate cases it seems reasonable to posit that the median is about 0.7 times the number of covariates. This includes the case of one unobserved covariate, provided we take the number of covariates to be those observed, and it also includes cases with outliers. The mean is also proportional to the number of covariates, but by a higher factor somewhere between 0.8 and 0.9. Remarkably the distribution of participants wasted by RAN appears to be dependent only on the number of covariates, and not on the form of their distribution, since the binary and exponential cases mirror the normal cases.
It is important to note that the distributions shown in Figure 1 are essentially uninfluenced by any of the factors used to define the simulation conditions. Exhausitve regressions of this measure on linear and quadratic expressions in these factors over all simulations found at most trivial effects, most of which failed to reach statistical significance despite the enormous number of simulations. In particular, these distributions are not influenced by sample size, as has been reported for some time.
The corresponding results for the marginal RAN and DAA estimators are shown in Figure 2. The general message seems rather clear. Increasing numbers of participants are wasted by RAN as the sample size increases, and this situation worsens as the number of covariates rises. As in the joint case, the marginal results do not seem to be particularly sensitive to the distribution of the covariates. Again exhaustive regression and stratified analyses failed to find any simulation factor (other than sample size) that had any appreciable effect at all on these distributions.
The results reported here resolve several of the issues that have been raised about dynamic allocation, as they pertain to design-adaptive allocation, at least in the linear regression situation. First, there is no evidence of bias in either marginal or joint regression estimators using design-adaptive allocation, over a range of simulation conditions. Secondly, there is a confirmation of previous suggestions that conventional marginal (ignoring covariates) estimates of the sampling standard deviation of the design-adaptive estimator are too high, and further the overestimation is found to be unacceptably substantial. This phenomenon has formed the basis of the widely-held belief that the marginal analysis should not be used with dynamic allocation (see Tu 2000 for a summary). Thirdly, a permutation-based estimate of the marginal sampling standard deviation of the design-adaptive estimator is found to be essentially unbiased, which suggests that the adjusted permutation strategy might generally solve this problem, and thus permit valid marginal analysis for use with design-adaptive allocation. Fourthly, the marginal randomization estimator shows high risk of being very inefficient. This is of concern because many clinical trials use this procedure to evaluate their primary results, and so the promotion of randomization has perhaps influenced researchers to believe that the search for prognostic factors is less important than it really is. Fifthly, the conventional line of argument, that using the joint (include covariates) model removes nearly all of the inefficiency penalty of randomization, is not substantiated, in that the fraction of trials with a relatively large number of participants wasted by randomization can be rather high. The fourth and fifth points do not seem to depend on any of the simulation parameters, so that at least in the regression situation there is a strong suggestion that they are general. Sixth, the assertion that randomization is the only method that removes or reduces the effect of unmeasured prognostic factors, is disproved, and in fact the cases with an unobserved covariate provide substantially the same results as the cases with all covariates measured. Seventh, the presence of a small number of outliers (between 1 and 4) does not have any appreciable effect on the above findings. Eighth, dynamic allocation is not invariably more efficient than randomization, since there is always a non-negligible probability that the former method will waste participants.
The overall conclusion from these simulations is that in regression situations, design-adaptive allocation provides a valid method of placing participants into two treatment groups, for the purpose of estimating treatment effects in an unbiased and efficient manner. If one wants to estimate future treatment benefits in a population, then the marginal estimation approach based on a randomized design is unacceptably worse than the same approach based on design-adaptive allocation. If one wants to estimate future benefits within segments of a population defined by specific values of the covariates, then the randomization penalty in a joint estimation approach is much smaller, while remaining problematic for small trials. Thus, in the settings covered by these simulations there appear to be no statistical reasons for preferring randomization to designadaptive allocation.