|Home | About | Journals | Submit | Contact Us | Français|
Despite the potential benefits of sequential designs, studies evaluating treatments or experimental manipulations in preclinical experimental biomedicine almost exclusively use classical block designs. Our aim with this article is to bring the existing methodology of group sequential designs to the attention of researchers in the preclinical field and to clearly illustrate its potential utility. Group sequential designs can offer higher efficiency than traditional methods and are increasingly used in clinical trials. Using simulation of data, we demonstrate that group sequential designs have the potential to improve the efficiency of experimental studies, even when sample sizes are very small, as is currently prevalent in preclinical experimental biomedicine. When simulating data with a large effect size of d = 1 and a sample size of n = 18 per group, sequential frequentist analysis consumes in the long run only around 80% of the planned number of experimental units. In larger trials (n = 36 per group), additional stopping rules for futility lead to the saving of resources of up to 30% compared to block designs. We argue that these savings should be invested to increase sample sizes and hence power, since the currently underpowered experiments in preclinical biomedicine are a major threat to the value and predictiveness in this research domain.
Group sizes in preclinical research are seldom informed by statistical power considerations but rather are chosen on practicability [1, 2]. Typical sample sizes are small, around n = 8 per group (http://www.dcn.ed.ac.uk/camarades/), and are only sufficient to detect relatively large sizes of effects. Consequently, true positives are often missed (false negatives), and many statistically significant findings are due to chance (false positives). Such results lack reproducibility, and the effect sizes are often substantially overestimated (“Winner’s curse”) [2–5]. Therefore, various research bodies (e.g., National Institutes of Health, United Kingdom Academy of Medical Sciences) have called for increased sample sizes [5, 6], as well as other design improvements in preclinical research. Yet, such calls also potentially antagonize the goal of minimizing burdens on animals. Here, we propose the use of sequential study designs to reduce the number of experimental animals required, as well as to increase the efficiency of current preclinical biomedical research. Moreover, our aim with this article is to bring the existing methodology of group sequential designs to the attention of researchers in the preclinical field and to clearly illustrate its potential utility.
Conventional study designs in experimental preclinical biomedicine use nonsequential approaches, in which group sizes are predetermined and fixed, and the decision to either accept the (alternative) hypothesis or fail to reject the null hypothesis is made after spending all experimental units in each group. In contrast, a group sequential design is a type of adaptive design that allows for early stopping of an experiment because of efficacy or futility, based on interim analyses before all experimental units are spent [7–9], thereby offering an increase in efficiency.
However, interim analyses come at a statistical cost, and special analysis methods and careful preplanning are required. Traditional frequentist statistics can be used to split the overall probability of type I error (α–error) to account for multiple testing [10, 11], but Bayesian methods are particularly suited, as they can incorporate information from earlier stages of the study. Moreover, Bayesian analysis enables the researcher to use prestudy information as a basis for the prior information about the measure of interest [8, 9]. As the prior is potentially subjective and the gained posteriors highly dependent not only on the data but also on the chosen prior, the practice of informed priors is hotly contested. Noninformative priors are an option to circumvent this concern [12, 13].
Group sequential designs are increasingly used in clinical research [8, 14]. So far, however, they are virtually nonexistent in preclinical experiments. We performed text-mining of the complete PubMed Central Open Access subset (time frame: 2010–2014) and found only one article explicitly describing an original study evaluating a treatment in rats or mice using a sequential design  (S1 Text).
To explore the potential for group sequential designs to increase the efficiency of preclinical studies, we simulated data for two-group comparisons of different effect sizes and compared “costs,” measured by the number of animals required for different group sequential designs, compared to a traditional nonsequential design (S1 Text).
We simulated a mouse experiment in which 36 animals are allocated to two groups. Currently, in most domains of preclinical medicine, group sizes of ten or less are prevalent, leading to grossly underpowered studies . A group size of 18 animals per group allows the detection of a standardized effect size of d = 1, given traditional constraints of alpha = 0.05 and beta = 0.20. A block design typically used in this type of study needs to include all animals before data analysis. In a group sequential design, an interim analysis is conducted, and a predefined set of rules determine whether the experiment should be continued or not (Fig 1).
Here, we demonstrate only some of many possible analysis approaches (frequentist sequential with O’Brien–Fleming boundaries , with Pocock boundaries  [S1 Table], Bayes Factor, and Bayes credible intervals, Table 1). See Box 1 for other approaches and references.
Is it feasible for the planned study:
Software for deriving and describing group sequential designs (including power considerations and sample size estimation):
The O’Brien–Fleming boundaries in the frequentist sequential approach keep the alpha level for the final analysis (stage 3) approximately as high as for the classical block design. Additionally, the same scenarios using Pocock boundaries can be found in S1 Table. It should be noted that the frequentist approaches refer to null hypothesis significance testing, whereas the Bayes Factor approach is basically a model comparison, and the other Bayesian approach uses credible intervals for estimates. These are different methods that might answer different research questions, as outlined by Morey et al. . However, here, we used all methods for deriving stopping criteria and decisions about efficacy or futility.
Our simulations showed that in an experimental setting typical for current experimental biomedicine, if the effect exists, group sequential designs have lower costs because of early stopping for futility or efficacy (Table 1). With a large true effect size (d = 1) and n = 18 per group, sequential analyses that stop for significance reduce the costs up to 20%, while the power of these analyses do not differ from the traditional block design. Underpowered studies (d = 0.5 scenarios, Table 1) show only approximately 30% power for classical as well as sequential approaches, while the reduction in costs through sequential design is minor. This stresses the need for sufficiently powered studies even with sequential analyses. As expected, average effect sizes among successful experiments are overestimated in the traditional approach and slightly more so in the sequential design. Larger experiments that can stop for both success and futility show a similar pattern: sequential analysis has similar power as the traditional approach, while costs are reduced substantially.
The simulations above differ from the real-world setting where we, despite setting out to detect an effect beyond a certain (biological) threshold, never know the true effect size a priori. In another set of simulations, we therefore assumed a specific distribution of true effect sizes within the universe of studies that can be performed. Such distributions may vary in different fields of research. This is relevant because, as with different effect size distributions and the chance of early stopping an experiment, the predictive probability of a “statistically significant” signal, i.e., the probability that a significant result really reflects a true effect, is different. To understand the ability to predict in a real-world setting, we simulated analyses with two different distributions of effect estimates: one optimistic and one pessimistic (Fig 2, S1 Fig). Through these simulations, we estimated the probabilities of obtaining an effect of any size d > 0 or at least size d ≥ 0.5 for both the traditional frequentist approach and group sequential designs. Overall, there are no major differences in these probabilities between the traditional and sequential approaches—despite the fact that the latter uses fewer animals. More importantly, this table shows that the main driver behind these probabilities is the a priori distribution of effect sizes (optimistic versus pessimistic).
To the best of our knowledge, there are no groups or programs currently implementing sequential designs in preclinical experimental studies evaluating the efficacy of treatments or interventions. However, we are aware that the practice of interim analyses is applied informally when a statistically significant effect is desired but not found, and the analyses are rerun until significance has been achieved (a practice known as “p-hacking”). Clearly, this practice inflates false-positive rates, as it violates the preset type I error (α–error) probability by not accounting for multiple testing in these unplanned interim analyses .
Despite the benefits suggested by our simulations, sequential approaches have properties that may limit their application in preclinical experimental biomedicine. The clearest disadvantage of group sequential designs is that each next stage can only be started after the outcome of the preceding stage is fully assessed and analyzed. Sequential analysis may require additional resources to set up, regulate, and monitor the independence of interim analyses, as well as additional statistical expertise. Another consideration is that a step-by-step design might increase the impact of batch and learning effects. However, the largest obstacle might be lack of familiarity with these methods in the field and amongst animal ethics committees, editorial boards, and peers. With this paper, we aim to spur the discussion and stimulate others to consider using sequential designs to increase the efficiency of their studies. Moreover, if in vivo researchers are to get ethical approval for this approach from their various committees, this article might help persuade those committees.
We posit that a substantial number of experiments in preclinical biomedicine can be planned and executed with batch sizes and sufficiently short intervals between treatments and outcome assessments to render them amenable to group sequential design–based methods (for an example, see S2 Text). Sequential designs can lead to a substantial reduction in animal resource. When these savings are invested in increased sample sizes (which, paradoxically, may not be higher than the current ones), sequential designs have the potential to increase the predictive ability of preclinical biomedical experiments and to reduce the current unacceptable levels of waste due to underpowered studies.
German Federal Ministry of Education and Research (BMBF) www.bmbf.de (grant number 01EO1301). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Provenance: Not commissioned; externally peer reviewed.