STAIR VI met in the aftermath of several failed stroke trials in which preclinical data partially after the initial STAIR preclinical recommendations and initial clinical trial results appeared promising. Although there are undoubtedly numerous potential reasons for disappointing outcomes, a question that STAIR VI addressed is whether applying externally derived standards to stroke research would improve the likelihood of identifying effective stroke therapies. A reasonable question is whether, as a group, stroke researchers need explicit standards to help ensure that the research is robust and reproducible.
In 2006, O’Collins et al2
performed a systematic review that extracted data for 1026 neuroprotective strategies tested in 8516 experiments relevant to stroke and published in ≈3500 article between 1957 and 2003. This study used a simple checklist derived from STAIR I to provide an overview of the quality and breadth of data available for individual therapies. Testing of only 5 of the 550 drugs reported to be effective in animal models of focal ischemia fully met this interpretation of the STAIR criteria. An initial assessment of the NXY-059 preclinical assessment program suggested that it closely fulfilled the STAIR criteria, but a subsequent analysis suggested that adherence was not absolute.3
One observation in the O’Collins2
systematic review was a relationship between increasing study quality score (based on adherence to STAIR I criteria) and declining efficacy.2
It appeared that poor quality studies overestimated efficacy, a phenomenon partially attributable to bias from lack of randomization and blinding. Similar striking observations have been made in some, but not all, of a series of detailed meta-analyses of the efficacy of individual drugs. This effect was particularly pronounced for FK506.4
Systematic review and meta-analysis of the data for 13 putative neuroprotectants revealed that the presence or absence of randomization to a treatment group, blinding of drug assignment during stroke induction, and blinding of outcome assessments were among the most powerful determinants of outcome.5
For example, studies of NXY-059 reported that efficacy was significantly lower in randomized studies (20.3% vs 52.8%) and in those that reported allocation concealment between cerebral ischemia induction and outcome assessment (25.1% vs 54.0%).6
In studies of hypothermia, these effects were less marked (37% vs 47% and 39% vs 47%, respectively) but still present.7
Perhaps because of the frustrations engendered by the failure of translation of apparently efficacious animal neuroprotectants into human stroke therapies and previous STAIR recommendations, stroke researchers are performing studies of better quality than in the past. However, stroke experimentalists still report random allocation to treatment group in only 36% of studies, allocation concealment in only 11%, and blinded assessment of outcome in only 29% of stroke studies.8
Sample size calculation illustrates the influence of these issues on experimental results. The probability of detecting a difference between groups is related to the magnitude of the difference, the variability in the outcome measures, and the number of times the population is sampled, in this case the number of animals per group. In systematic reviews of the preclinical stroke literature, only 3% of studies report using a sample size calculation.8
If we examine a worst case scenario and make the assumptions that the majority of authors indeed performed but did not report power calculations, used the minimum necessary calculated sample size but did not consider failure to randomly allocate to treatment group as a potential source of falsely large estimates of effect size bias, then >60% of studies might have been under-powered to detect real differences between treatment and control groups. With lack of allocation concealment, the potential for underestimating sample size increases to nearly 90% of the studies performed. If the required sample size for detection of a particular effect size in reality is 20 but only 18 animals are used, then potentially all 18 might have been wasted. However, if 22 are used, then the extra 2 have still contributed to useful data. Although, such scenarios almost certainly do not apply to most of the papers evaluated, without appropriate reporting of sample size calculation, it is not known in which situations it does apply.
There is precedence for standards in research being well-accepted and applied. Clinical trialists adhere to the Consolidated Standards of Reporting Trials (CONSORT) statement, which led to substantial improvements in the reporting and conduct of clinical trials as a requirement for publication.9,10
On the basis of the available evidence, it would now seem prudent to suggest that preclinical testing for the purpose of determining therapeutic efficacy in animal models of stroke should adopt similar standards for conducting and reporting experiments to ensure high-quality unbiased data.8,11
However, several of the authors acknowledge that we have not complied entirely with such standards in the past. Recently, 2 journals that publish experimental therapeutic stroke studies have acted on this suggestion and will only consider articles for publication if their methodology section includes the criteria outlined in .12,13
These standards should not preclude publication of observational, pilot, or hypothesis-generating data, but the conclusions of such studies should reflect their preliminary nature. The tendency of journals to reject reports of negative results could be addressed by the establishment of central repositories of preclinical results as has been performed for clinical trial data.
Recommendations for Ensuring Good Scientific Inquiry
The following sections address specific issues related to animal models in stroke and their influence on the most current STAIR criteria.