|Home | About | Journals | Submit | Contact Us | Français|
High-profile failures in stroke clinical trials have discouraged clinical translation of neuroprotectants. While there are several plausible explanations for these failures, we believe that the fundamental problem is the way clinical and pre-clinical studies are designed and analyzed for heterogeneous disorders such as stroke due to innate biological and methodological variability that current methods cannot capture. Recent efforts to address pre-clinical rigor and design, while important, are unable to account for variability present even in genetically homogenous rodents. Indeed, efforts to minimize variability may lessen the clinical relevance of preclinical models. We propose a new approach that recognizes the important role of baseline stroke severity and other factors in influencing outcome. Analogous to clinical trials, we propose reporting baseline factors that influence outcome and then adapting for the pre-clinical setting a method developed for clinical trial analysis where the influence of baseline factors is mathematically modeled and the variance quantified. A new therapy’s effectiveness is then evaluated relative to the pooled outcome variance at its own baseline conditions. In this way, an objective threshold for robustness can be established that must be overcome to suggest its effectiveness when expanded to broader populations outside of the controlled environment of the PI’s laboratory. The method is model neutral and subsumes sources of variance as reflected in baseline factors such as initial stroke severity. We propose that this new approach deserves consideration for providing an objective method to select agents worthy of the commitment of time and resources in translation to clinical trials.
Despite decades of research and tremendous progress in the understanding of the mechanisms of cell death and identification of several neuroprotectant drugs that thwart those mechanisms in pre-clinical stroke models, to date only treatments targeted at enhancing recanalization have shown clinical success . With numerous neuroprotectant compounds tested and failed [2, 3], the inability to predict successful clinical outcomes from apparently positive pre-clinical and early-phase clinical studies is all too common. While there are likely several explanations for these failures, including lack of power [4, 5], rigor in design and execution, and a disconnection between the pre-clinical design and human trial [6, 7], we believe there is a fundamental problem in the identification of effective agents due to the way clinical and pre-clinical studies are designed and analyzed for disorders heterogeneous with respect to the myriad of factors that influence outcome, including innate biological, environmental, and methodological variance and experimental noise .
Failures in clinical translation are not unique to stroke, but have occurred in major trials in disorders such as Alzheimer’s disease, amyotrophic lateral sclerosis, and traumatic brain injury [9–11]. A common theme of these failures is that treatments appeared to be effective in pre-clinical models and showed some apparent signals of efficacy in early-phase trials, yet go on to fail in larger phase 3 trials. These conditions also share in common that baseline factors, such as initial disease severity and complex subtypes, influence outcome but are frequently unevenly distributed between patient arms and in different settings.
Recognizing that imbalances in baseline factors can influence trial results, investigators frequently employ statistical methods to adjust or control for these factors, especially when investigating smaller subgroups. Most of these statistical methods are variations on multi-variate regression approaches. For their valid application, all these methods require overlapping, normally distributed factors that are related to each other in a linear fashion. However, our prior work indicated that these assumptions are almost never verified, and indeed even in the landmark National Institute of Neurological Disorders and Stroke (NINDS) recombinant tissue plasminogen activator (rt-PA) trial, the distribution of baseline factors did not meet these stringent assumptions . Non-overlapping and non-linear relationship between baseline factors essentially invalidates the use of traditional multi-variate correction methods [12, 13]. In the case of rt-PA, we have shown using alternative analytical methods that the strength of effect of rt-PA overcame the influence of imbalances , but this has not been the case for the majority of neuroprotective agents. Indeed, we recently reported a contribution from the analytical approach to errors in the conclusions in 90 % of early-phase trials leading up to failed phase 3 . Furthermore, methodological variation inherent in measurement of functional outcome can contribute as much as 25 % error to outcome depending on what method is used to quantitate . It is highly unlikely that variability, noise, and the errors they impose will be evenly distributed in smaller pre-pivotal phase trials. As a result, if the uneven distributions of factors that influence outcome and errors in measurement by chance favor a better response in a treatment group, a false positive result emerges. A false negative result can occur if those errors favored better outcomes in the placebo arm. Statistical tests that compare the means and distributions of outcomes do not reflect underlying imbalances of the distribution of these factors.
Here, we contend that there are analogous issues in the current approach to pre-clinical stroke studies that hamper the ability to predict which agents will be reproducibly effective when tested outside of the controlled environment of the PI’s laboratory or when expanded to different stroke models or patients. Efforts primarily focused on increasing experimental rigor, as in the 2009 Stroke Therapy Academic Industry Roundtable (STAIR) meeting and publication  and the Multi-PART consortium that mimics a phase 3 clinical trial scheme but in a pre-clinical setting [17–20], while clearly advances will, in our view, not be sufficient to identify an effective therapy when the treatment is extended to a larger, more heterogeneous population. There are too many sources of variability to anticipate and control. Also, a focus on homogeneity may make pre-clinical studies less translationally relevant . Instead, we suggest here a dramatically different alternative with a new approach that, rather than attempt to eliminate variability, incorporates biological variability and noise into decision making.
In prior work, the authors developed a novel analytical method for clinical trial decision making that generates an objective threshold that any new therapeutic must exceed in order to predict efficacy when extended from an early phase to a broader, more representative population in late-stage trials. This method addresses the additional sources of errors and non-random variation as the study extends outside of the controlled environment of the early-phase trial to a broader population . Our clinical method, pPREDICTS (pooled Placebo REsponse DICtates Treatment Success ), is applicable when baseline factors are predictive of outcome. For example, in stroke, initial severity explains approximately 80 % of functional outcome variance [22, 23]. There are additional contributions from the patient’s age [24, 25] and whether the patient underwent recanalization therapy . Baseline glucose is also highly associated with outcome, particularly when the patient has undergone recanalization , consistent between human and animal models . Collaterals, some genetically determined, are also known to play a role in stroke outcomes in both humans and animals and can be altered by acquired conditions such as cardiovascular risk factors [28–32]. Uneven distribution of these factors among treatment arms can independently influence outcome beyond that achieved by the treatment under study.
The net result of the uneven distribution of these sources of biological, experimental, and methodological variance is the need for a larger strength of effect than would be anticipated from individual laboratory results [8, 33]. At present, granting agencies and other entities interested in translating a new treatment can only guess at the robustness of treatment effect that needs to be achieved to increase confidence that a treatment will be positive when its use is extended to other populations.
Notably, biological, methodological, and measurement variance may be a more general phenomenon in clinical studies. For example, a recent report indicated that only 39 % of 100 social psychology studies selected for their importance in the field were replicated despite adopting identical methods to the original study . Indeed, several results were the opposite direction . The only factor predictive of replication was robustness of the initial effect size (and not, for example, reputation of the original authors). We developed our method specifically to generate an objective threshold for robustness that a treatment would need to overcome to raise confidence that the treatment can improve outcomes even when tested in a heterogeneous population.
pPREDICTS consists of a pooled outcome model generated from the placebo arms of randomized clinical trials (RCTs). We add the additional novel feature of multidimensional statistical or “prediction” intervals around this function providing a threshold by which a new therapy can be evaluated at its own baseline characteristics. Different than confidence intervals, prediction intervals take into account noise in a future observation as well as noise in generation of the minimized function and the variability of the data used to model the function [35, 36]. Confidence intervals pertain to function generated from available data and prediction intervals pertain to a future observation.
Our current pPREDICTS model is based on the control arms from 47 studies representing 11,000 patients. As the number of trials published increased, we were able to develop newer versions based on more specific subgroups of patients, such as differences in extent of rt-PA use  and intracranial hemorrhage (ICH) . pPREDICTS is thus far successful in identifying that those early-phase clinical trials would go on to fail in phase 3 both retrospectively and prospectively and would have correctly predicted the two major positive stroke trials, intravenous rt-PA and endovascular stent retrievers . The halting for futility of ATACH2 involving aggressive treatment of hypertension in acute ICH was also predicted by our models, as the prior ICH studies of aggressive hypertension management did not show any signal of efficacy in our model [38–40].
Clinical use of pPREDICTS is illustrated in Fig. 1a. The pPREDICTS middle surface shows the inverse relationship between the chances for a good functional outcome (z-axis: modified Rankin Scale of 0–2), baseline NIHSS (x-axis), and age (y-axis), with p = .05 prediction surfaces around the function that a treatment must overcome to be considered positive. To assess a study’s potential, outcomes are plotted at their respective baseline conditions, in this case, median NIHSS and age, and then visually inspected with respect to the outcome function (middle surface) and whether treatment outcome exceeds the p = .05 surfaces (Fig. 1b). Here, the NINDS trial phases 1 and 2 are superimposed on a pooled control model of patients who had not received rt-PA. The outcome of the treatment arms of the NINDS rt-PA trial  is superimposed on the model surfaces plotted at their baseline NIHSS and age (phase 1 and 2 involved different treatment time windows). Both treatment arm results exceed the p = .05 threshold indicating their effectiveness. Interestingly, the control arm results lie near to the pooled control arm outcomes, indicating that outcomes achieved in this trial are still relevant to modern stroke trials as all studies published since the NINDS trial are included in this model. As shown in Fig. 1c, both treatment phases of the failed SAINT trial of the spin-trap antioxidant NXY059 [42, 43] show no hint of improved outcome as treatment arm outcomes lie on the pooled control arm function at their baseline NIHSS and age . This analysis suggests that SAINT-I was erroneously viewed as positive based on small differences between the treatment and control arms, but as shown here, those differences are well within the variance expected when a trial is extended to a broader population.
As conceived and applied here, this method, based on cumulative sources of variability around measurable factors that contribute to outcome, provides an objective threshold that a new therapy would need to exceed if it is expected to be effective when tested in a broader population. Had SAINT-I been analyzed in this method, it would have predicted futility of the agent, as ultimately shown in the larger, better balanced SAINT-II trial. In this paper, we suggest the development of an analogous method utilizing baseline data from pre-clinical stroke models to generate an expected outcome and generating quantitative thresholds for assessing the likelihood that a treatment is truly positive. If successful, this method could potentially predict the ultimate effectiveness of therapies, taking innate biological and methodological variance into account, and thus contribute greatly to improved decision-making and allocation of scarce pharmaceutical resources to develop new therapies.
While pre-clinical researchers have been discouraged at their failure to identify successful clinical therapeutics, they generally do not appreciate that most pilot clinical trials also have not predicted a positive result in late-phase/pivotal trials . This analogy between failures in early clinical and pre-clinical phases extends to potential sources of variance. Table 1 indicates potential analogous sources of biological and methodological variability in the clinical and pre-clinical setting. Despite genetic homogeneity of rodent models, most preclinical investigators have already identified factors associated with outcomes, such as extent of cerebral blood flow (CBF) reduction during occlusion , temperature , occurrence of seizures , and others. Investigators recognize that even with permanent middle cerebral artery occlusion (MCAO), rats demonstrate considerable variance. To try to reduce variance in outcome, investigators commonly monitor CBF and exclude rats who do not achieve a specific threshold of reduction. Animals are fasted overnight to reduce peri-procedure glucose and the incidence of seizures. Temperature is regulated to the extent possible. However, the overall contribution of these maneuvers to reducing variability has not been quantified to our knowledge. By limiting heterogeneity, such as restricting the range of baseline glucose due to fasting, the important contribution of hyperglycemia to stroke outcome is ignored [48–53] and the effectiveness of a therapy in a hyperglycemic setting will not be known prior to translation to humans, where as many as one third of stroke patients are hyperglycemic on admission .
Initial post-procedure stroke severity, while often reported, is not explicitly associated with outcome other than to screen for success of the stroke model or stratify subjects. Initial stroke severity in humans is the most important predictor of outcome, and imbalances in this key factor has caused tremendous controversy in the clinical setting [14, 55–57]. With regards to functional testing, most investigators recognize that rodents must be “pre-tested” prior to stroke to see if they are motivated to participate in behavioral and motor testing or trained to perform complex tasks [58, 59]. While this pre-training likely reduces variability of functional measures, there is no definitive threshold at which all rodents have the same motivation and it can be expected that any functional differences are due to the stroke and or/treatment. Similarly in humans, many factors outside of stroke severity can influence outcome including depression , access and quality of rehabilitation, and adherence to secondary prevention [61–63]. With respect to outcome assessment, inter- and intra-rater reliability for these measures has not been rigorously tested. In human studies, variability remains despite extensive training or central adjudication  and contributes considerable errors to outcome measures that require an increase in sample size to improve the chances that errors will be evenly distributed between study arms. To our knowledge, it is not known if there is similar variability in pre-clinical functional testing.
Looking at pre-clinical testing from this perspective leads us to conclude that it is assumed, but as far as we can tell, not tested formally, that incorporating methods such as overnight fasting and pre-testing for behavioral effort will be sufficient to generate homogeneous strokes so that any outcome differences observed among controls and treated animals will be due mostly to the treatment under study. We believe that such an assumption is not validated by the failure of these models to predict clinical efficacy. Moreover, emphasis on homogeneity may make these models even less relevant to the human condition. Lack of an objective tool to quantify the effect size hampers rational decision making regarding selection of agents to proceed through the pre-and clinical trial process.
We developed pPREDICTS for the clinical setting as a new approach that embraces variances while requiring no adjustment for imbalances. By generating a continuous function relating outcome to baseline factors from a pooled population and adding “prediction” intervals, the effect size necessary for a reproducible result is obtained. We propose here consideration of a similar approach in the pre-clinical setting. One example (Fig. 2) of how baseline data not typically reported can be used to generate a threshold for outcome in the preclinical stroke setting can be seen from data generated in our laboratory in a 90-min transient suture occlusion of the MCAO in male adult rats . To generate a spectrum of pre-stroke glucose levels and mimic the typical clinical trial situation, the pancreatic toxin streptozotocin was administered to some rats 2 days prior to increase of blood glucose levels , and then an effort was made to maintain glucose under 300 mg/dL post-operatively; otherwise, all pre- and post-operative care was identical. Baseline glucose is plotted vs infarct size (controlled for edema ), and a pPREDICTS function fitted to this data.
Detailed methods to generate the pPREDICTS function and predication intervals are published . The model was developed by the following: (1) fitting a function to outcome proportions in multiple dimensions using a Matlab routine (nlinfit) and Freeman-Tukey modification of the arc sine square root of a proportion; (2) eliminating large outliers based on Studentized residuals; and (3) generating multidimensional prediction interval surfaces at the p < .05 level using a Jacobian matrix. Note that there are many potential scenarios by which baseline factors can be associated with outcome, including logistic/sigmoid, exponential, or linear functions.
By plotting baseline blood glucose vs infarct size (Fig. 2), we can see a highly correlated, sigmoid relationship that indicates that glucose explains considerable infarct size variance (r2 = .92). Generating p = .05 prediction intervals around this function indicates the threshold that an agent would need to exceed to overcome the variability not explained by this factor. Superimposed on this function is the mean ± confidence interval of infarct size in a group of streptozotocin-treated rats treated at the time of reperfusion with a novel high-capacity and broadly active carbon nanoparticle antioxidant, pegylated hydrophilic carbon clusters (PEG-HCCs) under investigation in the author’s laboratory . Infarct size at the treatment group’s baseline glucose is smaller than predicted by the control function, suggesting that this treatment effectively reduces stroke size in a setting of hyperglycemia (p ≤ 0.003, comparing PEG-HCC mean ± confidence intervals vs function mean ± prediction intervals at the same baseline glucose). We propose that this outcome model can be expanded to include other baseline factors such as initial stroke severity as determined either by post-operative neurological function or noninvasive imaging and a multi-dimensional model developed analogous to that seen in clinical pPREDICTS (Fig. 1). Different outcomes, such as percent that achieve a certain follow-up neurological endpoint [69, 70], analogous to the modified Rankin Scale could be employed to provide a range of potential outcomes.
A suggested paradigm for pre-clinical testing that could be adapted to this analytical approach is shown in Fig. 3. In bold are factors we consider the minimal to utilizing this method as an initial screen for a therapeutic agent. Because this approach in theory would be model-neutral, the other potential conditions and factors such as alternative stroke models as shown in phase 1 could serve to broaden the test of any agent to mimic what it might encounter in a clinical situation. This approach would be feasible as long as baseline factors such as post-stroke neurological deficit were associated with outcome. We propose that both pre-stroke and post-stroke measures, as illustrated in phases 2 and 3, could be subjected to modeling a multi-dimensional relationship between baseline factors and outcome, as well as prediction intervals generated to incorporate the cumulative sources of errors discussed above.
Due to paucity of work in this area, it is unclear if the models as conceived here express causal relationships between factors, although independent relationship is not a requirement as long as the model is useful in predicting outcomes . Of considerable importance is the internal and external validity of this approach when broadened in its application [4, 71–73]. Poor internal validity has been identified as a major contributor to false positive results in pre-clinical studies [4, 5]. In our case, internal validity could be improved by including several laboratories in the initial model development and using our iterative pPREDICTS method to exclude extreme outliers based on objective a priori criteria . External validity could be tested by developing the model on an initial cohort of data and then comparing this to a subsequent set. In the clinical stroke setting, pPREDICTS is continuously adapting to newer treatment approaches, and we envision in the preclinical setting that addition of different stroke models and conditions would contribute to a more broadly externally valid model. Of course, basic tenets of rigor as derived from STAIR, Multi-PART, and others would be essential to both types of validity.
If adopted by the larger stroke community, we anticipate that the approach proposed here could take a direction similar to what transpired for our clinical method . Our first study employed a published outcome model based on a few sites, yet was able to correctly predict lack of efficacy of pre-stent retriever endovascular interventions  from early case series. As more studies were published and investigators provided us additional data, the model was expanded to stroke subtypes, populations, and treatments [1, 22, 37, 76]. We will encourage our colleagues to contribute similar data points obtained by them. Over time, a comprehensive pre-clinical pPREDICTS model may emerge that addresses a wider range of stroke models.
Individual investigators and granting agencies could employ the thresholds generated in this growing sample to identify the strength of effect a new therapy needs to achieve in order to suggest its benefit when extended to a clinical population. Investigators can superimpose outcomes of their study along with confidence intervals onto the pPREDICTS model. Whether their control group is representative of a broader population can be tested by superimposing the outcomes of their control group onto the pooled model (analogous to Fig. 1b). If outcome ± confidence intervals of their treatment arm are outside the 95 % prediction intervals of the pooled control arm, there is compelling reason to proceed to the next phase of investigation. Similar to any screening methods, there is a chance of missing a potentially effective therapy (type II error) whose robustness may not be great enough to exceed other sources of variance. However, this may be a more general phenomenon related to the complexity of the disease being studied. Methods such as ours could target more narrow conditions and encourage use of more objective, less error-prone outcomes, all with the goal of narrowing the variance that needs to be exceeded.
Since errors may also result from a non-representative control group outcome, investigators can check their control results compared to this pooled population which suggests that a larger control group’s outcomes may drift toward the pooled outcomes. Funnel plots, substituting the pPREDICTS outcome function as the expected outcomes, can be generated to detect publication bias [5, 75].
This method could be used to obtain a sample size. If an investigator has conducted a single trial and knows the variance of their study, then variance of the pooled model at the same baseline independent variable can be utilized to obtain the sample size requirements for the next phase [35, 77]. This approach will obviate the need for making an assumption of equal variance. Power of a study can be determined in a similar manner .
High-profile failures in stroke clinical trials such as SAINT-II  discouraged major pharmaceutical and governmental investments in clinical translation of cellular protective agents in stroke. Explanations for failure focused on lack of power [4, 5], rigor in design and execution, and a disconnection between the pre-clinical design and human trial [6, 7]. An effort to address this problem included the (STAIR) meeting and subsequent publication . STAIR has improved pre-clinical reproducibility but has not yet demonstrated improved prediction. While we applaud these efforts, we do not believe them to be sufficient to ensure reproducible results when extended outside a few laboratories and models. We propose here to incorporate contributions from STAIR—e.g. blinding, randomization, careful record keeping, and reporting, from Multi-PART [4, 5, 78]—engaging multiple laboratories in investigating the same agent and extending these methods to include a consideration of the innate biological and methodological variability that this agent will ultimately encounter when translated to humans. We propose that this new approach deserves consideration as providing an objective measure of the robustness of treatment effect to aid in the selection of an agent worthy of the commitment of time and resources in definitive clinical trials.
Funding TAK received funding from NIH R21NS084290-02 and R01NS094535-01.
Computing resources of the Computational and Integrative Biomedical Research Center at Baylor College of Medicine were utilized for this study (PM).
Conflict of Interest TAK and PM are copyright holders of pPREDICTS. TAK is a co-inventor in a patent application related to PEG-HCCs and is an owner in shares of Acelerox, Inc, to pursue commercialization of nano-antioxidants.
Ethical Approval All applicable international, national, and/or institutional guidelines for the care and use of animals were followed. This article does not contain any studies with human participants.