|Home | About | Journals | Submit | Contact Us | Français|
The Alzheimer research community is actively pursuing novel biomarker and other biologic measures to characterize disease progression or to use as outcome measures in clinical trials. One product of these efforts has been a large literature reporting power calculations and estimates of sample size for planning future clinical trials and cohort studies with longitudinal rate of change outcome measures. Sample size estimates reported in this literature vary greatly depending on a variety of factors, including the statistical methods and model assumptions used in their calculation. We review this literature and suggest standards for reporting power calculation results. Regardless of the statistical methods used, studies consistently find that volumetric neuroimaging measures of regions of interest, such as hippocampal volume, outperform global cognitive scales traditionally used in clinical treatment trials in terms of the number of subjects required to detect a fixed percentage slowing of the rate of change observed in demented and cognitively impaired populations. However, statistical methods, model assumptions, and parameter estimates used in power calculations are often not reported in sufficient detail to be of maximum utility. We review the factors that influence sample size estimates, and discuss outstanding issues relevant to planning longitudinal studies of Alzheimer’s disease.
There is increasing interest in the potential utility of biomarkers as outcomes in clinical trials. For example, the joint industry/NIH funded Alzheimer’s Disease Neuroimaging Initiative (ADNI) was created expressly to investigate cerebrospinal fluid and volumetric neuroimaging measurements as diagnostic biomarkers of early Alzheimer’s disease (AD) and as potential endpoints for monitoring clinical trial treatment effects . ADNI has recruited and followed longitudinally approximately 200 AD cases, 400 mild cognitively impaired (MCI), and 200 age-matched cognitively normal controls . Additional novel biomarker endpoints, including MR spectroscopy  and FDG-PET  have been proposed and are being actively pursued. The hope is that biomarkers will allow monitoring of treatment effects at earlier stages of disease, before traditional cognitive and functional endpoints are measurable. It is also becoming apparent that biomarker measurements, particularly volumetric neuroimaging measures, are substantially more precise than traditional cognitive and functional measures, to the point that clinical trials using volumetric neuroimaging measures may be possible with a tenth or less of the sample size of current trials. Freely accessible ADNI data has provided a natural laboratory for exploring these issues. While this literature has consistently described the relative improvement in statistical power of imaging outcomes relative to cognitive outcomes, there is little consistency across reports in estimated sample size requirements for each particular outcome measure. To characterize these discrepancies, we have reviewed the ADNI power calculation publications with an eye to the influence of statistical methods on sample size projections.
Articles were identified based on a search of published papers listed on the ADNI website (adniinfo.org/Scientists/ADNIScientistsHome/ADNIPublications.aspx) as of February 4, 2011. All papers containing search terms “power” or “sample size” were reviewed for reported sample size calculations by one of the authors (MCA). Of 143 papers searched, 17 contained abstractable reports of previously unpublished analytic sample size calculations [5–21]. These papers report required sample size for a future clinical trial to observe a stated treatment effect, with the magnitude of the treatment effect described in terms of percentage slowing of disease progression relative to placebo. An additional six papers reported on sample sizes required for various analyses (e.g., detection of correlations, differences between dementia types, or the presence of atrophy) using a non-analytic sample-reduction method in which subjects were randomly discarded from the pilot data set until it was no longer possible to reject the hypotheses in question [22–27]. The remaining search hits were all due to papers reporting retrospective power calculations, papers that only reported relative gains in sample sizes, and papers that cited previously published estimates.
Among the 17 papers that presented prospective analytic calculations, most reported sample size required to detect a 25% reduction in the observed rate of change with 80% power. To facilitate comparisons across publications, sample size estimates based on a different percentage reduction were recalculated to a 25% reduction using the formula n25 = (k/25)2nk, where k equals the percentage used in the original report. Sample size estimates for power other than 80% (typically 90%) were standardized to 80% using the formula n0.8 = (zα/2 + z0.2)2np/(zα/2 + z1−p)2, where in this case the subscript p indicates the power of the trial expressed as a probability. The characterization of effect size as a fixed percent reduction in observed rate of change is useful for comparing power calculations across studies, but is not intended to serve as a model for research practice. In practice, effect sizes used in trial design should always be determined based on the plausibility and clinical significance of the hypothetical outcomes under consideration .
Tables 1 and and22 summarize reported sample size estimates for trials in AD and MCI. Table 1 summarizes estimates of sample size required to detect a 25% reduction in mean rate of decline on the standard cognitive outcome for AD treatment trials, the AD Assessement Scale - Cognitive Subscale (ADAS-Cog) ; Table 2 summarizes estimates of sample size required to detect a 25% reduction in atrophy rate for a likely MRI outcome, hippocampal volume. Reported sample size estimates for each measure are widely divergent. The differences in estimates may be explained by a number of factors, which we review in sequence below.
Trial design, i.e. the length of the trial and the frequency of assessment, has a direct influence on statistical power. All other things being equal, longer trials, and to a lesser extent trials with finer assessment intervals, result in more precise estimates of rate of decline per arm and require fewer subjects to detect treatment effects. For example, Hua et al.  report a 6-fold decrease in required sample size for a 2-year compared to a 6-month long AD treatment trial using change on ADAS-Cog as the outcome variable (Table 1). Relatively noisy outcome measures, such as global cognitive scales represented here by the ADAS-Cog, experience more gain in precision and power by increased trial length or assessment frequency compared to relatively less noisy outcome measures such as volumetric imaging. For example, using change in hippocampal volume as the outcome measure, Hua et al. report only a 36% increase in sample size required for a 6-month compared to 2-year trial (Table 2). The influence of trial design on statistical power varies with different analysis plans. Within limits, longer trials and increased sampling frequency are associated with improved power for trials designed to detect changes in trajectory of disease under treatment, although there are diminishing returns with longer trials as dropout rates increase and linearity assumptions implicit in most statistical analysis plans become less tenable. Trials designed to detect acute, symptomatic treatment effects are unlikely to benefit from longer observation or increased frequency of sampling.
Effect size, the minimum treatment effect a trial is powered to detect, directly influences sample size requirements. For the power calculations reviewed here, the effect size is calculated as a percentage of the assumed mean rate of decline under the placebo condition. The various ADNI power calculation papers used different estimates of the placebo rate of decline in their calculations, and this explains to some degree the differences in required sample size reported. For example, for MCI treatment trials using the ADAS-Cog as the endpoint (Table 1), effects sizes used for power calculations range from 25% of 2.1 points per year  to 25% of 1.0 points per year [8, 14]. The sample size estimate in the former (1183 for a 1-year trial) is substantively smaller than the sample size estimates in the latter (4000+ and 2175 for a 1-year trial). Several papers did not report the effect size powered for [9, 13, 19], and we can only speculate on the extent to which differences in sample size projections reported in these papers are attributable to differences in assumed effect size. However, in general, when defining effect size as a percentage reduction in mean rate of decline, a smaller assumed mean rate of decline under the null hypothesis translates to smaller effects sizes powered for and larger required sample size.
A critical factor when setting the effect size for power calculations is the issue of defining the target population of the planned future clinical trial. For the most part the power calculations reviewed here used estimates of mean rate of decline within the ADNI cohorts as the assumed trajectory of disease under the placebo condition. The implicit assumption is that subjects recruited in future trials will look much like the subjects recruited into the ADNI cohort study, a reasonable assumption given that the ADNI recruitment network and methods parallel those used by many multicenter trials . The differences in effect size (Tables 1 and and2)2) used by the various studies may follow in part from random variability in data obtained when the ADNI data were accessioned. Differences in effect size may also follow from differences in statistical methods used to calculate mean rate of decline, or differences in inclusion/exclusion criteria applied to the ADNI sample prior to estimation.
Regarding the effect of varying inclusion criteria, McEvoy et al.  describe the effect of inclusion criteria intended to enrich the study population for subjects more likely to have the underlying neurodegenerative process that is the target of most planned therapies. For example, restricting recruitment to MCI subjects with baseline MRI atrophy patterns consistent with AD resulted in a cohort with mean trajectory of decline on the ADAS-Cog of 2.3 points per year compared to a mean decline of 1.5 points per year in the unrestricted cohort; sample size requirements correspondingly dropped by over one-half using this inclusion criterion, from 978 per arm to 458 per arm . Similarly, restricting recruitment to subjects with the APOE ε4 risk allele increased the mean rate of ADAS-Cog decline to 1.7 points per year and reduced the required sample size to 774 persons per arm . A limitation of trials with restrictive inclusion criteria is that findings only generalize to the subpopulation examined.
Power calculations are specific to the analysis plan of the planned trial. Sample size formulas for two-group comparisons under normality assumptions are of the form:
where Δ is the treatment effect size under the alternative, is the within group variance of the outcome measure being compared across treatments, and z1−α/2 and z1−β are the usual quantiles of the standard normal distribution, with α equal to the type I error rate of a two-sided test, typically set to 0.05, and (1 − β) equal to the power of the trial, typically set to 0.8 or 0.9. Treatment effect Δ and variance are defined in terms of the outcome measure to be used in the planned trial. For example, for a trial with two observations per subject and outcome measure of change from baseline to followup, Δ is the change in the treatment group minus the change in placebo, and is the variance of change scores (e.g., Meinart , equation 9.14). In this example can be estimated as the variance of change from baseline to follow-up observed in two-wave pilot data of comparable duration to the planned trial. For a trial with multiple observations per subject and outcome measure of least squares slope of longitudinal trajectories, Δ is the difference in expected slopes in treatment versus placebo and is the within arm variance of least squares slopes . In this example can be estimated from the variance of least squares slopes observed in pilot data of comparable design to the planned trial. These are examples of two-stage “summary measures” analyses which require only the assumption that summary measures (i.e., change scores or least squares slopes) are independent, identically distributed asymptotically normal random variables. Several of the ADNI power calculation papers (Tables 1 and and2)2) use summary measures power formulas, although the exact statistical analysis and model assumptions used were not always stated in complete detail.
Several of the power calculation papers used formally parameterized longitudinal models and analysis plans as the basis of their power calculations. For example, McEvoy et al.  based power calculations on a linear mixed effects model analysis assuming longitudinal trajectories of decline are linear within subject and that the distribution of slopes and intercepts describing these trajectories is bivariate normal. Sample size requirements given this assumed model have been derived . For a balanced design (with all subjects observed at the same time points), the required sample size per arm is:
where and are parameters of the linear mixed effects model, and Σ(ti − )2 is the “design term”, where ti indexes the times at which measures are made and is the mean of the times. For example, for a 12 months trial with observations at baseline, month 6 and month 12, Σ(ti − )2 in units of years equals (0−0.5)2 + (0.5−0.5)2 + (1−0.5)2 = 0.5. Here Δ is the difference in mean rate of decline in treatment versus control, and the parameters and from the mixed effects model are the person to person variability in random slopes and the residual error variance of model . and can be estimated by fitting a linear mixed effects model to pilot data representative of the trial’s target population. For balanced design pilot data, estimates by formula (2) are algebraically identical to estimates by the power formula for a summary measures analysis comparing the mean of least squares slopes of treatment to the mean of least squares slopes of controls (e.g. ).
An alternative mixed effects model power formula is:
This formula is appropriate assuming a mixed effects model in which the subjects have random intercepts but identical rates of decline within arm (or equivalently, a marginal model with compound symmetric covariance structure ). Formula (3) results in smaller sample size projections, but can be anti-conservative when the common within arm rate of decline assumption does not hold. Formulas (1), (2), and (3) assume equal sample size per arm. Some trials use unequal allocation ratios to increase the likelihood of assignment to the active treatment arm and make the trial more attractive to study participants (e.g., ). Unequal allocation trials are slightly less efficient and require a modest adjustment in total sample size .
Several of the papers reviewed here reported sample sizes using formula (3), either in lieu of , or in addition to [11, 17], formula (2). Sample size estimates derived using the random-intercepts model and formula (3) were generally smaller than estimates using the mixed effects model with random intercepts and random slopes and formula (2). Taken together these observations underscore the importance of model selection when powering trials.
For volumetric imaging outcomes in particular, sample size estimates can vary depending on the method of image analysis used. Image processing can be based on manual tracings , semi-automated methods and fully automated methods (e.g., ). Even though each of these methods is measuring the same structure, they may have different signal-to-noise properties depending on the relative precision of the methods. For example, Leung et al. , calculated hippocampal volume change by two different image processing methods and calculated samples size requirements for each outcome measure. While both methods led to sample size estimates that were considerably smaller than estimates typical of global cognitive measures like the ADAS-Cog, they found that required sample size for the more efficient image processing method was between 32–54% smaller than the less efficient method (see Table 2). Characterizing the relative performance of various imaging technologies and processing techniques [10, 12, 13, 15, 16, 21, 23, 24], will be an important outcome of the ADNI exercise.
The results above suggest that the wide divergence of sample size estimates calculated from ADNI data can be explained by multiple factors beyond differences in trial design and target population, including differences in power calculation algorithms used, and, for neuroimaging outcomes, differences in the signal-to-noise profile of the different image processing algorithms. Additional factors relevant to power calculations for AD trials, and general recommendations for improved reporting of power calculations, are discussed below.
The validity of a power calculation is dependent in large part upon the accuracy of the (assumed known) parameter values used in its calculation. In practice these values are almost always calculated from pilot data, as is the case in the ADNI papers reviewed here, and hence contain some degree of random variability. The practical consequence of this randomness is potentially significant, especially when the pilot study used for parameter estimation is small. Several of the reviewed papers reported statistical tests or confidence intervals to characterize the variability inherent in sample size estimates ([10, 11, 13, 15–17, 21, 26, 27], see also ). For example, McEvoy et al.  used a bootstrap procedure to calculate 95% confidence intervals around sample size estimates based on ADNI data. They found that even with the relatively large ADNI pilot data set these confidence intervals can be large , demonstrating the importance of confidence interval calculation as a sensitivity analysis when powering trials.
Some age-related cognitive decline and brain atrophy is experienced even within cognitively normal elderly. This is potentially relevant to the design of trials, as treatments that target the Alzheimer neurodegenerative process specifically may have no effect on non-Alzheimer related decline, and non-Alzheimer related decline may comprise a substantial fraction of the total decline that the sample size calculations described in Tables 1 and and22 are powered to detect. ADNI includes an age-matched, cognitively normal healthy control cohort from which the potential influence of non-treatment responsive age-associated decline can be estimated. We illustrate this with sample power calculations for hypothetical trials of MCI subjects powered to detect a 50% slowing of disease progression as measured by various neuroimaging measures (Tables 3 and and4,4, adapted from ).
Table 3 summarizes estimated sample size requirements assuming a treatment that is effective at slowing both disease-specific atrophy and non-disease-specific age-associated atrophy. Table 3 is analogous to estimates summarized in Table 2 except that effect size is set to 50% slowing of progression. Power calculations are by formula (2) with parameter estimation using longitudinal ADNI data . Ventricular volume was the most efficient outcome measure under this scenario, requiring an estimated 83 subjects per arm to detect a difference in rate of atrophy equal to one half the rate of atrophy observed in the ADNI pilot data. Mid-temporal cortical thinning, whole brain atrophy, and right hippocampal atrophy were slightly less efficient as potential endpoints (Table 3).
Table 4 summarizes samples size requirements to detect a 50% slowing of disease-specific atrophy, where disease-specific atrophy is defined as the atrophy experienced by MCI subjects that is above and beyond the atrophy experienced by age-matched cognitively normal ADNI subjects, and the effect size Δ is calculated as 50% of the difference between the normal and MCI rate of atrophy. For trials powered to detect a slowing of disease-specific atrophy (Table 4), middle temporal cortical thinning was the most statistically efficient outcome measure, requiring 252 (left mid-temporal cortex) to 319 (right mid-temporal cortex) subjects per arm to detect a difference in rate of atrophy equal to one half the rate attributable specifically to the Alzheimer degenerative process. Ventricular volume, the most efficient outcome for detecting non-disease-specific atrophy (Table 3), was the least efficient volumetric outcome for detecting Alzheimer’s disease-specific atrophy (Table 4).
Which sample size algorithm is most appropriate for a given trial? Table 3 is appropriate for treatments presumed to target both non-specific age-associated atrophy and Alzheimer’s disease-associated atrophy. Table 4 is appropriate for treatments presumed to target only Alzheimer’s disease-associated atrophy. Table 4 is conservative if you presume that some age-associated atrophy observed in cognitively intact elderly is due to a preclinical Alzheimer’s disease neurodegenerative process, in which case sample sizes intermediate between Tables 3 and and44 would be sufficient. Further examples and discussion of this issue can be found in references [8, 11, 15–17, 21, 32].
As noted above, a number of the ADNI publications did not report the magnitude of treatment effect being powered or did not explicitly state the statistical analysis plan upon which power calculations were based. We suggest that, as a minimum standard for reporting power calculation findings, these two items be reported. Estimates of minimum sample size requirements are of little utility to readers if the algorithm used for power calculations and the methods for calculating parameter estimates used in those calculations are not reported. Furthermore, if the power calculation formula and parameter estimates are published (e.g., McEvoy et al. ), then outside investigators can use this information to inform sample size calculations for alternative designs (e.g., longer trials or trials with greater sampling frequency) or alternative treatment effect sizes.
Consideration of several additional issues can greatly improve the value of power calculation reports. Power calculation estimates are valid only if implicit model assumptions are true. Pilot data (e.g., ADNI data) can be used to test these implicit assumptions, and describing diagnostics to justify the proposed analysis plan and power calculation algorithm would greatly improve power calculation reports. As discussed above, parameters used in power calculations are estimated with some uncertainty, and a sensitivity analysis (reporting confidence intervals around sample size estimates) is also an important qualification of power calculation findings. Detailed descriptions of the cognitive and demographic characteristics of the pilot data increases the utility of power calculation reports as well. Covariate adjustment was not discussed in this review, but may be a means of improving the efficiency of clinical trials and deserves further consideration . Finally, we have also not addressed pragmatic issues such as adjusting sample size calculations to accommodate study subject dropout or loss to follow-up , which may vary as a function of the research protocol requirements of the various measurement methods.
We emphasize that we have focused exclusively on statistical issues in comparing published ADNI power calculation papers. A number of issues beyond statistical considerations are critical to planning clinical trials. Not the least of these is establishing the relative feasibility and practical significance of a given percentage slowing of progression on cognitive versus proposed volumetric imaging measures. The current Food and Drug Administration standard for approving Alzheimer treatments is demonstrated effectiveness in slowing of cognitive and functional decline. The utility of neuroimaging outcomes, e.g., to demonstrate biological effect in phase 2 trials, or, ultimately, the acceptance of these biomarker measures as outcome measures for phase 3 trials, is yet to be established . Nonetheless, the papers reviewed here consistently demonstrate the potential utility of these outcomes from the statistical efficiency perspective. The increased statistical efficiency translates to shorter trials with substantially smaller sample sizes, meaning more drugs could be effectively tested for the same cost in terms of dollars and human subject burden. Shorter, smaller trials may also be amenable to adaptive trial designs, which would open new avenues for potential gain in trial efficiency.
Supported by NIH/NIA AG010483 (SDE), AG005131 (SDE, MCA), and AG034439 (SDE).