Rate of cerebral atrophy calculated from serially acquired MRI is increasingly used as an outcome measure for clinical trials in AD (

Fox et al., 2005;

Jack et al., 2003) and MCI (

Jack et al., 2008). Attenuation of atrophy may provide a signal of a disease-modifying effect and sample size requirements may be much lower than those using traditional clinical outcome scores (

Jack et al., 2003). Sample size calculations are proportional to the variance of the measure used, and such variability is a combination of within- and between-subject variability. Within-subject variability may arise because of measurement error and physiological variability over time, and numerous approaches to reducing these sources of error have been employed, including improving the stability of scan acquisition; employing multiple scanning time-points; (

Schott et al., 2006) and developing novel and more accurate image analysis techniques, such as tensor-based morphometry (

Hua et al., 2009).

Variation between individuals is likely to reflect several factors, including age, disease stage, differences in underlying pathological substrate (e.g. contribution from vascular disease and TDP-43 pathology (

Josephs et al., 2008)), and other as yet unidentified epidemiological or genetic factors. Driving down these sources of variance, which have previously been estimated to contribute to over 50% of the variance in whole brain atrophy rate over 1 year in patients with established AD, (

Schott et al., 2006) and are higher in MCI, is an alternative way to reduce sample sizes.

One method is to “enrich” trials by preselection of patients in an attempt to produce a more homogeneous group. This approach however potentially limits the wider applicability of the trial findings. An alternative approach is to include a broader range of individuals, but to predefine baseline characteristics that might be expected to explain inter-individual variation, and incorporate these into the analysis. Using this methodology, and incorporating baseline information routinely collected during the course of a clinical study, we have demonstrated that reduction of sample sizes of up to 15–30% in established AD and 10–30% in MCI may be achieved.

The raw sample size estimates we have produced to provide 80% power to show a 25% reduction in rate of change for a 1 year study of AD (i.e. ~ 80 per arm using the KN–BSI; ~ 120 per arm using the VBSI; and ~ 90 per arm using semiautomated hippocampal measures) are in line with those suggested by previous work (

Barnes et al., 2008;

Leung et al., 2009;

Schott et al., 2005). In the context of patient recruitment, retention and cost, the 10–30% reduction in sample size potentially achievable by adjusting for baseline covariates, all of which are commonly measured, is not insignificant. The raw sample sizes required for an MCI trial are much larger (i.e. ~ 150 per arm using the KN–BSI; ~ 230 per arm using the VBSI; and ~ 200 per arm using hippocampal measures), but the percentage gains to be made by adjustment are similar, leading to sample sizes that are within the scope of Phase II studies. Few studies have reported confidence intervals on the “raw” sample sizes as we have done (

Holland et al., 2009;

Schott et al., 2006). Reporting such intervals for sample size estimates is essential, to indicate the precision with which they have been estimated.

In this study, we have analyzed volume loss (or ventricular enlargement) in mLs/yr, rather than as a percentage change. The approximate percentage changes we found are in keeping with prior studies (e.g. in AD ~ 1.5% whole brain atrophy/yr; ~ 5% hippocampal atrophy/yr). The whole brain atrophy rates were slightly smaller than in some previous studies (

Fox et al., 2005;

Schott et al., 2006), possibly reflecting either that the ADNI cohort were slightly older or had slightly milder disease than these other studies.

We found that while adjusting for baseline ventricular volume significantly reduced variability of VBSI, there was relatively little effect of adjusting KN–BSI or HMAPSHBSI for baseline brain or hippocampal volumes respectively. Thus while those with greater baseline ventricular volume tended to have greater subsequent ventricular enlargement, there was no evidence that baseline whole brain or hippocampal volumes were associated with subsequent atrophy in the same region.

Our results suggest that certain core features that contribute to the observed variance in atrophy rates; and when adjusted for, can significantly reduce the required sample sizes. Thus across all measures and in both AD and MCI, disease severity as measured using the ADAS-Cog is consistent in explaining some between-subject variance. Our results suggest that, for all three measures in AD and MCI subjects, CSF Aβ1–42 explains a moderate amount of variability in outcomes, with lower Aβ1–42 being associated with increased rates of atrophy; by contrast differences in baseline phosphorylated or total tau explained little variability. These results, seen in both the MCI and AD groups are perhaps surprising, as reduction of CSF Aβ1–42 reflecting deposition of fibrillar amyloid deposition within plaques is an early feature of AD and one that may begin to plateau in established disease (

Jack et al., 2010). By contrast, elevation of CSF tau is thought to reflect ongoing neuronal degeneration, and thus might be expected to be a more sensitive measure of change throughout the course of the disease. Previous studies assessing the influence of CSF biomarkers on measures of atrophy have shown conflicting results.

De Leon et al. (2006) and

Schuff et al., (2009) (the latter analyzing the ADNI dataset) reported higher rates of hippocampal loss in MCI to be associated with lower levels of Aβ1–42. Several studies found increased hippocampal rates to be associated with higher levels of p-tau in MCI (

de Leon et al., 2006;

Hampel et al., 2005;

Henneman et al., 2009), while in established AD, a weak association between baseline p-tau and whole brain atrophy has been reported (Sluimer et al., 2008). In interpreting the results for individual covariates in explaining atrophy rates, it is important to note that the confidence intervals for the estimated reductions in sample sizes are wide. Furthermore, we did not attempt to find the “optimal” subset of covariates to adjust, for two reasons. First, the optimal subset is likely to vary depending on the particular population studied. Second, defining the meaning of such an optimal subset, and finding it, is highly challenging from both a statistical and substantive perspective, given that all covariates provide some predictive value and that the “cost” of obtaining them often differs between variables (e.g. age v. CSF). Thus while our results suggest that disease severity and CSF Aβ1–42 may explain a relatively large proportion of between-subject differences in rate of atrophy, a degree of caution must be used when attempting to estimate the extent of influence of any one measure. The covariates found to be most predictive in these data-set, while biologically plausible, should not automatically be assumed to exert the same effect in all other AD/MCI studies.

Adjustment for baseline covariates can be performed by fitting a regression model for the outcome, with treatment group and the baseline covariates as “independent variables”. If an adjusted analysis is to be used as the primary analysis of a trial, it is generally deemed as essential to prespecify in the trial's protocol the regression model which is to be used and which covariates will be adjusted for, although recently methods have been proposed which allow covariates to be selected using the trial data itself in such a way which does not lead to overestimated treatment effects (

Tsiatis et al., 2008).

For continuous outcomes analyzed by linear regression models, the increase in statistical efficiency afforded through covariate adjustment depends on the strength of the associations between the covariates and outcome, and the size of the study (

Cox and McCullagh, 1982). In large randomized studies, adjustment for a small number of baseline covariates incurs a negligible cost in degrees of freedom, because treatment group is independent of baseline covariates (a consequence of randomization). In smaller trials, where this cost is nonnegligible, the benefit of covariate adjustment in efficiency will be less, and may even be detrimental. The decision as to how many covariates are adjusted for should therefore been made in light of the size of the trial and the presumed strength of the associations between covariates and the outcome. In moderate to large trials, covariate adjustment is expected (approximately) to increase efficiency if the number of covariates is no more than ρ

^{2} (the population squared multiple correlation coefficient) times the number of subjects (

Cox and McCullagh, 1982).

It is likely that maximum gain from neuroprotective agents will be achieved if these are given as early as possible in the disease process, and ideally at an asymptomatic stage even before fulfilment of criteria for MCI (

Petersen, 2009). However if clinical trials are to be powered appropriately, it is critically important that the effect of normal aging is not ignored. It is unlikely that any neuroprotective agent will slow the rate of atrophy to below that seen in normal aging, and as rates of atrophy in MCI are smaller than in AD and consequently closer to normal aging, studies of MCI that do not acknowledge normal aging as a floor effect are in danger of being underpowered. This is demonstrated in this study, where an absolute 25% reduction in atrophy rate equates to a relative reduction accounting for the effects of normal aging of ~ 35% in AD; but as much as 50% in MCI, with consequent large increases in required sample sizes when normal aging is taken into account. Simply comparing sample sizes which do not take into account normal aging disadvantages outcomes that have little aging effects (e.g. some cognitive measures), and flatters those with relatively large changes in normal aging (e.g. atrophy).

This study suggests that simply in terms of study power, using standard placebo/control designs, preliminary studies of disease modifying drugs are more likely to show an effect when tested in patients with established AD. This conclusion however does not acknowledge that different disease processes may peak at different stages of the disease; that it may be more difficult to halt a wide-spread and advanced pathological process; and that there is more brain and cognition to be saved in early disease. Advances in accurate, early diagnosis of AD, and novel trial designs, incorporating multiple scanning time-points, run-in periods (

Frost et al., 2008) or cross-over designs (

Cummings, 2009), may however be able to reduce within-subject variability still further and make early treatment studies more viable.

The strengths of this study include the use of a large, well validated, publicly available dataset, consisting of representative patients acquired from multiple sites and different scanners (

Petersen et al., 2010); robust statistical methodology; and a critical analysis of a range of different potential covariates in patients with MCI, AD and normal controls, using three different measures of structural change. We did not include PIB-PET measures (

Jack et al., 2009) or other genetic haplotype data (

Potkin et al., 2009) which might have been able to explain some of the large unexplained between-subject variability. Only ~ 55% of subjects had CSF results, potentially limiting the validity of our estimates for the benefit of adjusting for the CSF variables, as well as using all the covariates. To deal with the missing CSF values, we used a principled statistical technique for dealing with missing data. This approach uses the relationship between CSF variables and the other variables, estimated in those who had CSF, to (implicitly) predict the missing CSF variables in those who did not have lumbar puncture. The resulting estimates are consistent provided the decision to have CSF did not depend on the unobserved CSF values (conditional on observed variables), which seems reasonable, and provided the underlying statistical model is correctly specified. A comparison of the distribution of fully observed baseline characteristics between those who had CSF and those who did not revealed no statistically significant differences. Furthermore, the estimates of percentage sample size reduction found using the subset of AD/MCI subjects for whom CSF was available were similar to those found using the available data from all subjects. Differences between the estimates may be due to several reasons (

Sterne et al., 2009). First, estimates based on data from all subjects are, providing the modeling assumptions are valid, more precise than those based on the subset (~ 55% of each group) for whom CSF was available. Second, results may differ if the CSF data are not missing completely at random, although as noted there was little evidence against this assumption. We also note that in trials some outcome data are typically missing for some subjects, for a variety of potential reasons. Allowing for such missing values at both the design and analysis stage (e.g. through the use of linear mixed models or imputation methods) is essential.

The linear regression model used is based on a number of assumptions, including linearity of effects, no interactions, constant variance and normality of residuals. However, it has been shown that the covariate adjusted treatment effect estimates are (in large samples) unbiased without requiring these assumptions (

Tsiatis et al., 2008). Using the standard sample size formula, we have assumed that in a future trial the variance of the atrophy/ventricular enlargement outcome would be the same in the two treatment arms, equal to that estimated using the ADNI data. Our estimates of sample sizes with covariate adjustment are valid with the additional assumption that the covariances of the covariates with the atrophy/ventricular enlargement outcomes would be the same in the two treatment arms. The extent to which covariates can explain variability in the outcome, and hence reduce sample sizes, depends critically on the variability of the covariate in the sample. Strictly speaking therefore, our estimates are applicable for future studies in which AD/MCI patients are recruited using the same criteria as that in the ADNI study. In particular, the covariates may explain a larger proportion of variability between patients in the wider AD/MCI populations, since the covariates are likely to have greater variability than in the ADNI study. However, the ADNI dataset has been shown to be representative of patients who might be recruited for therapeutic studies (

Petersen et al., 2010).

In summary, we have shown that useful reductions in sample sizes may be achieved in AD and MCI trials using measures of cerebral volume change as an outcome measure if baseline characteristics are used as covariates. Required sample sizes are substantially higher in MCI trials than those carried out in patients with established AD, and the effect of accounting for normal aging as a floor threshold below which excess atrophy cannot fall implies significantly higher patient numbers will be needed for a given drug effect, particularly in MCI. It is critical that future trials of potentially disease-modifying therapies are appropriately powered so as not to miss a potential effect, and these data may help to inform such trial designs.