Serial structural brain changes in Alzheimer’s disease were visualized in patients with probable AD (
N = 91) and MCI (
N=189), scanned at baseline and after intervals of 6, 12, 18, and 24 months. 3D maps showed the regional distribution of cumulative atrophy across the brain, relative to the baseline scan, and the resulting patterns, at all time-points, are consistent with many past reports that pinpoint areas most affected by AD (
Frisoni et al., 2009;
Scahill and Fox, 2007;
Scahill et al., 2003;
Thompson and Apostolova, 2007;
Whitwell et al., 2007). The most prominent features detected by longitudinal MRI and TBM were ventricular expansion and temporal lobe atrophy (), consistent with earlier findings.
To improve the utility of voxel-based neuroimaging for clinical trials, which tend to deal with “single-number” outcome measures, we also derived a numeric summary to quantify the overall amount of temporal lobe atrophy in regions consistently affected by AD. After reducing a 3D map into this single atrophy score, we evaluated its statistical power versus standard clinical measures for detecting disease progression in AD and MCI. Sample size estimates were far smaller for neuroimaging measures than for clinical scores, suggesting that fewer patients or shorter observation times would still offer sufficient power in clinical trials using neuroimaging surrogate markers (
Chen et al., 2009a,
b;
Jack et al., 2003;
Reiman et al., 2008b;
Schuff et al., 2009).
Probable AD patients, followed up after 6, 12, and 24-month intervals, demonstrated successively greater cumulative brain atrophy, and progressively smaller sample sizes to detect a 25% reduction in the average change in a hypothetical clinical trial. MCI subjects also showed a gradual increase in the cumulative level of brain atrophy (versus baseline) in maps created at 6, 12, 18, and 24 month intervals. The 18-month follow-up for MCI, when used on its own, did not exhibit a detectably improved sample size estimate compared to a 12-month trial. Brain atrophy is detectable on MRI as early as 6 months in both AD and MCI groups. Even so, changes had greater effect sizes at longer intervals, illustrating the trade-off between recruitment requirements (and therefore costs) and the required observation time. More patients are needed for shorter trials, but patient enrollment may be reduced if a longer observation time is acceptable. And regardless of the scanning interval, TBM is clearly a useful neuroimaging marker that may help reduce costs for clinical trials.
Theoretical and practical considerations
Minimal sample sizes directly relate to effect sizes, which depend upon both the mean and standard deviation (SD) of the atrophy measures. For TBM measures derived from the statistically pre-defined region of interest, both the mean level of cumulative atrophy and its SD increased monotonically with longer inter-scan intervals (); as the mean rose faster than the variance, incrementally greater effect sizes (and smaller sample size estimates) were observed. According to one interpretation, most of the sources of measurement error in estimating the atrophic rate from serial MRI scans (e.g., scanner calibration, RF bias fields) may be roughly the same regardless of the scan interval, but the interval must always be long enough for sufficient systematic atrophy to accumulate and be detectable above the noise that is inevitable whenever measurements are made.
Even so, it is not logically necessary that long intervals must always give best effect sizes: they may not if the population variability increases drastically. Attrition effects are also relevant in practice. A long study may be useful to maximize the theoretical effect size, but only if patients are willing and able to stay in the trial to allow a longer follow-up. With attrition rates greater than 15–16% per year, shorter trials (e.g., 12-month) became a better choice (), probably because the added effect sizes from cumulative change over a longer period were outweighed by the exponentially growing attrition. Additionally, shorter trials may also be favored due to cost concerns, as the cost generally rises with longer follow-up periods. Even when considering the attrition rate, the 12-month trial is consistently better than the 6-month trial. Although a reasonable amount of change can be detected at 6 months (and differences between MCI and AD group rates of atrophy were detectable at 6 months), a 12-month trial might be the optimal trial duration in practice, assuming a typical attrition rate (20–30% per year) for clinical trials. Clearly, the optimal follow-up interval depends not only on the effect size, but also on the attrition rate, cost and other potentially unmodeled factors.
In one recent clinical trial, we used 3D cortical mapping methods to assess cortical thickness at 3-, 6-, and 12-month follow-up intervals in patients with first-episode schizophrenia, randomized to two different medications (Thompson et al., 2008). Intriguingly, medication effects—favoring olanzapine over haloperidol treatment—were significant as early as 3 and 6 months, but were no longer significant at 12 months. This may be because, in that case, medications are most effective soon after disease onset, and less so later, or it may be because, in schizophrenia, atrophic rates are greater soon after disease onset, and less so later (the opposite appears to be the case in AD and MCI, where atrophic rates are thought to accelerate over time as the disease progresses).
In another MRI study measuring atrophy in AD (Schott, 2005), 38 AD patients and 19 controls were scanned at 0, 6, and 12 months, and the boundary shift integral (BSI) method was used to measure whole brain volume loss over time. Consistent with our findings, smaller sample sizes were needed to detect the same effect size at longer intervals. In Schott et al.’s study, only 154 subjects per arm were needed to detect a 20% reduction in the rate of atrophy when measurements were made after 1 year, versus 410 for a 6-month interval. Schott et al. also noted that smaller samples were needed when measuring rates of ventricular enlargement versus whole brain atrophy, if the trial duration was 6 months, but not at 12 months. They found that the variance in the estimated rate of whole brain atrophy was almost twice as high at 6 versus 12 months, but the variance in the ventricular atrophy rate—which can be measured more consistently—was about the same at 6 versus 12 months.
We also found that the refining the search region can improve sample size estimates. Use of a statistically pre-defined ROI gave better power estimates at all time-points, compared to an anatomically defined ROI that included the entire temporal lobes ( and ). The use of a statistically-defined ROI based on an independent training sample was first proposed for positron emission tomography (PET) images (
Chen et al., 2009a,
b;
Reiman et al., 2008a). We recently applied the same idea to MRI image analysis, and found consistently higher power to detect disease-related brain atrophy (
Hua et al., 2009a;
Ho et al., in press). A Stat-ROI reduces the search regions to areas showing the most robust changes in AD or MCI, eliminating regions with greater variance or smaller effect sizes across subjects, and avoiding regions with changes in the opposite direction (e.g., ventricular enlargement or intra-sulcal CSF). The Stat-ROI is slightly smaller in size than an atlas-based definition of the temporal lobes () and, relative to the full temporal lobe ROI, it largely removes partial volumed voxels (containing mixtures of different tissues) at gray matter/CSF interfaces. At the interface of brain tissue and CSF, voxels with progressive tissue loss (in the gray matter) are partial volumed with expanding voxels (in the ventricular space) (
Hua et al., 2008a,
2009b), which can give high variance across subjects and can reduce statistical power. The Stat-ROI retains regions changing the most in AD, and boosts power substantially.
We set the temporal lobes as the default search region to compute the Stat-ROI because it is one of the earliest regions to become atrophic in AD (
Braak and Braak, 1991;
Braskie et al., 2008;
Jack et al., 1997,
1998;
Killiany et al., 1993;
Thompson et al., 2003), showing most differentiation between patients and healthy elderly (
Hua et al., 2008a). The exploratory analysis in multiple anatomical regions () further confirmed that the temporal lobes were the best search region to use, when this TBM method is used to identify AD-associated brain degeneration. Several other regions performed similarly in terms of n80. The hippocampus did not show especially high power with TBM, perhaps because it is too small to be modeled accurately with TBM due to partial volume effects (
Hua et al., 2009a). The Stat-ROI within the temporal lobe includes several regions with early pathological changes in AD such as the medial temporal lobe and entorhinal cortex (), but largely avoids the hippocampus due to the limitation of TBM in picking up changes with high effects sizes from small structures. In other work (
Protas et al., 2009), we have combined data from multiple ROIs for diagnostic classification in AD and MCI, and a similar approach could be taken here. Even so, the temporal lobe statistical ROI almost always gave best results when used alone, so we report data from that region here for simplicity of implementation and interpretation.
Stat-ROIs based on different diagnostic groups (AD vs. MCI) and based on different time points were very similar. They covered similar regions within the temporal lobes (), and power analysis was insensitive to these different choices of Stat-ROIs (). This suggests that a Stat-ROI derived from one time point may be reasonably applied to data from other time-points, making it easy to implement in clinical trials or other longitudinal studies. As picking a training set at each time point or for each diagnostic group is unnecessary, the number of scans in the testing set is preserved to maximize the sample size for power analysis.
Based on the power formula described in the Materials and methods section, the estimated minimum sample size for each arm is computed assuming a 25% slowing of the atrophy rate, indicated by a multiplier of 0.25 in the denominator. In reality, treatments may slow atrophy to different degrees, which may be denoted by
k%, for different
k. The sample size estimates required to detect a
k% slowing of atrophy can be easily derived by multiplying the sample size estimates (n80 or n90) in this paper by (25/
k)
2, as the numbers follow an inverse-square law. For example, 4 times as many subjects would be needed to detect a 12.5% slowing of atrophy (half of 25%), versus a 25% slowing of atrophy (
Ho et al., in press). The quadratic relationship between the sample size estimates and the percentage atrophic rate is illustrated in , using the n80s for a 12-month trial derived from the Stat-ROI (), as an example. The results of this paper can be easily translated to studies aiming to detect a different level of treatment effect, and our findings remain unaffected as multiplying the variables by a constant (25/
k)
2 does not alter the ranking of the effect sizes in the statistical tests (it is a monotone transformation, i.e., it preserves the rank order).
Localization of Changes with TBM
It may seem paradoxical that the brain changes with greatest effect sizes in TBM were generally found in large homogeneous regions of the white matter, when AD is widely accepted to be a predominantly hippocampal and cortical gray matter pathology. The predominant site of plaque and tangle accumulation in AD is the hippocampus and cortex, and the molecular hallmarks of AD spread through the cortex in a characteristic trajectory (
Braak and Braak, 1991;
Braskie et al., 2008). Volumetric atrophy is widespread in the cortical gray matter (GM). Our group, along with many others, has mapped the spatial-temporal trajectory of GM loss in AD, and it largely mirrors the Braak and Braak sequence of AD pathology (
Dickerson et al., 2009;
Thompson et al., 2003), and the cortical trajectory of plaque and tangle build-up tracked with [18F]FDDNP-PET (
Braskie et al., 2008).
For some time, white matter (WM) changes in AD were relatively difficult to quantify in conventional MRI because of the lack of visible anatomical boundaries that would be required to parcellate white matter. Other than white matter hyperintensities, which are hallmark lesions of cerebral vascular disease (
Brickman et al., 2009), the nature of WM degeneration in AD has not been well characterized until recently. Diffusion tensor imaging (DTI), relaxometry, and functional connectivity studies have now provided substantial evidence for diffuse WM abnormalities in AD (
Buckner et al., 2009;
Wozniak and Lim, 2006). Myelin breakdown and Wallerian degeneration both lead to WM atrophy, perhaps secondary to the effect of cortical neuronal loss in AD (
Bartzokis, 2009;
Bartzokis et al., 2006,
2007;
Spires-Jones et al., 2009). Disease-related WM degeneration is detectable in the form of reduced fractional anisotropy on DTI (
Huang et al., 2007;
Naggara et al., 2006;
Rose et al., 2000;
Sandson et al., 1999;
Wang et al., 2009), lowered relaxation rates in T2-based MRI relaxometry (
Bartzokis et al., 2003,
2004), and disrupted connectivity observed using resting-state functional MRI (
Agosta et al., 2009;
Buckner et al., 2009;
Hedden et al., 2009;
Supekar et al., 2008;
Wang et al., 2007;
Zhou et al., 2008).
As both GM and WM changes are occurring, a key question is which of the MRI-derived measures is the most reliable for detecting group differences or dynamic changes over time, and which can resolve them with greatest effect sizes and accuracy in longitudinal studies. As a percentage, more cortical and hippocampal gray matter may be lost over time than white matter. Even so, the effect sizes for the changes in gray matter may be lower than expected, as these structures are convoluted and difficult to measure accurately. The cortex is thin and the hippocampus is narrow, often only a few voxels thick, and automated measures of the same structures using different algorithms can disagree substantially (
Morra et al., 2009b), leading to calls for better harmonization of hippocampal measurement methods across studies (
Frisoni et al., 2010).
This is especially the case in voxel-based maps, where changes may be greatest, as a percentage of their volume, in the cortex and hippocampus, but when pooling data across subjects voxel-by-voxel, the interiors of large white matter structures still tend to be better registered than the cortical and hippocampal boundaries once all the data are aligned. In the interiors of structures, such as the white matter, coherent patterns (such as atrophy) are more likely to be reinforced across all members of a group than at boundary voxels where loss patterns may be less well registered, even after nonlinear registration. This effect can be partially overcome using voxel-based morphometry (VBM;
Ashburner and Friston, 2000) but that method multiplies gray matter “density” measures by registration-based estimates of changes, whereas TBM uses the registration field only. Cortical thickness measures also tend to have relatively poor reproducibility; in fact, different algorithms give mean cortical thickness values for normal subjects that differ by a factor of two (
Aganj et al., 2009).
Relation to Prior Work
Our earlier TBM paper on the 12-month ADNI follow-up data (
Hua et al., 2009) was mainly focused on optimizing the TBM processing pipeline for statistical power. We studied the effects on the results of critical parameters such as the linear registration steps, and the regularization parameters of the deformation model. Because so many parameter options and analysis choices could be made, we compared TBM designs with different linear and nonlinear registration parameters (including different regularizing functions) and found the set of best-performing parameters from the standpoint of maximizing effect sizes. A secondary focus of that prior paper was to motivate and evaluate the use of a data-driven statistical ROI to compute power estimates. We found that the statistical power of tracking brain degeneration could be further enhanced by using a statistically-defined ROI within the temporal lobes, based on voxels found to change the most in an independent sample. That study served as a foundation for the current study, in which we used the best TBM design and parameters from the prior study.
The current study asks a different question about the optimal length of a clinical trial, with the intent of understanding factors that might influence power in a clinical trial. Expanding the temporal sampling significantly relative to the prior study, we analyzed brain scans collected at baseline, 6, 12, 18, and 24 months, for both AD patients and MCI subjects. Unlike our prior report, the current study used scans at 5 time-points, and from many more subjects, so we were able to gauge the tradeoff between trial duration and recruitment requirements, both with and without considering likely attrition rates. The resulting information offers an evaluation of a simple and readily implementable image analysis method as well as a guideline for trials involving a neuroimaging component.
Limitations and caveats
The current study has some limitations, and some qualifications are needed. As noted before, a 25% reduction in the atrophic rate may have a different functional significance for a patient than a 25% reduction in the rate of decline for clinical or cognitive test scores (and may in reality be either better or worse for the patient). As such, a head-to-head comparison of clinical and neuroimaging measures is useful, but in reality both measures are informative and will continue to be widely used. Secondly, longitudinal studies with more than two time-points can employ more advanced statistical designs that use all the data at once, such as random effects or mixed effects models, to estimate intra-subject variance and take advantage of the repeated measures (
Fitzmaurice et al., 2004;
Frost et al., 2004). In recent ADNI analyses, using all scans in the time-series has been shown to boost power estimates (
Schuff et al., 2009). Some advocate the use of two scans taken on the same day from each subject, showing a net beneficial effect on sample size estimates, especially when the follow-up interval is short (
Schott et al., 2006). Finally, it is reasonable that multiple complementary neuroimaging measures, from MRI, PET and other modalities (arterial spin labeling, diffusion imaging, and resting-state fMRI) may be combined in the future, with genetic and CSF biomarker data, to give more robust metrics of disease progression, and to obtain better predictive value in longitudinal models.