|Home | About | Journals | Submit | Contact Us | Français|
Tensor-based morphometry (TBM) is a powerful method to map the 3D profile of brain degeneration in Alzheimer’s disease (AD) and mild cognitive impairment (MCI). We optimized a TBM-based image analysis method to determine what methodological factors, and which image-derived measures, maximize statistical power to track brain change. 3D maps, tracking rates of structural atrophy over time, were created from 1030 longitudinal brain MRI scans (1-year follow-up) of 104 AD patients (age: 75.7 ± 7.2 years; MMSE: 23.3 ± 1.8, at baseline), 254 amnestic MCI subjects (75.0 ± 7.2 years; 27.0 ± 1.8), and 157 healthy elderly subjects (75.9 ± 5.1 years; 29.1 ± 1.0), as part of the Alzheimer’s Disease Neuroimaging Initiative (ADNI). To determine which TBM designs gave greatest statistical power, we compared different linear and nonlinear registration parameters (including different regularization functions), and different numerical summary measures derived from the maps. Detection power was greatly enhanced by summarizing changes in a statistically-defined region-of-interest (ROI) derived from an independent training sample of 22 AD patients. Effect sizes were compared using cumulative distribution function (CDF) plots and false discovery rate methods. In power analyses, the best method required only 48 AD and 88 MCI subjects to give 80% power to detect a 25% reduction in the mean annual change using a two-sided test (at α = 0.05). This is a drastic sample size reduction relative to using clinical scores as outcome measures (619 AD/6797 MCI for the ADAS-Cog, and 408 AD/796 MCI for the Clinical Dementia Rating sum-of-boxes scores). TBM offers high statistical power to track brain changes in large, multi-site neuroimaging studies and clinical trials of AD.
Alzheimer’s disease (AD) is the most common form of dementia in people over the age of 60 (Jellinger, 2006). The disease affects more than 26 million people worldwide, including over 5 million in the U.S. alone (with an estimated economic cost of 156 billion USD per year; Wimo et al., 2006). From onset to death, AD gradually erodes memory, language, and higher-order cognition over a time course of 10–15 years (DeKosky and Marek, 2003; Goldman et al., 2001; Jellinger, 2006; Price and Morris, 1999). People with amnestic mild cognitive impairment (MCI)—a preclinical stage of AD—convert to AD at a rate of 10–25% annually (Petersen, 2000; Petersen et al., 2001; Petersen et al., 1994). Increasing efforts are directed towards treating those with MCI (Jack et al., 2008b; Jack et al., 2005) and people at heightened genetic risk, e.g., those with amyloid precursor protein (APP) or presenilin mutations (Goate, 2006; Goate et al., 1991), apolipoprotein E (APOE) ε4 carriers (Reiman et al., 2001; Reiman et al.,1996; Reiman et al., 2004), or Down’s syndrome (Haier et al., 2003) (reviewed in Dickerson and Sperling, 2005). As new treatments are being developed, neuroimaging measures are urgently required to assess whether treatments slow disease progression in the brain. Ultimately, therapeutic agents aim to improve cognitive function (reduce symptoms), and also resist the advance of neurodegeneration that can be detected in vivo using neuroimaging (Clark and Karlawish, 2003; Fillit, 2008; Ivinson et al., 2008; Mueller et al., 2006).
Clinical or cognitive measures, such as the Alzheimer’s Disease Assessment Scale (ADAS-Cog) (Rosen et al.,1984), are most commonly used to evaluate treatment efficacy in clinical trials, but their low test–retest reliability makes it necessary to use large sample sizes and long observation times (Mueller et al., 2006). To address this, several neuroimaging consortia, including the Alzheimer’s Disease Neuroimaging Initiative (ADNI), are determining which combination of neuroimaging, CSF, genetic, and cognitive biomarkers can collectively provide a more accurate early diagnosis of AD and monitor disease progression with greatest statistical power (Mueller et al., 2005a; Mueller et al., 2005b). There is great interest in determining which brain imaging methods can track disease progression in AD and MCI most powerfully. Ideal neuroimaging measures, or so-called biomarkers of disease progression, would correlate with cognitive performance, predict future clinical decline, and track longitudinal progression with high reliability and statistical power (Halperin et al., 2009; Mueller et al., 2006; Mueller et al., 2005b; Shaw et al., 2007). Eventually, clinical trials will need to provide the evidence needed to determine whether the effects of established disease-slowing treatments on MRI or other biomarkers predict a clinical benefit for regulatory agencies to approve future treatments based solely on these biomarker surrogate endpoints (Reiman and Langbaum, in press).
Structural MRI, in particular, has been proposed and tested for monitoring treatment effects in AD (Grundman et al., 2002; Jack et al., 2008b; Jack et al., 2003; Scheltens et al., 2002). Neuroimaging measures, such as hippocampal volumes (Jack et al., 2002; Jack et al., 1999; Morra et al., 2009b; Morra et al., 2009c; Schuff et al., 2009), ventricular volumes (Carmichael et al., 2006; Chou et al., 2008, 2009a,b; Thompson et al., 2004a), and brain boundary shift integral (BBSI) measures (Fox et al., 2000), have been shown to differentiate patients from controls, correlate with changes over time in clinical and cognitive scores, correlate with pathologically confirmed neuronal loss, and predict future conversion from preclinical to symptomatic AD.
All of these neuroimaging approaches typically provide single numeric summaries (e.g., the volume of a structure) from each patient’s 3D image set. There is growing interest in whether 3D brain mapping methods, which provide a detailed image of brain differences, can provide greater power to track disease progression than a single number derived from a scan (e.g., hippocampal volume). Tensor-based morphometry (TBM), for example, is a promising image analysis technique that computes the locations and rates of tissue atrophy. Changes are determined by elastically or fluidly aligning successive MRI scans of the same subject, using a registration algorithm. Maps of local expansion or compression factors can be used to estimate the local rates of tissue loss or CSF space expansion at each voxel (Ashburner and Friston, 2003; Chung et al., 2001; Fox et al., 2001; Freeborough and Fox, 1998; Riddle et al., 2004; Studholme et al., 2001; Thompson et al., 2000). TBM can also measure volumetric differences in cross-sectional studies by nonlinearly registering individual brain scans to a common anatomical template (Ashburner and Friston, 2003). TBM-derived measures also have several characteristics of an ideal AD biomarker as they have been shown to correlate with cognitive performance, and predict conversion from MCI to AD, in both cross-sectional (Hua et al., 2008a; Hua et al., 2008b) and longitudinal studies (Leow et al., 2009).
In this study, we investigated which methodological parameters, and which image-derived measures, provided the greatest power for tracking AD and MCI progression using TBM. First, we examined TBM designs with different linear and nonlinear registration parameters (including different regularizing functions (Leow et al., 2005; Yanovsky et al., 2008a)), which control the way in which the images deform. Second, we tested whether statistical power was increased if we summarized the changes detected in the maps using a more restricted region-of-interest (ROI). This ROI was defined statistically as the voxels with highest effect sizes for change over time in an independent training dataset (an approach advocated by Reiman, Chen and their colleagues for PET imaging; (Chen et al., 2009; Reiman et al., 2008). As the ROI approach provided substantial benefits, we subsequently performed exploratory analyses to determine how the minimal sample size estimates depended on how the ROI was thresholded and applied to the data. Differences in the statistical maps due to variations in the TBM design were examined by (1) ranking cumulative distribution function (CDF) plots that assess effect sizes for voxel-based maps, and (2) assessing minimal sample sizes to detect slowing of AD and MCI.
We hypothesized that TBM would lead to greatly reduced sample size estimates for clinical trials compared to existing cognitive measures, and would perform comparably to, and in some cases better than, other imaging measures for which power estimates have been reported. In addition, we expected empirically-defined statistical ROIs to boost power even further.
Longitudinal brain MRI scans and associated clinical data were downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) public database (http://www.loni.ucla.edu/ADNI/Data/). ADNI is a large five-year study launched in 2004 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organizations, as a $60 million public–private partnership. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessments acquired at multiple sites (as in a typical clinical trial), can replicate results from smaller single site studies measuring the progression of MCI and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to monitor the effectiveness of new treatments, and lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, M.D., VA Medical Center and University of California, San Francisco.
Longitudinal brain MRI scans (1-year follow-up) of 515 subjects were analyzed in this study (i.e., 1030 scans total), including 104 AD patients (age at initial scan: 75.7 ± 7.2 years), 254 individuals with amnestic MCI (75.0 ± 7.2), and 157 healthy elderly subjects (75.9 ± 5.1). These subjects are from the same cohort as those in our prior TBM studies of 676 subjects analyzed at baseline (Hua et al., 2008b), and 100 subjects studied longitudinally (Leow et al., 2009). All subjects underwent thorough clinical and cognitive assessment at the time of scan acquisition. Cognitive tests examined here included the Alzheimer’s Disease Assessment Scale-cognitive subscale (ADAS-Cog), a 70-point scale designed to measure the severity of cognitive impairment, which is currently the most widely used cognitive measure in AD trials (Rosen et al., 1984). It consists of 11 tasks assessing learning and memory, language production and comprehension, constructional and ideational praxis, and orientation. The sum-of-boxes Clinical Dementia Rating (CDR-SB), ranging from 0 to 18, measures dementia severity by evaluating patients’ performance in six domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Berg, 1988; Hughes et al., 1982; Morris, 1993). The Mini-Mental State Examination (MMSE) provides a global measure of mental status, evaluating five cognitive domains: orientation, registration, attention and calculation, recall, and language (Cockrell and Folstein, 1988; Folstein et al., 1975). The maximum MMSE score is 30; scores of 24 or lower are generally consistent with dementia. For both ADAS-Cog and CDR-SB, higher scores indicate poorer cognitive function, whereas for MMSE, higher scores denote better cognitive function. All AD patients met NINCDS/ADRDA criteria for probable AD (McKhann et al., 1984). Please refer to the ADNI protocol for detailed inclusion and exclusion criteria (Mueller et al., 2005a; Mueller et al., 2005b).
This dataset was downloaded on or before June 1, 2008, and reflects the status of the database at that point; as data collection is ongoing, we focused on analyzing all available baseline and 1-year follow-up scans, together with associated clinical and cognitive scores. The study was conducted according to the Good Clinical Practice guidelines, the Declaration of Helsinki and U.S. 21 CFR Part 50—Protection of Human Subjects, and Part 56—Institutional Review Boards. Written informed consent was obtained from all participants before experimental procedures, including cognitive tests, were performed.
All subjects were scanned with a standardized MRI protocol developed for ADNI (Jack et al., 2008a). Briefly, high-resolution structural brain MRI scans were acquired at 59 ADNI sites using 1.5 T MRI scanners. ADNI also collects a smaller subset of data at 3 T but it was not analyzed here. A direct comparison of 3 T versus 1.5 T field strengths, using a smaller subset of subjects scanned at both field strengths, is reported separately in a different publication (Ho et al., in press).
Using a sagittal 3D MP-RAGE scanning protocol, the typical 1.5 T acquisition parameters were repetition time (TR) of 2400 ms, minimum full TE, inversion time (TI) of 1000 ms, flip angle of 8°, 24 cm field of view, and 192 × 192 × 166 acquisition matrix in the x-, y-, and z-dimensions, yielding a voxel size of 1.25 × 1.25 × 1.2 mm3, later reconstructed to 1 mm isotropic voxels (Chou et al., 2009b).
Image corrections were applied using a processing pipeline at the Mayo Clinic, consisting of: (1) correction of geometric distortion due to gradient nonlinearity (Jovicich et al., 2006), i.e. “gradwarp”, (2) B1-correction for adjustment of image intensity inhomogeneity due to B1 nonuniformity (Chou et al., 2009b), (3) N3 bias field correction for reducing residual intensity inhomogeneity (Sled et al., 1998), and (4) geometrical scaling for removing scanner- and potential session-specific calibration errors using a phantom scan acquired for each subject (Gunter et al., 2006). All original image files as well as images with all of these corrections are available to the general scientific community at http://www.loni.ucla.edu/ADNI/Data/.
To adjust for linear drifts in position and scale within the same subject, the follow-up scan was linearly registered to its matching baseline scan using a 6-parameter (6P) or 9-parameter (9P) registration, driven by a mutual information (MI) cost function (Collins et al., 1994). The 9P registration allows scaling of the scan along 3 orthogonal axes, whereas the 6P registration is a rigid-body alignment and does not allow scaling. To account for global differences in brain scale across subjects, both mutually aligned scans were then linearly registered to the International Consortium for Brain Mapping template (ICBM-53) (Mazziotta et al., 2001), applying the same 9P transformation to both mutually aligned scans. Globally aligned images were resampled in an isotropic space of 220 voxels along x-, y- and z-dimensions with a final voxel size of 1 mm3. All scans were original images after image corrections thus no brain mask was used.
Using high-dimensional nonlinear registration, individual Jacobian maps (i.e., maps of local expansion or compression) were created to quantify 3D patterns of atrophic rates by warping the follow-up scan to match the baseline scan. All the elements of this process are automated; there is no user intervention apart from that required to select scans to analyze and to specify covariates and statistical models. Two separate methods were compared: one method (termed “3DMI”) (Leow et al., 2005), was driven by a mutual information cost function, and a regularizing term based on the linear elasticity operator. A second, alternative, algorithm (termed “sKL-MI”) used the symmetric Kullback–Leibler (sKL-MI) distance, another measure from information theory, with varying registration parameters, sigma (σ) and lambda (λ), denoted as S and L respectively to differentiate from the σ used in power analysis, that control the Jacobian field smoothness and weighting of the regularization, respectively (Yanovsky et al., 2009; Yanovsky et al., 2008a). Briefly, the registration method based on linear elasticity is more commonly used in brain imaging, and has been used in several of our prior TBM studies (e.g., (Gogtay et al., 2008; Hua et al., 2008b)). It models the deforming image as if it were embedded in a deforming physical medium obeying continuum-mechanical laws (Alexander et al., 2001; Broit, 1981; Leow et al., 2006a; Leow et al., 2006b; Shen and Davatzikos, 2002; Thompson et al., 2000). The sKL model causes images to deform in a slightly different way, such that the expansion factor or compression factor (Jacobian determinant) is as spatially uniform as possible in regions where both images have homogeneous intensity. This model, based on information theory, has been advocated as it can avoid statistical bias in the maps of brain change when no true changes are present (Leow et al., 2006a; Leow et al., 2007; Yanovsky et al., 2007a; Yanovsky et al., 2009; Yanovsky et al., 2008b; Yanovsky et al., 2007b). Essentially, each method tends to interpolate the deformation into white matter regions in slightly different ways, and as there is no ground truth (i.e., reference or “gold standard” method) to compute changes in those regions, it is of interest to see which method provides best power for detecting statistical effects on anatomy.
The sKL-MI method solves for the deformation (or, equivalently, for the 3D displacement field) minimizing a compound energy functional consisting of two terms, (1) the image matching term FMI, and (2) a regularization term based on the symmetric Kullback–Leibler divergence, RsKL. There are 2 free parameters (S and L) that can be set by the user, so we also examined their impact on the results. Briefly, the 3D vector-valued flow field u is obtained by driving image I2 into registration with image I1 using a force field (Yanovsky et al., 2009):
where λ(L) is the weight on the regularization term RsKL. The instantaneous velocity field v is obtained by convolving f with a Gaussian kernel Gσ of variance σ2 (S2):
Given the velocity vector field v, the displacement vector field u is obtained by advancing the following partial differential equation in time t:
For both methods, color-coded maps of the Jacobian determinants (the local expansion or compression factors) were used to illustrate regions of volume expansion (i.e., with det J(r)>1), or contraction (with det J(r)<1) (Ashburner and Friston, 2003; Chung et al., 2001; Freeborough and Fox, 1998; Riddle et al., 2004; Thompson et al., 2000; Toga, 1999) over time. These maps of tissue change rates were also spatially normalized across subjects by nonlinearly aligning all individual Jacobian maps to a minimal deformation template (MDT), for regional comparisons and group statistical analyses. The MDT was constructed based on images from 40 normal controls as detailed elsewhere (Hua et al., 2008a; Hua et al., 2008b).
To illustrate systematic differences in atrophic rates between groups, we constructed voxel-wise statistical maps based on the Student’s t statistic. We corrected for the multiple comparisons implicit in making a statistical map, by using permutation tests (Bullmore et al., 1999; Chiang et al., 2007a; Chiang et al., 2007b; Nichols and Holmes, 2002; Thompson et al., 2003). In brief, a null distribution for the group differences in Jacobian (atrophic rate) at each voxel was constructed using 5000 random permutations. For each test, the subjects’ diagnosis was randomly permuted and voxel-wise t-statistics were calculated. A ratio, describing the fraction of the time the t-statistic was more extreme in the randomized tests than the original test, was calculated to give a permutation-based p-value for the significance at each voxel. A “global P-value”, describing the fraction of time the suprathreshold volume (p<0.01, uncorrected) was greater in the randomized maps than the real effect (the original labeling), was calculated to determine whether any significant change could be detected across the brain. This procedure has been used in many prior reports (Braskie et al., 2008; Chiang et al., 2007a; Chou et al., 2009b). This is one of several standard ways to set up a permutation test and is sometimes called set-level inference. It deems a map significant when the total quantity of voxels with p-values lower than a fixed a priori threshold exceeds that obtained in 95% of random simulations.
CDF plots of the p-values were used to compare the effect sizes of (and therefore the power to detect) statistical effects when using (1) TBM designs with different regularizing functions (3DMI versus sKL-MI), (2) different sKL-MI nonlinear registration parameters (sigma = 6 or 9; lambda = 0.5, 1, 2, 5, 8, or 10), and (3) different choices for the pre-TBM linear registration step (6P versus 9P linear registration of follow-up and baseline scans). CDF plots are commonly generated when using false discovery rate methods to assign overall significance values to statistical maps (Benjamini and Hochberg, 1995; Genovese et al., 2002; Storey, 2002). CDFs can also be used to compare effect sizes of different methods based on the expected proportions of voxels with significance above any given statistical threshold under the null hypothesis. We have used this CDF ranking method in several prior reports to compare the power of different methods applied to the same set of images (Chou et al., 2008; Hua et al., 2008a; Lepore et al., 2008; Morra et al., 2009b).
The q-value refers to the point (corresponding value from the x axis) at which CDF plot intersects with the line y = 20x, and this represents the highest statistical threshold for which at most 5% false positives are expected in the map. Significance is declared if the CDF intersects the y = 20x line (other than at the origin), as this shows that the volume of supra-threshold statistics is more than 20 times the value that would be expected by chance alone, under null-hypothesis. The q-value is a measure of the overall significance of each p-value map, with q = 1 indicating full power to always control the false positive rate and q = 0 signifying that no thresholding of the data can adequately control the false positive rate (i.e., the findings are null). The empirical CDF of p-values is not to be confused with the more common FDR PP plot, which is the flip of this plot (i.e., the axes are interchanged). This procedure has been used in several of our prior publications (Chou et al., 2009a; Hua et al., 2008a; Morra et al., 2009c).
Both anatomically and statistically-defined ROIs were used in this study. First, a temporal lobe ROI, including the temporal lobes of both brain hemispheres, was manually delineated on the MDT template by a trained anatomist using the Brainsuite software program (Shattuck and Leahy, 2002). Secondly, an alternative statistically-defined ROI (subsequently called the “stat-ROI”) was defined based on voxels with significant atrophic rates over time (p<0.001) within the temporal lobes, in a nonoverlapping training set of 22 AD patients available at the time of image download, i.e., independent of the testing set used to compute the power numbers (see Fig. 4 for an example). Note that this is a somewhat arbitrary threshold approach to selecting an ROI; in principle the ROI could be chosen specifically to provide maximal power (see later, for more on this topic). A separate stat-ROI was generated and applied for each TBM design with different algorithms (3DMI versus sKL-MI) and different parameter settings (6 versus 9P linear registration and different S and L). A numeric summary—the mean atrophy rate for all voxels with in the ROI—was computed for each person, to summarize annual change within the ROI. We note that ROIs based on independently generated statistical maps have been advocated for use in PET studies by Reiman et al. (Chen et al., 2009; Reiman et al., 2008), and it is also common in fMRI studies to threshold statistical maps using regions identified in orthogonal statistical contrasts or based on independent analyses.
A power analysis was defined by the ADNI Biostatistics Core to estimate the sample size required to detect a 25% reduction in the mean annual rate of atrophy, using a two-sided test and standard significance level (α = 0.05) for a hypothetical two-arm study (treatment versus placebo). We used the individual mean atrophic rates computed within the stat-ROI for sample size calculations. In brief, β denotes the estimated annual change (average of the group) and σD refers to the standard deviation of the rate of atrophy across subjects. The estimated minimum sample size for each arm is computed from the formula:
Here zα is the value of the standard normal distribution for which P[Z<zα] = α and in this case we set α to its conventional value of 0.05 (Rosner, 1990). We estimated the sample size required to achieve 80% and 90% power (subsequently we will refer to these as n80 and n90).
Because we are testing a large number of methodological alternatives, it is necessary to take precautions to avoid inflating the false positive rate due to the large number of comparisons conducted. For example, it is not legitimate to compute effect sizes and minimal sample sizes for a method, and then vary the free parameters of the method until the effect sizes are as large as possible. The latter approach can be used for exploratory purposes but not for drawing statistical inferences regarding the sample sizes. As such, we declared one method at the outset as the “default” TBM method, namely using TBM with our most recent implementation of nonlinear registration. We therefore declared the default method to be the sKL-MI method with parameters set to default values used in our prior papers (S = 9, L = 5, after 9P linear registration (Yanovsky et al., 2009); we also thresholded the statistical ROI at an a priori statistical threshold of p = 0.001. The estimation of sample sizes for the default method may be considered as appropriate for purposes of statistical inference. We additionally explored the influence of varying free parameters (S = 6 or 9; L = 0.5, 1, 2, 5, 8, or 10) on sample size estimates, as post hoc tests to see if sample sizes could be further improved. We note that these power improvements should be regarded as post hoc findings while the best model would then provide the best parameter set and ROI combination for future studies and clinical trials. Particularly, we chose a subset of parameters that covered a dynamic range with relatively small increment to capture a potential nonlinear trend on statistics derived from the CDF and sample size estimates. We did not attempt to test all possible parameter combinations, as each test run required the computation of N = 515 Jacobian maps and permutationbased t tests with an average computation time of 144 h on the LONI computer cluster, which is equipped with 1152 dedicated Sun Microsystems 64-bit 2.2–2.4 GHz processors with 8–16 GB of memory per node.
First, CDF plots were used to compare the effect sizes of brain changes over time, as detected using different TBM designs. In Figs. 1 and and2,2, the CDFs of the p-values observed for the statistical comparison of atrophic rates of patients (AD or MCI) versus normal controls were plotted against the corresponding p-value that would be expected under the null hypothesis of no group difference. Greater effect sizes are represented by larger deviations (upswings) near the origin of these CDF plots. The curves show increasing effect sizes, in rank order from bottom to top, for detecting voxels with statistical differences between groups. In other words, the most powerful methods are shown at the top, but all curves that cross the y = 20x line are considered to have detected significant findings. By comparing AD and MCI with the control group, the greatest differentiation was achieved by the nonlinear registration based on the sKL-MI regularizing function with S = 6 and L = 8, though a few TBM designs attained very similar effect sizes, e.g., sKL-MI S9L8 and S9L10 (Fig. 1).
The 9P linear registration, used prior to the steps of TBM, yielded superior effect sizes compared to using a 6P rigid-body registration, regardless of the choice of regularizing function (Fig. 2). Group differences are therefore easier to detect if some global scaling is allowed when aligning the baseline and follow-up MRIs from the same subject. This is consistent with prior observations in longitudinal MRI studies by Paling et al. (2004). In summary, the optimal TBM design consists of using 9P linear registration of the follow-up to the baseline scans, with subsequent nonlinear registration regularized by sKL-MI, using parameter settings S = 6 and L = 8. The group difference maps and sample size estimates, reported from this point onwards, were based on the 9P linear registration, as it was part of the default method.
From the individual Jacobian maps computed with the optimal TBM design (sKL-MI S6L8), we derived maps of mean group differences in atrophic rates, and statistical maps of group differences based on the permutation-based t test, corrected for multiple comparisons. The 3D maps contrasting AD with normal controls revealed much higher rates of ongoing atrophy in the hippocampi, temporal lobes, and almost all of the white matter, accompanied by progressive CSF expansion in the lateral ventricles and circular sulcus of the insula (Fig. 3a). The only region where progressive white matter atrophy was not found to be faster in AD than in healthy subjects was the white matter of the prefrontal cortex, as seen in Fig. 3a. This is consistent with the anatomical trajectory of AD in the cortex, in which temporal and parietal areas typically undergo fastest atrophy in the earliest stages of the disease, and structural atrophy moves into the frontal poles only relatively late in the disease (Braak and Braak, 1991; Braskie et al., 2008; Thompson et al., 2003).
The MCI group displayed similar patterns and locations of ongoing atrophy as in AD, but the regions in which atrophic rates were faster than for controls was more anatomically restricted and was concentrated on the temporal lobes (Fig. 3b). Of course, there may be active atrophy in all brain regions; these maps only emphasize where the patient groups show detectably faster atrophy than normal controls. As expected, AD patients (global P<0.0001) and MCI subjects (global P = 0.0004) showed overall faster atrophic rates than controls (both P values were corrected for multiple comparisons using permutation testing).
For a two-arm study, the sample size required to detect a 25% reduction in the mean annual change with a two-sided test and α = 0.05, was estimated using a power analysis. The stat-ROI was used to further improve the detection power by concentrating on the signals within the regions that were already known to be undergoing detectable atrophy in a nonoverlapping sample of AD training subjects (Fig. 4). All power analyses (based on stat-ROIs) excluded the 22 AD training subjects, to ensure a fair comparison. We used the mean atrophy rates computed within the stat-ROI for sample size calculations.
As indicated by the mathematical formula and illustrated in Fig. 5, the sample size estimate depends on the mean annual change and its standard deviation in the sample analyzed. Clearly, the required sample sizes are smaller for methods for which the mean rate of change in the parameter measured is higher and/or its standard deviation is lower. The nonlinear registration parameter L controls the sKL-MI regularization. Increasing the value of L leads to more uniform Jacobian determinant values in regions of homogeneous intensity in the scan (e.g., across white matter regions). As a result, both the mean and standard deviation of the mean atrophy rate decreased monotonically when L was increased. The parameter S controls the fluid regularization or the Jacobian field smoothness. Increasing the value of S makes the Jacobian fields smoother, or less noisy. The effect of switching S from 6 to 9 had relatively little influence on sample size estimation, so was not considered further.
Based on the CDF rankings (Fig. 1), and the minimal sample size estimates (Fig. 5), we concluded that the best TBM design is based on sKL regularization with parameters set to S = 6 and L = 8, leading to 48 and 88 as sample size estimates for detecting clinically meaningful changes (25% reductions in the mean atrophic rate) in AD and MCI respectively, with 80% power (Fig. 5 and Tables 1, ,2).2). Detection power was greatly enhanced by using a stat-ROI, defined by an independent training sample of 22 AD patients, compared to an anatomically defined temporal lobe ROI (Table 1).
Although varying the sKL-MI parameters influenced the mean and standard deviation of the average atrophic rate, the sample size estimates remained relatively stable (Fig. 5). The results suggested that sKL-MI is a highly robust method, at least within a plausible range of parameters, and the results were relatively stable regardless of the choice of registration parameters.
The sample size estimates based on TBM (with sKL-MI and 3DMI registration methods) are presented together with those based on standard clinical and cognitive measures (rates of change in ADAS-Cog, MMSE, and CDR-SB) in Table 2. The TBM-derived numeric summaries are about 8 times more powerful than the best clinical assessments (in this case, CDR-SB), in terms of minimal sample size requirements. TBM-derived mean atrophy rates within the stat-ROI offered greatly reduced sample size estimates (i.e., gave superior statistical power) for a hypothetical clinical trial design that assesses reductions in the mean rate of atrophy.
In a post-hoc exploratory analysis, we examined how the group mean atrophy rate and sample size estimates (Fig. 6) were affected by the statistical threshold chosen to define the statistical ROI. Clearly, if a region is defined for signal averaging in an image, a very small region can be defined by setting the statistical threshold very high, but this might not give the highest effect size for the mean atrophic rate in new subjects, as different subjects show slight differences in anatomical localization for the location of greatest atrophy. As such, there is no immediate theoretical basis for choosing the extent or threshold for the statistical ROI; these depend on empirical factors (and the sample size of training images).
Reducing the statistical threshold used to define the ROI had little impact on the group mean atrophy rates and sample size estimates when sKL-MI was used to measure atrophic rates (Fig. 6a). However, sample size estimates were gradually improved in the scenario of 3DMI, with stable mean atrophy rate but lower standard deviation as the threshold was decreased and the ROI became smaller (Fig. 6b). This differential influence of the stat-ROI threshold was somewhat expected. In theory, the sKL-MI makes the Jacobian maps more spatially homogeneous in large areas with homogeneous intensity, e.g., the white matter. In a sense, the sKL statistical maps tend to have higher spatial autocorrelation (i.e., are spatially smoother), thus the exact value used for thresholding them has minimal impact. A more detailed understanding of these correlations in deformation maps can be gained by studying the 6-dimension Green’s function (spatial covariance function) of the differential operator governing the deformation mappings, a topic that we have investigated previously; the Green’s function can also be learned empirically from data (Brun et al., 2009a; Brun et al., 2009b; Fillard et al., 2007; Fillard et al., 2005; Gee, 1999; Grenander and Miller, 1998).
Here we found that a regional numeric summary, derived from TBM, provided a drastically reduced sample size estimate in a power analysis for detecting brain change, with 48 and 88 subjects required in AD and MCI groups respectively for a clinical trial designed to detect 25% improvement in the rate of decline with 80% power. In Table 2, sample size estimates using TBM-based methods were compared with those based on clinical scores in the same group of subjects (N = 515). The best clinical measure was CDR-SB, but even this measure required relatively large sample sizes (n80: 408 for AD, 796 for MCI). These sample sizes are around 8 times higher than those required by the TBM-derived imaging measure. Our power estimates for the clinical measures (MMSE, ADAS-Cog) are similar to those computed by Schuff et al., 2009, in other ongoing ADNI studies. As confirmed by other independent studies, the required sample sizes were drastically lower for structural MRI versus clinical measures (Fox et al., 2000; Jack et al., 2004; Schuff et al., 2009). Specifically, when TBM was used in combination with an empirically-defined statistical ROI, the resulting numeric summary was comparable to, or better than, other imaging measures for which power estimates have been reported (Beckett et al., 2008; Chen et al., 2009; Tosun et al., 2009). We note that for summarizing our TBM measures, the average of all voxels in the ROI was taken, as it was the simplest and most intuitive numerical summary that can be derived from an ROI. Other approaches are clearly possible, such as taking a weighted average—by weighting the voxels according to their effect sizes in the stat-ROI—or using a machine learning method to select the subset of most informative voxels when used jointly for group discrimination (see Sun et al., (in press) and Ferrarini et al., (2008) for examples of the latter approach). Other map-based numeric summaries will be the topic of future study.
In this large-scale analysis—to our knowledge, the largest longitudinal TBM study to date—we evaluated different TBM designs with optimized linear and nonlinear registration parameters, to maximize the power to detect progressive brain atrophy in AD and MCI. We tested twelve different combinations of TBM modules, with each test involving 515 high-dimensional nonlinear registrations and subsequent CDF-based statistical analyses. Even though computational requirements were immense, our findings of the best parameters and analysis choices can now be used in future studies to expedite them.
These results offer several insights regarding the choice of TBM design, including the choice of the regularizing function (sKL-MI versus 3DMI), selection of nonlinear registration parameters that affect the smoothness of the recovered changes (parameters S and L for the sKL method), as well as the type of linear registration (rigid or nonrigid) that spatially aligns the scans before nonlinear registration. Using CDF plots, we ranked different TBM designs in terms of their power to detect significant effects and compared q-values from an FDR analysis. The best power, represented by the top CDF curve and the highest q-value, was achieved with a TBM design with the sKL-MI regularization (with parameters S = 6, L = 8), and 9P-linear registration. The parameter settings chosen as defaults in our prior work (S9L5) gave very similar sample sizes (52 AD, 85 MCI) (Yanovsky et al., 2009). It is conceivable that the best settings may depend somewhat on whether AD or MCI is the target of the study, and they may depend somewhat on the inter-scan interval (here 1 year), with shorter follow-up periods giving noisier data and requiring even greater spatial regularization.
From the CDF plots (Fig. 1), we concluded that first, the elastic deformation model (3DMI) performed approximately as well as the deformation model based on information theory (sKL-MI), with sKL-MI performing slightly better or worse depending on how its parameters were set. When using the optimal combination of parameters (S = 6, L = 8), the sKL-based method demonstrated superior statistical power versus the 3DMI-based TBM design. The sKL method has been advocated in our prior work (Yanovsky et al., 2008a) as it avoids some of the drawbacks with standard image warping methods. For example, brain changes over time detected by most of the registration methods in the literature do not have a meanzero null distribution when no true biological changes are present, due to asymmetry in the registration process (see Leow et al., (2007) for a detailed discussion and mathematical derivations). Second, all deformation models successfully detected differences in atrophic rates between AD and controls, and MCI and controls, as is evident from the fact that all the CDFs rise sharply near the origin and all cross the y = 20x line. This means that for all methods and models, the results were significant in this very large study, but the effect sizes at each voxel differed slightly. Put another way, there is a range of statistical thresholds [0,q] that control the expected false discovery rate to be below 5%, and q is higher for the more powerful methods, meaning that a greater range of thresholds can be used to control the false discovery rate. Thirdly, and as expected, the curves showing MCI-control differences had lesser effect sizes than those showing AD-control differences, which make sense intuitively as atrophic rates in AD are the fastest of all, and atrophic rates in MCI are significantly faster than in controls.
The nonlinear registration parameter L controls the sKL-MI regularization. Increasing the value of L leads to lower mean and standard deviation of the atrophy rate (Fig. 5). This is because the heavier sKL-based regularizations (i.e., higher values of L) tend to reduce bias in homogeneous regions by producing uniform Jacobian determinant maps (e.g., in regions of white matter and ventricles where spatially coherent tissue loss and expansion are expected respectively). In a sense, sKL-based regularization is a type of spatial smoothing that is dominant only within inherently homogeneous regions. Therefore, it does not have the disadvantage of general smoothing which tends to average out signals from adjacent regions with distinct features as well as tissue change rates. For example, at the boundary of gray matter and CSF, it would be unfavorable to average or smooth the signals with opposite signs, i.e. loss of GM and expansion of CSF. Whereas in homogeneous regions of white matter and inside the ventricles, it is ideal to apply smoothing to generate uniform tissue change rates, in default of any other information that would lead to greater spatial resolution. This is at least conceptually related to the intensity consistent filtering approach designed to filter contractions and expansions in neighboring tissues separately so that anatomical boundaries are preserved (Studholme et al., 2003).
The type of linear registration (required as a pre-processing step before nonlinearly warping one scan onto the other) also affected the power to detect group differences in atrophic rates. Alignment of scans over time using scaling (9P linear registration) outperformed the rigid-body registration (6P), probably because the 9P registration can correct for scanner voxel size variations in large studies involving multiple sites, scanners and acquisition sequences. Prior longitudinal MRI studies found that using 9P scaling is advantageous for registration over time, even when significant cerebral atrophy is also occurring in the images (Paling et al., 2004). The concept of using 9P registration for serial MRI analysis was counterintuitive at first as one might suspect that important biological changes over time might be eliminated and incorrectly discounted, if scans were registered using 9P scaling. However, for studies of cerebral atrophy in dementia or other neurodegenerative diseases, the population of interest is relatively old and thus the skull size—one of the major driving forces in registration—is relatively stable. In TBM studies of early development (Gogtay et al., 2008; Hua et al., 2009), where there is substantial dynamic growth of the brain and skull, 6P registration may be more suitable, despite not correcting for possible geometric calibration errors due to scanner drift and effects of magnetic field re-shimming over time.
Some of the image correction steps in ADNI (e.g., gradwarp, and phantom-based image scaling) were not implemented in the study by Paling et al. (2004). Here we were able to achieve greater effect sizes using 9-parameter image registration instead of 6-parameter (rigid body) registration (Fig. 2). This suggests that standard image preprocessing pipeline in ADNI might not be sufficient to correct for all scanner variations, and applying 9P registration over time provided additional benefit. It is also worth noting that we did not apply brain mask prior to linear registration, unlike Paling et al. who delineated the brain region in their images using a semi-automated approach. Hence 9P scaling can be applied to longitudinal studies with and without nonbrain tissues, to correct for any remaining scanner voxel size variation that was not accounted for by image correction algorithms. Also, a very recent paper (Clarkson et al., 2009) suggests that 9-parameter registration was equally good or even superior to using phantom-based image corrections in the ADNI data specifically.
There is some debate as to whether it makes sense to compute minimal sample sizes from the changes in the patient group only, without taking into account the aging changes normally occurring in controls. Perhaps controversially, the consensus in the AD literature has been to base sample size estimates on the effect sizes for change computed in the patients only, even though such an approach can be biased in several respects. First, if changes that normally occur with normal aging are not considered, it is assumed that the goal of therapy is to reduce the patients’ atrophic rate by 25%, even if a more achievable target would be to reduce the patient–control difference (which is smaller) by 25%. Second, if any apparent shrinkage of the brain occurred due to a systematic (nonrandom) error in scanner calibration at some acquisition sites, or due to phantom errors, geometric drifts over time, or unforeseen biases in image processing, then the cause of the error may not be easy to detect in the absence of a control group, but would be taken into account in the statistics if a control group was also used, as their images would presumably also be prone to the same drift effects. To avoid these issues in the current study, we did not rely solely on sample size estimates (n80 and n90)—which were based on the patient groups only—but we also examined effect sizes in comparisons between patients and matched healthy controls. We computed CDFs from statistical maps of group differences between patients and controls, and we used FDR-based q-values to evaluate the power of different TBM designs. As seen in Figs.1 and and5,5, the TBM designs offering best sample size estimates (e.g., sKL-MI S9L1 for MCI) may not always best discriminate patients from controls, indicated by inferior CDF ranking and lower q-value. Clearly the former depends solely on the effect sizes for the change in the patients, while the latter also depends on the effect sizes for change observed in controls.
The idea of implementing a statistically-defined ROI based on an independent training sample was advocated by Reiman et al. (Chen et al., 2009; Reiman et al., 2008) for analysis of positron emission tomography (PET) images. This is a variant of a strategy that has long been used in functional MRI studies, where it is common to limit the search region for statistical effects to circumscribed anatomical regions or regions generally implicated in past studies. These regions can be based on maps derived from statistically orthogonal (and therefore independent) contrasts computed from the same sets of images. Here we showed that a statistically-defined ROI can be used to improve the detection power by concentrating the signals that change most in AD, as determined from a nonoverlapping sample of AD patients (Table 2).
The use of a statistically-based ROI was so much better than using an atlas-based ROI that it motivates further study of how best to use an ROI to optimize power. Different weights could be given to the component voxels, for example; rather than using a simple thresholding of the ROI to compute the mean value, the values in the ROI could be weighted using their effect sizes (e.g., t statistics derived from an independent sample). More sophisticated machine learning procedures could be used to select a subset of voxels in the ROI that are most informative when used jointly (for example using an adaptive boosting technique or a committee based learning method; see (Morra et al., 2009a) for a review of these methods). For example, we recently measured gray matter thickness at 64,000 cortical surface points in MRI data from 36 schizophrenia patients and 36 controls, and a machine learning algorithm (Sun et al., in press) picked a selection of points scattered across the cortex that optimally predicted disease status (patient or control). This achieved an excellent 86.1% classification rate when validated on independent, nonoverlapping, data. By distilling a whole image of features into those that are most predictive, dimension reduction is achieved; these features may also form a spatially distributed anatomical network revealing areas jointly involved in the disease. As the chosen features are all assessed simultaneously, changes in one region that are not necessarily strong enough on their own to be predictive are assessed in conjunction with all others, and systems of co-occurring changes can be identified (Zhang et al., 2008). This is the premise of several machine learning methods that have been applied to brain mapping, such as multivarate network analysis (Alexander et al., 2008), adaptive boosting, online learning, and their variants (see Morra et al., (2009a; 2009b), for a review).
We found that the hippocampus has a high rate of atrophy, and shows faster atrophy in MCI and AD than in controls when the full sample is analyzed (Fig. 3). However, in the stat-ROI (Fig. 4), which was defined in the training set of 22 AD subjects, the effect sizes for atrophic rates were great outside the hippocampus, in the rest of the temporal lobe. The effect size for detecting changes in the hippocampus can be negatively influence by the following three aspects. (1) The hippocampus is immediately adjacent to CSF voxels that are expanding, resulting in high variance and low statistical effect size at the boundaries. (2) Inherent variability in anatomy across subjects, which combined with the smallness of the structure, makes it difficult to align maps of change perfectly across subjects. (3) the effect size of stat-ROI is limited by the size of training set—only 22 AD patients were used to generate the stat-ROI. Adding more subjects to the training set will increase the effect size but at the cost of reducing the sample size of the testing set.
TBM has some specific advantages as a biomarker of disease progression in AD. The serial MRI protocol on which it is based tends to have high test/re-test reliability, and TBM can be used to assess, explore, or confirm the stability of serial MRI by computing changes across short intervals such as 2 weeks (Leow et al., 2006b), where biological changes are minimal. The effects of phantom-based corrections, image inhomogeneity corrections (Boyes et al., 2008), and other image processing steps on longitudinal stability can also be extensively explored, monitored, and optimized, as are effects of the field strength of the scanner (3 versus 1.5 T) (Ho et al., in press) and image reconstruction algorithms on statistical power.
In future studies, we plan to compare the power of several complementary MRI-derived measures, including volumes and maps of hippocampal and ventricular anatomy (Chou et al., 2009a; Morra et al., 2008; Morra et al., 2009c; Schuff et al., 2009), whole-brain and ventricular boundary shift integrals (Fox et al., 2001), automated volumetric parcellations (Fischl et al., 2002), voxel-based morphometry (Alexander et al., 2007; Ashburner and Friston, 2000; Good et al., 2001), abnormality scores derived from high-dimensional pattern classification (Misra et al., 2009), other tensor-based morphometry methods (Studholme et al., 2005) and cortical thickness mapping (Aganj et al., 2009; Thompson et al., 2003; Thompson et al., 2004b) on the same subset of ADNI data. Additionally, longitudinal declines during the different pre-clinical and clinical stages of AD may not necessarily be linear. Thus, one will need to see the extent to which it affects power in more severely affected AD patients or over different trial durations. At the same time, one could develop optimal statistical ROIs tailored to the subject characteristics of future subjects (e.g., mild versus moderately severe AD and the time-line of the proposed study).
It is unlikely that one single biomarker will outperform all others in terms of its power to track change, correlate with clinical decline, or predict outcomes. Furthermore, different biomarkers may differ in the extent to which they are modified by a treatment, predict a treatment’s clinical benefit, or are affected in a confounding way unrelated to the treatment’s therapeutic effects (e.g., increased volume loss measured by brain boundary shift integral (BBSI) was found in antibody responders compared with nonresponders in the clinical trial of AN1792 immunotherapy (Fox et al., 2005)). Use of genetic or CSF biomarkers, as well as other imaging measures (from PET or DTI) will be advantageous. In the meantime however, structural MRI measures offer a dramatic reduction in sample sizes required for clinical trials (an 8-fold difference relative to the best clinical measure in this study). Future efforts will focus on combining measures to reduce this sample size still further.
Data used in preparing this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative database (www.loni.ucla.edu/ADNI). Many ADNI investigators therefore contributed to the design and implementation of ADNI or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators is available at www.loni.ucla.edu/ADNI/Collaboration/ADNI_Citation.shtml. This work was primarily funded by the ADNI (Principal Investigator: Michael Weiner; NIH grant number U01 AG024904). ADNI is funded by the National Institute of Aging, the National Institute of Biomedical Imaging and Bioengineering (NIBIB), and the Foundation for the National Institutes of Health, through generous contributions from the following companies and organizations: Pfizer Inc., Wyeth Research, Bristol-Myers Squibb, Eli Lilly and Company, GlaxoSmithKline, Merck and Co. Inc., AstraZeneca AB, Novartis Pharmaceuticals Corporation, the Alzheimer’s Association, Eisai Global Clinical Development, Elan Corporation plc, Forest Laboratories, and the Institute for the Study of Aging (ISOA), with participation from the U.S. Food and Drug Administration. The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. Algorithm development for this study was also funded by the NIA, NIBIB, the National Library of Medicine, and the National Center for Research Resources (AG016570, EB01651, LM05639, RR019771 to PT).
Author contributions were as follows: XH, SL, IY, AL, YC, AH, BG, AT, and PT performed the image analyses; CJ, MB, ER, DH, JK, NS, GA, and MW contributed substantially to the image and data acquisition, study design, quality control, calibration and pre-processing, data-basing and image analysis. We thank Anders Dale for his contributions to the image pre-processing and the ADNI project.