This study had four main findings. First, a TBM method based on directly aligning group averaged images was found to be problematic, as it did not correctly control for false positives. This problem was solved by aligning each subject to a single template, and analyzing individual maps. Second, we showed a CDF-based method that can help to decide which methodological choices affect power in TBM; linear (9 parameter) initial registration and larger samples were found to give higher effect sizes, and the dependency on sample size was explored. Third, analysis of voxels in large regions such as the temporal lobe was more powerful than using small regions such as the hippocampus, confirming that TBM is better for resolving distributed atrophy rather than very small-scale changes, at least when used in a cross-sectional design. Fourthly, clinical measures of deterioration in brain function (MMSE, CDR scores) were tightly linked with both atrophy and ventricular expansion, but the atrophy measures gave higher effect sizes. The best TBM-based marker of neurodegeneration was temporal lobe atrophy, as this distinguished AD from controls better than other measures.
In our comparison of two types of TBM design, we first used the traditional method, which creates individual Jacobian maps for each subject by non-linearly aligning their MRIs to the normal MDT template. All the Jacobian maps share a common coordinate system defined by the normal MDT, so an average map of the group (normal, MCI or AD) was created by taking the arithmetic mean at each voxel (other possible approaches include using the geometric mean, matrix logarithm mean, Frechét mean, or geodesic metrics on the deformation velocity (
Woods, 2003;
Avants and Gee, 2004;
Leow et al., 2006;
Aljabar et al., 2008;
Lepore et al., 2008). Statistical parametric maps may then be computed to associate regional atrophy with predictors measured in each individual (diagnosis, clinical scores, etc.). By contrast, the direct method uses geometric centering to construct an average template that conforms to the group mean geometry, and then a single non-rigid transformation quantifies group differences. The two methods both detect tissue loss in temporal lobes, hippocampus, the thalamus and widespread widening of sulcal and ventricular CSF spaces, congruent with prior studies (
Baron et al., 2001;
Callen et al., 2001;
Frisoni et al., 2002;
Busatto et al., 2003;
Gee et al., 2003;
Thompson et al., 2003;
Karas et al., 2004;
Testa et al., 2004;
Teipel et al., 2007;
Whitwell et al., 2007).
The direct method has several limitations. First, it is difficult to covary for other variables measured at the individual level, such as age or sex, although this could be circumvented to some degree by matching samples for these variables. Second, it is computationally prohibitive to compute an empirical null distribution for deformations between group average templates, unless tens of thousands of templates are generated from permuted datasets. Null distributions for Jacobian maps based on individual registrations are faster to compute, but do not adequately control for false positives when null experiments are performed (such as aligning two control MDTs with no true difference). Further study is necessary to clarify how registration errors compare when registering individuals and templates to other templates. In a recent study,
Aljabar et al. (2008) computed maps of brain growth in 25 infants scanned one year apart, at one and two years of age, based on creating a mean template for baseline scans and directly aligning it to a mean template from follow-up scans. While they were not able to provide significance measures for the mapped changes, the overall growth factors for gray and white matter, computed from this direct registration, agreed with measures from independent segmentations, and the results were visually reasonable and in line with the neurodevelopmental literature. This suggests that the change rates observed with the direct method may be accurate, at least in a longitudinal study, but their significance is difficult to assess. If the direct method is used in a longitudinal study, it may be more robust than in a cross-sectional study, as the cohorts at each time point are by definition matched on all demographic variables other than time. In a cross-sectional study, any confounds in demographic matching of the groups may enter the maps of group differences, without a statistical means to adjust for them or estimate their effects.
Any TBM study is limited by the accuracy with which deformable registration can match anatomical boundaries between individual brains and corresponding regions on the template. Our mean deformation template (MDT) was created after rigorous nonlinear registration, and geometric centering. Several studies have suggested that registration bias can be reduced, and effect sizes increased, by using an unbiased group-average template of this kind (
Kovacevic et al., 2005;
Kochunov et al., 2002;
Good et al., 2001;
Lepore et al., 2007). Most anatomical features and boundaries are well-preserved in the MDT, and the hippocampus is sufficiently discernible to be labeled by hand on the MDT. Even so, it may not be possible to achieve accurate regional measurements of atrophy, especially in small regions such as the hippocampus, since that would assume a locally highly accurate registration. TBM is best for assessing differences with at a scale greater than 3–4 mm (the resolution of the FFT used to compute the deformation field). For smaller-scale effects, direct modeling of the structure, e.g. using surface-based geometrical methods, may offer additional statistical power to detect subregional differences (e.g.,
Morra et al., submitted for publication).
As the ADNI initiative is a study of 200 AD, 400 MCI, and 200 controls, this study focused not just on AD but also on MCI. The focus in the AD field has shifted to MCI in recent years, in the hope of tracking disease progression and ultimately resisting it, before individuals progress to AD. It is useful to know what factors affect detection power or link with cognition in MCI versus AD, as factors that can enhance power in MCI may not be so relevant in a study of AD, and regions in which atrophy correlates with cognition in MCI may not be so relevant to cognition in AD, or in healthy aging. In this study, we therefore included power estimates and measures of effect sizes for TBM studies of both MCI and AD, revealing that sample requirements differ greatly for different effects of interest.
In this study, we did not (beyond multiple pair-wise comparisons) attempt to gain any insight into the shift in morphological changes from normal controls to MCI to AD. A strength of a TBM analysis would be to map all subjects to a common template, and then track the distribution of atrophy it spreads anatomically over time (e.g.
Thompson et al., 2001) or with clinical progression (
Janke et al., 2001). As ADNI is a longitudinal study, we plan to fit longitudinal models to detect the shift in the location of greatest atrophy as the longitudinal data (e.g., 1 year follow-up scans) become available. This will require repeated-measures methods, which have not yet been validated for TBM, and specialized methods for creating longitudinal mean templates, which are emerging in the literature (see
Lorenzen et al., 2004,
2006).
The ROI-based analyses (Figs. and ) revealed patterns of atrophy in MCI and AD, but with relatively low significance levels. In future, we will see if statistical power can be improved by adjusting for the effects of the CSF signals on the overall estimates of atrophy, as the effects of CSF expansion partially oppose the contraction signal. Due to potential biases, we avoided analyzing effects from the contracting voxels only (i.e., voxels with Jacobian less than one), such as taking the average Jacobian in the contracting regions, or counting the numbers of contracting voxels. Such an approach could be biased, in that a group with greater variance in the Jacobian could have more contracting voxels while having the same mean level of atrophy. Also, an analysis of contracting voxels could be biased towards a group with a very small region of very high atrophy, which could occur, at least in principle.
Use of CDF plots
In neuroscientific studies using TBM, it is vital to optimize statistical power for detecting anatomical differences, especially when evaluating the power of treatment to counteract degeneration, as in a drug trial, or in an epidemiological study to identify neuroprotective factors (
Lopez et al., 2007). Comparison of power across image analysis methods is of great interest, but some caveats are necessary regarding the use of CDF-based approaches, in which the ordered
p-values are plotted and compared to the expected 45-degree line under the null hypothesis of “no effect”. In highly sensitive methods, the departure of the early part of the curve from a 45-degree line will be large (showing a positive upswing). This assumption is supported by our plots (), in which successively larger sample sizes boost the effect size in statistical maps identifying group differences, for both MCI and AD. As shown in the CDF plots (), for all significance thresholds (values on the
x axis), the proportion of significant voxels, detecting group differences, increases dramatically as the group size is enlarged from
N=10 to 40. In prior work (
Lepore et al., 2008), we used this same CDF approach to note that the deviation of the statistics from the null distribution generally increases with the number of parameters included in the statistics, with multivariate TBM statistics on the full tensor typically outperforming scalar summaries of the deformation based on the eigenvalues, trace, or the Jacobian determinant. With this approach, we also found that effect sizes in TBM may be boosted, at least in some contexts, by using mean anatomical templates based on Lie group averaging (
Lepore et al., 2007) or by using deformation models based on information-theoretic Kullback-Leibler distances (
Leow et al., 2007), or using Riemannian fluid models, which regularize the deformation in a log-Euclidean manifold (
Brun et al., 2007).
Even so, we do not have ground truth regarding the extent and degree of atrophy or neurodegeneration in AD or MCI. So, although an approach that finds greater disease effect sizes is likely to be more accurate than one that fails to detect disease, it would be better to compare these models in a predictive design where ground truth regarding the dependent measure is known (i.e., morphometry predicting cognitive scores or future atrophic change; see e.g., (
Grundman et al., 2002)). We are collecting this data at present, and any increase in power for a predictive model may allow a stronger statement regarding the relative power of different models in TBM, or the relative power of one image analysis method versus another for tracking brain disease.
A second caveat is that just because a CDF curve is higher for one method than another in one experiment, it may not be true of all experiments. Without confirmation on multiple samples, it may not reflect a reproducible difference between methods. FDR and its variants (
Storey, 2002;
Langers et al., 2007) declare that a CDF shows evidence of a signal if it rises greater than 20 times more sharply than a null distribution, so a related criterion could be developed to compare two empirical mean CDFs after multiple experiments. As simple numeric summaries sacrifice much of the power of maps, and provide a rather limited view of the differences in sensitivity among voxel-based methods, additional work on CDF-based comparisons of methods seems warranted.
Correlations with clinical measures
The corrected P values signify the overall significance levels of the correlations between atrophy and clinical scores within the whole brain. For MMSE, both the positive and two-tailed tests are significant, suggesting a correlation between the regions of volume reduction and lower MMSE scores. For global CDR and sum-of-boxes, we obtain robust results in both negative and two-tailed correlations. As higher CDR scores denote greater impairment, the negative correlation links lower brain volume with greater CDR scores. Based on , atrophy of brain tissue (gray and white matter) detected by TBM links better with cognition than volume expansion (e.g., of the ventricles), although each is significantly associated with both MMSE and CDR. Strictly speaking, the CSF expansion signal may offer less signal to noise than the atrophic signal as we are using statistical tests that depend on the total volume of regions that reach a certain threshold (supra-threshold volume and corrected q-values from FDR). It may be that, if the statistical tests had been formulated differently, e.g., as strict voxel-level comparisons (e.g., maximal t-statistics), they would detect CSF differences with greater effect sizes than atrophic effects.
Analysis of group size
It may seem odd to assess effect size in groups as small as 10 to 40 subjects per group when imaging studies such as ADNI now assess 200 or 400 subjects per group. Here a sample as low as 10 is merely included to show how power completely breaks down when the sample is minimal and not sufficiently powered to detect an effect with reasonable confidence. Although morphometric studies of 10–20 subjects per group were more common in studies five to ten years ago (e.g.
Thompson et al., 2001), most current MRI studies are designed to contrast patients in several categories (treatment versus placebo, MCI converters versus non-converters, ApoE4 carriers versus non-carriers), so it is common to have groups containing as few as 10 subjects for some statistical contrasts (given the low annual rate of conversion from MCI to AD, and the low incidence of certain risk genotypes). As seen with our CDF approach, for contrasts that are underpowered, it may have merit to plot the CDFs based on pilot samples, and assess the rate at which the CDFs are increasing (or not) with successive increments in the sample size. Although there is no widely accepted power analysis for morphometric studies using statistical maps as outcome measures, the CDF based methods, such as those advocated here, offer a means to study whether incrementing a small sample could yield sufficient power to reject a null hypothesis.
Single-subject analysis
Although these maps () are clearly of interest, several caveats are needed in interpreting them. First, in this case all of the variance used to assess abnormality comes from a statistic comparing the single subject with the normal group, so some covariation for age, sex, and possibly other factors, ideally based on multiple regression in a large sample, would be more appropriate to calibrate the level of age-adjusted atrophy. Second, lower tissue volumes in an individual are not always a sign of disease, so plotting regional volumes as a percentile relative to a normative population (which is essentially what the significance map is) may reflect a combination of disease-related atrophy, and some natural variation in brain volumes. These factors could be easier to disentangle in a longitudinal evaluation of the same patient over time. Finally, as noted by
Salmond et al. (2002), if a Gaussian distribution is assumed for the Jacobian statistics at each voxel, a significant number of false positives may still arise purely due to non-Gaussianity when comparing a single subject to a group. To ensure that the data are smooth enough for the residuals to be regarded as normally distributed, Salmond et al. suggested that the data be first heavily smoothed (using a 12mm FWHM kernel); alternatively, a large control population could be used to establish a non-parametric reference distribution at each voxel, which is essentially the permutation approach taken here.
Anatomical maps and prior work
The main contribution of this paper, relative to prior work using voxel-based morphometry (VBM) and tensor-based morphometry in AD or MCI, is to study the effects of different analysis choices within the framework of TBM, and how they affect the sensitivity for detecting disease effects. Our anatomical findings are largely in line with prior work using automated techniques to map patterns of brain atrophy at voxel-level. Initial formulations of VBM derived maps of structural differences by comparing the local composition of brain tissue types after global position and volumetric differences had been removed through spatial normalization (
Ishii et al., 2005;
Shiino et al., 2006;
Davatzikos et al., 2008;
Fan et al., 2008;
Karas et al., 2007;
Smith et al., 2007;
Vemuri et al., 2008). In contrast, TBM is a method based on high-dimensional image registration, which derives information on regional volumetric differences from the deformation field that aligns the images. Recent reformulations of VBM, termed ‘optimized VBM’ (
Davatzikos et al., 2001;
Good et al., 2001) modulate the voxel intensity of the spatially normalized gray matter maps by the local expansion factor of a 3D deformation field that aligns each brain to a standard brain template. As a result, the final modulated voxel contains the same amount of gray matter as in the native pre-registered gray matter map.
Chetelat et al. (2002) and
Karas et al. (2004) used VBM to analyze patterns of gray matter loss in MCI and AD. Relative to normal subjects,
Chetelat et al., (2002) found that MCI subjects showed significant atrophy in the hippocampus, temporal cortices, and cingulate gyri. Gray matter density in the posterior association cortex was significantly higher in MCI than AD.
Karas et al. (2004) found similar patterns of parietal atrophy in AD and MCI, but found active hippocampal atrophy in the transitional stage from MCI to AD. The author suggested this discrepancy could be due to borderline significance or difference in disease severity of MCI populations. A very recent study by Teipel et al. (
Teipel et al., 2007) used the TBM method to study brain degeneration in MCI and AD. They used principal component analysis to extract spatially distributed anatomical features associated with the diagnosis of AD, and they focused on identifying features that may be useful in predicting the transition from MCI to AD. Future longitudinal TBM studies with the ADNI data are likely to reveal which aspects of atrophy are most predictive of future conversion to AD, and which voxel-based methods are optimal for detecting progression or correlations with cognition. As the sample size increases, it may be possible to detect and model effects of the MRI platform, field strength, or acquisition site, to determine whether the multi-site and dual MRI platform acquisition of the data contributed to reduced effect sizes, especially for the MCI group. Comparisons distinguishing MCI from controls my be more sensitive to these effects, whereas the AD versus control group comparison has an effect size so great that it overwhelms any increased variability due to multicenter acquisition. This potential source of variability, that is perhaps not typical of studies in general, will be evaluated in future.