|Home | About | Journals | Submit | Contact Us | Français|
Structural changes in neuroanatomical subregions can be measured using serial magnetic resonance imaging scans, and provide powerful biomarkers for detecting and monitoring Alzheimer's disease. The Alzheimer's Disease Neuroimaging Initiative (ADNI) has made a large database of longitudinal scans available, with one of its primary goals being to explore the utility of structural change measures for assessing treatment effects in clinical trials of putative disease-modifying therapies. Several ADNI-funded research laboratories have calculated such measures from the ADNI database and made their results publicly available. Here, using sample size estimates, we present a comparative analysis of the overall results that come from the application of each laboratory's extensive processing stream to the ADNI database. Obtaining accurate measures of change requires correcting for potential bias due to the measurement methods themselves; and obtaining realistic sample size estimates for treatment response, based on longitudinal imaging measures from natural history studies such as ADNI, requires calibrating measured change in patient cohorts with respect to longitudinal anatomical changes inherent to normal aging. We present results showing that significant longitudinal change is present in healthy control subjects who test negative for amyloid-β pathology. Therefore, sample size estimates as commonly reported from power calculations based on total structural change in patients, rather than change in patients relative to change in healthy controls, are likely to be unrealistically low for treatments targeting amyloid-related pathology. Of all the measures publicly available in ADNI, thinning of the entorhinal cortex quantified with the Quarc methodology provides the most powerful change biomarker.
Structural magnetic resonance imaging (MRI) is highly sensitive to the neurodegeneration that occurs in Alzheimer's disease (AD), even in prodromal stages [McEvoy et al., 2009; Vemuri et al., 2009]. Atrophy measures in neuroanatomical subregions correlate well with disease stage determined from histopathology [Vemuri et al., 2008], and with clinical measures of disease severity [Jack et al., 2004; McDonald et al., 2009]. They are predictive of clinical decline and conversion to AD in individuals with mild cognitive impairment (MCI) [Fan et al., 2008; Jack et al., 1999, 2004; Kovacevic et al., 2009; McEvoy et al., 2011; Vemuri et al., 2009]. Structural changes over time in neuroanatomical subregions can be quantified from serial MRI scans [Holland et al., 2009], and provide powerful biomarkers for tracking disease progression or slowing of progression with treatment. As a reflection of the progressive neurodegeneration that underlies the cognitive and functional decline in AD, anatomical change measures have high face validity as outcome measures for evaluating putative disease-modifying effects of new therapeutic interventions, and are being evaluated in clinical trial settings as potential surrogates for standard clinical or cognitive outcomes.
To be useful as primary or secondary outcome measures in clinical trials, longitudinal MRI analysis methods must be able to detect with high fidelity subtle structural changes over time. Multiple methodologies have been developed to address this challenge. The Alzheimer's Disease Neuroimaging Initiative (ADNI), a large-scale, multisite, longitudinal study of the natural history of AD [Mueller et al., 2005] was launched in 2003 with an overarching goal of determining the best set of in vivo biomarkers for early detection and tracking of AD (http://www.adni-info.org) [Mueller et al., 2005]. A related goal is to determine which methods provide maximum power for detecting treatment effects in clinical trials of potential disease-modifying therapies [Cummings, 2010]. A unique aspect of ADNI is that all raw data are being made available publicly as they are collected (http://adni.loni.ucla.edu). Research groups funded by ADNI [Jack et al., 2010] have made their derived data publicly available as well, enabling a direct comparison of the relative sensitivity of different methods for detecting and tracking neuropathological changes related to AD. Here, we use ADNI's publicly available derived data to present a comparative analysis of measures of whole-brain and subregional change, obtained from several widely used analysis methods. These methods amount to extensive processing streams and involve many factors, such as image exclusion decisions, quality control, the choice of using the pair of scans or choosing just the best single scan available for subject-timepoints, and the quality of gradient-field nonlinearity unwarping employed, and thus are composed of far more than change-measurement algorithms. Given all these differences, it is only practical to evaluate each methodology based on the overall results individually available from applying them to the same large ADNI database. To compare methods, we use estimated sample size requirements, as these have become a standard metric for evaluating biomarkers, and are directly relevant to clinical trial design. We further provide statistical significance results (P values) for differences in sample size estimates obtained in strict head-to-head pairwise comparisons among all measures.
To obtain realistic sample size estimates from longitudinal neuroimaging measures, it is essential to control for potential bias that can arise in image analysis [Thompson and Holland, 2011]. The problem of bias in image registration has been known since the early days of nonrigid morphometric methods [Ashburner and Friston, 2000; Christensen, 1999; Christensen and Johnson, 2001] and has received a great deal of attention recently, including the development of some general and implementation-specific solutions [Leow et al., 2007; Reuter et al., 2010; Yanovsky et al., 2009; Yushkevich et al., 2010]. Sources of bias include asymmetries in image smoothing and/or interpolation, and asymmetry in the image matching or regularization term in the cost function used in image registration. Such bias can be accentuated to varying degrees depending on the minimization scheme used [Yushkevich et al., 2010].
Another critical consideration when estimating sample sizes for treatment response is whether to include effects seen in normal aging as treatable effects. When performing power calculations based on a natural history (nonintervention) trial such as ADNI, sample size estimates are typically calculated for a hypothesized treatment effect expressed in terms of a percentage of the total disease-related effect, for example, a 25% slowing in rate of decline on the outcome measure.
Compared to clinical or cognitive measures, neuroimaging measures are very sensitive to changes that occur over time in cognitively healthy older adults [Fjell et al., 2009; Fotenos et al., 2005; Fox et al., 2000; Jack et al., 2008]. When effect sizes are estimated based on absolute change measures, for example, 25% reduction in the total atrophy rate of a given anatomical structure, the usually implicit and probably false [Herrup, 2010] assumption is that all change over time is due to AD. Several ADNI presentations, the ADNI-2 grant proposal (available at www.adni-info.org/Scientists/ADNIScientistsHome.aspx), and most ADNI studies evaluating or comparing sample size estimates for neuroimaging outcome measures make this assumption implicitly [Beckett et al., 2010; Cummings, 2010; Ho et al., 2010; Hua et al., 2009, 2010; Kohannim et al., 2010; Lorenzi et al., 2010; Nestor et al., 2008; Risacher et al., 2010; Schuff et al., 2009; Vemuri et al., 2010], with a few notable exceptions [Fox et al., 2000; Holland et al., 2009; Leung et al., 2010; McEvoy et al., 2010; Schott et al., 2010]. This is a critically important issue since sample size estimates will be substantially smaller if effect sizes are calculated based on absolute change measures rather than on the difference in change measures between patients and controls.
The use of absolute change measures is valid if all atrophy over time in cognitively healthy older individuals is due to AD, that is, if all cognitively healthy older individuals are in a preclinical state of AD. Current research suggests, however, that only 18% of cognitively healthy older individuals aged 60–69 show signs of amyloid-β (Aβ) pathology, one of the key necessary features for AD, rising to 65% in those over 80 years [Rowe et al., 2010]. Pathological levels of cortical Aβ can be assessed directly through positron emission tomography (PET) imaging of amyloid-sensitive ligands [Rabinovici and Jagust, 2009] or indirectly through cerebrospinal fluid levels of Aβ42 [Blennow et al., 2010]. CSF and PET measures of Aβ pathology correlate highly with each other [Fagan et al., 2009] and with measures of Aβ at autopsy [Ikonomovic et al., 2008]. Proposed models of the trajectories of different AD biomarkers [Aisen et al., 2010; Frisoni et al., 2010; Jack et al., 2010] postulate that Aβ pathology is the earliest detectable sign of AD pathology, and may be apparent a decade or more before other signs of AD occur, such as neurodegeneration and cognitive impairment. These models further postulate that neurodegenerative changes, reflected in atrophy on structural MRIs, are downstream events that occur closer in time to, and underlie, the functional and cognitive impairment that characterize AD—and indeed for familial AD, gradual atrophy acceleration has been found in the prodromal stages [Ridha et al., 2006].
According to these models, atrophy observed in individuals who do not show signs of Aβ pathology would not be due to AD, as Aβ pathology appears prior to, and presumably triggers [Hardy and Selkoe, 2002], the AD-related neurodegeneration. There would thus be no reason to expect that a treatment, such as an anti-amyloid therapy, aimed at slowing progression of AD pathology, would affect atrophy that stems from causes other than AD pathology. Therefore, determination of disease-specific effect sizes for neuroimaging outcome measures, based on a natural history trial of AD, would be best estimated as the difference in atrophy rates experienced by MCI or AD patients relative to atrophy rates observed in cognitively healthy individuals without Ab pathology. It should be noted, however, that there is little evidence to date for differences in atrophy rates in AD-vulnerable regions between Aβ-positive and Aβ-negative healthy older adults (or healthy older controls, HCs) [Chetelat et al., 2010; Fjell et al., 2010], though using variants of Boundary Shift Integral (BSI) for whole brain, ventricles, and hippocampus, a significant difference was found between 65 Aβ-negative HCs (including two converters) and 40 Aβ-positive HCs (including four converters) [Schott et al., 2010]. Here we compare atrophy rates in ADNI's full HC group, and in HCs separated into two subgroups, those who test negative for Aβ pathology and those who test positive, based on CSF Aβ42 levels, and examine the implications of these findings for sample size estimation.
In this study, we analyze publicly available ADNI data from the application of five methodologies to serial brain scans to determine which method provides the most sensitive detection of anatomical change over time. These methodologies are: (1) Quarc (quantitative anatomical regional change, developed in our laboratory) [Holland and Dale, 2011; Holland et al, 2009], (2) FreeSurfer Longitudinal v.4.4, (3) FreeSurfer Cross-sectional v.4.3 [Dale and Sereno, 1993; Dale et al., 1999; Fischl et al, 1999, 2002; Jack et al., 2010], (4) BSI [Freeborough and Fox, 1997; Leung et al., 2010], and (5) Tensor-Based Morphometry (TBM) [Hua et al., 2008a, b, 2009, 2010]. (We note that “Tensor-Based Morphometry” is sometimes used to refer to any nonlinear registration method, even if the only tensors involved are rank 1, that is, vectors. “Tensor-Based Morphometry” is used in the title of several of the Hua et al. articles, referring to the full processing stream developed and implemented by those authors at LONI, UCLA; the tensors in question comprise the set of 3 × 3 Jacobian matrices, one at each voxel, which result from morphometric registration—the registration itself is not based on these tensors, but the analysis of structural change is based on statistical properties of the Jacobian field. In agreement with Hua et al., here we use “Tensor-Based Morphometry” and “TBM” as identifiers referring exclusively to the complete UCLA-LONI methodology.) Official ADNI data for Voxel-Based Morphometry [Alexander et al., 2010; Ashburner and Friston, 2000; Chetelat et al, 2005; Tzourio-Mazoyer et al., 2002] did not yield meaningful results and so was not included in our analysis. We assess the impact of measurement bias on sample size estimates derived from these neuroimage analysis methodologies and provide sample size estimates, with confidence intervals, for the bias-corrected data, along with P values for pairwise head-to-head comparisons. We also evaluate the impact of failing to control for changes observed in healthy aging. Finally, to determine the sensitivity of neuroimaging variables as outcome measures, we compare sample size estimates derived using change in the neuroimaging measures to those derived using change on a standard clinical outcome measure, the Clinical Dementia Rating-Sum of Boxes (CDR-SB) score.
The five methodologies under discussion have some aspects in common, and some unique features. FreeSurfer-cross-sectional performs independent tissue segmentations at each timepoint for each subject, and is the only method considered here that does not use any form of registration between longitudinal images; FreeSurfer-longitudinal is conceptually an extension of the cross-sectional variant, where input from all available time points for a subject is used to update the segmentations at each timepoint. Quarc and TBM are nonlinear registration methods, and estimate change by integrating a volume-change field over anatomically predefined tissue ROIs or statistically defined ROIs; they differ significantly in their details, for example, image matching and regularization terms, minimization schemes, use or not of Jacobians, use or not of an atlas, intensity normalization, and so forth. BSI has evolved through several versions, but essentially estimates tissue (contrast) boundary displacements between pairs of affine registered images. Altogether, this is a rich set of methodologies that result in some remarkable similarities and differences, which we discuss below.
All data used in the preparation of this article were obtained from the ADNI database (www.loni.ucla.edu/ADNI). ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organizations, as a $60 million, 5-year public–private partnership. ADNI's goal is to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.
ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations. ADNI has recruited 227 cognitively normal individuals to be followed for 3 years, 396 people with MCI to be followed for 3 years, and 193 with mild AD to be followed for 2 years (see www.adni-info.org). The research protocol was approved by each local institutional review board and written informed consent is obtained from each participant.
The ADNI general eligibility criteria have been described elsewhere [Petersen et al., 2010]. Briefly, subjects are not depressed, have a modified Hachinski score of 4 or less, and have a study partner able to provide an independent evaluation of functioning. HC subjects have a Clinical Dementia Rating (CDR) of 0. Subjects with MCI have a subjective memory complaint, objective memory loss measured by education-adjusted scores on Wechsler Memory Scale Logical Memory II, a CDR of 0.5, preserved activities of daily living, and absence of dementia. Subjects with AD have a CDR of 0.5 or 1.0 and meet National Institute of Neurological Disorders and Stroke and Alzheimer's Disease and Related Disorders Association criteria for probable AD.
Analyses were performed on data sets available from www.loni.ucla.edu/ADNI/Data through April 16, 2011. These data sets comprise measures derived from longitudinal structural MRI processed with: Quarc; FreeSurfer-longitudinal (FS); FreeSurfer-cross-sectional (FS×); BSI; and TBM. The measures in these data sets are for various ROIs, both predefined tissue regions and data-driven regions, at baseline and follow-up (generally 6-months apart) through 36-months. Images for Quarc were preprocessed locally, similarly to the preprocessing at Mayo Clinic performed by ADNI for the other methodologies, but using both images per time point where available, and using image correction procedures for site-specific distortion effects updated for recent scanner changes. The other methodologies used only a single scan per timepoint—the best scan in the event of artifactual degradation in the other. Since one of the goals of ADNI is to identify biomarkers that are more powerful than current standard outcomes for tracking early disease progression, sample sizes were also determined using CDR-SB as an outcome variable. Change with respect to baseline was the measure used in all cases (follow-up images were directly registered with baseline). All data that passed quality control (as defined by the several methodologies) for all available time points were used. This provides for a global overview when comparing the full methodological processing streams, and implicitly takes into account differences in the methods' failure rates. We also carried out pairwise comparisons on identical subject-timepoint data sets for a more narrowly focused assessment of relative performance.
For the FreeSurfer-related methods, we focused primarily on the entorhinal, hippocampus, and whole brain, these being the ROIs traditionally of interest in AD studies, but we also provide results for other ROIs in Tables II and III; for BSI we used the whole-brain measure “KN-BSI,” described in [Leung et al., 2010]; and for TBM we used the “Stat-ROI,” the statistically defined ROI in the temporal lobe that was designed to undergo a high degree of change from AD, as described in [Hua et al., 2009]. An earlier analysis of the TBM Stat-ROI showed that the differences between 0 to 6 month and subsequent interval atrophy rates were highly significant for this measure for all diagnostic groups [Thompson and Holland, 2011], and an attempt to redress the issues raised [Hua et al, 2011] resulted in alternative sample size estimates, which we discuss below. Individual subject data from the modified method are not available on the ADNI website, but since a large number of publications have reported on the problematic TBM Stat ROI results [Beckett et al., 2010; Cummings, 2010; Ho et al., 2010; Hua et al., 2008a, b, 2009a, b, 2010; Jack et al., 2010; Kohannim et al., 2010], and results from the modified method are similar, we provide a more detailed analysis here of this method, and compare results from the two approaches.
Quality control information, using acronymic nomenclature, was explicitly provided by the individual research groups in the publicly available ADNI spreadsheets for FS, FS×, BSI, and Quarc data, and used here for filtering out subject visits that did not have values as follows: FS and FS× QVERALLQC = “Pass” or “Partial”; BSI VENTACCEPT = 1, REGRATING ≤ 3, KMNREGRATING ≤ 3; and Quarc QCPASS = 1. The total numbers of remaining subjects, categorized by diagnosis (and CSF-Aβ status for HCs), for all methodologies and CDR-SB are shown in Table I.
All measures were evaluated for potential bias by estimating the intercept based on a linear fit to the 6- and 12-month timepoints [Yushkevich et al., 2010]. This linear fit was performed simultaneously across groups (AD, MCI, and HC), allowing different slopes for each group but requiring constant intercept, based on the assumption that additive bias, if arising from methodology, should equally affect measures from all groups.
More formally, we used the following linear model to fit for additive bias (intercept) b, and slopes (rates of change) sH for HCs, sM for MCI subjects, and sA for AD subjects: Y = sH × TH + sM × TM + sA × TA + b. Here, Y, TH, TM, and TA are vectors of length equal to the total number of all subject-visits at 6- and 12-months: Y is the vector of response measurements (percent volume change from baseline) for all subjects-visits; TH is the vector of times from baseline for all HC subject-visits, with zeros at positions corresponding to non-HCs; TM, and TA are similar vectors but for MCI and AD subjects, respectively. The general linear model was fit using Matlab, and the null hypothesis that the y-intercept is zero, indicating no bias, was tested.
Measures were corrected for bias by subtracting the estimated b at all follow-up timepoints. Bias-corrected measures were then used for subsequent power calculations.
Power calculations, modeling linear change over time, were performed for each methodology with standard methods briefly described in [Holland et al., 2009], using all available timepoints through 36-months for each subject. Since we are measuring change from baseline, in plots of measured change versus time of measurement all intercepts are zero at baseline. Each subject, however, is assumed to change at an independent rate. Thus we have a linear mixed-effects model (fixed intercepts, fixed group slopes, random individual subject slopes, random within-subject additive or observational error) where, for a specific diagnostic group, the measurement Yij at time tij for subject i at follow-up timepoint j is Yij = mitij+εij. Here, εij is the within-subject error, assumed to be independent and identically normally distributed with zero mean and variance ; mi = m + γi, where m is the fixed effect slope (mean rate of change for the group) and γi is the between-subject random effect slope with variance . We use the Matlab (version R2009b) nlmefit function in the Statistics Toolbox (http://www.mathworks.com) to obtain maximum likelihood point estimates of , , and m. These fixed and random effects parameter estimates can be used in power calculations to obtain point estimates of the sample sizes N, per arm, required for a hypothetical placebo-controlled longitudinal study, as described in [Fitzmaurice et al., 2004]. This approach was used to calculate sample sizes required to detect a 25% slowing in mean rate of decline for a hypothetical disease-modifying treatment versus placebo for a 24-month, two-arm, equal allocation trial, with 6-month assessment intervals. Power calculations were performed with the requirement that the trial have 80% power to detect the treatment effect using a two-sided significance level of 5%. When correcting for normal aging, the sample size estimates were calculated using the variance parameters ( , ) from the patient cohort, while the treatment effect size of interest was assumed to be 25% of the difference between the mean rates of change in the patient and healthy populations.
To determine 95% confidence intervals on the sample size estimates, the joint a posteriori probability density function for the mixed effects model parameters ( , , and m) was computed based on the multivariate Gaussian likelihood function of the observed data, given the model parameters, evaluated at a regular mesh of points in the space spanned by the model parameters. The sample size values at the mesh points were then sorted, and the cumulative distribution calculated from the correspondingly sorted a posteriori probability values. The 95% confidence intervals were then computed from the cumulative distribution function (cdf) values of 0.025 and 0.975.
To calculate P values for pairwise comparisons of sample sizes required for different measures, we carried out a two-tailed test for the null hypothesis of equal sample sizes (N1 – N2 = 0). We used the a posteriori probability distributions described above to compute the probability distribution for the difference between the sample sizes for the two measures; the latter is given by the convolution of the two sample size distributions, since N1 and N2 are independent random variables. Thus, p = 2 × min[P (N1 ≤ N2), P (N2 ≤ N1)], where P (N1 ≤ N2)-cdfN1-N2(0), and P (N2 ≤ N1) = 1-cdfN1-N2(0) [Casella and Berger, 2002].
The TBM Stat-ROI measure showed statistically significant bias, as illustrated in Figure 1. This figure shows the average cumulative atrophy detected by this method, as a percentage of baseline volume, for HC, MCI, and AD subjects up through 36-months, along with the additive bias estimate. Without accounting for bias, the cumulative atrophy plots show that all three groups undergo an initial high rate of change (an average of 1.64% for HC, 1.93% for MCI, and 2.28% for AD over the first 6-month interval of the study), with substantially lower rates of change after that (e.g., average change in the second 6-month period of the study is 0.34% for HC, 0.55% for MCI, and 0.76% for AD subjects; for the final 12-months of the study, the average change was 0.09% for MCI and −0.01% for HC—a pronounced deceleration that is indicative of higher-order contributions to bias in TBM). Based on the simultaneous linear fit to the 6- and 12-month timepoints of all three diagnostic groups, estimated additive bias was 1.31%, which is equal to 68% of the observed change in the MCI cohort and 57% of the observed change in the AD cohort at 6-months. The P value for the null hypothesis of no bias was, to the precision of the Matlab numerical libraries, 0. Although subject data from the modified TBM Stat-ROI method are not available, Hua et al.  report a bias of 0.29% in the modified method, which is substantially reduced from 1.31% reported above. However, the new measures of change for the Stat-ROI are also substantially reduced (average change measurement of 0.5% for HC, 0.6% for MCI, and 0.9% for AD over the first 6-month interval of the study), so that the bias as a percentage of 6-month change in MCI remains large: 48%. (We note that the bias reported in [Hua et al., 2011] would be slightly reduced if actual visit times were used, that is, not assuming that all visits happened exactly at 6-month intervals.).
The KN-BSI measure also showed significant bias, accounting for 18% of the change observed in MCI subjects and 12% of the change observed in AD subjects at 6-months (P = 0.042). None of the other methods showed significant bias.
Estimated sample sizes for absolute-change were smallest for the bias-uncorrected TBM Stat-ROI measure. For example, using 291 MCI subjects, the sample size estimate to detect 25% slowing in the MCI population was N = 84, and the 95% confidence interval was CI = [71 103]. After bias correction, however, this estimate more than tripled, to N = 287, CI = [223 395], rendering this method significantly less powerful for detecting change than bias-corrected KN-BSI (head-to-head comparison of 266 MCI subjects: TBM N = 319, CI = [239 457]; KN-BSI N = 147, CI = [117 197]; P value for difference in sample-size estimates = 0.0002), Quarc entorhinal (236 MCI subjects: TBM N = 233, CI = [179 327]; ERC N = 150, CI = [118 202]; P = 0.029), and Quarc hippocampus (Hipp. N = 156, CI = [122 210]; P = 0.047). Sample size estimates for all methods, after bias-correction, are shown in Table II for MCI and Table III for AD.
For the modified TBM Stat-ROI method, the sample size estimate reported for absolute change in MCI is N = 129; correcting this for the 0.29% bias (and using the new estimated MCI annual rate of change of 1%, along with the standard sample size formula, ibid.), gives N = 129/(1 – 0.29)2 = 256, close to the value N = 287, CI = [223 395], reported above and in Table II for the original TBM Stat-ROI method.
To determine whether atrophy rates in HCs are due to preclinical AD (the implicit assumption in numerous published studies that use absolute rather than relative change measures in sample size calculations), we examined atrophy rates in the two ROIs affected earliest in AD–entorhinal cortex and hippocampus—as well as in the whole brain in HCs who tested negative for Aβ pathology, based on the cut-off value of CSF Aβ42 levels >192 pg/mL as determined by Shaw et al. , for all methodologies for which these ROIs are defined; we also similarly examined the TBM temporal lobe Stat-ROI (note that CSF measures were obtained only on a subset of ADNI subjects—see Table I). Figure 2 shows bias-corrected annual atrophy rates with 95% confidence intervals for the full HC group, the two HC subgroups who were, respectively, Aβ-negative and Aβ-positive, and the MCI group; these atrophy rates were calculated using a mixed-effects regression model on all baseline data and follow-up data available up to 3-years, as described in the Methods section.
As shown in Figure 2, longitudinal volumetric changes in entorhinal cortex, hippocampus, and whole brain are clearly present in Aβ-negative HC subjects, and the annual percentage changes in these subjects do not substantially differ from those observed in the full HC group. Atrophy rates in Aβ-negative HC subjects for all ROIs, regardless of methodology, are a substantial fraction (one-third to one-half) of the atrophy rates seen in MCI subjects. As the anatomical changes seen in Aβ-negative HC subjects are not likely to be due to Aβ pathology, there is no reason to expect that they would be affected by therapeutic agents designed specifically to target amyloid pathology. The potentially treatable effect therefore would be most realistically defined as the amyloid-negative aging-corrected rate of change in the patient cohort, that is, the difference in rates of change between the patient group and the Aβ-negative HC subjects. However, since there is little difference between the atrophic rates of change in the Aβ-negative HC group and the full HC group (compared with the differences between either of these groups and the MCI or AD groups), the potentially treatable effect could conservatively be defined as the aging-corrected rate of change in the patient cohort, that is, the difference in rates of change between the patient group and the full HC group–to take advantage of the larger N of the full control group.
The annual rates of change for the TBM Stat-ROI shown in Figure 2 are very similar to those for the whole brain measure KN-BSI, which in turn are fairly consistent with the whole brain measures for Quarc, FS, and FS×. (Note that the statistics for the HC(Aβ+) group are poorest, reflecting the relatively small number of subjects in that group—see Table I.) This is, on the face of it, an unexpected result because the TBM Stat-ROI was specifically designed to identify the brain subregion undergoing the highest rate of change, yet the resulting rates are essentially the same as those for whole brain change obtained by the other methods. The estimates reported in [Hua et al., 2011] imply even smaller rates of change; for example, the TBM Stat-ROI annual rate of change for MCI, when correcting for the remaining bias, is 1 – 0.29 = 0.71%, compared with 0.98%, CI = [0.90% 1.06%], for KN-BSI.
Sample size estimates using disease-specific (aging-corrected) and absolute (aging-uncorrected) rates of change for bias-corrected data are shown in Figure 3 for MCI subjects and Figure 4 for AD subjects, for representative measures for each methodology (numerical values for these and other neuroimaging measures are shown in Tables II and III). For reference, the sample size estimate using CDR-SB, the most sensitive standard clinical outcome measure, is also shown. P-values for the significance of the difference in sample size estimates from all head-to-head pairwise comparisons of relative-change (aging-corrected) measures are in Tables IV and andV,V, for MCI and AD, respectively. In the figures, data are arranged in ascending order for the conservative aging-corrected sample size estimate, “MCI-HC” or “AD-HC.” CDR-SB represents a demarcation for those neuroimaging measures that are competitive with current clinical outcome measure in tracking longitudinal change over time in MCI and mild AD. In the case of MCI, only the entorhinal cortex (ERC) as measured by Quarc has a conservative disease-specific 95% confidence interval that is wholly below the CDR-SB confidence intervals (head-to-head comparison involving 311 MCI subjects and 181 HC subjects gives: Quarc ERC N = 297, CI = [216 439]; CDR-SB N = 582, CI = [417 884]; P value for difference in N's = 0.01). Head-to-head comparisons of the Quarc measures with measures from the other methodologies are shown in Figure 5 for MCI and Figure 6 for AD; P values presented in the first row in Table IV show that for MCI the Quarc entorhinal is significantly more powerful than all other measures of change, except the Quarc Hippocampus. From either the last column (MCI-HC) or last row (MCI-HC(Aβ−)) in Table IV, Quarc ERC is the only measure that is statistically significantly superior to CDR-SB. From Figure 2, the measurement of rates of change for FS Hippocampus appears to be anomalous for the HC(Aβ−) and HC(Aβ+) groups, which show substantial and significant difference. This is in contrast with the much higher degree of similarity among these subject groups for all other measures. Furthermore, one would expect the point estimates for FS and FS× to be similar, as indeed they are for the hippocampus measured for all MCIs and all HCs, and for all subject groups for the whole brain. It should be noted that there are much fewer subjects in the HC(Aβ−) and, particularly, the HC(Aβ+) groups compared to all HCs (see Table I), so the estimates for these subgroups are not as robust.
Sample size estimates for the more realistic amyloid-negative aging-corrected rates of change, that is, for “MCI-HC(Aβ−)”, are generally lower compared with those for “MCI-HC”, reflecting the slightly smaller atrophic rates for HC(Aβ−) compared with those for HC in Figure 2. Note that the point estimates for both the Quarc and FS entorhinal remain unchanged.
Using the estimated annual rates of change reported in [Hua et al., 2011] for MCI, and noting that additive bias is essentially eliminated when calculating relative change between groups, the sample size estimate from the modified TBM Stat-ROI method for the change in MCI relative to that in all HCs is N = 129/(1 – 0.7)2 = 1,433, close to the value N = 1,358, CI = [712 3624], reported in Table II for the original TBM Stat-ROI method.
Sample size estimates for a trial involving mild AD patients, shown in Figure 4 and Table III, are substantially smaller, with most neuroimaging outcome measures yielding smaller estimated sample sizes than the CDR-SB (significance of differences are in Table V). Sample size estimates based on absolute change are also shown for comparison. As expected, for any given neuroimaging measure, they are substantially smaller with much tighter confidence intervals than their disease-specific counterparts.
This study compared five widely used methodologies for quantifying longitudinal change measures from structural MRIs and examined two critical issues that significantly impact sample size estimation when neuroimaging measures are used as outcome variables: bias in image analysis and the definition of the potentially treatable effect. Failure to control for either of these factors can lead to dramatic underestimation of sample sizes needed to detect a potentially beneficial effect of a disease-modifying therapy.
Potential bias in image registration is a well-known problem in the analysis of serial MRIs that has received much attention in recent literature [Thompson and Holland, 2011]. Although most methodologies employ procedures to minimize bias, our results show that some commonly used methods, particularly TBM Stat-ROI, are significantly affected. For TBM Stat-ROI, correction for bias tripled (or doubled, using the alternative results from the modified method) the sample size estimates for detection of change in MCI subjects. Acquiring scans on the same day, where no deformation is expected, would provide ideal images for testing the presence of additive bias—though multiplicative bias would remain undetectable using such images. A simple way to eliminate bias due to registration methodology is to make the entire procedure symmetric by construction: register image A to image B, independently register image B to image A, and then combine the changes measured in both directions by algebraic or geometric averaging.
The definition of the potentially treatable effect is another critical factor that profoundly affects estimated sample sizes. The majority of publications in the ADNI literature, notably [Beckett et al., 2010; Cummings, 2010; Hua et al., 2010; Vemuri et al., 2010], implicitly define the potentially treatable effect as the absolute change from baseline, although as mentioned earlier there are exceptions [Fox et al., 2000; Holland et al., 2009; McEvoy et al., 2010; Schott et al., 2010]. The rationale for use of absolute change measures has not clearly been articulated in published reports, but presumably it arises from the assumption that atrophy in HCs is dominated by a subset of HCs who are in a preclinical stage of AD and experiencing disease-related elevated rates of decline. We directly tested this assumption by examining atrophy rates in HCs who tested negative for Aβ pathology on the basis of CSF Aβ42 levels and comparing these rates with those seen in all HCs. Since amyloid lesions neuropathologically partly characterize AD [Mirra et al., 1991], and CSF and PET measures of their prevalence are believed to be the earliest detectable signs of possible AD [Morris et al., 2010] (in which case clinical decline might not occur until a decade or so after the lesions become manifest [Price and Morris, 1999]), elderly individuals without Aβ pathology are highly unlikely to be in a preclinical stage of AD—and indeed since many elderly HCs have elevated plaque burden while remaining cognitively normal [Price et al., 2009], it might be that “Alzheimer's is not a part of normal aging any more than breaking your hip is a part of normal aging” [Herrup, 2010]. Our results show, however, that the Aβ-negative HCs experienced an approximately similar rate of whole brain, entorhinal, and hippocampal atrophy as the full, undifferentiated HC group. Furthermore, as can be seen in Figure 2, apart from the anomalous “FS Hipp” result discussed above, hippocampal and entorhinal atrophy rates are similar for HC(Aβ+) and HC(Aβ−). This result is in agreement with neuropathological studies that show no significant difference in total entorhinal [Price et al, 2001] and hippocampal [West et al., 2004] neuron counts—even in CA1, the hippocampal sector most affected by neuron loss in the early stages of AD–between cognitively normal subjects essentially free of amyloid pathology and those exhibiting significant amounts of amyloid deposition, to a degree consistent with a neuropathological diagnosis of possible AD. Thus, the preponderance of atrophy in HCs must arise from causes other than Aβ pathology, and it would not be reasonable to expect an AD therapy, in particular one targeting Aβ pathology, to reduce atrophy rates for atrophy that occurs in the absence of such pathology. Neuropathological analysis of HCs who do not fulfill criteria for the neuropathological diagnosis of AD or other neurodegenerative disease [Freeman et al., 2008] supports this conclusion. Atrophy in such individuals can persist over an age range of five decades, likely reflects loss of dendritic complexity in neuropil and/or changes in neuronal size, but in contrast with AD-related atrophy, preserves neuronal number. The neuropathological study shows further that in these subjects the presence of diffuse plaques did not correlate with cortical atrophy; that cortical atrophy correlates with age; and though neuritic plaque burden also correlates with age, the small number of plaques and tangles had no direct influence on cortical atrophy. Therefore, “cortical changes seen in aging are not simply the result of early AD changes but are related to aging itself” (ibid.).
It should be noted that in a clinical trial employing biomarkers for the natural history of the disease, care must be taken in assessing the disease modifying ability of the therapy [Citron, 2010; Salloway et al., 2008]. Correlation is not sufficient [Baker and Kramer, 2003]: the biomarker must also be in the causal pathway of the disease, and directly relate to clinically meaningful endpoints [Mani, 2004]. By the same token, therapy might affect atrophy in unexpected ways, as shown by the AN1792 Aβ immunotherapy trial [Gilman et al., 2005] where whole brain atrophy was greater in the approximately one-fifth of subjects who were antibody responders than in the placebo group, a result possibly due to brain hydration state related to therapy, or to negative effects of the vaccination on fiber or white matter volume, and that was not reflected in worsening cognitive performance [Fox et al., 2005]. Though clinical improvement in this trial largely may have been precluded because the patient cohort was at a relatively advanced stage of the disease [Holmes et al., 2008; Hyman, 2011], this outcome nevertheless argues in favor of analyzing subregional, in particular cortical, change rather than global change when monitoring disease-modifying effects of therapy.
Defining the potentially treatable effect based on absolute change from baseline is attractive in that it leads to small sample size requirements—often much smaller than those based on standard clinical outcome measures. This approach represents the most optimistic assessment of the potentially treatable effect. The more conservative approach of defining the potentially treatable effect relative to change experienced by all HCs, or the slightly more realistic approach of defining it relative to change experienced by Ab-negative HCs, may represent a more achievable goal, particularly since most current therapies target amyloid pathology. Requiring the treatment to slow all change, even that unrelated to the targeted mechanism of the drug, is likely to result in a trial that is substantially underpowered to detect slowing of disease-specific atrophy. An additional advantage of using relative rather than absolute change measures is that any purely additive systematic bias in the results arising from errors in image acquisition or analysis methods will, by definition, be removed upon subtraction.
Consideration of these two factors together-bias correction and defining the potentially treatable effect as a measure of relative rather than absolute change-substantially alters the conclusions regarding the relative sensitivity of different neuroimaging biomarkers from what has been published in the literature [Beckett et al., 2010; Cummings, 2010; Jack et al., 2010]. From data available on the ADNI website through April 16, 2011, estimates of change in subregional cortical areas, as determined by Quarc, produce the smallest estimated sample sizes when change relative to HCs is taken into account. In particular, for the entorhinal cortex, the area known to be first affected by AD pathology, N = 293 CI = [214 432] for MCI-HC, N = 74 CI = [54 113] for AD-HC, and as can be seen in Table IV, sample size estimates based on the Quarc entorhinal are the only ones that are significantly smaller than those achieved using CDR-SB as the outcome measure for either MCI-HC or MCI-HC(Aβ−). Several other temporal lobe structures quantified with Quarc (Tables II and III) also provided powerful relative-change biomarkers, including the amygdala (N = 366 CI = [264 549] for MCI-HC, N = 132 CI = [94 205] for AD-HC) which suffers from a significant increase in the numbers of both neuritic plaques and NFTs when transitioning from HC to amnestic MCI, and again when transitioning from amnestic MCI to early AD [Markesbery et al., 2006]. Quarc was designed not only to capture large-scale structural change, but also to measure change in small regions with high precision [Holland and Dale, 2011], and generally provides significantly improved measures of change compared with those of the standard (cross-sectional) FreeSurfer and FreeSurfer-longitudinal, the only other methods that attempt to quantify change in cortical regions. The results presented here for Quarc were derived using both the back-to-back scans per timepoint (which should improve signal to noise); the other methods used a single scan per timepoint (choosing the best scan from each pair should reduce the degrading impact of image artifacts). Though these effects have not been assessed here, when considering future trials the requirement for acquiring two scans to achieve these sample sizes should be borne in mind. Also, sample size estimates have not been modeled to account for attrition due to QC. (The methods used in Quarc are fully documented in [Holland and Dale, 2011], and are available to other researchers on a not-for-profit recharge basis through the UCSD Multimodal Imaging Laboratory, mmil.ucsd.edu.)
Although the current study is primarily concerned with issues that affect the use of neuroimaging biomarkers as outcome measures in clinical trials, it is important to point out that longitudinal MRI measures as provided by ADNI are also being used in the comparative investigation of disease-related trajectories of various AD biomarkers [Caroli and Frisoni, 2010; Frisoni et al., 2010; Jack et al., 2010; Perrin et al., 2009; Trojanowski et al., 2010]. To ensure fidelity of the serial measurements, it has been proposed in a recent consensus article [Klein et al., 2009] that registration procedures be validated with respect to linearity (the inverse consistency of forward and reverse transformations between image pairs), and transitivity (e.g., the total change calculated when registering visit 3 to visit 1 should equal the sum calculated when registering visit 3 to visit 2, and visit 2 to visit 1). The methodologies used in ADNI, including Quarc, have not been validated in this respect. Therefore, caution is advised when considering the validity of the nonlinear trajectories that have been published based on ADNI data.
ADNI has been revolutionary in its pioneering of the open source model of data sharing, making raw data and derived measures freely accessible to the scientific community and industry as soon as they become available. It has been highly successful in advancing research on biomarkers in AD. ADNI results are being used by the pharmaceutical industry to aid in decision-making on the choice of biomarkers for use as outcome measures, and for powering clinical trials. Ultimately, ADNI data and analyses may form the basis for regulatory qualification for imaging biomarkers.
It is thus essential that these biomarkers are validated and that the models used for power calculations be consistent with the biological mechanisms targeted by the therapy under investigation. It should be noted, however, that establishing imaging biomarkers as surrogates for clinical-cognitive outcomes cannot be achieved with natural history studies, but will require successful clinical trials, that is, ones where cognitive outcomes are improved and where there is a clear and cogent biological connection with the imaging measures [Carrillo et al., 2009; Katz, 2004]. Establishing the surrogacy of biomarkers will be all the more difficult if those biomarkers are not correctly calibrated for non-disease-related effects. For therapies targeting Aβ pathology, it is not reasonable to expect that they will affect atrophy rates observed in healthy individuals without evidence of Ab pathology. Since such individuals show atrophy rates in AD-vulnerable structures equivalent to those of the larger HC group, the potentially treatable effect is best conservatively defined as change relative to that experience by HCs. It is also essential to provide confidence intervals for any sample size estimates to enable selection of biomarkers that estimate change with the highest certainty [Holland et al., 2009; McEvoy et al., 2010; Schott et al., 2010].
As part of the stated goals of the study, ADNI has been charged with statistically evaluating and comparing the different biomarkers and analysis methods to inform clinical trial design. To date, results presented and published by ADNI have not taken these critical issues into account. They report sample size estimates based only on absolute change measures; they have not considered issues of potential bias in image registration; and they have not provided a clear index of uncertainty of the results. If left uncorrected, these findings could lead to the adoption of suboptimal biomarkers for outcome measures, and to trials that are substantially underpowered for detecting potential disease-modifying effects.
The authors thank Yoon Chung, Trevor Cooper, Rahul Desikan, Matt Erhart, Donald Hagler, Robin Jennings, Alan Koyama, and Chris Pung. This study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. Anders M. Dale is a founder and holds equity in CorTechs Labs, Inc, and also serves on its Scientific Advisory Board. The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. Linda K. McEvoy's spouse is President of CorTechs Labs, Inc. A patent application for Quarc has been filed through the UCSD Technology Transfer Office.
Contract grant sponsor: NIH; Contract grant numbers: R01AG031224, R01AG22381, U54NS056883, P50NS22343, P50MH081755, U01 AG024904, P30 AG010129, K01 AG030514; Contract grant sponsor: NIA; Contract grant number: K01AG029218; Contract grant sponsors: Alzheimer's Disease Neuroimaging Initiative (ADNI) (Data collection and sharing for this project), National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, Abbott, AstraZeneca AB, Bayer Schering Pharma AG, Bristol-Myers Squibb, Eisai Global Clinical Development, Elan Corporation, Genentech, GE Health- care, GlaxoSmithKline, Innogenetics, Johnson and Johnson, Eli Lilly and Co, Medpace, Inc., Merck and Co., Inc., Novartis AG, Pfizer Inc., F. Hoffman-La Roche, Schering-Plough, Synarc, Inc., Wyeth, the Alzheimer's Association and Alzheimer's Drug Discovery Foundation (with participation from the U.S. Food and Drug Administration), Foundation for the National Institutes of Health (www.fnih.org), the Northern California Institute for Research and Education, the Dana Foundation.
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (www.loni.ucla.edu/ADNI). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. Complete listing of ADNI investigators available at, http://www.loni.ucla.edu/ADNI/Data/ADNI_Authorship_List.pdf