|Home | About | Journals | Submit | Contact Us | Français|
Neuroimaging centers and pharmaceutical companies are working together to evaluate treatments that might slow the progression of Alzheimer’s disease (AD), a common but devastating late-life neuropathology. Recently, automated brain mapping methods, such as tensor-based morphometry (TBM) of structural MRI, have outperformed cognitive measures in their precision and power to track disease progression, greatly reducing sample size estimates for drug trials. In the largest TBM study to date, we studied how sample size estimates for tracking structural brain changes depend on the time interval between the scans (6–24 months). We analyzed 1309 brain scans from 91 probable AD patients (age at baseline: 75.4±7.5 years) and 189 individuals with mild cognitive impairment (MCI; 74.6±7.1 years), scanned at baseline, 6, 12, 18, and 24 months. Statistical maps revealed 3D patterns of brain atrophy at each follow-up scan relative to the baseline; numerical summaries were used to quantify temporal lobe atrophy within a statistically-defined region-of-interest. Power analyses revealed superior sample size estimates over traditional clinical measures. Only 80, 46, and 39 AD patients were required for a hypothetical clinical trial, at 6, 12, and 24 months respectively, to detect a 25% reduction in average change using a two-sided test (α=0.05, power=80%). Correspondingly, 106, 79, and 67 subjects were needed for an equivalent MCI trial aiming for earlier intervention. A 24-month trial provides most power, except when patient attrition exceeds 15–16%/year, in which case a 12-month trial is optimal. These statistics may facilitate clinical trial design using voxel-based brain mapping methods such as TBM.
Alzheimer’s disease (AD) is a pathological condition associated with aging. It affects over 24 million individuals worldwide according to a Delphi consensus study in 2005 (Ferri et al., 2005), and the incidence doubles every 5 years after age 60 years (Brookmeyer et al., 1998; Jellinger, 2006; Jorm et al., 1987). AD has been described as the most feared illness in the elderly population (Harris and Interactive, 2006), even more so than cancer and cardiovascular diseases, mainly due to the lack of effective treatment or prevention methods. The optimal treatment window is in the non-symptomatic stage, during which misfolded proteins begin to aggregate into extracellular senile plaques and intracellular neurofibrillary tangles (Selkoe, 2004), followed by inflammatory damage and cell death in the central nervous system (Frank et al., 2003; Shaw et al., 2007). Mild cognitive impairment (MCI), especially the amnestic type (Petersen, 2003b; Petersen et al., 2001), is a transitional stage between normal aging and fully-developed AD. With an annual conversion rate of 10–25% to AD, MCI has attracted much attention for treatment trials targeting minimally symptomatic individuals (Grundman et al., 2004; Petersen, 2003a, 2007). Development of disease-modifying therapies is mainly hindered by the lack of robust biomarkers to aid early diagnosis and assess treatment efficacy (Shaw et al., 2007).
Longitudinal, structural magnetic resonance imaging (MRI) has substantial power to track subtle brain changes over time, and there has been great interest in which neuroimaging-derived measures, either used separately or jointly, offer greatest power to track AD. Several automated methods have been proposed to measure hippocampal atrophy (Morra et al., 2009a,b; Schuff et al., 2009), ventricular enlargement (Carmichael et al., 2006; Chou et al., 2008, 2009a,b; Nestor et al., 2008; Thompson et al., 2004), and whole brain atrophy assessed using methods such as the brain boundary shift integral (BBSI) (Fox et al., 2000) or SIENA (Ho et al., in press; Smith et al., 2002, 2004). These neuroimaging measures have several advantages as outcome measures as they can differentiate patients from controls (Davatzikos et al., 2008; Fox and Freeborough, 1997), correlate with clinical and cognitive decline (Fox et al., 1999; Hua et al., 2008b; Jack et al., 2009; Leow et al., 2009), correlate with pathologically confirmed neuronal loss (Vemuri et al., 2008; Whitwell et al., 2008), predict future conversion from preclinical to symptomatic AD (Apostolova et al., 2006; Hua et al., 2008b; Misra et al., 2009), and have a high test-retest reliability (Leow et al., 2006). Voxel-based morphometry studies, including the ones using TBM, have depicted the characteristic pattern of structural degeneration in AD, providing an automated and convenient platform for image-based disease characterization (Apostolova et al., 2007; Apostolova and Thompson, 2008; Busatto et al., 2008; Chetelat et al., 2005, 2008a,b; Davatzikos et al., 2001; Dickerson et al., 2009; Frisoni et al., 2007; Hua et al., 2008b; Ishii et al., 2005; Leow et al., 2009).
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) is a large-scale multi-site study to define optimal neuroimaging methods for clinical trials. In prior studies, we found that tensor-based morphometry gave better sample size estimates than standard clinical scores for tracking disease-related changes over a one-year interval (Hua et al., 2009a); sample sizes were comparable for subjects scanned 3 versus 1.5 Tesla (Ho et al., in press). As those analyses focused on people scanned with a 1-year follow-up interval, it is of great interest to know whether a shorter interval (6 months) or a longer one (24 months) would provide better power. In choosing a follow-up interval, clinical trial designers must consider the risk of waiting a long time for the disease to progress. This must be traded off against the more limited sensitivity to disease progression at shorter intervals, when biological changes have had less time to accumulate. To investigate these trade-offs, we analyzed the serial brain scans of 91 patients diagnosed with probable AD, followed up at 6, 12, and 24 months, and 189 MCI subjects followed up at 6, 12, 18, and 24 months. 3D maps of atrophy (i.e., volume loss - a widely-used neuroimaging measure of AD progression) were created, based on nonlinearly warping the follow-up scan to baseline scan. These maps charted disease progression in great spatial detail for each individual. To boost power, a numerical summary was derived from a statistically-defined region-of-interest (Stat-ROI (Hua et al., 2009a; Chen et al., 2009a; Ho et al., in press) inside the temporal lobes to quantify the rate of atrophy. Using the TBM-derived neuroimaging measure, we computed sample size estimates for hypothetical AD and MCI clinical trials. We compared sample size requirements for trials with different scan intervals of 6, 12, 18, and 24 months, and explored the influence of using atlas-based versus statistically pre-defined ROIs.
We hypothesized that, at all time intervals examined, TBM-derived neuroimaging measures would lead to greatly reduced sample size estimates for clinical trials compared to traditional cognitive measures. We also expected that longitudinal studies with longer follow-up periods would require smaller sample sizes to detect treatment effects. Finally, we assessed whether the use of a statistically pre-defined ROI boosted power at each follow-up interval; in exploratory analyses, we determined whether the power estimates depended on how the statistically pre-defined ROI was generated (i.e., which diagnostic group and which time-points were used to generate it).
Longitudinal brain MRI scans and associated clinical data were downloaded from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) public database (http://www.loni.ucla.edu/ADNI/Data/). ADNI is a large five-year study launched in 2004 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a $60 million public-private partnership. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessments acquired at multiple sites (as in a typical clinical trial), can replicate results from smaller single site studies measuring the progression of MCI and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to monitor the effectiveness of new treatments, and lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, M.D., VA Medical Center and University of California, San Francisco.
Serial brain MRI scans were analyzed from 91 probable AD patients (age at baseline: 75.4±7.5 years) scanned at baseline and followed up at 6, 12, and 24 months, and 189 individuals with amnestic MCI (74.6±7.1) followed up at 6, 12, 18, and 24 months, amounting to a total of 1309 scans. These subjects are from the same cohort as those in our prior TBM studies, which examined a subset of 676 of these individuals at baseline, and 515 of them at 12-month follow-up, but not at other time-points (Hua et al., 2008a,b, 2009a). All subjects underwent thorough clinical and cognitive assessment at the time of scan acquisition. Cognitive tests examined here included the Alzheimer’s Disease Assessment Scale-cognitive subscale (ADAS-Cog), a 70-point scale designed to measure the severity of cognitive impairment, which is currently the most widely used cognitive measure in AD trials (Rosen et al., 1984). It consists of 11 tasks assessing learning and memory, language production and comprehension, constructional and ideational praxis, and orientation. The sum-of-boxes Clinical Dementia Rating (CDR-SB), ranging from 0 to 18, measures dementia severity by evaluating patients’ performance in six domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Berg, 1988; Hughes et al., 1982; Morris, 1993). The Mini-Mental State Examination (MMSE) provides a global measure of mental status, evaluating five cognitive domains: orientation, registration, attention and calculation, recall, and language (Cockrell and Folstein, 1988; Folstein et al., 1975). The maximum MMSE score is 30; scores of 24 or lower are generally consistent with dementia. For both ADAS-Cog and CDR-SB, higher scores indicate poorer cognitive function, whereas for MMSE, higher scores denote better cognitive function. All AD patients met NINCDS/ADRDA criteria for probable AD (McKhann et al., 1984). The ADNI protocol lists more detailed inclusion and exclusion criteria (Mueller et al.), 2005a,b which can also be found online at http://www.alzheimers.org/clinicaltrials/fullrec.asp?PrimaryKey=208.
These data sets were downloaded on or before June 1, 2009, and reflect the status of the database at that point; as data collection is ongoing, we focused on analyzing all available baseline and follow-up scans up to 2 years, together with associated clinical and cognitive scores. The study was conducted according to the Good Clinical Practice guidelines, the Declaration of Helsinki and U.S. 21 CFR Part 50-Protection of Human Subjects, and Part 56-Institutional Review Boards. Written informed consent was obtained from all participants before experimental procedures, including cognitive tests, were performed.
All subjects were scanned with a standardized MRI protocol developed for ADNI (Jack et al., 2008). Briefly, high-resolution structural brain MRI scans were acquired at 59 ADNI sites using 1.5 Tesla MRI scanners (GE Healthcare, Philips Medical Systems, or Siemens). Using a sagittal 3D MP-RAGE scanning protocol, the typical 1.5T acquisition parameters were repetition time (TR) of 2400 ms, minimum full TE, inversion time (TI) of 1000 ms, flip angle of 8°, 24 cm field of view, 192 × 192 × 166 acquisition matrix in the x-, y-, and z- dimensions, yielding a voxel size of 1.25 × 1.25 × 1.2 mm3, later reconstructed to 1 mm isotropic voxels. The scan quality was evaluated by the ADNI MRI quality control center at the Mayo Clinic following standardized criteria.
Image corrections were applied using a processing pipeline at the Mayo Clinic, consisting of: (1) correction of geometric distortion due to gradient non-linearity (Jovicich et al., 2006), i.e., “gradwarp” (2) “B1-correction” for adjustment of image intensity inhomogeneity due to B1 non-uniformity (Jack et al., 2008), (3) “N3” bias field correction for reducing residual intensity inhomogeneity (Sled et al., 1998), and (4) geometrical scaling for removing scanner- and potential session-specific calibration errors using a phantom scan acquired for each subject (Gunter et al., 2009). All original image files as well as images with all of these corrections are available to the general scientific community at http://www.loni.ucla.edu/ADNI/Data/.
To adjust for linear drifts in position and scale within the same subject, the follow-up scan (6-, 12-, 18-, or 24-month) was linearly registered to its matching baseline scan using a 9-parameter (9P) registration, driven by a mutual information (MI) cost function (Collins et al., 1994). 9P linear registration was chosen over 6P rigid-body registration as previous studies have shown that 9P registration can correct for scanner voxel size variations in large longitudinal studies involving multiple sites, scanners and acquisition sequences, and consistently outperforms 6P registration (Hua et al., 2009a; Paling et al., 2004). Using the ADNI data, 9P linear registration was shown to achieve similar level of scaling correction to phantom-based image correction (Clarkson et al., 2009), and to correct for any remaining scanner voxel size variation that was not accounted for by image correction algorithms (Hua et al., 2009a). To account for global differences in brain scale across subjects, the mutually aligned time-series of scans was then linearly registered to the International Consortium for Brain Mapping template (ICBM-53) (Mazziotta et al., 2001), applying the same 9P transformation to both mutually aligned scans. Globally aligned images were re-sampled in an isotropic space of 220 voxels along x-, y- and z-dimensions with a final voxel size of 1 mm3.
Individual Jacobian maps illustrating local expansion or compression were also created to estimate 3D patterns of structural brain change over time, by warping the follow-up scan to match the baseline scan. The nonlinear registration algorithm was driven by a mutual information cost function, and as a regularizing term, we used the symmetrized Kullback-Leibler (sKL-MI) distance, with the registration parameters of sigma=6 and lambda=8, chosen as the best parameter set from an earlier optimization study (Hua et al., 2009a). The parameters sigma and lambda control the Jacobian field smoothness and weighting of the regularization, respectively (Yanovsky et al., 2008, 2009) (see Hua et al., 2009a, for a detailed discussion of the sKL-MI registration parameters and their implications for detecting change). Color-coded maps of the Jacobian determinants were created to illustrate regions of volume expansion (i.e., with detJ (r)>1), or contraction (i.e., with detJ (r)>1) (Ashburner and Friston, 2003; Chung et al., 2001; Freeborough and Fox, 1998; Riddle et al., 2004; Thompson et al., 2000; Toga, 1999) over time. These maps of tissue change were also spatially normalized across subjects by nonlinearly aligning all individual Jacobian maps to a minimal deformation template (MDT), for regional comparisons and group statistical analyses. The MDT was constructed based on images from 40 normal controls as detailed elsewhere (Hua et al., 2008a,b).
To illustrate the average degree of atrophy, relative to the baseline image, at each follow-up time in AD and MCI groups, we constructed voxel-wise mean maps by taking the average at each voxel of the Jacobian maps, across subjects. These maps were color-coded to show the percentage of regional brain tissue loss and ventricular enlargement. The Jacobian maps of AD and MCI groups were compared at each time point using permutation-based two sample t tests, to assess the overall significance of group differences inside the whole brain as well as inside the temporal lobes, corrected for multiple comparisons (Bullmore et al., 1999; Chiang et al., 2007; Hua et al., 2009b; Nichols and Holmes, 2002; Thompson et al., 2003). In brief, a null distribution for the group differences in Jacobian determinant (atrophic rate) at each voxel was constructed using 10,000 random permutations of the data. For each test, the subjects’ diagnosis (AD vs. MCI) was randomly permuted and voxel-wise t tests were conducted to identify voxels more significant than p=0.05. The volume of voxels inside a mask (whole brain or temporal lobes) more significant than p=0.05 was computed for the real experiment and for the random assignments. Finally, a ratio, describing the fraction of the time the suprathreshold volume was greater in the randomized maps than the real effect (the original labeling), was calculated to give an overall p-value for the significance of the map (corrected for multiple comparisons by permutation). The correction is for the number of tests, so it quantifies the level of surprise in seeing the overall map within the mask or search region.
To give a summary of the 3D map of brain atrophy for each subject, a single numerical measure was derived by computing an average within an ROI. Both anatomically and statistically-defined ROIs were used in this study. First, a temporal lobe ROI (Temp-ROI), including the temporal lobes of both brain hemispheres, was manually delineated on the MDT template by a trained anatomist using the Brainsuite software program (Shattuck and Leahy, 2002) (Fig. 2). Secondly, a statistically-defined ROI (Stat-ROI) was defined based on voxels with significant atrophic rates (p<0.001) within the temporal lobes, in a non-overlapping training set of 20 AD patients (age at baseline: 74.8±6.3 years; 7 men and 13 women) scanned at baseline and 12 months (Fig. 2). The method is detailed fully in two prior reports, Hua et al. (2009a) and Ho et al. (in press). The training set used to define the ROI was deliberately based on scans that were independent of (and not overlapping with) the testing set consisting of 91 AD and 189 MCI subjects, in order to maximize the available testing set (from which maps and sample sizes were computed) at all the time points. Next, we aimed to answer the following question: does a “customized” Stat-ROI—specifically made for each diagnostic group and time point—further improve the statistical power? We created separate Stat-ROI based on 20 AD patients (age at baseline: 75.4±7.5 years; 12 men and 8 women) scanned at baseline and followed up at 6, 12, and 24 months. And, we made a corresponding set of separate Stat-ROIs based on 20 MCI subjects (75.4±7.4 years; 12 men and 8 women) scanned at baseline and followed up at 6, 12, 18, and 24 months, selected from the common set consisting of 91 AD and 189 MCI subjects across all time points. To ensure that the training and testing sets did not overlap, all subjects chosen for the training set were subsequently removed from the testing set. For example, when 20 AD patients were selected to compute the customized Stat-ROI at 6-month, the testing set at 6-months for AD was reduced to 71 subjects (91–20=71). This was done to comply with the rule in machine learning that evaluations should be made on data sets independent of those used to select the regions of interest.
To allow a simple statistical analysis, a numeric summary—the mean atrophy rate for all voxels within an ROI—was also computed for each follow-up scan, to summarize the overall amount of temporal lobe atrophy detected during an observation time of 6, 12, 18, and 24 months, respectively.
A power analysis was defined by the ADNI Biostatistics Core to estimate the sample size required to detect a 25% reduction in the mean annual rate of atrophy, using a two-sided test and standard significance level (α=0.05) for a hypothetical two-arm study (treatment versus placebo). The estimated minimum sample size for each arm is computed from the formula below. Briefly, denotes the estimated annual change (average of the group) and σD refers to the standard deviation of the rate of atrophy across subjects.
Here zα is the value of the standard normal distribution for which P[Z<zα]=α and in this case we set α to its conventional value of 0.05 (Rosner, 1990). The sample sizes required to achieve 80% and 90% power were computed in this study, subsequently referred to as n80 and n90. As the observation time ranged from 6 to 24 months, instead of converting the overall amount of change to an annual rate of atrophy, we computed the number of subjects required to detect a 25% reduction in the overall atrophy occurring over the interval (which gives the same result). These sample size estimates define how many patients would need to be recruited for clinical trials with the duration of 6, 12, 18, and 24 months respectively. We used the individual atrophy measures computed within the Stat-ROI or Temp-ROI for sample size calculations.
The 95% confidence interval for the n80 estimate was computed from 10,000 bootstrapped samples. At each bootstrapped resample, a subset of the numerical summaries (n=91 for AD and n=189 for MCI) was randomly drawn from the overall pool of numerical summaries within each group (AD or MCI). This was performed separately for the data at each follow-up interval. The confidence intervals were estimated with a bias corrected and accelerated percentile method (Davison and Hinkley, 1997; Efron and Tibshirani, 1993).
As an exploratory analysis, we further computed the n80s derived from a Stat-ROI within the anatomical regions of the frontal, parietal, occipital lobes, CSF, whole brain, hippocampus, cerebral white matter, and cerebral gray matter (see Table 2). The hippocampus was delineated on the MDT template by investigators at University College, London using MIDAS (Medical Image Display and Analysis System) software (Freeborough et al., 1997). This delineation included the hippocampus proper, dentate gyrus, subiculum, and alveus (Fox et al., 1996; Scahill et al., 2003). The cerebral gray and white matter was classified using the partial volume classifier (PVC) in the Brainsuite software package (Shattuck and Leahy, 2002), after removing the brain stem and cerebellum from the whole-brain mask.
To determine if the sample size estimate was affected by the choice of ROI (Temp-ROI versus Stat-ROI, and various Stat-ROIs based on specific time point and diagnostic group), we conducted serial pairwise Student t tests. Because we were testing a large number of methodological alternatives to compare the effects of varying follow-up periods and Stat-ROIs (29 two-sample t tests: including 10 tests to compare rate of atrophy at different follow-up periods with Stat- and Temp-ROI, respectively, 7 tests to compare the Stat- versus Temp-ROI, and 12 tests to compare the effects of customized ROI based on diagnostic group and time point), we took the necessary precautions to avoid inflating the false positive rate due to multiple comparisons. We used Bonferroni correction to adjust each p-value, i.e., p=b·n, where b stands for the uncorrected p-value from each t test and n is the total number of tests (n=29). All the p-values Results in the section are corrected p-values after this Bonferroni correction.
In reality, participants have the right to drop out of a clinical trial or research study at any time, or they may be unable to return due deteriorating health, disability, or death, or because they have moved out of the area or have lost interest in the study. Attrition rate for ADNI is around 7% per year, but in other studies with repeated assessments, attrition rates may be as high as 50% per additional visit (Thompson et al., 2009). Because of this, it is reasonable to build in estimates of the attrition rate in sample size estimation because larger samples may need to be recruited at the outset to ensure that an adequate sample remains later. In a simplified model, assuming that the participants who drop out are a random sub-sample, we adjusted the sample size estimates for attrition, by using the formula: Adjusted n80=(n80 from power estimate)/[(1 − attrition rate)T]. Here T denotes the inter-scan interval or trial interval in years, e.g., 1.5 for an 18-month interval. Although the assumption of random drop-out might not hold up in real clinical trials, it allows approximate adjustments to be made for attrition.
Individual Jacobian maps were averaged in AD and MCI groups to demonstrate the average amount of brain tissue loss (in blue colors) and ventricular enlargement (in red colors), characteristic of disease progression in AD, at a follow-up period of 6, 12, 18, and 24 months (Fig. 1). These tissue changes are shown as percentages, relative to the baseline scan, and are computed within each individual, before averaging across subjects in the group. Small but visible changes are noted at 6 months, the shortest follow-up period, in AD and MCI. As expected, greater progressive change, evidenced by ventricular expansion and temporal lobe atrophy, was detected at longer observation times. AD patients had progressed faster than MCI at all time points, as expected. The corrected p-values for comparing AD and MCI groups within a search region encompassing the entire brain were p=0.07, p=0.002, and p=0.004 at 6-, 12- and 24-month respectively; and the corrected p-values within the temporal lobes were p=0.02, p<0.0001, and p<0.0001 at 6, 12 and 24 months, respectively. This is in line with the intuition that AD versus MCI differences in atrophic rates are hardest to detect at the shortest follow-up interval (6 months), but can be detected so long as a temporal lobe region of interest is used.
As detailed in the Materials and methods section, the Temp-ROI was defined based on the average anatomy of the MDT (Fig. 2). The Stat-ROI was defined based on a non-overlapping training set of AD patients followed up for a 12-month period (Fig. 2). Since the training set was chosen to be independent of the evaluation set, sample size estimates (n80 and n90) were computed from the full sample of 91 AD patients and 189 MCI subjects. The same Stat-ROI was applied to compute all sample size estimates in Fig. 3. Across the same set of patients in AD, longer intervals led to greater effect sizes for the measurement of temporal lobe atrophy, resulting in smaller sample size estimates (n80 and n90 in Fig. 3a). MCI subjects showed the same trend as AD, with greater cumulative atrophy at longer follow-up intervals. The MCI subjects were examined at an additional time point (18 months), but this time-point, when used on its own, offered little extra benefit relative to a 12-month follow-up, showing comparable n80 and n90 at 12 and 18 months. Sample size estimates based on numeric summaries derived from the Stat-ROI consistently out-performed the ones derived from the Temp-ROI (Fig. 3) (all pair-wise comparisons, of the Stat-ROI versus the Temp-ROI, had corrected p-values less than 0.05). Additionally, all TBM-derived neuroimaging markers demonstrated drastic sample size reductions relative to standard clinical measures (ADAS-Cog, MMSE, and CDR-SB) at all follow-up periods (Table 1).
The exploratory analysis (Table 2) suggested that a statistically defined ROI generated within the temporal lobes is the best search region in terms of yielding the lowest n80 s (smaller n80s are better), although several anatomical ROIs performed almost equally well. Summaries of atrophic rates derived within the cortical gray matter showed comparable statistical power to the ones derived from the temporal lobes. The performance of the statistical ROIs was somewhat insensitive to the choice of the search region, with the temporal region giving marginally better results, as would be predicted from the sequence of AD progression.
In theory, the statistically pre-defined region of interest could be based on any chosen set of AD or MCI subjects, even from a separate study, so long as they do not overlap with the testing (or evaluation) set. Apart from the Stat-ROI created from the non-common set (Figs. 2 and and3),3), we created 7 additional Stat-ROIs based on 20 AD patients and 20 MCI subjects chosen from the common set consisting of 91 AD patients and 189 MCI subjects, at each follow-up time. As shown in Fig. 4, the Stat-ROIs were anatomically very similar regardless of the diagnostic group and follow-up period that they were based on, indicating a consistent pattern of longitudinal change detected in both AD and MCI. Moreover, the choice of Stat-ROI did not significantly affect the sample size estimates in AD (Fig. 5a) and MCI (Fig. 5b) (all pair-wise comparisons had corrected p-values greater than 0.05).
In practice, the attrition rate would have to be considered in making sample size estimates. ADNI, as a research project, has a low attrition rate of 5–7%. Real clinical trials usually have a much higher attrition rate ranging from 20% to 30% per year. To illustrate this effect, we computed sample size estimates using TBM-derived neuroimaging measures with attrition rates built into the power analysis, assuming that the people who drop out are a random sub-sample of the overall sample (Figs. 6 and and7).7). With little to no attrition, longer trials are better in terms of sample size estimates. As the attrition rate goes up, the longer trials, i.e., 18- and 24-month, begin to fall behind the shorter trials, as the benefit of waiting longer for the disease to progress is outweighed by the number of people who fail to return for follow-up evaluation. The 12-month interval was the optimal trial duration (among those considered here) when the attrition rate was greater than 20% per year for both AD and MCI (Fig. 6). To determine the cut-off point, the adjusted n80s were plotted against the attrition rate (Fig. 7). We found that the 24-month trial gave the best sample size estimate so long as the attrition rate was below 15–16% per year. If the attrition rate rose further, the 12-month trial became the optimal choice, outperforming all other trial durations (Fig. 7). More complex models could be fitted if the attrition rate is itself variable over time, but prior studies have shown that a somewhat consistent proportion of subjects tends to drop out, with each successive follow-up (Thompson et al., 2008).
Serial structural brain changes in Alzheimer’s disease were visualized in patients with probable AD (N = 91) and MCI (N=189), scanned at baseline and after intervals of 6, 12, 18, and 24 months. 3D maps showed the regional distribution of cumulative atrophy across the brain, relative to the baseline scan, and the resulting patterns, at all time-points, are consistent with many past reports that pinpoint areas most affected by AD (Frisoni et al., 2009; Scahill and Fox, 2007; Scahill et al., 2003; Thompson and Apostolova, 2007; Whitwell et al., 2007). The most prominent features detected by longitudinal MRI and TBM were ventricular expansion and temporal lobe atrophy (Fig. 1), consistent with earlier findings.
To improve the utility of voxel-based neuroimaging for clinical trials, which tend to deal with “single-number” outcome measures, we also derived a numeric summary to quantify the overall amount of temporal lobe atrophy in regions consistently affected by AD. After reducing a 3D map into this single atrophy score, we evaluated its statistical power versus standard clinical measures for detecting disease progression in AD and MCI. Sample size estimates were far smaller for neuroimaging measures than for clinical scores, suggesting that fewer patients or shorter observation times would still offer sufficient power in clinical trials using neuroimaging surrogate markers (Chen et al., 2009a,b; Jack et al., 2003; Reiman et al., 2008b; Schuff et al., 2009).
Probable AD patients, followed up after 6, 12, and 24-month intervals, demonstrated successively greater cumulative brain atrophy, and progressively smaller sample sizes to detect a 25% reduction in the average change in a hypothetical clinical trial. MCI subjects also showed a gradual increase in the cumulative level of brain atrophy (versus baseline) in maps created at 6, 12, 18, and 24 month intervals. The 18-month follow-up for MCI, when used on its own, did not exhibit a detectably improved sample size estimate compared to a 12-month trial. Brain atrophy is detectable on MRI as early as 6 months in both AD and MCI groups. Even so, changes had greater effect sizes at longer intervals, illustrating the trade-off between recruitment requirements (and therefore costs) and the required observation time. More patients are needed for shorter trials, but patient enrollment may be reduced if a longer observation time is acceptable. And regardless of the scanning interval, TBM is clearly a useful neuroimaging marker that may help reduce costs for clinical trials.
Minimal sample sizes directly relate to effect sizes, which depend upon both the mean and standard deviation (SD) of the atrophy measures. For TBM measures derived from the statistically pre-defined region of interest, both the mean level of cumulative atrophy and its SD increased monotonically with longer inter-scan intervals (Fig. 3); as the mean rose faster than the variance, incrementally greater effect sizes (and smaller sample size estimates) were observed. According to one interpretation, most of the sources of measurement error in estimating the atrophic rate from serial MRI scans (e.g., scanner calibration, RF bias fields) may be roughly the same regardless of the scan interval, but the interval must always be long enough for sufficient systematic atrophy to accumulate and be detectable above the noise that is inevitable whenever measurements are made.
Even so, it is not logically necessary that long intervals must always give best effect sizes: they may not if the population variability increases drastically. Attrition effects are also relevant in practice. A long study may be useful to maximize the theoretical effect size, but only if patients are willing and able to stay in the trial to allow a longer follow-up. With attrition rates greater than 15–16% per year, shorter trials (e.g., 12-month) became a better choice (Fig. 7), probably because the added effect sizes from cumulative change over a longer period were outweighed by the exponentially growing attrition. Additionally, shorter trials may also be favored due to cost concerns, as the cost generally rises with longer follow-up periods. Even when considering the attrition rate, the 12-month trial is consistently better than the 6-month trial. Although a reasonable amount of change can be detected at 6 months (and differences between MCI and AD group rates of atrophy were detectable at 6 months), a 12-month trial might be the optimal trial duration in practice, assuming a typical attrition rate (20–30% per year) for clinical trials. Clearly, the optimal follow-up interval depends not only on the effect size, but also on the attrition rate, cost and other potentially unmodeled factors.
In one recent clinical trial, we used 3D cortical mapping methods to assess cortical thickness at 3-, 6-, and 12-month follow-up intervals in patients with first-episode schizophrenia, randomized to two different medications (Thompson et al., 2008). Intriguingly, medication effects—favoring olanzapine over haloperidol treatment—were significant as early as 3 and 6 months, but were no longer significant at 12 months. This may be because, in that case, medications are most effective soon after disease onset, and less so later, or it may be because, in schizophrenia, atrophic rates are greater soon after disease onset, and less so later (the opposite appears to be the case in AD and MCI, where atrophic rates are thought to accelerate over time as the disease progresses).
In another MRI study measuring atrophy in AD (Schott, 2005), 38 AD patients and 19 controls were scanned at 0, 6, and 12 months, and the boundary shift integral (BSI) method was used to measure whole brain volume loss over time. Consistent with our findings, smaller sample sizes were needed to detect the same effect size at longer intervals. In Schott et al.’s study, only 154 subjects per arm were needed to detect a 20% reduction in the rate of atrophy when measurements were made after 1 year, versus 410 for a 6-month interval. Schott et al. also noted that smaller samples were needed when measuring rates of ventricular enlargement versus whole brain atrophy, if the trial duration was 6 months, but not at 12 months. They found that the variance in the estimated rate of whole brain atrophy was almost twice as high at 6 versus 12 months, but the variance in the ventricular atrophy rate—which can be measured more consistently—was about the same at 6 versus 12 months.
We also found that the refining the search region can improve sample size estimates. Use of a statistically pre-defined ROI gave better power estimates at all time-points, compared to an anatomically defined ROI that included the entire temporal lobes (Fig. 3 and Table 1). The use of a statistically-defined ROI based on an independent training sample was first proposed for positron emission tomography (PET) images (Chen et al., 2009a,b; Reiman et al., 2008a). We recently applied the same idea to MRI image analysis, and found consistently higher power to detect disease-related brain atrophy (Hua et al., 2009a; Ho et al., in press). A Stat-ROI reduces the search regions to areas showing the most robust changes in AD or MCI, eliminating regions with greater variance or smaller effect sizes across subjects, and avoiding regions with changes in the opposite direction (e.g., ventricular enlargement or intra-sulcal CSF). The Stat-ROI is slightly smaller in size than an atlas-based definition of the temporal lobes (Fig. 2) and, relative to the full temporal lobe ROI, it largely removes partial volumed voxels (containing mixtures of different tissues) at gray matter/CSF interfaces. At the interface of brain tissue and CSF, voxels with progressive tissue loss (in the gray matter) are partial volumed with expanding voxels (in the ventricular space) (Hua et al., 2008a, 2009b), which can give high variance across subjects and can reduce statistical power. The Stat-ROI retains regions changing the most in AD, and boosts power substantially.
We set the temporal lobes as the default search region to compute the Stat-ROI because it is one of the earliest regions to become atrophic in AD (Braak and Braak, 1991; Braskie et al., 2008; Jack et al., 1997, 1998; Killiany et al., 1993; Thompson et al., 2003), showing most differentiation between patients and healthy elderly (Hua et al., 2008a). The exploratory analysis in multiple anatomical regions (Table 2) further confirmed that the temporal lobes were the best search region to use, when this TBM method is used to identify AD-associated brain degeneration. Several other regions performed similarly in terms of n80. The hippocampus did not show especially high power with TBM, perhaps because it is too small to be modeled accurately with TBM due to partial volume effects (Hua et al., 2009a). The Stat-ROI within the temporal lobe includes several regions with early pathological changes in AD such as the medial temporal lobe and entorhinal cortex (Fig. 2), but largely avoids the hippocampus due to the limitation of TBM in picking up changes with high effects sizes from small structures. In other work (Protas et al., 2009), we have combined data from multiple ROIs for diagnostic classification in AD and MCI, and a similar approach could be taken here. Even so, the temporal lobe statistical ROI almost always gave best results when used alone, so we report data from that region here for simplicity of implementation and interpretation.
Stat-ROIs based on different diagnostic groups (AD vs. MCI) and based on different time points were very similar. They covered similar regions within the temporal lobes (Fig. 4), and power analysis was insensitive to these different choices of Stat-ROIs (Fig. 5). This suggests that a Stat-ROI derived from one time point may be reasonably applied to data from other time-points, making it easy to implement in clinical trials or other longitudinal studies. As picking a training set at each time point or for each diagnostic group is unnecessary, the number of scans in the testing set is preserved to maximize the sample size for power analysis.
Based on the power formula described in the Materials and methods section, the estimated minimum sample size for each arm is computed assuming a 25% slowing of the atrophy rate, indicated by a multiplier of 0.25 in the denominator. In reality, treatments may slow atrophy to different degrees, which may be denoted by k%, for different k. The sample size estimates required to detect a k% slowing of atrophy can be easily derived by multiplying the sample size estimates (n80 or n90) in this paper by (25/k)2, as the numbers follow an inverse-square law. For example, 4 times as many subjects would be needed to detect a 12.5% slowing of atrophy (half of 25%), versus a 25% slowing of atrophy (Ho et al., in press). The quadratic relationship between the sample size estimates and the percentage atrophic rate is illustrated in Fig. 8, using the n80s for a 12-month trial derived from the Stat-ROI (Table 1), as an example. The results of this paper can be easily translated to studies aiming to detect a different level of treatment effect, and our findings remain unaffected as multiplying the variables by a constant (25/k)2 does not alter the ranking of the effect sizes in the statistical tests (it is a monotone transformation, i.e., it preserves the rank order).
It may seem paradoxical that the brain changes with greatest effect sizes in TBM were generally found in large homogeneous regions of the white matter, when AD is widely accepted to be a predominantly hippocampal and cortical gray matter pathology. The predominant site of plaque and tangle accumulation in AD is the hippocampus and cortex, and the molecular hallmarks of AD spread through the cortex in a characteristic trajectory (Braak and Braak, 1991; Braskie et al., 2008). Volumetric atrophy is widespread in the cortical gray matter (GM). Our group, along with many others, has mapped the spatial-temporal trajectory of GM loss in AD, and it largely mirrors the Braak and Braak sequence of AD pathology (Dickerson et al., 2009; Thompson et al., 2003), and the cortical trajectory of plaque and tangle build-up tracked with [18F]FDDNP-PET (Braskie et al., 2008).
For some time, white matter (WM) changes in AD were relatively difficult to quantify in conventional MRI because of the lack of visible anatomical boundaries that would be required to parcellate white matter. Other than white matter hyperintensities, which are hallmark lesions of cerebral vascular disease (Brickman et al., 2009), the nature of WM degeneration in AD has not been well characterized until recently. Diffusion tensor imaging (DTI), relaxometry, and functional connectivity studies have now provided substantial evidence for diffuse WM abnormalities in AD (Buckner et al., 2009; Wozniak and Lim, 2006). Myelin breakdown and Wallerian degeneration both lead to WM atrophy, perhaps secondary to the effect of cortical neuronal loss in AD (Bartzokis, 2009; Bartzokis et al., 2006, 2007; Spires-Jones et al., 2009). Disease-related WM degeneration is detectable in the form of reduced fractional anisotropy on DTI (Huang et al., 2007; Naggara et al., 2006; Rose et al., 2000; Sandson et al., 1999; Wang et al., 2009), lowered relaxation rates in T2-based MRI relaxometry (Bartzokis et al., 2003, 2004), and disrupted connectivity observed using resting-state functional MRI (Agosta et al., 2009; Buckner et al., 2009; Hedden et al., 2009; Supekar et al., 2008; Wang et al., 2007; Zhou et al., 2008).
As both GM and WM changes are occurring, a key question is which of the MRI-derived measures is the most reliable for detecting group differences or dynamic changes over time, and which can resolve them with greatest effect sizes and accuracy in longitudinal studies. As a percentage, more cortical and hippocampal gray matter may be lost over time than white matter. Even so, the effect sizes for the changes in gray matter may be lower than expected, as these structures are convoluted and difficult to measure accurately. The cortex is thin and the hippocampus is narrow, often only a few voxels thick, and automated measures of the same structures using different algorithms can disagree substantially (Morra et al., 2009b), leading to calls for better harmonization of hippocampal measurement methods across studies (Frisoni et al., 2010).
This is especially the case in voxel-based maps, where changes may be greatest, as a percentage of their volume, in the cortex and hippocampus, but when pooling data across subjects voxel-by-voxel, the interiors of large white matter structures still tend to be better registered than the cortical and hippocampal boundaries once all the data are aligned. In the interiors of structures, such as the white matter, coherent patterns (such as atrophy) are more likely to be reinforced across all members of a group than at boundary voxels where loss patterns may be less well registered, even after nonlinear registration. This effect can be partially overcome using voxel-based morphometry (VBM; Ashburner and Friston, 2000) but that method multiplies gray matter “density” measures by registration-based estimates of changes, whereas TBM uses the registration field only. Cortical thickness measures also tend to have relatively poor reproducibility; in fact, different algorithms give mean cortical thickness values for normal subjects that differ by a factor of two (Aganj et al., 2009).
Our earlier TBM paper on the 12-month ADNI follow-up data (Hua et al., 2009) was mainly focused on optimizing the TBM processing pipeline for statistical power. We studied the effects on the results of critical parameters such as the linear registration steps, and the regularization parameters of the deformation model. Because so many parameter options and analysis choices could be made, we compared TBM designs with different linear and nonlinear registration parameters (including different regularizing functions) and found the set of best-performing parameters from the standpoint of maximizing effect sizes. A secondary focus of that prior paper was to motivate and evaluate the use of a data-driven statistical ROI to compute power estimates. We found that the statistical power of tracking brain degeneration could be further enhanced by using a statistically-defined ROI within the temporal lobes, based on voxels found to change the most in an independent sample. That study served as a foundation for the current study, in which we used the best TBM design and parameters from the prior study.
The current study asks a different question about the optimal length of a clinical trial, with the intent of understanding factors that might influence power in a clinical trial. Expanding the temporal sampling significantly relative to the prior study, we analyzed brain scans collected at baseline, 6, 12, 18, and 24 months, for both AD patients and MCI subjects. Unlike our prior report, the current study used scans at 5 time-points, and from many more subjects, so we were able to gauge the tradeoff between trial duration and recruitment requirements, both with and without considering likely attrition rates. The resulting information offers an evaluation of a simple and readily implementable image analysis method as well as a guideline for trials involving a neuroimaging component.
The current study has some limitations, and some qualifications are needed. As noted before, a 25% reduction in the atrophic rate may have a different functional significance for a patient than a 25% reduction in the rate of decline for clinical or cognitive test scores (and may in reality be either better or worse for the patient). As such, a head-to-head comparison of clinical and neuroimaging measures is useful, but in reality both measures are informative and will continue to be widely used. Secondly, longitudinal studies with more than two time-points can employ more advanced statistical designs that use all the data at once, such as random effects or mixed effects models, to estimate intra-subject variance and take advantage of the repeated measures (Fitzmaurice et al., 2004; Frost et al., 2004). In recent ADNI analyses, using all scans in the time-series has been shown to boost power estimates (Schuff et al., 2009). Some advocate the use of two scans taken on the same day from each subject, showing a net beneficial effect on sample size estimates, especially when the follow-up interval is short (Schott et al., 2006). Finally, it is reasonable that multiple complementary neuroimaging measures, from MRI, PET and other modalities (arterial spin labeling, diffusion imaging, and resting-state fMRI) may be combined in the future, with genetic and CSF biomarker data, to give more robust metrics of disease progression, and to obtain better predictive value in longitudinal models.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abbott, AstraZeneca AB, Bayer Schering Pharma AG, Bristol-Myers Squibb, Eisai Global Clinical Development, Elan Corporation, Genentech, GE Healthcare, GlaxoSmithKline, Innogenetics, Johnson and Johnson, Eli Lilly and Co., Medpace, Inc., Merck and Co., Inc., Novartis AG, Pfizer Inc, F. Hoffman-La Roche, Schering-Plough, Synarc, Inc., and Wyeth, as well as non-profit partners the Alzheimer’s Association and Alzheimer’s Drug Discovery Foundation, with participation from the U.S. Food and Drug Administration. Private sector contributions to ADNI are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Disease Cooperative Study at the University of California, San Diego. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles. This research was also supported by NIH grants P30 AG010129, K01 AG030514, and the Dana Foundation. Algorithm development and image analysis for this study was funded by grants to P.T. from the NIBIB (R01 EB007813, R01 EB008281, R01 EB008432), NICHD (R01 HD050735), and NIA (R01 AG020098), and National Institutes of Health through the NIH Roadmap for Medical Research, Grants U54-RR021813 (CCB) (to AWT and PT).