This study examined the reliability of two fully automatic segmentation and labeling programs for measuring the volumes of subcortical and other brain structures in a group of normal subjects who were repeatedly scanned within a 7–9 day interval. Overall, there was consistent and large-magnitude scan–rescan variability that was exacerbated for small structures such as the amygdala, accumbens, and pallidum. The choice of software (FreeSurfer or FSL/FIRST) did not strongly influence reliability; however, FIRST produced higher reliability for the small structures measured here. Consistent with the observed main effects, sample size estimates for longitudinal studies were greatest for regions with poor or moderate rescan reliability, particularly when detecting small effects.
We found a difference in reliability between 1 hour and 1 week interscan intervals. (although, for some brain structures, such as the right hippocampus, the 1 hour reliability (2A–2B; ICC = 0.82) was lower than the 1 week reliability (1B–2B; ICC = 0.89). Higher reliability might be expected for shorter interscan intervals due to magnetic field instabilities or drift. Other sources of variance were similar for the 1 hour and 1-week interscan interval such as the effect of subject repositioning. The Duke scanning site, where these images were obtained, uses rigorous and regular QA procedures (Friedman and Glover, 2006
; Keator et al., 2008
) that may have diminished some sources of scanner variability.
The goal of our article is to inform longitudinal studies where the selected group is assessed at two different time points and volume data from the two time points are compared. If perfect scan–rescan reliability was achieved then the true change in any structure (e.g., volume) could be measured perfectly in a longitudinal setup. We have approached this problem by examining the special case where the true change of the structure is assumed to be zero and then the measured departure from perfect reliability. The change observed in the selected group of cases can be compared with the longitudinal change in a control group to characterize the effects of the treatment or process (e.g., aging) in question. Although a longitudinal design has the advantage of limiting individual variability with each subject acting as its own control, other sources of variability persist with repeated scanning of the same subject on the same scanner using the same acquisition parameters even with relatively short time duration between scans. These sources of variability include small changes in image orientation, changes in prescan parameters, and instability in the magnetic field.
Previous work examined interscanner reliability, where field strength and manufacturer were varied (Jovicich et al., 2006
; Reig et al., 2009
; Schnack et al., 2004
). Only two studies, to our knowledge, examined rescan reliability (intrascanner) with automated segmentation, with the first study limited to basic tissue class segmentation (GM, WM, CSF) (Agartz et al., 2001
). The second study, by Wonderlick et al. (2009)
, reported somewhat similar reliability using FreeSurfer (version 4.0.1) to what we found in this study. For example, similar reliability was obtained in the amygdala (0.85 for Wonderlick et al. vs. 0.87 (left) and 0.82 (right) for this study), caudate (0.99 vs. 0.98 and 0.98), hippocampus (0.96 vs. 0.98 and 0.94), pallidum (0.87 vs. 0.92 and 0.91), putamen (0.95 vs. 0.97 and 0.96), and thalamus (0.97 vs. 0.98 and 0.97). However, these values obtained by Wonderlick et al. were from FreeSurfer 4.0.1 and much higher than those we obtained with the cross-sectional stream of FreeSurfer (v4.4) as seen in Supporting Information Table S1
. There are several factors that may have contributed to these differences including differences in scanner manufacturer, scanner hardware (Siemens 3T TIM Trio in their study vs. GE 3T EXCITE in our study), headcoil (12 channel vs. 8 channel), pulse sequence (MP-RAGE vs. FSPGR with ASSET), sample size (11 vs. 23), age profile of participants (young subgroup and old subgroup vs. young group), and interscan interval (2 weeks vs. 1 h and 1 week).
We did not undertake an experimental study of the variables that may have contributed to the scan–rescan differences in our brain volumes. One factor that is difficult to control is the precise position of the subject’s head within the head coil. A slightly different orientation can result in partial volume effects for different tissue types along the boundary of a brain structure that could change the contrast of the surface boundary with neighboring structures. Such effects are most relevant for boundary voxels, but of lesser consequence for voxels located in the interior of a structure. When the boundary is distinct, meaning there is a minimal overlap in the probability distribution of signal intensity between adjacent structures, this variance has a minor effect on the resulting segmentation. However, when the boundary with a neighboring structure is less distinct, and has a larger overlap in probability distributions of signal intensity, this variance can derail automated segmentation and dramatically change outcome. Reliability may differ across brain structures due to variability in tissue contrast profiles and divergent modeling algorithms (e.g., cortical surface-based or voxel-based segmentation methods). A host of other factors specific to the segmentation algorithm and the atlas being used are likely to alter the outcome of segmentation and its vulnerability to MR signal variance in difficult-to-segment regions (Shattuck et al., 2008
). Thus, depending on its location, a small change in MR signal may lead to a large and sometimes unpredictable difference in the outcome of the segmentation algorithm as suggested by . The reliability measurements observed in this sample of young healthy adults are therefore unlikely to limited by the atlases associated with FreeSurfer and FIRST that contain a wide range of demography and pathology. Additional concerns related to the participant sample are covered in the Limitations section that follows.
Rescan reliability was investigated for manual tracing by Bartzokis et al. (1993)
and showed slightly lower reliability than for automated segmentation for a number of regions such as hippocampus (0.91 for Bartzokis et al. vs. 0.98 (left) and 0.94 (right) for this study with FreeSurfer) and amygdala (0.75 vs. 0.87 and 0.82 for this study with FreeSurfer). Similarly, rescan reliability of intracranial volume with manual tracing (ICC = 0.95) was slightly lower than for intra-rater (same scan) reliability (ICC = 0.96) (Nandigam et al., 2007
). Most volumetric studies that use manual tracing report high intra-rater reliability even on challenging regions such as the hippocampus and amygdala (ICC = 0.95) (Mervaala et al., 2000
; Rojas et al., 2004
; Whitwell et al., 2005
). When a baseline manual segmentation is performed, fluid registration can be used to attain highly reliable segmentation of repeat scans that is superior to a subsequent manual segmentation (Crum et al., 2001
). This approach is especially advantageous when multiple repeat scans are performed longitudinally because it requires manual segmentation of only the first scan. However, the introduction of lesions or rapid degeneration between scans can compromise the fluid registration approach. Thus, fluid registration may be preferred for regions such as the amygdala that are unreliably segmented with the fully automated methods.
The exact sample size recommendations from our power analyses are specific to the hardware, software, segmentation method, pulse sequence, and other parameters used in this study. However, our procedures are typical for academic imaging at many major research institutions, and thus the relative effects across brain regions are likely to generalize. Improvements in multichannel imaging by combining T1, T2, and PD sequences that optimize automated segmentation may improve rescan reliability when compared with single channel acquisition. Such multichannel acquisitions also offer substantial invariance to acquisition parameters (Fischl et al., 2004
). Pulse sequences such as high bandwidth multiecho FLASH have a high signal to noise ratio and minimal image distortion from B0 effects have been shown to improve reliability. Wonderlick et al. examined the performance of scan–rescan segmentation with FreeSurfer using recent advances in MR acquisition including high resolution (1 mm isotropic), parallel acquisition (phased array headcoil), and a multiecho T1 weighted sequence using MP-rage sequence (1.3 × 1.0 × 1.3 mm) for comparison testing (Wonderlick et al., 2009
). Even when using these advanced approaches, the effect of MR signal variance on automated segmentation was not eliminated.
Our findings may be specific to FreeSurfer and FIRST—two popular noncommercial software programs used at many research institutions. It is important to emphasize that we did not evaluate the validity or accuracy of the measurements from these two programs. It is possible an accurate measure might result from a given segmentation obtained from a single scan but does not provide information about how consistently it can produce accurate segmentation when the same brain is scanned repeatedly and is generally not assessed in studies of segmentation accuracy. Indeed we show that for certain regions, scan–rescan reliability of automatically segmented brain regions is of concern.
The demographic sample of healthy young participants limits the ability to generalize the present findings. Our demographic is unlikely to contain extremes of the population distribution, and one might expect higher scan–rescan reliability in our sample than in a sample representative of a more diverse population. The intriguing point is that despite the limited demographic attributes of this group, the reliability is surprisingly low in some instances and might be even lower in a more diverse sample with respect to demography (e.g., age) or neuropsychiatric pathology. Similarly, our power analyses for estimating sample size are likely to be underestimates of the actual number of subjects required for conducting longitudinal studies in more diverse groups. Studies with a case-control design are likely to encounter greater variance related to individual differences that are avoided in a longitudinal design where each participant serves as its own control.