In this paper, we show that human subcortical volume estimates derived from brain structural MRI data are remarkably reproducible for a variety of data acquisition and analysis factors when using the publicly available FreeSurfer automated segmentation tool. Specifically, using a group of healthy older (mean age 69.5 years, n=15) and two different groups of young subjects (n=5 for both, mean ages 34 and 36.5 years) we examined how the volume test-retest reproducibility of hippocampus, thalamus, caudate, putamen, pallidum, amygdala, ventricular and intracranial structures are affected by scan session, structural MRI acquisition sequence, data preprocessing, subcortical segmentation analyses, major MRI system upgrades, vendor and field effects. We identified a number of factors that contribute little to within- or across-session variability, and other factors that contribute potentially important variability to within- and across-session variability.
The segmentation errors reported in this work represent the best estimate we can give for the error of the method under the reported measurement conditions. The main factors that introduce errors in the final segmentation results are image quality factors (signal-to-noise ratio and contrast-to-noise ratio) and brain anatomical variability relative to the probabilistic atlas. These factors are intermingled. Realistic brain anatomical simulations with pre-defined characteristics for subcortical structures and their spatial arrangements could be attempted to separate the contribution of segmentation errors from image quality and segmentation atlas factors. These issues are important but are beyond the scope of this manuscript. The closest to a ground-truth that can be currently used to assess the accuracy of the FreeSurfer segmentation method is the comparison with manual segmentations by a neuroanatomist, as validated in Fischl et al. 2002
The segmentation results () are comparable with previously reported results (The Internet Brain Volume Database, http://www.cma.mgh.harvard.edu/ibvd/
). For most structures, there's a fairly wide range of estimates of normal volume, and ours are within the typical range.
We expect the best possible volume reproducibility from data acquired within the same scanning session using identical acquisition sequences. For both the old and the young groups we find that within-session reproducibility was comparable to across session reproducibility when data was acquired with the same MRI system. For the hippocampus, thalamus, caudate, putamen, lateral ventricles and intracranial volumes, reproducibility error across sessions in the same scanner were less than 4.3% in the older group and less than 2.3% in the young group. This difference is most likely due to the fact that older subjects tend to move more during scans, hence giving suboptimal image quality (gray-white matter contrast to noise ratio) relative to the younger subjects. Smaller structures (pallidum, amygdala and inferior lateral ventricles) gave higher reproducibility errors (under 10.2% for the old group and under 10.4% for the young group). The reproducibility error is derived as (100*SD/MEAN) where SD is the standard deviation of the test-retest volume differences and MEAN is the mean volume within the group. For small structures MEAN decreases and therefore for a similar or worse SD the reproducibility error increases. The result that the reproducibility of the young and older group becomes more similar for smaller volumes indicates that the size effect (MEAN volume) in the reproducibility dominates the SD differences between groups.
Having an adequate number of subjects is very important to minimize biases in the results, yet it can be challenging for reproducibility studies like the one described here because of the significant cost in scanner time usage. Each subject in each of the three datasets (dataset 1: 15 subjects, dataset 2: 5 subjects, dataset 3: 5 subjects) was scanned in four different 1-hour sessions, so the effective ‘hourly costs’ were 60, 20 and 20 for datasets 1, 2 and 3, respectively. The acquisition of the scanner upgrade data (datasets 2 and 3) has additional practical challenges: scans have to be acquired within a short time before/after the upgrade. In particular right before an upgrade scanner availability tends to be lower than normal because of the need that many projects have for completing acquisitions before the upgrade. For this reason datasets 2 and 3 resulted with fewer subjects, with gender biases that were hard to avoid. The Jacknife bias analysis indicated that the number of subjects used in the older dataset gave a relatively low mean proportional bias across the structures investigated (1.8%), whereas the same value was substantially higher for the young group (46%). This indicates that the results derived from the young groups (scanner upgrade effects and B1 inhomogeneity correction effects) should be considered as preliminary and in need for further validation with a larger dataset.
In agreement with results obtained in a cortical thickness reproducibility study (Han et al., 2006
), we found that averaging two acquisitions made relatively minor contributions to improvement in the reproducibility of subcortical volumes. The acquisition of two MPRAGE volumes is still recommended mainly for practical reasons. If both scans are good they can either be averaged or the best quality scan selected for the segmentation. If one volume is bad (e.g. due to motion artifacts) then the other can still be used for segmentation without averaging. Furthermore, the data acquired and analyzed in this study were collected under ideal circumstances, with cooperative volunteer participants and highly skilled scanner operators, and both of these factors may reduce the apparent added value of averaging multiple acquisitions. In addition, as the signal-to-noise ratio of a single acquisition diminishes (e.g., with parallel acquisition acceleration protocols), the added value of volumes averaged from multiple acquisitions may increase.
In the small sample of young subjects we found that the B1 inhomogeneity correction method tested did not significantly improve volume reproducibility, suggesting that the extra calibration-related scans and inhomogeneity correction pre-processing step can be avoided when only data acquired with the same MRI system will be considered. Further, the standard automated Freesurfer segmentation includes an intensity normalization step (Non-parametric Non-uniform intensity Normalization, N3), so our results suggest that the effects of the N3 correction are stronger than the corrections introduced by our B1 corrections. We did not have data to evaluate whether B1 correction improves reproducibility across MRI system vendors or field strength, but will be critical for large N phased arrays or small coils in general.
The choice of imaging sequence (MPRAGE or multi-echo FLASH) with the corresponding brain atlas used for the automated segmentation analyses did not show significant differences in volume repeatability. This suggests that the segmentation algorithm is robust across a variety of similar image contrast properties, thus alleviating the need to create manually labeled probabilistic atlases for different acquisition methods, consistent with recent work (Han & Fischl, 2007
). The comparison between subcortical volumes derived from MPRAGE and MEF sequences showed that for some structures (putamen, lateral ventricles, inferior lateral ventricles, and intracranial) there were significant biases in the mean volume difference given by the two methods. These differences may be due to the differential sensitivity (acquisition bandwidth) that the sequences have to signal T2* (signal loss, geometric distortions). The MPRAGE sequence has the advantage of being currently more standard than multi-echo FLASH, thus it is easier to implement it consistently in multi-center studies. It is also important to recognize that the MPRAGE and multi-echo FLASH sequences have very similar contrast properties, which may not apply to T1 sequences with different contrast properties (e.g., SPGR) or non T1 sequences.
Within the limits of our small sample size we find that in major MRI system upgrades (Sonata-Avanto and Trio-TrioTIM) combining pre- and post-upgrade data does not significantly worsen the variance but may introduce a bias in the mean volume differences. Combining this with our segmentation atlas results suggests that it is safe to use the same brain atlas after a system upgrade, which is very convenient. For longitudinal studies we believe that it is appropriate to plan a system upgrade calibration study as part of the design, particularly with samples from the population under study scanned shortly prior to and immediately after the upgrade for a correct estimation of potential biases. An important practical issue is to know about the upgrade sufficiently far in advance to plan for the calibration study, optimally complete the study prior to the upgrade. If the longitudinal study will continue after the upgrade it should ideally be balanced across relevant study groups with respect to number of acquisitions before and after the upgrade, since subtle effects of interest in longitudinal studies may in fact be within the small range of variance identified in this study (e.g., hippocampal volume differences of 2–5%).
We found that when data from different MRI systems are combined (same field different vendors, same vendor different fields, or different vendor and fields) then the variance of the volume differences doesn’t significantly change relative to the test-retest reproducibility from data acquired in a fixed MRI system, but biases of the mean volume differences may become significant. All data were segmented using an atlas from a single MRI system suggesting that image contrast differences arising from differences in hardware and field strength were strong enough to be detected by the segmentation algorithm. The spatial reproducibility results showed constant and high spatial consistency of the segmentation volumes (average Dice coefficient of 0.88 ± 0.04) for a variety of test-retest conditions, ranging from no MR system and sequence changes to changes of system, sequence or field strength. The spatial overlap results are also in good agreement with a previous study (Han et al., 2007
) that compared Siemens Sonata segmentations of the same structures with manual segmentations, suggesting that both spatial accuracy and reproducibility and accuracy are high.
One extension of this work would be to test if reproducibility differences across MRI platforms can be reduced by doing the subcortical segmentations with a probabilistic atlas that is constructed from manual segmentations of data acquired with the various MRI systems under consideration. Alternatively, statistical models could be used given that they have been proved successful in combining data that are sufficiently different in acquisition sequence to fail pooling straight out (Fennema-Notestine et al, 2007
). Further, our cross-vendor comparison (Siemens Sonata MPRAGE – GE Signa MPRAGE) did not include potentially significant sources of variation found when each vendor uses its own product sequence, which can lead to image contrast differences. Therefore our results might underestimate the variance seen with cross vendor switches that introduce strong sequence changes, as may occur in practice.
The fact that the rest-retest reproducibility variance of the segmented volumes does not significantly change across platforms and field strengths (particularly in the hippocampus) implies that a multicenter study with these MRI systems does not necessarily require a much larger sample dataset to detect a specific effect. Of course, this is under ideal circumstances with highly motivated cognitively intact older adults. These conclusions may not generalize to other brain structures or to patient populations with cognitive impairment if there are any reductions in raw data quality related to movement or other issues. Our power calculations for detecting a net 2.45% hippocampal volume reduction rate difference between hypothetical non-treated and treated AD groups resulted in an estimate of 49 subjects per group, which but was not appreciably worsened by scanner upgrades or differences in scanner platform or field strength. These results differ from previous calculations (Jack et al 2003
) which estimate, for the same treatment effect, 21 subjects per treatment arm with a 2.1% standard deviation. The differences may be due to the fact that Jack et al used various data adjustments that were not applied in our analysis, including normalization for total intracranial volume, adjustments for age and gender and corrections for skew data distributions. These adjustments might help reducing the reproducibility error thereby reducing the sample size.
In addition to their volume, subcortical structures have started to be characterized also by their 3D shapes (Munn, 2007
; Wang, 2007
; Patenaude, 2006
; Miller, 2004
). Combining both volume and shape metrics might improve the power of detecting cross-sectional differences across populations or longitudinal changes. An important extension of the reproducibility study here presented would be to examine the reproducibility of shape metrics. An important limitation of this study is the lack of quantification of spatial differences in voxel labeling; that is, different voxels may be labeled as the same structure in two different sessions, but if the volumes do not differ, our analyses would not demonstrate this potentially important variance effect.
Knowledge of the degree to which different MRI instrument-related factors affect the reliability of metrics that characterize subcortical structures is essential for the interpretation of these measures in basic and clinical neuroscience studies. Furthermore, the knowledge of reproducibility is critical if these metrics are to find applications as biomarkers in clinical trials of putative treatments for neurodegenerative or other neuropsychiatric diseases, particularly with the growth of large sample multi-center studies (Jack et al., 2003
; Mueller et al., 2005
; Murphy et al., 2006
; Belmonte et al., 2007).
To conclude, our results suggest that, for the purpose of designing morphometric longitudinal studies at a single site, one structural MPRAGE acquisition segmented with the corresponding MPRAGE atlas can be optimal. Subcortical volumes derived from T1-weighted structural imaging data acquired at a single 1.5T site are reliable measures that can be pooled even if there are differences in image acquisition sequence and major system upgrades. However, MRI-instrument specific factors should be considered when combining data from different MRI systems (vendors and/or fields). It should be noted that we do not report a random effects study, therefore the results should not be extrapolated to pulse sequences or scanners not included in this study.