We wished to evaluate a template-based automated brain extraction method (MAPS) and a number of well-established automated brain extraction methods relative to a conventional semi-automated method that involves time consuming manual editing. We applied the four automated brain extraction methods (MAPS, BET, BSE and HWA) to over 800 scans from the ADNI database. This set of images included scans with a range of anatomy and atrophy: from healthy elderly subjects with little atrophy to MCI and AD subjects with very significant atrophy.
All four methods showed reasonable overlap (Jaccard index) with the semi-automated ‘gold-standard’ segmentation. Among the four methods, MAPS had higher median accuracy and smaller variability in accuracy. Both MAPS and HWA had low false negative and false positive rates, meaning that they were able to preserve nearly all the brain voxels and, at the same time, removed most of the non-brain voxels. MAPS removed more non-brain voxels than HWA and was less variable than HWA in terms of the CR of false positive rate and false negative rate. Although the median accuracy of BET was higher than BSE, the variability in accuracy of BSE was lower than BET. Of note, in the direct comparison, ‘undilated MAPS-brains’ were found to be very accurate, with a median Jaccard index of 0.980 in 1.5T segmentations. This is close to the mean Jaccard index of two different segmentations produced by the same segmentor (0.988) and segmentations performed by different segmentors (0.989). Furthermore, MAPS KN-BSI was in excellent agreement with semi-automated KN-BSI, and the small mean (SD) difference of 0.02% (0.08%) between them was less than the mean (SD) difference of 0.05% (0.47%) in BSI between same-day scan pairs reported by Boyes et al. (2006)
in a different study.
We compared the four automated brain extraction methods qualitatively using the false positive and false negative projection maps (see and ). While the false positive projection maps appear quite similar with added dura surrounding the cerebellum, the false negative projection maps show that different methods failed to include tissues in different locations as represented by different ‘hot spots’. BET appeared to tend to exclude temporal and frontal lobe tissues (consistent with the findings of Shattuck et al. (2009)
) as well as cerebellar tissue. Both BSE and HWA appeared to erroneously exclude cerebellar tissue. However, Shattuck et al. (2009)
did not find that HWA excluded much cerebellar tissue, which was likely due to the difference in the range of morphology and characteristics of the brain images in the datasets. The results of the quantitative comparison between BET, BSE and HWA are similar to those reported by Fennema-Notestine et al. (2006)
, Shattuck et al. (2009)
and Sadananthan et al. (2010)
, with HWA being better at preserving brain voxels than BET and BSE, and BET and BSE being better at removing non-brain voxels than HWA.
Although the effect of scanner field strength on the accuracy of MAPS and HWA was minimal, the effect on the robustness of HWA was large: the CR of the false negative rate in 3T segmentations is 39 percentage points higher than 1.5T segmentations. The median Jaccard index and false negative rate of BET and BSE in 1.5T segmentations were better than 3T segmentations. Although there was no evidence of a difference in the variability in the Jaccard index of BET and BSE between 1.5T and 3T segmentations, the CR of the false negative rate of BSE in 3T segmentations is 40 percentage points higher than 1.5T segmentations. Sadananthan et al. (2010)
also found that the performance of the methods were different in their 1.5T and 3T datasets.
Despite the efforts put into trying to ensure that the characteristics of MR images in the ADNI dataset were similar across different scanner manufacturers and field strengths, there are inevitably significant differences and it is interesting that field strength significantly affected the accuracy and robustness of the automated brain extraction methods.
The effect of the diagnostic groups on the automated brain extraction methods was complicated; the accuracy of MAPS in all the groups was similar, however, MAPS produced slightly less robust results in controls. This is likely due to the 2-voxel dilation performed at the end of the processing as the dilated brain region in controls is more likely included non-brain tissues (e.g. dura) than MCI or AD subjects. BET produced more accurate results in controls with higher median Jaccard index and lower median false negative rate. On the other hand, there was little suggestion of the robustness of BET being different across diagnostic groups except at 3T the segmentations of AD subjects were more robust than control and MCI subjects. Although there was no evidence of a difference in the accuracy of BSE between diagnostic groups, it was surprising that the robustness of BSE was significantly better in MCI subjects in 1.5T segmentations. The accuracy of HWA in all the diagnostic groups was similar. Although there was no evidence of a difference in the robustness of HWA between diagnostic groups, the CR of the false positive rate of controls tended to be smaller than AD and MCI subjects.
Although we did not find any significant difference in the median Jaccard index of BSE and HWA between diagnostic groups, we found that BET produced significantly more accurate results in controls than MCI and AD subjects in both 1.5T and 3T scans. This was similar to the findings of Fennema-Notestine et al. (2006)
that the average Jaccard index of BET in young normal controls was higher than AD subjects ( of (Fennema-Notestine et al., 2006
We previously found that STAPLE was the best method to combine multiple hippocampal segmentations in terms of the Jaccard index (Leung et al., 2010a
). However, we found shape based averaging to be better for whole brain segmentations. The best label fusion method is likely to be problem specific, consistent with the findings of Artaechevarria et al. (2009)
; in that depending on the characteristics of the images and regions, globally or locally weighted voting produced substantially better results than simple majority voting. It is interesting to note that the chosen parameters give similar results in the small subset and our whole dataset, meaning that the 10 randomly chosen 1.5T images have provided a good sample for parameter selection in MAPS. Given the excellent results in the 3T scans and the scans from SVE, the chosen parameters may also be suitable for scans acquired using different MR sequences and scanners - this potential generalisabilty (based on the range of anatomy included in the template library) is a possible advantage over those methods that require parameter selection based on a subset of scans. The oscillation in the accuracy of SBA in may appear concerning in terms of performance, however it is due to the discreteness in 50% trimmed mean: the 50% trimmed mean discards equal or unequal numbers of segmentations from either side depending on the number of segmentations.
For large studies and clinical trials, it is more important to minimise the human interaction time and expertise required to correct any sub-optimal segmentation (e.g. parameter fine-tuning or manual editing) than to minimise the computation time of the algorithm. Although the computation time of MAPS is comparatively much longer than BET, BSE and HWA, the robustness of MAPS was substantially higher than the other methods. Furthermore, the processing time of MAPS can be improved by (1) running the software using a computer cluster, (2) using fewer atlases in a trade off between accuracy and computation time, or (3) running the non-rigid registration on a graphical processing unit (GPU) (Modat et al., 2010
One of the strengths of this study is the large number of images of AD, MCI and control subjects acquired from scanners of different field strength and manufacturers at multiple sites. To the best of our knowledge, this is the largest comparison of automated brain extraction methods in the literature. Another strength of this study is that all the data and softwares will be openly available to the public on the world wide web. All the scans can be downloaded from the ADNI website (http://www.adni-info.org
). The semi-automated brain segmentations will be available on the ADNI website. BET, BSE and HWA are all available on the web (see Section 2). The registration software and label fusion softwares used in MAPS can be downloaded at http://sourceforge.net/projects/niftyreg/
. We will make all the MAPS brain regions available on-line at the ADNI website (http://adni.loni.ucla.edu/
One of the limitations of this study is the lack of ground-truth whole brain segmentations in the method comparison. Instead, we used semi-automated segmentations which were then manually edited by trained expert segmentors. The segmentors followed a pre-defined segmentation protocol to ensure low intra- and inter-rater variability. Another limitation is that the amount of brain stem labelled as brain may not be consistent between the semi-automated and automated segmentations. Although the thresholding was designed to remove CSF from the automated segmentations to allow the comparison with semi-automated segmentations, it may remove some grey matter from the brains and lose some important information at the boundary of the brain. We also did not try to use other label fusion algorithms in MAPS (apart from vote, SBA and STAPLE), such as a local weighted voting method (Artaechevarria et al., 2009
) or a selective and iterative method (Langerak et al., 2010
). In addition, although we examined most of the parameters in BET, BSE and HWA using a subset of scans from our dataset, an expert user may be able to fine-tune other parameters or use a different subset to produce better results.
Despite the fact that all the MAPS experiments were carried out in a leave-one-out fashion, MAPS may have an advantage over other methods in the comparison because the definition of a brain region in the MAPS segmentations is likely to be more consistent with the semi-automated segmentations. Partly our motivation for developing and assessing MAPS was to replace the semi-automated segmentation - there is therefore some potential intrinsic advantage to MAPS (relative to BET, BSE and HWA). As such we must be cautious about the conclusions. Nonetheless the advantage is arguably minimal because of the following:
- The post-hoc analysis (Section 3.7) showed that MAPS performed well both in terms of accuracy and variability in accuracy on a different and independent dataset with gold-standard brain masks delineated using a different manual segmentation protocol (SVE). The comparison using SVE is not only independent but also involves a wide range of algorithms with parameters that have been fine-tuned either by the developers or Shattuck et al. (2009). Currently, SVE contains 118 sets of results from several algorithms (e.g. VBM8, BSE and brainwash2). We found that the evaluations using our semi-automated brain segmentations and the independent gold-standard segmentations from SVE are consistent with each other;
- The final step in MAPS involved a 2-voxel unconditional dilation. Although this step was designed to recover missing brain tissues, it also substantially reduces the similarity between the MAPS segmentations and the gold-standard segmentations. For example, using a randomly chosen brain segmentation in our template library, a 2-voxel dilation reduces the Jaccard index from 1 to 0.741;
- There is a substantially amount of manual intervention in the semi-automated segmentation, which includes the selection of the initial intensity thresholds and the editing of brain/non-brain tissues during various stages of the semi-automated segmentation;
- In order to reduce the influence of the amount of CSF included in the automated brain segmentations in the comparison, the Jaccard index and the false positive rate were calculated using thresholded brain segmentations as in Sadananthan et al. (2010) and Boesen et al. (2004). The thresholding values were given by 60% of the mean brain intensity of the gold-standard segmentation. This thresholding step ensures consistent cut-off points between CSF and GM interface in all the automated segmentations;
- The false positive rate and false negative rate maps of MAPS show errors near the inferior brain stem. This suggests that there is still inconsistency between the MAPS brain segmentations and gold-standard segmentations.
The outputs of different brain extraction algorithms include different amount of internal ventricular and external sulcal CSF. Therefore, we chose to use a consistent threshold to exclude low intensity voxels from all the brain segmentations, as suggested by Boesen et al. (2004)
and Sadananthan et al. (2010)
, to try to compare different algorithms in as unbiased manner as possible. However, we acknowledge that brain extraction is rarely used in isolation and that dependent on the subsequent processing steps and ultimate outcome measure being assessed the quality of segmentation and possible errors included may or may not be important. The requirement for accuracy in brain extraction therefore varies with different uses of the masks. We also acknowledge that each of the other methods might well be fine-tuned to particular scan types and applications. Although we showed that the semi-automated KN-BSI and MAPS KN-BSI were very similar, future work should examine the suitability of a particular brain extraction method for the specific processing pipeline or application for which it is to be used.
In conclusion, our results suggest that a template library approach (MAPS) is a relatively accurate and robust method of automated brain extraction. MAPS was similar to HWA in the ability to preserve brain tissues, but removed significantly more non-brain tissues than HWA. MAPS was shown to be more robust than HWA. We suggest that fully automated brain extraction methods now approach the accuracy and reliability of time consuming manual techniques and may be particularly valuable in large scale studies. Ultimately, the development and evaluation of accurate and robust brain segmentation methods that are able to equal or outperform more labour-intensive manual segmentation procedures will facilitate more efficient research.