Based on a training sample of 15 subjects, we found that the best method for generating a baseline hippocampal volume with our template library utilised non-linear registration (FFD) together with intensity thresholding and combining the best matched eight segmentations using STAPLE to which a Markov random field filter of 0.2 weighting was applied. This generated volumes whose means and SDs were similar to those produced using manual segmentation, with the largest difference being in the AD group with automated volumes on average being lower than manual by 81 mm3 which corresponds to about 4% of the manual AD hippocampal volume. Overall, the mean difference between our automated volumes and the manual measurements was 27mm3 or just over 1% of the mean of all volumes. The automated regions also agreed well (average JI = 0.81) with the manual regions delineated by a second expert segmentor S2 who did not generate the manual regions in the template library. Using an HBSI measure on these regions, we were able to produce a rate of atrophy similar to manual atrophy rates, with the largest difference in means being in the AD group (5.4% manual vs. 6.5% automated) and the overall mean difference in rates being just under 0.5% /year.
We found that the accuracy of MAPS on unseen data was very high, having achieved a mean (SD) JI of 0.80 (0.04) when comparing the automated baseline hippocampal regions with manual regions delineated by the expert segmentor S1 from a set of 30 subjects (10 AD, 10 MCI and 10 controls). The SD of the automated volumes was similar to the manual volumes in all three groups. The means of automated volumes were smaller than manual volumes in all three groups, with the largest difference being in the AD group, with automated volumes on average being lower than manual by 182 mm3 which corresponds to about 10% of the manual AD hippocampal volume. Overall, the mean (SD) of differences in the manual and automated hippocampal volumes was 101 mm3 (144 mm3) (or 4.4% of mean volume) with manual > automated.
Application of MAPS and MAPS-HBSI to a large dataset showed the expected pattern of hippocampal volumes (AD<MCI<controls) and atrophy rates (AD>MCI>controls). Further to this, we found differences across MCI subgroups based on follow-up diagnosis determined up to 36 months from baseline, with hippocampal volumes statistically significantly lower in those subjects who progressed to a diagnosis of dementia at some point in the study compared to those who remained stable or “reverted” to normal. Atrophy rates from MAPS-HBSI also showed the expected pattern (MCI reverters < MCI stable < MCI converters), with converters having a hippocampal atrophy rate that was statistically significantly higher than the other groups and twice as high as the reverters. Although MCI reverters had higher mean volumes and lower rates than the MCI stable group these differences did not reach statistical significance, which is likely due to the small size of the reverter group (n=8).
The comparison of our volumes and indirect atrophy rates with those calculated using SNT (previously published by Schuff et al. (2009
) and treated as the gold standard in the work of Wolz et al. (2010)
and Lötjönen, J. et al. (2010)
) revealed that there was a marked difference in volumes, with MAPS having larger volumes than SNT. The difference in absolute volume is not surprising given that the conventions of anatomical boundaries of the hippocampus differ slightly between MAPS and SNT (Konrad et al., 2009
), e.g. the alveus was included in our hippocampal regions while it was not included in regions from SNT. The mean indirect atrophy rate using MAPS was 0.95 percentage point higher than the mean using SNT in the MCI group and 1.55 percentage points lower than the mean using SNT in the AD group; however the MAPS indirect atrophy rates had lower SDs in all three groups. The estimated sample sizes calculated based on MCI atrophy rates using MAPS was 285, which was 71% smaller than using SNT (981).
Our final investigation of a comparison of MAPS indirect atrophy rate and MAPS-HBSI showed that the mean MAPS-HBSI was 0.88 percentage point lower than MAPS indirect atrophy rate in the MCI group. More importantly, the MAPS-HBSI has markedly lower SDs in all three groups. Possible reasons for the lower SD in rates include the use of a registration-based method to detect boundary shift of the hippocampus (HBSI) compared with segmentation of the hippocampus at the two time-points, as has been previously demonstrated in whole brain analysis (Frost et al., 2004
). We have also shown previously that the use of HBSI with a manual baseline region results in lower SDs of rates in control groups compared with rates generated from our “gold standard” manual segmentation of baseline and repeat scans (Barnes et al., 2004
; Barnes et al., 2007
). Consequently, the estimated sample size calculated based on AD subjects alone using MAPS-HBSI was 68, which was 60% smaller than using MAPS indirect atrophy rates. There was evidence of correlation between MAPS-HBSI and 1-year change in AVLT score in MCI subjects, whereas no evidence of correlations was found between the 1-year change in AVLT score and MAPS (or SNT) indirect atrophy rates.
Our technical findings relate well to those of other research groups. We found non-linear registration to perform better than linear registration as this has been previously shown (Woods et al., 1998b
). A combination of labels has also been shown to be useful (Rohlfing et al., 2004
; Warfield et al., 2004
) so our finding of overall improvement by including those best matches was expected. Overall, the different methods of combinations of labels (vote rule, SBA and STAPLE) produced similarly good results. It is interesting that the results of vote rule and STAPLE were more similar than those of SBA. STAPLE had slightly better accuracy than vote rule, which is consistent with a previous publication (Rohlfing et al., 2004
). SBA had slightly lower accuracy than vote rule. This was consistent with previous results showing that the vote rule had slightly better accuracy when combining more than five segmentations (see Figure 7a of (Rohlfing and Maurer, Jr., 2007
The optimal number of segmentations for vote rule, SBA and STAPLE were 29, 29 and 8. shows that segmentation accuracy increases first and approaches a plateau when more segmentations are combined. A similar plateau effect in the range of 20 – 30 segmentations was reported by Aljabar et al. (2009)
and Collins et al. (2009)
when fusing hippocampal segmentations using the vote rule. It should also be noted that the less accurate (i.e. lower rank) segmentations may introduce bias into the combined segmentation in all three methods if the segmentation errors are not randomly distributed. Aljabar et al. (2009)
showed a gradual decrease in accuracy for hippocampal segmentation when more than about 30 ranked atlases are combined.
Overall, our technique is most similar to that reported by Aljabar et al. (Aljabar et al., 2009
). It differs in the following ways: (1) Aljabar et al. used vote rule to fuse the segmentations, whereas we used STAPLE with MRF to combine the segmentations. Furthermore, shows that STAPLE with MRF performed slightly better than vote rule when used to fuse the segmentations from our technique; (2) Aljabar et al. ranked the atlases after nonrigidly registering all the images to the Montreal Neurological Institute (MNI) BrainWeb single subject simulated T1-weighted MR image, whereas we ranked the atlases after affinely registering all the images to a single control subject in the template library. Note that Klein et al. (2008)
mentioned the need to initialise the STAPLE algorithm using a probabilistic segmentation; in the STAPLE implementation given to us by Warfield, this is performed internally by averaging the input segmentations to provide a global prior.
We have obtained one of the best accuracies reported to-date for automated hippocampal segmentation when compared with gold standard manual segmentations from a set of 30 randomly chosen subjects (10 AD, 10 MCI and 10 controls) from ADNI. Expressing our JI (of 0.80 from the independent test data) as a Dice score5
equates to 0.89, with the previous highest Dice scores (N = number of hippocampi in the study) being 0.81(N=100) (Pohl et al., 2007
), 0.83 (N=550) (Aljabar et al., 2009
), 0.83 (N=60) (Heckemann et al., 2006
), 0.86 (N=54) (Barnes et al., 2008a
), 0.86 (N=40) (Morra et al., 2008
), 0.86 (N=14) (Fischl et al., 2002
), 0.85 (N=30) (Powell et al., 2008
), 0.86 (N=40) (van der Lijn et al., 2008
), 0.87 (N=30) (Chupin et al., 2008
), 0.88 (N=5) (Gousias et al., 2008
) (from a cohort of 2 year old children), 0.85 (N=364) (Wolz et al., 2010
), 0.89 (N=120) (Lötjönen et al., 2010
) and 0.89 (N=160) (Collins and Pruessner, 2009
). Note that our inter- and intra-rater JI values correspond to Dice scores of 0.93 and 0.96 respectively. Comparing these to the results from using our automatic method with different training and test data (0.89) or with the same training data segmented by a different rater (0.90) or the same rater (0.91), suggests that the method has not been over-trained, and that there is potential to improve it further, ideally to approach the upper bounds of inter- or intra-rater agreement.
Hippocampal atrophy rates using automated baseline regions and HBSI of 4.4% per year for AD (mean age = 75) and 1.1% per year for controls (mean age = 76) are similar to those reported in the literature. A recent meta-analysis of studies prior to ADNI estimated rates to be approximately 4.7% per year in AD subjects (mean age = 73) and controls 1.4% per year in healthy controls (mean age = 78) (Barnes et al., 2009
). Mean hippocampal atrophy rates in ADNI have been reported to be 0.8% per year in controls (mean age = 76) and 4.4% per year in AD (mean age = 76) (Schuff et al., 2009
). MCI atrophy rates have been shown to be between those of AD and controls, with those progressing to a diagnosis of dementia having higher rates than those remaining stable: median rates being 4.3% vs. 3.0% per year (mean age = 71 in MCI) (Henneman et al., 2009
), 3.3% vs 1.8% per year (mean age = 77 in converters and 76 in stable subjects) (Jack et al., 2004
The strengths of this study include the large and multi-site nature of data collection (though training and testing subsets were relatively small) and the availability of follow-up on all subjects enabling assessment of clinical change from baseline. A notable difference between the outcomes in ADNI and previous studies relates to MCI conversion to AD. One recent meta-analysis showed that in studies adhering to the Mayo clinic definition, allowing for dementia type, the conversion rate from MCI to AD was 8.1% /year (95% CI = 6.3–10.0%) (Mitchell and Shiri-Feshki, 2009
). However, the higher rate of conversion in ADNI (~16% /year) is likely to due to the stringent criteria used to recruit subjects with MCI meaning they were likely to be further down the clinical spectrum (i.e. closer to an AD diagnosis) compared with other studies (Petersen et al., 2009
We acknowledge a number of limitations to this study. We only performed inter-subject registration using ratio image uniformity as the cost function rather than evaluating several possible cost functions. Also, we used the cross-correlation to choose the best-matched images from the template library rather than evaluating other image similarity measures, since cross-correlation has been shown to be a good measure for the hippocampus (Aljabar et al., 2009
). Although there is longitudinal follow-up on all subjects there is no pathological confirmation of disease which would provide diagnostic certainty. However, this is the setting in which clinical trials must be conducted. Not only did hippocampal atrophy rates differ between AD and the other groups, but MCI subjects who progressed clinically also had higher rates than those who did not.
In general, the use of a template library in a multi-atlas method depends on a number of factors, such as the quality of template images (e.g. signal-to-noise ratio and contrast-to-noise ratio), anatomical differences between the subjects in the template library and target group, and the manual segmentation protocol. Although the template library used in this study was from a different cohort and included a spectrum of hippocampal volumes (both AD and controls) it did not contain subjects with MCI. However, we saw no evidence that the MCI subjects were more poorly segmented than the control or AD groups, as they had similar JIs compared with control and AD groups. This would be expected given the overlap in hippocampal volumes and morphology across the control-MCI-AD spectrum. Finally, we did not assess whether our algorithm differed in ability to segment hippocampi or detect change over time according to imaging site or field strength. However, one previous study reported no evidence of differing variability across sites (Schuff et al., 2009
) and our estimation of HBSI parameters was performed on a subject by subject basis which allows for some differences across the scanning sites.
In addition to this, the number of MCI reverters in this study was low at only 8 subjects, which was small when compared to the number of MCI stable and converters. The MCI reverters were possibly subjects with small test-retest fluctuations in performance or genuine changes in cognitive ability. However, they did meet criteria for MCI at baseline and after this time did not. One would hypothesise that these subjects would have larger hippocampal volumes and lower rates and with MAPS and MAPS-HBSI we find this was so (albeit differences were not significant when reverters and stable MCI subjects were compared).
We conclude that MAPS has a high level of accuracy for segmentation of the hippocampus and is robust to multi-site data. Our automatically obtained regions can be used to measure hippocampal volume change over time using boundary shift measures (MAPS-HBSI). These methods show expected patterns of volume difference AD<MCI<control, and atrophy rate AD>MCI>control, and show differences in volume and rate in MCI groups according to clinical follow-up with MCI converters< MCI stable< MCI reverters for volumes, and MCI converters> MCI stable> MCI reverters for atrophy rates. MAPS and MAPS-HBSI may be useful in large-scale multi-centre trials to assess both baseline characteristics and disease progression.