|Home | About | Journals | Submit | Contact Us | Français|
This study assesses the performance of public-domain automated methodologies for MRI-based segmentation of the hippocampus in elderly subjects with Alzheimer’s Disease (AD) and mild cognitive impairment (MCI). Structural MR images of 54 age- and gender-matched healthy elderly individuals, subjects with probable AD, and subjects with MCI were collected at the University of Pittsburgh Alzheimer’s Disease Research Center. Hippocampi in subject images were automatically segmented by using AIR, SPM, FLIRT, and the fully-deformable method of Chen to align the images to the Harvard atlas, MNI atlas, and randomly-selected, manually-labeled subject images (“cohort atlases”). Mixed-effects statistical models analyzed the effects of side of the brain, disease state, registration method, choice of atlas, and manual tracing protocol on the spatial overlap between automated segmentations and expert manual segmentations. Registration methods that produced higher degrees of geometric deformation produced automated segmentations with higher agreement with manual segmentations. Side of the brain, presence of AD, choice of reference image, and manual tracing protocol were also significant factors contributing to automated segmentation performance. Fully-automated techniques can be competitive with human raters on this difficult segmentation task, but a rigorous statistical analysis shows that a variety of methodological factors must be carefully considered to insure that automated methods perform well in practice. The use of fully-deformable registration methods, cohort atlases, and user-defined manual tracings are recommended for highest performance in fully-automated hippocampus segmentation.
Hippocampal atrophy has been proposed as a clinical marker for early AD because it is known to occur early in the course of the disease on a spatial scale large enough to be detectable with structural MR images  . Visual, qualitative atrophy assessment (e.g. ) has been hindered by the relative subtlety of atrophy early in the course of AD . However, the development of reliable, repeatable protocols for human raters to trace the hippocampus (e.g. ) have led to the possibility of precise quantitation of AD-related atrophy , hippocampus-level quantification of activation in co-registered structural-functional images , and quantification of other hippocampal characteristics such as bilateral symmetry . Furthermore, tracing protocols have enabled the study of hippocampal morphometrics in subjects with mild cognitive impairment (MCI), a high-AD-risk clinical condition marked by minor deficits in one or more cognitive domains .
However, large-scale studies of AD-related hippocampal atrophy are often impractical because manual segmentations are labor-intensive and require training to insure high repeatability between raters. Typical hippocampi take between 30 minutes and 2 hours to trace by hand; the tedious labor quickly causes fatigue. Semi-automated segmentation methods reduce manual labor by having the user identify a sparse set of image landmarks that constrain a subsequent automated segmentation process   . However, we focus on fully-automated atlas-based techniques to eliminate the need for a user to manually process each image under study, and to eliminate the landmark-identification process as a source of variability between segmentations of the same image.
Atlas-based segmentation coregisters a subject image and a special reference image called the atlas image on which structures of interest have been manually traced (See Figure 1). The resulting spatial transformation maps the coordinates of the structures from the coordinate space of the atlas image to that of the subject image. Since this approach is posed in terms of image-to-image registration, atlas-based techniques take advantage of methodological advances in registration that are driven by a wide range of application areas such as visualization, image-guided surgery, and voxel-based morphometry. Furthermore, atlas-based approaches are among the easiest to implement since they only require the user to align the atlas and subject images.
The purpose of this study was to systematically compare the performance of several competing public-domain methodologies for atlas-based segmentation of AD-atrophied hippocampi. We validated several widely-disseminated automated image registration methods   ; in contrast, previous studies on atlas-based elderly hippocampus segmentation used a single, recently-developed, cutting-edge registration algorithm that lacked a widely-disseminated, standard software implementation (e.g., ). Furthermore, we examined the use of two widely-disseminated atlas images  , as well as individual, manually-traced subject images (as in, e.g., ), to serve as the reference, or cohort atlas image. Finally, we examined the impact of varying manual tracing protocols on atlas-based segmentation performance.
We gathered MR images of 20,19, and 15 subjects in the AD, MCI, and control populations respectively. All subjects were enrolled in the University of Pittsburgh Alzheimer’s Disease Research Center between 1999 and 2004 and given a structural MR scan at time of enrollment. The spoiled gradient-recalled (SPGR) volumetric T1-weighted pulse sequence, acquired in the coronal plane, had the following parameters optimized for maximal contrast among gray matter, white matter, and CSF (TE=5, TR =25, flip angle = 40 degrees, NEX = 1, slice thickness = 1.5mm/0mm interslice). Along with the MR scan, subjects received a comprehensive battery of neuropsychological and clinical tests at time of enrollment and at yearly follow-up visits (see   for evaluation procedure). A consensus meeting of neuroradiologists, psychiatrists, neurologists, and psychologists diagnosed each subject into MCI , AD, or control categories.
We compared the performance of software modules in AIR, SPM, FLIRT, and Chen’s method     as registration substrates for atlas-based segmentation of elderly hippocampi 1. While several algorithmic details vary between these registration techniques, they are chiefly distinguished from each other in terms of their geometric transformation model- that is, the mathematical equation that maps image coordinates between the atlas image and subject image. We partiitoned the geometric transformation models into three categories in terms of the degree to which they allow the atlas image to spatially deform when it is aligned to the subject image. Affine methods apply the same linear transformation to all voxels in the entire atlas image; semi-deformable mappings deform the atlas image in a spatially smooth, gradual way to align it to the subject image; and fully-deformable methods produce image-to-image mappings that are essentially unconstrained spatially (see Figure 2 for an illustration and  for a detailed mathematical explanation). We considered one fully-deformable, three semi-deformable, and three affine registration methods. The AIR affine , SPM affine , and FLIRT affine  methods estimate an affine transformation between images. The AIR semi-deformable method uses the transformation output by the AIR affine method as a starting point for estimation of a spatially-smooth deformation based on a polynomial transformation model; the SPM semi-deformable method uses the transformation output by the SPM affine method as the starting point for estimation of a smooth deformation based on a discrete cosine transform (DCT) transformation model; the Chen semi-deformable method estimates a piecewise-linear transformation . Finally, the Chen fully-deformable method takes the output of the Chen semi-deformable method as a starting point for estimation of an unconstrained, voxel-by-voxel deformation.
We evaluated automated segmentations by comparing them to manual segmentations performed by a single expert rater, R1, who was blind to diagnosis, gender, age, and other clinical data at the time of tracing. Hippocampi were traced on contiguous coronal slices following the guidelines of Watson et al. , Schuff et al. , and Pantel et al. . The traced structure included the hippocampus proper, the subiculum, and the dentate gyrus. The image and tracing were viewed in all three orthogonal viewing planes during manual segmentation. Rater R1 traced hippocampi on all 54 subject images; additionally, we selected 2 AD, 2 MCI, and 2 control images from the pool of 54 subjects for tracing by two additional trained raters, R2 and R3, using the same protocol. These additional manual segmentations were used to compare automated-manual segmentation agreement to inter-rater agreement. All manual segmentations were digitized into binary volumes for analysis.
In the cohort atlas scenario, we selected an image from a subject population (AD, MCI, or control), manually traced left and right hippocampi on it, and treated it as a reference image that all other images in the subject population were registered to during atlas-based segmentation. We refer to the selected subject image as a cohort “atlas” image to emphasize its role as a reference image. Cohort atlas images were selected at random from the subject population, however we note that a variety of more complex strategies for cohort atlas image selection are possible . For each image in each subject population, we considered a hypothetical situation in which that image is selected as the cohort atlas; all other images in the population were registered to the cohort atlas image and hippocampus segmentation results were evaluated. In other words, for a population of k images, we considered k different possible cohort atlases, which we registered to all k − 1 other images in the population for a total of k − 1 trials per registration method.
In the standard atlas scenario, we began with an atlas image and manual hippocampus tracings, or manual atlas tracings, provided by an atlas institution (Harvard or MNI). We registered the atlas image to the subject image, use the resulting transformation to transfer a manual tracing of the hippocampus from the atlas image to the subject image. This automated segmentation was evaluated by comparing it to an independent manual subject tracing- a manual tracing of the hippocampi on the subject image performed by rater R1. However, we recognized that the manual tracing protocol used by R1 may differ from that used by human tracers at MNI and Harvard, and that our evaluation risked confounding two factors that could have caused discrepancies between the automated segmentation and manual subject tracing: differences in hippocampus delineation between automated and manual techniques, and discrepancies in hippocampus boundary conventions between manual atlas and subject tracings. For this reason, rater R1 generated manual atlas tracings by tracing left and right hippocampi on the Harvard and MNI atlas images using the same manual tracing protocol used for tracing on the subject images. Experiments analyzed the effects of choice of atlas (MNI vs. Harvard) and manual atlas tracings (performed by R1 vs. performed by the atlas institution) on manual-automated segmentation agreement.
Cohort atlas images reflect possibly anomalous characteristics of a particular scan and subject, and their use is inherently more labor-intensive than standard atlases since they require the user to hand-label the structure of interest on the cohort atlas image. However, cohort-atlas-based segmentation has potential advantages over the more conventional standard-atlas-based approach. If the population of subject images is homogeneous with respect to factors such as sensor acquisition parameters, subject age, and subject disease state, then drawing a cohort atlas image from the population guarantees that these factors will not confound the registration process. Furthermore, hand-labeling the structure of interest on the cohort atlas image insures the investigator that anatomical boundaries reflect his or her conventions.
Performance of automated segmentation algorithms was measured in terms of manual-automated agreement, that is, agreement between automated segmentations and manual tracings performed by an expert rater. We compared manual-automated agreement to manual-manual agreement, or the agreement between manual tracings performed by pairs of expert human raters. In so doing, we assessed whether switching from manual to automated segmentation significantly increases the variability between the produced segmentation and one produced by an independent human rater. We selected 2 AD, 2 MCI, and 2 control images from our pool of subjects and had the hippocampi segmented manually by human raters R1, R2 and R3. Since R1 traced hippocampi on the full set of 54 subject images, we measured manual-automated agreement in terms of agreement between R1-rated manual tracings and the Chen fully-deformable automated technique. Manual-manual agreement was measured in terms of pairwise agreement between manual tracings by R1 and R2, R1 and R3, and R2 and R3. Manual-automated agreement for each subject image was summarized in terms of the average manual-automated agreement between its R1 segmentation and the automated segmentations from all cohort atlas images in its disease category. Experiments analyzed differences between manual-manual agreement and manual-automated agreement on the 6 multiply-manually-traced hippocampi. Note that this approach differs from the more common approach of measuring agreement between pairs of manual and/or automated segmentations in terms of hippocampal volumes; the key difference is that our approach quantifies agreement in terms of how well the segmentations overlap in the brain. We note that other approaches, based on estimating automated segmentation performance and a single estimate of the true, underlying structure mask, are also available .
We evaluated the agreement between an automated hippocampus segmentation estimate and a manual segmentation using a numerical criterion that measured the degree to which they overlap. We represented the automated segmentation and its corresponding manual segmentation B as binary 3D volumes in which voxels labeled as hippocampus had a value of 1. Let VBOTH be the set of voxels labeled as hippocampus by both and B; set V has voxels labeled as hippocampus by but not ; and set VB consists of voxels labeled as hippocampus by B but not (Sets VBOTH, V, and VB are labeled in red, dark gray, and light gray in Figure 3d). The overlap ratio measures the degree of overlap between the automated and manual segmentations, specifically:
In other words, the overlap ratio measures the percentage of the combined volumes of and B that are both labeled as hippocampus. When and B overlap perfectly, or(B, )=1; when the masks do not overlap at all, or(B, ) = 0. We note that several authors have quantified manual-automated segmentation agreement using criteria similar to the overlap ratio    .
Overlap ratio was computed over the entire hippocampus. Furthermore, to characterize automated segmentations in terms of hippocampal sub-regions, we divided the hippocampus into sections and computed performance measures over the voxels in each section. Consider a bounding box (xmin, xmax, ymin, ymax, zmin, zmax) around all the hippocampus voxels in and B (i.e., the x coordinates of all voxels in VBOTH V VB are between xmin and xmax, etc.). For each of the three cardinal directions, we partitioned the estimated and ground-truth hippocampi into k sections along that direction and computed overlap ratios in each of the sections. That is, for all i from 1 to k we computed , where
for all other voxels. Similarly, we compute and for all i from 1 to k. See Figure 3e for an illustration. Figure 4 suggests that since the hippocampi all have similar gross orientations in the image, the sections can be interpreted as corresponding to rough anatomical regions on the hippocampus. For example, if we cut the shown hippocampi into sections using vertical lines, the sections to the left correspond to posterior regions, and sections to the right correspond to anterior regions.
We analyzed the effects of registration method, side of the brain, disease state, manual tracing protocol, and choice of atlas on overlap ratio through mixed-effects statistical models  that properly accounted for fixed effects, random effects, and grouping in our data. The fixed effects, including disease state, side of the brain, and registration method, were modeled as additive offsets from a baseline value of the performance measure. Random effects, such as the random sampling of subjects from an overall patient population, were modeled as variance components. Each level of each fixed effect was assigned a coefficient representing the offset it produced from the baseline value. The overall significance of each fixed effect was evaluated through omnibus F tests. Furthermore, we analyzed differences between factor levels- for example, between control, MCI, and AD subjects- by using focused F tests to check for significant differences between their coefficients. Effect size for focused F tests was quantified by the contrast correlation rcontrast , which generalizes standard 2-group correlational effect size measures while properly accounting for degrees of freedom. In our analysis, between-group differences refer to differences in model coefficients between two factor levels. Mixed-effects models properly account for the random sampling of subject images and cohort atlas images from overall AD, MCI, and control populations, and properly account for repeated measures. All statistics were performed using R version 1.9.1. Mixed-effects models were fit using maximum likelihood estimation in the nlme package. In order to give multiple views of the complex ways in which overlap ratio varied with respect to fixed effects, we report significance values and effect sizes for between-group tests, as well as box-and-whisker plots that show the median, quartiles, and extreme values of overlap ratio within groups.
Experiments evaluate the degree to which segmentation results varied with respect to disease state, registration algorithm, atlases, manual tracings, and side of the brain. At the core of the experiments is the following sequence of actions:
We refer to the execution of these actions for a particular choice of atlas image, subject image, registration algorithm, and manual tracings as a segmentation trial. Our experimental results were obtained by performing a series of trials through which each of these 4 factors is varied systematically. In particular, for both of our standard atlases, we ran one trial for each possible combination of the 7 registration algorithms, 54 subject images, and 2 sets of manual tracings supplied with the atlas. For each disease state, and for each registration algorithm, we ran one trial for each possible cohort atlas image and subject image within the disease group.
For cohort-atlas-based segmentation, we fit a mixed-effects model in which disease state, side of the brain, and registration method were fixed effects; the subject and cohort atlas identity were random effects; and the performance measures were the dependent variables. The overall effects of side, disease, and method on overlap ratio were statistically significant (p < .0001, p = .0192, p < .0001).
Box plots showing how overlap ratio varies with disease state, side of the brain, registration method, and registration method category, are shown in Figure 5. Effect sizes and p values are shown in Table 1. Overlap ratio was significantly lower in AD compared to MCI and control groups, although no significant difference in overlap ratio was seen between MCI and control groups. No significant difference existed between the FLIRT affine and AIR affine methods. For all other pairs of methods, significant (but in many cases slight) differences in overlap ratio existed. The methods, ranked in decreasing order of overlap ratio, were as follows: Chen fully-deformable, AIR semi-deformable, Chen semi-deformable, SPM affine, SPM semi-deformable, FLIRT affine, AIR affine.
We grouped the registration methods into fully-deformable, semi-deformable, and affine categories and fit a mixed-effects model in which the fixed effects were the method category, disease state, and side of the brain; subject and atlas identity were random effects (see Figure 5 and Table 1). Fully-deformable methods had significantly higher overlap ratio than semi-deformable and affine methods. In turn, semi-deformable methods had significantly higher overlap ratio than affine methods, although the effect size was not as pronounced as in the comparison between fully-deformable and semi-deformable categories.
For standard-atlas-based segmentation, we fit a mixed-effects model in which the fixed effects were the standard atlas (Harvard or MNI), the source of the manual atlas tracings (R1 or Harvard/MNI), side of the brain, disease state, and registration method; subject identity was a random effect; and the performance measures were dependent variables. Figure 6 plots the overlap ratio as a function of atlas image and manual atlas tracing, registration method, side of the brain, and disease state. Results based on R1-traced atlas tracings are referred to as “Harvard By R1” and “MNI By R1”; results based on manual atlas tracings provided by the atlas institution are referred to as “Harvard By Harvard” and “MNI By MNI” respectively.
Overlap ratio was significantly higher for R1-traced manual atlas tracings than hippocampi traced by the atlas institution and was significantly higher for right sides of the brain compared to left. No significant difference in overlap ratio was seen between the MNI and Harvard atlases. Overlap ratio was significantly lower for AD subjects than MCI subjects and controls, but no significant difference was seen between the MCI and control groups. The registration methods, ranked in decreasing order of overlap ratio, were: Chen fully-deformable, AIR semi-deformable, Chen semi-deformable, SPM affine, AIR affine, SPM semi-deformable, FLIRT affine. The difference in overlap ratio between the SPM semi-deformable and FLIRT affine methods was not statistically significant, nor was the difference in overlap ratio between the Chen semi-deformable method and AIR semi-deformable method. Differences in overlap ratio between all other pairs of methods had statistically significant p values, although in some cases the effect sizes were not large.
We directly compared cohort-atlas-based segmentation to standard-atlas-based segmentation using the Chen fully-deformable registration method, which had shown the highest segmentation performance in experiments described above. We fit a mixed-effects model in which the atlas (MNI, Harvard, or cohort atlas), human tracer (R1 or the atlas institution), side of the brain, and disease state were fixed effects, subject identity was a random effect, and the dependent variable was the overlap ratio. Figure 7 plots the overlap ratio for the standard atlases and cohort atlases in this model, along with p values and effect sizes. The mean overlap ratio was significantly higher for cohort-atlas-based segmentation than standard-atlas-based-segmentation using manual atlas tracings by R1 along with the MNI or Harvard atlas images. Performance measures for standard atlases using manual atlas tracings from the atlas institution were significantly worse in each case.
For the six multiply-manually-traced subjects, we fit a mixed-effect model with overlap ratio as the dependent variable, the type of agreement (manual-manual or manual-automated) and side of the brain as fixed effects, and subject identity as a random effect. Manual-manual agreement was not significantly higher than manual-automated agreement in terms of overlap ratio, although we saw a trend toward slightly higher manual-manual agreement (p = .0916, rcontrast = .264). Box plots comparing the distribution of overlap ratio for manual-manual and manual-automated agreement are shown in Figure 8.
Figure 9 shows a representative plot of automated-manual mean overlap ratios and manual-manual mean overlap ratios for hippocampal sections taken along the three cardinal directions of our data set. Results are shown for cohort-atlas-based segmentation using the Chen fully-deformable registration method on the right hippocampus in MCI images; however, similar distributions of overlap are seen for both sides of the brain, all registration methods, and all disease states (see  for detailed plots). The three cardinal directions correspond roughly to the posterior-anterior, medial-lateral, and superior-anterior hippocampal axes, respectively (see Figure 4). The hippocampal sections most responsible for manual-automated disagreement were located at the extremities of the hippocampus, especially at the superior, inferior, medial, and lateral ends. With the exception of the most extreme sections, mean overlap ratio was generally higher toward the lateral extent of the hippocampus and lower toward the medial extent. Furthermore, with the exception of the most extreme sections, mean overlap ratio was relatively constant with respect to anterior-posterior position. Finally, moving from the superior to inferior extent, mean overlap ratio increased steadily, reached a peak at the central sections, and decreased toward the inferior end. These patterns of manual-automated overlap across sub-regions were similar to patterns of manual-manual overlap on the 6 selected images, although the human raters were relatively more consistent at the lateral extent.
This section summarizes our results in terms of which factors led to higher or lower performance measures in the atlas-based segmentation experiments. A “>” between two factor levels indicates that the overlap was higher for the first factor level compared to the second.
Our results confirm the intuition that methods making use of more highly-deformable geometric transformation models tended to fit the complex shape of the hippocampus more accurately than less-deformable geometric models. This agrees with earlier results that demonstrated that atlas-based hippocampus segmentation based on other highly-deformable registration methods can produce hippocampal volumes consistent with expected disease-related atrophy effects (see, e.g.,   ). We believe that the AIR semi-deformable technique performed better than competing semi-deformable methods because the “deformability” of its geometric transformation- i.e., the degree of its polynomial basis- was allowed to gradually increase over the course of optimization, while the geometric transformations for the Chen and SPM semi-deformable techniques were fixed in their spatial structure. Furthermore, as mentioned above, SPM is explicitly biased toward minimally-deforming transformations, which may steer its geometric transformation away from highly-accurate fit of the hippocampal surface. In a related technical report, a statistical analysis of the severity of segmentation errors shows a similar relationship between the performance characteristics of the three registration categories .
Results suggest a general trend toward higher manual-manual agreement compared to manual-automated agreement (see Figures 9 and and8),8), but the differences are not statistically significant. Thus, while there may be room for improvement of the automated methods, Chen’s fully-deformable method can be competitive with the human raters in terms of overlap ratio. These results, together with results from a related study of the severity of automated segmentation errors , suggests that automated methods may be competitive for elderly hippocampus segmentation applications, especially those that can tolerate minor errors in spatial localization. These results extend previous findings that atlas-based techniques can be competitive with manual tracing for other subject groups and brain structures (see, e.g.,    ). Furthermore, the automated results present a very promising starting point for further automated refinement by more complex shape-model-based segmentation techniques (for example,   ).
Overall performance measures were significantly lower among AD subjects than MCI or control subjects. One possible explanation for these results is that the degenerative proccesses of AD made image registration inherently more difficult and ambiguous by reducing tissue contrast and/or inducing a high degree of variability in the geometric characteristics of brain structures such as the hippocampus. Another possible explanation is that registering pairs of AD images was no more or less difficult than registering MCI or elderly control brains, but that standard software packages are not optimized for the task. Similarly, the fact that overlap ratios for MCI and control cases were similar could suggest that their image characteristics do not differ so significantly that they affected registration.
To further investigate how performance differences between disease groups were modulated by other algorithmic factors, we fit mixed-effects models similar to those described above, but with additional terms to model interactions between disease states and concurrent algorithmic factors. Specifically, for cohort-atlas-based segmentation, there were fixed effect terms in the model for disease state, side of the brain, registration method, and the interaction between disease state and registration method. For standard-atlas-based segmentation, we included fixed effect terms for the standard atlas, source of the manual atlas tracing, side of the brain, disease state, registration method, interaction between disease state and registration method, and interaction between disease state and standard atlas. For comparing standard atlases to cohort atlases using fully-deformable registration, fixed effects were the atlas, human tracer, side of the brain, disease state, and interaction between disease state and atlas. Cohort-atlas-based segmentation performance was significantly higher for control subjects with the SPM affine (p = .0232. rcontrast = .0206), AIR semi-deformable (p = .0035. rcontrast = .0265), and SPM semi-deformable methods (p < .0001. rcontrast = .0524); standard-atlas-based performance was higher in control subjects with the AIR semi-deformable (p = .0207. rcontrast = .0426) and SPM semi-deformable methods (p = .0128. rcontrast = .0458), and lower in MCI subjects with the Chen semi-deformable method (p = .0463. rcontrast = .0367); and no interaction terms were significant in the model comparing cohort-atlas-based to standard-atlas-based segmentation using the Chen fully-deformable method. While the effect size is relatively low for each interaction term, these results suggest that performance differences between AD, MCI and control groups are attributable more to AIR and SPM registration methods than other registration methods or atlas choices.
In terms of automated segmentation performance, a striking bilateral asymmetry was seen all experiments, across all three disease groups. These results echo the slight bilateral asymmetry in atlas-based hippocampus segmentation results shown by Duchesne et al. . However, a mixed-effects model fit to solely manual-manual agreement data did not show a statistically significant bilateral asymmetry in manual-manual overlap ratio (p = .12). Our initial calculations of hippocampal volumes did not show a significant volume asymmetry, echoing the findings of Bigler et al. . Further investigation is needed to explain this bilateral effect.
Results from our mixed-effects models suggest that randomly selecting cohort atlas images from a population leads to higher automated segmentation performance than standard-atlas-based segmentation, independent of differences in manual segmentation protocols between institutions. This confirms our intuition that differences in brain morphology and image acquisition characteristics between young, healthy atlas-image brains and elderly, diseased subject-image brains can negatively impact performance of standard-atlas-based segmentation. In particular, differences in brain structure between the young, healthy individuals scanned for standard atlas images and the elderly subjects in our study could pose additional challenges to accurate image registration and segmentation. Future work should investigate the ways in which discrepancies in morphology, image acquisition parameters, and scanning equipment impact atlas-based segmentation results.
Segmentation errors were evenly distributed between posterior and anterior regions of the hippocampus, were more concentrated in the medial regions than the lateral regions, and were generally more highly concentrated toward the periphery than the center. One possible reason for the medial skew in errors is that CSF forms part of the lateral boundary of the structure over its entire anterior-posterior extent, while in some regions, the medial boundary consists entirely of subtle, ambiguous interfaces with other gray-matter compartments. We suggest that the sharp contrast between gray matter and CSF forms a strong visual cue that the automated methods take advantage of to more accurately localize the lateral boundary. Interestingly, our finding that agreement between pairs of human raters did not vary significantly along the anterior-posterior direction except at the extreme periphery contrasts with the inter-rater consistency maps shown by Thompson et al. , which suggest that manual tracings are relatively more consistent in the anterior sections. A possible explanation for this discrepancy is that the consistency measure of Thompson et al. is based on agreement between raters in radial distances from the medial axis of the hippocampus to its surface, and therefore could be less sensitive in anterior sub-regions where radial distances are relatively large.
Geuze et al. recently described a dizzying array of existing methods for manually tracing the hippocampus in MR . Our results (see Figure 6 indicate that discrepancies between these manual protocols can add a highly significant source of variation to what portion of the brain can be expected to be labeled as hippocampus, both in manual tracings and atlas-based automated methods. We emphasize that we are not suggesting that the manual segmentation protocol used by R1 is superior or inferior to those employed for the Harvard or MNI atlases; rather, we have showed that variations in the resulting hippocampi can be significant. Therefore, we recommend that researchers using standard atlas images for atlas-based segmentation should examine the manual tracings and tracing protocols closely to be sure the delineation conventions employed match those of their own laboratory. If they do not, our results have shown that tracing the structure on the standard atlas image or a randomly-selected subject image leads to automated segmentations whose agreement with expert manual segmentations is competitive with manual-manual agreement.
Atlas-based segmentation is a simple, automated method for structure segmentation that can use public-domain tools to produce reasonable structure delineations in images of elderly controls and subjects with MCI and AD. While additional work may be needed to make these automated techniques truly competitive with expert human raters, their performance may be acceptable for image proccessing applications that can tolerate a small amount of hippocampus localization error. While standard digital atlases from MNI, Harvard, and other institutions allow investigators to apply atlas-based segmentation to their subject images with no need for manual labeling, care must be taken to insure that hippocampus tracing protocols from the atlas institution coincide with those of the investigator.
This work was supported by NIH grants NS07391, MH064625, AG05133, DA01590001, MH01077, EB001561, RR019771, RR021813, AG016570,
1While we use the original implementation of Chen’s method, its registration modules are highly similar to the FEM-based and Demons registration modules of the freely-available ITK software package