The computational segmentation process for each of the 10 subjects was fully automated and took about 5 h per subject on a 2.83 GHz Intel Xeon E5440 processor. A qualitative comparison between the automated segmentation result in one subject and the corresponding manual delineation is shown in .
The top left of shows, for each of the structures of interest, the average Dice overlap measure between the automated and manual segmentation in all 10 subjects, along with error bars that indicate the standard errors around the mean. As expected, larger structures score better than do smaller ones: CA2/3 and subiculum, the largest structures with an average size of 935 and 537 mm3, respectively, have an average Dice coefficient of around 0.74, whereas CA4/DG (on average 526 mm3) and presubiculum (on average 431 mm3) score around 0.68 each, and the smaller CA1 (on average 340 mm3) has a Dice coefficient of around 0.62. Automated segmentation of the fimbria and the hippocampal fissure, the smallest structures with an average volume of only 81 mm3 each, is considerably more challenging, with a Dice coefficient of around 0.51 and 0.53, respectively.
FIGURE 3 Dice overlap measures (top left), average boundary distances (bottom left), and relative volume differences (top right) between automated and manual segmentations in 10 subjects. Also shown (bottom right) are the human intrarater Dice overlap measures (more ...)
The bottom left of shows the mean and standard errors of the average distance between the boundary of each structure’s manual segmentation and the boundary of the corresponding automated segmentation. As can be seen, this evaluation metric is less dependent on the structures’ size than the Dice overlap coefficient. The average boundary error is below the voxel size of 380 µm for most structures, including the fimbria. Exceptions are CA1, which has an average boundary error of about 1.17 times the voxel size, and the hippocampal fissure, with an average error almost twice the voxel size.
The top right of shows, for each structure, the volume differences between the automated and manual segmentations relative to their mean volumes. Regarding Pearson’s correlation coefficient, the automatically calculated volumes of CA2/3 and CA4/DG are strongly correlated with the manual ones, with a correlation coefficient of 0.91 (P ≤ 0.0002) and 0.83 (P ≤ 0.0028), respectively. Subiculum also correlates to some degree (correlation coefficient 0.60), although the correlation does not reach statistical significance (P ≤ 0.066). Interestingly, despite the hippocampal fissure’s relatively low Dice overlap measure and large average boundary error, its automated volume measurements correlate better with the manual ones than do some structures with much better segmentation evaluation metrics (correlation coefficient 0.50, P ≤ 0.14). The relatively poor segmentation evaluation scores are apparently caused by a systematic underestimation of the volume of the hippocampal fissure by the automated method.
The bottom right of shows, for each structure, the average Dice overlap measure between the automated segmentation and the repeated manual delineations of two slices in the midbody of the hippocampus in five subjects (filled bars), along with the Dice overlap between the repeated manual segmentations themselves (empty bars). Qualitatively, the automated method performs similarly on these selected slices as on the whole volumes, except for the presubiculum (Dice coefficient decreased to 0.55), and the hippocampal fissure (Dice coefficient only 0.2). With respect to the intrarater variability of the human operator, the automated results are systematically more different from the manual segmentations than the manual segmentations from one another, although considerable disagreement is apparent between the latter as well (mean intrarater Dice overlap over all structures: 0.79). With the exception of the hippocampal fissure, the structures that are most difficult to segment reliably by the human operator are also the most difficult for the automated method: the order in which the Dice scores decrease across structures is the same for both methods.