When validating a machine learning approach it is essential to examine error metrics on both the training and testing sets. A test set independent of the training set is vital in machine learning, in order to show the effectiveness of a classifier on data totally withheld from the training set. Since we used 21 hand-labeled brains to train the algorithm, we employed a leave-one-out analysis to guarantee a separation between the training and testing sets. In order to put our error metrics in context and decide whether they were acceptable for the application, we had a second independent expert rater trace the same 21 brains. We were then able to create a triangle of comparisons as shown in , in which the algorithm's segmentations can be compared with those of the human rater who trained the algorithm (rater 1; A.G.) and with those of an independent rater (rater 2; C.A.) who did not train the algorithm.
A schematic description of the comparisons performed. For all of the tests performed in this paper, training was performed on rater 1's tracings.
In order to show agreement with a human expert not involved with training the algorithm, we only trained our algorithm on manual segmentations from rater 1 and were still able to achieve good segmentation results that agreed well with rater 2's manual tracings. We emphasize that the validation against rater 1 is also an independent validation in the sense that our algorithm was classifying images that it was not trained on (i.e. a leave-one-out approach).
Secondly, we further validated our approach using volumetric results of three kinds. We hypothesized that hippocampal volume would decrease as the disease progresses further, and verified this by comparing mean volumes in groups of controls, MCI subjects, and AD patients. We also examined whether, in the full sample, hippocampal volume was correlated with clinical measurements of cognitive impairment; encouragingly, we found that measures from our segmentations correlated more strongly with cognition, in the hippocampus, than measures from a popular technique for quantification of brain atrophy, tensor-based morphometry, which is closely related to voxel-based morphometry.
Finally, since longitudinal follow-up scans were available for the individuals tested in this paper, we used scans taken six months later to assess the longitudinal stability of the segmentations of the same subject. We showed that the amount of hippocampal volume change was consistent with prior reports in the literature.
To assess our segmentations' performance, we first define a number of error metrics based on the following definitions: A, the ground truth segmentation, and B, the testing segmentation. Additionally, we define d(a,b) as the Euclidean distance between points a and b.
- H1 = maxaA(minbB(d(a,b)))
- H2 = maxbB(minaA(d(b,a)))
- Mean = avgaA(minbB(d(a,b)))
First, presents our segmentation performance on the training set. For this analysis, we used all 21 brains as training data, and tested on all 21 brains. These performance results on the training set represent an upper bound for the expected accuracy on the testing set. Next, we used our leave-one-out approach to obtain testing metrics comparing our results to rater 1 (leg “b” in ), shown in . compares our method with rater 2 (leg “c” in ), again using the leave-one-out technique. Finally, we compared the two human raters directly with one another (leg “a” in ) in .
Table 2 Precision, recall, relative overlap (R.O.), similarity index (S.I.) Hausdorff distance, and mean distance are reported for the training set (N = 21). Note that lower values are better for the Hausdorff distance and mean error (reported here in millimeters); (more ...)
Table 3 Precision, recall, relative overlap (R.O.), similarity index (S.I.) Hausdorff distance, and mean distance are reported for the leave-one-out analysis (N = 21) when the algorithm was trained on 20 segmentations from rater 1 and tested on a single independent (more ...)
Table 4 Precision, recall, relative overlap (R.O.), similarity index (S.I.) Hausdorff distance, and mean distance are reported for the leave-one-out analysis (N = 21) when trained on rater 1 and tested on rater 2. We note that the Hausdorff errors are only slightly (more ...)
Table 5 Precision, recall, relative overlap (R.O.), similarity index (S.I.) Hausdorff distance, and mean distance are reported between the two human raters (N = 21). We note that in all metrics except the mean error, the two human raters agree with each other (more ...)
The first thing to note is that the error metrics from the training and test sets are very close to each other, with the testing metrics being slightly worse than the training metrics (which is to be expected). This shows that ACM is not memorizing the data, but instead learning the underlying structure of the hippocampus. Next, our algorithm shows only a small difference in the error metrics as opposed to the difference between the two human raters. Specifically, if and are compared, the relative overlap between two human raters is on average 74.9% for the left and 74.3% for the right hippocampus (), while the relative overlap between the algorithm and a rater not involved in training it was 75.4% for the left and 71.9% for the right hippocampus (). This shows that the errors in our algorithm are comparable to the differences between two raters. In terms of precision, the agreement between the two human raters is about 3% higher than the agreement between the algorithm and the rater not used to train it, with all values in the 83-89% range. For recall, the algorithm agrees with the 2nd rater at least as well as the 1st rater agrees with the 2nd rater, with all values in the 82-86% range. The only metric for which the human raters agree with each other more than they do with the algorithm is the mean error (see and ), but for that metric agreement is very high between all three suggesting that any biases are very small.
To further compare the performance of our approach with other segmentation methods, in we present error metrics from three other papers that report either fully or semi-automated hippocampal segmentations. We present these only to show that ours is within the same range as other automated approaches. Since each study uses a different set of scans, an exact comparison is not possible.
This table reports hippocampal segmentation metrics for other semi- and fully automated approaches. Our results compare favorably to those reported here. A complete comparison is not possible without testing performance on the same set of brains.
shows an example brain from the test set, with the right and left hippocampi overlaid in yellow and green. There is good differentiation of the hippocampus from the surrounding amygdala, overlying CSF, and adjacent white matter, and the traces are spatially smooth, simply connected, and visually resemble manual segmentations by experts. This image was chosen at random from the test set, and is representative of the segmentation accuracy obtainable on the test images.
Figure 4 Automated segmentation results for an individual from the testing set. Here the right hippocampus is encircled in yellow, and the left hippocampus in green. Axial, coronal, and two sagittal slices through the hippocampus show that the hippocampal boundary (more ...)
shows that the inter-rater r (intraclass correlation) between the two raters' hippocampal volumes and the volumes obtained from our algorithm's segmentations are comparable. Although the inter-rater r is lower when comparing our approach to either rater versus the difference between the two raters, the intraclass correlation is high, and, as expected, statistically significant on both sides. For all of the tests in , we trained the algorithm only on segmentations from rater 1, and this is one reason why there is a slightly higher correlation observed with rater 1 than with rater 2.
Inter-rater r when comparing the three sets of volumes. These volumes were obtained from the leave-one-out analysis so a realistic testing environment can be observed.
Next, we present a disease-based validation technique, based on the premise that a necessary but not sufficient condition for a valid classifier is that it differentiates group mean hippocampal volumes between AD, MCI and controls. Since it is well-known that reductions in hippocampal volume are associated with declining cognitive function (Jack et al., 1999
), we showed that our method is accurately capturing known mean volumetric differences between subgroups of interest with different stages of dementia (controls, MCI, and AD). Due to the limited sample size (N = 21), we pooled left and right hippocampal volumes together for some of these results. Volumetric summaries were computed using the segmentations obtained in the leave-one-out testing analysis.
and show that there is a sequential reduction in volume between controls, MCI, and AD subjects, consistent with many prior studies (Convit et al., 1997
). This shows that the brain MRIs we are working with show the expected profile of volumetric effects with disease progression, and that the segmentation approach is measuring hippocampal volumes with low enough methodological error to differentiate the 3 diagnostic groups, at least at the group level, in a very small sample.
Volumetric analysis for the three different diagnostic groups. The error bars represent standard errors of the mean. Percent differences are tabulated in .
Table 8 Mean differences in hippocampal volume (as a percentage) are shown for the groups listed in the left column for all subjects. Even though this is a very small sample (N=21; 7 of each diagnosis), there is a hippocampal volume reduction associated with (more ...)
: shows strong and significant positive correlations between hippocampal volume and MMSE scores (r = 0.587 for the average of the left and right hippocampal volumes; p < 0.01), and with sum of boxes CDR scores, for both the left and right, and mean hippocampal volumes (r = −0.642 for the mean volume, p < 0.01). Correlations are high (around 0.6) when the average of the left and right hippocampal volumes is measured, suggesting that the hippocampal volumes explain a significant proportion of the variation in clinical decline. Although these associations are known, it provides evidence that the classifier error is low enough to allow their detection in small samples. Each of these values is significant despite the very small sample size, further confirming that our method is capable of capturing disease-associated hippocampal degeneration.
Table 9 This table reports the correlations between hippocampal volumes and clinical covariates. A desirable but not sufficient condition for a hippocampal segmentation approach is that the methodological error is small enough for correlations to be detected (more ...)
In a previous cross-sectional study on the ADNI dataset, we used tensor-based morphometry (TBM) to analyze brain differences associated with different stages of disease progression (Hua et al., 2008
). TBM is a method based on high-dimensional image registration, which derives information on regional volumetric differences from a deformation field that aligns the images. TBM and voxel-based morphometry (VBM (Ashburner and Friston, 2000
)) are closely linked and each measures voxelwise expansion (or contraction) of the brain as compared to a minimal deformation template, which represents the mean anatomy of the subjects (Lepore et al., 2007
Voxel-based morphometry (Davatzikos et al., 2001
; Good et al., 2001
) is a related approach that modulates the voxel intensity of a set of spatially normalized gray matter maps by the local expansion factor of a 3D deformation field that aligns each brain to a standard brain template.
Although TBM has proven useful in quantifying brain atrophy over time in 3D (Leow et al., 2005a
; Studholme et al., 2004
; Teipel et al., 2007
), in cross-sectional studies TBM can be less effective for quantifying volumetric differences in small brain regions (such as the hippocampus) when the ROI is defined on the minimal deformation template.
This is to be expected, as TBM may be considered a rudimentary hippocampal segmentation approach that works by fluidly deforming a mean anatomical template onto the target image – the criteria to guide accurate segmentations are typically limited to measures of agreement in image intensities, such as the mutual information (Leow et al., 2005b
; Viola and Wells, 1995
). shows the correlation between hippocampal volume (as measured with TBM) and MMSE and sum of boxes CDR scores. Note that none of the correlations is even significant in this small sample, and the measures compare poorly with those shown in . This suggests that our direct segmentation of hippocampal anatomy via voxel-level classification is better correlated with cognition than measures we previously obtained using a deformation-based morphometry method.
Table 10 This table reports the correlations between hippocampal volumes estimated using tensor-based morphometry (as reported (Hua et al., 2008)) and clinical covariates on the hippocampus when using TBM. None of these correlationshas a significant p-value, by (more ...)
Longitudinal Validation by Repeat Scanning
As a final validation approach, we segmented a set of six-month follow-up scans, acquired using an identical imaging protocol, for the individuals whose baseline scans were analyzed in this paper. At the time of writing, six-month follow-up scans were available for 18 of the 21 subjects analyzed in this paper, including 6 AD patients, 5 MCI patients, and 7 control subjects. Due to the very small sample size (especially in the AD and MCI groups) and short interval, we present this analysis to show that our algorithm is reproducible, giving relatively consistent hippocampal volumes over a short interval, when minimal hippocampal volume loss is expected. shows that there is minimal loss over 6 months, which is to be expected. We note that this change represents a combination of biological changes and the methodological errors in segmentation, which derive partly from the algorithm and partly from the fact that the image acquisition is not perfectly reproducible. As these sources of methodological error are expected to be small and additive, the fact that the mean change is near 1.5% for the left and 0% for the right hippocampus is in line with expectation. Given that some small biological change is also occurring, this suggests good longitudinal stability for the volume measurements obtained by our algorithm.
Table 11 This table reports the % loss of the hippocampus for all 18 subjects that had follow up scans over a 6 month interval. For both hippocampi (and the mean volume), the mean percent loss is very small. This indicates good longitudinal reproducibility of (more ...)