The major results of analyses are summarized in .

Performance of methods based on volume overlap

As assessed by volume overlap (Dice's coefficient), FreeSurfer was superior to FIRST for hippocampal segmentation but not for amygdala segmentation as seen in ; significant interaction of region*method [*F*(1,19)=16.9; *p*<0.001]. The volume overlap for FreeSurfer and manual tracing was greater than for FIRST and manual tracing in the left and right hippocampus [*t*(40)=5.3, *p*<10^{−6}]. There were no corresponding differences between methods for amygdala segmentation [*t*(40)=0.05; *p*>0.5]. Percent volume overlap was greater for the hippocampus than the amygdala regardless of the measurement method [*t*(40)=12.4, *p*<10^{−24}].

Performance of methods based on volume difference

When comparing percent volume differences between automatic and manual measurements, FreeSurfer was superior to FIRST for hippocampal segmentation, but FIRST was superior to FreeSurfer for amygdala segmentation as seen in (significant interaction of region*method [*F*(1,19)=125.6; *p*<0.0001]). Note that lower values of percent volume difference indicate superior performance. The volume difference for FreeSurfer was less than FIRST in the hippocampus [*t*(40)=4.1, *p*<10^{−4}]. On the other hand, volume difference for FIRST was less than FreeSurfer in the left and right amygdala [*t*(40)=5.5, *p*<10^{−6}]. Volume difference was smaller in the hippocampus than the amygdala; main effect of region [*t*(40)=2.3, *p*<0.02].

A comparison of volume differences between automated and manual segmentation showed greater FreeSurfer-Manual volume difference in the L-amygdala than the R-amygdala [*t*(40)=2.2, *p*<0.05]. However, no FIRST-Manual volume difference was detected between the left and right amygdala [*t*(40)=0.16, *p*>0.8]. Comparing the left and right hippocampus, we failed to detect a Freesurfer-Manual volume difference [*t*(40)=1.1, *p*=0.3], or a FIRST-Manual volume difference [*t*(38)=1.7, *p*=0.1].

Correlation of automated segmentation methods

The correlation of hippocampal volume between FreeSurfer and manual tracing (*R*=0.82, *p*<10^{−9}) was higher than the correlation between FIRST and manual tracing (*R*=0.66, *p*<10^{−5}) (). Both automated methods yielded larger hippocampal volumes relative to manual segmentation.

The correlation of amygdala volume between FreeSurfer and manual tracing (*R*=0.56, *p*<0.0005) was higher than the correlation between FIRST and manual tracing (*R*=0.24, *p*>0.13) (). Both automated methods yielded larger amygdala volumes relative to manual segmentation.

The Bland–Altman plots (see ) confirm that both automated methods generated systematically larger volumes than manual tracing. We also examined the extent to which the automated measures performed poorly on the same images by comparing subjects whose volume differences approached or exceeded 2 SDs. Across the four plots, there were five data points showing an overestimation of automated volume compared to manual volume. However, these five data points represented images from five unique subjects, confirming that the automated techniques were not giving their worst performance on the same images. Thus, we find no evidence that gross anatomic anomalies that were unduly influencing results.

Shape analysis of segmentations

Group averaged 3D shape renderings generated for shape analyses demonstrate that FreeSurfer had better overall performance than FIRST in segmenting the hippocampus, particularly in the head and tail regions that are especially challenging. Difference maps (with variance) and significance maps for the hippocampus show shape differences between FIRST and manual tracing (), and shape differences between FreeSurfer and manual tracing (). The difference map for FIRST reveals that the head and tail had the largest shape difference and greater variance indicating that much of the inflated volume estimates of FIRST originate from extended surface estimations in the head and tail regions. The difference map for FreeSurfer shows the anterior-medial surface had prominent shape difference and some increased variance indicating that this region was the major source the inflated volume estimates from FreeSurfer when compared to hand tracing. Though ellipsoids indicate increased variance in the tail region, the difference maps indicate that the mean difference between surfaces is relatively small (suggested by the green color of the tail section). Significance maps (permutation corrected, *p*<0.05) confirm prominent shape differences between FIRST and manual tracing as well as between FreeSurfer and manual tracing. The latter comparison revealed less widespread shape differences providing additional evidence that FreeSurfer performed favorably relative to FIRST in the hippocampus.

Shape analysis results suggest FIRST had better overall performance than FreeSurfer in segmenting the amygdala. Difference maps and significance maps for the amygdala show shape differences between FIRST and Manual tracing (), and shape differences between FreeSurfer and Manual tracing (). The maps suggest a general increase in volume in the anterior and posterior surfaces generated by FreeSurfer that is less pronounced with FIRST. This finding is consistent with the larger FreeSurfer-Manual volume difference (8.3%) than FIRST-Manual volume difference (4.5%) as represented in . Notably, the ellipsoids for both methods reflect the greater overall variance for the amygdala compared to the hippocampus. This is consistent with the relative variances seen in and . Significance maps (permutation corrected, *p*<0.05) confirm prominent shape differences between FreeSurfer and manual tracing as well as between FIRST and manual tracing. The latter comparison revealed less widespread shape differences providing additional evidence that FIRST performed favorably relative to FreeSurfer in the amygdala.

Sample size estimation

Based on the additional variance introduced by FreeSurfer and FIRST methods relative to manual tracing of the hippocampus, FreeSurfer requires a relatively small increase in sample size over hand tracing for the entire range of effect sizes. On the other, hand a substantial increase in sample size is required if FIRST is used for measurement compared to hand tracing. For example, for an effect size of 0.9, which reflects a change in volume of 414 mm^{3} (ES×SD=0.9×459) or about 12% of typical hippocampal volume (414/3560), would require a per group sample size of *n*=12 for manual tracing, *n*=14 for FreeSurfer, and *n*=24 for FIRST. Sample size estimates for a range of effect sizes (power=0.8; alpha=0.05) are shown in for each method.

The sample size estimates for amygdala measurements show that both FreeSurfer and FIRST required considerably larger numbers of subjects relative to manual tracing. For example, for an effect size of 0.9, which reflects a change in volume of 145 mm^{3} (ES×SD=0.9×161) or about 10% of typical amygdala volume (145/1389), would require a per group sample size of *n*=12 for manual tracing, *n*=23 for FreeSurfer, and *n*=24 for FIRST. Sample size estimates for a range of effect sizes are shown in for each method.

It is important to note that standard deviation was the determining factor in estimation of sample size. Both automated segmentation methods introduced additional variance in measures of total volume relative to manual segmentation. For hippocampal measures, FreeSurfer (SD=501) and manual tracing (SD=459) introduced less variance than FIRST (SD=688). For amygdala measures, approximately equal variance was introduced by FreeSurfer (SD=234) and FIRST (SD=237) but, both were greater than manual (SD=161).

Utility of methods in major depression

Analysis of hippocampal and amygdala volumes in MDD showed that hippocampal volume, measured by FreeSurfer, was reduced in depressed patients relative to controls (

*t*(55)=2.22,

*p*<0.04) but not when these same hippocampi were measured by FIRST (

*t*(55)=0.54,

*p*>0.59) (see ). The difference in volume between the MDD and control groups as measured by FreeSurfer was about 9%, consistent with published studies (

Videbech and Ravnkilde, 2004). Given the large disparity in sample size between MDD and control groups, we randomly selected 10 participants from the control group to serve as secondary reference group (see ). Consistent results were obtained showing lower hippocampal volumes associated with depression using FreeSurfer (

*t*(17)=2.14,

*p*<0.05) but not FIRST (

*t*(17)=0.57,

*p*>0.57). The MDD and control groups did not differ in total cerebral volume (

*t*(17)=0.05,

*p*>0.96). Neither method showed differences in amygdala volume associated with depression.

| **Table 2**Demographic and clinical characteristics of MDD and control groups |