|Home | About | Journals | Submit | Contact Us | Français|
Although a wide range of approaches have been developed to automatically assess the volume of brain regions from MRI, the reproducibility of these algorithms across different scanners and pulse sequences, their accuracy in different clinical populations and sensitivity to real changes in brain volume has not always been comprehensively examined. Firstly we present a comprehensive testing protocol which comprises 312 freely available MR images to assess the accuracy, reproducibility and sensitivity of automated brain segmentation techniques. Accuracy is assessed in infants, young adults and patients with Alzheimer’s disease in comparison to gold standard measures by expert observers using a manual technique based on Cavalieri’s principle. The protocol determines the reliability of segmentation between scanning sessions, different MRI pulse sequences and 1.5T and 3T field strengths and examines their sensitivity to small changes in volume using a large longitudinal dataset. Secondly we apply this testing protocol to a novel algorithm for segmenting the lateral ventricles and compare its performance to the widely used FSL FIRST and FreeSurfer methods. The testing protocol produced quantitative measures of accuracy, reliability and sensitivity of lateral ventricle volume estimates for each segmentation method. The novel algorithm showed high accuracy in all populations (intraclass correlation coefficient, ICC>0.95), good reproducibility between MRI pulse sequences (ICC>0.99) and was sensitive to age related changes in longitudinal data. FreeSurfer demonstrated high accuracy (ICC>0.95), good reproducibility (ICC>0.99) and sensitivity whilst FSL FIRST showed good accuracy in young adults and infants (ICC>0.90) and good reproducibility (ICC=0.98), but was unable to segment ventricular volume in patients with Alzheimer’s disease or healthy subjects with large ventricles. Using the same computer system, the novel algorithm and FSL FIRST processed a single MRI image in less than 10 minutes while FreeSurfer took approximately 7 hours. The testing protocol presented enables the accuracy, reproducibility and sensitivity of different algorithms to be compared. We also demonstrate that the novel segmentation algorithm and FreeSurfer are both effective in determining lateral ventricular volume and are well suited for multicentre and longitudinal MRI studies.
A range of automated segmentation algorithms are available for determining the volume of various local brain regions, including widely applied techniques such as FreeSurfer (Fischl et al., 2002), FSL FIRST (Patenaude et al., 2007), ANIMAL (Collins et al., 1999) and the LONI pipeline (Macdonald et al., 1994). Since their development these algorithms have been applied to neurological and psychiatric disorders such as Alzheimer’s disease (Cherubini et al., 2010), multiple sclerosis (Benedict et al., 2009) and schizophrenia (Kuperberg et al., 2003) and are also being used to investigate the developing brain in childhood and adolescence (Lenroot et al., 2007), However, early validation studies were limited to healthy young adults and did not report between session, pulse sequence or scanner reproducibility; measures of sensitivity to changes in regional brain volume were rarely presented. These issues are critically important for multi-centre and longitudinal studies, where segmentation algorithms should be sensitive to small changes in brain volume but insensitive to the use of different magnetic resonance imaging (MRI) scanners (reflecting differences in scanner hardware and software and performance differences between otherwise identical scanners). Another consideration is that some algorithms are not able to segment particular types of images, or require varying degrees of user intervention and therefore may become impractical for studies with large cohorts. These problems may explain why manual segmentation of brain regions is still commonplace in the literature (Doty et al., 2008; Dutt et al., 2009; Ettinger et al., 2007; Jack et al., 2008b) Recently, more rigorous studies have been published comparing segmentation algorithms in terms of accuracy (Babalola et al., 2009; Morey et al., 2009), test re-test reproducibility (Morey et al., 2010), sensitivity to changes in brain structure (Bergouignan et al., 2009), and the effect of MRI acquisition parameters on segmentation reproducibility in terms of global (de Boer et al., 2010; Shuter et al., 2008), subcortical and cortical volumes (Jovicich et al., 2009; Wonderlick et al., 2009). However to our knowledge, no publically available dataset exists that may be used to measure segmentation performance in terms of all the above parameters.
The aim of this paper is two-fold, a) to directly address this point by developing a comprehensive testing protocol to determine the accuracy, reproducibility and sensitivity of MRI neuroanatomical segmentation techniques using publically available data which can be used by other investigators and b) to apply the testing protocol to assess lateral ventricle segmentation using a new fully automated technique and to compare this with two popular freely available packages, FreeSurfer and FSL FIRST.
Specifically with respect to lateral ventricle segmentation:
Our focus on the lateral ventricles is of clinical relevance and research interest because increased volume of this region has been implicated in a number of psychiatric and neurological disorders. Dilation of the lateral ventricles is one of the most consistent findings in both schizophrenia (Kempton et al., 2010; Wright et al., 2000) and bipolar disorder (Kempton et al., 2008). Although hippocampal volume reduction is the most prominent finding, ventricular volume increase is also a key sign of progression in Alzheimer’s disease (Zakzanis et al., 2003) and mild cognitive impairment (Carmichael et al., 2007).
The segmentation testing protocol is described, followed by a description of a novel algorithm used to segment the lateral ventricles. Finally we demonstrate how the segmentation testing protocol is applied to assess the novel algorithm, FSL FIRST and FreeSurfer.
To establish the accuracy of the segmentation algorithms, manually determined lateral ventricle regions of interest (ROIs) were used as the ‘gold standard’ in each of the three groups described below (one independent rater for each group). The ROI analysis was conducted on the basis of sterological techniques and the Cavalieri principle implemented in PC-based software (MEASURE) which has been validated (Barta et al., 1997) and extensively used in ROI studies (Keller et al., 2009; McAlonan et al., 2002; McDonald et al., 2006). MEASURE superimposes a grid on the image and grid points falling within the lateral ventricles were manually marked by a trained rater. The region comprised the entire lateral ventricular system including the temporal horns. The lateral ventricles are bordered medially by the corpus callosum, septum pellucidum and interventricular foramen, anteriorly by the frontal cortex and posteriorly by the occipital cortex. Head tilt was corrected using manual reorientation in MEASURE in all brains before measurements to align images along the anterior commissure–posterior commissure (AC-PC) line and the interhemispheric fissure. A grid setting of 1×1×1 was used so that one grid point fell on each voxel in the image. The software allows the user to zoom in and out and view the image in 3 orthogonal planes. When using a high zoom setting trilinear interpolation is applied to the images, however the grid points are always displayed as one pixel. Raters were trained to use their judgement in classifying voxels which were affected by partial volume effects For each of the three groups below the lateral ventricles were analyzed by the rater on two occasions to obtain intra-rater reliability estimates and a random selection of 5 images from each group were analyzed by an independent rater to obtain inter-rater reliability estimates. Participants took part in this study in accordance with the Declaration of Helsinki and the procedures were approved by local ethics committees.
Seven young adults (mean ± SD, age = 23.8 ± 4.1 years) were scanned using a 1.5 Tesla GE Signa MRI scanner (General Electric, Milwaukee, WI). Images were acquired in the coronal plane using a three dimensional, T1 weighted, inversion-recovery prepared, steady state, spoiled gradient-echo pulse sequence (TR = 9.1 ms, TE = 2 ms, TI = 450 ms, flip angle = 20 degrees, slice thickness = 1.5 mm, matrix size = 256×256, voxel dimensions = 0.94×0.94×1.50 mm3, averages = 1, images available at http://sites.google.com/site/brainseg).
Nine patients with Alzheimer’s disease (age = 77.4 ± 2.4 years, 6 females, mini mental state exam (MMSE) score = 23.7 ± 3.5, clinical dementia rating = 1.1 ± 0.4) were scanned using a 1.5 T General Electric Signa HDx MRI scanner (General Electric, Milwaukee, WI). One patient’s diagnosis was subsequently changed to depression with memory problems. Data acquisition was designed to be compatible with the Alzheimer Disease Neuroimaging Initiative (ADNI) (Jack et al., 2008a). Following a three plane localizer, a high resolution sagittal 3D MP-RAGE dataset was acquired (TR = 8.6 ms, TE = 3.8 ms, TI = 1000 ms, flip angle = 8 degrees, slice thickness = 1.2 mm, matrix size = 256×256, voxel dimensions = 0.938×0.938×1.2 mm3, averages = 1, images available from http://sites.google.com/site/brainseg).
The infant dataset was collected by an independent research group (Gousias et al., 2008) and is available at http://www.brain-development.org/. We selected a subset of 10 structural MRIs (subjects 2, 4, 8, 11, 14, 18, 21, 22, 26 and 27) from the full sample of 32 two-year old infants born prematurely (age = 24.8 ± 2.4 months, 16 females). Sagittal T1 weighted volumes were acquired from each subject (1.0T Phillips HPQ scanner, TR = 23ms, TE = 6ms, slice thickness = 1.6 mm, matrix size = 256×256, voxel dimensions = 1.04×1.04×1.6 mm3 resliced to 1.04×1.04×1.04 mm3).
To assess the reproducibility of the segmentation algorithms in the same subjects using the same MRI scanner and pulse sequence, we used the Open Access Series of Imaging Studies (OASIS, www.oasis-brains.org) database which includes structural MRI scans from 20 subjects (age = 23.4 ± 4.0 years, 8 females) who were scanned using the same pulse sequence (1.5T Siemens Vision scanner, TR = 9.7 ms, TE = 4 ms, TI = 20 ms, flip angle = 10 degrees, slice thickness = 1.25 mm, matrix size = 256×256, voxel dimensions = 1×1×1.25 mm3 resliced to 1×1×1 mm3, averages = 1) on 2 occasions within 90 days (Marcus et al., 2007).
To determine the consistency of the segmentations when different MRI scanners and pulse sequences were used, 9 adults (age = 28 ± 8.5 years, 6 females) were each scanned using two MRI scanners (1.5T and 3.0T General Electric Signa HDx scanner) with 4 different pulse sequences in each scanner (8 images per subject in total, mean inter-scan interval between 1.5T and 3T scanner = 6.7 ± 4.2 days). The pulse sequences were all T1 weighted volumetric scans (see Table 1 for MRI sequence parameters, images available from http://sites.google.com/site/brainseg).
For an extreme test of between pulse sequence reproducibility we obtained T2 weighted images collected for clinical reporting and T1 weighted scans from the same 15 young adults (age = 36.3 ± 13.4 years, 9 females). The images were acquired using the 1.5T scanner above with an axial T2 weighted sequence (TR = 3000 ms, TE = 97 ms, flip angle = 90 degrees, slice thickness = 3mm, matrix size = 256×256, voxel dimensions = 0.94×0.94×3 mm3, averages = 1) and sagittal T1 weighted scans (pulse sequence A1, Table 1, images available from http://sites.google.com/site/brainseg).
Ventricular volume is known to increase with age in healthy adults from post-mortem (Hubbard and Anderson, 1981), CT (Schwartz et al., 1985) and MRI studies (Scahill et al., 2003). To examine the sensitivity of the algorithms to small changes in ventricular volume we used the longitudinal OASIS dataset (Marcus et al., 2009) which includes T1 weighted MR image pairs (1.5T Siemens Vision scanner, TR = 9.7 ms, TE = 4 ms, TI = 20 ms, flip angle = 10 degrees, slice thickness = 1.25 mm, matrix size = 256×256, voxel dimensions = 1×1×1.25 mm3, averages = 1) from the same healthy volunteers acquired at two time points (72 subjects, age at baseline scan = 75.4 ± 8.1 years, 50 females, mean inter-scan interval = 738 ± 249 days). The sensitivity of the algorithms was assessed by their ability to detect, i) a change in ventricular volume between the baseline and follow-up scan and ii) their ability to detect a correlation between the change in ventricular volume and the inter-scan interval.
The average time taken for each algorithm to process one MR image was determined by processing 10 randomly chosen images from the OASIS dataset. All algorithms were run on a 2× Quad Core Xeon E5450 3.0GHz computer with 56Gb RAM using the CentOS 5.4 Linux operating system.
Our novel algorithm for segmentation of the lateral ventricles, named ALVIN (Automatic Lateral Ventricle delIneatioN), uses ‘unified segmentation’ in SPM8 (Ashburner and Friston, 2005). Unified segmentation produces gray matter, white matter and cerebral spinal fluid (CSF) images from MRI data but does not segment subcortical structures. ALVIN works by applying a binary mask to spatially normalised cerebral spinal fluid (CSF) segmented images produced using unified segmentation. As the segmented images already demarcate the main boundaries of the lateral ventricles, the purpose of the mask was to exclude CSF outside the lateral ventricles, such as the third ventricle, superior cistern and sulcal CSF.
There is large inter-subject variability in the size and shape of the lateral ventricles even after spatial normalisation into Montreal Neurological Institute (MNI) space. Therefore it was important that the boundaries of the mask were made with reference to a large representative population. We used the healthy control sample from the cross-sectional OASIS, (www.oasis-brains.org) database which includes structural MRI scans from 316 healthy subjects aged 18 to 94 (Marcus et al., 2007). Images were averaged from 3 to 4 MP-RAGE scans (TR = 9.7 ms, TE = 4 ms, TI = 20 ms, flip angle = 10, slice thickness = 1.25 mm, matrix size = 256×256, voxel dimensions 1×1×1.25 mm3) obtained from the same subject on the same day. Modulated normalised CSF images were produced using unified segmentation in SPM8 with default options (Ashburner and Friston, 2005). Unified segmentation performs image registration, bias field correction, and tissue segmentation in one generative model. Images are spatially normalised into MNI space using affine transformations and non-linear basis functions; volume information at each voxel is conserved by multiplying tissue density values by the Jacobian determinant. We also applied a standard SPM algorithm (clean_gwc) which removes incorrectly segmented gray and white matter using an iterative conditional dilation and smoothing technique applied over combined gray and white matter maps. Of the 316 scans in the database 275 were successfully segmented by SPM. There was no significant difference in age or gender between subjects where segmentation had been successful compared to failed segmentations (both p>0.39). The face removal algorithm used by Marcus et al (2007) to ensure subject anonymity may have increased the segmentation failure rate, as priors used by SPM8 include facial features. A mean CSF image from the 275 segmented images was produced to enable the delineation of the lateral ventricle mask. The outlines of the mask was drawn using the ROI tool in MRIcro v1.40 (Rorden and Brett, 2000). To highlight all regions where CSF voxels existed in every subject, the mean CSF image was viewed using intensity window centre and width settings as 0.05 and 0.1 respectively (Figure 2). The mask boundaries were hand drawn to include the entire lateral ventricular system including the temporal horns. In a small number of regions/coordinates lateral ventricular CSF and non-ventricular CSF overlapped between subjects in normalised space (e.g. at the fornix boundary between the lateral ventricles and superior cistern, and in the occipital lobe between the posterior horn and parieto-occiptal sulcus). For these regions the image contrast was reduced and the mask boundary was marked within the CSF signal local minimum to ensure the ventricular CSF was included for the majority of subjects at these particular coordinates.
Unified segmentation in SPM8 was applied to each test image to produce a modulated CSF image which was multiplied by the binary mask giving a three-dimensional image of the lateral ventricles in MNI space. As the data was modulated, absolute volume of the lateral ventricles was calculated by summing the intensity over the entire normalised image.
The testing protocol was applied to the ALVIN algorithm described above, FSL (v4.1.7) FIRST and FreeSurfer (v4.5.0). Briefly the FSL FIRST algorithm performs subcortical volumetric and shape analysis using models constructed from manually segmented images (Patenaude, 2007). Initially the FIRST algorithm normalises the MR image into MNI space, after which the normalisation is checked manually. The spatial transformation is used to fit a subcortical mask to the image and a segmentation algorithm with a model of the left and right lateral ventricle is used to segment these structures. The algorithm requires the number of modes of variation as input, which is set to 40 for the lateral ventricles (as recommended by the authors of FSL FIRST). Finally a boundary correction algorithm which uses FSL’s segmentation tool, FAST is applied before the volume of the lateral ventricles is determined. The FreeSurfer package may be used to conduct subcortical segmentation and cortical surface parcellation. For the analysis used in this study the FreeSurfer pipeline (Fischl et al., 2002) performed intensity correction and skull stripping, followed by gray and white segmentation and segmentation of subcortical structures including the lateral ventricles using an atlas based approach. For each side of the brain FreeSurfer outputs two segmentations which are named ‘lateral ventricles’ and ‘inferior lateral ventricles’; the volumes of these regions were summed to obtain total lateral ventricle volume. The performance of the ALVIN algorithm was tested using SPM8, however to determine if the algorithm was compatible with SPM5 we also applied the entire testing protocol to ALVIN using both SPM versions. To determine spatial overlap in the segmentation produced by ALVIN and FSL FIRST and FreeSurfer, it was necessary to convert the segmented image produced by ALVIN from MNI space to native space. This was achieved by applying the inverse spatial normalisation parameters for each subject to the ALVIN binary mask of the lateral ventricles. The binary mask in native space was then used to mask a CSF segmented image in native space produced by the unified segmentation step. Finally the image of the lateral ventricles in native space was thresholded at 0.5 to produce a binary segmented image.
To quantify accuracy and reproducibility we used the Intraclass Correlation Coefficient (ICC) measure (single measure, 2-way mixed consistency) (McGraw and Wong, 1996; Yaffee, 1998). For accuracy results, the ICC quantifies how well the automated segmentations agree with the gold standard measures, for reproducibility measures the ICC value quantifies the consistency of the segmentations. ICC values were calculated after the exclusion of failed segmentations, and were not calculated if more than 50% of segmentations failed. Spatial overlap of segmentations was assessed using the Dice coefficient (Crum et al., 2005). For the sensitivity analysis a paired t-test was used and the result was converted to a Z-score and Pearson’s r was used to determine the correlation between volume change and inter-scan interval. Statistical calculations were performed with SPSS 15.0 (SPSS Inc.) except for power calculations which were carried out using GPOWER 3.0 (Faul et al., 2007).
By visually inspecting each segmentation we recorded the number of cases where the algorithms failed to segment the lateral ventricles (see Figure 3 for examples). For consistency we did not attempt to adjust the default parameters in each algorithm and re-run the segmentation step or manually adjust the images.
The intra-rater agreement (same rater), in terms of intra-class correlation coefficients (ICCs) for gold standard manual segmentation of the lateral ventricles was 0.994 for young adults, 0.999 in patients with Alzheimer’s disease and 0.973 in infants. Inter-rater reliability (independent raters) for adults, patients with Alzheimer’s disease and infants was 0.995, 0.991 and 0.993 respectively. All three algorithms demonstrated high accuracy compared to manual gold standard segmentation (Table 2). In terms of segmentation failures, the Alzheimer’s disease images were the most problematic for all 3 segmentation algorithms, particularly FSL FIRST which was unable to segment any of the images (Table 3). Overall FreeSurfer demonstrated the highest segmentation accuracy. Segmentation consistency was good between the three algorithms (Table 4). ICC and Dice coefficient measures both indicated that ALVIN and FreeSurfer most closely agreed, except for the young adults dataset where the latter measure suggested a closer agreement between ALVIN and FSL FIRST. In terms of absolute volume measures (Table 5) ALVIN reported a consistently higher volume than the other techniques.
ALVIN showed the highest test-retest and T1/T2 reproducibility (Table 2). FSL FIRST showed good reproducibility, but suffered from a reasonably high failure rate on the inter-scanner/ pulse-sequence dataset. FreeSurfer demonstrated good reproducibility in the test-retest dataset and the highest inter-scanner/pulse-sequence reproducibility, but was unable to process any of the clinical T2 weighted images. Absolute volume estimates (Table 6) also showed highly consistent values between paired scans, and demonstrated that ventricular volume estimates were higher from T2 weighted images.
ALVIN and FreeSurfer were able to detect a change in ventricular volume between baseline and follow-up scan, estimating an increase in volume of 2.7 ml and 2.5 ml respectively over the 2 year period. Both algorithms were also able to detect the expected correlation between the volume increase and interscan interval. FSL FIRST had a failure rate of 63% which precluded a sensitivity analysis. A power analysis suggests that ALVIN would require a sample size of 11 subjects, and FreeSurfer a sample size of 13 subjects to detect a significant change in lateral ventricle volume between the baseline and follow-up scan (two tailed, alpha=0.05, power=0.8). In terms of detecting a correlation between change in ventricular volume and interscan interval, ALVIN and FreeSurfer would require a sample size of 26 and 19 subjects, respectively (two tailed, alpha=0.05, power=0.8).
Segmentation failures were conspicuous during visual inspection and were characteristic for each algorithm. ALVIN failures most commonly occurred at the SPM unified segmentation stage where the scalp was incorrectly classified as CSF (example shown in Figure 3b). FSL FIRST failures occurred at the main segmentation stage and were revealed when the segmented lateral ventricles were overlaid on the MRI scan; as shown (Figure 3a) only small fragments of the lateral ventricles were segmented. FreeSurfer segmentation errors occurred at the normalisation or skull stripping stage and were recognised by the algorithm which terminated the procedure. ALVIN demonstrated the lowest segmentation failure rate of 3.2% over all images followed by FreeSurfer with a failure rate of 9.6% and FSL FIRST with a failure rate of 36.2%.
Manual segmentations took approximately 80 minutes per subject. ALVIN and FSL FIRST were both faster than manual segmentation taking 8 and 7 minutes respectively. FreeSurfer was an order of magnitude slower than the other algorithms taking approximately 7 hours, although during this time the software segmented a number of subcortical structures, as it was not possible to segment the lateral ventricles only (Table 2).
Lateral ventricles volumes obtained using ALVIN in SPM8 agreed well with those produced using SPM5 (ICC>0.999 over all images in the testing protocol) suggesting the ALVIN algorithm worked effectively with both versions of SPM.
The ALVIN algorithm which takes MRI images in native space as inputs and outputs ventricular volumes, is freely available as an SPM extension and may be downloaded from sites.google.com/site/brainseg. The images used in the testing protocol are freely available and may be downloaded from the websites listed in Table 7.
We have developed a testing protocol for assessing the accuracy, reproducibility and sensitivity of segmentation algorithms based on publically available data and validated a conceptually simple technique for automatically extracting the lateral ventricles. The availability of the testing protocol will enable other researchers to validate future segmentation algorithms.
We primarily used intraclass correlation coefficients (ICC) to measure reproducibility and accuracy. Although segmentation may be assessed with metrics which measure the overlap of regions (Fischl et al., 2002) this was problematic with our data because the gold standard measures were made on the basis of the Cavalieri principle, and the software used did not produce volumetric image files representing manual segmentations. However we were able quantify segmentation overlap between the automated algorithms using the Dice coefficient. The ICC measure of agreement is a widely used statistic, spanning genetics (Gibert et al., 1998), functional neuroimaging (Caceres et al., 2009) and clinical rating scales (Nuechterlein et al., 2008), and is also the standard measure for assessing intra-rater and inter-rater reliability on manually drawn ROIs in structural MRI studies (DeLisi et al., 1997; Doty et al., 2008; McClure et al., 2006) and has previously been used to determine the reliability of FreeSurfer (Wonderlick et al., 2009) and FSL FIRST (Morey et al., 2010). We used an ICC which measures consistency rather than absolute agreement, thus high ICCs reported in this paper suggest the segmentation algorithms would give very similar statistical results when comparing two groups of subjects. However as each algorithm is likely to give a systematic difference in volume (Table 5) it would not be possible to combine data produced by different algorithms in a single study.
The validation dataset within the testing protocol is comparable to the Internet Brain Segmentation Repository (http://www.cma.mgh.harvard.edu/ibsr/) a dataset which includes 18 MRI scans with manual segmentations of 43 individual structures. Our dataset includes manual segmentation of the lateral ventricles only, but includes infants and patients with Alzheimer’s disease to reflect a wider range of brain morphology. A related online resource, BrainWeb (Cocosco et al., 1997) (http://mouldy.bic.mni.mcgill.ca/brainweb/) allows the user to enter customizable MRI sequence parameters to produce a simulated MRI image of the brain and others have highlighted the importance of simulation for segmentation (Simmons et al., 1996). The BrainWeb tool has been used to validate a number of segmentation algorithms (Amato et al., 2003; Chao et al., 2009). Our dataset of 9 individuals scanned with 8 sequences on at 1.5T and 3T could also be used to verify that segmentation algorithms are reproducible when applied to images using a range of MRI parameters.
To assess sensitivity we examined the impact of aging on lateral ventricle volume, as this is reasonably robust effect (Scahill et al., 2003). However not knowing the real change in ventricular volume is problematic in assessing the sensitivity of these algorithms. A different approach is to use simulated data, such as Camara et al. (2008) who used a deformation model to mimic atrophy in Alzheimer’s disease to assess algorithms which measure brain atrophy. The advantage of simulated data is that the investigator precisely knows the location and magnitude of the changes that have occurred, however such an approach relies on the simulation accurately mirroring the effects of pathology on brain structure which may not always be possible and does not reflect small differences that might be caused by, for example, changes in head positioning, hydration and RF coil performance over time. Within our testing protocol it would have been preferable to use a larger group of validation images.
In this study our strategy was to use subgroups which reflected a wide range in brain morphology rather than one large healthy adult group. Our hope is that other investigators will add to the pool of publically available manually segmented images allowing future algorithms to be validated against a larger number of healthy adults and patients with other neurological and psychiatric disorders. A valid criticism of a study which compares an investigator’s own algorithm against others is that there may be biases in selecting the testing protocol. However in this study we have used data that is already publicly available where possible and made our own additional data and software freely available so that other researchers may verify the methods we have used.
Using the testing protocol we have validated ALVIN, our segmentation algorithm in adults, patients with Alzheimer’s disease, and infants, and shown it to be reliable between MRI scanners and pulse sequences and sensitive to small changes in ventricular volume. In developing this technique we have built on existing neuroimaging software and datasets; ALVIN relies on the unified segmentation methodology developed by Ashburner and Friston (2005) and the ventricular mask which was based on the representative OASIS MRI dataset (Marcus et al., 2007). The algorithm was comparable to FreeSurfer in terms of accuracy, reproducibility and sensitivity. ALVIN produced volume estimates which were higher than manual segmentation values and the other automated techniques. This was most marked in the infant dataset although the other two automated techniques also gave higher values than manual segmentation. Inspection of the infant dataset segmented with ALVIN revealed that in some cases small parts of the superior cistern and parieto-occiptal sulcus were classified as ventricular CSF due to their relative position in normalised MNI space. Unfortunately altering the ventricular mask to improve segmentation in infants would adversely impact segmentation in older age groups due to increased size of the ventricles with aging. A possible improvement would be for ALVIN to automatically select different ventricular masks based on brain structure, or to use a more accurate spatial normalisation procedure to closely match ventricular size and shape to a standard template. Differences in absolute volume estimates between the manual and automated methods are also likely to arise from partial volume effects. Both manual and automated methods use intensity information when classifying voxels, however small differences in the threshold applied may lead to different volume estimates particularly in structures with a large surface area to volume ratio. Clinical and research questions are usually concerned with volume differences, either between patients and controls, or baseline and follow-up scans where reproducibility may be more important than absolute volume. As highlighted previously it is not possible to combine data from different algorithms in a single study if the algorithms exhibit systematic differences. ICC values demonstrated that ALVIN closely agreed with manual measures in terms of the relative distribution of volumes in a group and was also sensitive to longitudinal changes in volume. In terms of processing speed, if a user required ventricular volumes, ALVIN was 50 times faster than FreeSurfer. However an important limitation of ALVIN is that it is only able to segment a single region while both FSL FIRST and FreeSurfer are able to segment a number of cortical and subcortical regions
In terms of previous reproducibility studies, Morey et al (2010) reported test-retest ICC values of 0.993–0.999 for the lateral ventricles segmented using FreeSurfer which compared well with our value of 0.998, and reported ICC values of 0.977–0.998 for FSL FIRST which agreed with the value of 0.996 reported in this study. Our results also concur with Jovicich et al (2009) who report that different T1 weighted images had only a relatively small effect on segmented lateral ventricle volume compared to inter-subject variability using FreeSurfer.
The FreeSurfer algorithm was valid in all groups, while the FSL FIRST algorithm was valid in young adults and infants and both techniques demonstrated good reproducibility. The poor performance of FSL FIRST in patients with Alzheimer’s disease was surprising, especially as the training dataset used to develop the algorithm included patients with Alzheimer’s disease (Patenaude, 2007). Inspection of the FIRST segmented images revealed that in some cases only small sections of the ventricles were identified, leading to erroneous volume estimates. In addition the ventricular model within FIRST did not appear to include the temporal horn which contributed to a lower accuracy estimate. Examination of the data showed that FIRST had particular problems with large ventricles and was not able to segment any images with ventricles larger than 35ml. Thus FIRST failure rates were particularly high in the Alzheimer dataset and the OASIS longitudinal dataset used in the sensitivity analysis (Table 3) which included participants with a mean age of 75. The poor performance of FSL FIRST ventricular segmentation is unusual for FSL structural MRI processing tools. Indeed a recent publication has shown that FSL FIRST accurately segments other subcortical structures (Patenaude et al., 2011) and in terms of our own studies we have previously demonstrated that FSL SIENA was sensitive enough to detect small changes in brain morphology from acute dehydration (Kempton et al., 2009).
The ability of ALVIN and FSL FIRST to segment the lateral ventricles from clinical T2 weighted images is useful, as it demonstrates the techniques may be used with lower resolution data, allowing the algorithms to be applied to older images or studies where acquisition time is required be kept to a minimum.
ALVIN and FreeSurfer are well suited to multicentre and/or longitudinal studies due to their relatively high inter-scanner reproducibility and sensitivity to changes in ventricular volume. For multicentre projects the different scanners would still need to be modelled at the statistical analysis stage, however by using these algorithms, the inter-scanner variance would be efficiently accounted for, increasing sensitivity to small changes in ventricular size.
The authors acknowledge financial support from the National Institute for Health Research (NIHR) Specialist Biomedical Research Centre for Mental Health award to the South London and Maudsley NHS Foundation Trust and the Institute of Psychiatry, King’s College London. W.R. Crum acknowledges support from the King’s College London Centre of Excellence in Medical Engineering funded by the Wellcome Trust and EPSRC (WT 088641/Z/09/Z). We are grateful to the Open Access Structural Imaging Series for the use of this data and include their following grant numbers: P50 AG05681, P01 AG03991, R01 AG021910, P20 MH071616, U24 RR021382. Ulrich Ettinger acknowledges support from the Deutsche Forschungsgemeinschaft (ET 31/2-1).
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.