Automated brain segmentation algorithms segment a structural magnetic resonance imaging (MRI) image into different tissue classes. In general, a MRI image is segmented into gray matter, white matter, and cerebrospinal fluid. Based on this segmentation, methods are available to calculate several neuroanatomical measures, for example gray matter volume, gray matter density, cortical thickness, or cortical curvature. Researchers use these measures to investigate differences in brain structure between groups or to investigate changes in brain structure over time. Phenomena that are investigated include learning processes 
, language lateralization 
, psychosis 
, mild cognitive impairment 
, aphasia 
, alexithymia 
, post-traumatic stress disorder 
, Huntington disease 
, depression 
, autism 
, and schizophrenia 
. The use of automated
segmentation algorithms is desirable, as these algorithms are (i) much faster than manual segmentations and (ii) user independent, that is, they do not depend on expert knowledge in neuroanatomy. However, significant challenges exist as differences in brain structure between groups, or changes within subjects are often very subtle (please see, e.g., 
). Therefore, it is crucially important that (i) automated segmentation algorithms are able to precisely determine the exact amount of, for example, gray matter tissue in an MRI image (cf. accuracy), and that (ii) they produce similar results, when applied to different images of the same person (cf. reliability). At the moment, however, too little is known about the accuracy and reliability of current automated segmentation algorithms.
Clark et al. 
addressed the problem of reliability of different automated segmentation algorithms. Combining different algorithms for intensity correction, skull-stripping and segmentation, Clark et al. 
produced a large number of different processing pathways and tested these pathways on twenty MRI images taken from the same subject. They found that the most “optimal” processing pathway yielded volume estimates that were on average three times less variable than those estimates calculated by less “optimal” pathways. They also demonstrated that the choice of the segmentation algorithm had the greatest impact on the variability of the final segmentation, whereas intensity correction and skull-stripping algorithms had little effect on the overall tissue segmentation reliability. In contrast to those findings, Fein et al. 
showed that skull-stripping may greatly improve the power of structural brain analysis. Acosta-Cabronero et al. 
evaluated the impact of skull-stripping and intensity correction algorithms on the subsequent segmentation. In accordance with the findings of Fein et al. 
, they reported a large influence of those preprocessing steps.
In 2009, Klauschen et al. 
conducted a systematic evaluation of different segmentation algorithms. They used simulated brain data that were generated based on varying brain anatomy and varying image quality, as well as real images from nine different individuals and test-retest images of 48 individuals. They tested the performance of three commonly used segmentation algorithms, provided by software packages SPM5, FSL, and FreeSurfer. Within-segmenter analyses revealed volume differences greater than 15%. Between-segmenter comparisons showed an average discrepancy of 24% for real MRI images. The results of Klauschen et al. 
suggested that automated brain segmentation algorithms might be seriously limited in the fine discrimination of tissue classes. Most importantly, their study casted serious doubts on the capability of automated segmentation algorithms to detect changes in brain structure in longitudinal studies.
To provide information to the community regarding which gray matter segmentation procedure they can build upon, we present a systematic evaluation of accuracy and reliability
gray matter segmentation algorithms. Whereas Clark et al. 
emphasized the comparison of different processing pipelines with permuting preprocessing steps, and whereas Klauschen et al. 
tested within and between-segmenter reliability and accuracy of three software packages, our investigation expands the work by Clark et al. and Klauschen et al. by providing a comprehensive investigation of both
segmentation pipelines and within and between-segmenter accuracy and reliability using the latest versions
of commonly used segmentation algorithms. Importantly, we provide measures of accuracy obtained from real T1 MRI images. To our knowledge this has not been done before in a systematical manner. The fact that we tested the latest versions of available segmentations procedures is also of particular importance, because, up to now, all studies concerned with the evaluation of automated segmentation 
used segmentation algorithms that were subjected to substantial development since.
In the current study, we evaluated the segmentation algorithms provided by (i) SPM8, (ii) VBM8, (iii) FSL, and (iv) FreeSurfer separately and in combination with algorithms for intensity correction and skull-stripping. We determined accuracy in terms of the Dice coefficient computed for the comparison of ground truth images and corresponding gray matter segmentations in simulated and real T1 brain images. We evaluated reliability in terms of standard deviation, coefficient of variation, and reliability coefficient of gray matter segmentations on real T1 images. In comparison to previous studies, our focus was on the simultaneous investigation of accuracy and reliability in combination with a systematic evaluation of the influence of each processing step on segmentation quality. Thus, we were able to examine (a) which processing step has the largest influence on segmentation accuracy both in simulated and real T1 MRI images, (b) how accuracy and reliability are linked, (c) how results from simulated and real T1 images differ, and (d) how preprocessing steps and segmentation algorithms interact.