Automatic segmentation of medical images has been an essential component of many applications and considerable effort has been invested in order to find reliable and accurate algorithms to solve this difficult problem. Many techniques have been proposed with different levels of automation and range of applicability. However, proposing a new algorithm is not merely enough. A thorough evaluation of its performance is necessary with some quantifiable measurement of its accuracy and variability.
The problem of measuring the performance of segmentation algorithms is the subject of this article. We investigate different techniques to assess the quality of multiple segmentation methods on a problem-specific data set. We are especially interested in cases where there is no ground truth available. We focus on the evaluation of brain tissue classifiers, although our framework can be applied to any segmentation problem.
Before we turn our attention to situations where no ground truth is available, we briefly review the key aspects of evaluation with a ground truth. In this scenario, the accuracy of the evaluation depends on two important components. First, one needs to have or design a suitable ground truth, and second, one needs to choose appropriate similarity metrics for the problem being evaluated.
Defining a ground truth in a medical context is not trivial and several approaches have been proposed. A common and popular technique is to compare automatic techniques with a group of human experts (Grau et al., 2004
; Rex et al., 2004
). In this framework, one assumes that human raters hold some prior knowledge of the ground truth that is reflected in their manual tracings. Unfortunately, human raters make errors and considerations of accuracy and variability must be addressed (Zijdenbos et al., 2002
). Another common technique is the use of phantoms. For segmentation problems, phantoms are usually synthetic images for which the true segmentation is known (Collins et al., 1998
; Zhang et al., 2001
; Ashburner and Friston, 2003
). A physical object can also be used as a phantom ground truth. The phantom is first measured, then imaged. The true measurements and segmentation measurements are compared and performance is thus assessed (Klingensmith et al., 2000
). Studies with cadavers have also been completed in a similar fashion (Klingensmith et al., 2000
; Yoo et al., 2000
). Unfortunately, it is exceedingly difficult to design phantoms that appropriately mimic in-vivo
data and post mortem data differ from perfused, living tissue.
Once a ground truth is created, the key task of evaluation is to measure the similarity
between the reference and the automatic segmentation. It is still unclear whether a generic set of measurements can be used for all segmentation problems, although some measures have been popular. Differences in volume have often been used, possibly because volume is such a central measurement in MR imaging studies (Zijdenbos et al., 1994
). However, two objects with the same volume can be quite dissimilar and alternative measures are needed. To address this issue, different forms of distances between boundaries of segmented objects have been used, a popular choice being the Hausdorff distance (Chalana and Kim, 1997
; Gerig et al., 2001
). Measures of spatial overlap have also been considered important as an alternative to volume differences (Zijdenbos et al., 1994
; Ashburner and Friston, 2003
; Grau et al., 2004
; Pohl et al., 2004
). We will investigate these in detail in section 2.1.
For many medical problems, as noted previously, phantom studies are considered insufficient for validation and manual tracings are simply not available. In the work presented here, we focus on the automatic classification of the brain into four major tissue classes: Gray Matter (GM), White Matter (WM), CerebroSpinal Fluid (CSF) and background (BG). For this specific problem, manual tracings of the entire data set, a total of forty cases, is simply impossible. Nevertheless, if one was to start a new neuroimaging study, one would certainly like to evaluate the automatic classifiers on the entire population. We thus have to turn to methods that measure performance in situations where no ground truth is available. A rather intuitive approach is to perform such an evaluation based on common agreement
. That is, if nine out of ten algorithms classify voxel x
in subject i
as white matter then one says there is a 90% chance this voxel truly is white matter. This simple technique is interesting but limited as all algorithms have equal voting power and situations can arise where a voxel can have equal probability to be classified into different tissue classes. Nevertheless, this notion of common agreement is useful and can be quantified directly through measures such as the Williams’ index (Chalana and Kim, 1997
; Klingensmith et al., 2000
; Martin-Fernandez et al., 2005
). Creating a reference according to the majority of votes from the segmentations can also be done. The reference can then be used as a ground truth for further performance measurements. A more elaborate technique is the one developed by Warfield et al. (2004)
which creates simultaneously a reference standard as well as performance parameters through an Expectation Maximization framework.
This evaluation approach based on common agreement is the foundation of the work we present here. Our MR brain segmentation problem suffers from the lack of readily available database that has both the type of input data we use and accurate reference classifications. Without a gold standard, the problem is clearly ill-posed, and we believe common agreement is a sensible solution. It should be noted that some care should be taken while analyzing the results as one cannot state with certainty that one algorithm clearly outperforms the others purely based on a common agreement principle. Nevertheless, we will at least be able to observe and study many aspects of the segmentation performance such as robustness, variability between different cases, brain regions, or tissue classes. We will also be able to infer how different algorithms are and whether some techniques tend to behave similarly. One key aspect for the success of our study is the requirement that the input to the common agreement is unbiased. If a subset of the methods tested always behave similarly, the agreement will be biased towards these methods and the evaluation may be incorrect. In our work, we selected 11 segmentation techniques, which we believed represented a well balanced set of techniques. We incorporate a discussion on bias as part of our analysis in section 5.
In this article, we make several contributions: First, we present a framework in which one can assess segmentation performance purely based on common agreement. Three methods form the basis of this framework: Williams’ Index a technique we recently introduced (Martin-Fernandez et al., 2005
); STAPLE’s algorithm (Warfield et al., 2004
); and a novel visualization based on Multidimensional Scaling (MDS), a statistical tool to explore (dis)similarity data (Borg and Groenen, 1997
; Cox and Cox, 2000
). Second, we discuss the validity of our results by comparing our framework (purely based on common agreement) with an evaluation against a set of manual segmentations (used as ground truth). Our findings suggest that common agreement evaluation provides almost the same information as evaluating against a ground truth, with respect to robustness, variability and even ranking. Nevertheless, we do observe that some of the information captured by human expert is not present in the automatic classifications. Common agreement alone may thus not be sufficient to accurately rank automatic segmentation algorithms. Finally, as our experiments test eleven state of the art segmentation algorithms on a real and rather large data set, we provide useful and new knowledge about the performance of these algorithms to the community.
In the following section, we give a detail description of the design of the evaluation framework. We start by introducing different similarity measures to compare binary images. We then give detailed information on how Williams’ index is computed and present a brief review of STAPLE’s underlying principles and how it is used in our experiments. We give a more in depth description of MDS, as we have not seen this technique used for evaluation elsewhere.
Section 3 describes our experimental setup: which data set is being used, which algorithms are being tested and what kinds of tests are being performed. This section starts with an experiment in which absolutely no ground truth is available and only common agreement is used. We then validate our approach by creating gold standards based on human tracings of a small subset of the data to validate if common agreement is indeed a sensible approach. We analyze our results in section 4, and discuss the feasibility, accuracy, robustness, scalability and significance of evaluating brain tissue classification algorithms purely based on their common agreement in section 5. Section 6 concludes the paper summarizing the achieved results.