|Home | About | Journals | Submit | Contact Us | Français|
Comparison of a group of multiple observer segmentations is known to be a challenging problem. A good segmentation evaluation method would allow different segmentations not only to be compared, but to be combined to generate a “true” segmentation with higher consensus. Numerous multi-observer segmentation evaluation approaches have been proposed in the literature, and STAPLE in particular probabilistically estimates the true segmentation by optimal combination of observed segmentations and a prior model of the truth. An Expectation–Maximization (EM) algorithm, STAPLE’S convergence to the desired local minima depends on good initializations for the truth prior and the observer-performance prior. However, accurate modeling of the initial truth prior is nontrivial. Moreover, among the two priors, the truth prior always dominates so that in certain scenarios when meaningful observer-performance priors are available, STAPLE can not take advantage of that information. In this paper, we propose a Bayesian decision formulation of the problem that permits the two types of prior knowledge to be integrated in a complementary manner in four cases with differing application purposes: (1) with known truth prior; (2) with observer prior; (3) with neither truth prior nor observer prior; and (4) with both truth prior and observer prior. The third and fourth cases are not discussed (or effectively ignored) by STAPLE, and in our research we propose a new method to combine multiple-observer segmentations based on the maximum a posterior (MAP) principle, which respects the observer prior regardless of the availability of the truth prior. Based on the four scenarios, we have developed a web-based software application that implements the flexible segmentation evaluation framework for digitized uterine cervix images. Experiment results show that our framework has flexibility in effectively integrating different priors for multi-observer segmentation evaluation and it also generates results comparing favorably to those by the STAPLE algorithm and the Majority Vote Rule.
Segmentation is a fundamental problem in many pattern recognition and image processing applications. Segmentations can be generated by different automated computer methods or by human observers. Multiple-observer segmentation evaluation is helpful in many scenarios. Some examples are: (a) evaluating performance of multiple observers’ segmentations simultaneously ; (b) measuring segmentation complexity ; (c) combining multiple observers’ segmentations to generate the ground-truth segmentation. STAPLE  is an algorithm proposed for the first scenario.
In STAPLE, two different kinds of prior knowledge can be integrated. One is the truth prior, which specifies the probability of each pixel being inside the segmentation. This information can be obtained through training a statistical atlas. The other is the observer-performance (or observer) prior, which specifies prior knowledge about the performance level of each observer, often quantified by two performance parameters, sensitivity and specificity (Section 2.2). However, the role of the two priors is not balanced in the STAPLE algorithm. The truth prior is heavily depended on, and it almost always dominants over the observer prior so that the observer prior has little effect on the final evaluation result. Since the truth prior is often unknown and an estimated prior is used instead, the evaluation result is often not in agreement with the initial performance measures of observers. As pointed out in , if this was discovered in the application, it would indicate either the need to re-evaluate the global prior assumption or the need for improved training of the experts generating the segmentations. However these recommendations do not address the lack of truth prior or the discrepancy caused by inconsistent truth and observer priors. In certain situations, the performance measures of multiple observers’ segmentations are known in advance to some extent. For instance, let us consider an observer which is an automated segmentation algorithm, we will know if the algorithm tends to perform conservatively thus has a low specificity. For manual segmentations, we can assume that segmentations made by experts have higher sensitivity and specificity than those by non-experts. In these situations, we would desire evaluation results that are consistent with the known observer-performance priors.
Based on the above observations, we propose a different framework based on the Bayesian Decision Theory and the MAP optimization principle for the multiple-observer segmentation evaluation problem. The framework is based on different segmentation evaluation needs and different prior knowledge available. One need is to estimate the ground-truth segmentation and observer performance levels, with or without the truth prior probability. The other need is to combine the segmentations from observers with different measures of performance. To address the first need, if the truth prior is unknown, the observers are treated equally as experts with high sensitivity and specificity. The truth prior probability is estimated by averaging all observer segmentations then integrated in the MAP estimation. If a reliable truth prior is available, it will be used directly. To address the second need where we know a priori some observers’ sensitivity and specificity, the MAP solution combines these performance measures to compute a ground truth map which is consistent with the known measures. The estimated ground truth can then be used to evaluate other observers whose performance measures are unknown. For validation purposes, gold-standard ground truth segmentation can be acquired in phantom experiments or by multiple-observer consensus to compare with the estimated ground truth.
We developed an online software system to evaluate multi-observer segmentations for medical images such as those in the NCI/NLM medical repository of digital cervicographic images (cervigrams) . The total 939 images were collected as part of a study for the evolution of lesions related to cervical cancer conducted by the National Cancer Institute (NCI) together with the National Library of Medicine (NLM) through two major studies in Costa Rica and the United States, the Guanacaste and ALTS1 projects, respectively . In these studies, multiple observers (or raters) have marked several important regions on cervigrams that are of anatomical or clinical interest, including the cervix boundary and acetowhite regions. They were clinicians with expertise in colposcopy that were identified by members of the Board of Directors of the American Society for Colposcopy and Cervical Pathology and by staff at the National Cancer Institute. They included 12 general gynecologists and 8 gynecologist oncologists. 18 of them work in academic settings and 2 in private practice. They have varies of years of experience. In the studies, the total number of subjects was also 939 (one cervigram per subject). The cervix boundary defines the region of the uterine cervix, which is of anatomic interest within the cervigrams. The acetowhite regions are epithelium with whitened appearance, which is visible for a short period of time following the application of 3% to 5% acetic acid. Some acetowhite regions correlate with uterine cervix cancer progression, and thus are of clinical significance. Examples of these marked regions are shown in Fig. 1. Each cervigram has associated with a different number of manual markings varying from one to twenty. In this paper, we consider combining multiple observers’ segmentations of the cervix boundary (yellow line in Fig. 1), and our software can be used to evaluate these multi-observer segmentations in different scenarios.
The remainder of this paper is organized as follows. In Section 2, we discuss previous work and our choice for multiple-observer segmentation measures. In Section 3, we discuss previous work on combining multiple-observer segmentations. We introduce the STAPLE algorithm  and identify its limitation. We then describe our framework and algorithms for different multiple-observer segmentation evaluation scenarios in Section 4. The web-based multiple-observer segmentation evaluation software developed based on our method is presented in Section 5, and we demonstrate experimental results and comparison with previous work in Section 6 by using the multiple observer manual segmentations. In Section 7, we also demonstrate experimental results but by using manual segmentation results to evaluate our automatic segmentation method. Section 8 concludes the paper with discussion of future work.
A number of metrics have been proposed to compare segmentations. Generally the evaluation methods of image segmentation can be classified into three categories : analytical methods, empirical goodness methods and empirical discrepancy methods. Analytical methods are not used to judge the performance of segmentation methods but their properties, principles, complexity, requirement and so forth. Empirical goodness methods are used to compute some manner of “goodness” criterion such as uniformity within regions, contrast between regions, shape of segmentation regions and so forth. The empirical discrepancy methods evaluate segmentation methods by comparing the segmented image against a manually segmented reference image, which is often referred as the ground truth, and computing error measures. The empirical discrepancy methods have been the most commonly used methods for segmentation evaluation.
Reviewing work in the literature, one can find two kinds of empirical discrepancy methods: (1) region-based evaluation, which evaluates segmentation consensus in terms of the number of regions, and the locations, sizes and other statistics of the segmented regions, and (2) boundary-based evaluation, which evaluates segmentation in terms of both the location and shape accuracies of the extracted region boundaries. The segmentation performance-level criteria in region-based evaluation can be: (a) sensitivity and specificity, where sensitivity is defined as “true positive fraction”, and specificity is “true negative fraction” , (b) correctness and completeness, or precision and recall [17, 18], where high completeness means that the region segmented has covered the relevant pattern well, whereas high correctness implies that the region segmented does not contain many (incorrect) irrelevant patterns, (c) the number of misclassified pixels and their distances to the nearest correctly segmented pixels , (d) measures based on hamming distance between two segmentations , (e) local consistency error which quantifies the consistency between image segmentations of differing granularities , (f) bidirectional consistency error which penalizes dissimilarity between segmentations proportional to the degree of region overlap , and (g) partition distance which is defined as “given two partitions P and Q of S, the partition distance is the minimum number of elements that must be deleted from S, so that the two induced partitions (P and Q restricted to the remaining elements) are identical” . On the other hand, the performance-level criteria in boundary-based evaluation can be: (a) distance of distribution signatures which is based on the distance between distribution signatures that represent boundary points of two segmentation masks , (b) precision-recall measurement which uses precision and recall values to characterize the agreement between the oriented boundary edge elements of two segmentations’ region boundaries , and (c) a new discrepancy measure  which takes into account not only the consensus of the localized boundaries of the created segments but also under-segmentation and over-segmentation.
A good evaluation method would allow segmentations by different approaches not only to be compared, but to be integrated to generate segmentation with higher consensus. In our framework, we choose to use the region-based evaluation metric: sensitivity and specificity, which can be incorporated into our framework in combining multiple-observer segmentations.
In the NCI cervigram database, we have segmentations of regions marked by 20 observers. Since these markings can vary in size and location it is essential that we have measures to evaluate these multi-observer segmentations. We choose sensitivity p and specificity q to measure the performance level of each binary segmentation.
Sensitivity is the “true positive fraction” and defined as
where TP is the number of true positive pixels and FN is the number of false negative pixels. That is, sensitivity means the percentage of pixels properly included in the segmentation result out of all pixels in the segmentation result.
Specificity is the “true negative fraction” and defined as
where TN is the number of true negative pixels and FP is the number of false positive pixels. So specificity means the percentage of pixels properly excluded from the segmentation result out of all pixels outside of the ground truth.
The relationship of sensitivity p and specificity q in a binary segmentation can be easily understood through the diagram in Fig. 2. The pixels labeled 1 are inside the segmentation (foreground) and those labeled 0 are outside (background). Different observers may have different (p, q) values; for instance, medical experts have higher (p, q) values while inexperienced non-experts may have lower ones (Fig. 2).
The similar measures to sensitivity and specificity are correctness and completeness, or precision or recall, which are defined as follows:
There are a number of combination methods proposed in the literature for different cases of integrating multi-observer segmentations to derive a final segmentation. These include class probability combining strategies such as the Min Rule, the Max Rule, the Median Rule and the Majority Vote Rule . For instance, the Majority Vote Rule chooses the segmentation label for each pixel based on what the majority of observers agree on; this simple method, however, does not take into consideration the variability in quality or performance among the voters and also does not incorporate the prior knowledge regarding segmentations. There are also combination strategies that assume each classifier has expertise in a subset of the decision domain [11–13], and strategies [14, 15] that can account for different confidence or uncertainty levels in segmentations.
The result after probabilistically combining multiple observer segmentations is usually presented in a multiple-observer ground truth map. One example is shown in Fig. 3. In Fig. 3a, two observers have marked the acetowhite regions (in red line and in blue line). Figure 3b shows the corresponding ground truth map after combining the two segmentations, and in this map, each pixel is represented by a color indicating the probability that belongs inside the ground truth segmentation.
The STAPLE algorithm is a well-known method proposed by Warfield et al. , for generating ground truth segmentation maps from the observations of multiple observers and measuring the performance levels of each of the observers.
Let us suppose there are N pixels in the image whose segmentations are being evaluated by a total of R observers. The following notations are used in describing the STAPLE algorithm.
STAPLE is an EM (Expectation–Maximization) algorithm and estimates simultaneously the true segmentation T and performance level parameters of observers characterized by parameters (p and q in this case). It aims to maximize the complete data log likelihood:
Like other EM algorithms, the STAPLE algorithm has two steps: the Expectation (E) step and the Maximization (M) step. In the E step, it computes an expectation of the likelihood at each iteration k:
where the posterior probability of the true segmentation at each pixel is
In the above definition for the posterior ground truth segmentation, f(Ti=1) is the truth prior probability, αi is the conditional data probability f(Di=1|Ti=1,p(k−1), q(k−1)), and βi is the conditional data probability f(Di=0|Ti=1, p(k−1), q(k−1)):
In the M step, it estimates the observers’ performance level parameters, p and q, that maximize the conditional expectation of the complete data log likelihood function.
If the difference between (p, q) values at the k−1 and k steps is small enough, the algorithm is considered converged. Then it outputs the final (p, q) values and the ground truth map Wi. In STAPLE, there are three inputs that are needed: the multiple observer segmentations D, the initial (p, q) values for each segmentation, and the ground-truth segmentation prior probability f(Ti=1).
As described above, besides the multiple observer segmentation data, there are two kinds of priors that are necessary inputs to the STAPLE algorithm: the truth prior probability f(Ti=1), and the observer prior represented by the initial (p, q) values for each observer’s segmentation. However, as noticed by the STAPLE authors  and by us through extensive experiments, the truth prior almost always dominates the observer prior so that the initial (p, q) observer performance-level values have little effect on the final posterior segmentation. Indeed the converged result on (p, q) by STAPLE often contradicts the initial (p, q) prior. We believe this discrepancy is caused by the independence assumption made by STAPLE—the ground truth T is independent of the performance level parameters so that f(T, p, q)=f(T)f(p, q). It is obvious from the definitions of sensitivity p (Eq. 1) and specificity q (Eq. 2) that (p, q) are not independent of T. Having this independence assumption separates the influence of the truth prior from that of the observer performance-level prior. In practice, this manifests in a way that STAPLE can not deal with the scenario when the (p, q) values for each observer’s segmentation are known. Moreover, the truth prior is often unknown and an estimated prior is used instead; if the estimated prior is far-off from the ground truth segmentation, the negative effect of the lack of prior can get magnified. This limitation of the STAPLE algorithm can be seen from the following experiments. In the first group of experiments (Table 1 and Fig. 4), the truth prior probability is not available and it is estimated as the average of the relative proportion of the labels (1 or 0) in the multiple-observer segmentations. Therefore the value of the prior probability is kept the same for all three experiments. We vary the initial (p, q) values of the two observers in different experiments: in Experiment 1, both observers are set as experts with high (p, q) values; in Experiment 2, observer 1 is set as an expert and observer 2 as a non-expert; the configurations in Experiment 3 is on the contrary to Experiment 2. Using the sample mean of the multi-observer segmentations as the truth prior, one can see from Table 1 that: (1) the final estimated ground truth map by STAPLE is close to that generated by the majority vote rule, (2) the differing observer-performance prior (p, q) values have little effect on the estimated ground truth map, and (3) the converged (p, q) values can deviate greatly from the initial (p, q) values which indicates that the observer prior was overwhelmed by the truth prior (Table 1; Fig. 4).
In the second group of experiments (Table 2 and Fig. 5), we specify different truth prior probability with different (p, q) values across experiments. In Experiment 1, the truth prior probability is set to be closer to the segmentation by Observer 2 (γ=0.5), while in experiments 2 and 3, the truth prior is closer to the segmentation by Observer 1 (γ=0.2). These experiments clearly show that the truth prior probability has dominant effect on the estimated ground truth map at the converged local minima in the STAPLE algorithm. Experiment 3 in particular is interesting. In that experiment, we set Observer 1 as a non-expert and Observer 2 as an expert thus the truth prior probability is not in agreement with the prior performance measures of the two observers’ segmentations. The converged results from the STAPLE algorithm are consistent with the truth prior probability instead of the observer performance-measure prior values. Experiment 3 clearly demonstrates that, even when reliable information about observer performance measures is available, we still have to get the correct truth prior in order to obtain meaningful results using STAPLE (Table 2; Fig. 5).
As demonstrated above, STAPLE effectively ignores the observer performance measure prior. Indeed in the derivation of STAPLE , the observer performance prior probability f(p, q) was cancelled out by making the independence assumption between T and (p, q) values. The result of this cancellation is that there is no way to inject prior knowledge about individual observer’s performance level in the STAPLE framework. Furthermore, the ground truth prior probability has shown dominant effect on the estimated posterior ground truth map and the estimated performance measures (Table 1 and Table 2, Fig. 4 and Fig. 5), which is not always desirable because oftentimes we do not have reliable information about the truth prior. We argue that these limitations stem from the independence assumption because based on either the standard definitions for sensitivity and specificity (Section 2.2) or the definitions in STAPLE (pj=Pr(Dij=1|Ti=1), qj=Pr(Dij=0|Ti=0)), p and q are fully dependent on D and T. That is, given segmentation data decisions D and the ground truth T, the performance measures of any Observer j, pj and qj, are uniquely determined.
Based on the above analysis, we propose a new framework for multiple observer segmentation evaluation, which is more general than STAPLE. We explicitly take into account different kinds of prior knowledge that are available and apply different methods in different scenarios. The two kinds of prior knowledge that can be injected into our framework are: the (ground) truth prior (γ=f(Ti=1)), and the observer performance-level prior (p, q) values. If a certain prior is unknown, it can be initialized with uniform distribution or initialized based on observers’ segmentation data.
The overall theoretical framework is based on the Bayesian Decision Theory , which aims to make a decision based on the posterior probability distribution, f(T|D). The standard maximum a posteriori (MAP) estimator can be applied to select the most probably ground truth T:
For pixel i, let
where f(Ti=1|D) indicates the posterior probability of the true segmentation at pixel i being equal to one. It follows that the posterior background probability f(Ti=0|D)=(1−f(Ti=1|D). Thus the MAP estimator (Eq. 8) will assign the class label of pixel i to be 1 (i.e. foreground pixel, Ti=1) if f(Ti=1|D)>0.5, or assign the label 0 (i.e. background pixel, Ti=0) if f(Ti=1|D)<0.5.
Next we discuss several scenarios with different prior knowledge available and different application purposes. The multi-observer segmentation evaluation algorithms in our framework are introduced for each scenario.
In this scenario, we simply apply Eq. 10, Eq. 11 and Eq. 12 with these numbers to calculate f(Ti=1|D) and estimate the posterior ground truth segmentation map. This case can not be handled by STAPLE because the observer prior would be ignored and would not have the desired effect on the estimated ground truth segmentation.
In this scenario, we know the sensitivity and specificity of each observer thus we can distinguish observers of different performance levels such as experts vs. non-experts. However, we do not know the truth prior probability f(Ti=1). In practice, such a situation is quite common. The sensitivity and specificity for each observer can be estimated based on training data from the observer’s past experience (manual segmentations). Or if an observer is an automated segmentation algorithm, the (p, q) values of the observer can be estimated based on the characteristics of the segmentation algorithm or based on its performance on validation datasets. In this case, we want to obtain the ground truth consistent with the known (p, q) values of observers. Therefore the sensitivity and specificity values can not be used as initialization values in the EM-based STAPLE algorithm (Section 3.2). Instead we follow the Bayesian Decision framework and calculate directly f(Ti=1|D) using Eq. 10, Eq. 11, and Eq. 12 with the known (p, q) values of observers; the unknown truth prior probability is modeled through one of two ways:
Sometimes we have the performance measures of some observers but not others. Our approach in this situation is to use the above algorithm to estimate the ground truth, and then the observers with unknown measures are evaluated by comparing their segmentations to the estimated ground truth. The (p, q) values are calculated by Eq. 1 and 2.
In this case, the known truth prior is directly applied in Eq. 12, while the missing (p, q) values of each observer can be set in two ways:
In this scenario, initialization of the truth prior probability and the (p, q) values of each observer in the Bayesian framework is a combination of the initialization methods introduced in Sections 4.2 and 4.3: the initialization of f(Ti=1) (and f(Ti=0)) can follow either Sections 4.2A) or 4.2B); the initialization of individual observer’s (p, q) values can follow either Section 4.3A) or 4.3B).
Based on the framework we proposed, we developed a web-based software application. The software is developed in Java and the architecture of the software is shown in Fig. 6.
The system consists of three components: the web browser, the application and the server. The web browser is accessible to users by which they download and evoke the Java application. It is made possible by the Java Web Start technology. The Java application has the following features:
The software on the server side includes a Java servlet and algorithms. The Java servlet communicates with the application. It receives the image, observer segmentations, and prior information from the application and sends the results back to the application after the algorithms finish computing.
We carried out several experiments in four scenarios as described in Section 4 by using a subset of images from the NCI/NLM database which contains 939 cervigrams with multi-observer segmentation data. The image is rescaled to half size of the original one which has the size of 2399×1636 pixels. It is segmented by one to twenty medical experts with varying performance level. For clarity of presentation, we show our results on one image (Fig. 8a) that was segmented by three observers and compare the results with those by the STAPLE algorithm and the Majority Vote rule. Experimental and comparison results on other images in the database have shown similar trends. On the example image, two observers (in green and blue lines) give similar segmentation while the other (in red line) is different from the two. The sensitivity and specificity values are calculated inside the bounding box of ROI (area of interest) and not in the whole images.
We initialize the truth prior probability and the observer (p, q) values as outlined in Section 4.4:
In this case, we initialize the prior probability (Section 4.2B) and (p, q) values (Section 4.3B) based on the observers’ segmentation data. The resulting ground truth map (Fig. 9b) is similar to that of the Major Vote Rule shown in Fig. 10d.
In STAPLE, since there is no prior knowledge about either the truth prior probability or (p, q) values of each observer, the prior probability is estimated as the sample mean of the relative proportion of the label in the segmentation (Eq. 3). This means that each observer is treated as equal. Since the truth prior is the dominant prior in the STAPLE algorithm, the results generated by STAPLE are similar to that of the Majority Vote Rule (Fig. 10d). The initial (p, q) values for each observer have little effect on the results. This can be seen in Fig. 10b and c. The initial (p, q) values of each observer are listed in Table 5.
Using the Majority Vote Rule, the truth prior and the initial (p, q) values are irrelevant and the results are completely determined by majority of the data, which is a shortcoming of the rule (Table 5; Fig. 10).
In the first group of experiments, we consider the case in which the (p, q) prior values for all observers are known (Table 6). In Experiment 1, each observer has equal (p, q) values while in Experiment 2, Observer 1 is an expert and Observers 2 and 3 are non-experts. The results are consistent with the (p, q) values set for each observer. In Experiment 2, the result leans toward the segmentation by Observer 1, who is an expert (Fig. 11c; Table 6).
The other situation in this scenario can be that the measures of performance are known for some observers only. In the second group of experiments, we specify (p, q) values for some observers (Table 7). The other observers are evaluated by our framework (Section 4.2) and their measures of performance are shown in Table 7 (Fig. 12).
As discussed in Section 3.2, the limitation of the STAPLE algorithm is that the truth prior probability is dominant and the (p, q) prior values of each observer are ignored (see Table 1, ,2,2, ,55 and Fig. 4, ,5,5, ,10).10). Thus the STAPLE algorithm does not apply to this scenario. The Majority Vote Rule generates results that depend on observer data alone without considering prior information so it does not apply to this scenario either.
We initialize (p, q) values for each observer as outlined in Section 4.3:
In this group of experiments (Table 10), each observer has initial sensitivity and specificity calculated from the segmentation data. We clearly see the effect on the estimated ground truth probability map given changes in the truth prior probability.
In the second group of experiments, we set (p, q) values for each observer lower. At the same time, we change the prior probability. The results show again that the truth prior dominates over the observer prior (p, q) in STAPLE (Fig. 17). By comparing the first and second groups of experiments, one can see that the (p, q) settings do not affect STAPLE’S final results. For instance, the resulting ground truth map Fig. 16b is exactly the same as Fig. 17b, and Fig. 16c the same as Fig. 17c, even though the (p, q) values in these two groups of experiments are very different. This again shows STAPLE’S limitation pointed out in Section 3.2 (Table 12; Fig. 17).
When we have reliable estimates of both the truth prior probability and observer (p, q) values, our method coherently balances their effects and integrates them in a complementary manner. We carried out two groups of experiments in this case. One is with higher (p, q) values for each observer and the other is with lower (p, q) values for each observer. At the same time, we changed the value of the truth prior probability. As one can see, when the (p, q) values are very high (close to 1.0), the effect of the observer data dominates over the truth prior probability, while when the (p, q) values are lower indicating low confidence in observer data, the truth prior clearly shows its effect (Table 13 and and14;14; Fig. 16 and and1717).
The STAPLE algorithm cannot handle this Scenario since the truth prior probability dominates even with very high (p, q) values for each observer.
In this experiment, we use our framework to evaluate our automatic segmentation method . First, we use the automatic segmentation method to differentiate the acetowhite (AW) issue and non-AW tissue. Then we use the results from multiple observers’ manual segmentations to evaluate the result from the automatic segmentation method.
We use a database-guided segmentation paradigm in which we apply machine learning techniques, such as support vector machines (SVM) to learn, from a database with ground truth annotations provided by experts, critical visual signs that correlate with important tissue types and to use the learned classifier for tissue segmentation in unseen images. The support vector machines (SVM) classifier has been successfully applied to detecting Microcalcifications in Mammograms and various other medical classification problems. We use SVM to perform color-based tissue classification in order to segment different tissue regions, especially to segment the biomarker AW region from the rest of the cervix. The segmentation performance is optimized with respect to the feature color space and granularity. We evaluate color spaces including RGB, HSV, and L*a*b*. On different granularity of the features, we train AW and other tissue classifiers, first using individual pixel sample colors and then using cluster features returned by the Mean Shift based clustering algorithm. Cluster features greatly reduce the dimensionality of training so that SVM is scalable to larger training sets, while producing results with comparable accuracy. Given a novel test image, the Mean Shift clustering algorithm partitions the image into clusters of similar color and/or texture, and the trained SVM classifier (on cluster features of training data) is applied to classifying clusters in the test image. This ground-truth database guided segmentation method is flexible in terms of the number of tissue classes. Thus we can perform either two-label, or multi-label classification.
We demonstrate our results in one scenario where no prior information is known. We use the segmentation data for initializing the unknown priors: the probability prior and the (p, q) values of multiple observers. Table 15 shows the prior probability and (p, q) values of multiple observers while Fig. 18 shows the original image, the result from our automatic segmentation method and the ground truth map. In Experiment 1 and 2, our automatic method has lower sensitivity than specificity partly because the automatic method excluded the os part of the cervix (Table 15; Fig. 18).
In this paper, we have proposed a new method for multiple observer segmentation evaluation based on analysis of the STAPLE algorithm. The analysis includes different scenarios that have different kinds of prior knowledge available. We first identified a limitation of the STAPLE algorithm which indicates that observer performance prior is effectively ignored in the framework. We formulate instead a Bayesian Decision framework that balances the roles of the ground truth segmentation prior and observer performance-level prior according to their availability and confidence in their estimation. We demonstrate multi-observer segmentation evaluation results of our framework in four scenarios with differing prior knowledge and application purposes, and the results compare favorably to those by the STAPLE algorithm and the Majority Vote Rule. The results also show the flexibility of our method in effectively integrating different priors for multi-observer segmentation evaluation. Although we only illustrate the results by using the cervigrams, our method can work for multi-observer segmentation applications using any images. Currently, our online software only allows users to submit the segmentation information to the server in the format of contours in order to save the transfer time. We will extend the software to include binary images and other formats in the future. Another missing part of our framework is to integrate the constraints such as structure or shape constraints since integration of more prior information will help to generate more accurate evaluation results.
Future work also includes the following directions: (a) the extension of our framework to multiple labels, (b) the extension of our framework to 3D, which is pretty straightforward. The voxels are used instead of pixels. The ground truth map becomes a 3D probability map. All equations in our framework remain the same as those in 2D. (c) similar to that in STAPLE, our framework can take the spatial prior into consideration, (d) the current method only works on a single image with multiple observers’ segmentations. It can be extended to evaluate each observer’s performance based on their segmentations on multiple images, (e) we plan to apply this method to evaluating the performance of automatic segmentation algorithms and to improving the consensus in training, and (f) the method can be integrated in model-based segmentation frameworks to provide feedback on how to refine model parameters.
Yaoyao Zhu is currently a Ph.D. candidate in the Computer Science and Engineering Department at Lehigh University. She received a B.S. degree in electronics from Beijing University, China, in 1995 and an M.S. degree in computer engineering from the University of Cincinnati in 2001. Her research interests include machine learning, pattern recognition and medical image processing.
Xiaolei Huang received her doctorate and masters degrees in computer science from Rutgers, the State University of New Jersey in 2006 and 2001 respectively, and her bachelors degree in computer science from Tsinghua University, China, in 1999. Since 2006, she has been an Assistant Professor in the Computer Science and Engineering Department at Lehigh University. Her research interests include Computer Vision, Biomedical Image Analysis, and Computer Graphics. In these areas, she has authored or co-authored over 30 publications including journal articles, book chapters, and refereed conference proceedings papers. A member of the Institute of Electrical and Electronics Engineers and the Biomedical Engineering Society, she has served on the program committees of several biomedical imaging and computer vision conferences and reviews papers for journals including IEEE Transactions on Pattern Analysis and Machine Intelligence, Graphical Models, Medical Image Analysis, and IEEE Transactions on Biomedical Engineering. She is the holder of 1 U.S. patent and has 5 U.S. patents pending.
Wei Wang received the BS degree in electronics and information engineering from Beihang University (Beijing University of Aero and Astro) in 2003 and the MS degree in electrical engineering from Lehigh University in 2007. He is currently a Ph.D. candidate work at IDEA lab in Lehigh University. His research interests are clustering, image segmentation and medical application.
Daniel Lopresti received his bachelors degree from Dartmouth College, Hanover, NH in 1982 and his Ph.D. degree in computer science from Princeton University, Princeton, NJ in 1987. He spent several years with the Computer Science Department, Brown University, Providence, RI, and then went on to help found the Matsushita Information Technology Laboratory in Princeton. He later spent time at Bell Labs. Since 2003, he has been with the Computer Science and Engineering Department, Lehigh University, Bethlehem, PA, where he leads research examining fundamental algorithmic and systems-related questions in pattern recognition, document analysis, bioinformatics, and computer security. He has authored or co-authored over 100 publications in journals and refereed conference proceedings and is the holder of 21 U.S. patents.
Rodney Long is an electronics engineer for the Communications Engineering Branch at the National Library of Medicine, where he has worked since 1990. Prior to his current job, he worked for 14 years in industry as a software developer and as a systems engineer. His research interests are in telecommunications, image processing, and scientific/biomedical databases. He has an MA. in applied mathematics from the University of Maryland. He is a member of the Mathematical Association of America and the IEEE.
Dr. Sameer Antani is a Staff Scientist with the Lister Hill National Center for Biomedical Communications an intramural R&D division of the National Library of Medicine (NLM) at the U.S. National Institutes of Health (NIH). His research interests are in image and text data management for large biomedical and multimedia archives. His research includes content-based indexing, and retrieval of biomedical images (CBIR), combining image and text retrieval, topics in advanced multimodal medical document retrieval, and next-generation interactive (multimedia rich) documents. He earned his B.E. (Computer) degree from the University of Pune, India, in 1994, and his M.E. and Ph.D. degrees in Computer Science and Engineering from the Pennsylvania State University, USA, in 1998 and 2001, respectively. Dr. Antani is a member of the IEEE, the IEEE Computer Society, and SPIE. He serves on the steering committee for IEEE Symposium for Computer Based Medical Systems (CBMS).
Zhiyun Xue joined the Lister Hill National Center for Biomedical Communications at the National Library of Medicine (NLM) in 2006. Her research interests are in the areas of medical image analysis, computer vision, and pattern recognition. She received her Ph.D. degree in Electrical Engineering from Lehigh University in 2006, and her master's and bachelor's degrees in Electrical Engineering from Tsinghua University, China, in 1998 and 1996, respectively.
George R. Thoma received the B.S. from Swarthmore College, and the M.S. and Ph.D. from the University of Pennsylvania, all in electrical engineering. As the senior electronics engineer and Chief of the Communications Engineering Branch of the Lister Hill National Center for Biomedical Communications, a research and development division of the National Library of Medicine, he directs R&D programs in image processing, document image storage on digital optical disks, automated document image delivery, digital xray archiving, and high speed image transmission. He has also conducted research in analog videodiscs, satellite communications and video teleconferencing. Dr. Thoma is a Fellow of the SPIE, the International Society for Optical Engineering.
Yaoyao Zhu, Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA.
Xiaolei Huang, Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA.
Wei Wang, Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA.
Daniel Lopresti, Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA.
Rodney Long, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Sameer Antani, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
Zhiyun Xue, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
George Thoma, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.