Our results confirm the observation of others that the agreement level between operators marking hyperintense MS lesions as measured by SI is dependent on the lesion burden shown in the test scans. Using AICc, which accounts for the number of parameters a model uses, the DOEE method was significantly better than the mean value, linear, or quadratic fit of the SI values (p
<.0001, for all three comparisons). Using the mean DE and OER values calculated across scans, our expression for SI in terms of MTA,
, has a remarkable .83 linear correlation (p
.001) with the SI values calculated for each pair of ROIs associated with a scan. Likewise, the small residual errors indicate a very good fit (Figure ). Just as importantly, the values for DE and OER are not significantly correlated with MTA. Hence, it is easy to see that if SI measurement is used then operator agreement will appear poorest when using low lesion load scans (i.e., MTA is small, and DE/MTA is relatively large, in the equation for SI directly above), and best when using high lesion load scans (i.e. high MTA) and SI can be approximated as 1 – (1/2) OER. To summarize, while SI is commonly used to measure rater ability
, it heavily reflects the lesion burden
of the test set used. However, we are able to explain SI's dependence on lesion burden using just two parameters that are not dependent on lesion burden. We therefore propose these values (mean
DE and mean
OER) either as an addition or alternative to the reporting of mean SI values for assessing rater agreement.
The shape of the SI values plotted against MTA values (Figure ) follows an initial steep rise followed by a leveling of values for larger values of MTA. This general shape can be observed in graphs relating SI and lesion burden from other centers [13
]. The rank correlation between SI and MTA was highly significant (ρ
.001). As values for SI are highly correlated with Kappa and JI, these later indices would also be highly dependent on the lesion burden of patients used in the test set.
Our approach divides operator differences into two types: DE and OE. These two types of errors have different characteristics. DE was predominantly constant for all scans, and had a non-significant (ρ
.83) rank correlation with MTA. On the other hand, OE showed a strong linear relationship with MTA (Figure ). This led to our use of OER in our equation for SI, which has a low rank correlation with MTA (ρ
.37). OE's direct dependence on MTA is reasonable. MTA increases when there are more lesions, or the average lesion size increases. In either condition, we expect the outline error to increase. It may seem reasonable to assume a similar relationship with DE. That is, that more lesions imply operators would have a larger absolute number of differences in detecting lesions. However, this is not the case. The predominant relationship is that DE is relatively constant across scans and MTA values (Figure ) and is well represented by a line with an intercept equal to DE and a slope equal to zero. This relationship suggests operators may have an advantage in agreeing to mark a small lesion (lower rate
of detection error) on a scan depicting high lesion burden than a low lesion burden. That is, even though raters must mark more lesions on scans depicting high lesion volume, they will likely have the same total difference in the detection of lesions (DE) as from a scan depicting low lesion burden. We believe that DE remaining relatively constant across a range of lesion loads indicates that total size of “subtle” or ambiguous lesions remains relatively constant across scans. Outline error, on the other hand, can be well represented by a line with an intercept equal to zero, and slope equal to OER (Figure ).
Detection error measurements, the total size (DE) and number of missed ROIs (Cumulative Detection Error graph), are especially important in the analysis of longitudinal studies. For example, a result of many ROI analyses is to establish the number of (typically small) lesions that may have newly appeared or disappeared with respect to a previous scan. In this regard, agreement measures such as SI, JI, or Kappa—or worse, operator agreement in measuring total lesion volume—are poorly suited to the task. This is especially true if the scans have a high lesion burden, since these measures are fully dominated by the raters' agreement on the outlines of large lesions. If the analysis requires the determination of small lesions, we recommend the use of the Cumulative Detection Error graph to estimate the expected number of detection errors above a given threshold size. We then recommend that a lesion threshold value be chosen for the analysis so the average number of disagreements is small.
OE is the major contributor of error by volume. While for low lesion burden the contributions of OE and DE were similar, OE was more than 5 times larger than DE for scans showing high lesion burden. As such, reducing OE (or OER) should have the greater impact on improving inter-rater measure of lesion volume. It is, therefore, not surprising that outlining of lesions using semi-automated contouring methods has been shown to reduce inter-rater variability compared to manual outlining [6
]. The test for correlation between individual CR union and intersection/union was performed for 1131 CR and near zero correlation (r
.7883) was observed. This indicates the outline agreement behaves similarly for ROIs of all sizes. The presented Outline Error Distribution graph makes use of this fact and uses ROIs of all sizes for the distribution. Even with the above findings, it is still possible that lesions with similar size will have slightly different values for the intersection/union fraction depending on the overall lesion load of the scan the lesion was from.
Breaking operator agreement into DE and OER allows an operator to be evaluated on either or both criteria according to the demands of the application. Our tests and observations provide an introduction to our developed tools for the comparison of raters creating ROIs of MS lesions. The development was driven from testing automated lesion detection methods. In this work, it quickly became apparent that the success of a method as measured by SI had little to do with the method, but instead was extensively driven by the lesion burden revealed by the images. Automated lesion detection methods are regularly reported in the literature, with their performance typically described in terms of JI, SI, or Kappa. Based on the results presented here, we see that it is difficult for the reader to compare results of different methods, since the lesion burden of the patients used to construct a test set of scans dominates how well a method performs in terms of SI. Had methods similar to ours been used, it would be relatively easy to assess the strengths and weaknesses of the different methods.
OE, DE, and SI only measure the difference between the raters, and don't distinguish between raters or a gold standard with "False Positive" and "False Negative" distinctions [19
]. However, our "Cumulative Detection Error" and "Outline Error Distribution" graphs provide an informative approach—which examines whether biases exist between raters—that is consistent with our division between detection and outline differences. The initial observations made here lead to many new questions and research areas. For instance, would the incorporation of lesion contrast either with or in place of lesion size provide a better variable for the functions measuring detection and outline agreement? Additionally, our approach demonstrated usefulness for comparing rater agreement across scanning modalities, which allows us to answer questions such as: “Do raters agree better when measuring ROIs on a 3
T scanner versus a 1.5
T scanner?” Used in this way our method would be able to determine whether a hypothesized improvement is due to improved detection or outline agreement.
While we propose DE and OER as better measures for the comparison of raters' masks than using SI, JI, or Kappa, these still do not strictly measure rater performance alone. In the case of comparing 1.5
T vs. 3
T scanning modalities, this can be used as an advantage. In general, we (obviously) anticipate that raters will perform better on high quality images than on low quality images. However, our methods remove a significant confounding problem in the comparison of raters that afflicts the indices, SI, JI and Kappa. Our testing used ROI sets from two raters on 17 scans, which is more than would typically be used to evaluate a rater, and was sufficient to demonstrate the very strong correlation (r
.001) between our estimate and true SI values. The full utility of our measures, as with SI or others, will have to be established over time, as they are used on a wider variety of applications.