In this section we will show how measurements of recall and detection rates for two screening methods can be used to determine the ERU– the relative utility of correct and incorrect decisions that is needed to make the two methods have an equal decision-theoretic utility. We begin with a brief review of utility analysis for binary decision processes, then define the ERU measure, and show how it can be estimated from recall and cancer detection rates in matched populations without a separate estimate of disease prevalence.
2.1 Utility analysis of binary decisions
When screening mammography is regarded as a binary decision – with exam results dichotomized into categories of follow-up or no follow-up – utility is determined by probabilities of the four possible outcomes. These are true positives (TP), where patients with disease are assigned to follow-up; false positives (FP), where patients who do not have disease are assigned to follow-up; true negatives (TN), where patients without disease are not assigned to follow-up; and false negatives (FN) where patients with disease are not assigned to follow-up. The basis for defining utility in binary decisions such as this is to determine a utility value for each of the outcomes, and then compute the total expected utility
) indicates the probability of the outcome.
The various outcome probabilities can be decomposed into the TP and FP rates as well as the prevalence of the disease in the population, π. The true positive rate, RTP, is defined as the conditional probability of a positive finding given that disease is present, and the false positive rate, RFP, is the conditional probability of a positive finding given that disease is absent. These terms can be used to rewrite each of the outcome probabilities. Specifically, we see that P(TP) = RTPπ, P(FN) = (1 – RTP)π, P(TN) = (1 – RFP)(1 – π), and P(FP) = RFP(1 – π). This reparameterization makes explicit the connection to ROC analysis where RTP is plotted as a function of RFP for a diagnostic test.
Rearrangement of terms in Equation 1
with substitution of terms involving RTP
, and π
results in the following iso-utility equation,
The important feature of Equation 2
is that for a fixed value of U, iso-utility curves in RTP
come in the form of a line with positive slope (under the reasonable assumption that correct decisions have greater utility than incorrect ones). This means that every pair, (RTP
), which satisfies Equation 2
for a given U has equal utility. As we shall see, the slope of this line will play an important role in defining the ERU measure.
2.2 Relative utility and equivalent relative utility
We will follow Wagner et al
] in defining the relative utility as the difference between correct and incorrect decisions when the patient has disease divided by this difference when the patient is not diseased,
Note that the numerator and denominator in Equation 3
could be switched and the result would still provide a reasonable definition of relative utility. However, we believe Equation 3
is more intuitive for screening applications, since the utility of correctly diagnosing patients with disease is generally considered to be higher than correctly diagnosing normal patients, and thus relative utility should be large (URel
1). Using this definition of relative utility, the iso-utility line in Equation 3
is given by
is an odds ratio based on disease prevalence, Qπ
= (1 – π
) / π
. At typical estimates of prevalence in breast cancer screening (π
≈ 0.5%), this ratio is roughly 200.
summarizes the standard relationship between utility and the operating point of a diagnostic test [6
]. The ROC curve specifies a set of possible operating points, with the utility of each point governed by Equation 2
. Traditionally, utility has been used to derive the optimal operating point of the ROC curve, which is seen in to be a point on the ROC curve which is tangent to the iso-utility lines. The slope of the iso-utility line – and hence the tangent point on the ROC curve – is highly dependent on the relative utility. This has been considered a limitation of utility analysis, since there is no universally agreed upon value for this quantity [11
]. The ERU metric we propose essentially uses these same concepts for the purpose of comparing two screening systems in a way that does not require an a-priori
established relative utility.
Thus far we have considered the process adopting a given utility structure and have seen the consequences in terms of the true positive and false positive rates that have equal utility. Now we reverse the situation and start with the operating points of two diagnostic tests and then derive the relative utility that makes them both fall on an iso-utility line. Let us imagine a situation where we sought to compare two well characterized diagnostic tests. Test 1 has an operating point (RTP,1
) and Test 2 has operating point (RTP,2
). The slope of the line connecting these two operating points is given by
We define the ERU between Test 1 and Test 2 as the relative utility needed for Test 1 and Test 2 to lie on an iso-utility line. Therefore the ERU can be found by equating the slopes in Equation 4
and Equation 5
, and solving for the relative utility,
2.3 Interpretation of ERU
We propose ERU as a measure for comparing screening methods. We therefore need to be able to interpret the result and say one method is better, assuming statistical significance is achieved. As a first step, we consider the case when one method increases the true-positive rate and reduces the false-positive rate. In this case one test is clearly superior to the other. We also note that a test with a lower recall rate and simultaneously higher detection rate will always have higher sensitivity and a lower false-positive rate. In this case, the ERU will be negative because the slope between the two operating points is negative. Thus the interpretation of a negative ERU is that one method is superior, and it would take a fundamentally flawed utility structure to make them equivalent. The superior method should be readily apparent from sensitivity/specificity data or alternatively recall/detection rate data.
Now let us assume without loss of generality that test 1 has the lower false positive rate and a lower true positive rate as well, as shown in . We can say that for a putative relative utility greater than the ERU, test 2 is superior since it would reside on a better iso-utility line. Conversely, for a putative relative utility less than the ERU, test 1 is superior since it would then reside on a better iso-cost line. While the ultimate judgment of the systems still requires definition of the appropriate relative utility for interpretation, the ERU itself can be readily estimated without it. We will discuss this issue further in Section 3.
Interpretation of Equivalent Relative Utility (ERU)
2.4 Determination of ERU from recall and detection rates
We have shown how ERU can be determined from true-positive and false-positive rates. However, as mentioned in the introduction, it is often easier to acquire recall and cancer detection rates in practical studies. In this section we describe how these measures can be used to find the ERU. As we shall see, a surprising result of using recall and detection rates to determine the ERU is that explicit reference to disease prevalence cancels, and thus a separate estimate of disease prevalence for the population is not required.
The cancer detection rate, RD
, is simply the probability of a true-positive outcome, and therefore it is related to the true positive rate by
The recall rate, RR
, is the rate of true-positive and false positive outcomes, and is therefore related to the true-positive and false-positive rates by
From the recall and detection rates we can solve for the true-positive and false-positive rates by
Let us now assume a situation similar to Equation 5
, except that now we have cancer-detection and recall rate data for two tests instead of true-positive and false-positive rate data. Let (RD,1
) be the detection rate and recall rate for test 1 and (RD,2
) be the corresponding measures for test 2. Using Equation 9
to determine (RTP,1
) and (RTP,2
), and then using these to determine the slope in Equation 5
Substituting this into Equation 6
specifies the ERU between test 1 and test 2 as Equation 11
shows that the ERU is defined by the difference in the rate of false positive recalls (RR
) divided by the difference in the rate of detected cancers. Cancelling terms yields a final form in terms of the difference between recall and detection rates,
Note that in Equation 11
and Equation 12
, all prevalence terms have canceled and so there is no need to know π
explicitly in order to determine the ERU. Viewed another way, the prevalence dependence of ERU is built implicitly into the recall and cancer-detection rates, and thus does not require separate measurement.
2.5 Estimation of ERU from measured detection and recall rates
Estimation of the ERU consists of replacing recall and detection rates in Equation 12
with sample estimates D
. Let N1
be the total sample size (i.e. the total number of patients evaluated by method 1), and let NR,1
be the number of these patients recalled for follow-up and ND,1
be the number of patients with detected cancer. The estimated recall and detection rates are determined from the sample proportions
An analogous procedure is used to produce D,2
, and R,2
. These estimates of recall and detection rates can be used in Equation 12
to produce an estimate of the ERU.
However, as we shall see, ERU is difficult to estimate precisely, and hence it is probably more useful to describe it in terms of confidence intervals than a point estimate. In Appendix 1
we describe a posterior sampling method for computing Bayesian confidence intervals on ERU estimates computed from observed proportions as in Equation 13
. For relative utilities within the confidence interval, the data is indeterminate for which system is optimal.
2.6 Limitations of the approach
As a final step in presenting the general methodology for evaluating and interpreting the ERU, we review some important limitations of the approach. The first is the critical assumption of equal disease prevalence in the two (or more) cohorts being evaluated. This issue is endemic to comparisons of recall and detection rates in general since these are dependent on disease prevalence. As an example, consider two hypothetical tests with identical true-positive and false-positive rates of 70% and 5% respectively. Now assume that test 1 is evaluated in a cohort with a disease prevalence of 5/1000, and test 2 is evaluated in a cohort with a prevalence of 7/1000. The recall and detection rates will be (5.33%,3.5/1000) for test 1 and (5.46%,4.6/1000) for test 2. This results in a negative ERU suggesting that test 2 is superior. Thus differences in underlying prevalence can bias the ERU. The ERU measure, as defined here, is only appropriate for comparison in cohorts that have been selected in a way that does not lead to a systematic mismatch in prevalence.
A second important issue is to recognize that the analysis here is based on decision utility, which is not equivalent to a cost/benefit analysis. A screening modality that has a favorable ERU may still be prohibitively costly.
A third potential limitation of the analysis is that it strongly links the screening modality with an operating point. A suboptimal operating point used in one modality may result in a poor ERU in the comparison. This may be the consequence of an unfamiliar new method leading to the adoption of an overly strict or lax decision criterion. For criterion-free analysis, ROC type studies are more appropriate. The ERU is most appropriate for use analyzing clinical performance when the operating point as well as the technology or methodology is under evaluation.