|Home | About | Journals | Submit | Contact Us | Français|
The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the performance of diagnostic tests. The purpose of this article is to provide a nonmathematical introduction to ROC analysis. Important concepts involved in the correct use and interpretation of this analysis, such as smooth and empirical ROC curves, parametric and nonparametric methods, the area under the ROC curve and its 95% confidence interval, the sensitivity at a particular FPR, and the use of a partial area under the ROC curve are discussed. Various considerations concerning the collection of data in radiological ROC studies are briefly discussed. An introduction to the software frequently used for performing ROC analyses is also presented.
The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the quality or performance of diagnostic tests, and is widely used in radiology to evaluate the performance of many radiological tests. Although one does not necessarily need to understand the complicated mathematical equations and theories of ROC analysis, understanding the key concepts of ROC analysis is a prerequisite for the correct use and interpretation of the results that it provides. This article is a nonmathematical introduction to ROC analysis for radiologists who are not mathematicians or statisticians. Important concepts are discussed along with a brief discussion of the methods of data collection to use in radiological ROC studies. An introduction to the software programs frequently used for performing ROC analyses is also presented.
Sensitivity and specificity, which are defined as the number of true positive decisions/the number of actually positive cases and the number of true negative decisions/the number of actually negative cases, respectively, constitute the basic measures of performance of diagnostic tests (Table 1). When the results of a test fall into one of two obviously defined categories, such as either the presence or absence of a disease, then the test has only one pair of sensitivity and specificity values. However, in many diagnostic situations, making a decision in a binary mode is both difficult and impractical. Image findings may not be obvious or clean-cut. There may be a considerable variation in the diagnostic confidence levels between the radiologists who interpret the findings. As a result, a single pair of sensitivity and specificity values is insufficient to describe the full range of diagnostic performance of a test.
Consider an example of 70 patients with solitary pulmonary nodules who underwent plain chest radiography to determine whether the nodules were benign or malignant (Table 2). According to the biopsy results and/or follow-up evaluations, 34 patients actually had malignancies and 36 patients had benign lesions. Chest radiographs were interpreted according to a five-point scale: 1 (definitely benign), 2 (probably benign), 3 (possibly malignant), 4 (probably malignant), and 5 (definitely malignant). In this example, one can choose from four different cutoff levels to define a positive test for malignancy on the chest radiographs: viz. ≥2 (i.e., the most liberal criterion), ≥3, ≥4, and 5 (i.e., the most stringent criterion). Therefore, there are four pairs of sensitivity and specificity values, one pair for each cutoff level, and the sensitivities and specificities depend on the cutoff levels that are used to define the positive and negative test results (Table 3). As the cutoff level decreases, the sensitivity increases while the specificity decreases, and vice versa.
To deal with these multiple pairs of sensitivity and specificity values, one can draw a graph using the sensitivities as the y coordinates and the 1-specificities or FPRs as the x coordinates (Fig. 1A). Each discrete point on the graph, called an operating point, is generated by using different cutoff levels for a positive test result. An ROC curve can be estimated from these discrete points, by making the assumption that the test results, or some unknown monotonic transformation thereof, follow a certain distribution. For this purpose, the assumption of a binormal distribution (i.e., two Gaussian distributions: one for the test results of those patients with benign solitary pulmonary nodules and the other for the test results of those patients with malignant solitary pulmonary nodules) is most commonly made (1, 2). The resulting curve is called the fitted or smooth ROC curve (Fig. 1B) (1). The estimation of the smooth ROC curve based on a binormal distribution uses a statistical method called maximum likelihood estimation (MLE) (3). When a binormal distribution is used, the shape of the smooth ROC curve is entirely determined by two parameters. The first one, which is referred to as a, is the standardized difference in the means of the distributions of the test results for those subjects with and without the condition (Appendix) (2, 4). The other parameter, which is referred to as b, is the ratio of the standard deviations of the distributions of the test results for those subjects without versus those with the condition (Appendix) (2, 4). Another way to construct an ROC curve is to connect all the points obtained at all the possible cutoff levels. In the previous example, there are four pairs of FPR and sensitivity values (Table 3), and the two endpoints on the ROC curve are 0, 0 and 1, 1 with each pair of values corresponding to the FPR and sensitivity, respectively. The resulting ROC curve is called the empirical ROC curve (Fig. 1C) (1). The ROC curve illustrates the relationship between sensitivity and FPR. Because the ROC curve displays the sensitivities and FPRs at all possible cutoff levels, it can be used to assess the performance of a test independently of the decision threshold (5).
Several summary indices are associated with the ROC curve. One of the most popular measures is the area under the ROC curve (AUC) (1, 2). AUC is a combined measure of sensitivity and specificity. AUC is a measure of the overall performance of a diagnostic test and is interpreted as the average value of sensitivity for all possible values of specificity (1, 2). It can take on any value between 0 and 1, since both the x and y axes have values ranging from 0 to 1. The closer AUC is to 1, the better the overall diagnostic performance of the test, and a test with an AUC value of 1 is one that is perfectly accurate (Fig. 2). The practical lower limit for the AUC of a diagnostic test is 0.5. The line segment from 0, 0 to 1, 1 has an area of 0.5 (Fig. 2). If we were to rely on pure chance to distinguish those subjects with versus those without a particular disease, the resulting ROC curve would fall along this diagonal line, which is referred to as the chance diagonal (Fig. 2) (1, 2). A diagnostic test with an AUC value greater than 0.5 is, therefore, at least better than relying on pure chance, and has at least some ability to discriminate between subjects with and without a particular disease (Fig. 2). Because sensitivity and specificity are independent of disease prevalence, AUC is also independent of disease prevalence (1, 5).
AUC can be estimated both parametrically, with the assumption that either the test results themselves or some unknown monotonic transformation of the test results follows a binormal distribution, and nonparametrically from the empirical ROC curve without any distributional assumption of the test results (Figs. 1B, C). Several nonparametric methods of estimating the area under the empirical ROC curve and its variance have been described (6-8). The nonparametric estimate of the area under the empirical ROC curve is the summation of the areas of the trapezoids formed by connecting the points on the ROC curve (Fig. 1C) (6, 7). The nonparametric estimate of the area under the empirical ROC curve tends to underestimate AUC when discrete rating data (e.g., the five-point scale in the previous example) are collected, whereas the parametric estimate of AUC has negligible bias except when extremely small case samples are employed (2, 4). For discrete rating data, the parametric method is, therefore, preferred (2). However, when discrete rating data are collected, if the test results are not well distributed across the possible response categories (e.g., in the previous example, those patients with actually benign lesions and those patients with actually malignant lesions tend to be rated at each end of the scale, 1 = definitely benign and 5 = definitely malignant, respectively), the data may be degenerate and, consequently, the parametric method may not work well (2, 4). Using the nonparametric method is an option in this case, but may provide even more biased results than it normally would (2). For continuous or quasi-continuous data (e.g., a percent-confidence scale from 0% to 100%), the parametric and nonparametric estimates of AUC will have very similar values and the bias is negligible (2). Therefore, using either the parametric or nonparametric method is fine in this case (2). In most ROC analyses of radiological tests, discrete rating scales with five or six categories (e.g., definitely absent, probably absent, possibly present, probably present and definitely present) are used, for which the parametric method is recommended unless there is a problem with degenerate data. Data collection in radiological ROC studies is further discussed in a later section.
AUC is often presented along with its 95% confidence interval (CI). An AUC of a test obtained from a group of patients is not a fixed, true value, but a value from a sample that is subject to statistical error. Therefore, if one performs the same test on a different group of patients with the same characteristics, the AUC which is obtained may be different. Although it is not possible to specifically define a fixed value for the true AUC of a test, one can choose a range of values in which the true value of AUC lies with a certain degree of confidence. The 95% CI gives the range of values in which the true value lies and the associated degree of confidence. That is to say, one can be 95% sure that the 95% CI includes the true value of AUC (9, 10). In other words, if one believes that the true value of AUC is within the 95% CI, there is a 5% chance of its being wrong. Therefore, if the lower bound of the 95% CI of AUC for a test is greater than 0.5, then the test is statistically significantly better (with a 5% chance of being wrong or a significance level of 0.05) than making the diagnostic decision based on pure chance, which has an AUC of 0.5.
Since AUC is a measure of the overall performance of a diagnostic test, the overall diagnostic performance of different tests can be compared by comparing their AUCs. The bigger its AUC is, the better the overall performance of the diagnostic test. When comparing the AUCs of two tests, equal AUC values mean that the two tests yield the same overall diagnostic performance, but does not necessarily mean that the two ROC curves of the two tests are identical (3). Figure 3 illustrates two ROC curves with equal AUCs. The curves are obviously not identical. Although the AUCs and, therefore, the overall performances of the two tests are the same, test B is better than test A in the high FPR range (or high sensitivity range), whereas test A is better than test B in the low FPR range (or low sensitivity range) (Fig. 3). The equality of two ROC curves can be tested by using the two parameters, a and b, instead. Because the shape of a binormal smooth ROC curve can be completely specified by the two parameters, a and b, the equality of the two ROC curves under the binormal assumption can be assessed by testing the equality of the two sets of parameters, a and b, i.e. by comparing the two sets of values from the two ROC curves. The null hypothesis and alternative hypothesis of the test are H0: a1 = a2 and b1 = b2 versus H1: a1 ≠ a2 or b1 ≠ b2, respectively, where 1 and 2 denote the two different ROC curves (2, 3). According to this method, the ROC curves and, consequently, the diagnostic performances of different tests are considered to be different, unless the ROC curves are identical: in other words, unless they yield equal sensitivities for every specificity between 0 and 1 or equal specificities for every sensitivity between 0 and 1 (4).
In some clinical settings, when comparing the performances of different diagnostic tests, one may be interested in only a small portion of the ROC curve and comparing the AUCs and the overall diagnostic performance may be misleading. When screening for a serious disease in a high-risk group (e.g., breast cancer screening), the cutoff range for a positive test should be chosen in such a way as to provide good sensitivity, even if the FPR is high, because false negative test results may have serious consequences. On the other hand, in screening for a certain disease, whose prevalence is very low and for which the subsequent confirmatory tests and/or treatments are very risky, a high specificity and low FPR is required. If the cutoff range for a positive test is not adjusted accordingly, almost all of the positive decisions will be false positive decisions, resulting in many unnecessary, risky follow-up examinations and/or treatments. In Figure 3, although the AUCs and overall performances of the two tests are the same, in the former diagnostic situation requiring high sensitivity, test B would be better than test A, whereas in the latter situation requiring a low FPR, test A would be better than test B. AUC, as a measure of the overall diagnostic performance, is not helpful in these specific diagnostic situations. The diagnostic performance of a test should be judged in the context of the diagnostic situation to which the test is applied. And, depending on the specific diagnostic situation, only a portion of the overall ROC curve may need to be considered.
One way to consider only a portion of an ROC curve is to use the ROC curve to estimate the sensitivity at a particular FPR, and to compare the sensitivities of different ROC curves at a particular FPR (Fig. 4). Another way is to use the partial area under the ROC curve (Fig. 4) (11, 12). Partial ROC area is defined as the area between two FPRs or between two sensitivities. The partial area under the ROC curve between two FPRs, FPR1 = e1 and FPR2 = e2, can be denoted as A(e1 ≤ FPR ≤ e2) (2). Unlike AUC, whose maximum possible value is always 1, the magnitude of the partial area under the ROC curve is dependent on the two FPRs chosen. Therefore, the standardization of the partial area by dividing it by its maximum value is recommended and Jiang et al. (12) referred to this standardized partial area as the partial area index. The maximum value of the partial area between FPR1 = e1 and FPR2 = e2 is equal to the width of the interval, e2 - e1. The partial area index is interpreted as the average sensitivity for the range of FPRs or specificities chosen (1, 2).
Unlike in the case of many laboratory tests, the interpretation of most radiological tests is qualitative and there are several ways to express the reader's confidence in the presence of a disease, namely a binary result which is either positive or negative for the disease, a discrete rating scale such as a five-point scale, and a continuous or quasi-continuous scale such as a percent-confidence scale from 0% to 100% (2). The first approach is inadequate for ROC analysis, however, the second and third approaches are appropriate (2). In most of the ROC analyses of radiological tests which have been conducted to date, a discrete rating scale with five or six categories has been used. Rockette et al. (13) performed a study to assess how the estimates of performance on ROC curves are affected by the use of a discrete five-point scale versus a continuous percent-confident scale. They compared the AUCs obtained with the two different scales in the case of abdominal CTs used for detecting abdominal masses and suggested that the discrete rating or continuous scales are often not significantly different, and can be used interchangeably in image-evaluation ROC studies, although they recommended continuous scales for routine use in radiological ROC studies, because of their potential advantages in some situations (13). Having as many categories as possible or using a continuous or quasi-continuous scale is desirable theoretically (14) and has been shown to produce results essentially equivalent to those of discrete scales, when the latter produce well-distributed operating points (15).
Several software programs that are frequently used for ROC analysis are available on the Internet.
ROCKIT, which is available at http://xray.bsd.uchicago.edu/krl/roc_soft.htm (accessed December 31, 2003), is a program for parametric ROC analysis that combines the features of ROCFIT, LABROC, CORROC2, CLABROC and INDROC. It estimates the smooth ROC curve and its AUC, 95% CI of AUC, and the parameters a and b on the basis of a binormal distribution. ROCKIT tests the statistical significance of the differences between two paired (i.e., two ROC curves from the same group of patients), partially paired, or unpaired (i.e., two ROC curves from two different groups of patients, viz. one curve each from each group of patients) ROC curves. The difference between two AUCs (i.e., the difference in the overall diagnostic performance of the two tests) is tested with the z test. Differences in the parameters a and b of two ROC curves (i.e., the equality of the two ROC curves) are tested using the bivariate chi-square test, as presented by Metz et al (2, 4). ROCKIT also estimates the sensitivity at a particular FPR and tests the statistical significance of the difference between the two sensitivities on the two curves at a particular FPR by means of the z test.
PlotROC.xls, which is available at http://xray.bsd.uchicago.edu/krl/roc_soft.htm (accessed December 31, 2003), is a Microsoft Excel 5.0 (Microsoft, Redmond, WA, U.S.A.) macro sheet which takes the a and b parameter values based on the assumption of a binormal distribution to plot a smooth ROC curve.
MedCalc (MedCalc Software, Mariakerke, Belgium), which is available at http://www.medcalc.be (accessed December 31, 2003), is a statistical package that offers nonparametric ROC analysis. It provides the empirical ROC curve and nonparametric estimate of the area under the empirical ROC curve with its 95% CI, based on the method developed by Hanley et al. (7). A comparison between two paired ROC curves is available and the statistical significance of the difference between two AUCs is calculated with the z test, as described by Hanley et al. (16). SPSS version 10.0 (SPSS Inc., Chicago, IL, U.S.A.) also provides the empirical ROC curve and nonparametric estimate of the area under the empirical ROC curve and its 95% CI, which are calculated using a method similar to that of Medcalc. However, it does not provide a statistical comparison between ROC curves.
Partarea.for, which is available at http://www.bio.ri.ccf.org/Research/ROC (accessed December 31, 2003), is a FORTRAN program designed to estimate the partial area under the smooth ROC curve between two FPRs, based on the method developed by McClish (11). It also tests the statistical significance of the difference between the two partial areas of two ROC curves using the z test. This program should be used in conjunction with a parametric program such as ROCKIT. To estimate the partial area, it requires the a and b parameter estimates, along with the variances (a) and (b) and the covariance (a, b) of an ROC curve, which can be obtained by means of a parametric program. When comparing two partial areas of two ROC curves it also requires the covariances (a1, a2), (a1, b2), (b1, a2) and (b1, b2), which can be obtained using a parametric program (note : the subscripts 1 and 2 denote two different ROC curves). This program needs to be compiled before it can be used on a DOS or Windows-based computer.
The authors wish to thank Charles E. Metz, PhD at Kurt Rossmann Laboratories, Department of Radiology, University of Chicago, IL, USA for reviewing the manuscript and providing helpful comments, and Frank Schoonjans at MedCalc Software, Mariakerke, Belgium for providing the information on MedCalc.
If the data are actually binormal or if a known function can transform the data so that it follows a binormal distribution, parameters a (the standardized difference in the means of the distributions of the test results for those subjects with and without the condition) and b (the ratio of the standard deviations of the distributions of the test results for those subjects without versus those with the condition) can be estimated directly from the means and standard deviations of the distributions of those subjects with and without the condition. Thus, we will have
a = (u1 - u0) / σ1; b = σ0 / σ1
where ui is the mean and σi is the standard deviation of the test results, i = 0 (without the condition), 1 (with the condition).
For discrete rating data, we hypothesize discrete rating scale test results, T0 (without the condition) and T1 (with the condition) as a categorization of two latent continuous scale random variables, T*0 and T*1, respectively, each of which has a normal distribution. For a discrete rating scale test result, Ti, which can take on one of the K-ordered values, where i = 0 (without the condition) or 1 (with the condition), we assume that there are K - 1 unknown decision thresholds c1, c2, ..., cK-1, so that
If T*i ≤ c1, then Ti = 1
If cj - 1 < T*i ≤ cj, then Ti = j, j = 2, 3, ..., K - 1
If T*i > cK - 1, then Ti = K
Because we assume that both T*0 and T*1 have normal distributions, then
T*0 ~ N (µ0, σ02); T*1 ~ N (µ1, σ12)
where µ0, µ1 are the means and σ02, σ12 are the variances of the normal distributions. Therefore, we will have
a = (µ1 - µ0) / σ1; b = σ0 / σ1