Metabolomics is increasingly being applied towards the identification of biomarkers for disease diagnosis, prognosis and risk prediction. Unfortunately among the many published metabolomic studies focusing on biomarker discovery, there is very little consistency and relatively little rigor in how researchers select, assess or report their candidate biomarkers. In particular, few studies report any measure of sensitivity, specificity, or provide receiver operator characteristic (ROC) curves with associated confidence intervals. Even fewer studies explicitly describe or release the biomarker model used to generate their ROC curves. This is surprising given that for biomarker studies in most other biomedical fields, ROC curve analysis is generally considered the standard method for performance assessment. Because the ultimate goal of biomarker discovery is the translation of those biomarkers to clinical practice, it is clear that the metabolomics community needs to start “speaking the same language” in terms of biomarker analysis and reporting-especially if it wants to see metabolite markers being routinely used in the clinic. In this tutorial, we will first introduce the concept of ROC curves and describe their use in single biomarker analysis for clinical chemistry. This includes the construction of ROC curves, understanding the meaning of area under ROC curves (AUC) and partial AUC, as well as the calculation of confidence intervals. The second part of the tutorial focuses on biomarker analyses within the context of metabolomics. This section describes different statistical and machine learning strategies that can be used to create multi-metabolite biomarker models and explains how these models can be assessed using ROC curves. In the third part of the tutorial we discuss common issues and potential pitfalls associated with different analysis methods and provide readers with a list of nine recommendations for biomarker analysis and reporting. To help readers test, visualize and explore the concepts presented in this tutorial, we also introduce a web-based tool called ROCCET (ROC Curve Explorer & Tester, http://www.roccet.ca). ROCCET was originally developed as a teaching aid but it can also serve as a training and testing resource to assist metabolomics researchers build biomarker models and conduct a range of common ROC curve analyses for biomarker studies.
Electronic supplementary material
The online version of this article (doi:10.1007/s11306-012-0482-9) contains supplementary material, which is available to authorized users.
Biomarker analysis; ROC curve; AUC; Confidence intervals; Optimal threshold; Sample size; Bootstrapping; Cross validation; Biomarker validation and reporting
The receiver operating characteristic (ROC) curve is an important tool to gauge the performance of classifiers. In certain situations of high-throughput data analysis, the data is heavily class-skewed, i.e. most features tested belong to the true negative class. In such cases, only a small portion of the ROC curve is relevant in practical terms, rendering the ROC curve and its area under the curve (AUC) insufficient for the purpose of judging classifier performance. Here we define an ROC surface (ROCS) using true positive rate (TPR), false positive rate (FPR), and true discovery rate (TDR). The ROC surface, together with the associated quantities, volume under the surface (VUS) and FDR-controlled area under the ROC curve (FCAUC), provide a useful approach for gauging classifier performance on class-skewed high-throughput data. The implementation as an R package is available at http://userwww.service.emory.edu/~tyu8/ROCS/.
As in many different areas of science and technology, most important problems in bioinformatics rely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out through the analysis of their receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) constitutes a popular indicator of the performance of a binary classifier. However, the assessment of the statistical significance of the difference between any two classifiers based on this measure is not a straightforward task, since not many freely available tools exist. Most existing software is either not free, difficult to use or not easy to automate when a comparative assessment of the performance of many binary classifiers is intended. This constitutes the typical scenario for the optimization of parameters when developing new classifiers and also for their performance validation through the comparison to previous art.
In this work we describe and release new software to assess the statistical significance of the observed difference between the AUCs of any two classifiers for a common task estimated from paired data or unpaired balanced data. The software is able to perform a pairwise comparison of many classifiers in a single run, without requiring any expert or advanced knowledge to use it. The software relies on a non-parametric test for the difference of the AUCs that accounts for the correlation of the ROC curves. The results are displayed graphically and can be easily customized by the user. A human-readable report is generated and the complete data resulting from the analysis are also available for download, which can be used for further analysis with other software. The software is released as a web server that can be used in any client platform and also as a standalone application for the Linux operating system.
A new software for the statistical comparison of ROC curves is released here as a web server and also as standalone software for the LINUX operating system.
Rationale and Objectives
To examine the effects of the number of categories in the rating scale used in an observer experiment on the results of ROC analysis by a simulation study.
Materials and Methods
We have previously evaluated the effects of computer-aided diagnosis (CAD) on radiologists’ characterization of malignant and benign breast masses in serial mammograms. The evaluation of the likelihood of malignancy was performed on a quasi-continuous (0-100 points) confidence-rating scale. In this study, we simulated the use of discrete confidence-rating scales with fewer number of categories and analyzed the results with receiver operating characteristic (ROC) methodology. The observers’ estimates of the likelihood of malignancy were also mapped to BI-RADS assessments with 5 and 7 categories and ROC analysis was performed. The area under the ROC curve and the partial area index obtained from ROC analysis of the different confidence-rating scales were compared.
The fitted ROC curves and the performance indices do not change significantly when the confidence-rating scales were varied from 6 to 101 points if the estimated operating points obtained directly from the data are distributed relatively evenly over the entire range of true-positive fraction (TPF) and false-positive fraction (FPF). The mapping of the likelihood of malignancy observer data to the 7-category BI-RADS assessment scale allowed reliable ROC analysis, whereas mapping to the 5-category BI-RADS scale could cause erratic ROC curve fitting because of the lack of operating points in the mid-range or failure in ROC curve fitting because of data degeneration for some observers.
ROC analysis of discrete confidence rating scales with few but relatively evenly distributed data points over the entire FPF and TPF range is comparable to that of a quasi-continuous rating scale. However, ROC analysis of discrete confidence rating scales with few and unevenly distributed data points may cause unreliable estimations.
Computer-Aided Diagnosis; Continuous and Discrete Confidence Rating Scales; ROC Observer Study; Classification; Mammography
Syndromic surveillance systems can potentially be used to detect a bioterrorist attack earlier than traditional surveillance, by virtue of their near real-time analysis of relevant data. Receiver operator characteristic (ROC) curve analysis using the area under the curve (AUC) as a comparison metric has been recommended as a practical evaluation tool for syndromic surveillance systems, yet traditional ROC curves do not account for timeliness of detection or subsequent time-dependent health outcomes.
Using a decision-analytic approach, we predicted outcomes, measured in lives, quality adjusted life years (QALYs), and costs, for a series of simulated bioterrorist attacks. We then evaluated seven detection algorithms applied to syndromic surveillance data using outcomes-weighted ROC curves compared to simple ROC curves and timeliness-weighted ROC curves. We performed sensitivity analyses by varying the model inputs between best and worst case scenarios and by applying different methods of AUC calculation.
The decision analytic model results indicate that if a surveillance system was successful in detecting an attack, and measures were immediately taken to deliver treatment to the population, the lives, QALYs and dollars lost could be reduced considerably. The ROC curve analysis shows that the incorporation of outcomes into the evaluation metric has an important effect on the apparent performance of the surveillance systems. The relative order of performance is also heavily dependent on the choice of AUC calculation method.
This study demonstrates the importance of accounting for mortality, morbidity and costs in the evaluation of syndromic surveillance systems. Incorporating these outcomes into the ROC curve analysis allows for more accurate identification of the optimal method for signaling a possible bioterrorist attack. In addition, the parameters used to construct an ROC curve should be given careful consideration.
Classification of a given observation to one of three classes is an important task in many decision processes or pattern recognition applications. A general analysis of the performance of three-class classifiers results in a complex six-dimensional (6D) receiver operating characteristic (ROC) space, for which no simple analytical tool exists at present. We investigate the performance of an ideal observer under a specific set of assumptions that reduces the 6D ROC space to 3D by constraining the utilities of some of the decisions in the classification task. These assumptions lead to a 3D ROC space in which the true-positive fraction (TPF) can be expressed in terms of the two types of false-positive fractions (FPFs). We demonstrate that the TPF is uniquely determined by, and therefore is a function of, the two FPFs. The domain of this function is shown to be related to the decision boundaries in the likelihood ratio plane. Based on these properties of the 3D ROC space, we can define a summary measure, referred to as the normalized volume under the surface (NVUS), that is analogous to the area under the ROC curve (AUC) for a two-class classifier. We further investigate the properties of the 3D ROC surface and the NVUS for the ideal observer under the condition that the three class distributions are multivariate normal with equal covariance matrices. The probability density functions (pdfs) of the decision variables are shown to follow a bivariate log-normal distribution. By considering these pdfs, we express the TPF in terms of the FPFs, and integrate the TPF over its domain numerically to obtain the NVUS. In addition, we performed a Monte Carlo simulation study, in which the 3D ROC surface was generated by empirical “optimal” classification of case samples in the multi-dimensional feature space following the assumed distributions, to obtain an independent estimate of NVUS. The NVUS value obtained by using the analytical pdfs was found to be in good agreement with that obtained from the Monte Carlo simulation study. We also found that, under all conditions studied, the NVUS increased when the difficulty of the classification task was reduced by changing the parameters of the class distributions, thereby exhibiting the properties of a performance metric in analogous to AUC. Our results indicate that, under the conditions that lead to our 3D ROC analysis, the performance of a 3-class classifier may be analyzed by considering the ROC surface, and its accuracy characterized by the NVUS.
3-class classification; ROC analysis; performance index; ideal observer
This paper considers receiver operating characteristic (ROC) analysis for bivariate marker measurements. The research interest is to extend tools and rules from univariate marker to bivariate marker setting for evaluating predictive accuracy of markers using a tree-based classification rule. Using an and-or classifier, an ROC function together with a weighted ROC function (WROC) and their conjugate counterparts are proposed for examining the performance of bivariate markers. The proposed functions evaluate the performance of and-or classifiers among all possible combinations of marker values, and are ideal measures for understanding the predictability of biomarkers in target population. Specific features of ROC and WROC functions and other related statistics are discussed in comparison with those familiar properties for univariate marker. Nonparametric methods are developed for estimating ROC-related functions, (partial) area under curve and concordance probability. With emphasis on average performance of markers, the proposed procedures and inferential results are useful for evaluating marker predictability based on a single or bivariate marker (or test) measurements with different choices of markers, and for evaluating different and-or combinations in classifiers. The inferential results developed in this paper also extend to multivariate markers with a sequence of arbitrarily combined and-or classifier.
Concordance probability; Prediction accuracy; Tree-based classification; U-statistics
The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the performance of diagnostic tests. The purpose of this article is to provide a nonmathematical introduction to ROC analysis. Important concepts involved in the correct use and interpretation of this analysis, such as smooth and empirical ROC curves, parametric and nonparametric methods, the area under the ROC curve and its 95% confidence interval, the sensitivity at a particular FPR, and the use of a partial area under the ROC curve are discussed. Various considerations concerning the collection of data in radiological ROC studies are briefly discussed. An introduction to the software frequently used for performing ROC analyses is also presented.
Diagnostic radiology; Receiver operating characteristic (ROC) curve; Software reviews; Statistical analysis
A common task in analyzing microarray data is to determine which genes are differentially expressed across two (or more) kind of tissue samples or samples submitted under experimental conditions. Several statistical methods have been proposed to accomplish this goal, generally based on measures of distance between classes. It is well known that biological samples are heterogeneous because of factors such as molecular subtypes or genetic background that are often unknown to the experimenter. For instance, in experiments which involve molecular classification of tumors it is important to identify significant subtypes of cancer. Bimodal or multimodal distributions often reflect the presence of subsamples mixtures. Consequently, there can be genes differentially expressed on sample subgroups which are missed if usual statistical approaches are used. In this paper we propose a new graphical tool which not only identifies genes with up and down regulations, but also genes with differential expression in different subclasses, that are usually missed if current statistical methods are used. This tool is based on two measures of distance between samples, namely the overlapping coefficient (OVL) between two densities and the area under the receiver operating characteristic (ROC) curve. The methodology proposed here was implemented in the open-source R software.
This method was applied to a publicly available dataset, as well as to a simulated dataset. We compared our results with the ones obtained using some of the standard methods for detecting differentially expressed genes, namely Welch t-statistic, fold change (FC), rank products (RP), average difference (AD), weighted average difference (WAD), moderated t-statistic (modT), intensity-based moderated t-statistic (ibmT), significance analysis of microarrays (samT) and area under the ROC curve (AUC). On both datasets all differentially expressed genes with bimodal or multimodal distributions were not selected by all standard selection procedures. We also compared our results with (i) area between ROC curve and rising area (ABCR) and (ii) the test for not proper ROC curves (TNRC). We found our methodology more comprehensive, because it detects both bimodal and multimodal distributions and different variances can be considered on both samples. Another advantage of our method is that we can analyze graphically the behavior of different kinds of differentially expressed genes.
Our results indicate that the arrow plot represents a new flexible and useful tool for the analysis of gene expression profiles from microarrays.
Motivation: The performance of classifiers is often assessed using Receiver Operating Characteristic ROC [or (AC) accumulation curve or enrichment curve] curves and the corresponding areas under the curves (AUCs). However, in many fundamental problems ranging from information retrieval to drug discovery, only the very top of the ranked list of predictions is of any interest and ROCs and AUCs are not very useful. New metrics, visualizations and optimization tools are needed to address this ‘early retrieval’ problem.
Results: To address the early retrieval problem, we develop the general concentrated ROC (CROC) framework. In this framework, any relevant portion of the ROC (or AC) curve is magnified smoothly by an appropriate continuous transformation of the coordinates with a corresponding magnification factor. Appropriate families of magnification functions confined to the unit square are derived and their properties are analyzed together with the resulting CROC curves. The area under the CROC curve (AUC[CROC]) can be used to assess early retrieval. The general framework is demonstrated on a drug discovery problem and used to discriminate more accurately the early retrieval performance of five different predictors. From this framework, we propose a novel metric and visualization—the CROC(exp), an exponential transform of the ROC curve—as an alternative to other methods. The CROC(exp) provides a principled, flexible and effective way for measuring and visualizing early retrieval performance with excellent statistical power. Corresponding methods for optimizing early retrieval are also described in the Appendix.
Availability: Datasets are publicly available. Python code and command-line utilities implementing CROC curves and metrics are available at http://pypi.python.org/pypi/CROC/
Different methods of evaluating diagnostic performance when comparing diagnostic tests may lead to different results. We compared two such approaches, sensitivity and specificity with area under the Receiver Operating Characteristic Curve (ROC AUC) for the evaluation of CT colonography for the detection of polyps, either with or without computer assisted detection.
In a multireader multicase study of 10 readers and 107 cases we compared sensitivity and specificity, using radiological reporting of the presence or absence of polyps, to ROC AUC calculated from confidence scores concerning the presence of polyps. Both methods were assessed against a reference standard. Here we focus on five readers, selected to illustrate issues in design and analysis. We compared diagnostic measures within readers, showing that differences in results are due to statistical methods.
Reader performance varied widely depending on whether sensitivity and specificity or ROC AUC was used. There were problems using confidence scores; in assigning scores to all cases; in use of zero scores when no polyps were identified; the bimodal non-normal distribution of scores; fitting ROC curves due to extrapolation beyond the study data; and the undue influence of a few false positive results. Variation due to use of different ROC methods exceeded differences between test results for ROC AUC.
The confidence scores recorded in our study violated many assumptions of ROC AUC methods, rendering these methods inappropriate. The problems we identified will apply to other detection studies using confidence scores. We found sensitivity and specificity were a more reliable and clinically appropriate method to compare diagnostic tests.
Laboratory observer performance measurements, receiver operating characteristic (ROC) and free-response ROC (FROC) differ from actual clinical interpretations in several respects, which could compromise their clinical relevance. The objective of this study was to develop a method for quantifying the clinical relevance of a laboratory paradigm and apply it to compare the ROC and FROC paradigms in a nodule detection task.
The original prospective interpretations of 80 digital chest radiographs were classified by the truth panel as correct (C=1) or incorrect (C=0), depending on correlation with additional imaging, and the average of C was interpreted as the clinical figure of merit. FROC data were acquired for 21 radiologists and ROC data were inferred using the highest ratings. The areas under the ROC and alternative FROC curves were used as laboratory figures of merit. Bootstrap analysis was conducted to estimate conventional agreement measures between laboratory and clinical figures of merit. Also computed was a pseudovalue-based image-level correctness measure of the laboratory interpretations, whose association with C as measured by the area (rAUC) under an appropriately defined relevance ROC curve, is as a measure of the clinical relevance of a laboratory paradigm.
Low correlations (e.g. κ=0.244) and near chance level rAUC values (e.g. 0.598), attributable to differences between the clinical and laboratory paradigms, were observed. The absolute width of the confidence interval was 0.38 for the interparadigm differences of the conventional measures and 0.14 for the difference of the rAUCs.
The rAUC measure was consistent with the traditional measures but was more sensitive to the differences in clinical relevance. A new relevance ROC method for quantifying the clinical relevance of a laboratory paradigm is proposed.
Logistic regression analysis (LRA), Support Vector Machine (SVM) and a neural network (NN) are commonly used statistical models in computer-aided diagnostic (CAD) systems for breast ultrasonography (US). The aim of this study was to clarify the diagnostic ability of the use of these statistical models for future applications of CAD systems, such as three-dimensional (3D) power Doppler imaging, vascularity evaluation and the differentiation of a solid mass.
Materials and Methods
A database that contained 3D power Doppler imaging pairs of non-harmonic and tissue harmonic images for 97 benign and 86 malignant solid tumors was utilized. The virtual organ computer-aided analysis-imaging program was used to analyze the stored volumes of the 183 solid breast tumors. LRA, an SVM and NN were employed in comparative analyses for the characterization of benign and malignant solid breast masses from the database.
The values of area under receiver operating characteristic (ROC) curve, referred to as Az values for the use of non-harmonic 3D power Doppler US with LRA, SVM and NN were 0.9341, 0.9185 and 0.9086, respectively. The Az values for the use of harmonic 3D power Doppler US with LRA, SVM and NN were 0.9286, 0.8979 and 0.9009, respectively. The Az values of six ROC curves for the use of LRA, SVM and NN for non-harmonic or harmonic 3D power Doppler imaging were similar.
The diagnostic performances of these three models (LRA, SVM and NN) are not different as demonstrated by ROC curve analysis. Depending on user emphasis for the use of ROC curve findings, the use of LRA appears to provide better sensitivity as compared to the other statistical models.
Vascularization index; Flow index; Vascularization-flow index; Logistic regression analysis; Neural network; Support Vector Machine
The aim of this study was to compare the diagnostic performance of the three software packages 4DMSPECT (4DM), Emory Cardiac Toolbox (ECTb), and Cedars Quantitative Perfusion SPECT (QPS) for quantification of myocardial perfusion scintigram (MPS) using a large group of consecutive patients.
We studied 1,052 consecutive patients who underwent 2-day stress/rest 99mTc-sestamibi MPS studies. The reference/gold-standard classifications for the MPS studies were obtained from three physicians, with more than 25 years each of experience in nuclear cardiology, who re-evaluated all MPS images. Automatic processing was carried out using 4DM, ECTb, and QPS software packages. Total stress defect extent (TDE) and summed stress score (SSS) based on a 17-segment model were obtained from the software packages. Receiver-operating characteristic (ROC) analysis was performed.
A total of 734 patients were classified as normal and the remaining 318 were classified as having infarction and/or ischemia. The performance of the software packages calculated as the area under the SSS ROC curve were 0.87 for 4DM, 0.80 for QPS, and 0.76 for ECTb (QPS vs. ECTb p = 0.03; other differences p < 0.0001). The area under the TDE ROC curve were 0.87 for 4DM, 0.82 for QPS, and 0.76 for ECTb (QPS vs. ECTb p = 0.0005; other differences p < 0.0001).
There are considerable differences in performance between the three software packages with 4DM showing the best performance and ECTb the worst. These differences in performance should be taken in consideration when software packages are used in clinical routine or in clinical studies.
myocardial perfusion imaging; SPECT; automatic quantification; software; coronary artery disease
Diagnostic tests commonly are characterized by their true positive (sensitivity) and true negative (specificity) classification rates, which rely on a single decision threshold to classify a test result as positive. A more complete description of test accuracy is given by the receiver operating characteristic (ROC) curve, a graph of the false positive and true positive rates obtained as the decision threshold is varied. A generalized regression methodology, which uses a class of ordinal regression models to estimate smoothed ROC curves has been described. Data from a multi-institutional study comparing the accuracy of magnetic resonance (MR) imaging with computed tomography (CT) in detecting liver metastases, which are ideally suited for ROC regression analysis, are described. The general regression model is introduced and an estimate for the area under the ROC curve and its standard error using parameters of the ordinal regression model is given. An analysis of the liver data that highlights the utility of the methodology in parsimoniously adjusting comparisons for covariates is presented.
AIM: To quantitatively assess the ability of double contrast-enhanced ultrasound (DCUS) to detect tumor early response to pre-operative chemotherapy.
METHODS: Forty-three patients with gastric cancer treated with neoadjuvant chemotherapy followed by curative resection between September 2011 and February 2012 were analyzed. Pre-operative chemotherapy regimens of fluorouracil + oxaliplatin or S-1 + oxaliplatin were administered in 2-4 cycles over 6-12 wk periods. All patients underwent contrast-enhanced computed tomography (CT) scan and DCUS before and after two courses of pre-operative chemotherapy. The therapeutic response was assessed by CT using the response evaluation criteria in solid tumors (RECIST 1.1) criteria. Tumor area was assessed by DCUS as enhanced appearance of gastric carcinoma due to tumor vascularity during the contrast phase as compared to the normal gastric wall. Histopathologic analysis was carried out according to the Mandard tumor regression grade criteria and used as the reference standard. Receiver operating characteristic (ROC) analysis was used to evaluate the efficacy of DCUS parameters in differentiating histopathological responders from non-responders.
RESULTS: The study population consisted of 32 men and 11 women, with mean age of 59.7 ± 11.4 years. Neither age, sex, histologic type, tumor site, T stage, nor N stage was associated with pathological response. The responders had significantly smaller mean tumor size than the non-responders (15.7 ± 7.4 cm vs 33.3 ± 14.1 cm, P < 0.01). According to Mandard’s criteria, 27 patients were classified as responders, with 11 (40.7%) showing decreased tumor size by DCUS. In contrast, only three (18.8%) of the 16 non-responders showed decreased tumor size by DCUS (P < 0.01). The area under the ROC curve was 0.64, with a 95%CI of 0.46-0.81. The effects of several cut-off points on diagnostic parameters were calculated in the ROC curve analysis. By maximizing Youden’s index (sensitivity + specificity - 1), the best cut-off point for distinguishing responders from non-responders was determined, which had optimal sensitivity of 62.9% and specificity of 56.3%. Using this cut-off point, the positive and negative predictive values of DCUS for distinguishing responders from non-responders were 70.8% and 47.4%, respectively. The overall accuracy of DCUS for therapeutic response assessment was 60.5%, slightly higher than the 53.5% for CT response assessment with RECIST criteria (P = 0.663). Although the advantage was not statistically significant, likely due to the small number of cases assessed. DCUS was able to identify decreased perfusion in responders who showed no morphological change by CT imaging, which can be occluded by such treatment effects as fibrosis and edema.
CONCLUSION: DCUS may represent an innovative tool for more accurately predicting histopathological response to neoadjuvant chemotherapy before surgical resection in patients with locally-advanced gastric cancer.
Gastric cancer; Chemotherapy; Ultrasonic imaging; Predictive value of tests; Disease management
Laboratory receiver operating characteristic (ROC) studies, that are often used to evaluate medical imaging systems, differ from “live” clinical interpretations in several respects which could compromise their clinical relevance. The aim was to develop methodology for quantifying the clinical relevance of a laboratory ROC study. A simulator was developed to generate ROC ratings data and binary clinical interpretations classified as correct or incorrect for a common set of images interpreted under clinical and laboratory conditions. The area under the trapezoidal ROC curve was used as the laboratory figure-of-merit and the fraction of correct clinical decisions as the clinical figure-of-merit. Conventional agreement measures (Pearson, Spearman, Kendall and kappa) between the bootstrap-induced fluctuations of the two figures-of-merit were estimated. A jackknife pseudovalue transformation applied to the figures-of-merit was also investigated as a way to capture agreement existing at the individual image level that could be lost at the figure-of-merit level. It is shown that the pseudovalues define a relevance ROC curve the area under which (rAUC) measures the ability of the laboratory figure-of-merit based pseudovalues to correctly classify incorrect vs. correct clinical interpretations, and is a measure of the clinical relevance of an ROC study. The conventional measures and rAUC were compared under varying simulator conditions. It was found that design details of the ROC study, namely the number of bins, the difficulty level of the images, the ratio of disease-present to disease-absent images, and the unavoidable difference between laboratory and clinical performance levels, can seriously underestimate the agreement as indicated by conventional agreement measures, even for perfectly correlated data, while rAUC showed high agreement and was relatively immune to these details. At the same time rAUC was sensitive to factors such as intrinsic correlation between the laboratory and clinical decision variables and differences in reporting thresholds that are expected to influence agreement both at the individual image level and at the figure-of-merit level. Suggestions are made for how to conduct relevance ROC studies aimed at assessing agreement between laboratory and clinical interpretations.
observer performance; ROC; bootstrap; jackknife; agreement; correlation; agreement; statistical modeling; clinical relevance
The receiver operating characteristic (ROC) curve displays the capacity of a marker or diagnostic test to discriminate between two groups of subjects, cases versus controls. We present a comprehensive suite of Stata commands for performing ROC analysis. Non-parametric, semiparametric and parametric estimators are calculated. Comparisons between curves are based on the area or partial area under the ROC curve. Alternatively pointwise comparisons between ROC curves or inverse ROC curves can be made. Options to adjust these analyses for covariates, and to perform ROC regression are described in a companion article. We use a unified framework by representing the ROC curve as the distribution of the marker in cases after standardizing it to the control reference distribution.
The receiver operating characteristic (ROC) curve is a fundamental tool to assess the discriminant performance for not only a single marker but also a score function combining multiple markers. The area under the ROC curve (AUC) for a score function measures the intrinsic ability for the score function to discriminate between the controls and cases. Recently, the partial AUC (pAUC) has been paid more attention than the AUC, because a suitable range of the false positive rate can be focused according to various clinical situations. However, existing pAUC-based methods only handle a few markers and do not take nonlinear combination of markers into consideration.
We have developed a new statistical method that focuses on the pAUC based on a boosting technique. The markers are combined componentially for maximizing the pAUC in the boosting algorithm using natural cubic splines or decision stumps (single-level decision trees), according to the values of markers (continuous or discrete). We show that the resulting score plots are useful for understanding how each marker is associated with the outcome variable. We compare the performance of the proposed boosting method with those of other existing methods, and demonstrate the utility using real data sets. As a result, we have much better discrimination performances in the sense of the pAUC in both simulation studies and real data analysis.
The proposed method addresses how to combine the markers after a pAUC-based filtering procedure in high dimensional setting. Hence, it provides a consistent way of analyzing data based on the pAUC from maker selection to marker combination for discrimination problems. The method can capture not only linear but also nonlinear association between the outcome variable and the markers, about which the nonlinearity is known to be necessary in general for the maximization of the pAUC. The method also puts importance on the accuracy of classification performance as well as interpretability of the association, by offering simple and smooth resultant score plots for each marker.
In order for spatiotemporal analysis to become a relevant clinical tool, it must be applied to human vocal fold vibration. Receiver operating characteristic (ROC) analysis will help assess the ability of spatiotemporal parameters to detect pathological vibration.
Materials and Methods
Spatiotemporal parameters of correlation length and entropy were extracted from high speed videos of 124 subjects, 67 without vocal fold pathology and 57 with either vocal fold polyps or nodules. Mann-Whitney rank sum tests were performed to compare normal vocal fold vibrations to pathological vibrations, and ROC analysis was used to assess the diagnostic value of spatiotemporal analysis.
A statistically significant difference was found between the normal and pathological groups in both correlation length (P < 0.001) and entropy (P < 0.001). ROC analysis showed area under the curve (AUC) of 0.85 for correlation length, 0.87 for entropy, and 0.92 when the two parameters were combined. A statistically significant difference was not found between the nodules and polyps groups in either correlation length (P = 0.227) or entropy (P = 0.943). ROC analysis showed AUC of 0.63 for correlation length and 0.51 for entropy.
Although they could not effectively distinguish vibration of vocal folds with nodules from those with polyps, the spatiotemporal parameters correlation length and entropy exhibit the ability to differentiate normal and pathological vocal fold vibration, and may represent a diagnostic tool for objectively detecting abnormal vibration in the future, especially in neurological voice disorders and vocal folds without a visible lesion.
Spatiotemporal analysis; vocal fold nodules; vocal fold polyps; ROC analysis
The ROC (Receiver Operating Characteristic) curve is the most commonly used statistical tool for describing the discriminatory accuracy of a diagnostic test. Classical estimation of the ROC curve relies on data from a simple random sample from the target population. In practice, estimation is often complicated due to not all subjects undergoing a definitive assessment of disease status (verification). Estimation of the ROC curve based on data only from subjects with verified disease status may be badly biased. In this work we investigate the properties of the doubly robust (DR) method for estimating the ROC curve under verification bias originally developed by Rotnitzky et al. (2006) for estimating the area under the ROC curve. The DR method can be applied for continuous scaled tests and allows for a non ignorable process of selection to verification. We develop the estimator's asymptotic distribution and examine its finite sample properties via a simulation study. We exemplify the DR procedure for estimation of ROC curves with data collected on patients undergoing electron beam computer tomography, a diagnostic test for calcification of the arteries.
Diagnostic test; Nonignorable; Semiparametric model; Sensitivity analysis; Sensitivity; Specificity
Purpose. To investigate the diagnostic accuracy of machine learning classifiers (MLCs) using retinal nerve fiber layer (RNFL) and optic nerve (ON) parameters obtained with spectral domain optical coherence tomography (SD-OCT). Methods. Fifty-seven patients with early to moderate primary open angle glaucoma and 46 healthy patients were recruited. All 103 patients underwent a complete ophthalmological examination, achromatic standard automated perimetry, and imaging with SD-OCT. Receiver operating characteristic (ROC) curves were built for RNFL and ON parameters. Ten MLCs were tested. Areas under ROC curves (aROCs) obtained for each SD-OCT parameter and MLC were compared. Results. The mean age was 56.5 ± 8.9 years for healthy individuals and 59.9 ± 9.0 years for glaucoma patients (P = 0.054). Mean deviation values were −1.4 dB for healthy individuals and −4.0 dB for glaucoma patients (P < 0.001). SD-OCT parameters with the greatest aROCs were cup/disc area ratio (0.846) and average cup/disc (0.843). aROCs obtained with classifiers varied from 0.687 (CTREE) to 0.877 (RAN). The aROC obtained with RAN (0.877) was not significantly different from the aROC obtained with the best single SD-OCT parameter (0.846) (P = 0.542). Conclusion. MLCs showed good accuracy but did not improve the sensitivity and specificity of SD-OCT for the diagnosis of glaucoma.
Evaluation of diagnostic performance is a necessary component of new developments in many fields including medical diagnostics and decision making. The methodology for statistical analysis of diagnostic performance continues to develop, offering new analytical tools for conventional inferences and solutions for novel and increasingly more practically relevant questions.
In this paper we focus on the partial area under the Receiver Operating Characteristic (ROC) curve, or pAUC. This summary index is considered to be more practically relevant than the area under the entire ROC curve (AUC), but because of several perceived limitations, it is not used as often. In order to improve interpretation, results for pAUC analysis are frequently reported using a rescaled index such as the standardized partial AUC proposed by McClish (1989).
We derive two important properties of the relationship between the “standardized” pAUC and the defined range of interest, which could facilitate a wider and more appropriate use of this important summary index. First, we mathematically prove that the “standardized” pAUC increases with increasing range of interest for practically common ROC curves. Second, using comprehensive numerical investigations we demonstrate that, contrary to common belief, the uncertainty about the estimated standardized pAUC can either decrease or increase with an increasing range of interest.
Our results indicate that the partial AUC could frequently offer advantages in terms of statistical uncertainty of the estimation. In addition, selection of a wider range of interest will likely lead to an increased estimate even for standardized pAUC.
evaluation of diagnostic performance; ROC; partial area under the Receiver Operating Characteristics; standardized pAUC; summary index; variance of standardized pAUC
The EUCAST and the CLSI have established different breakpoints for fluconazole and Candida spp. However, the reference methodologies employed to obtain the MICs provide similar results. The aim of this work was to apply supervised classification algorithms to analyze the clinical data used by the CLSI to establish fluconazole breakpoints for Candida infections and to compare these data with the results obtained with the data set used to set up EUCAST fluconazole breakpoints, where the MIC for detecting failures was >4 mg/liter, with a sensitivity of 87%, a false-positive rate of 8%, and an area under the receiver operating characteristic (ROC) curve of 0.89. Five supervised classifiers (J48 and CART decision trees, the OneR decision rule, the naïve Bayes classifier, and simple logistic regression) were used to analyze the original cohort of patients (Rex's data set), which was used to establish CLSI breakpoints, and a later cohort of candidemia (Clancy's data set), with which CLSI breakpoints were validated. The target variable was the outcome of the infections, and the predictor variable was the MIC or dose/MIC ratio. For Rex's data set, the MIC detecting failures was >8 mg/liter, and for Clancy's data set, the MIC detecting failures was >4 mg/liter, in close agreement with the EUCAST breakpoint (MIC > 4 mg/liter). The sensitivities, false-positive rates, and areas under the ROC curve obtained by means of CART, the algorithm with the best statistical results, were 52%, 18%, and 0.7, respectively, for Rex's data set and 65%, 6%, and 0.72, respectively, for Clancy's data set. In addition, the correlation between outcome and dose/MIC ratio was analyzed for Clancy's data set, where a dose/MIC ratio of >75 was associated with successes, with a sensitivity of 93%, a false-positive rate of 29%, and an area under the ROC curve of 0.83. This dose/MIC ratio of >75 was identical to that found for the cohorts used by EUCAST to establish their breakpoints (a dose/MIC ratio of >75, with a sensitivity of 91%, a false-positive rate of 10%, and an area under the ROC curve of 0.90).
To compare the diagnostic accuracy of the Matrix frequency-doubling technology (FDT) 24-2, first-generation FDT N-30 (FDT N-30), and standard automated perimetry (SAP) tests of visual function.
One eye of each of 85 glaucoma patients and 81 healthy controls from the Diagnostic Innovations in Glaucoma Study was included. Evidence of glaucomatous optic neuropathy on stereophotographs was used to classify the eyes. Matrix FDT 24-2, first-generation FDT N-30, and SAP-SITA 24-2 tests were performed on all participants within 3 months. Receiver operating characteristic (ROC) curves were generated and used to determine sensitivity levels at 80% and 90% specificity for mean deviation (MD), pattern standard deviation (PSD), number of total deviation (TD), and pattern deviation (PD) points triggered at less than 5% and 1%. The tests were compared using the best parameter for each test (that with the highest area under the ROC curve) and with the PSD.
The best parameters were MD for SAP (0.680), PSD for FDT N-30 (0.733), and number of TD less than 5% points for FDT 24-2 (0.774). Using the best parameter, the area under the ROC curve was significantly larger for FDT 24-2 than for SAP (P = 0.01). No statistically significant differences were observed between SAP and FDT N-30 (P = 0.21) and FDT N-30 and FDT 24-2 (P = 0.26). Similar results were obtained when the PSD was used to compare the tests, with the exception that the area under the ROC curve for the FDT N-30 test (0.733) was significantly larger than that of the SAP-SITA (0.641; P = 0.03).
The performance of the Matrix FDT 24-2 was similar to that of the first-generation FDT N-30. The Matrix FDT 24-2 test was consistently better than SAP at discriminating between healthy and glaucomatous eyes. Further studies are needed to evaluate the ability of the Matrix FDT 24-2 to monitor glaucoma progression.