The receiver operating characteristic (ROC) curve is an important tool to gauge the performance of classifiers. In certain situations of high-throughput data analysis, the data is heavily class-skewed, i.e. most features tested belong to the true negative class. In such cases, only a small portion of the ROC curve is relevant in practical terms, rendering the ROC curve and its area under the curve (AUC) insufficient for the purpose of judging classifier performance. Here we define an ROC surface (ROCS) using true positive rate (TPR), false positive rate (FPR), and true discovery rate (TDR). The ROC surface, together with the associated quantities, volume under the surface (VUS) and FDR-controlled area under the ROC curve (FCAUC), provide a useful approach for gauging classifier performance on class-skewed high-throughput data. The implementation as an R package is available at http://userwww.service.emory.edu/~tyu8/ROCS/.
Diagnostic tests commonly are characterized by their true positive (sensitivity) and true negative (specificity) classification rates, which rely on a single decision threshold to classify a test result as positive. A more complete description of test accuracy is given by the receiver operating characteristic (ROC) curve, a graph of the false positive and true positive rates obtained as the decision threshold is varied. A generalized regression methodology, which uses a class of ordinal regression models to estimate smoothed ROC curves has been described. Data from a multi-institutional study comparing the accuracy of magnetic resonance (MR) imaging with computed tomography (CT) in detecting liver metastases, which are ideally suited for ROC regression analysis, are described. The general regression model is introduced and an estimate for the area under the ROC curve and its standard error using parameters of the ordinal regression model is given. An analysis of the liver data that highlights the utility of the methodology in parsimoniously adjusting comparisons for covariates is presented.
Syndromic surveillance systems can potentially be used to detect a bioterrorist attack earlier than traditional surveillance, by virtue of their near real-time analysis of relevant data. Receiver operator characteristic (ROC) curve analysis using the area under the curve (AUC) as a comparison metric has been recommended as a practical evaluation tool for syndromic surveillance systems, yet traditional ROC curves do not account for timeliness of detection or subsequent time-dependent health outcomes.
Using a decision-analytic approach, we predicted outcomes, measured in lives, quality adjusted life years (QALYs), and costs, for a series of simulated bioterrorist attacks. We then evaluated seven detection algorithms applied to syndromic surveillance data using outcomes-weighted ROC curves compared to simple ROC curves and timeliness-weighted ROC curves. We performed sensitivity analyses by varying the model inputs between best and worst case scenarios and by applying different methods of AUC calculation.
The decision analytic model results indicate that if a surveillance system was successful in detecting an attack, and measures were immediately taken to deliver treatment to the population, the lives, QALYs and dollars lost could be reduced considerably. The ROC curve analysis shows that the incorporation of outcomes into the evaluation metric has an important effect on the apparent performance of the surveillance systems. The relative order of performance is also heavily dependent on the choice of AUC calculation method.
This study demonstrates the importance of accounting for mortality, morbidity and costs in the evaluation of syndromic surveillance systems. Incorporating these outcomes into the ROC curve analysis allows for more accurate identification of the optimal method for signaling a possible bioterrorist attack. In addition, the parameters used to construct an ROC curve should be given careful consideration.
The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the performance of diagnostic tests. The purpose of this article is to provide a nonmathematical introduction to ROC analysis. Important concepts involved in the correct use and interpretation of this analysis, such as smooth and empirical ROC curves, parametric and nonparametric methods, the area under the ROC curve and its 95% confidence interval, the sensitivity at a particular FPR, and the use of a partial area under the ROC curve are discussed. Various considerations concerning the collection of data in radiological ROC studies are briefly discussed. An introduction to the software frequently used for performing ROC analyses is also presented.
Diagnostic radiology; Receiver operating characteristic (ROC) curve; Software reviews; Statistical analysis
As in many different areas of science and technology, most important problems in bioinformatics rely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out through the analysis of their receiver operating characteristic (ROC) curves. The area under the ROC curve (AUC) constitutes a popular indicator of the performance of a binary classifier. However, the assessment of the statistical significance of the difference between any two classifiers based on this measure is not a straightforward task, since not many freely available tools exist. Most existing software is either not free, difficult to use or not easy to automate when a comparative assessment of the performance of many binary classifiers is intended. This constitutes the typical scenario for the optimization of parameters when developing new classifiers and also for their performance validation through the comparison to previous art.
In this work we describe and release new software to assess the statistical significance of the observed difference between the AUCs of any two classifiers for a common task estimated from paired data or unpaired balanced data. The software is able to perform a pairwise comparison of many classifiers in a single run, without requiring any expert or advanced knowledge to use it. The software relies on a non-parametric test for the difference of the AUCs that accounts for the correlation of the ROC curves. The results are displayed graphically and can be easily customized by the user. A human-readable report is generated and the complete data resulting from the analysis are also available for download, which can be used for further analysis with other software. The software is released as a web server that can be used in any client platform and also as a standalone application for the Linux operating system.
A new software for the statistical comparison of ROC curves is released here as a web server and also as standalone software for the LINUX operating system.
Statistical evaluation of medical imaging tests used for diagnostic and prognostic purposes often employ receiver operating characteristic (ROC) curves. Two methods for ROC analysis are popular. The ordinal regression method is the standard approach used when evaluating tests with ordinal values. The direct ROC modeling method is a more recently developed approach that has been motivated by applications to tests with continuous values, such as biomarkers.
In this paper, we compare the methods in terms of model formulations, interpretations of estimated parameters, the ranges of scientific questions that can be addressed with them, their computational algorithms and the efficiencies with which they use data.
We show that a strong relationship exists between the methods by demonstrating that they fit the same models when only a single test is evaluated. The ordinal regression models are typically alternative parameterizations of the direct ROC models and vice-versa. The direct method has two major advantages over the ordinal regression method: (i) estimated parameters relate directly to ROC curves. This facilitates interpretations of covariate effects on ROC performance; and (ii) comparisons between tests can be done directly in this framework. Comparisons can be made while accommodating covariate effects and comparisons can be made even between tests that have values on different scales, such as between a continuous biomarker test and an ordinal valued imaging test. The ordinal regression method provides slightly more precise parameter estimates from data in our simulated data models.
While the ordinal regression method is slightly more efficient, the direct ROC modeling method has important advantages in regards to interpretation and it offers a framework to address a broader range of scientific questions including the facility to compare tests.
comparisons; covariates; diagnostic test; markers; ordinal regression; percentile values
Rationale and Objectives
Although an ideal observer’s receiver operating characteristic (ROC) curve must be convex — i.e., its slope must decrease monotonically — published fits to empirical data often display “hooks.” Such fits sometimes are accepted on the basis of an argument that experiments are done with real, rather than ideal, observers. However, the fact that ideal observers must produce convex curves does not imply that convex curves describe only ideal observers. This paper aims to identify the practical implications of non-convex ROC curves and the conditions that can lead to empirical and/or fitted ROC curves that are not convex.
Materials and Methods
This paper views non-convex ROC curves from historical, theoretical and statistical perspectives, which we describe briefly. We then consider population ROC curves with various shapes and analyze the types of medical decisions that they imply. Finally, we describe how sampling variability and curve-fitting algorithms can produce ROC curve estimates that include hooks.
We show that hooks in population ROC curves imply the use of an irrational decision strategy, even when the curve doesn’t cross the chance line, and therefore usually are untenable in medical settings. Moreover, we sketch a simple approach to improve any non-convex ROC curve by adding statistical variation to the decision process. Finally, we sketch how to test whether hooks present in ROC data are likely to have been caused by chance alone and how some hooked ROCs found in the literature can be easily explained as fitting artifacts or modeling issues.
In general, ROC curve fits that show hooks should be looked upon with suspicion unless other arguments justify their presence.
Receiver operating characteristic (ROC) analysis; proper ROC curve; maximum likelihood estimation (MLE); contaminated ROC model
Laboratory observer performance measurements, receiver operating characteristic (ROC) and free-response ROC (FROC) differ from actual clinical interpretations in several respects, which could compromise their clinical relevance. The objective of this study was to develop a method for quantifying the clinical relevance of a laboratory paradigm and apply it to compare the ROC and FROC paradigms in a nodule detection task.
The original prospective interpretations of 80 digital chest radiographs were classified by the truth panel as correct (C=1) or incorrect (C=0), depending on correlation with additional imaging, and the average of C was interpreted as the clinical figure of merit. FROC data were acquired for 21 radiologists and ROC data were inferred using the highest ratings. The areas under the ROC and alternative FROC curves were used as laboratory figures of merit. Bootstrap analysis was conducted to estimate conventional agreement measures between laboratory and clinical figures of merit. Also computed was a pseudovalue-based image-level correctness measure of the laboratory interpretations, whose association with C as measured by the area (rAUC) under an appropriately defined relevance ROC curve, is as a measure of the clinical relevance of a laboratory paradigm.
Low correlations (e.g. κ=0.244) and near chance level rAUC values (e.g. 0.598), attributable to differences between the clinical and laboratory paradigms, were observed. The absolute width of the confidence interval was 0.38 for the interparadigm differences of the conventional measures and 0.14 for the difference of the rAUCs.
The rAUC measure was consistent with the traditional measures but was more sensitive to the differences in clinical relevance. A new relevance ROC method for quantifying the clinical relevance of a laboratory paradigm is proposed.
Receiver operating characteristic (ROC) curve, plotting true positive rates against false positive rates as threshold varies, is an important tool for evaluating biomarkers in diagnostic medicine studies. By definition, ROC curve is monotone increasing from 0 to 1 and is invariant to any monotone transformation of test results. And it is often a curve with certain level of smoothness when test results from the diseased and non-diseased subjects follow continuous distributions. Most existing ROC curve estimation methods do not guarantee all of these properties. One of the exceptions is Du and Tang (2009) which applies certain monotone spline regression procedure to empirical ROC estimates. However, their method does not consider the inherent correlations between empirical ROC estimates. This makes the derivation of the asymptotic properties very difficult. In this paper we propose a penalized weighted least square estimation method, which incorporates the covariance between empirical ROC estimates as a weight matrix. The resulting estimator satisfies all the aforementioned properties, and we show that it is also consistent. Then a resampling approach is used to extend our method for comparisons of two or more diagnostic tests. Our simulations show a significantly improved performance over the existing method, especially for steep ROC curves. We then apply the proposed method to a cancer diagnostic study that compares several newly developed diagnostic biomarkers to a traditional one.
ROC curve; Smoothing spline; Bootstrap
The ROC (Receiver Operating Characteristic) curve is the most commonly used statistical tool for describing the discriminatory accuracy of a diagnostic test. Classical estimation of the ROC curve relies on data from a simple random sample from the target population. In practice, estimation is often complicated due to not all subjects undergoing a definitive assessment of disease status (verification). Estimation of the ROC curve based on data only from subjects with verified disease status may be badly biased. In this work we investigate the properties of the doubly robust (DR) method for estimating the ROC curve under verification bias originally developed by Rotnitzky et al. (2006) for estimating the area under the ROC curve. The DR method can be applied for continuous scaled tests and allows for a non ignorable process of selection to verification. We develop the estimator's asymptotic distribution and examine its finite sample properties via a simulation study. We exemplify the DR procedure for estimation of ROC curves with data collected on patients undergoing electron beam computer tomography, a diagnostic test for calcification of the arteries.
Diagnostic test; Nonignorable; Semiparametric model; Sensitivity analysis; Sensitivity; Specificity
Motivation: The performance of classifiers is often assessed using Receiver Operating Characteristic ROC [or (AC) accumulation curve or enrichment curve] curves and the corresponding areas under the curves (AUCs). However, in many fundamental problems ranging from information retrieval to drug discovery, only the very top of the ranked list of predictions is of any interest and ROCs and AUCs are not very useful. New metrics, visualizations and optimization tools are needed to address this ‘early retrieval’ problem.
Results: To address the early retrieval problem, we develop the general concentrated ROC (CROC) framework. In this framework, any relevant portion of the ROC (or AC) curve is magnified smoothly by an appropriate continuous transformation of the coordinates with a corresponding magnification factor. Appropriate families of magnification functions confined to the unit square are derived and their properties are analyzed together with the resulting CROC curves. The area under the CROC curve (AUC[CROC]) can be used to assess early retrieval. The general framework is demonstrated on a drug discovery problem and used to discriminate more accurately the early retrieval performance of five different predictors. From this framework, we propose a novel metric and visualization—the CROC(exp), an exponential transform of the ROC curve—as an alternative to other methods. The CROC(exp) provides a principled, flexible and effective way for measuring and visualizing early retrieval performance with excellent statistical power. Corresponding methods for optimizing early retrieval are also described in the Appendix.
Availability: Datasets are publicly available. Python code and command-line utilities implementing CROC curves and metrics are available at http://pypi.python.org/pypi/CROC/
The classification accuracy of a continuous marker is typically evaluated with the receiver operating characteristic (ROC) curve. In this paper, we study an alternative conceptual framework, the “percentile value.” In this framework, the controls only provide a reference distribution to standardize the marker. The analysis proceeds by analyzing the standardized marker in cases. The approach is shown to be equivalent to ROC analysis. Advantages are that it provides a framework familiar to a broad spectrum of biostatisticians and it opens up avenues for new statistical techniques in biomarker evaluation. We develop several new procedures based on this framework for comparing biomarkers and biomarker performance in different populations. We develop methods that adjust such comparisons for covariates. The methods are illustrated on data from 2 cancer biomarker studies.
Biomarker; Classification; Covariate adjustment; Percentile value; ROC; Standardization
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
Rationale and Objectives
To examine the effects of the number of categories in the rating scale used in an observer experiment on the results of ROC analysis by a simulation study.
Materials and Methods
We have previously evaluated the effects of computer-aided diagnosis (CAD) on radiologists’ characterization of malignant and benign breast masses in serial mammograms. The evaluation of the likelihood of malignancy was performed on a quasi-continuous (0-100 points) confidence-rating scale. In this study, we simulated the use of discrete confidence-rating scales with fewer number of categories and analyzed the results with receiver operating characteristic (ROC) methodology. The observers’ estimates of the likelihood of malignancy were also mapped to BI-RADS assessments with 5 and 7 categories and ROC analysis was performed. The area under the ROC curve and the partial area index obtained from ROC analysis of the different confidence-rating scales were compared.
The fitted ROC curves and the performance indices do not change significantly when the confidence-rating scales were varied from 6 to 101 points if the estimated operating points obtained directly from the data are distributed relatively evenly over the entire range of true-positive fraction (TPF) and false-positive fraction (FPF). The mapping of the likelihood of malignancy observer data to the 7-category BI-RADS assessment scale allowed reliable ROC analysis, whereas mapping to the 5-category BI-RADS scale could cause erratic ROC curve fitting because of the lack of operating points in the mid-range or failure in ROC curve fitting because of data degeneration for some observers.
ROC analysis of discrete confidence rating scales with few but relatively evenly distributed data points over the entire FPF and TPF range is comparable to that of a quasi-continuous rating scale. However, ROC analysis of discrete confidence rating scales with few and unevenly distributed data points may cause unreliable estimations.
Computer-Aided Diagnosis; Continuous and Discrete Confidence Rating Scales; ROC Observer Study; Classification; Mammography
Receiver operating characteristics (ROC) analysis is a common quantitative method used to demonstrate the trade-off between sensitivity and specificity of various tests and other predictors of clinical outcomes. It is also used to compare the discriminatory ability of predictive models. Despite its important role in clinical research and its relevance to clinical practice, it is not a function that has been typically available in commercial statistical packages. For years, many investigators have used a stand alone program first written by Robert Centor in 1988. While an effective tool, this program did not leverage the power of the Web for ease of access, storage and display of ROC results. We present the background on the development and implementation of a functioning Web-based ROC analyzer that is freely available.
SpectraClassifier (SC) is a Java solution for designing and implementing Magnetic Resonance Spectroscopy (MRS)-based classifiers. The main goal of SC is to allow users with minimum background knowledge of multivariate statistics to perform a fully automated pattern recognition analysis. SC incorporates feature selection (greedy stepwise approach, either forward or backward), and feature extraction (PCA). Fisher Linear Discriminant Analysis is the method of choice for classification. Classifier evaluation is performed through various methods: display of the confusion matrix of the training and testing datasets; K-fold cross-validation, leave-one-out and bootstrapping as well as Receiver Operating Characteristic (ROC) curves.
SC is composed of the following modules: Classifier design, Data exploration, Data visualisation, Classifier evaluation, Reports, and Classifier history. It is able to read low resolution in-vivo MRS (single-voxel and multi-voxel) and high resolution tissue MRS (HRMAS), processed with existing tools (jMRUI, INTERPRET, 3DiCSI or TopSpin). In addition, to facilitate exchanging data between applications, a standard format capable of storing all the information needed for a dataset was developed. Each functionality of SC has been specifically validated with real data with the purpose of bug-testing and methods validation. Data from the INTERPRET project was used.
SC is a user-friendly software designed to fulfil the needs of potential users in the MRS community. It accepts all kinds of pre-processed MRS data types and classifies them semi-automatically, allowing spectroscopists to concentrate on interpretation of results with the use of its visualisation tools.
The receiver operating characteristic (ROC) curve, the positive predictive value (PPV) curve and the negative predictive value (NPV) curve are three measures of performance for a continuous diagnostic biomarker. The ROC, PPV and NPV curves are often estimated empirically to avoid assumptions about the distributional form of the biomarkers. Recently, there has been a push to incorporate group sequential methods into the design of diagnostic biomarker studies. A thorough understanding of the asymptotic properties of the sequential empirical ROC, PPV and NPV curves will provide more flexibility when designing group sequential diagnostic biomarker studies. In this paper we derive asymptotic theory for the sequential empirical ROC, PPV and NPV curves under case-control sampling using sequential empirical process theory. We show that the sequential empirical ROC, PPV and NPV curves converge to the sum of independent Kiefer processes and show how these results can be used to derive asymptotic results for summaries of the sequential empirical ROC, PPV and NPV curves.
Group Sequential Methods; Empirical Process Theory; Diagnostic Testing
Rational and Objectives
Receiver operating characteristic analysis (ROC) is often used to find the optimal combination of biomarkers. When the subject level covariates affect the magnitude and/or accuracy of the biomarkers, the combination rule should take into account of the covariate adjustment. The authors propose two new biomarker combination methods that make use of the covariate information.
Materials and Methods
The first method is to maximize the area under covariate-adjusted ROC curve (AAUC). To overcome the limitations of the AAUC measure, the authors further proposed the area under covariate standardized ROC curve (SAUC), which is an extension of the covariate-specific ROC curve. With a series of simulation studies, the proposed optimal AAUC and SAUC methods are compared with the optimal AUC method that ignores the covariates. The biomarker combination methods are illustrated by an example from Alzheimer's disease research.
The simulation results indicate that the optimal AAUC combination performs well in the current study population. The optimal SAUC method is flexible to choose any reference populations, and allows the results to be generalized to different populations.
The proposed optimal AAUC and SAUC approaches successfully address the covariate adjustment problem in estimating the optimal marker combination. The optimal SAUC method is preferred for practical use, because the biomarker combination rule can be easily evaluated for different population of interest.
Biomarker combination; covariate adjustment; AUC; covariate standardization
ROC analysis occupies an increasingly important role in technology assessment. ROC curves allow one to compare a set of ordinal estimates over the entire range of estimates. Sources of such estimates may include subjective probabilities, mathematical prediction models and empiric prediction models (like the APGAR score). The area under the ROC curve measures the ability of the estimation method to discriminate between two states (usually disease and non-disease). This paper discusses how one constructs ROC curves, what the area under the curve means, and how and why one compares two ROC curves. The computer program (ROC ANALYZER) allows easy performance of these analyses on MS-DOS compatible machines.
In contrast to the receiver operating characteristic (ROC) assessment paradigm, localization ROC (LROC) analysis provides a means to jointly assess the accuracy of localization and detection in an observer study. In a typical multireader, multicase (MRMC) evaluation, the data sets are paired so that correlations arise in observer performance both between readers and across the imaging conditions (e.g., reconstruction methods or scanning parameters) being compared. Therefore, MRMC evaluations motivate the need for a statistical methodology to compare correlated LROC curves. In this paper, we suggest a nonparametric strategy for this purpose. Specifically, we find that seminal work of Sen on U-statistics can be applied to estimate the covariance matrix for a vector of LROC area estimates. The resulting covariance estimator is the LROC analog of the covariance estimator given by DeLong et al. for ROC analysis. Once the covariance matrix is estimated, it can be used to construct confidence intervals and/or confidence regions for purposes of comparing observer performance across imaging conditions. In addition, given the results of a small-scale pilot study, the covariance estimator may be used to estimate the number of images and observers needed to achieve a desired confidence interval size in a full-scale observer study. The utility of our methodology is illustrated with a human-observer LROC evaluation of three image reconstruction strategies for fan-beam X-ray computed tomography.
Confidence intervals; image quality; receiver operating characteristic (ROC); U-statistics
The receiver operating characteristic (ROC) curve is used to evaluate a biomarker’s ability for classifying disease status. The Youden Index (J), the maximum potential effectiveness of a biomarker, is a common summary measure of the ROC curve. In biomarker development, levels may be unquantifiable below a limit of detection (LOD) and missing from the overall dataset. Disregarding these observations may negatively bias the ROC curve and thus J. Several correction methods have been suggested for mean estimation and testing; however, little has been written about the ROC curve or its summary measures. We adapt non-parametric (empirical) and semi-parametric (ROC-GLM [generalized linear model]) methods and propose parametric methods (maximum likelihood (ML)) to estimate J and the optimal cut-point (c*) for a biomarker affected by a LOD. We develop unbiased estimators of J and c* via ML for normally and gamma distributed biomarkers. Alpha level confidence intervals are proposed using delta and bootstrap methods for the ML, semi-parametric, and non-parametric approaches respectively. Simulation studies are conducted over a range of distributional scenarios and sample sizes evaluating estimators’ bias, root-mean square error, and coverage probability; the average bias was less than one percent for ML and GLM methods across scenarios and decreases with increased sample size. An example using polychlorinated biphenyl levels to classify women with and without endometriosis illustrates the potential benefits of these methods. We address the limitations and usefulness of each method in order to give researchers guidance in constructing appropriate estimates of biomarkers’ true discriminating capabilities.
Youden Index; ROC curve; Sensitivity and Specificity; Optimal Cut-Point
While the receiver operating characteristic (ROC) method is the acknowledged gold-standard for imaging system assessment, it ignores localization information and differentiation between multiple abnormalities per case. Since the free-response ROC (FROC) method uses localization information and more closely resembles the clinical reporting process, it is being increasingly used. A number of methods have been proposed to analyze the data that results from a FROC study: jackknife alternative FROC (JAFROC) and a variant termed JAFROC1, initial detection and candidate analysis (IDCA) and ROC analysis via the reduction of the multiple ratings on a case to a single rating. The focus of this paper was to compare JAFROC1, IDCA and the ROC analysis methods using a clinical FROC human data set. All methods agreed on the ordering of the modalities and all yielded statistically significant differences of the figures-of-merit, i.e., p<0.05. Both IDCA and JAFROC1 yielded much smaller p-values than ROC. The results are consistent with a recent simulation based validation study comparing these and other methods. In conclusion, IDCA or JAFROC1 analysis of FROC human data may be superior at detecting modality differences than ROC analysis.
Metabolomics is increasingly being applied towards the identification of biomarkers for disease diagnosis, prognosis and risk prediction. Unfortunately among the many published metabolomic studies focusing on biomarker discovery, there is very little consistency and relatively little rigor in how researchers select, assess or report their candidate biomarkers. In particular, few studies report any measure of sensitivity, specificity, or provide receiver operator characteristic (ROC) curves with associated confidence intervals. Even fewer studies explicitly describe or release the biomarker model used to generate their ROC curves. This is surprising given that for biomarker studies in most other biomedical fields, ROC curve analysis is generally considered the standard method for performance assessment. Because the ultimate goal of biomarker discovery is the translation of those biomarkers to clinical practice, it is clear that the metabolomics community needs to start “speaking the same language” in terms of biomarker analysis and reporting-especially if it wants to see metabolite markers being routinely used in the clinic. In this tutorial, we will first introduce the concept of ROC curves and describe their use in single biomarker analysis for clinical chemistry. This includes the construction of ROC curves, understanding the meaning of area under ROC curves (AUC) and partial AUC, as well as the calculation of confidence intervals. The second part of the tutorial focuses on biomarker analyses within the context of metabolomics. This section describes different statistical and machine learning strategies that can be used to create multi-metabolite biomarker models and explains how these models can be assessed using ROC curves. In the third part of the tutorial we discuss common issues and potential pitfalls associated with different analysis methods and provide readers with a list of nine recommendations for biomarker analysis and reporting. To help readers test, visualize and explore the concepts presented in this tutorial, we also introduce a web-based tool called ROCCET (ROC Curve Explorer & Tester, http://www.roccet.ca). ROCCET was originally developed as a teaching aid but it can also serve as a training and testing resource to assist metabolomics researchers build biomarker models and conduct a range of common ROC curve analyses for biomarker studies.
Electronic supplementary material
The online version of this article (doi:10.1007/s11306-012-0482-9) contains supplementary material, which is available to authorized users.
Biomarker analysis; ROC curve; AUC; Confidence intervals; Optimal threshold; Sample size; Bootstrapping; Cross validation; Biomarker validation and reporting
Receiver operating characteristic (ROC) curve analysis is a simple and effective means to compare the accuracies of indicator variables of bacterial beach water quality. The indicator variables examined in this study were previous day's Enterococcus density and antecedent rainfall at 24, 48, and 96 h. Daily Enterococcus densities and 15-min rainfall values were collected during a 5-year (1996 to 2000) study of four Boston Harbor beaches. The indicator variables were assessed for their ability to correctly classify water as suitable or unsuitable for swimming at a maximum threshold Enterococcus density of 104 CFU/100 ml. Sensitivity and specificity values were determined for each unique previous day's Enterococcus density and antecedent rainfall volume and used to construct ROC curves. The area under the ROC curve was used to compare the accuracies of the indicator variables. Twenty-four-hour antecedent rainfall classified elevated Enterococcus densities more accurately than previous day's Enterococcus density (P = 0.079). An empirically derived threshold for 48-h antecedent rainfall, corresponding to a sensitivity of 0.75, was determined from the 1996 to 2000 data and evaluated to ascertain if the threshold would produce a 0.75 sensitivity with independent water quality data collected in 2001 from the same beaches.
Comparison of two samples can sometimes be conducted on the basis of analysis of ROC curves. A variety of methods of point estimation and confidence intervals for ROC curves have been proposed and studied well. We develop smoothed empirical likelihood based confidence intervals for ROC curves when the samples are censored and generated from semiparametric models. The resulting empirical log-likelihood function is shown to be asymptotically chi-squared. Simulation studies illustrate that the proposed empirical likelihood confidence interval is advantageous over the normal approximation based confidence interval. A real data set is analyzed using the proposed method.
Estimating equation; confidence interval; coverage; Kaplan-Meier estimation; empirical likelihood ratio; empirical likelihood function