The area under a receiver operating characteristic (ROC) curve (AUC) is a commonly used index for summarizing the ability of a continuous diagnostic test to discriminate between healthy and diseased subjects. If all subjects have their true disease status verified, one can directly estimate the AUC nonparametrically using the Wilcoxon statistic. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Because estimators of the AUC based only on verified subjects are typically biased, it is common to estimate the AUC from a bias-corrected ROC curve. The variance of the estimator, however, does not have a closed-form expression and thus resampling techniques are used to obtain an estimate. In this paper, we develop a new method for directly estimating the AUC in the setting of verification bias based on U-statistics and inverse probability weighting. Closed-form expressions for the estimator and its variance are derived. We also show that the new estimator is equivalent to the empirical AUC derived from the bias-corrected ROC curve arising from the inverse probability weighting approach.
Diagnostic test; Inverse probability weighting; Missing at random; U-statistic
The area under the ROC curve (AUC) and partial area under the ROC curve (pAUC) are summary measures used to assess the accuracy of a biomarker in discriminating true disease status. The standard sampling approach used in biomarker validation studies is often inefficient and costly, especially when ascertaining the true disease status is costly and invasive. To improve efficiency and reduce the cost of biomarker validation studies, we consider a test-result-dependent sampling (TDS) scheme, in which subject selection for determining the disease state is dependent on the result of a biomarker assay. We first estimate the test-result distribution using data arising from the TDS design. With the estimated empirical test-result distribution, we propose consistent nonparametric estimators for AUC and pAUC and establish the asymptotic properties of the proposed estimators. Simulation studies show that the proposed estimators have good finite sample properties and that the TDS design yields more efficient AUC and pAUC estimates than a simple random sampling (SRS) design. A data example based on an ongoing cancer clinical trial is provided to illustrate the TDS design and the proposed estimators. This work can find broad applications in design and analysis of biomarker validation studies.
Area under ROC curve (AUC); Empirical likelihood; Nonparametric; Partial area under ROC curve (pAUC); Simple random sampling; Test-result-dependent sampling
To compare the diagnostic accuracy of two continuous screening tests, a common approach is to test the difference between the areas under the receiver operating characteristic (ROC) curves. After study participants are screened with both screening tests, the disease status is determined as accurately as possible, either by an invasive, sensitive and specific secondary test, or by a less invasive, but less sensitive approach. For most participants, disease status is approximated through the less sensitive approach. The invasive test must be limited to the fraction of the participants whose results on either or both screening tests exceed a threshold of suspicion, or who develop signs and symptoms of the disease after the initial screening tests.
The limitations of this study design lead to a bias in the ROC curves we call paired screening trial bias. This bias reflects the synergistic effects of inappropriate reference standard bias, differential verification bias, and partial verification bias. The absence of a gold reference standard leads to inappropriate reference standard bias. When different reference standards are used to ascertain disease status, it creates differential verification bias. When only suspicious screening test scores trigger a sensitive and specific secondary test, the result is a form of partial verification bias.
For paired screening tests with bivariate normally distributed scores, we give formulae and programs to quantify the effect of paired screening trial bias on a paired comparison of area under the curves. We fix the prevalence of disease, and the chance a diseased subject manifests signs and symptoms. We derive the formulas for true sensitivity and specificity, and those for the sensitivity and specificity observed by the study investigator.
The observed area under the ROC curves is quite different from the true area under the ROC curves. The typical direction of the bias is a strong inflation in sensitivity, paired with a concomitant slight deflation of specificity.
In paired trials of screening tests, when area under the ROC curve is used as the metric, bias may lead researchers to make the wrong decision as to which screening test is better.
Receiver operating characteristic (ROC) curve, plotting true positive rates against false positive rates as threshold varies, is an important tool for evaluating biomarkers in diagnostic medicine studies. By definition, ROC curve is monotone increasing from 0 to 1 and is invariant to any monotone transformation of test results. And it is often a curve with certain level of smoothness when test results from the diseased and non-diseased subjects follow continuous distributions. Most existing ROC curve estimation methods do not guarantee all of these properties. One of the exceptions is Du and Tang (2009) which applies certain monotone spline regression procedure to empirical ROC estimates. However, their method does not consider the inherent correlations between empirical ROC estimates. This makes the derivation of the asymptotic properties very difficult. In this paper we propose a penalized weighted least square estimation method, which incorporates the covariance between empirical ROC estimates as a weight matrix. The resulting estimator satisfies all the aforementioned properties, and we show that it is also consistent. Then a resampling approach is used to extend our method for comparisons of two or more diagnostic tests. Our simulations show a significantly improved performance over the existing method, especially for steep ROC curves. We then apply the proposed method to a cancer diagnostic study that compares several newly developed diagnostic biomarkers to a traditional one.
ROC curve; Smoothing spline; Bootstrap
The receiver operating characteristic (ROC) curve is a tool commonly used to evaluate biomarker utility in clinical diagnosis of disease, especially during biomarker development research. Emerging biomarkers are often measured with random measurement error and subject to limits of detection that hinder their potential utility or mask an ability to discriminate by negatively biasing the estimates of ROC curves and subsequent area under the curve. Methods have been developed to correct the ROC curve for each of these types of sources of bias but here we develop a method by which the ROC curve is corrected for both simultaneously through replicate measures and maximum likelihood. Our method is evaluated via simulation study and applied to two potential discriminators of women with and without preeclampsia.
ROC curve; limit of detection; measurement error; area under the curve; replicates
Receiver operating characteristic (ROC) curves can be used to assess the accuracy of tests measured on ordinal or continuous scales. The most commonly used measure for the overall diagnostic accuracy of diagnostic tests is the area under the ROC curve (AUC). A gold standard test on the true disease status is required to estimate the AUC. However, a gold standard test may sometimes be too expensive or infeasible. Therefore, in many medical research studies, the true disease status of the subjects may remain unknown. Under the normality assumption on test results from each disease group of subjects, using the expectation-maximization (EM) algorithm in conjunction with a bootstrap method, we propose a maximum likelihood based procedure for construction of confidence intervals for the difference in paired areas under ROC curves in the absence of a gold standard test. Simulation results show that the proposed interval estimation procedure yields satisfactory coverage probabilities and interval lengths. The proposed method is illustrated with two examples.
Area under the ROC curve; EM algorithm; bootstrap method; gold standard test; maximum likelihood estimation
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity
We present a unified approach to nonparametric comparisons of receiver operating characteristic (ROC) curves for a paired design with clustered data. Treating empirical ROC curves as stochastic processes, their asymptotic joint distribution is derived in the presence of both between-marker and within-subject correlations. A Monte Carlo method is developed to approximate their joint distribution without involving nonparametric density estimation. The developed theory is applied to derive new inferential procedures for comparing weighted areas under the ROC curves, confidence bands for the difference function of ROC curves, confidence intervals for the set of specificities at which one diagnostic test is more sensitive than the other, and multiple comparison procedures for comparing more than two diagnostic markers. Our methods demonstrate satisfactory small-sample performance in simulations. We illustrate our methods using clustered data from a glaucoma study and repeated-measurement data from a startle response study.
Area under the receiver operating characteristic curve; Clustered data; Confidence band; Intersection-union tests; Longitudinal data; Multiple comparison; Paired design; Partial area under the receiver operating characteristic curve; Quantile process; Repeated measurement
Diagnostic tests commonly are characterized by their true positive (sensitivity) and true negative (specificity) classification rates, which rely on a single decision threshold to classify a test result as positive. A more complete description of test accuracy is given by the receiver operating characteristic (ROC) curve, a graph of the false positive and true positive rates obtained as the decision threshold is varied. A generalized regression methodology, which uses a class of ordinal regression models to estimate smoothed ROC curves has been described. Data from a multi-institutional study comparing the accuracy of magnetic resonance (MR) imaging with computed tomography (CT) in detecting liver metastases, which are ideally suited for ROC regression analysis, are described. The general regression model is introduced and an estimate for the area under the ROC curve and its standard error using parameters of the ordinal regression model is given. An analysis of the liver data that highlights the utility of the methodology in parsimoniously adjusting comparisons for covariates is presented.
The receiver operating characteristic (ROC) curve is used to evaluate a biomarker’s ability for classifying disease status. The Youden Index (J), the maximum potential effectiveness of a biomarker, is a common summary measure of the ROC curve. In biomarker development, levels may be unquantifiable below a limit of detection (LOD) and missing from the overall dataset. Disregarding these observations may negatively bias the ROC curve and thus J. Several correction methods have been suggested for mean estimation and testing; however, little has been written about the ROC curve or its summary measures. We adapt non-parametric (empirical) and semi-parametric (ROC-GLM [generalized linear model]) methods and propose parametric methods (maximum likelihood (ML)) to estimate J and the optimal cut-point (c*) for a biomarker affected by a LOD. We develop unbiased estimators of J and c* via ML for normally and gamma distributed biomarkers. Alpha level confidence intervals are proposed using delta and bootstrap methods for the ML, semi-parametric, and non-parametric approaches respectively. Simulation studies are conducted over a range of distributional scenarios and sample sizes evaluating estimators’ bias, root-mean square error, and coverage probability; the average bias was less than one percent for ML and GLM methods across scenarios and decreases with increased sample size. An example using polychlorinated biphenyl levels to classify women with and without endometriosis illustrates the potential benefits of these methods. We address the limitations and usefulness of each method in order to give researchers guidance in constructing appropriate estimates of biomarkers’ true discriminating capabilities.
Youden Index; ROC curve; Sensitivity and Specificity; Optimal Cut-Point
A common feature of diagnostic research is that results for a diagnostic gold standard are available primarily for patients who are positive for the test under investigation. Data from such studies are subject to what has been termed "verification bias". We evaluated statistical methods for verification bias correction when there are few false negatives.
A simulation study was conducted of a screening study subject to verification bias. We compared estimates of the area-under-the-curve (AUC) corrected for verification bias varying both the rate and mechanism of verification.
In a single simulated data set, varying false negatives from 0 to 4 led to verification bias corrected AUCs ranging from 0.550 to 0.852. Excess variation associated with low numbers of false negatives was confirmed in simulation studies and by analyses of published studies that incorporated verification bias correction. The 2.5th – 97.5th centile range constituted as much as 60% of the possible range of AUCs for some simulations.
Screening programs are designed such that there are few false negatives. Standard statistical methods for verification bias correction are inadequate in this circumstance.
The receiver operating characteristic (ROC) curve displays the capacity of a marker or diagnostic test to discriminate between two groups of subjects, cases versus controls. We present a comprehensive suite of Stata commands for performing ROC analysis. Non-parametric, semiparametric and parametric estimators are calculated. Comparisons between curves are based on the area or partial area under the ROC curve. Alternatively pointwise comparisons between ROC curves or inverse ROC curves can be made. Options to adjust these analyses for covariates, and to perform ROC regression are described in a companion article. We use a unified framework by representing the ROC curve as the distribution of the marker in cases after standardizing it to the control reference distribution.
ROC analysis occupies an increasingly important role in technology assessment. ROC curves allow one to compare a set of ordinal estimates over the entire range of estimates. Sources of such estimates may include subjective probabilities, mathematical prediction models and empiric prediction models (like the APGAR score). The area under the ROC curve measures the ability of the estimation method to discriminate between two states (usually disease and non-disease). This paper discusses how one constructs ROC curves, what the area under the curve means, and how and why one compares two ROC curves. The computer program (ROC ANALYZER) allows easy performance of these analyses on MS-DOS compatible machines.
Definitions of renal function in patients undergoing coronary artery bypass graft surgery (CABG) vary in the literature. We sought to investigate which method of estimating renal function is the best predictor of mortality after CABG.
We analysed the preoperative and postoperative renal function data from all patients undergoing isolated CABG from January 1998 through December 2007. Preoperative and postoperative renal function was estimated using serum creatinine (SeCr) levels, creatinine clearance (CrCl) determined by the Cockcroft-Gault formula and the glomerular filtration rate (e-GFR) estimated by the Modification of Diet in Renal Disease (MDRD) formula. Receiver operator characteristic (ROC) curves and area under the ROC curves were calculated.
In 9987 patients, CrCl had the best discriminatory power to predict early as well as late mortality, followed by e-GFR and finally SeCr. The odds ratios for preoperative parameters for early mortality were closer to 1 than those of the postoperative parameters.
Renal function determined by the Cockcroft-Gault formula is the best predictor of early and late mortality after CABG. The relationship between renal function and mortality is non-linear. Renal function as a variable in risk scoring systems such as the EuroSCORE needs to be reconsidered.
Coronary artery bypass grafts, CABG; Kidney, renal function; Statistics, regression analysis
As medical technology proliferates, we must have useful methods for evaluating that technology. One method - ROC curves - allows us to examine the spectrum of a test's usefulness. This paper discusses the concept of ROC curves and presents a simple method for estimating the area under the ROC curve. This measurement - the area under the ROC curve (AUC) - gives a useful estimate of a test's discriminatory ability. One can easily estimate the AUC using microcomputer spreadsheet software. The paper demonstrates how to develop the program and suggests that spreadsheets be adopted for other simple statistical uses.
Covariate-specific ROC curves are often used to evaluate the classification accuracy of a medical diagnostic test or a biomarker, when the accuracy of the test is associated with certain covariates. In many large-scale screening tests, the gold standard is subject to missingness due to high cost or harmfulness to the patient. In this paper, we propose a semiparametric estimation of the covariate-specific ROC curves with a partial missing gold standard. A location-scale model is constructed for the test result to model the covariates’ effect, but the residual distributions are left unspecified. Thus the baseline and link functions of the ROC curve both have flexible shapes. With the gold standard missing at random (MAR) assumption, we consider weighted estimating equations for the location-scale parameters, and weighted kernel estimating equations for the residual distributions. Three ROC curve estimators are proposed and compared, namely, imputation-based, inverse probability weighted and doubly robust estimators. We derive the asymptotic normality of the estimated ROC curve, as well as the analytical form the standard error estimator. The proposed method is motivated and applied to the data in an Alzheimer's disease research.
Alzheimer's disease; covariate-specific ROC curve; ignorable missingness; verification bias; weighted estimating equations
A need exists for an unbiased measure of the accuracy of feed-forward neural networks used for classification. Receiver Operating Characteristic (ROC) analysis is suited for this measure, and was used to assess the performance of several different network weights. The area under an ROC and its standard error were used to compare different network weight sets, and to follow the performance of a network during the course of training. The ROC is not sensitive to the prior probabilities of examples in the testing set nor to decision bias. The area under an ROC curve is a readily understood measure, and should be used to evaluate neural networks and to report results of learning experiments. Examples are provided from experiments with data from the biotechnology domain.
NEURAL NETWORKS; ROC ANALYSIS; PROTEIN STRUCTURE PREDICTION
An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers.
The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs.
In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method.
Rationale and Objectives
Estimation of ROC curves and their associated indices from experimental data can be problematic, especially in multi-reader, multi-case (MRMC) observer studies. Wilcoxon estimates of area under the curve (AUC) can be strongly biased with categorical data, whereas the conventional binormal ROC curve-fitting model may produce unrealistic fits. The “proper” binormal model (PBM) was introduced by Metz and Pan (1) to provide acceptable fits for both sturdy and problematic datasets, but other investigators found that its first software implementation was numerically unstable in some situations (2). Therefore, we created an entirely new algorithm to implement the PBM.
Materials and Methods
This paper describes in detail the new PBM curve-fitting algorithm, which was designed to perform successfully in all problematic situations encountered previously. Extensive testing was conducted also on a broad variety of simulated and real datasets. Windows, Linux, and Apple Macintosh OS X versions of the algorithm are available online at http://xray.bsd.uchicago.edu/krl/.
Plots of fitted curves as well as summaries of AUC estimates and their standard errors are reported. The new algorithm never failed to converge and produced good fits for all of the several million datasets on which it was tested. For all but the most problematic datasets, the algorithm also produced very good estimates of AUC standard error. The AUC estimates compared well with Wilcoxon estimates for continuously -distributed data and are expected to be superior for categorical data.
This implementation of the PBM is reliable in a wide variety of ROC curve-fitting tasks.
Receiver operating characteristic (ROC) analysis; receiver operating characteristic (ROC) curves; proper binormal model; maximum likelihood estimation (MLE); multi-reader; multi-case (MRMC) analysis
Receiver operating characteristic (ROC) curves are useful tools to evaluate classifiers in biomedical and bioinformatics applications. However, conclusions are often reached through inconsistent use or insufficient statistical analysis. To support researchers in their ROC curves analysis we developed pROC, a package for R and S+ that contains a set of tools displaying, analyzing, smoothing and comparing ROC curves in a user-friendly, object-oriented and flexible interface.
With data previously imported into the R or S+ environment, the pROC package builds ROC curves and includes functions for computing confidence intervals, statistical tests for comparing total or partial area under the curve or the operating points of different classifiers, and methods for smoothing ROC curves. Intermediary and final results are visualised in user-friendly interfaces. A case study based on published clinical and biomarker data shows how to perform a typical ROC analysis with pROC.
pROC is a package for R and S+ specifically dedicated to ROC analysis. It proposes multiple statistical tests to compare ROC curves, and in particular partial areas under the curve, allowing proper ROC interpretation. pROC is available in two versions: in the R programming language or with a graphical user interface in the S+ statistical software. It is accessible at http://expasy.org/tools/pROC/ under the GNU General Public License. It is also distributed through the CRAN and CSAN public repositories, facilitating its installation.
The receiver operating characteristic (ROC) curve is an important tool to gauge the performance of classifiers. In certain situations of high-throughput data analysis, the data is heavily class-skewed, i.e. most features tested belong to the true negative class. In such cases, only a small portion of the ROC curve is relevant in practical terms, rendering the ROC curve and its area under the curve (AUC) insufficient for the purpose of judging classifier performance. Here we define an ROC surface (ROCS) using true positive rate (TPR), false positive rate (FPR), and true discovery rate (TDR). The ROC surface, together with the associated quantities, volume under the surface (VUS) and FDR-controlled area under the ROC curve (FCAUC), provide a useful approach for gauging classifier performance on class-skewed high-throughput data. The implementation as an R package is available at http://userwww.service.emory.edu/~tyu8/ROCS/.
Receiver Operating Characteristic (ROC) curves are commonly used to summarize the classification accuracy of diagnostic tests. It is not uncommon in medical practice that multiple diagnostic tests are routinely performed or multiple disease markers are available for the same individuals. When the true disease status is verified by a gold standard test, a variety of methods have been proposed to combine such potential correlated tests to increase the accuracy of disease diagnosis. In this article, we propose a method of combining multiple diagnostic tests in the absence of a gold standard. We assume that the test values and their classification accuracies are dependent on covariates. Simulation studies are performed to examine the performance of the combination method. The proposed method is applied to data from a population-based aging study to compare the accuracy of three screening tests for kidney function and to estimate the prevalence of moderate kidney impairment.
Diagnostic test; Gold standard; Markov chain Monte-Carlo; Sensitivity; Specificity
The receiver operating characteristic (ROC) curve, which is defined as a plot of test sensitivity as the y coordinate versus its 1-specificity or false positive rate (FPR) as the x coordinate, is an effective method of evaluating the performance of diagnostic tests. The purpose of this article is to provide a nonmathematical introduction to ROC analysis. Important concepts involved in the correct use and interpretation of this analysis, such as smooth and empirical ROC curves, parametric and nonparametric methods, the area under the ROC curve and its 95% confidence interval, the sensitivity at a particular FPR, and the use of a partial area under the ROC curve are discussed. Various considerations concerning the collection of data in radiological ROC studies are briefly discussed. An introduction to the software frequently used for performing ROC analyses is also presented.
Diagnostic radiology; Receiver operating characteristic (ROC) curve; Software reviews; Statistical analysis
Bivariate random effect models are currently one of the main methods recommended to synthesize diagnostic test accuracy studies. However, only the logit-transformation on sensitivity and specificity has been previously considered in the literature. In this paper, we consider a bivariate generalized linear mixed model to jointly model the sensitivities and specificities, and discuss the estimation of the summary receiver operating characteristic curve (ROC) and the area under the ROC curve (AUC). As the special cases of this model, we discuss the commonly used logit, probit and complementary log-log transformations. To evaluate the impact of misspecification of the link functions on the estimation, we present two case studies and a set of simulation studies. Our study suggests that point estimation of the median sensitivity and specificity, and AUC is relatively robust to the misspecification of the link functions. However, the misspecification of link functions has a noticeable impact on the standard error estimation and the 95% confidence interval coverage, which emphasizes the importance of choosing an appropriate link function to make statistical inference.
meta-analysis; bivariate random effect models; sensitivity; specificity; receiver operating characteristic curve; area under the ROC curve
Receiver operating characteristic (ROC) analysis is a useful evaluative method of diagnostic accuracy. A Bayesian hierarchical nonlinear regression model for ROC analysis was developed. A validation analysis of diagnostic accuracy was conducted using prospective multi-center clinical trial prostate cancer biopsy data collected from three participating centers. The gold standard was based on radical prostatectomy to determine local and advanced disease. To evaluate the diagnostic performance of PSA level at fixed levels of Gleason score, a normality transformation was applied to the outcome data. A hierarchical regression analysis incorporating the effects of cluster (clinical center) and cancer risk (low, intermediate, and high) was performed, and the area under the ROC curve (AUC) was estimated.
Sensitivity; Specificity; Receiver operating characteristic analysis; Bayesian hierarchical models