Models for risk prediction are widely used in clinical practice to risk stratify and assign treatment strategies. The contribution of new biomarkers has largely been based on the area under the receiver operating characteristic curve, but this measure can be insensitive to important changes in absolute risk. Methods based on risk stratification have recently been proposed to compare predictive models. These include the reclassification calibration statistic, the net reclassification improvement (NRI), and the integrated discrimination improvement (IDI). This work demonstrates the use of reclassification measures, and illustrates their performance for well-known cardiovascular risk predictors in a cohort of women. These measures are targeted at evaluating the potential of new models and markers to change risk strata and alter treatment decisions.
The discovery and development of new biomarkers continues to be an exciting and promising field. Improvement of prediction of risk of developing disease is one of the key motivations in these pursuits. Appropriate statistical measures are necessary for drawing meaningful conclusions about the clinical usefulness of these new markers. In this review, we present several novel metrics proposed to serve this purpose. We use reclassification tables constructed based on clinically meaningful disease risk categories to discuss the concepts of calibration, risk separation, risk discrimination, and risk classification accuracy. We discuss the notion that the net reclassification improvement is a simple yet informative way to summarize information contained in risk reclassification tables. In the absence of meaningful risk categories, we suggest a ‘category-less’ version of the net reclassification improvement and integrated discrimination improvement as metrics to summarize the incremental value of new biomarkers. We also suggest that predictiveness curves be preferred to receiver-operating-characteristic curves as visual descriptors of a statistical model’s ability to separate predicted probabilities of disease events. Reporting of standard metrics, including measures of relative risk and the c statistic is still recommended. These concepts are illustrated with a risk prediction example using data from the Framingham Heart Study.
reclassification; risk prediction; NRI; IDI; calibration; discrimination
For comparing the performance of a baseline risk prediction model with one that includes an additional predictor, a risk reclassification analysis strategy has been proposed. The first step is to cross-classify risks calculated according to the 2 models for all study subjects. Summary measures including the percentage of reclassification and the percentage of correct reclassification are calculated, along with 2 reclassification calibration statistics. The author shows that interpretations of the proposed summary measures and P values are problematic. The author's recommendation is to display the reclassification table, because it shows interesting information, but to use alternative methods for summarizing and comparing model performance. The Net Reclassification Index has been suggested as one alternative method. The author argues for reporting components of the Net Reclassification Index because they are more clinically relevant than is the single numerical summary measure.
biological markers; diagnosis; epidemiologic methods; prognosis; risk model
Risk reclassification methods have become popular in the medical literature as a means of comparing risk prediction models. In this issue of the Journal, Pencina et al. (Am J Epidemiol. 2012;176(6):492–494) present further results for continuous measures of model discrimination and describe their characteristics in nested models with normally distributed variables. Measures include the change in the area under the receiver operating characteristic curve, the integrated discrimination improvement, and the continuous net reclassification improvement. Although theoretically interesting, these continuous measures may not be the most appropriate to assess clinical utility. The continuous net reclassification improvement, in particular, is a measure of effect rather than model improvement and can sometimes exhibit erratic behavior, as illustrated in 2 examples. Caution is needed before using this as a measure of improvement. Further, the test of the continuous net reclassification improvement and that for the integrated discrimination improvement are similar to the likelihood ratio test in nested models and may be overinterpreted. Reclassification in risk strata, while requiring thresholds, may be more relevant clinically with its ability to examine potential changes in treatment decisions.
calibration; discrimination; model fit; risk prediction
The discrimination of a risk prediction model measures that model's ability to distinguish between subjects with and without events. The area under the receiver operating characteristic curve (AUC) is a popular measure of discrimination. However, the AUC has recently been criticized for its insensitivity in model comparisons in which the baseline model has performed well. Thus, 2 other measures have been proposed to capture improvement in discrimination for nested models: the integrated discrimination improvement and the continuous net reclassification improvement. In the present study, the authors use mathematical relations and numerical simulations to quantify the improvement in discrimination offered by candidate markers of different strengths as measured by their effect sizes. They demonstrate that the increase in the AUC depends on the strength of the baseline model, which is true to a lesser degree for the integrated discrimination improvement. On the other hand, the continuous net reclassification improvement depends only on the effect size of the candidate variable and its correlation with other predictors. These measures are illustrated using the Framingham model for incident atrial fibrillation. The authors conclude that the increase in the AUC, integrated discrimination improvement, and net reclassification improvement offer complementary information and thus recommend reporting all 3 alongside measures characterizing the performance of the final model.
area under curve; biomarkers; discrimination; risk assessment; risk factors
The performance of prediction models can be assessed using a variety of different methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic (ROC) curve), and goodness-of-fit statistics for calibration.
Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision–analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.
We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n=544 for model development, n=273 for external validation).
We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for making clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.
Concerns have been raised about the use of traditional measures of model fit in evaluating risk prediction models for clinical use, and reclassification tables have been suggested as an alternative means of assessing the clinical utility of a model. Several measures based on the table have been proposed, including the reclassification calibration (RC) statistic, the net reclassification improvement (NRI), and the integrated discrimination improvement (IDI), but the performance of these in practical settings has not been fully examined. We used simulations to estimate the type I error and power for these statistics in a number of scenarios, as well as the impact of the number and type of categories, when adding a new marker to an established or reference model. The type I error was found to be reasonable in most settings, and power was highest for the IDI, which was similar to the test of association. The relative power of the RC statistic, a test of calibration, and the NRI, a test of discrimination, varied depending on the model assumptions. These tools provide unique but complementary information.
Calibration; Discrimination; Model accuracy; Prediction; Reclassification
Net reclassification and integrated discrimination improvements have been proposed as alternatives to the increase in the AUC for evaluating improvement in the performance of risk assessment algorithms introduced by the addition of new phenotypic or genetic markers. In this paper, we demonstrate that in the setting of linear discriminant analysis, under the assumptions of multivariate normality, all three measures can be presented as functions of the squared Mahalanobis distance. This relationship affords an interpretation of the magnitude of these measures in the familiar language of effect size for uncorrelated variables. Furthermore, it allows us to conclude that net reclassification improvement can be viewed as a universal measure of effect size. Our theoretical developments are illustrated with an example based on the Framingham Heart Study risk assessment model for high risk men in primary prevention of cardiovascular disease.
AUC; biomarker; c statistic; model performance; risk prediction; ROC
Purpose of review
We discuss two data analysis issues for studies that use binary clinical outcomes (whether or not an event occurred): the choice of an appropriate scale and transformation when biomarkers are evaluated as explanatory factors in logistic regression; and assessing the ability of biomarkers to improve prediction accuracy for event risk.
Biomarkers with skewed distributions should be transformed before they are included as continuous covariates in logistic regression models. The utility of new biomarkers may be assessed by measuring the improvement in predicting event risk after adding the biomarkers to an existing model. The area under the receiver operating characteristic (ROC) curve (C-statistic) is often cited; it was developed for a different purpose, however, and may not address the clinically relevant questions. Measures of risk reclassification and risk prediction accuracy may be more appropriate.
The appropriate analysis of biomarkers depends on the research question. Odds ratios obtained from logistic regression describe associations of biomarkers with clinical events; failure to accurately transform the markers, however, may result in misleading estimates. Whilst the C-statistic is often used to assess the ability of new biomarkers to improve the prediction of event risk, other measures may be more suitable.
biomarker analysis; odds ratio; ROC curve; risk prediction accuracy; C-statistic
Mortality among patients with heart failure (HF) is high. Though individual biomarkers have been investigated to determine their value in mortality risk prediction, the role of a multimarker strategy requires further evaluation.
Methods and Results
Olmsted County residents presenting with HF from July 2004 to September 2007 were recruited to undergo biomarker measurement. We investigated whether addition of C-reactive protein (CRP), B-type natriuretic peptide (BNP), and troponin T (TnT) to a model including established risk indicators improved 1-year mortality risk prediction using the c statistic, integrated discrimination improvement (IDI), and net reclassification improvement (NRI). Among 593 participants, the mean age was 76.4 years and 48% were men. After 1 year follow-up, 122 (20.6%) participants had died. Patients with CRP (<11.8mg/L), BNP (<350pg/mL), and TnT (≤0.01ng/mL) below the median had low 1-year mortality (3.3%), while those with two or three biomarkers above the median had markedly increased mortality (30.8% and 35.5%, respectively). The addition of two or more biomarkers to the model offered greater improvement in 1-year mortality risk prediction than use of a single biomarker. The combination of CRP and BNP resulted in an increase in the c statistic from 0.757 to 0.810 (p<0.001), an IDI gain of 7.1% (p<0.001), and a NRI of 22.1% (p<0.001). Use of all three biomarkers offered no incremental gain (IDI gain 0.7% vs. CRP+BNP, p=0.065).
Biomarkers improved 1-year mortality risk prediction beyond established indicators. The use of a two-biomarker combination was superior to a single biomarker in risk prediction, though addition of a third biomarker conferred no added benefit.
epidemiology; heart failure; prognosis; inflammation; community
Rigorous statistical evaluation of the predictive values of novel biomarkers is critical prior to applying novel biomarkers into routine standard care. It is important to identify factors that influence the performance of a biomarker in order to determine the optimal conditions for test performance. We propose a covariate-specific time-dependent PPV curve to quantify the predictive accuracy of a prognostic marker measured on a continuous scale and with censored failure time outcome. The covariate effect is accommodated with a semiparametric regression model framework. In particular we adopt a smoothed survival time regression technique (Dabrowska, 1997) to account for the situation where risk for the disease occurrence and progression is likely to change over time. In addition, we provide asymptotic distribution theory and resampling-based procedures for making statistical inference on the covariate specific positive predictive values. We illustrate our approach with numerical studies and a dataset from a prostate cancer study.
Biomarker evaluation; Negative predictive value; Positive predictive value; Semi-parametric survival analysis
This study compares inflammation-related biomarkers with established cardiometabolic risk factors in the prediction of incident type 2 diabetes and incident coronary events in a prospective case-cohort study within the population-based MONICA/KORA Augsburg cohort.
Methods and Findings
Analyses for type 2 diabetes are based on 436 individuals with and 1410 individuals without incident diabetes. Analyses for coronary events are based on 314 individuals with and 1659 individuals without incident coronary events. Mean follow-up times were almost 11 years. Areas under the receiver-operating characteristic curve (AUC), changes in Akaike's information criterion (ΔAIC), integrated discrimination improvement (IDI) and net reclassification index (NRI) were calculated for different models. A basic model consisting of age, sex and survey predicted type 2 diabetes with an AUC of 0.690. Addition of 13 inflammation-related biomarkers (CRP, IL-6, IL-18, MIF, MCP-1/CCL2, IL-8/CXCL8, IP-10/CXCL10, adiponectin, leptin, RANTES/CCL5, TGF-β1, sE-selectin, sICAM-1; all measured in nonfasting serum) increased the AUC to 0.801, whereas addition of cardiometabolic risk factors (BMI, systolic blood pressure, ratio total/HDL-cholesterol, smoking, alcohol, physical activity, parental diabetes) increased the AUC to 0.803 (ΔAUC [95% CI] 0.111 [0.092–0.149] and 0.113 [0.093–0.149], respectively, compared to the basic model). The combination of all inflammation-related biomarkers and cardiometabolic risk factors yielded a further increase in AUC to 0.847 (ΔAUC [95% CI] 0.044 [0.028–0.066] compared to the cardiometabolic risk model). Corresponding AUCs for incident coronary events were 0.807, 0.825 (ΔAUC [95% CI] 0.018 [0.013–0.038] compared to the basic model), 0.845 (ΔAUC [95% CI] 0.038 [0.028–0.059] compared to the basic model) and 0.851 (ΔAUC [95% CI] 0.006 [0.003–0.021] compared to the cardiometabolic risk model), respectively.
Inclusion of multiple inflammation-related biomarkers into a basic model and into a model including cardiometabolic risk factors significantly improved the prediction of type 2 diabetes and coronary events, although the improvement was less pronounced for the latter endpoint.
Many novel and emerging risk factors exhibit a significant association with cardiovascular disease, but have not been found to improve risk prediction. Statistical criteria used to evaluate such models and markers have largely relied on the receiver operating characteristic curve, which is an insensitive measure of improvement. Recently, new methods have been developed based on risk reclassification, or changes in risk strata following use of a new marker or model. Associated measures based on both calibration and discrimination have been proposed. This review describes previous methods used to evaluate models as well as the newly developed methods to evaluate clinical utility.
To assess the value of a continuous marker in predicting the risk of a disease, a graphical tool called the predictiveness curve has been proposed. It characterizes the marker’s predictiveness, or capacity to risk stratify the population by displaying the distribution of risk endowed by the marker. Methods for making inference about the curve and for comparing curves in a general population have been developed. However, knowledge about a marker’s performance in the general population only is not enough. Since a marker’s effect on the risk model and its distribution can both differ across subpopulations, its predictiveness may vary when applied to different subpopulations. Moreover, information about the predictiveness of a marker conditional on baseline covariates is valuable for individual decision making about having the marker measured or not. Therefore, to fully realize the usefulness of a risk prediction marker, it is important to study its performance conditional on covariates. In this article, we propose semiparametric methods for estimating covariate-specific predictiveness curves for a continuous marker. Unmatched and matched case-control study designs are accommodated. We illustrate application of the methodology by evaluating serum creatinine as a predictor of risk of renal artery stenosis.
New markers may improve prediction of diagnostic and prognostic outcomes. We review various measures to quantify the incremental value of markers over standard, readily available characteristics. Widely used traditional measures include the improvement in model fit or in the area under the receiver operating characteristic (ROC) curve (AUC). New measures include the net reclassification index (NRI) and decision–analytic measures, such as the fraction of true positive classifications penalized for false positive classifications (‘net benefit’, NB).
For illustration we discuss a case study on the presence of residual tumor versus benign tissue in 544 patients with testicular cancer. We assessed 3 tumor markers (AFP, HCG, and LDH) for their incremental value over currently standard clinical predictors. AUC and R2 values suggested adding continuous LDH and AFP whereas NB only favored HCG as a potentially promising marker at a clinically defendable decision threshold of 20% risk. Results based on the NRI fell in the middle, suggesting reclassification potential of all three markers.
We conclude that improvement in standard discrimination measures, which focus on finding variables that might be promising across all decision thresholds, may not detect the most informative markers at a specific threshold of particular clinical relevance. When a marker is intended to support decision making, calculation of the improvement in a decision–analytic measure, such as NB, is preferable over an overall judgment as obtained from the AUC in ROC analysis.
prediction; logistic regression model; performance measures; incremental value
To determine whether erectile dysfunction (ED) predicts cardiovascular disease (CVD) beyond traditional risk factors.
ED and CVD share pathophysiological mechanisms and often co-occur. It is unknown whether ED improves the prediction of CVD beyond traditional risk factors.
This was a prospective, population-based study of 1,709 men (of 3,258 eligible) aged 40–70 years. ED was measured by self-report. Subjects were followed for CVD for an average follow-up of 11.7 years. The association between ED and CVD was examined using the Cox proportional hazards regression model. The discriminatory capability of ED was examined using c statistics. The reclassification of CVD risk associated with ED was assessed using a method that quantifies net reclassification improvement.
1,057 men with complete risk factor data who were free of CVD and diabetes at baseline were included. During follow-up, 261 new cases of CVD occurred. ED was associated with CVD incidence controlling for age (Hazard Ratio (HR): 1.42 (95% Confidence Interval (CI)): 1.05, 1.90), age and traditional CVD risk factors (HR: 1.41, 95% CI: 1.05, 1.90), as well as age and Framingham risk score (HR: 1.40, 95% CI: 1.04–1.88). Despite these significant findings, ED did not significantly improve the prediction of CVD incidence beyond traditional risk factors.
Independent of established CVD risk factors, ED is significantly associated with increased CVD incidence. Nonetheless, ED does not improve the prediction of who will and will not develop CVD beyond that offered by traditional risk factors.
Aging; erectile dysfunction; cardiovascular disease; longitudinal studies; men
Risk prediction procedures can be quite useful for the patient’s treatment selection, prevention strategy, or disease management in evidence-based medicine. Often, potentially important new predictors are available in addition to the conventional markers. The question is how to quantify the improvement from the new markers for prediction of the patient’s risk in order to aid cost–benefit decisions. The standard method, using the area under the receiver operating characteristic curve, to measure the added value may not be sensitive enough to capture incremental improvements from the new markers. Recently, some novel alternatives to area under the receiver operating characteristic curve, such as integrated discrimination improvement and net reclassification improvement, were proposed. In this paper, we consider a class of measures for evaluating the incremental values of new markers, which includes the preceding two as special cases. We present a unified procedure for making inferences about measures in the class with censored event time data. The large sample properties of our procedures are theoretically justified. We illustrate the new proposal with data from a cancer study to evaluate a new gene score for prediction of the patient’s survival.
area under the receiver operating characteristic curve; C-statistic; Cox’s regression; integrated discrimination improvement; net reclassification improvement; risk prediction
To date, the only established model for assessing risk for nasopharyngeal carcinoma (NPC) relies on the sero-status of the Epstein-Barr virus (EBV). By contrast, the risk assessment models proposed here include environmental risk factors, family history of NPC, and information on genetic variants. The models were developed using epidemiological and genetic data from a large case-control study, which included 1,387 subjects with NPC and 1,459 controls of Cantonese origin. The predictive accuracy of the models were then assessed by calculating the area under the receiver-operating characteristic curves (AUC). To compare the discriminatory improvement of models with and without genetic information, we estimated the net reclassification improvement (NRI) and integrated discrimination index (IDI). Well-established environmental risk factors for NPC include consumption of salted fish and preserved vegetables and cigarette smoking (in pack years). The environmental model alone shows modest discriminatory ability (AUC = 0.68; 95% CI: 0.66, 0.70), which is only slightly increased by the addition of data on family history of NPC (AUC = 0.70; 95% CI: 0.68, 0.72). With the addition of data on genetic variants, however, our model’s discriminatory ability rises to 0.74 (95% CI: 0.72, 0.76). The improvements in NRI and IDI also suggest the potential usefulness of considering genetic variants when screening for NPC in endemic areas. If these findings are confirmed in larger cohort and population-based case-control studies, use of the new models to analyse data from NPC-endemic areas could well lead to earlier detection of NPC.
Fracture prediction models help identify individuals at high risk who may benefit from treatment. Area Under the Curve (AUC) is used to compare prediction models. However, the AUC has limitations and may miss important differences between models. Novel reclassification methods quantify how accurately models classify patients who benefit from treatment and the proportion of patients above/below treatment thresholds. We applied two reclassification methods, using the NOF treatment thresholds, to compare two risk models: femoral neck BMD and age (“simple model”) and FRAX (”FRAX model”).
The Pepe method classifies based on case/non-case status and examines the proportion of each above and below thresholds. The Cook method examines fracture rates above and below thresholds. We applied these to the Study of Osteoporotic Fractures.
There were 6036 (1037 fractures) and 6232 (389 fractures) participants with complete data for major osteoporotic and hip fracture respectively. Both models for major osteoporotic fracture (0.68 vs. 0.69) and hip fracture (0.75 vs. 0.76) had similar AUCs. In contrast, using reclassification methods, each model classified a substantial number of women differently. Using the Pepe method, the FRAX model (vs. simple model), missed treating 70 (7%) cases of major osteoporotic fracture but avoided treating 285 (6%) non-cases. For hip fracture, the FRAX model missed treating 31 (8%) cases but avoided treating 1026 (18%) non-cases. The Cook method (both models, both fracture outcomes) had similar fracture rates above/below the treatment thresholds.
Compared with the AUC, new methods provide more detailed information about how models classify patients.
hip fracture; major osteoporotic fracture; FRAX; BMD; prediction
There are two popular statistical approaches to biomarker evaluation. One models the risk of disease (or disease outcome) with, for example, logistic regression. A marker is considered useful if it has a strong effect on risk. The second evaluates classification performance by use of measures such as sensitivity, specificity, predictive values, and receiver operating characteristic curves. There is controversy about which approach is more appropriate. Moreover, the two approaches can give contradictory results on the same data. The authors present a new graphic, the predictiveness curve, which complements the risk modeling approach. It assesses the usefulness of a risk model when applied to the population. Although the predictiveness curve relates to classification performance measures, it also displays essential information about risk that is not displayed by the receiver operating characteristic curve. The authors propose that the predictiveness and classification performance of a marker, displayed together in an integrated plot, provide a comprehensive and cohesive assessment of a risk marker or model. The methods are demonstrated with data on prostate-specific antigen and risk factors from the Prostate Cancer Prevention Trial, 1993–2003.
biological markers; classification analysis; diagnostic tests, routine; epidemiologic methods; predictive value of tests; prostate-specific antigen; risk assessment; risk model
Appropriate quantification of added usefulness offered by new markers included in risk prediction algorithms is a problem of active research and debate. Standard methods, including statistical significance and c statistic are useful but not sufficient. Net reclassification improvement (NRI) offers a simple intuitive way of quantifying improvement offered by new markers and has been gaining popularity among researchers. However, several aspects of the NRI have not been studied in sufficient detail.
In this paper we propose a prospective formulation for the NRI which offers immediate application to survival and competing risk data as well as allows for easy weighting with observed or perceived costs. We address the issue of the number and choice of categories and their impact on NRI. We contrast category-based NRI with one which is category-free and conclude that NRIs cannot be compared across studies unless they are defined in the same manner. We discuss the impact of differing event rates when models are applied to different samples or definitions of events and durations of follow-up vary between studies. We also show how NRI can be applied to case-control data. The concepts presented in the paper are illustrated in a Framingham Heart Study example.
In conclusion, NRI can be readily calculated for survival, competing risk, and case-control data, is more objective and comparable across studies using the category-free version, and can include relative costs for classifications. We recommend that researchers clearly define and justify the choices they make when choosing NRI for their application.
discrimination; model performance; NRI; risk prediction; biomarker
Although the area under the receiver operating characteristic (ROC) curve (AUC) is the most popular measure of the performance of prediction models, it has limitations, especially when it is used to evaluate the added discrimination of a new risk marker in an existing risk model. Pencina et al. (2008) proposed two indices, the net reclassification improvement (NRI) and integrated discrimination improvement (IDI), to supplement the improvement in the AUC (IAUC). Their NRI and IDI are based on binary outcomes in case-control settings, which do not involve time-to-event outcome. However, many disease outcomes are time-dependent and the onset time can be censored. Measuring discrimination potential of a prognostic marker without considering time to event can lead to biased estimates. In this paper, we extended the NRI and IDI to time-to-event settings and derived the corresponding sample estimators and asymptotic tests. Simulation studies showed that the time-dependent NRI and IDI have better performance than Pencina’s NRI and IDI for measuring the improved discriminatory power of a new risk marker in prognostic survival models.
Improved discrimination; Prognostic survival models; Time-dependent NRI; Time-dependent IDI
Our recently proposed point scoring model includes the widely-used Spetzler-Martin (SM)-5 variables, along with age, unruptured presentation, and diffuse border (SM-Supp). Here we evaluate the SM-Supp model performance compared to SM-5, SM-3, and Toronto prediction models using net reclassification index (NRI), which quantifies the correct movement in risk reclassification, and validate the model in an independent dataset.
Bad outcome was defined as worsening between preoperative and final postoperative modified Rankin Scale score. Point scores for each model were used as predictors in logistic regression, and predictions evaluated using NRI at varying thresholds (10–30%) and any threshold (continuous NRI>0). Performance was validated in an independent dataset (n=117).
Net gain in risk reclassification was better using the SM-Supp model over a range of threshold values (NRI=9–25%) and significantly improved overall predictions for outcomes in the development dataset, yielding a continuous NRI of 64% versus SM-5, 67% versus SM-3, and 61% versus Toronto (all P<0.001). In the validation dataset, the SM-Supp model again correctly reclassified a greater proportion of patients versus SM-5 (82%), SM-3 (85%), and Toronto models (69%).
The SM-Supp model demonstrated better discrimination and risk reclassification than several existing models and should be considered for clinical practice to estimate surgical risk in BAVM patients.
receiver operator curve; Modified Rankin Scale; net reclassification
Statistical prediction tools are increasingly common in contemporary medicine but there is considerable disagreement about how they should be evaluated. Three tools (Partin tables, the European Society for Urological Oncology (ESUO) criteria and the Gallina nomogram) have been proposed for the prediction of seminal vesicle invasion (SVI) in patients with clinically localized prostate cancer. We aimed to determine which of these tool, if any, should be used clinically.
The independent validation cohort consisted of 2584 patients treated surgically for clinically localized prostate cancer between 2002 and 2007 at one of four North American tertiary-care referral centers. Traditional (area-under-the-receiver-operating-characteristic-curve (AUC), calibration plots, the Brier score, sensitivity and specificity, positive and negative predictive value) and novel (risk stratification tables, the net reclassification index, decision curve analysis and predictiveness curves) statistical methods quantified the predictive abilities of the three tested models.
Traditional statistical methods (receiver operating characteristic (ROC) plots and Brier scores), as well as two of the novel statistical methods (risk stratification tables and the net reclassification index) could not provide clear distinction between the SVI prediction tools. For example, receiver operating characteristic (ROC) plots and Brier scores seemed biased against the binary decision tool (ESUO criteria) and gave discordant results for the continuous predictions of the Partin tables and the Gallina nomogram. The results of the calibration plots were discordant with those of the ROC plots. Conversely, the decision curve clearly indicated that the Partin tables represent the ideal strategy for stratifying the risk of SVI.
Based on decision curve analysis results, surgeons should consider using the Partin tables to predict SVI. Decision curve analysis provided clinically meaningful comparisons between predictive models; other statistical methods for evaluation of prediction models gave inconsistent results that were difficult to interpret.
prostate; prostatic neoplasms; prostatectomy; seminal vesicles; algorithms; statistics
The predictiveness curve shows the population distribution of risk endowed by a marker or risk prediction model. It provides a means for assessing the model’s capacity for stratifying the population according to risk. Methods for making inference about the predictiveness curve have been developed using cross-sectional or cohort data. Here we consider inference based on case-control studies which are far more common in practice. We investigate the relationship between the ROC curve and the predictiveness curve. Insights about their relationship provide alternative ROC interpretations for the predictiveness curve and for a previously proposed summary index of it. Next the relationship motivates ROC based methods for estimating the predictiveness curve. An important advantage of these methods over previously proposed methods is that they are rank invariant. In addition they provide a way of combining information across populations that have similar ROC curves but varying prevalence of the outcome. We apply the methods to PSA, a marker for predicting risk of prostate cancer.
biomarker; classification; predictiveness curve; risk prediction; ROC curve; total gain