|Home | About | Journals | Submit | Contact Us | Français|
The performance of prediction models can be assessed using a variety of different methods and metrics. Traditional measures for binary and survival outcomes include the Brier score to indicate overall model performance, the concordance (or c) statistic for discriminative ability (or area under the receiver operating characteristic (ROC) curve), and goodness-of-fit statistics for calibration.
Several new measures have recently been proposed that can be seen as refinements of discrimination measures, including variants of the c statistic for survival, reclassification tables, net reclassification improvement (NRI), and integrated discrimination improvement (IDI). Moreover, decision–analytic measures have been proposed, including decision curves to plot the net benefit achieved by making decisions based on model predictions.
We aimed to define the role of these relatively novel approaches in the evaluation of the performance of prediction models. For illustration we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer (n=544 for model development, n=273 for external validation).
We suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for making clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model.
From a research perspective, diagnosis and prognosis constitute a similar challenge: the clinician has some information and wants to know how this relates to the true patient state, whether this can be known currently (diagnosis) or only at some point in the future (prognosis). This information can take various forms, including a diagnostic test, a marker value, or a statistical model including several predictor variables. For most medical applications, the outcome is our interest is binary and the information can be expressed as probabilistic predictions 1. Predictions are hence absolute risks, which go beyond assessments of relative risks, such as regression coefficients, odds ratios or hazard ratios 2.
There are various ways to assess the performance of a statistical prediction model. The traditional statistical approach is to quantify how close predictions are to the actual outcome, using measures such as explained variation (e.g. using R2 statistics) and the Brier score 3. Performance can further be quantified in terms of calibration (do close to x of 100 patients with a risk prediction of x% have the outcome?), using e.g. the Hosmer-Lemeshow “goodness-of-fit” test 4. Furthermore, discrimination is essential (do patients who have the outcome have higher risk predictions than those who do not?), which can be quantified with measures such as sensitivity, specificity, and the area under the receiver operating characteristic curve (or concordance statistic, c) 1,5.
Recently, several new measures have been proposed to assess performance of a prediction model. These include variants of the c statistic for survival 6,7, reclassification tables 8, net reclassification improvement (NRI), and integrated discrimination improvement (IDI) 9, which are refinements of discrimination measures. The concept of risk reclassification has caused substantial discussion in the methodological and clinical literature 10,11,12,13,14. Moreover, decision–analytic measures have been proposed, including ‘decision curves’ to plot the net benefit achieved by making decisions based on model predictions 15. These measures have not yet widely been used in practice, which may partly be due to their novelty to applied researchers 16. In this paper, we aim to clarify the role of these relatively novel approaches in the evaluation of the performance of prediction models.
We first briefly discuss prediction models in medicine. Next, we review the properties of a number of traditional and relatively novel measures for the assessment of the performance of an existing prediction model, or extensions to a model. For illustration we present a case study of predicting the presence of residual tumor versus benign tissue in patients with testicular cancer.
We consider prediction models that provide predictions for a dichotomous outcome, since these are most relevant in medical applications. The outcome can be either an underlying diagnosis (e.g. presence of benign or malignant histology in a residual mass after cancer treatment), an outcome occurring within a relatively short time after making the prediction (e.g. 30-day mortality), or a long-term outcome (e.g. 10-year incidence of coronary artery disease, with censored follow-up of some patients).
At model development we aim for at least internally valid predictions, i.e. predictions that are valid for subjects from the underlying population 17. Preferably, the predictions are also generalizable to ‘plausibly related’ populations 18. Various epidemiologic and statistical issues need to be considered in a modeling strategy for empirical data 1,19,20. When a model is developed, it is obvious that we want some quantification of its performance, such that we can judge whether the model is adequate for its purpose, or better than an existing model.
We recognize that a key interest in contemporary medical research is whether a marker (e.g. molecular, genetic, imaging) adds to an existing model. Often, new markers are selected from a large set based on strength of association in a particular study. This poses a high risk of overoptimistic expectations of the marker’s performance 21,22. Moreover, we are only interested in the incremental value of a marker, on top of predictors that are readily accessible. Validation in fully independent, external data is the best way to compare the performance a model with and without a new marker 21,23.
Prediction models can be useful for several purposes, such as for inclusion criteria or covariate adjustment in a randomized controlled trial 24,25,26. In observational studies, a prediction model may be used for confounder adjustment or case-mix adjustment in comparing outcome between centers 27. We concentrate on the usefulness of a prediction model for medical practice, including public health (e.g. screening for disease) and patient care (diagnosing patients, giving prognostic estimates, decision support).
An important role of prediction models is to inform patients on their prognosis, for example after a cancer diagnosis has been made 28. A natural requirement to a model for this situation is that predictions are well calibrated (or ‘reliable’) 29,30.
A specific situation may be that only limited resources are available, which hence need to be targeted to those with the highest expected benefit, such as those at highest risk. This situation calls for a well discriminating model which separates those at high risk from those at low risk.
Decision support is another important area, including decisions on the need for further diagnostic testing (tests may be burdensome or costly to a patient), and therapy (e.g. surgery with risks of morbidity and mortality) 31. Such decisions are typically binary and require the definition of clinically relevant decision thresholds.
We briefly consider some of the more traditionally used performance measures in medicine, without intending to be comprehensive (Table 1).
The distance between the predicted outcome and actual outcome is central to quantify overall model performance from a statistical modeler’s perspective 32. The distance is Y −Ŷ for continuous outcomes. For binary outcomes, with Y defined 0 – 1, Ŷ is equal to the predicted probability p, and for survival outcomes it is the predicted event probability at a given time (or as a function of time). These distances between observed and predicted outcomes are related to the concept of ‘goodness-of-fit’ of a model, with better models having smaller distances between predicted and observed outcomes. The main difference between goodness-of-fit and predictive performance is that the former is usually evaluated in the same data while assessment of the latter requires either new data or cross-validation.
Explained variation (R2) is the most common performance measure for continuous outcomes. For generalized linear models, Nagelkerke’s R2 is often used 1,33. This is a logarithmic scoring rule. For binary outcomes Y, we score a model with the logarithm of predictions p: Y*log(p) + (Y−1)*(log(1 – p)). Nagelkerke’s R2 can also be calculated for survival outcomes, based on the difference in −2 log likelihood of a model without and a model with one or more predictors.
The Brier score is a quadratic scoring rule, where the squared differences between actual binary outcomes Y and predictions p are calculated: (Y - p)2,34. We can also write this similar to the logarithmic score: Y*(1 – p)2 + (1 – Y)*p2. The Brier score for a model can range from 0 for a perfect model to 0.25 for a non-informative model with a 50% incidence of the outcome. When the outcome incidence is lower, the maximum score for a non-informative model is lower, e.g. for 10%: 0.1*(1–0.1)2 + (1–0.1)*0.12 =0.090. Similar to Nagelkerke’s approach to the LR statistic, we could scale Brier by its maximum score under a non-informative model: Brierscaled = 1 – Brier / Briermax, where Briermax = mean(p)*(1 – mean(p)), to let it range between 0% and 100%. This scaled Brier score happens to be very similar to Pearson’s R2 statistic 35.
Calculation of the Brier score for survival outcomes is possible with a weight function, which considers the conditional probability of being uncensored during time 36,37,3. We can then calculate the Brier score at fixed time points, and create a time-dependent curve. It is useful to use a benchmark curve, based on the Brier score for the overall Kaplan-Meier estimator, which does not consider any predictive information 3. It turns out that overall performance measures compose of two important characteristics of a prediction model, discrimination and calibration, each of which can be assessed separately.
Accurate predictions discriminate between those with and those without the outcome. Several measures can be used to indicate how well we classify patients in a binary prediction problem. The concordance (c) statistic is the most commonly used performance measure to indicate the discriminative ability of generalized linear regression models. For a binary outcome, c is identical to the area under the Receiver Operating Characteristic (ROC) curve, which plots the sensitivity (true positive rate) against 1 – (false positive rate) for consecutive cutoffs for the probability of an outcome.
The c statistic is a rank order statistic for predictions against true outcomes, related to Somers’ D statistic 1. As a rank order statistic, it is insensitive to systematic errors in calibration such as differences in average outcome. A popular extension of the c statistic with censored data can be obtained by ignoring the pairs that cannot be ordered 1. It turns out that this results in a statistic that depends on the censoring pattern. Gonen and Heller have proposed a method to estimate a variant of the c statistic which is independent of censoring, but holds only in the context of a Cox proportional hazards model 7. Furthermore, time-dependent c statistics have been proposed 6,38.
In addition to the c statistic, the discrimination slope can be used as a simple measure for how well subjects with and without the outcome are separated 39. It is calculated as the absolute difference in average predictions for those with and without the outcome. Visualization is readily possible with a box plot or a histogram, which will show less overlap between those with and those without the outcome for a better discriminating model. Extensions of the discrimination slope have not yet been made to the survival context.
Calibration refers to the agreement between observed outcomes and predictions 29. For example, if we predict a 20% risk of residual tumor for a testicular cancer patient, the observed frequency of tumor should be approximately 20 out of 100 patients with such a prediction. A graphical assessment of calibration is possible with predictions on the x-axis, and the outcome on the y-axis. Perfect predictions should be on the 45° line. For linear regression, the calibration plot is a simple scatter plot. For binary outcomes, the plot contains only 0 and 1 values for the y-axis. Smoothing techniques can be used to estimate the observed probabilities of the outcome (p(y=1)) in relation to the predicted probabilities, e.g. using the loess algorithm 1. We may however expect that the specific type of smoothing may affect the graphical impression, especially in smaller data sets. We can also plot results for subjects with similar probabilities, and thus compare the mean predicted probability to the mean observed outcome. For example, we can plot observed outcome by decile of predictions, which makes the plot a graphical illustration of the Hosmer-Lemeshow goodness-of-fit test. A better discriminating model has more spread between such deciles than a poorly discriminating model. We note however that such grouping, though common, is arbitrary and imprecise.
The calibration plot can be characterized by an intercept a, which indicates the extent that predictions are systematically too low or too high (‘calibration-in-the-large’), and a calibration slope b, which should be 1 40. Such a recalibration framework was already proposed by Cox 41. At model development, a=0 and b=1 for regression models. At validation, calibration-in-the-large problems are common, as well as b smaller than 1, reflecting overfitting of a model 1. A value of b smaller than 1 can also be interpreted as reflecting a need for shrinkage of regression coefficients in a prediction model 42,43.
We now discuss some relatively novel performance measures, again without pretending to be comprehensive.
Cook proposed to make a ‘reclassification table’ to show how many subjects are reclassified by adding a marker to a model 8. For example, a model with traditional risk factors for cardiovascular disease was extended with the predictors ‘parental history of myocardial infarction’ and ‘CRP’. The increase in c statistic was minimal (from 0.805 to 0.808). However, when they classified the predicted risks into four categories (0–5, 5–10, 10–20, >20 per cent 10-year CVD risk), about 30% of individuals changed category when comparing the extended model with the traditional one. Change in risk categories, however, is insufficient to evaluate improvement in risk stratification; the changes must be appropriate. One way to evaluate this is to compare the observed incidence of events in the cells of the reclassification table to the predicted probability from the original model. Cook proposed a reclassification test as a variant of the Hosmer-Lemeshow statistic within the reclassified categories, leading to a chi-square statistic 44.
Pencina et al extended the reclassification idea by conditioning on the outcome: reclassi cation of subjects with and without the outcome should be considered separately 9. Any ‘upward’ movement in categories for subjects with the outcome implies improved classi cation, and any ‘downward movement’ indicates worse reclassi cation. The interpretation is opposite for subjects without the outcome. The improvement in reclassi cation was quantified as the sum of differences in proportions of individuals moving up minus the proportion moving down for those with the outcome, and the proportion of individuals moving down minus the proportion moving up for those without the outcome. This sum was labeled the Net Reclassification Improvement (NRI). Also, a measure that integrates the NRI over all possible cut-offs for the probability of the outcome was proposed (integrated discrimination improvement, IDI) 9. The IDI is equivalent to the difference in discrimination slopes of 2 models, and to the difference in Pearson R2 measures 45, or the difference is scaled Brier scores.
Some performance measures imply that false negative and false positive classifications are equally harmful. For example, the calculation of error rates is usually made by classifying subjects as positive when their predicted probability of the outcome exceeds 50%, and as negative otherwise. This implies an equal weighting of false-positive and false-negative classifications.
In the calculation of the NRI, the improvement in sensitivity and the improvement in specificity are summed. This implies relatively more weight for positive outcomes if a positive outcome was less common, and less weight if a positive outcome was more common than a negative outcome. The weight is equal to the non-events odds: (1-mean(p)) / mean(p), where mean(p) is the average probability of a positive outcome. Accordingly, although weighting in not equal, it is not explicitly based on clinical consequences. Defining the best diagnostic test as the one closest to the top left hand corner of the ROC curve – that is, the test with the highest sum of sensitivity and specificity (the Youden index: Se + Sp – 1, 46 ) – similarly implies weighting by the non-events odds.
Vickers et al proposed decision curve analysis as a simple approach to quantify the clinical usefulness of a prediction model (or an extension to a model) 15. For a formal decision analysis, harms and benefits need to be quantified, leading to an optimal decision threshold 47. It may however often be difficult to define this threshold 15. Difficulties may lie at the population level, i.e. that we do not have sufficient data on harms and benefits. Moreover, the relative weight of harms and benefits may differ from patient to patient, necessitating individual thresholds. Hence, we may consider a range of thresholds for the probability of the outcome, similar to ROC curves that consider the full range of cut-offs rather than a single cut-off for a sensitivity/specificity pair.
The key aspect of decision curve analysis is that a single probability threshold can be used both to categorize patients as positive or negative and to weight false positive and false negative classifications 48. If we assume that the harm of unnecessary treatment (a false-positive decision) is relatively limited – such as antibiotics for infection - the cut-off should be low. In contrast, if overtreatment is quite harmful, such as extensive surgery, we should use a higher cut-off before a treatment decision is made. The harm to benefit ratio hence defines the relative weight w of false-positive decisions to true-positive decisions. For example, a cut-off of 10% implies that FP decisions are valued at 1/9th of a TP decision, and w = 0.11. The performance of a prediction model can then be summarized as a Net Benefit: NB = (TP – w FP) / N, where TP is the number of true positive decisions, FP the number of false positive decisions, N is the total number of patients and w is a weight equal to the odds of the cut-off (pt/(1-pt), or the ratio of harm to benefit 48. Documentation and software for decision curve analysis is publicly available (www.decisioncurveanalysis.org).
We may extent the calibration graph to a validation graph 20. This entails that the distribution of predictions in those with and without the outcome is plotted at the bottom of the graph, capturing information on discrimination, similar to what is shown in a box plot. Moreover, it is important to have 95% confidence intervals around deciles (or other quantiles) of predicted risk to indicate uncertainty in the assessment of validity. From the validation graph we can learn the discriminative ability of a model (e.g. study the spread in observed outcomes by deciles of predicted risks), the calibration (closeness of observed outcomes to the 45 degree line), and the clinical usefulness (how many predictions are above or below clinically relevant thresholds).
Men with metastatic non-seminomatous testicular cancer can often be cured nowadays by cisplatin based chemotherapy. After chemotherapy, surgical resection is a generally accepted treatment to remove remnants of the initial metastases, since residual tumor may still be present. In the absence of tumor, resection has no therapeutic benefits, while it is associated with hospital admission, and risks of permanent morbidity and mortality. Logistic regression models were developed to predict the presence of residual tumor, combining well-known predictors, such as the histology of the primary tumor, pre-chemotherapy levels of tumor markers, and (reduction in) residual mass size 49.
We first consider a data set with 544 patients to develop a prediction model that includes 5 predictors (Table 2). We then extend this model with the pre-chemotherapy level of the tumor marker lactate dehydrogenase (LDH). This illustrates ways to assess the incremental value of a marker. LDH values were log transformed, after standardizing by dividing by the local upper levels of normal values, after examination of nonlinearity with restricted cubic spline functions 50. In a later study, we externally validated the 5 predictor model in 273 patients from a tertiary referral center, where LDH was not recorded 51. This illustrates ways to assess the usefulness of a model in a new setting.
A clinically relevant cut-off for the risk of tumor was based on a decision analysis, where estimates from literature and from experts in the field were used to formally weigh the harms of missing tumor against the benefits of resection in those with tumor 52. This analysis indicated that a risk threshold of 20% would be clinically reasonable.
Adding LDH to the 5 predictor model increased the model chi-square from 187 to 212 (LR statistic 25, p<0.001) in the development data set. LDH hence had statistically significant additional predictive value. Overall performance improved: Nagelkerke’s R2 increased from 39% to 43%, and the Brier score decreased from 0.17 to 0.16 (Table 3). The discriminative ability showed a small increase (c rose from 0.82 to 0.84, Fig 1). Similarly, the discrimination slope increased from 0.30 to 0.34 (Fig 2). The IDI hence was 4%.
Using a cut-off of 20% for the risk of tumor led to classification of 465 and 469 patients as at high risk for residual tumor with the original and extended models respectively (Table 4). The extended model reclassified 19 of the 465 patients as low risk (4%). On the other hand, 23 of 79 were reclassified as high risk while initially classified as low risk (29%). The total reclassification was hence 7.7% (42/544). Based on the observed proportions, those who were reclassified were placed into more appropriate categories. Cook’s reclassification test was statistically significant (p=0.030), comparing predictions from the original model with observed outcomes in the 4 cells of Table 4. A more detailed assessment of the reclassification is obtained by a scatter plot with symbols by outcome (tumor or necrosis, Fig 3). We note especially that some patients with necrosis have higher predicted risks according to the model without LDH than according to the model with LDH (circles in right lower corner of the graph). The improvement in reclassification for those with tumor was 1.7% ((8-3)/299), and for those with necrosis 0.4% ((16–15)/245). The NRI hence was 2.1% [95% CI −2.9 to +7.0%], which is a much lower percentage than the 7.7% for all reclassified patients. The IDI was already estimated from Fig 1 as 4%.
A cut-off of 20% implies a relative weight of 1:4 for false-positive decisions against true-positive decisions. For the model without LDH, the Net Benefit was (TP – w*FP)/N = (284 – 0.25*(465-284))/544=0.439. If we would do resection in all, the NB would however be similar: (299 – 0.25*(544-299))/544=0.437. The model with LDH has a better NB: (289 0.25*(469-289))/544=0.449. Hence, at this particular cut-off, the model with LDH would be expected to lead to 1 more mass with tumor being resected per 100 patients at the same number of unnecessary resections of necrosis. The decision curve shows that the NB would be much larger for higher threshold values (Fig 4), i.e. patients accepting higher risks of residual tumor.
Overall model performance in the new cohort of 273 patients (197 with with residual tumor) was less than at development, according to R2 and scaled Brier scores (25% instead of 39% and 20% instead of 30% respectively). Also, the c statistic and discrimination slope were poorer. Calibration was on average correct (calibration-in-the-large coefficient close to zero), but the effects of predictors were on average smaller in the new setting (calibration slope 0.74). The Hosmer-Lemeshow test was of borderline significance. The Net Benefit was close to zero, which was explained by the fact that very few patients had predicted risks below 20% and that calibration was imperfect around this threshold (Figs 2 and and55).
All analyses were done in R version 2.8.1 (R Foundation for Statistical Computing, Vienna, Austria), using the Design library. The syntax is provided in the Appendix.
This paper provided a framework for a number of traditional and relatively novel measures to assess the performance of an existing prediction model, or extensions to a model. Some measures relate to the evaluation of the quality of predictions, including overall performance measures such as explained variation and the Brier score, and measures for discrimination and calibration. Other measures quantify the quality of decisions, including decision-analytic measures such as the Net Benefit and decision curves, and measures related to reclassification tables (NRI, IDI).
Having a well discriminating model will commonly be most relevant for research purposes, such as covariate adjustment in a RCT. But a well discriminating model (e.g. c 0.8) may be useless if the decision threshold for clinical decisions is outside the range of predictions provided by the model. And a poorly discriminating model (e.g. c 0.6), may be clinically useful if the clinical decision is close to a “toss up” 53. This implies that the threshold is right in the middle of the distribution of predicted risks, which is for example the case for models in fertility medicine 54. For clinical practice, providing insight beyond the c statistic has been a motivation for some recent measures, especially in the context of extension of a prediction model with additional predictive information, e.g. from a biomarker 8,9,45. Many measures provide numerical summaries that may be difficult to interpret (see e.g. Table 3).
Evaluation of calibration is important if model predictions are used to inform patients or physicians to make decisions. The widely used Hosmer-Lemeshow test has a number of drawbacks, including limited power and poor interpretability 1,55. Instead, the recalibration parameters as proposed by Cox (intercept and calibration slope) are more informative 41. Validation plots with the distribution of risks for those with and without the outcome provide a useful graphical depiction, in line with previous proposals 45.
The net benefit, with visualization in a decision curve, is a simple summary measure to quantify clinical usefulness when decisions are to be supported by a prediction model 15. We recognize however that other measures may give additional insights instead of providing a single summary measure. If a threshold is clinically well accepted, such as the 10% and 20% 10-year risks thresholds for cardiovascular events, reclassification tables and its associated measures may be particularly useful. For example, Table 4 clearly illustrates that LDH makes that a few more subjects with tumor are in the high risk category (289/299=97% instead of 284/299=95%) and one less subject without tumor is in the high risk category (180/245=73%. instead of 181/245=74%). This illustrated that key information for comparing performances of two models is contained in the margins of the reclassification tables 12.
In sum, we suggest that reporting discrimination and calibration will always be important for a prediction model. Decision-analytic measures should be reported if the predictive model is to be used for making clinical decisions. Other measures of performance may be warranted in specific applications, such as reclassification metrics to gain insight into the value of adding a novel predictor to an established model
A key issue in the evaluation of the quality of decisions is that false-positive and false-negative decisions will usually have quite different weight in medicine. Using equal weights for false-positive and false-negative decisions is ‘absurd’ in many medical applications 56. Several measures of clinical usefulness have been proposed before which are consistent with decision-analytic considerations 48,31,57,58,59,60.
We recognize that binary decisions can fully be evaluated in a ROC plot. The plot may however be obsolete unless the predicted probabilities at the operating points are indicated. Optimal thresholds can be defined by the tangent line to the curve, defined by the incidence of the outcome and the relative weight of false-positive and false-negative decisions 58. If a prediction model is perfectly calibrated, the optimal threshold in the curve corresponds to the threshold probability in the Net Benefit analysis. The tangent is a 45 degree line if the outcome incidence is 50% and false-positive and false-negative decisions are weighted equally. We consider the Net Benefit and related decision curves preferable to graphical ROC curve assessment in the context of prediction models, although these approaches are obviously related 59.
Most performance measures can also be calculated for survival outcomes, which pose the challenge of dealing with censoring observations. Naïve calculation of ROC curves for censored observations can be misleading, since some of the censored observation would have had events if follow-up were longer. Also, the weight of false-positive and false-negative decisions may change with the follow-up time considered. Another issue is to consider competing risks in survival analyses of non-fatal outcomes, such as failure of heart valves 61, or mortality due to different causes 62. Disregarding competing risks often leads to overestimation of absolute risk 63.
Any performance measure should be estimated with correction for optimism, as can e.g. be achieved with cross-validation or bootstrap resampling. To determine generalizability to other, plausibly related, settings, an external validation data set of sufficient size is required 18. Some statistical updating may then be necessary for parameters in the model 64. After repeated validation under different circumstances, an analysis of the impact of using a model for decision support should follow, which requires formulation of a model as a simple decision rule 65.
We have tried to sketch a framework for performance evaluation of predictions and decisions based on prediction models, both for newly developed or existing models, and for the situation of assessing the incremental value of a predictor such as a biomarker. Many more measures are available than discussed in this paper, which may have specific value in specific circumstances. The novel measures on reclassification and clinical usefulness can provide valuable additional insight on the value of prediction models and extensions to models, which goes beyond traditional measures of calibration and discrimination.
This paper was based on discussions at an international symposium “Measuring the accuracy of prediction models” (Cleveland, OH, Sept 29, 2008, http://www.bio.ri.ccf.org/html/symposium.html), which was supported by the Cleveland Clinic Department of Quantitative Health Sciences and the Page Foundation. We thank Dr Margaret Pepe and Jessie Gu (University of Washington, Seattle, WA) for their critical review and helpful comments, as well as two anonymous reviewers.