The simulation results comparing the correction methods for the decision curve net benefits are shown in Table . For a threshold probability of 15%, the uncorrected estimate was over-optimistic for all scenarios; all correction methods gave an estimate lower than the best estimate of net benefit; repeated 10-fold cross-validation had the least bias for all but the scenario with 100 events, where the bootstrap estimate had slightly lower bias (-0.0001 vs -0.0005). Similar results were obtained for threshold probabilities of 25% and 35%. For the threshold probabilities of 60% and 80%, the bootstrap method had the least bias. The variability of the bootstrap and repeated 2-fold cross-validation methods was similar, however, the repeated 10-fold cross-validation method tended to have slightly less variability.
Simulation results for correction for over-fit.
A comparison of corrected net benefits from bootstrap and 10-fold cross-validation is shown in Table . In all comparisons for all threshold probabilities except 60% and 80%, the absolute difference in the corrected estimates was less than 0.005, with a relative difference in net benefit <6% (calculated as difference in net benefit divided by best estimate of net benefit). The 60% and 80% thresholds are near the tail of the decision curve, and are subject to excess random noise. The properties of the decision curve near this threshold are of minor interest because few men would require a ≥ 60% probability of cancer before they would accept biopsy. Thus the superior properties of the bootstrap at this threshold are of little value. To further examine the preferred correction method, we plotted sample decision curves with correction for overfit from a data set with 30 events (Figure and Figure ). One immediate attraction of repeated 10-fold cross-validation is that it has a smoothing effect on the decision curve. The curve remains unstable at very high threshold probabilities; however, these are rarely encountered in clinical practice (we rarely consider unnecessary treatment, say, 20 times worse than untreated disease). We therefore recommend repeated 10-fold cross-validation as a method to correct decision curves created using the same data as that used to generate the model.
Simulation results for correction for over-fit: "best estimate" of net benefit minus net benefit after correction.
Figure 3 Decision curve for a model predicting the outcome of prostate biopsy, with correction for overfit by crossvalidation. The thin grey line is the net benefit of biopsying all men; the dashed black line is the net benefit of biopsying men on the basis of (more ...)
Figure 4 Decision curve for a model predicting the outcome of prostate biopsy, with correction for overfit by bootstrap. The thin grey line is the net benefit of biopsying all men; the dashed black line is the net benefit of biopsying men on the basis of the statistical (more ...)
That said, we saw very little optimism where the number of events per variable was greater than 20, and thus do not see a strong justification for correcting decision curves for overfit where studies are of sufficient size. This will likely be the case for the sort of studies typically appropriate for decision curve analysis: we do not analyze small, preliminary studies to determine whether a test, marker or model would be of clinical benefit; evaluation of clinical effects is normally reserved for larger and more robust data sets.
Confidence intervals for net benefits
A decision curve plot will have at least two curves and a straight line, and there will be many areas in which the curves overlap or are very close. Adding confidence bands to a plot, therefore, is likely to lead to confusing graph that is difficult to interpret. Accordingly, the best way to present confidence intervals for a decision curve analysis would be, first, to choose a limited number of key thresholds, and second, report the 95% C.I. for the difference in net benefit for pairwise comparisons between models at each of these thresholds.
We propose bootstrap methods, which are widely used and simple to implement, to obtain confidence intervals for the net benefit at a particular threshold.
1. Choose a limited number of threshold probabilities.
2. Sample with replacement from the data set
3. Fit the models of interest and compute the net benefits at threshold probabilities specified in (1) with the sample in (2)
4. Repeat steps (2) to (3) n times (we recommend n ≥ 1000). The 95% confidence interval for the net benefit is given by the 2.5th – 97.5th percentiles across n replications.
It may be of interest to obtain the confidence interval for the difference in net benefit for two treatment strategies, for example, treating according to a model vs. treating all patients. In this case, in step 3 the difference in net benefit of those two treatment strategies should be computed.
Logistic regression was used to estimate the predicted probability of a prostate cancer diagnosis. We fit one model with total PSA as the only predictor (the base model) and another model with total PSA, free-to-total PSA ratio, age and digital rectal exam result as the predictors (the full model). We used bootstrap methods to compute the confidence interval for three strategies: treat all patients, treat according to the base model, and treat according to the full model. We also computed the confidence interval around the difference in net benefit for the full model vs. treating all and full model vs. the base model.
We obtained the confidence intervals for the net benefits associated with threshold probabilities of 15, 25, 35, 60, and 80% using bootstrap methods with 2000 replications (Table ). Given are the point estimates of the net benefit for the three treatment strategies and the difference in full vs base and full vs all. The lower bound of the full model has a superior net benefit than both the base model and treating all for all threshold probabilities evaluated except 80%. We might therefore consider the value of the full model confirmed for the entire range of threshold probabilities that a man would typically require for a prostate biopsy.
Confidence intervals for the net benefits using bootstrap methods.
Application of decision curve analysis to censored data
Calculation of net benefit for a decision curve requires an estimate of the rate of true and false positives. For survival time data, this requires that survival time must be converted to a binary endpoint at a prespecified landmark time, for example, patient alive at five years. However, survival data are typically subject to censoring: a man who entered a study, say, three years before the analysis was conducted and was alive at that time is "censored" because we know he lived more than three years, but not how much longer.
One solution is to exclude patients who were event free at last follow-up but whose survival time is less than our landmark time. This is associated with two problems. The first is that informative data are removed from analysis: a patient who was censored at 4 years and 11 months most likely survived to 5 years but is treated identically in the analysis as a patient followed for only one month. Second, removing censored patients from the analysis increases the prevalence of the event. This is because patients followed for less than 5 years will be counted if they die but not if they survive. Changing the prevalence is important because it affects the proportion of true and false positives, and therefore the net benefit.
To calculate the net benefit for survival time data subject to censoring, we first define x = 1 if the patient has a predicted probability from the model ≥ pt
(the threshold probability) and x = 0 otherwise; s(t) is the Kaplan-Meier survival probability at our chosen landmark time t, and n
is the number of subjects in the data set. Using methods similar to Begg et al[11
], the number of true positives is given by [1 - (s(t) | x = 1)] × P(x = 1) × n
and the false positives as (s(t) | x = 1) × P(x = 1) × n
. Naturally, one assumption of the method is that the mechanism of censoring is independent from the predictors used to create the model.
Heagerty et al[12
] point out that this method can, in some instances, result in a non-monotone relationship between the predicted probability from the model and sensitivity or specificity. Yet there is no requirement that a decision curve be monotone by pt
: there is no inherent contradiction in having net benefit increase above some pt
, and then decrease at some pt
. Indeed, this is often what is seen in the right-hand tail of the decision curve, where there is a relatively limited number of cases, and the curve is subject to excess sampling variation. Nonetheless, the rationale for decision curve analysis is to evaluate the clinical effects of a test, model or marker. Studies aiming to affect clinical practice tend to be large, and should be well populated across the threshold probabilities of interest. As such, we should expect the important parts of the decision curve to be monotone.
In time-to-event analyses where the failure event is something other than death, it is often important to consider the effects of competing risks. A competing risk is any event that a subject could experience, that would alter the likelihood of having the event of interest. The most common competing risk is death before the event of interest, such as recurrence of cancer, since a subject cannot experience the event of interest after they die. In the presence of competing risks, the cumulative incidence function, which takes into account the probability of having the competing risk event, can be used to estimate the probability of having the event of interest [13
]. To calculate the net benefit in the presence of competing risks, we denote the cumulative incidence of the event of interest by time t as I
(t). The number of true positives is given by (I
(t) | x = 1) × P(x = 1) × n
and the false positives as [1 - (I
(t) | x = 1)] × P(x = 1) × n
. That is, we use the same formula as in the absence of competing risks, but using the estimate from the cumulative incidence function in place of the Kaplan-Meier estimate. It is known that the probability of the event calculated using Kaplan-Meier methods is generally higher than when taking into account competing risks [14
]. We therefore expect that, in general, net benefit will be lower when competing risks are taken into account.
Simulation study without competing risks
We conducted a simulation study with 2000 replications to check the method for computing the net benefit for survival time data in the absence of competing risks. We simulated data with 5000 subjects and created a binary predictor x (1 if positive and 0 if negative) and generated an event time Ti for each subject i such that Ti was related to x. We then generated a uniform censoring time Ci for each subject i and defined the observed time for subject i as the minimum of Ti and Ci, denoted by Yi. We determined the coverage of the method described above, for a given time t and for a threshold probabilities of 15, 30, and 60%. Coverage was defined as the proportion of 95% confidence intervals, calculated using bootstrap methods described above, that contained the true net benefit. To obtain the true net benefit, we simulated data in the same way but with an arbitrarily large data set and Yi equal to Ti. Due to the absence of censoring, the true positives are subjects with x = 1 and Ti <t and the false positives are subjects with x = 1 and Ti > t. We conducted simulations for three time-points t and with three relationships between the predictor and event: the predictor equally sensitive and specific, the predictor more specific, and the predictor more sensitive. Approximately 10%, 20%, and 30% of patients were censored, respectively, before time-point 1, 2, and 3.
Results of simulation study without competing risk
Results of the simulation study where the predictor was equally sensitive and specific are given in Table . For all scenarios, there was little bias and coverage was excellent. For example, for a threshold probability of 15%, a predictor being equally sensitivity and specific to the event, and evaluated at timepoint 1, the true net benefit was 0.0185 and the mean net benefit over 2000 replications was 0.0186. Similar results were obtained for the simulations where the predictor was more specific and where the predictor was more sensitive (data not shown).
Simulation results for a survival-time endpoint.
A decision curve from a survival time data set with 30% censoring is shown in figure . To create this figure, we used an uncensored survival time data set, created a binary outcome for survival at t, and calculated net benefit for binary data. We then applied censoring as described above, and calculated a second decision curve calculating net benefit for censored data. The two curves are essentially overlapping, suggesting good properties of our method.
Figure 5 Decision curves for survival time data. The thick grey line is the net benefit for a strategy of treating all men; the thick black line is the net benefit of treating no men. A thin grey line is calculated from an uncensored data set for a binary variable (more ...)
Data for censored data with competing risks
We used two previously studied data sets to demonstrate the effects of competing risks on decision curve analysis. The first data set contained 4462 bladder cancer patients who underwent radical cystectomy [15
]. The event of interest was recurrence (1068 events). Since bladder cancer patients tend to have significant comorbid conditions, 846 patients died from other causes without recurrence, which was considered the competing risk event. We calculated the decision curve for a multivariable prediction model (the "bladder nomogram") with and without adjustment for competing risks[15
]. Age is one of the predictors in the model and is strongly associated with the death from other causes. We therefore expected the decision curves with and without adjustment for competing risks to be different.
The second data set contained 7765 prostate cancer patients treated by radical prostatectomy [16
]. Similar to the bladder cancer data set, the event of interest was recurrence and the competing risk event was death from other causes without recurrence. Prostate cancer patients tend to be in otherwise good health, only 368 patients died without recurrence, while 1256 patients recurred. We calculated the decision curve for a multivariable model including PSA, stage, and grade. As the competing risk was rare, and the predictors for recurrence unassociated with the competing risk of death, we expected the decision curves with and without adjustment for competing risk to be very similar.
Results for survival time data with competing risks
Decision curves with and without adjustment for competing risk are shown in figure and figure . In the bladder cancer example – where the incidence of the competing risk is high, and the predictor is associated with the competing risk – we do see, as expected, that adjustment for competing risk lowers net benefit for both the model and for the strategy of "treat all". However, decisions about the value of the model are not likely to be affected because the model remains of value over a wide range of threshold probabilities. In the prostate cancer example – where the incidence of the competing risk is low, and the predictor unassociated with the competing risk – the decision curves with and without adjustment for competing risk are essentially overlapping.
Figure 6 Decision curve for survival time data with and without adjustment for competing risk, where the incidence of competing risks is high (bladder cancer data set). The thick grey line is the net benefit for a strategy of treating all patients with (dashed (more ...)
Figure 7 Decision curve for survival time data with and without adjustment for competing risk, where the incidence of competing risks is low (prostate cancer data set). The thick grey line is the net benefit for a strategy of treating all men with (dashed line) (more ...)
Application of decision curve analysis when outcome or predictor data are not available
We may sometimes want to calculate decision curves in the absence of outcome data. For example, a statistical model is published in the literature and is shown to be well-calibrated. An investigator wishes to know whether application of the model would be of clinical benefit, either because this was not reported by the original authors, or because the investigator believes that the properties of the model may differ for the population that he or she is interested in, because the distribution of predictors may vary between populations. The model predicts some future event, such as cancer incidence or recurrence, and the investigator's data set is relatively immature, with few patients followed for a sufficient period of time.
Alternatively, we may wish to calculate a decision curve in the absence of predictors. This would occur in a case-control study, where predictors are only measured on a proportion of patients without the event.
If a model is well calibrated, that is, if close to x
% of a sample of patients with a predicted risk of x
% have the event, true and false positives can be calculated directly from predicted probabilities. Using
as the predicted probability for the ith
patient, where m
> 0 patients have
, net benefit is calculated as:
A decision curve for the principal example, calculated using this formulation rather than outcome data, is given in figure . The curve is not subject to sampling noise and so has a smooth shape.
Decision curve for complete data set calculated directly from predicted probabilities.
Statistical code for decision curve analysis
We have written statistical code to implement decision curve analysis and its extensions. In Stata, we have created two commands: dca
for a binary outcome and stdca
for a survival-time outcome; corresponding Stata ado and help files are available for both commands. For dca
, the user inputs a binary outcome variable and one or more predictor variables. Within the command, the user has the option to plot the decision curve or save the points of the decision curve to a Stata data file. To calculate a decision curve in the absence of outcome data, the user specifies the predicted probability from the model as both the outcome and the predictor variable. For stdca
, the user inputs the predictor variables (the data must already be declared as survival-time data using stset
) and a timepoint of interest. The output is similar to that of dca
. In R, we have created two R functions: dca.R and stdca.R. These functions are implemented similar to the Stata commands, however, in stdca.R the user must also specify as inputs the time and failure variables. The Stata and R code can be found at http://www.decisioncurveanalysis.org
along with tutorials on using the code (including survival time data, multivariable models, joint and conditional models), discussions of how to interpret decision curves, and code to implement correction for overfit by repeated 10-fold cross-validation.