As noted in extensive philosophical discussions by

Anscombe (1967) and

Chatfield (1995) and realized extensively in medical practice, the only real validation of a prediction model is confirmation by a completely independent set of observations collected on a different set of patients from different centers by different investigators. Recently,

Steyerberg *et al.* (2010) provided a clarifying review of evaluation methods for risk prediction models, separating evaluation metrics according to whether they measure discrimination, calibration, or both. The purpose of this section is to review the latest state-of-the art in validation principles for risk prediction tools and for comparing updating risk tools to existing tools.

Typically the way risk prediction tools, or diagnostic markers in general, are applied in practice is to choose an arbitrary cutpoint

*c* and take further action if the prediction exceeds c, referred to as a positive test, and no action if the prediction falls below

*c*, referred to as a negative test. The discrimination accuracy of the test is reported separately for cancer cases and controls, as the sensitivity (proportion of cancer cases testing positive) and specificity (number of controls testing negative), for various choices of the cutpoint

*c*. The receiver operating characteristic curve plots the sensitivity versus the false positive rate (1-specificity) for all cutpoints

*c*. The area underneath the receiver operating characteristic curve (AUC) ranges from 0.50 for a test with no discriminative power to 1.0 for a test with 100% sensitivity at all possible cutpoints

*c*. The AUC holds an alternative intuitively appealing definition as the probability that for a randomly drawn cancer case and randomly drawn control, the case has a higher risk than the control. Defined as such, it is a rank-based metric equivalent to the non-parametric Wilcoxon test statistic for comparing distributions in two populations

`wilcox.test` in

`R`), and can be implemented to test the null hypothesis that AUC = 0.50 versus the alternative that AUC > 0.50. The U-statistic approach of

DeLong *et al.* (1988) can be used for a formal statistical test of the null hypothesis that two risk tools or markers have the same AUC on a validation set versus the two-sided alternative that they differ (

`roc.test` function with

`method “Delong”` and

`paired=T` option, in

`R` package

`pROC`). The AUC is practically the ubiquitous endpoint in diagnostic medicine and has been nearly the sole performance criterion for evaluating the PCPTRC (Parekh

*et al.*, 2007;

Eyre *et al.*, 2009;

Hernandez *et al.*, 2009). However, it only measures one dimension of performance, discrimination, and even there, recent statistical reports have criticized its use for placing too much weight on the clinically irrelevant portion of the receiver operating characteristic curve (ROC) (

Pencina et al., 2008,

Greenland, 2008).

Calibration assesses how closely predicted risks match actual risks in the population and can also be assessed among subgroups to better identify where the prediction model is failing. A formal test of calibration can be implemented by splitting a validation set into k groups, typically k=10 groups defined by deciles of the distribution of evaluated risks on the validation set, and using an approximation to Pearson’s chi-square goodness-of-fit test recommended by

Lemeshow and Hosmer (1982):

where

*O*_{i} is the observed number of cancer cases,

*n*_{i} the number of individuals, and

*π*_{i} the average risk for the

*i*th group, for

*i=1,…k*.

Discrimination and calibration metrics objectively summarize accuracy but do not provide information as to which thresholds of a prediction model might be useful for basing clinical decisions. Towards this end,

Vickers and Elkin (2006) proposed a measure of net benefit justified through a layman’s decision analysis framework that does not rely on user-specified costs associated with various outcomes as full-blown decision analyses typically do. The approach relies on assigning weights to the relative harms of false positive and negative decisions and then evaluating the net benefit as the average benefit of the decisions. Some decision theoretic arguments show that if a threshold

*c* of a risk prediction is chosen for deciding to take action, such as to get a prostate biopsy, and the value of a true positive decision is set to 1 for identifiability, then the value of a false positive decision becomes −

*c*/(1-

*c*) (

Vickers and Elkin, 2006). As with the other accuracy measures, net benefit is evaluated on an external cohort to the one on which the risk model was developed. The net benefit is defined as the average benefit value over the true and false positive counts:

where for emphasis dependency on the user-selected threshold

*c* is included in the definition. The expression for the net benefit can be rewritten to show that it is also a function of the discrimination measures sensitivity, TPR(

*c*), and 1-specificity, FPR(

*c*), evaluated on the external validation set and weighted by the proportions of cancer cases (%Cancer) and non-cancer cases (%Non-Cancer) and their benefit values in the validation set:

The discrimination metrics TPR and FPR already tend to vary by validation set. As the above expression shows, net benefit further relies on the cancer prevalence in the validation set. In other words, for two validation sets with the same operating characteristics of a prediction model, the one with higher cancer prevalence will demonstrate higher net benefit.

Vickers and Elkin (2006) suggested evaluating the net benefit over all possible thresholds

*c* of the prediction model ranging from 0 to 1, but in the specific application as the case study here, of determining whether or not to proceed to prostate biopsy for determination of prostate cancer, Steyerberg and Vickers (2008) discussed that most men would reasonably be uncertain as to the course of action with risks in the 10 to 40% range. Specific values of the net benefit can be difficult to interpret in isolation so

Vickers and Elkin (2006) also recommended overlaid decision curves for the strategies of referring no patients to biopsy or all patients to biopsy regardless of the threshold

*c* selected. For these curves the last expression,

*c*/(1-

*c*), remains the same but the TPR and FPR are calculated based on the test rule that assigns no patients to test positive (in other words,

*c* > 1) and all patients to test positive (in other words,

*c* < 0). For referring no patients to biopsy, the TPR and FPR are identically 0 so the net benefit curve for this rule is the horizontal line at 0 across all thresholds

*c*. For the decision rule referring all patients to prostate biopsy the TPR and FPR are 1 and the net benefit curve becomes %Cancer-%Non-Cancer ×

*c*/(1-

*c*).

For comparing risk predictions from a new model to risk predictions from an old model,

Pencina *et al.* (2008) proposed the integrated discrimination index (IDI) that is simply the difference in discrimination slopes between the new and old predictions as proposed by

Yates (1982):

where

*n*_{events} are the number of events, in this case prostate cancer cases, and

*n*_{nonevents} are the number of non-events, in this case non-cancer cases, and the summations sum over the predicted probabilities from the new and old models as subscripted among the cancer cases and non-cases as subscripted on the n’s. The logic of the IDI is clear, a good prediction model should provide higher estimated risks among the cancer cases in the validation set compared to the controls, how good is determined by the discrimination slopes of the models. A positive IDI would indicate a new model has better discrimination slope than the old.

As a final note, all the measures defined above require no missing values for all covariates appearing in the risk prediction tool, but this is often not the reality for externally collected validation sets.

Janssen *et al.* (2010) recently showed by simulation that imputation for missing covariates results in less biased estimates of validation metrics than other practices of either excluding the entire patient from analysis, or the covariate out of a model. The current state of the art in imputation is based on specification of full conditional distributions for missing covariates and termed Multivariate Imputation by Chained Equations (MICE), implementable with the

`mice` package in

`R` (

van Burren, 2007). MICE can be used without additional model specifications to impute missing data under a missing-at-random (MAR) mechanism and with additional specifications to impute missing data assumed to be not-MAR (NMAR). Briefly the method works by fitting an appropriate conditional model, such as a logistic model for dichotomous variables, for all missing variables conditional on all other variables, outcomes and covariates, in the model. The

`R mice` package recommends using all measured variables in a dataset to build the imputation model even if they are not part of the analysis.