Owing to selective reporting and publication bias, almost all articles on cancer prognostic markers report statistically significant results [

15]. The fact that a variable is statistically significant in a multivariate model does not necessarily imply that the variable improves the model's predictive accuracy [

16]. It has been shown, for example, that a marker with an odds ratio of 3 is in fact a very poor classification tool and that an odds ratio of 30 or more is desirable. More generally, however, a single measure of association such as an odds ratio does not meaningfully describe a marker's ability to classify patients [

17].

Thus, the important question to answer is *not* (1) whether the gene expression classifier by itself is significantly related to the outcome, or (2) whether the gene expression classifier is statistically significant in a multivariate model which also includes the classical clinical and pathological factors, or even (3) whether the gene expression classifier is more significant than the classical clinical and pathological factors in a multivariate model, but rather: “Does adding the gene expression classifier to an existing model which is based on the most important clinical and pathological factors improve the predictive accuracy of this model?”

Neither the statistical significance (

*p*-value) of the classifier in the model nor the value of its hazard ratio can be used to assess the gain in predictive accuracy. The value of the hazard ratio depends on the measurement scale and cut-off used, the other variables that are included in the model and how they are coded [

18].

The following methods can be used to assess whether a gene expression classifier improves the predictive accuracy of a standard classification or scoring system when applied to an *independent* data set.

Assessment of a gene classifier within the levels of a standard risk group classification system

When a classification system exists which divides patients into different risk groups based on standard clinical and/or pathological factors, one way to assess whether the gene classifier adds predictive accuracy to this system is to examine in a separate test set the outcome for the gene classifier within the risk groups of the standard system. The statistical significance of the contribution of the new classifier within the risk groups of the standard system can be used to assess whether it adds prognostic information.

Predictive inaccuracy and proportion of explained variation

The predictive inaccuracy of a model is the average of the absolute difference between the observed outcome and outcome predicted by the model, i.e. the absolute prediction error [

19-

22]. The predictive inaccuracy is assessed for four models:

- without any covariates (
*D*_{0}) - with only the clinical and pathological factors (
*D*_{C}) - with only the genomic classifier (
*D*_{G}) - with both the clinical and pathological factors and the genomic classifier (
*D*_{CG})

The proportion of variation explained by the models is then calculated. Similar to the multiple *R*^{2} for linear regression, the explained variation for a model with only the gene classifier is (*D*_{0}–*D*_{G})/*D*_{0}, while the relative gain in explained variation when the gene classifier is added to the model containing the clinical and pathological factors is (*D*_{C}–*D*_{CG})/*D*_{0}. Explained variation ranges from 0% to 100% and predictive inaccuracy is 0 for a perfect prediction. Standard errors for explained variation and predictive inaccuracy can be obtained via bootstrap resampling. The value of adding a genomic classifier to a model with clinical and pathological factors can then be determined based on the values of predictive inaccuracy and explained variation for models 2 and 4 above.

Several different modelling strategies exist, such as Cox regression, logistic regression, recursive partitioning and regression trees (CART) and artificial neural networks [

23]. Considerable work has been done in defining measures of predictive accuracy and explained variation for the Cox proportional hazards regression model [

20-

22].

When patients die of unrelated causes before the endpoint of interest is observed, for example recurrence or progression to muscle-invasive disease, classical methods based on Kaplan–Meier estimates and Cox regression will overestimate the probability of the event. When such competing risks are present, cumulative incidence curves and special competing risk regression techniques must be used [

24].

In prognostic factor studies of survival, the absolute or relative predictive accuracy and the overall explained variation are often low even when there are prognostic factors that are highly statistically significant. An explained variation of 20% has been proposed as the minimum requirement for a gain in predictive accuracy which is worthwhile on an individual patient level [

19].

Area under the curve, concordance index and concordance probability estimate

The most common measures of predictive accuracy for binary and time to event endpoints are, respectively, the area under the receiver operating characteristic (ROC) curve (AUC) (sensitivity versus 1 – specificity) and the c-index (concordance index), which are identical for binary data [

16,

25]. The c-index, an index of predictive discrimination, provides the probability that, for two patients chosen at random, the model-predicted outcome and the observed outcome are in agreement, i.e. the patient with the worse outcome is predicted to have the worse outcome (c-index = 0.50 is agreement by chance). The c-index is calculated for the models described in the previous section to assess the amount of improvement when taking the gene classifier into account. It cannot be interpreted as a proportion of variation explained by a gene classifier.

One disadvantage of the c-index is that it tends to become more extreme as the amount of censoring increases. An alternative to the c-index which is not affected by patterns of censoring is the concordance probability estimate (CPE) based on the Cox model.

More recently, ROC methodology has been extended to time-to-event outcomes to produce time-dependent sensitivity, specificity and ROC curves along with a global concordance measure [

26].