Standard Metrics

The most basic requirement put on a new marker under consideration is that it is

*statistically significantly* associated with the outcome of the study. The purpose behind this is simple: we need to rule out the possibility that the observed association might be due to chance. Statistical significance is determined using a p-value and the level of “less than 0.05” is universally accepted. However, what is frequently overlooked is that p-value is a function of both effect size (how strong the association is) as well as sample size (how large is our study or how many events we observed). Thus, large studies are very likely to declare statistical significance, even though the magnitude of the effect might be miniscule. Hence, the concept of

*clinical significance* has been put forth to reduce the impact of sample size. It is usually associated with the observed relative risk expressed using risk, odds or hazard ratios, or as an absolute risk expressed as risk difference, or as its inverse known as the number-needed-to-treat (

23). In the context of risk prediction and adding new biomarkers to existing models, these risk metrics are estimated using models that already adjust for standard risk factors. The phrase

*independent contribution* is used in clinical literature to emphasize that the analysis took into account existing risk factors. In simple terms, a new marker that is associated with the risk of outcome on its own may not be of much use in risk prediction, if its correlation with standard factors diminishes its contribution to the overall model. In our example, the hazard ratio per one standard deviation increase in HDL cholesterol was 0.65 (95% confidence interval: 0.53, 0.80), and it was significant with p-value < 0.001.

While these standard risk metrics are commonly used and well understood and provide useful information about the magnitude of association, they may not give a complete assessment of the added contribution of the new marker in the context of risk prediction. It is generally true that the higher the magnitude of absolute or relative risk for a new biomarker in a model adjusted for standard risk factors, the more the ‘gain’ in model performance that can be expected. However, the questions of “how much ‘gain’ is good enough” and “how to compare contributions of markers evaluated on different scales (continuous, ordinal, binary)” have not been answered. The latter question has become even more relevant with flexible modeling techniques (splines etc.) increasing in popularity and application.

The area under the receiver-operating-characteristics (ROC) curve (AUC), often called the c statistic, has become the standard metric for assessing performance of models for binary outcomes. Hanley and McNeil (

17) have shown that AUC is equal to the probability that given two subjects, one with and one without an event of interest, the one with event will have a higher model-based predicted probability of an event. In simpler terms this means that the model is more likely to assign higher risks to people with events, which is obviously a desirable property. On the other hand, the relationship between the AUC and the plot of ‘Sensitivity’ versus ‘1 – Specificity’ (the ROC curve) is appealing for the purpose of risk classification. Any paper attempting to propose a new risk prediction model is expected to report the value of the c statistic.

Given the above, it seems natural to use the increment in the c statistic as a method of quantifying the added value offered by new biomarkers, in a manner similar to the increment in R-square used in linear regression. However, recently researchers have observed (based on empirical applications) that the IAUC is very small when the baseline model’s c is large (

24). On one hand, this observation should not be surprising: good models are harder to improve upon. However, the extent to which IAUC depends on baseline c (rather than the effect size of association for a new biomarker) seems undesirably large.

The above empirical finding can be illustrated with the help of the following simple simulation. A baseline model for a binary outcome with the c statistics ranging from 0.50 to 0.99 was constructed using a single predictor with necessary effect size, and then a new marker uncorrelated with the first one was added to that model. The effect sizes (ES) of the new marker of 0.18, 0.41 and 0.69 were selected to correspond to odds ratios of 1.2, 1.5 and 2.0, respectively. Then, the IAUCs between the baseline and new models were calculated and plotted against the baseline c and are displayed in .

We observe how quickly the IAUCs decline as a function of the baseline c. Baseline models with c above 0.75 cannot be improved by more than 0.05, even if the effect size is as large as 0.69 (OR per one standard deviation of 2.0). A marker with odds ratio of 1.2 per standard deviation can add only a miniscule amount to the c statistic. Note that these results were obtained in the optimistic scenario of no correlation between the new marker and baseline model predictors. In case of correlation greater than zero between a new biomarker and standard risk factors, odds ratios of the magnitudes presented here will be even harder to attain. In our HDL example, the c statistics for the old and new models were 0.762 and 0.774, yielding a small IAUC = 0.012 (p-value = 0.092).

Another criticism of the c statistic and the IAUC came from the CVD risk prediction domain, where treatment guidelines are based on generally accepted risk categories based on absolute 10-year predicted event rates. Since the c statistic is not influenced by absolute risk levels, it was argued it cannot adequately capture the clinical usefulness of new markers in situations where treatment decisions are based on established categories (

18). Reclassifications tables were proposed instead (

18), and appropriate ways of interpreting them were put forth (

19).

Reclassification Table Metrics and their Continuous Counterparts

As mentioned above, some clinical decisions based on risk prediction algorithms are based on established categories of absolute risk. Primary prevention of CVD or CHD serves as the key example, where people with predicted 10-year risk above 20% are recommended for treatment, whereas those with risk below 10% (or more recently 6%) are considered ‘low risk’ (

9,

10,

25). Adherence to such categories suggests that a lot can be learned about the usefulness of a new marker from simple cross-tabulation of risk categories (for example, ‘low risk’ 0–6%, ‘medium risk’ 6–20%, ‘high risk’ > 20%) based on predicted probabilities obtained using models without and with the new marker. Janes et al. (

26) observe that key information is contained in the margins of the reclassification table and suggest looking at the following three characteristics (“checks”) that can be derived from such a table:

- Calibration;
- Risk stratification capacity;
- Classification accuracy.

We illustrate their meaning and interpretation using data published in (

19) and described in the Materials and Methods section looking at the added usefulness of HDL cholesterol in a 10-year risk prediction model of CHD. The reclassification table is given as and consists of 3 parts, first for individuals with CVD events, second for individuals who do not experience CVD events (nonevents) and third for all individuals combined. The first two have been presented in (

19), the third one is added here to illustrate additional concepts.

| **Table 2**Reclassification table for models without and with HDL cholesterol |

Good calibration means that model-based predicted event rates closely match those observed in practice. In the context of a reclassification table, the simplest calibration check looks at the observed event rates presented in the margins of the third part of the table, which show data on all individuals combined and determines if they fall into the risk categories to which they correspond. In our example, event rates for the “old model” (i.e., the one without HDL) are given in the row margins as 2.5%, 10.6% and 19.7% for risk categories of 0–6%, 6–20% and >20%, respectively. For simplicity, we used crude event rates, but Kaplan-Meier rates could be presented instead, and would be even more appropriate. It is evident that that the first two rates fall into the respective categories, while the third one is on the border (19.7% is very close to but technically outside the >20% category). The model with HDL does a little better with respective rates of 2.0%, 10.8% and 25.4% well within their respective categories.

The above ‘check’ is descriptive. A formal test would follow the Hosmer – Lemeshow (

27) chi-square approach (or Nam and D’Agostino’s analog for survival data (

28)), and one could argue that the more traditional, decile-based presentation is equally if not more meaningful. Another approach would look at the plot of predicted vs. observed risk – an illustration of this approach has been presented in a recent paper on reporting guidelines for biomarker studies in CVD risk prediction by Hlatky et al. (

29).

It is important to note that using standard analytic methods, which include logistic and proportional hazards regressions, one is almost guaranteed to obtain a reasonable degree of model calibration. Largest violations are observed when the mean of model-based predictions is different than the incidence rate in the cohort being analyzed (it is referred to as bias (D’Agostino et al. (

30))). However, the two regressions mentioned above introduce a minimal bias in the sample in which the model was developed, and thus should lead to a reasonably good calibration if the number of predictors is sufficient.

The second ‘check’ suggested by Janes et al. (

26) relates to discrimination understood as the model’s ability to spread predicted risk across individuals. A simple check coming from the reclassification table looks at the sizes of the three risk categories. Better models will tend to have more people in the lowest and highest risk groups. In our example, again looking at the third part of the reclassification , these numbers are approximately identical for the old and new models, with 66%, 30% and 4% across the three risk groups. In particular, we observe very little shrinkage of the middle risk group.

A more general way of inspecting the amount of spread offered by a given model can be depicted using the predictiveness curve introduced by Pepe et al. (

21). It is constructed by ranking all predicted probabilities and plotting them against the quantile of risk. Ideally, the predictiveness curve remains very close to the horizontal axis and then increases very rapidly. If meaningful categories exist, they can be marked on the vertical axis and then corresponding quantiles of risk determined on the horizontal axis to determine the size of the middle risk groups, similarly to the reclassification table. Other useful properties of this curve have been described by Pepe et al. (

21). shows two predictiveness curves, one for the model without and one for the model with HDL.

The curves are virtually overlapping. The plot leads to the same conclusion as the one derived from looking at the margins of the reclassification table: there is very little difference between the ‘spread-ability’ of the two models. In our experience, this conclusion is very common when examining the usefulness of new biomarkers. It is a question for future research to determine if this is caused by the inadequacy of the markers under consideration, or if it is an inherent property of the predictiveness curve itself.

While it is desirable ‘to spread’ the predicted risks as much as possible, the fact of adequate separation does not have to guarantee that events get higher predicted probabilities and non-events get low predicted probabilities, a condition necessary for a good prediction model. When significant predictors are used, it is very unlikely that a large separation would not imply good classification accuracy, but direct measures of the latter are clearly desirable.

In the context of reclassification tables, the net reclassification improvement (NRI) offers a simple yet meaningful check of classification accuracy (

19). In its simplest form for binary outcomes, NRI is calculated by examining events and non-events separately (first two parts of reclassification ). Only people who change risk categories contribute to the NRI, since for those who do not, the models perform the same. It is desirable to increase the predicted probabilities of event for those who experience events, and hence any upward reclassification among events is considered beneficial. It is quantified by (bold-face) numbers above the diagonal in the first (event) part of : we get (15+0+14) = 29. This gain is offset by people with events who move down in categories using the new model. Their number is below the diagonal, and can be calculated as (4+0+3) = 7. Hence the net improvement is (29−7) = 22 out of 183 people with events, which gives a percentage improvement of (22/183) = 12.0%. For nonevents, the reasoning is reversed: downward movement in categories is considered beneficial. Looking at the second part of reclassification , we see (148+1+25) = 174 individuals without events moving down (bold-face numbers below the diagonal) and (142+0+31) = 173 moving upwards for a net gain of (174−173) = 1 person out of 3081, or 0.0%. The total NRI is calculated as the sum of the two, which in this case is 12%. The total NRI assumed implicit weighting of importance proportional to the odds of nonevents, in this case (incidence rate = 6%) it is 0.94/0.06 = 15.7:1. This means that each misclassified event is 15.7 more important than each misclassified ‘nonevent.’ Other weightings are also possible (cf. Vickers (

31)). But at least as informative as the total NRI are its two components – the one calculated for events and one for nonevents. There is a 12% improvement in reclassification of events, and no change in reclassification of nonevents. Thus, the improvement in reclassification of events comes only at the cost of measuring HDL cholesterol.

Another useful look at the NRI focuses on a prospective interpretation and allows extension to survival data. We calculate the Kaplan-Meier event rates for individuals reclassified upwards and downwards, and compare them to the overall Kaplan-Meier rate. Ideally, the rate for those going up will be much higher than the overall rate, while the rate for those going down will be lower. In our example, the rate for those reclassified upwards is 15.1% and for those reclassified downwards it is 4.1% as compared to the rate of 6.0% for everyone combined, which fulfills the desired relationship: 15.1% > 6.0% > 4.1%. We conclude that people reclassified upwards are at higher risk than the average, while people reclassified downwards are at a lower risk.

Seeing the simple and informative nature of NRI and its components one wants to ask whether this approach can be generalized to situations where there are no categories. There are two ways in which NRI can be extended to situations where no meaningful categories exist. The first one defines any change in predicted probability as either upward or downward movement, depending on the direction. Since predicted probabilities are continuous if at least one of the predictors is, this implies that every person will be reclassified. In our example, this category-less NRI(>0) is 30.2%, with the corresponding event and nonevent components of 24.6% and 5.6%, respectively. These numbers are much larger than the ones we observed when categories were present – this is to be expected as presence of categories substantially reduces the amount of reclassification estimated. This example also illustrates the phenomenon of NRI increasing with the number of categories with the category-less NRI(>0) usually serving as the upper limit. However, the conclusions about reclassification of events and nonevents remain similar – much more improvement is seen for events than for nonevents.

Another way to extend NRI to the case of no categories is accomplished by assigning a weight to each movement that is equal to difference of probabilities from the old and new models. This leads to another measure of accuracy of prediction model which Pencina et al. (

19) called the integrated discrimination improvement (IDI). They have also shown that IDI is equal to the difference in discrimination slopes as proposed by Yates (

32). The discrimination slope has a nice intuitive definition: it is the difference in means of predicted probabilities for events and nonevents (in other words, it is a measure of separation in predicted probabilities for event and nonevents). In our example, these slopes for models with and without HDL are 0.0715 and 0.0630, yielding an IDI of 0.0085. Discrimination slopes and IDI depend on incidence of the outcome of interest, and further research is needed to gain some intuition of what an IDI of 0.0085 really means beyond its statistical significance at p-value = 0.016. One potential direction is to look at the relative IDI (rIDI) defined as the increase in discrimination slopes divided by the slope of the old model (

33). In our case it is 13.5% (0.0085/0.0630). The following, heuristic argument might help assess the magnitude of rIDI. If every variable was to contribute equally to discrimination slope, and the old model had 6 predictors, the average contribution would be 16.6%. This is the incremental contribution that would be expected from the new variable. HDL with 13.5% rIDI comes close to the expectation. Alternatively, treating age as a time-scale adjustment rather than a risk factor, one looks at the relative increase beyond age. The contribution of the five risk factors other than age to the old model’s slope is 0.0400, resulting in non-age rIDI of 21%. The expected number based on 5 risk factors would be 20%, indicating that HDL offers a similar magnitude of improvement as the average of risk factors used in the old model.