As noted previously, statistical tests of association, including likelihood ratio tests or odds ratio estimates, cannot determine whether markers or models will be useful in risk prediction or have clinical utility (
Harrell, 2001;
Pepe, et al., 2004). Model fit statistics, which examine discrimination, calibration, or overall model accuracy, can aid in determining the performance of risk prediction models. The simulations described above examined both classic measures of model fit as well as more recent ones based on reclassification to directly compare models. While the c-statistic, or area under the ROC curve, has often been used to quantify model fit, the difference between c-statistics is typically small, even for established predictors (
Pepe et al., 2004;
Cook, 2007). This is also true in the simulations where it was usually no more than 0.01 for odds ratios of 1.5 or 2.0. Power for testing the difference in c-statistics, however, was conversely relatively high for a continuous Y. While power was less than 50% for an ORY of 1.5, it reached 90% when ORY was 2 for a sample size of 5000 with a 10% probability of disease. Thus, even small changes in c were statistically significant. Power for the c-statistic with a binary predictor was lower when the prevalence was low.
We considered two methods of computing the variance of the difference between c-statistics for correlated data: that described by
DeLong et al. (1988) and another proposed by
Rosner and Glynn (2009). Both methods are based on ranks and the Wilcoxon statistic, but Rosner and Glynn use the bivariate normal distribution function to estimate joint probabilities given a shift alternative, while DeLong
et al. estimate these empirically, leading to a more computer intensive method. The variance estimates were similar for the two methods, along with the corresponding power, although the Rosner-Glynn estimate was usually a couple of percentage points lower. The advantage of the Rosner-Glynn estimate was that it was much quicker to run. With a very large data set, the difference in time could be substantial. Both methods, though, suffered from a conservative type I error.
The IDI statistic, as previously noted (
Pepe et al., 2008a;
Tjur, 2009), corresponded with the difference in R
2 based on the correlation between the observed outcome and predicted probabilities. Results were similar for the Nagelkerke version of R
2. While R
2 has been promoted as a means to assess fit in logistic models (
Ash and Shwartz, 1999), it remains difficult to interpret, especially since the size is typically low for binary outcomes.
Pepe et al. (2008a) have shown that a test of the IDI is asymptotically equivalent to a test of statistical significance. As expected given these theoretical considerations, the power for the IDI was relatively high and comparable to the likelihood ratio test for the regression coefficient throughout, indicating that the IDI is sensitive at detecting model differences. A significant regression coefficient, or even IDI, again may not be enough to determine clinical utility.
The classic Hosmer-Lemeshow test was found to have low power in all scenarios as illustrated in , which has been previously demonstrated in extensive simulations (
Hosmer and Hjort, 2002). In fact, none of the twelve goodness-of-fit tests considered by Hosmer and Hjort had adequate power to detect an omitted covariate.
Le Cessie and van Houwelingen (1991) suggest that partitioning on the covariate or ‘x’ space as suggested by
Tsiatis (1980), rather than the outcome or ‘y’ space, may help circumvent the problem. Difficulties in partitioning remain, especially when considering multiple variables. The reclassification table alternatively considers and partitions on a model including the omitted variables. Partitioning can then be done in important categories of the ‘y’ space for models both with and without the variable(s) in question.
A perfect model should display agreement of the observed and expected risk throughout the distribution or however categories are defined (
Diamond, 1992;
Gail and Pfeiffer, 2005). The usual definition of calibration often refers to the expected risk conditional on variables in the model. Ideally, however, we would like a model to reflect the true underlying probability, regardless of whether all predictor variables are in the model (
Cook, 2010). While the Hosmer-Lemeshow test has little power to detect an omitted variable, the RC test can at least indirectly test this ‘unconditional’ calibration by comparing observed and expected proportions against an omitted variable.
Relevant determination of clinical significance can be found through reclassification measures since they more directly address changes in risk strata and potential treatment. The simple percent reclassified is not a useful measure on its own. An assessment of the accuracy of the reclassifications is necessary, whether through the RC test or the NRI. If the new categories are found to be more accurate, then the percents reclassified may be helpful for clinical purposes. The percent reclassified into the various categories could be used to evaluate the impact of screening strategies or could be used for cost-benefit analyses.
We found that the type I error for the NRI and the RC statistic were reasonable, but the relative power varied according to the model assumptions. The primary difference between these two tests lies in their interpretation. The NRI is a test of discrimination, or separation of cases and controls. It does not assess whether the predicted probabilities are accurate or calibrated, but whether cases are higher than controls in one model vs. another. Alternatively, the RC statistic directly assesses calibration rather than discrimination within categories defined by both models. These tools thus appear to provide complementary but unique information
As an estimate, the NRI is dependent on the particular categories used. For example, the use of three vs. four risk categories can lead to a change in the NRI, which is more pronounced than in the test of calibration (
Cook and Ridker, 2009). In the example in section 2.4., the NRI ranged from 2.5% with two categories to 35.1% with cut points at each observation. The RC statistic is less sensitive to the number of categories due to the change in degrees of freedom. If a model accurately reflects the observed outcomes, it should do so however the categories are defined. As with the Hosmer-Lemeshow statistic upon which it is based, however, there will be variability of the RC statistic with changing category definition. We believe that this represents random variability rather than the systematic bias that is displayed by the NRI and other versions of the proportion reclassified. A sensible strategy to use in the absence of established clinical guidelines may be to use quartiles or multiples of the base probability, which yielded similar results here. In particular, it may be reasonable to use category cut points that are one-half, equal to, and double the average population risk.
For clinical use, individuals who fall into the intermediate categories, or “gray zone,” may be of most interest. While the original cNRI is biased, the adjusted version can be used to determine the net improvement in this group. In the simulations, estimates were slightly higher than those for the NRI when four risk categories were used. For three risk categories, there was limited reclassification in the intermediate group, so this statistic is not as useful.
We examined three versions of the reclassification calibration statistic, and found that the adjusted version led to a conservative type I error and lower power when cells were restricted to those with at least 20 observations. It was more similar to the unadjusted RC statistic when the average expected values were at least five. This latter restriction, though, may ignore important information in the smaller categories, as in the example data. The J2 statistic was conservative throughout, likely due to the extra degree of freedom charged. While theoretically such adjustments could improve the fit to the chi-square distribution, this was found not to be the case. The unadjusted RC statistic exhibited good test characteristics and general adherence to the chi-square distribution in the null situation. Thus, we recommend using the unadjusted statistic with cell sizes of at least 20.
A limitation to these analyses is that all of the models presented here were fit and compared in the same data and thus the overall estimates of performance will be optimistic. Since over-fitting will likely be somewhat greater in a larger model, the estimates of power for testing differences in models will also be somewhat over-estimated. However, the underlying or true models were theoretical and no selection rules were applied, so we would not expect the degree of optimism to be very large, as previously demonstrated (
Cook and Ridker, 2009).
Note that the majority of these simulations assumed no correlation between the predictors. If a new variable is highly correlated with those already in the model, it will add little to prediction, as demonstrated by lower amounts of reclassification, lower estimates of all measures of model difference, and lower power in our simulations. If predictors are suspected of being correlated, performance measures could alternatively be estimated using residuals given the other predictors and estimates of the conditional effect on the outcome.
While use of categories in assessing calibration has been discouraged (
Hosmer, et al., 1997), such risk strata serve two purposes. First, use of risk stratification tables can illustrate the potential for changes in treatment along important points on the continuum of risk. They help assess changes in risk that may prove useful to clinicians in guiding treatment decisions, so are readily interpretable in terms of clinical impact or utility. Reclassification was originally proposed in a clinical setting where treatment guidelines were established according to levels of projected risk (
Cook et al., 2006). While such clinical categories are not always available, the categories should be ‘important’ such that changes in category should have potential clinical impact. Second, using categories is a rough means of weighting the importance of changes in prediction. In healthy cohorts, such as the Framingham study (
Pencina et al., 2008) or the Women’s Health Study (
Ridker et al., 2005) for example, the bulk of the cohort may fall at very low risk. Doubling or even tripling risk within these low-risk individuals may lead to small changes in absolute risk with little clinical impact. Weighting by the observed distribution may thus not be optimal for assessing clinical utility. Strata that give more weight to those at higher risk may be more relevant to decision-making, whether or not the particular categories used follow established clinical criteria.
This paper has focused on the statistical characteristics of the various performance measures, particularly type I error and power. The most powerful test, however, is not necessarily the best to determine clinical utility. If it were, we would use the test of association based on the estimated log odds ratio (
Pepe et al., 2004). Similarly the c-statistic was often found to have good power even when changes were very small. Whether small differences in the c-statistic translate to clinical utility, however, remains questionable (
Cook, 2007). The focus should be on how the model will be used in practice and a corresponding performance measure. Since treatment guidelines are sometimes based on predicted risk strata as in the ATP III report (2001), a method that compares the accuracy of the predicted strata for two models would seem to be appropriate in many situations.
Ultimately, what matters most in assessing predictive models should be related to actual clinical use. Will the new model change practice or change treatment decisions for physicians? Is the cost of additional screening, whether financial or otherwise, justified by an ultimate benefit to the patient? The NRI implicitly assigns a cost of 1/p for the cases and 1/(1−p) for the controls, but could be generalized to other costs (
Pencina, et al., 2009). Since costs may change over time or by personal preference, flexible decision rules such as net benefit (
Vickers and Elkin, 2006) or relative utility curves (
Baker, et al., 2009), may be helpful in determining the best model for a given cost structure. Such considerations could also assist in determining the optimum cut points for clinical use as well as comparing one-stage vs. two-stage testing (
Baker et al., 2009). While such cost considerations, however, may be the ultimate criteria for adopting a model into clinical practice, determination of model fit and whether treatments would change remain critical first steps in evaluating clinical usefulness (
Hlatky, et al., 2009).