We define r(X) as a predictive model for a disease state D in which X corresponds to one or more established clinical and biological markers. If X* designates a new marker, our goal is to assess whether X* adds to the predictive accuracy that is already available from X. Thus, our question about the value of X* involves a comparison of r(X) to r(X, X*) rather than a comparison of r(X) to r(X*). That is, we wish to know the incremental increase in accuracy due to X*. The setting in which one wants to consider the addition of a new marker to an existing set of markers is known as a nested model.
There are two general approaches that are used frequently to compare the incremental effect of a new predictor in the context of a nested model. In the first, the novel marker is added to a regression model that includes the established markers. The new marker is accepted as having predictive value if it is significantly associated with outcome in the multivariable model, after adjusting for established markers; if so, the marker is commonly referred to being an "independent" predictor of outcome. In the second approach, the area under the receiver operating characteristic (ROC) curve (AUC) or c-index is calculated for each of the two predictive models, r(X
) and r(X, X*
), and the two AUC's compared. The fact that expert commentaries on this issue advocate this comparison without reference to statistical testing suggest that they have intended the comparison to represent an informal judgment of the increase in incremental accuracy[2
], that is, the recommendation is to use a comparison of AUC's for estimation. Nonetheless, there are well known and widely-used statistical tests for comparing diagnostic accuracy[7
], and so increasingly investigators have elected to use these as formal tests for incremental accuracy in the context of comparing predictive models[9
We have observed that it is often the case in such reports that the novel marker has been found to be statistically significant in the multivariable model including established markers, but has been reported to have a non-significant effect on predictive accuracy on the basis of tests comparing the two AUCs. For instance, Folsom et al.[15
] looked at various novel markers to predict coronary heart disease, and tested whether these markers added incremental value to a standard predictive model that included age, race, sex, total cholesterol, high density lipoprotein, blood pressure, antihypertensive medications, smoking status, and diabetes. For each marker, the authors reported both the p
value for the marker in the multivariable model, and a p
value for the difference in AUC between the standard model and the standard model plus the new marker. As an example, interleukin 6 was reported to be a statistically significant independent predictor of outcome (p
=0.03), but the increase in AUC, from 0.773 to 0.783 was reported to be non-significant. Similarly, Gallina and colleagues [16
] investigated whether body mass index could help predict high-grade prostate cancer. They reported that although body mass index was statistically significant (p
=0.001) in a multivariable model including clinical stage, prostate volume, and total and free prostate specific antigen, the increase in AUC (from 0.718 for the standard predictors to 0.725 for the standard predictors plus body mass index) was non-significant (p
In considering the contribution of a new marker in the context of established markers we are interested in the incremental improvement in predictive accuracy that the new marker can deliver. What do we mean by incremental predictive accuracy? A new predictor can only provide additional information if it is associated with the outcome, conditional on the existing predictors. Consequently, we are fundamentally interested in testing for conditional independence between the new predictor, X*, and the outcome, Y, conditional on the established predictors, X. If X* and Y are associated, conditional on X, then there is information that can be potentially utilized to improve the prediction. In other words, in constructing a test for incremental information, the conceptual null hypothesis is that there is no useful information in X* for predicting Y once the information in X is taken into account.
In the construction of a specific statistical test, the actual null hypothesis used can differ, even though in our context all tests are targeted fundamentally at the preceding conceptual null hypothesis. When we approach the question of the value of X* using a regression model, such as logistic regression if the outcome is a binary event or proportional hazards regression for survival-type outcomes, we are comparing the fit of the data to two different models, a null regression model in which the outcome, after transformation, has a linear relationship with X versus a model in which the addition of a linear term involving X* improves the fit. If β is the parameter representing the coefficient of X* in this model, then the null hypothesis is that β=0. This might lead to a different result from, say, a Mantel-Haenszel test of association between X* and Y, stratified by X. However, both are essentially testing the same conceptual null hypothesis, the hypothesis that there is no conditional association between X* and Y, given X, and thus no potentially useful incremental information in X* for the purposes of predicting Y.
Consider now approaching this issue in the setting of an ROC analysis. Again, there are different options for formulating the null hypothesis. A logical choice is to construct a test of the hypothesis that the ROC curve mapped by the predictor from the model r(X
) is identical to the ROC curve mapped by the predictor from the model r(X, X*
). Indeed tests of this nature are available [17
]. However, by far the most common approach is to focus on the areas under the ROC curves from these two models [8
]. The null hypothesis is that the areas, denoted AUC(X
) and AUC(X,X
*), are identical. These two null hypotheses are not the same, but they both conform to our conceptual null hypothesis, namely that X
* does not add incremental information to the predictive ability of the model formed using X
alone. Investigators who have used this approach have typically taken the patient-specific risk predictors from the two models, and used these as data elements both for estimating the ROC curves and as data for conducting the test comparing ROC areas.
To our knowledge, little work has been done to estimate the power of regression models for detecting incremental predictive accuracy in comparison to the power of corresponding tests for the AUCs. We conducted a simulation study in which the two predictors, X and X*, were generated as standard normal variables with varying levels of predictive strength, represented by means that differed depending on the binary outcome Y. The difference in means between Y = 1 and Y = 0 for X and X* are represented by μ and μ*respectively and were varied between 0 (i.e. the null) and 0.3. X and X* were generated both independently (i.e. with a correlation of ρ = 0.0) and for the correlations ρ = 0.1, ρ = 0.3 and ρ = 0.5. The data sets were analyzed using logistic regression, and likelihood ratio and Wald tests for the incremental contribution of X* were performed. The patient-specific predictors for each of the models were then used as data for a test comparing the two AUCs, using the popular area test proposed by Delong et al. [8
]. The algorithm used for the simulation is provided in the Appendix. The results for a study with n=500 and an outcome prevalence of 0.5 are presented in Table . The first set of rows represent test size, i.e. the setting in which X
* contributes no incremental predictive accuracy (represented by μ*=0). Here we see that both the likelihood ratio and Wald test have test size close to the nominal 5%. By contrast the DeLong test of the AUCs is exceptionally conservative, with a test size far below nominal. Power comparisons in the rest of the table show that the likelihood ratio test and the Wald test have similar power but both are far superior to the AUC test. Further, the likelihood and Wald tests are largely unaffected by the underlying strength of the baseline predictive model (represented by μ), while the power of the area test diminishes as the underlying AUC increases (again represented by μ). Power for all tests increases with greater correlation between μ and μ*.
Simulation results for n=500, prevalence at 20%.
We repeated our analyses varying the prevalence (0.2 and 0.05) and sample size (n=100). Our results were essentially unaffected. Lowering the sample size or prevalence reduced power for all analyses, but the Wald and likelihood ratio tests always had far superior power to the AUC test.