|Home | About | Journals | Submit | Contact Us | Français|
The integrated discrimination improvement (IDI) index is a popular tool for evaluating the capacity of a marker to predict a binary outcome of interest. Recent reports have proposed that the IDI is more sensitive than other metrics for identifying useful predictive markers. In this article, the authors use simulated data sets and theoretical analysis to investigate the statistical properties of the IDI. The authors consider the common situation in which a risk model is fitted to a data set with and without the new, candidate predictor(s). Results demonstrate that the published method of estimating the standard error of an IDI estimate tends to underestimate the error. The z test proposed in the literature for IDI-based testing of a new biomarker is not valid, because the null distribution of the test statistic is not standard normal, even in large samples. If a test for the incremental value of a marker is desired, the authors recommend the test based on the model. For investigators who find the IDI to be a useful measure, bootstrap methods may offer a reasonable option for inference when evaluating new predictors, as long as the added predictive capacity is large.
Various metrics have been proposed for quantifying the predictive ability of a classification model or quantifying the incremental value of a new biomarker or predictor (1). The most common single-number summary of the ability of a classification tool to discriminate between cases and controls is the area under the receiver operating characteristic curve (AUC), also known as the c index. To quantify the incremental value of a new marker, one can use the improvement in the AUC when the marker is added to an existing classification model. However, the AUC has been widely criticized because it does not measure a clinically meaningful quantity (2, 3). There is also concern that the AUC is “insensitive” and does not demonstrate the value of new markers that are useful for prediction (2). Recently, several investigators proposed measures of incremental value that examine the extent to which a new marker reclassifies subjects (2, 4). However, such measures can be sensitive to arbitrary boundaries delineating discrete categories of risk (5).
Pencina et al. (4) proposed the integrated discrimination improvement (IDI) index as complementary to the AUC. The IDI is defined as
In this equation, IS is the integral of sensitivity over all possible cutoff values and IP is the corresponding integral of “1 minus specificity.” In equation 1, “new” refers to the classification model that includes the new biomarker and “old” refers to the classification model that does not. Pencina et al. (4) provide the following estimator for the IDI:
In equation 2, is an average of estimated probabilities of an event. An average is taken over the people in the sample who experienced events (“events”), and an average is taken over those who did not experience an event (“nonevents”). In other words, events are cases and nonevents are controls. Use of the IDI can be motivated from multiple perspectives (3, 6–9). Perhaps the simplest motivation for the IDI is that a useful marker leads to increased estimated risks of disease for cases and decreased estimated risks for controls. If the new marker contributes to risk prediction, the first term of equation 2 will be large in the positive direction and the second term will be large in the negative direction; subtracting them produces a large IDI.
Pencina et al. (4) give an example of using the IDI to evaluate the incremental value of a marker. Two regression models are fitted to a data set, with and without the new marker. Each regression model yields estimated risks of disease for every individual, case and control, in the data set. The estimated risks from the 2 fitted models are averaged appropriately, and is computed for the data set using equation 2. Although Pencina et al. (4) do not use logistic regression in their example, we expect this to be a common choice in practice, and we use logistic regression throughout most of this paper.
To test the null hypothesis that IDI = 0, Pencina et al. (4) provide the test statistic
In equation 3, is the standard error of paired differences of new and old model-based predicted probabilities among cases; is the corresponding standard error among controls. Pencina et al. (4) conjecture that zIDI is asymptotically standard normal under the null hypothesis that the new biomarker does not contribute to prediction.
Not all investigators agree that the IDI is a major improvement over the AUC as a measure of incremental value. Greenland (8) comments that the IDI, like the AUC, incorporates information that is irrelevant. That is, both measures summarize the entire receiver operating characteristic curve, including regions where false-positive or false-negative rates are unacceptable. Chi and Zhou (6) fault the IDI for putting equal weight on sensitivity and specificity, when the relative importance of sensitivity and specificity varies with the objective. Mihaescu et al. (10) comment that the IDI, like the AUC, is a measure of clinical validity rather than clinical utility. Without endorsing the AUC, we note that most researchers have enough experience with the AUC to interpret the measure and to know when an AUC value is “large.” It is not clear whether the same holds for the IDI. On the other hand, the IDI has become increasingly popular in predictive modeling research. In a scientific statement from the American Heart Association, Hlatky et al. noted that “the IDI test appears to be more powerful than the c index” for establishing that a new biomarker has positive incremental value (11, p. 2411). On February 17, 2011, 353 articles in the Science Citation Index referenced the article by Pencina et al. (4). Many of these authors used the IDI or the test statistic zIDI as supporting evidence in favor of a proposed biomarker.
In this article, we sidestep the debate on the inherent value of the IDI as a measure and focus instead on the statistical properties of the IDI. The popularity of the IDI warrants further investigation of its behavior, particularly in the common situation in which the “new” and “old” risk models are estimated using the same set of data. Pepe et al. (12) raised concerns that the denominator of equation 3 is an underestimate of the standard error of . We investigate this particular question, as well as the sampling distribution of . We provide empirical and theoretical evidence that is approximately normal only for large values of the IDI. In particular, we show that the test statistic zIDI does not have a standard normal distribution under the null hypothesis that IDI = 0, and thus the test based on zIDI is not valid.
We used both simulation and statistical theory to explore the sampling distribution of and the null distribution of zIDI. Throughout this paper, we consider the behavior of in the common situation where “old” and “new” nested risk models are fitted to the same data set.
We employed multiple schemes for simulating data. We always use D to denote the binary variable indicating the outcome, that is, disease status. Y denotes established (“old”) predictors. Candidate (“new”) predictors are denoted with W, W1, or W2.
We simulated the log odds of disease according to a logistic risk model in which we think of age as the established predictor Y and cardiovascular disease as the outcome D. In our simplest simulation model, there is a single candidate predictor W:
We also consider scenarios in which there are 2 candidate predictors W1 and W2:
We simulated Y as N(65, 10) and independently simulated each of W, W1, and W2 as N(0, 1). These simulation parameters yield an event rate of approximately 5% when γ = 0 or γ1 = γ2 = 0. Using simulated data, we computed risks of disease using equation 4 or equation 5, and we simulated disease statuses from each risk independently using a Bernoulli distribution. If a γ parameter equals zero, then the corresponding W has no predictive value. If a γ parameter is not zero, then the corresponding W is predictive, although its incremental value depends, of course, on the magnitude of its coefficient.
The alternative logistic simulation model was designed to mimic situations in which the established predictor is not very predictive. The simulation model is similar to equation 4:
Y is randomly generated from a standard exponential distribution, and W is independently generated from a Poisson distribution with mean 4. As in the previous logistic simulation model, the prevalence is approximately 5% when γ = 0.
In this simulation, we start with a real data set from a clinical trial for prevention of mother-to-child transmission of human immunodeficiency virus (HIV) (13). Table 1 gives basic descriptive statistics for this data set. Of the 1,882 deliveries by HIV-infected women recorded in this data set, 8% of the infants had a positive HIV test at birth. There is an established predictor Y, which is maternal viral load at 20–24 weeks of gestation. A higher viral load means that the mother has more copies of the virus circulating in her blood and is modestly predictive of whether she will transmit HIV to her child during pregnancy or delivery. We consider the mother's age as the candidate predictor. One would not expect the age of an HIV-infected pregnant woman to predict whether her infant will be born with HIV infection. In each simulated data set, we randomly permute mother's age, ensuring no predictive ability of the “new” predictor W.
For simulated data sets, we fit logistic regression models to the data set with and without the “new” predictors. In other words, for a given simulated data set, we fit the model
to estimate risks of disease using only the established predictor Y. If there is a single candidate predictor, we fit the model
to estimate the risks of disease using both the established and candidate predictors. We used the “new” and “old” estimated risks to compute and zIDI.
In the original IDI paper, Pencina et al. (4) proposed the IDI for comparing 2 nested models. In particular, the proposal does not limit the IDI to evaluating a single candidate marker. In fact, the IDI has been used to evaluate multiple markers as a set (14, 15). In a similar spirit, we used logistic simulations with 2 new predictors to compare equation 7 and the larger model
For simulations with a single candidate predictor W, we also consider the following 2 nested prediction models:
Thus, in the larger model there is a single candidate predictor but the 2 fitted models differ by 2 degrees of freedom (df). As before, we used the “new” and “old” estimated risks to compute and zIDI.
Using the logistic simulation model and setting γ = 0, we simulated data sets with a useful predictor of disease and a candidate predictor of disease that has no predictive capacity. Similarly, we used the HIV simulation model to generate data sets in which a candidate predictor had no incremental value.
First, we investigated the accuracy of the standard error estimate used in equation 3. For a given sample size, we simulated 10,000 data sets using the logistic simulation model with γ = 0 and computed . The standard deviation of across these 10,000 simulations estimates the standard error of under the null hypothesis that IDI = 0. For each simulated data set, we also computed using the formula in the denominator of equation 3. In Figure 1, we compare our empirical estimate of with the estimate used in equation 3 by dividing the latter by the former. We see that the standard error estimate in equation 3 is, on average, about half as large as it should be. The magnitude of the bias is (perhaps remarkably) stable as sample size increases. Our results confirm the suspicion of Pepe et al. (12) that the standard error estimate used in equation 3 is an underestimate of the standard error of .
We also investigated more fully the sampling distribution of and zIDI when the null hypothesis, IDI = 0, is true. The top row of Figure 2 shows the results for the logistic simulation model (γ = 0). The bottom row shows the results for the HIV simulation model. Results are based on 10,000 simulations of data sets of size n = 1,500 for the logistic model and n = 1,882 for the HIV simulation model. The null distribution of is highly nonsymmetric, with a long right tail. The strong positive skewness in the distribution results from the fact that the 2 components of have a strong negative correlation. Pepe et al. (12) also pointed out that the IDI is equal to the proportion of explained variation, which is either always or predominantly positive, depending on the type of regression model. That is, adding a new variable to a set of predictors rarely decreases the proportion of explained variation (and never decreases the proportion of explained variation in linear regression). The null distribution of zIDI is more symmetric but is centered away from zero and is not standard normal. Other simulation models gave very similar results (data not shown).
We also studied the sampling distribution of and zIDI for 2-df IDIs. For the logistic simulation model, the larger model is equation 8, and for the HIV simulation model, the larger model is equation 9. Results are shown in Figure 3. Compared with Figure 2, Figure 3 shows that is more prominently skewed toward positive values and the distribution of zIDI is further shifted to the right in comparison with a standard normal curve.
We have seen that the null distribution of zIDI is not standard normal (Figures 2 and and3).3). What is the implication for investigators attempting to use zIDI to evaluate a new biomarker? We used the logistic simulation model with γ = 0 to investigate the type I error (false-positive) rate of the zIDI test. Suppose an investigator uses zIDI to conduct a 2-sided hypothesis test of H0: IDI = 0 for a single biomarker and a 1-df difference between the “new” and “old” predictive models. It turns out that the zIDI test is slightly conservative. A nominal 5%-level test uses a cutoff of 1.96; the true size of the test is actually slightly smaller, approximately 3.9.
The IDI is a measure of the improvement in prediction. As previously noted (14), a 2-sided hypothesis test is not appropriate when interest is in markers that improve prediction. If one uses an IDI-based hypothesis test to evaluate a new biomarker, an appropriate test is 1-sided—that is, H0: IDI = 0 vs. H1: IDI > 0. Performing the test by comparing zIDI with a standard normal distribution, the cutoff 1.96 nominally corresponds to a 2.5%-level 1-sided test. The actual type I or false-positive error rate is approximately 3.9%. An intended α level of 5% corresponds to an actual α level of approximately 9.3%.
We also considered the case in which 2 df separate the “new” and “old” predictive models. In this case, both 1-sided and 2-sided hypothesis tests are anticonservative, with higher false-positive rates than the nominal levels. Figure 4 illustrates the results described above.
Using the logistic simulation model, we also simulated data sets where the new predictor W has some predictive value by choosing γ ≠ 0. We examined a range of values of γ. As before, we computed for each simulated data set.
Figure 5 shows estimated sampling distributions of for a range of γ values. (Results are shown in 2 plots because of the drastically different scales for the distributions for small and large γ.) For small values of γ, has a severe right skewness, as we saw in Figure 2. For larger values of γ, has a fairly symmetric distribution. To help interpret these results, Table 2 (first row of data) provides the average P value for the coefficient of γ in the fitted logistic regression model. A value of γ = 0.4 is a marginally significant predictor according to this metric.
The extremely nonnormal empirical distribution of is surprising, so we investigated the distribution analytically in a simplified scenario to help explain the simulation findings. The formulation of the IDI in equation 4 does not restrict how the risk models are to be fitted, so we examined the distribution that arose when the risk scores were fitted by linear regression. This would be an unusual choice in practice, but it is convenient here because it allows us to derive simple formulas for the risk scores. In contrast, logistic regression models are fitted to data using iterative algorithms, and there are not simple formulas for model parameters as a function of the data. However, since the computational algorithms for logistic regression use iterative weighted linear regression, we would expect the distribution of based on linear regression to be a good guide to the distribution based on logistic regression (at least when prediction is weak). Our analytic results for linear regression explain both the asymmetric null distribution of and the underestimation of its standard error.
Without loss of generality, the “old” model contains a single variable Y and the “new” model additionally includes a variable W that is independent of Y and with mean zero (otherwise, replace Y by Yβ and W by W – E[W|Y]). We show in the Appendix that
where is the estimated coefficient of W in the fitted model and ρ is the prevalence of disease. Under the strong null hypothesis that both marker Y and marker W have no predictive value,
where n is the sample size. Under the more general null hypothesis that Y is a useful predictor, we need to know Var[W|Y], which we denote = σ2. This is the scenario we are most interested in—the incremental value of W above and beyond an existing predictor Y. In this case, we have
Equation 10 allows us to use well-established results about parameter estimates in linear models to understand the distribution of . Under the alternative hypothesis (γ ≠ 0), has a noncentral chi-squared distribution with a noncentrality parameter increasing with n and γ. As the noncentrality parameter increases, the distribution gets closer to normal, but the normal approximation is only good in situations where the power of the test for γ = 0 is high. Since
for large γ or n, the distribution will eventually be centered around γ2 with a normal distribution and with variance proportional to γ2.
Our simulation studies show that these results for the linear model hold approximately for logistic regression. In the first part of the Results section, we saw that the null distribution of has a chi-square-shaped distribution and the sampling distribution appears approximately normal for IDI away from zero.
An interesting result applies to a scenario in which risk models have been estimated using a separate set of training data. If a case-control validation sample is taken to estimate the IDI using the existing (fixed) risk models, then the formula for estimating provided by Pencina et al. (4) turns out to be correct (Appendix).
Bootstrapping is a popular method with which to make inferences about a parameter using an estimator whose sampling distribution is not well characterized. Unfortunately, bootstrapping is not always a reliable method for making inferences about the IDI. Figure 5 and Figure 6 show why. The sampling distribution of changes rapidly in shape and scale as IDI approaches zero. The bootstrap estimates the sampling distribution of under conditions as they exist in the sample. If the true IDI in the population is zero, then in the sample will typically be positive, and the sampling distribution under conditions in the sample will be substantially different from the sampling distribution under the true, zero, IDI. The bootstrap distribution will be more symmetrical, more spread out, and shifted to the right compared with the true sampling distribution.
The third and sixth rows in Table 2 show that the bootstrap has an anticonservative bias when the true incremental value of a marker is null. In particular, for the alternative logistic simulation model, the anticonservatism of the bootstrap was severe, with only 74.1% of nominal 95% bootstrap confidence intervals covering the true IDI value of zero. We obtained similar results when 2 new markers were simultaneously evaluated with the IDI, with unreliable, anticonservative inferences for small values of the IDI (Table 3).
In this paper, we investigated IDI as a measure of the incremental value of a biomarker. In our simulation studies, the published formula for estimating the standard error of tended to underestimate the true standard error by a factor of approximately 2. Moreover, the sampling distribution of for a marker with no predictive value is strongly skewed toward positive values. We also considered testing the null hypothesis H0: IDI = 0. The null distribution of the proposed z statistic does not follow a standard normal distribution. For evaluating the incremental value of a single biomarker, 2-sided hypothesis testing using the z test is conservative. More appropriate 1-sided hypothesis testing is anticonservative, meaning that the IDI z test is prone to giving false-positive results.
Most of the empirical results we have presented involved fitting logistic regression models to data simulated under a logistic model. This is an idealized situation where the exactly correct model is fitted to the data and used to estimate risks and the IDI. The fact that the sampling distribution of in such a highly idealized situation did not conform to the expectations set out in equation 4 does not bode well for its behavior with real data.
Our empirical and theoretical results indicate that a valid test of H0: IDI = 0 that is based on will be very difficult to develop. However, the hypothesis H0: IDI = 0 is equivalent to H0: P(D|Y, W) = P(D|Y), where W is the candidate biomarker and Y is the set of existing predictors (12). This is fortunate, because it means that an IDI-based test is unnecessary. Therefore, if a test of positive incremental value is desired, we recommend using a test based on the model. For example, if a regression function is used for risk modeling, then the likelihood ratio test for the coefficient of W in the risk model can be used to test the null hypothesis H0: P(D|Y, W) = P(D|Y). The likelihood ratio test is implemented in all major statistical packages, can be applied to single markers or sets of markers, and is the uniformly most powerful test.
In certain cases in practice, IDI-based tests of the predictiveness of a novel biomarker give small P values, whereas tests based on regression coefficients or the AUC are far from significant. For example, see Table III in the article by Criqui et al. (16), Table 2 in the article by Blankenberg et al. (17), or Table 3 in the article by Lin et al. (18). Since all tests evaluate the same null hypothesis, a tempting conclusion is that the IDI-based test is more powerful than the others (11). Unfortunately, the results in this paper lead to an alternate explanation, namely that IDI-based results are inconsistent with the other results because the test based on zIDI is not valid.
We remind readers that the value of hypothesis testing in evaluating new biomarkers is, at best, limited. The real challenge in biomarker research is to identify markers with a predictive capacity that is substantial enough to improve clinical practice. The motivation for the development of the IDI still stands: to find measures that quantify the incremental value in a meaningful way. For investigators who find the IDI to be a useful measure, bootstrapping to obtain confidence intervals may offer a reasonable option for inference, as long as the true IDI is well away from zero.
Author affiliations: Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington (Kathleen F. Kerr, Robyn L. McClelland, Elizabeth R. Brown, Thomas Lumley).
K. F. K. was supported by sabbatical funding from the University of Washington. E. R. B. was supported by National Institutes of Health grant R01 HL095126, and T. L. was supported by National Institutes of Health grant R01 HL080295.
Conflict of interest: none declared.
We consider adding a single new variable W that is fitted by linear regression. There is no loss of generality in assuming that the “old” model contains a single variable Y, so the “new” model contains W and Y. We can also assume that W has a mean value of zero and is uncorrelated with Y in the sample (otherwise, replace Y by Yβ and W by W − E[W|Y]).
The last equality holds because in any generalized linear model with an intercept, the residuals sum to zero.
where is the coefficient of the proposed marker W in the “new” model.
Under the strong null hypothesis that neither the “old” marker Y nor the “new” marker W is predictive of disease, then .
Under the more general null hypothesis that Y is predictive but W is not, let σ2 denote Var[D|Y]. We then have and
Under the alternative hypothesis, has a noncentral chi-squared distribution with a noncentrality parameter increasing with n and γ. As the noncentrality parameter increases, the distribution gets closer to normal, as shown in Figure 5.
We can explicitly demonstrate the failure of the bootstrap in the simplest case in which the models are linear and the “old” model is uninformative. The derivations above show that estimated integrated discrimination improvement (IDI) then has a scaled noncentral chi-squared distribution with noncentrality parameter nγ2/2, that is,
A bootstrap sample is a sample from a population in which , where is the estimate in the original data sample. The distribution of statistics IDI* computed on the bootstrap samples will correctly estimate the sampling distribution of IDI when —that is, in large samples, conditional on
When the new biomarker is uninformative, the sampling distribution of is a central chi-squared distribution, that is, but the conditional sampling distribution of the bootstrap replicates IDI* is Since does not converge to zero, the bootstrap distribution does not converge to the sampling distribution. As Figure 6 shows, the bootstrap distribution of IDI* actually varies according to the sample value of .
If γ ≠ 0, however, the distribution of is asymptotically normal with mean γ2 and variance proportional to 1/n and depending smoothly on γ. The sampling distribution of is approximately normal with mean
and variance proportional to 1/n and depending smoothly on γ.
The conditional distribution of the bootstrap replicates
IDI* in a sample with will thus have mean
which converges to the mean of , and since the variance depends smoothly on , it will converge to the variance of the . Thus, the bootstrap gives the correct sampling distribution for in large samples when γ ≠ 0.
To prove the result at the end of the “Sampling Distribution of : Theoretical Results” section (see text), assume 2 differentiable functions If we are conditioning on the test sample, these can be regarded as fixed functions. They produce a pair of random variables (P = fnew(Y), Q = fnew(Y, w)), and we also have the outcome variable D. Because we are treating the 2 functions as fixed, the triples (P, Q, Y) for each person are (conditionally) independent and identically distributed in the training sample.
The IDI is estimated by
This is a Hadamard-differentiable function of the empirical cumulative distribution function of (P, Q, D), as long as the proportion of cases is bounded away from 0 and 1, so it is asymptotically normal and bootstrappable and is consistent for the value defined by applying the IDI() functional to the true distributions of P, Q, and D (19).
The asymptotic variance of the estimated IDI will depend only on the uncertainty in the numerators and so is the variance of
Under prospective sampling, this is still larger than the formula given by Pencina et al. (4). However, under case-control sampling with prespecified numbers of cases and controls, the variance is the sum of variance contributions from the case (D = 1) and control (D = 0) strata; so under these circumstances, the asymptotic variance is
Therefore, the variance formula presented by Pencina et al. (4) is correct if one develops the prediction models in a separate sample, fixes the “old” and “new” risk models to be those estimated from those samples, and then estimates the IDI in a separate case-control validation sample.