Home | About | Journals | Submit | Contact Us | Français |

**|**Am J Epidemiol**|**PMC3202159

Formats

Article sections

Authors

Related links

Am J Epidemiol. 2011 August 1; 174(3): 364–374.

Published online 2011 June 14. doi: 10.1093/aje/kwr086

PMCID: PMC3202159

Received 2010 October 29; Accepted 2011 February 28.

Copyright American Journal of Epidemiology © The Author 2011. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

This article has been cited by other articles in PMC.

The integrated discrimination improvement (IDI) index is a popular tool for evaluating the capacity of a marker to predict a binary outcome of interest. Recent reports have proposed that the IDI is more sensitive than other metrics for identifying useful predictive markers. In this article, the authors use simulated data sets and theoretical analysis to investigate the statistical properties of the IDI. The authors consider the common situation in which a risk model is fitted to a data set with and without the new, candidate predictor(s). Results demonstrate that the published method of estimating the standard error of an IDI estimate tends to underestimate the error. The *z* test proposed in the literature for IDI-based testing of a new biomarker is not valid, because the null distribution of the test statistic is not standard normal, even in large samples. If a test for the incremental value of a marker is desired, the authors recommend the test based on the model. For investigators who find the IDI to be a useful measure, bootstrap methods may offer a reasonable option for inference when evaluating new predictors, as long as the added predictive capacity is large.

Various metrics have been proposed for quantifying the predictive ability of a classification model or quantifying the incremental value of a new biomarker or predictor (1). The most common single-number summary of the ability of a classification tool to discriminate between cases and controls is the area under the receiver operating characteristic curve (AUC), also known as the *c* index. To quantify the incremental value of a new marker, one can use the improvement in the AUC when the marker is added to an existing classification model. However, the AUC has been widely criticized because it does not measure a clinically meaningful quantity (2, 3). There is also concern that the AUC is “insensitive” and does not demonstrate the value of new markers that are useful for prediction (2). Recently, several investigators proposed measures of incremental value that examine the extent to which a new marker reclassifies subjects (2, 4). However, such measures can be sensitive to arbitrary boundaries delineating discrete categories of risk (5).

Pencina et al. (4) proposed the integrated discrimination improvement (IDI) index as complementary to the AUC. The IDI is defined as

(1)

In this equation, IS is the integral of sensitivity over all possible cutoff values and IP is the corresponding integral of “1 minus specificity.” In equation 1, “new” refers to the classification model that includes the new biomarker and “old” refers to the classification model that does not. Pencina et al. (4) provide the following estimator for the IDI:

(2)

In equation 2, is an average of estimated probabilities of an event. An average is taken over the people in the sample who experienced events (“events”), and an average is taken over those who did not experience an event (“nonevents”). In other words, events are cases and nonevents are controls. Use of the IDI can be motivated from multiple perspectives (3, 6–9). Perhaps the simplest motivation for the IDI is that a useful marker leads to increased estimated risks of disease for cases and decreased estimated risks for controls. If the new marker contributes to risk prediction, the first term of equation 2 will be large in the positive direction and the second term will be large in the negative direction; subtracting them produces a large IDI.

Pencina et al. (4) give an example of using the IDI to evaluate the incremental value of a marker. Two regression models are fitted to a data set, with and without the new marker. Each regression model yields estimated risks of disease $\widehat{p}$ for every individual, case and control, in the data set. The estimated risks from the 2 fitted models are averaged appropriately, and $\widehat{\text{IDI}}$ is computed for the data set using equation 2. Although Pencina et al. (4) do not use logistic regression in their example, we expect this to be a common choice in practice, and we use logistic regression throughout most of this paper.

To test the null hypothesis that IDI = 0, Pencina et al. (4) provide the test statistic

(3)

In equation 3, $\widehat{\text{S}}{\text{E}}_{\text{events}}$ is the standard error of paired differences of new and old model-based predicted probabilities among cases; $\widehat{\text{S}}{\text{E}}_{\text{nonevents}}$ is the corresponding standard error among controls. Pencina et al. (4) conjecture that *z*_{IDI} is asymptotically standard normal under the null hypothesis that the new biomarker does not contribute to prediction.

Not all investigators agree that the IDI is a major improvement over the AUC as a measure of incremental value. Greenland (8) comments that the IDI, like the AUC, incorporates information that is irrelevant. That is, both measures summarize the entire receiver operating characteristic curve, including regions where false-positive or false-negative rates are unacceptable. Chi and Zhou (6) fault the IDI for putting equal weight on sensitivity and specificity, when the relative importance of sensitivity and specificity varies with the objective. Mihaescu et al. (10) comment that the IDI, like the AUC, is a measure of clinical validity rather than clinical utility. Without endorsing the AUC, we note that most researchers have enough experience with the AUC to interpret the measure and to know when an AUC value is “large.” It is not clear whether the same holds for the IDI. On the other hand, the IDI has become increasingly popular in predictive modeling research. In a scientific statement from the American Heart Association, Hlatky et al. noted that “the IDI test appears to be more powerful than the *c* index” for establishing that a new biomarker has positive incremental value (11, p. 2411). On February 17, 2011, 353 articles in the Science Citation Index referenced the article by Pencina et al. (4). Many of these authors used the IDI or the test statistic *z*_{IDI} as supporting evidence in favor of a proposed biomarker.

In this article, we sidestep the debate on the inherent value of the IDI as a measure and focus instead on the statistical properties of the IDI. The popularity of the IDI warrants further investigation of its behavior, particularly in the common situation in which the “new” and “old” risk models are estimated using the same set of data. Pepe et al. (12) raised concerns that the denominator of equation 3 is an underestimate of the standard error of $\widehat{\text{IDI}}$. We investigate this particular question, as well as the sampling distribution of $\widehat{\text{IDI}}$. We provide empirical and theoretical evidence that $\widehat{\text{IDI}}$ is approximately normal only for large values of the IDI. In particular, we show that the test statistic *z*_{IDI} does not have a standard normal distribution under the null hypothesis that IDI = 0, and thus the test based on *z*_{IDI} is not valid.

We used both simulation and statistical theory to explore the sampling distribution of $\widehat{\text{IDI}}$ and the null distribution of *z*_{IDI}. Throughout this paper, we consider the behavior of $\widehat{\text{IDI}}$ in the common situation where “old” and “new” nested risk models are fitted to the same data set.

We employed multiple schemes for simulating data. We always use *D* to denote the binary variable indicating the outcome, that is, disease status. *Y* denotes established (“old”) predictors. Candidate (“new”) predictors are denoted with *W*, *W*_{1}, or *W*_{2}.

We simulated the log odds of disease according to a logistic risk model in which we think of age as the established predictor *Y* and cardiovascular disease as the outcome *D*. In our simplest simulation model, there is a single candidate predictor *W*:

(4)

We also consider scenarios in which there are 2 candidate predictors *W*_{1} and *W*_{2}:

(5)

We simulated *Y* as *N*(65, 10) and independently simulated each of *W*, *W*_{1}, and *W*_{2} as *N*(0, 1). These simulation parameters yield an event rate of approximately 5% when γ = 0 or γ_{1} = γ_{2} = 0. Using simulated data, we computed risks of disease using equation 4 or equation 5, and we simulated disease statuses from each risk independently using a Bernoulli distribution. If a γ parameter equals zero, then the corresponding *W* has no predictive value. If a γ parameter is not zero, then the corresponding *W* is predictive, although its incremental value depends, of course, on the magnitude of its coefficient.

The alternative logistic simulation model was designed to mimic situations in which the established predictor is not very predictive. The simulation model is similar to equation 4:

(6)

*Y* is randomly generated from a standard exponential distribution, and *W* is independently generated from a Poisson distribution with mean 4. As in the previous logistic simulation model, the prevalence is approximately 5% when γ = 0.

In this simulation, we start with a real data set from a clinical trial for prevention of mother-to-child transmission of human immunodeficiency virus (HIV) (13). Table 1 gives basic descriptive statistics for this data set. Of the 1,882 deliveries by HIV-infected women recorded in this data set, 8% of the infants had a positive HIV test at birth. There is an established predictor *Y*, which is maternal viral load at 20–24 weeks of gestation. A higher viral load means that the mother has more copies of the virus circulating in her blood and is modestly predictive of whether she will transmit HIV to her child during pregnancy or delivery. We consider the mother's age as the candidate predictor. One would not expect the age of an HIV-infected pregnant woman to predict whether her infant will be born with HIV infection. In each simulated data set, we randomly permute mother's age, ensuring no predictive ability of the “new” predictor *W*.

For simulated data sets, we fit logistic regression models to the data set with and without the “new” predictors. In other words, for a given simulated data set, we fit the model

(7)

to estimate risks of disease using only the established predictor *Y*. If there is a single candidate predictor, we fit the model

to estimate the risks of disease using both the established and candidate predictors. We used the “new” and “old” estimated risks to compute $\widehat{\text{IDI}}$ and *z*_{IDI}.

In the original IDI paper, Pencina et al. (4) proposed the IDI for comparing 2 nested models. In particular, the proposal does not limit the IDI to evaluating a single candidate marker. In fact, the IDI has been used to evaluate multiple markers as a set (14, 15). In a similar spirit, we used logistic simulations with 2 new predictors to compare equation 7 and the larger model

(8)

For simulations with a single candidate predictor *W*, we also consider the following 2 nested prediction models:

and

(9)

Thus, in the larger model there is a single candidate predictor but the 2 fitted models differ by 2 degrees of freedom (df). As before, we used the “new” and “old” estimated risks to compute $\widehat{\text{IDI}}$ and *z*_{IDI}.

Using the logistic simulation model and setting γ = 0, we simulated data sets with a useful predictor of disease and a candidate predictor of disease that has no predictive capacity. Similarly, we used the HIV simulation model to generate data sets in which a candidate predictor had no incremental value.

First, we investigated the accuracy of the standard error estimate used in equation 3. For a given sample size, we simulated 10,000 data sets using the logistic simulation model with γ = 0 and computed $\widehat{\text{IDI}}$. The standard deviation of $\widehat{\text{IDI}}$ across these 10,000 simulations estimates the standard error of $\widehat{\text{IDI}}$ under the null hypothesis that IDI = 0. For each simulated data set, we also computed $\widehat{\text{S}}\text{E}\left(\widehat{\text{IDI}}\right)$ using the formula in the denominator of equation 3. In Figure 1, we compare our empirical estimate of $\widehat{\text{S}}\text{E}\left(\widehat{\text{IDI}}\right)$ with the estimate used in equation 3 by dividing the latter by the former. We see that the standard error estimate in equation 3 is, on average, about half as large as it should be. The magnitude of the bias is (perhaps remarkably) stable as sample size increases. Our results confirm the suspicion of Pepe et al. (12) that the standard error estimate used in equation 3 is an underestimate of the standard error of $\widehat{\text{IDI}}$.

Bias in $\widehat{\text{S}}\text{E}\left(\widehat{\text{IDI}}\right)$. For each sample size, we simulated 10,000 data sets using the logistic simulation model with γ = 0 and computed $\widehat{\text{IDI}}$. The standard deviation of $\widehat{\text{IDI}}$ across these 10,000 simulations estimates the null standard error (SE) of **...**

We also investigated more fully the sampling distribution of $\widehat{\text{IDI}}$ and *z*_{IDI} when the null hypothesis, IDI = 0, is true. The top row of Figure 2 shows the results for the logistic simulation model (γ = 0). The bottom row shows the results for the HIV simulation model. Results are based on 10,000 simulations of data sets of size *n* = 1,500 for the logistic model and *n* = 1,882 for the HIV simulation model. The null distribution of $\widehat{\text{IDI}}$ is highly nonsymmetric, with a long right tail. The strong positive skewness in the distribution results from the fact that the 2 components of $\widehat{\text{IDI}}$ have a strong negative correlation. Pepe et al. (12) also pointed out that the IDI is equal to the proportion of explained variation, which is either always or predominantly positive, depending on the type of regression model. That is, adding a new variable to a set of predictors rarely decreases the proportion of explained variation (and never decreases the proportion of explained variation in linear regression). The null distribution of *z*_{IDI} is more symmetric but is centered away from zero and is not standard normal. Other simulation models gave very similar results (data not shown).

Null distribution of $\widehat{\text{IDI}}$ and *z*_{IDI} for 10,000 data sets simulated using the logistic model (top row; *n* = 1,500) and the human immunodeficiency virus simulation model (bottom row; *n* = 1,882). For *z*_{IDI}, a standard normal density curve is given for reference. **...**

We also studied the sampling distribution of $\widehat{\text{IDI}}$ and *z*_{IDI} for 2-df IDIs. For the logistic simulation model, the larger model is equation 8, and for the HIV simulation model, the larger model is equation 9. Results are shown in Figure 3. Compared with Figure 2, Figure 3 shows that $\widehat{\text{IDI}}$ is more prominently skewed toward positive values and the distribution of *z*_{IDI} is further shifted to the right in comparison with a standard normal curve.

We have seen that the null distribution of *z*_{IDI} is not standard normal (Figures 2 and and3).3). What is the implication for investigators attempting to use *z*_{IDI} to evaluate a new biomarker? We used the logistic simulation model with γ = 0 to investigate the type I error (false-positive) rate of the *z*_{IDI} test. Suppose an investigator uses *z*_{IDI} to conduct a 2-sided hypothesis test of H_{0}: IDI = 0 for a single biomarker and a 1-df difference between the “new” and “old” predictive models. It turns out that the *z*_{IDI} test is slightly conservative. A nominal 5%-level test uses a cutoff of 1.96; the true size of the test is actually slightly smaller, approximately 3.9.

The IDI is a measure of the *improvement* in prediction. As previously noted (14), a 2-sided hypothesis test is not appropriate when interest is in markers that improve prediction. If one uses an IDI-based hypothesis test to evaluate a new biomarker, an appropriate test is 1-sided—that is, H_{0}: IDI = 0 vs. H_{1}: IDI > 0. Performing the test by comparing *z*_{IDI} with a standard normal distribution, the cutoff 1.96 nominally corresponds to a 2.5%-level 1-sided test. The actual type I or false-positive error rate is approximately 3.9%. An intended α level of 5% corresponds to an actual α level of approximately 9.3%.

We also considered the case in which 2 df separate the “new” and “old” predictive models. In this case, both 1-sided and 2-sided hypothesis tests are anticonservative, with higher false-positive rates than the nominal levels. Figure 4 illustrates the results described above.

Using the logistic simulation model, we also simulated data sets where the new predictor *W* has some predictive value by choosing γ ≠ 0. We examined a range of values of γ. As before, we computed $\widehat{\text{IDI}}$ for each simulated data set.

Figure 5 shows estimated sampling distributions of $\widehat{\text{IDI}}$ for a range of γ values. (Results are shown in 2 plots because of the drastically different scales for the distributions for small and large γ.) For small values of γ, $\widehat{\text{IDI}}$ has a severe right skewness, as we saw in Figure 2. For larger values of γ, $\widehat{\text{IDI}}$ has a fairly symmetric distribution. To help interpret these results, Table 2 (first row of data) provides the average *P* value for the coefficient of γ in the fitted logistic regression model. A value of γ = 0.4 is a marginally significant predictor according to this metric.

The extremely nonnormal empirical distribution of $\widehat{\text{IDI}}$ is surprising, so we investigated the distribution analytically in a simplified scenario to help explain the simulation findings. The formulation of the IDI in equation 4 does not restrict how the risk models are to be fitted, so we examined the distribution that arose when the risk scores were fitted by linear regression. This would be an unusual choice in practice, but it is convenient here because it allows us to derive simple formulas for the risk scores. In contrast, logistic regression models are fitted to data using iterative algorithms, and there are not simple formulas for model parameters as a function of the data. However, since the computational algorithms for logistic regression use iterative weighted linear regression, we would expect the distribution of $\widehat{\text{IDI}}$ based on linear regression to be a good guide to the distribution based on logistic regression (at least when prediction is weak). Our analytic results for linear regression explain both the asymmetric null distribution of $\widehat{\text{IDI}}$ and the underestimation of its standard error.

Without loss of generality, the “old” model contains a single variable *Y* and the “new” model additionally includes a variable *W* that is independent of *Y* and with mean zero (otherwise, replace *Y* by *Y*β and *W* by *W* – *E*[*W*|*Y*]). We show in the Appendix that

(10)

where $\widehat{\gamma}$ is the estimated coefficient of *W* in the fitted model and ρ is the prevalence of disease. Under the strong null hypothesis that both marker *Y* and marker *W* have no predictive value,

where *n* is the sample size. Under the more general null hypothesis that *Y* is a useful predictor, we need to know Var[*W*|*Y*], which we denote = σ^{2}. This is the scenario we are most interested in—the incremental value of *W* above and beyond an existing predictor *Y*. In this case, we have

Equation 10 allows us to use well-established results about parameter estimates in linear models to understand the distribution of $\widehat{\text{IDI}}$. Under the alternative hypothesis (γ ≠ 0), ${\widehat{\gamma}}^{2}$ has a noncentral chi-squared distribution with a noncentrality parameter increasing with *n* and γ. As the noncentrality parameter increases, the distribution gets closer to normal, but the normal approximation is only good in situations where the power of the test for γ = 0 is high. Since

for large γ or *n*, the distribution will eventually be centered around γ^{2} with a normal distribution and with variance proportional to γ^{2}.

Our simulation studies show that these results for the linear model hold approximately for logistic regression. In the first part of the Results section, we saw that the null distribution of $\widehat{\text{IDI}}$ has a chi-square-shaped distribution and the sampling distribution appears approximately normal for IDI away from zero.

An interesting result applies to a scenario in which risk models have been estimated using a separate set of training data. If a case-control validation sample is taken to estimate the IDI using the existing (fixed) risk models, then the formula for estimating $\text{Var}\left[\widehat{\text{IDI}}\right]$ provided by Pencina et al. (4) turns out to be correct (Appendix).

Bootstrapping is a popular method with which to make inferences about a parameter using an estimator whose sampling distribution is not well characterized. Unfortunately, bootstrapping is not always a reliable method for making inferences about the IDI. Figure 5 and Figure 6 show why. The sampling distribution of $\widehat{\text{IDI}}$ changes rapidly in shape and scale as IDI approaches zero. The bootstrap estimates the sampling distribution of $\widehat{\text{IDI}}$ under conditions as they exist in the sample. If the true IDI in the population is zero, then $\widehat{\text{IDI}}$ in the sample will typically be positive, and the sampling distribution under conditions in the sample will be substantially different from the sampling distribution under the true, zero, IDI. The bootstrap distribution will be more symmetrical, more spread out, and shifted to the right compared with the true sampling distribution.

Illustration of the fact that the resampling-subjects bootstrap does not provide valid inference for small values of the integrated discrimination improvement (IDI) index. We simulated 1,000 data sets of size 1,500 using the logistic simulation model **...**

The third and sixth rows in Table 2 show that the bootstrap has an anticonservative bias when the true incremental value of a marker is null. In particular, for the alternative logistic simulation model, the anticonservatism of the bootstrap was severe, with only 74.1% of nominal 95% bootstrap confidence intervals covering the true IDI value of zero. We obtained similar results when 2 new markers were simultaneously evaluated with the IDI, with unreliable, anticonservative inferences for small values of the IDI (Table 3).

In this paper, we investigated IDI as a measure of the incremental value of a biomarker. In our simulation studies, the published formula for estimating the standard error of $\widehat{\text{IDI}}$ tended to underestimate the true standard error by a factor of approximately 2. Moreover, the sampling distribution of $\widehat{\text{IDI}}$ for a marker with no predictive value is strongly skewed toward positive values. We also considered testing the null hypothesis H_{0}: IDI = 0. The null distribution of the proposed *z* statistic does not follow a standard normal distribution. For evaluating the incremental value of a single biomarker, 2-sided hypothesis testing using the *z* test is conservative. More appropriate 1-sided hypothesis testing is anticonservative, meaning that the IDI *z* test is prone to giving false-positive results.

Most of the empirical results we have presented involved fitting logistic regression models to data simulated under a logistic model. This is an idealized situation where the exactly correct model is fitted to the data and used to estimate risks and the IDI. The fact that the sampling distribution of $\widehat{\text{IDI}}$ in such a highly idealized situation did not conform to the expectations set out in equation 4 does not bode well for its behavior with real data.

Our empirical and theoretical results indicate that a valid test of H_{0}: IDI = 0 that is based on $\widehat{\text{IDI}}$ will be very difficult to develop. However, the hypothesis H_{0}: IDI = 0 is equivalent to H_{0}: *P*(*D*|*Y*, *W*) = *P*(*D*|*Y*), where *W* is the candidate biomarker and *Y* is the set of existing predictors (12). This is fortunate, because it means that an IDI-based test is unnecessary. Therefore, if a test of positive incremental value is desired, we recommend using a test based on the model. For example, if a regression function is used for risk modeling, then the likelihood ratio test for the coefficient of *W* in the risk model can be used to test the null hypothesis H_{0}: *P*(*D*|*Y*, *W*) = *P*(*D*|*Y*). The likelihood ratio test is implemented in all major statistical packages, can be applied to single markers or sets of markers, and is the uniformly most powerful test.

In certain cases in practice, IDI-based tests of the predictiveness of a novel biomarker give small *P* values, whereas tests based on regression coefficients or the AUC are far from significant. For example, see Table III in the article by Criqui et al. (16), Table 2 in the article by Blankenberg et al. (17), or Table 3 in the article by Lin et al. (18). Since all tests evaluate the same null hypothesis, a tempting conclusion is that the IDI-based test is more powerful than the others (11). Unfortunately, the results in this paper lead to an alternate explanation, namely that IDI-based results are inconsistent with the other results because the test based on *z*_{IDI} is not valid.

We remind readers that the value of hypothesis testing in evaluating new biomarkers is, at best, limited. The real challenge in biomarker research is to identify markers with a predictive capacity that is substantial enough to improve clinical practice. The motivation for the development of the IDI still stands: to find measures that quantify the incremental value in a meaningful way. For investigators who find the IDI to be a useful measure, bootstrapping to obtain confidence intervals may offer a reasonable option for inference, as long as the true IDI is well away from zero.

Author affiliations: Department of Biostatistics, School of Public Health, University of Washington, Seattle, Washington (Kathleen F. Kerr, Robyn L. McClelland, Elizabeth R. Brown, Thomas Lumley).

K. F. K. was supported by sabbatical funding from the University of Washington. E. R. B. was supported by National Institutes of Health grant R01 HL095126, and T. L. was supported by National Institutes of Health grant R01 HL080295.

Conflict of interest: none declared.

- AUC
- area under the receiver operating characteristic curve
- HIV
- human immunodeficiency virus
- IDI
- integrated discrimination improvement

We consider adding a single new variable *W* that is fitted by linear regression. There is no loss of generality in assuming that the “old” model contains a single variable *Y*, so the “new” model contains *W* and *Y*. We can also assume that *W* has a mean value of zero and is uncorrelated with *Y* in the sample (otherwise, replace *Y* by *Y*β and *W* by *W* − *E*[*W*|*Y*]).

Now

The last equality holds because in any generalized linear model with an intercept, the residuals sum to zero.

Since

we have

and

where $\widehat{\gamma}$ is the coefficient of the proposed marker *W* in the “new” model.

Under the strong null hypothesis that neither the “old” marker *Y* nor the “new” marker *W* is predictive of disease, then .

Therefore,

and

Under the more general null hypothesis that *Y* is predictive but *W* is not, let σ^{2} denote Var[*D*|*Y*]. We then have and

Under the alternative hypothesis, ${\widehat{\gamma}}^{2}$ has a noncentral chi-squared distribution with a noncentrality parameter increasing with *n* and γ. As the noncentrality parameter increases, the distribution gets closer to normal, as shown in Figure 5.

We can explicitly demonstrate the failure of the bootstrap in the simplest case in which the models are linear and the “old” model is uninformative. The derivations above show that estimated integrated discrimination improvement (IDI) then has a scaled noncentral chi-squared distribution with noncentrality parameter *n*γ^{2}/2, that is, ${\chi}_{1}^{2}(\lambda =n{\gamma}^{2}/2).$

A bootstrap sample is a sample from a population in which $\gamma =\widehat{\gamma}$, where $\widehat{\gamma}$ is the estimate in the original data sample. The distribution of statistics IDI^{*} computed on the bootstrap samples will correctly estimate the sampling distribution of IDI when $\gamma =\widehat{\gamma}$—that is, in large samples, conditional on $\widehat{\gamma},$

When the new biomarker is uninformative, the sampling distribution of $n\times \widehat{\text{IDI}}$ is a central chi-squared distribution, that is, ${\chi}_{1}^{2}(\lambda =0),$ but the conditional sampling distribution of the bootstrap replicates IDI^{*} is $n\times {\text{IDI}}^{*}\sim {\chi}_{1}^{2}(\lambda =n{\widehat{\gamma}}^{2}/2).$ Since $n{\widehat{\gamma}}^{2}$ does not converge to zero, the bootstrap distribution does not converge to the sampling distribution. As Figure 6 shows, the bootstrap distribution of IDI^{*} actually varies according to the sample value of $\widehat{\text{IDI}}$.

If γ ≠ 0, however, the distribution of ${\widehat{\gamma}}^{2}$ is asymptotically normal with mean γ^{2} and variance proportional to 1/*n* and depending smoothly on γ. The sampling distribution of $\widehat{\text{IDI}}$ is approximately normal with mean

and variance proportional to 1/*n* and depending smoothly on γ.

The conditional distribution of the bootstrap replicates

IDI^{*} in a sample with $\gamma =\widehat{\gamma}$ will thus have mean

which converges to the mean of $\widehat{\text{IDI}}$, and since the variance depends smoothly on $\widehat{\gamma}$, it will converge to the variance of the $\widehat{\text{IDI}}$. Thus, the bootstrap gives the correct sampling distribution for $\widehat{\text{IDI}}$ in large samples when γ ≠ 0.

To prove the result at the end of the “Sampling Distribution of $\widehat{\text{IDI}}$: Theoretical Results” section (see text), assume 2 differentiable functions $x\mapsto {f}_{\text{old}}\left(x\right)\text{and}(x,w)\mapsto {f}_{\text{new}}(x,w).$ If we are conditioning on the test sample, these can be regarded as fixed functions. They produce a pair of random variables (*P* = *f*_{new}(*Y*), *Q* = *f*_{new}(*Y*, *w*)), and we also have the outcome variable *D*. Because we are treating the 2 functions as fixed, the triples (*P*, *Q*, *Y*) for each person are (conditionally) independent and identically distributed in the training sample.

The IDI is estimated by

This is a Hadamard-differentiable function of the empirical cumulative distribution function of (*P*, *Q*, *D*), as long as the proportion of cases is bounded away from 0 and 1, so it is asymptotically normal and bootstrappable and is consistent for the value defined by applying the IDI() functional to the true distributions of *P*, *Q*, and *D* (19).

The asymptotic variance of the estimated IDI will depend only on the uncertainty in the numerators and so is the variance of

Under prospective sampling, this is still larger than the formula given by Pencina et al. (4). However, under case-control sampling with prespecified numbers of cases and controls, the variance is the sum of variance contributions from the case (*D* = 1) and control (*D* = 0) strata; so under these circumstances, the asymptotic variance is

Therefore, the variance formula presented by Pencina et al. (4) is correct if one develops the prediction models in a separate sample, fixes the “old” and “new” risk models to be those estimated from those samples, and then estimates the IDI in a separate *case-control* validation sample.

1. Gu W, Pepe M. Measures to summarize and compare the predictive capacity of markers. Int J Biostat. 2009;5(1) Article 27. (doi: 10.2202/1557-4679.1188) [PMC free article] [PubMed]

2. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–935. [PubMed]

3. Pepe MS, Janes HE. Gauging the performance of SNPs, biomarkers, and clinical factors for predicting risk of breast cancer. J Natl Cancer Inst. 2008;100(14):978–979. [PMC free article] [PubMed]

4. Pencina MJ, D'Agostino RB, Sr, D'Agostino RB, Jr, et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157–172. [PubMed]

5. Dalton JE, Kattan MW. Recent advances in evaluating the prognostic value of a marker. Scand J Clin Lab Invest Suppl. 2010;242:59–62. [PubMed]

6. Chi YY, Zhou XH. The need for reorientation toward cost-effective prediction: comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by Pencina et al., *Statistics in Medicine* (DOI: 10.1002/sim.2929) Stat Med. 2008;27(2):182–184. [PubMed]

7. Cook NR. Comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., *Statistics in Medicine* (DOI: 10.1002/sim.2929). *Stat Med.* 2008;27(2): 191–195. [PubMed]

8. Greenland S. The need for reorientation toward cost-effective prediction: comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., *Statistics in Medicine* (DOI: 10.1002/sim.2929). *Stat Med.* 2008;27(2): 199–206. [PubMed]

9. Van Calster B, Van Huffel S. Integrated discrimination improvement and probability-sensitive AUC variants. Stat Med. 2010;29(2):318–319. [PubMed]

10. Mihaescu R, van Zitteren M, van Hoek M, et al. Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve. Am J Epidemiol. 2010;172(3):353–361. [PubMed]

11. Hlatky MA, Greenland P, Arnett DK, et al. Criteria for evaluation of novel markers of cardiovascular risk: a scientific statement from the American Heart Association. American Heart Association Expert Panel on Subclinical Atherosclerotic Diseases and Emerging Risk Factors and the Stroke Council. Circulation. 2009;119(17):2408–2416. [PMC free article] [PubMed]

12. Pepe MS, Feng Z, Gu JW. Comments on ‘evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond’ by M. J. Pencina et al., *Statistics in Medicine* (DOI: 10.1002/sim.2929). *Stat Med.* 2008;27(2): 173–181. [PubMed]

13. Taha TE, Brown ER, Hoffman IF, et al. A phase III clinical trial of antibiotics to reduce chorioamnionitis-related perinatal HIV-1 transmission. AIDS. 2006;20(9):1313–1321. [PubMed]

14. Chao C, Song Y, Cook N, et al. The lack of utility of circulating biomarkers of inflammation and endothelial dysfunction for type 2 diabetes risk prediction among postmenopausal women: the Women's Health Initiative Observational Study. Arch Intern Med. 2010;170(17):1557–1565. [PMC free article] [PubMed]

15. Sandholt CH, Sparsø T, Grarup N, et al. Combined analyses of 20 common obesity susceptibility variants. Diabetes. 2010;59(7):1667–1673. [PMC free article] [PubMed]

16. Criqui MH, Ho LA, Denenberg JO, et al. Biomarkers in peripheral arterial disease patients and near- and longer-term mortality. J Vasc Surg. 2010;52(1):85–90. [PubMed]

17. Blankenberg S, Zeller T, Saarela O, et al. Contribution of 30 biomarkers to 10-year cardiovascular risk estimation in 2 population cohorts: the MONICA, Risk, Genetics, Archiving, and Monograph (MORGAM) Biomarker Project. Circulation. 2010;121(22):2388–2397. [PubMed]

18. Lin HJ, Lee BC, Ho YL, et al. Postprandial glucose improves the risk prediction of cardiovascular death beyond the metabolic syndrome in the nondiabetic population. Diabetes Care. 2009;32(9):1721–1726. [PMC free article] [PubMed]

19. van der Vaart AW. Asymptotic Statistics. Cambridge, United Kingdom: Cambridge University Press; 1998. Functional delta method; pp. 291–303.

Articles from American Journal of Epidemiology are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |