Search tips
Search criteria 


Logo of ijbiostatThe International Journal of BiostatisticsThe International Journal of BiostatisticsSubmit to The International Journal of BiostatisticsSubscribe
Int J Biostat. 2009 January 1; 5(1): 27.
Published online 2009 October 1. doi:  10.2202/1557-4679.1188
PMCID: PMC2827895

Measures to Summarize and Compare the Predictive Capacity of Markers


The predictive capacity of a marker in a population can be described using the population distribution of risk (Huang et al. 2007; Pepe et al. 2008a; Stern 2008). Virtually all standard statistical summaries of predictability and discrimination can be derived from it (Gail and Pfeiffer 2005). The goal of this paper is to develop methods for making inference about risk prediction markers using summary measures derived from the risk distribution. We describe some new clinically motivated summary measures and give new interpretations to some existing statistical measures. Methods for estimating these summary measures are described along with distribution theory that facilitates construction of confidence intervals from data. We show how markers and, more generally, how risk prediction models, can be compared using clinically relevant measures of predictability. The methods are illustrated by application to markers of lung function and nutritional status for predicting subsequent onset of major pulmonary infection in children suffering from cystic fibrosis. Simulation studies show that methods for inference are valid for use in practice.

The predictive capacity of a marker in a population can be described using the population distribution of risk (Huang et al. 2007; Pepe et al. 2008a; Stern 2008). Virtually all standard statistical summaries of predictability and discrimination can be derived from it (Gail and Pfeiffer 2005). The goal of this paper is to develop methods for making inference about risk prediction markers using summary measures derived from the risk distribution. We describe some new clinically motivated summary measures and give new interpretations to some existing statistical measures. Methods for estimating these summary measures are described along with distribution theory that facilitates construction of confidence intervals from data. We show how markers and, more generally, how risk prediction models, can be compared using clinically relevant measures of predictability. The methods are illustrated by application to markers of lung function and nutritional status for predicting subsequent onset of major pulmonary infection in children suffering from cystic fibrosis. Simulation studies show that methods for inference are valid for use in practice.

1. Background

Let D denote a binary outcome variable, such as presence of disease or occurrence of an event within a specified time period and let Y denote a set of predictive markers used to predict a bad outcome, D = 1, or a good outcome, D = 0. For example, elements of the Framingham risk score (age, gender, total and high-density lipoprotein cholesterol, systolic blood pressure, treatment for hypertension and smoking) are used to predict occurrence of a cardiovascular event within 10 years ( We write the risk associated with marker value Y = y as risk(y) = P[D = 1|Y = y].

Huang et al. (2007) proposed the predictiveness curve to describe the predictive capacity of Y. It displays the population distribution of risk via the risk quantiles, R(ν) versus ν, where


The inverse of the predictiveness curve is simply the cumulative distribution function (cdf) of risk(Y)


and correspondingly


Gail and Pfeiffer (2005) noted that standard statistical measures used to quantify the predictive capacity of a risk prediction model can be calculated from the risk distribution function, Frisk(p). These include measures derived from the receiver operating characteristic (ROC) curve and the Lorenz curve, predictive values, misclassification rates, and measures of explained variation. Bura and Gastwirth (2001) used the risk quantiles, R(ν), to assess predictors in binary regression models. They proposed a summary index which they called the total gain.

Summary indices are often used to compare prediction models. The area under the ROC curve is widely used in practice for this purpose. However there is controversy about its use, particularly in the cardiovascular research community (Cook 2007; Pencina et al. 2008). This has motivated another approach to evaluating risk prediction markers that relies on defining categories of risk that are clinically meaningful. Several summary indices based on this notion have been proposed. The reclassification percent and the net reclassification index (NRI) are such summary measures derived from reclassification tables and they have recently gained popularity in the applied literature (Ridker et al. 2008; D’Agostino et al. 2008).

In this paper, we explicitly relate existing and new summary measures of prediction to the risk distribution, i.e. to the predictiveness curve. We contrast them qualitatively, paying particular attention to their clinical interpretations and relevance. We then derive distribution theory that can be used for making statistical inference. Note that rigorous methods for inference have not been available heretofore for several of the existing summary measures. Rather the measures are used informally in practice. Small sample performance is investigated for the new and existing summary measures with simulation studies.

The methods are illustrated with data from 12,802 children with cystic fibrosis disease. We describe the data and risk modelling methods in detail later in section 7. Briefly, we compare the capacities of lung function and nutritional measures made in 1995 to predict onset of a pulmonary exacerbation event during the following year. Overall, 41% of children had a pulmonary exacerbation in 1996. Figure 1 displays predictiveness curves (estimated using methods described in Section 7) for two risk models, one based on lung function (FEV1) and one based on weight. We see from Figure 1 that lung function is more predictive in the sense that more subjects have lung function based risks that are at the high and low ends of the risk scale than is true for weight based risks. Since a good risk marker is one that is helpful to individuals making medical decisions, and because decisions are more easily made when an individual’s risk is high or low than if it is in the middle, we conclude informally from the curves that lung function is a superior predictor than weight. We next define formal summary indices that can be used for descriptive and comparative purposes and illustrate them with the cystic fibrosis data.

Figure 1.
Predictiveness curves for FEV1 (solid curve) and weight (dashed curve) as predictors of the risk of having at least one pulmonary exacerbation in the following year in children with cystic fibrosis. The horizontal line indicates the overall proportion ...

2. Summary Indices Involving Risk Thresholds

In clinical practice, a subject’s risk is calculated to assist in medical decision making. If his risk is high, he may be recommended for diagnostic, treatment or preventive interventions. If his risk is low, he may avoid interventions that are unlikely to benefit him. In certain clinical contexts, explicit treatment guidelines exist that are based on individual risk calculations. For example, the Third Adult Treatment Panel recommends that if a subject’s 10 year risk of a cardiovascular disease exceeds 20% he should consider low density lipoprotein (LDL)-lowering therapy (Adult Treatment Panel III 2001). The risk threshold that leads one to opt for an intervention depends on anticipated costs and benefits. These may vary with individuals’ perceptions and preferences (Vickers and Elkin 2006; Hunink et al. 2006). The choice of threshold may also vary with the availability of health care resources. In this section we discuss summary indices that depend on specifying a risk threshold. To be concrete we suppose that the overall risk in the population is high, ρ = P[D = 1], and that the goal of the risk model is to identify individuals at low risk, risk(Y) < pL, where pL is the risk threshold that defines low risk in the specific clinical context. Analagous discussion would pertain to a low risk population in which a risk model is sought to identify a subset of individuals at high risk. Extensions to settings where multiple risk categories are of interest occur in practice when multiple treatment options are available, and will be discussed at the end of this section.

For illustration with the cystic fibrosis data, we choose the low risk threshold pL = 0.25 which contrasts with the overall incidence ρ = 0.41. Patients with cystic fibrosis now routinely receive inhaled antibiotic treatment to prevent pulmonary exacerbations but this was not the case in the 1990s the time during which our data were collected. If subjects at low risk, risk(Y) < pL, in the absence of treatment could be identified, they could forego treatment and thereby avoid inconveniences, monetory costs and potentially increased risk of developing therapy resistant bacterial strains associated with inhaled prophylactic antibiotics.

2.1. Population Proportion at Low Risk

A simple compelling summary measure is the proportion of the population deemed to be at low risk according to the risk model. This is R1(pL), the inverse function of the predictiveness curve, as noted earlier. A good risk prediction marker should identify more people at low risk so that more people can avoid the negative consequences of treatment that is unnecessary for them. That is, a better model will have larger values for R1(pL). In the Cystic Fibrosis example, we see from Figure 1 that 32% of subjects in the population are in the low risk stratum based on lung function measures while 11%, are in the low risk stratum according to weight. A completely uninformative marker would put none in the low risk stratum since it assigns risk(Y) = ρ to all subjects.

2.2. Cases and Controls Classified as Low Risk

Another important perspective from which to evaluate risk prediction markers is classification accuracy (Pepe et al. 2008a, Janes, Pepe and Gu 2008). This is characterized by the risk distribution in cases, subjects for whom D = 1, and in controls, subjects with a good outcome D = 0. Specifically, a better risk model will classify fewer cases and more controls as low risk (Pencina et al. 2008). This is desirable because cases should not forego treatment as they may benefit from it. On the other hand, treatment should be avoided for controls since they only suffer its negative consequences. Corresponding summary measures are termed true and false positive rates,


Higher TPR(pL) and lower FPR(pL) are desirable.

Figure 2 shows cumulative distributions of risk(Y) in cases and controls separately. From this, TPR(p) and FPR(p) can be gleaned for any value of p. We see that the proportion of controls in the low risk stratum is much larger when using lung function as the risk prediction marker than for weight, 1-FPR(pL) = 46% for lung function as opposed to 15% for weight. However the proportion of cases whose risks exceed pL is also lower for the lung function model (TPR(pL) = 87%) than for the weight model (TPR(pL) = 93%)

Figure 2.
Cumulative distributions of risk based on FEV1 and weight in predicting the risk of having at least one pulmonary exacerbation in the following year in children with cystic fibrosis. Distributions are shown separately for subjects who had events (cases, ...

Observe that TPR(pL) and FPR(pL) are indexed by the threshold pL. This contrasts with the display of TPR and FPR that constitutes the ROC curve. ROC curves (Figure 3) suppress the risk thresholding values by showing TPR just as a function of FPR, not TPR and FPR as functions of risk threshold (Figure 2). When specific risk thresholds define clinically meaningful risk categories, the TPR and FPR associated with those risk category definitions are of intrinsic interest, more so than the TPR achieved at a fixed FPR value.

Figure 3.
ROC curves for FEV1 (solid curve) and weight (dashed curve) as predictors of risk of having at least one pulmonary exacerbation in the following year in children with cystic fibrosis. The solid and filled circles are the true and false positive rates ...

2.3. Event Rates in Risk Strata

Another pair of summary measures is the event rates in the two risk strata. These can be thought of as predictive values, PPV(pL) and 1-NPV(pL), defined as


PPV(pL) is the event rate in the high risk stratum and 1-NPV(pL) is the event rate in the low risk stratum. For a good marker, the event rate PPV(pL) will be high and the event rate 1-NPV(pL) will be low.

By applying Bayes theorem to (1), PPV and NPV can be written in terms of TPR and FPR:


These expressions facilitate estimation of PPV(p) and NPV(p), which we discuss in section 4.

Event rates are also functions of the predictiveness curve. Specifically they average the curve over the ranges (νL, 1) and (0, νL) where νL = R1(pL).


For the cystic fibrosis example, estimates of the event rates, 1-NPV(pL) and PPV(pL), are 17% and 53% for the risk strata defined by lung function. In contrast the event rates are much closer to each other, 24% and 43%, in the two risk strata defined by weight. Again lung function appears to be the better predictor of low risk. Not only is R1(pL), the size of the low risk stratum, bigger when using lung function but 1-NPV(pL), the event rate in the low risk stratum, is also smaller.

2.4. νth Risk Percentile

In the applied literature, variables are often categorized using quantiles. In this vein, categories of risk are sometimes defined using risk quantiles for which we have used the notation R(ν). For example, Ridker et al. (2000) used quartiles of risk and noted that high sensitivity c-reactive protein (hs-CRP) was more predictive of cardiovascular risk than standard lipid screening because the level of hs-CRP in the highest versus lowest quartile was associated with a much higher relative risk for future coronary events than was the case for standard lipid measurements.

Another context in which R(ν) is well motivated is when availability of medical resources is limited. Suppose resources are available to provide an intervention to a fraction 1 ν of the population, those 1 ν at highest risk. Since R(ν) is the corresponding risk quantile, subjects given the intervention have risks ≥ R(ν). A marker or risk model for which R(ν) is larger is preferable because it ensures that those receiving intervention are at greater risk of a bad outcome in the absence of the intervention.

In the cystic fibrosis example, suppose the 10% of the population deemed to be at highest risk will be treated. If lung function is used to calculate risk, subjects with risks at or above 0.76 receive treatment. On the other hand if weight is used to calculate risk, subjects whose risks are as low as 0.52 will be offered treatment.

2.5. Risk Threshold Yielding Specified TPR or FPR

In a diagnostic setting, it may be important to flag most people with disease as high risk so that people with disease get necessary treatment. In other words, we may require that the TPR exceed a certain minimum value, TPR=t. The corresponding risk threshold is an important entity to report. We denote it by R(νT (t)). The decision rule that yields TPR=t, requires people whose risks are as low as R(νT (t)) to undergo treatment. If the treatment is cumbersome or risky the decision rule may be unacceptable or unethical if the threshold R(νT (t)) is low.

In screening healthy populations for a rare disease such as ovarian cancer, the false positive rate must be very low in order to avoid large numbers of subjects undergoing unnecessary medical procedures. The risk threshold that yields an acceptable FPR must also be acceptable for individuals as a threshold for deciding for or against medical procedures. To maintain a very low FPR, the risk threshold may be very high in which case the decision rule would not be ethical. Reporting the risk threshold that yields specified FPR=t is therefore often important in practice and we denote the threshold by R(νF (t)).

Unlike other predictiveness summary measures, R(νT (t)) and R(νF (t)) may not be suited to the task of comparing markers. It is not clear that a specific ordering of thresholds is always preferable. In the cystic fibrosis example, the risk threshold that yields TPR=0.85 is 0.27 when the calculation is based on lung function, but 0.32 when weight is used. Observe that another consideration is the corresponding false positive rate which is 0.50 for lung function and 0.72 for weight. If one wanted to control the false positive rate, at FPR=0.15 say, the corresponding risk thresholds are 0.54 for lung function and 0.51 for weight. Observe that the lung function based risk threshold is lower than that for weight when controlling the TPR but higher when controlling the FPR.

2.6. Risk Reclassification Measures

Several summary measures that rely on defined risk categories have been proposed recently. The context for their definition has been when comparing a baseline risk model with one that adds a novel marker to the baseline predictors using risk reclassification tables that involve 3 or more categories of risk. It is illuminating to consider these measures in our much simplified context, where only 2 risk categories defined by a single risk threshold pL are of interest and when the baseline model involves no covariates at all so that the baseline risk is equal to ρ for all subjects. We discuss the more complex setting later.

Cook (2007) proposes the reclassification percent to summarize predictive information in a model. In our context, all subjects are considered high risk under the baseline model because ρ > pL. The reclassification percent is therefore the proportion of subjects classified as low risk according to the risk model involving Y. This is exactly the summary index R1(pL) discussed earlier.

Pencina et al. (2008) criticize the reclassification percent because it does not distinguish between desirable risk reclassifications (up for cases and down for controls) and undesirable risk reclassifications (down for cases and up for controls). They propose the net reclassification improvement (NRI) summary statistic as an alternative. We use“up” and “down” to denote changes of one or more risk categories in the upward and downward directions, respectively, for a subject between their baseline and augmented risk values. The NRI is defined as


In our simple context it is easy to see that


where TPR(pL) and FPR(pL) were discussed earlier. We see that in the 2 category setting the NRI statistic is equal to Youden’s index (Youden 1950). Youden’s index has been criticized because implicitly it weighs equally the consequences of classifying a case as low risk, i.e. a case failing to receive intervention, and classifying a control as high risk, i.e. a control subjected to unnecessary intervention. Most often the costs and consequences of these mistakes will differ greatly for cases and controls. Therefore we recommend reporting the two components of the NRI separately, TPR(pL) and FPR(pL). Values were reported for the cystic fibrosis study above. The corresponding NRI values are 0.33=0.87–0.54 for lung function and 0.08=0.93–0.85 for weight.

2.7. Extensions and Discussion

A key use of summary measures is to compare different risk models. One can quantify the difference in performance between two risk models by taking the difference between summary measures derived from the two models. In the cystic fibrosis example discussed here, the two risk models involve completely different markers. However, one could also entertain two models that involve some common predictors. The setting in which risk reclassification ideas have emerged, is where one model involves standard baseline predictors and the other includes a novel marker in addition to the baseline predictors. Taking the difference in summary measures for the two models is a sensible way of assessing improvement in performance in this context too.

Recall that when only 2 risk categories (low versus high) exist, Cook’s reclassification percent is equal to R1(pL) when the baseline risk does not depend on baseline covariates. However, the reclassification percent is not equal to the difference of values for R1(pL) between the baseline and augmented models when the baseline model does involve covariates. In general, even when two models have exactly the same predictive performance, the reclassification percent is typically non-zero. In fact it has been shown to vary dramatically with correlations between predictors in one model versus another (Janes et al. 2008). This measure therefore does not seem well suited for gauging the difference between predictive capacities of two models. Instead we suggest that one simply focus on the difference in proportions of subjects classified as low (or high) risk with the two models, i.e. differences in R1(pL).

We represented the NRI statistic as TPR(pL)-FPR(pL) in the simple setting. It is easy to show that when two models involve covariates, the NRI statistic to compare the two models is the difference (TPR1(pL)-TPR2(pL))-(FPR1(pL)-FPR2(pL)) where subscripts 1 and 2 are used to index the two models. In analogy with our earlier discussion, we recommend reporting the two comparative components separately, TPR1(pL)-TPR2(pL) and FPR1(pL)-FPR2(pL), rather than their difference, the NRI, because typically changes in TPR should be weighted differently than changes in FPR.

Summarizing data is difficult when more than two risk categories are involved. Statistics such as the NRI have been criticized because they do not distinguish between changes of one risk category and more than one risk category (Pepe et al. 2008b). In a similar vein, when 3 risk categories exist with specific treatment recommendations for each, misclassifying a case as being in the lowest risk level may be more serious than misclassifying him as in the middle category. Similarly, misclassifying a control as being in the highest risk level may be more serious than misclassifying him as being in the middle category. Without specifying utilities associated with different types of misclassifications, any accumulation of data across risk categories is difficult to justify. For these settings we propose use of a vector of summary statistics distinguished by the risk thresholds. For example, suppose we consider three risk categories for the cystic fibrosis study defined by two thresholds pL = 0.25 and pH = 0.75. We could report: the proportions of subjects in the highest and lowest categories, (1−R1(pH), R1(pL)); the proportions of cases and controls in each category, (TPR(pH), 1−TPR(pL)) and (FPR(pH), 1−FPR(pL)); and so forth.

Although statistical summaries that depend on clinically meaningful risk thresholds are appealing, the choice of risk thresholds is often uncertain. Different clinicians or policy makers may choose different risk categorizations. This argues for displaying the risk distributions as continuous curves since one can then read from them summary indices described here using any risk threshold of interest to the reader.

3. Threshold Independent Summary Measures

Classic measures that describe the predictive strength of a model can be interpreted as summary indices for the predictiveness curve. We describe the relationships next. These measures can compliment the display of risk distributions for several models when no specific risk thresholds are of key interest. In addition, formal hypothesis tests to compare predictiveness curves can be based on them.

3.1. Proportion of Explained Variation

The proportion of explained variation, also called R2, is the most popular measure of predictive power for continuous outcomes and is popular for binary outcomes too. It is most commonly defined as


But it can also be written as


because var(D)=E(var(D|Y))+var(E(D|Y)) and E(D|Y) = P(D = 1|Y) = risk(Y). PEV is a standardized measure of the variance in risk(Y) since ρ(1 − ρ) in the denominator is the risk variance for an ideal marker that predicts risk(Y) = 1 for cases and risk(Y) = 0 for controls. Hu et al. (2006) noted that PEV can also be written as the correlation between D and risk(Y).

An unintuitive but interesting and simple interpretation for PEV is as the difference between the averages of risk(Y) for cases and controls(Pepe, Feng and Gu 2008b),


In summary for the cystic fibrosis data, PEV, calculated as 0.22 for the lung function measure and 0.05 for weight, can be interpreted as variances of risk distributions displayed in Figure 1 standardized by the ideal variance of 0.41 × (1 0.41) = 0.24, or as differences in means of distributions shown in Figure 2. In Figure 1, var(risk(Y)) = 0.053 for lung function and 0.012 for weight yielding 0.22 and 0.05 respectively when divided by 0.24. On the other hand in Figure 2, case and control mean risks are 0.54 and 0.32 for lung function while they are 0.44 and 0.39 for weight, again yielding 0.54–0.32=0.22 and 0.44–0.39=0.05 for the PEV values calculated as the differences in means.

Pencina et al. (2008) employ the PEV summary measure to gauge the improvement in risk prediction when clinically relevant risk thresholds do not exist. They do not recognize it as the proportion of explained variation but call it integrated discrimination improvement (IDI) and note that it has another interpretation as Youden’s index integrated uniformly over (0,1):


where Y I(p) = P(risk(Y) > p|D = 1) − P(risk(Y) > p|D = 0) is Youden’s index for the binary decision rule that is positive when risk(Y) > p. In other words, PEV can also be interpreted as the difference between integrated TPR(p) and FPR(p) functions defined earlier.

In a commentary on the Pencina et al. (2008) paper, Ware and Cai (2008) suggest that IDI, denoted here by PEV, does not depend on the overall event rate, ρ = P(D = 1). We disagree. To illustrate, suppose we have a single marker with risk function risk(Y) increasing in Y. Then


where risk(y) = p. Here, the conditional probabilities, P(Y > y|D = 1) and P(Y > y|D = 0), are independent of prevalence, ρ, but risk(Y)Y is a function of ρ. To demonstrate, consider a simple linear logistic regression model,


And note that


Since P(D=1|Y)=P(Y|D=1)P(Y|D=0)ρ1ρ/{1+P(Y|D=1)P(Y|D=0)ρ1ρ}, clearly varies with ρ, so does its derivative. Figure 4 shows the relationship between PEV and ρ for a marker Y that is standard normally distributed in controls and normally distributed with mean 1 and variance 1 in cases. The risk is a simple linear logistic risk function (equation (3)). As ρ increases from 0 to 1, we see that PEV increases then decreases with maximum occurring at ρ = 0.5. Janssens et al. (2006) also demonstrated dependence of PEV on ρ through a simulation study.

Figure 4.
Relationship between the proportion of explained variation, PEV, and the prevalence. A linear logistic risk model with controls standard normally distributed and cases normally distributed with mean 1 and variance 1 was used to generate the data. Maximum ...

The proportion of explained variation has been defined in other ways, notably based on notions of log likelihood (deviance). Gail and Pfeiffer (2005) note that these can also be calculated from the risk distribution. However, Zheng and Agresti (2000) make the point that these summary measures are difficult to interpret and we concur wholeheartedly. Therefore, we do not pursue them further in this paper but note that methods for inference could be developed in analogy with those we develop here for PEV.

3.2. Total Gain

Total gain, proposed by Bura and Gastwirth (2001) is defined as,


This is the area sandwiched between the predictiveness curve and the horizontal line at ρ, which is the predictiveness curve for a completely uninformative marker assigning risk(Y) = ρ to all subjects. TG is appealing because it can be visualized directly from the predictiveness curve. For a perfect risk prediction model, the predictiveness curve is a step function rising from 0 to 1 at ν = 1 − ρ. The corresponding TG is 2ρ(1 − ρ).

Other interpretations can be made for TG. Huang and Pepe (2008b) have shown that TG is equivalent to the Kolmogorov-Smirnov measure of distance between risk distributions for cases and controls. This is an ROC summary index (Pepe 2003, page 80):



In fact we can write this more simply.




Let ν* be the point where R(ν*) = ρ. We have


Furthermore, because ρ=0ν*R(ν)dν+ν*1R(ν)dν and ρ=01ρdν, setting these two terms equal it follows that


Therefore TG can be written as


because TPR(ρ)=ν*1R(ν)dν/ρ. Moreover, since 1 − R1(ρ) = ρTPR(ρ) + (1 − ρ)FPR(ρ), the above representation of TG can be further simplified to 2ρ(1 − ρ){TPR(ρ) FPR(ρ)}.

This representation of TG is useful for estimation and for deriving asymptotic distribution theory. Interestingly, by equating (6) and (7), we find that the maximum value of TPR(p)-FPR(p) occurs at the risk threshold p = ρ. Another short proof follows by taking its derivative. In particular, since


taking the derivative of the right side with respect to ν and setting it to 0, we have


at the solution. That is, the solution is at R(ν) = ρ. In the same illustrative setting used above, where ρ = 0.2, Y is standard normal in controls and normal with mean 1 and variance 1 in cases, we see from Figure 5 how TPR(p)-FPR(p) varies with p. The maximum value, 0.39, is achieved at p = 0.2, i.e. at p = ρ.

Figure 5.
Association between and TPR(p)− FPR(p) and p. A linear logistic risk model with controls standard normally distributed and cases normally distributed with mean 1 and variance 1 was used to generate the data. Overall prevalence of event ρ ...

Another appealing feature of TG is that after it is standardized by 2ρ(1−ρ), the total gain for a perfect marker, it is functionally independent of ρ. Let’s use TG to denote standardized total gain


so that TG [set membership] [0, 1]. We will focus on TG; here. It is independent of disease prevalence because of it’s interpretation as the Kolmogorov-Smirnov ROC summary index. Moreover, based on the results above, TG is simply interpreted as the difference between the proportions of cases and controls with risks above the average, ρ = P(D = 1) = E(risk(Y)).

In the cystic fibrosis example, TG based on lung function is 0.20, while TG based on weight is 0.09. Since the overall event rate is ρ=41%, the corresponding standardized TG values are TG=0.42 for lung function and TG=0.20 for weight.

3.3. Area Under the ROC Curve and Further Discussion

The area under the ROC curve is widely used to summarize and compare predictive markers and models. It can be interpreted simply as the probability of correctly ordering subjects with and without events using risk(Y):


However, it has been criticized widely for having little relevance to clinical practice (Cook 2007; Pepe and Janes 2008; Pepe et al. 2007). In particular, the task facing the clinician in practice is not to order risks for two individuals. Part of the appeal of the AUC, however, lies in the fact that it depends neither on prevalence, ρ, nor on risk thresholds. Yet in the context of risk prediction within a specific clinical population, these attributes may be weaknesses. In particular, when specific risk thresholds are of interest, the ROC curve hides them. In Figure 3, we plot the ROC curves for risk based on lung function and on weight. The AUC values are 0.771 and 0.639, respectively.

Interestingly all of the measures discussed here can be thought of as the mathematical distance between risk distributions for cases and controls (Figure 2) measured in different ways. The PEV is the difference in the means of case and control risk distributions. The TG is the Kolmogorov-Smirnov measure and we have shown that this is equal to the difference between the proportions of cases and controls with risks larger than ρ. The AUC is equivalent to the Wilcoxon measure of distance between risk distributions for cases and controls.

4. Estimation OF Summary Measures

We now turn to estimation of summary indices from data. We focus on the scenario where Y is a single continuous marker. We also allow Y to be a predefined combination of multiple markers. For example, the score may be derived from a training dataset and our task is to evaluate the combination score using a test dataset.

We use the following notation: Y, YD and Y are marker measurements from the general, case and control populations, respectively. Let F, FD and F be the corresponding distribution functions and let f, fD and f be the density functions. We assume the risk, risk(Y) = P(D = 1|Y), is monotone increasing in Y. Under this assumption we have R(ν) = P{D = 1|Y = F1(ν)}. Thus the curve R(ν) vs. ν is the same as the curve risk(Y) vs. F(Y) and the predictiveness curve can be obtained by first estimating the risk model risk(Y), and then the marker distribution F(Y). Let YDi, i = 1, ..., nD be the nD independent identically distributed observations from cases, and Yi, i = 1, ..., n be the n independent identically distributed observations from controls. We write Yi, i = 1, ..., n for {YD1 ..., YDnD, Y1, ..., YnD} where n = nD + n.

Suppose the risk model is risk(Y) = P(D = 1|Y) = G(θ, Y), where


and h is some monotone increasing function of Y. This is a very general formulation. As a special case, logit{G(θ, Y)} could be as simple as θ0 + θ1Y with θ1 > 0, the ordinary linear logistic model. We consider estimation first under a cohort or cross sectional design and later discuss case-control designs for which the logistic regression formulation is particularly helpful.

4.1. Cohort Design

Suppose we have n independent identically distributed observations (Yi, Di) from the population. Maximum likelihood estimates of θ can be obtained, denoted by [theta w/ hat], as well as empirical estimates of F, FD, F, and ρ, denoted by F, FD, F, and [rho with circumflex]. We use these to calculate estimated summary indices. Summary measures that involve risk thresholds are the risk quantile, R(ν), the population proportion with risk below p, R1(p), cases and controls with risks above p, TPR(p) and FPR(p), event rates in risk strata, PPV(p) and 1-NPV(p), and the risk thresholds yielding specified TPR or FPR, R(νT (t)) and R(νF (t)).

We plug [theta w/ hat] and F into G to get estimators of R(ν), and R1(p):


Estimates of cases and controls with risks above p are:


We write the event rates in risk strata in terms of TPR(p) and FPR(p) to facilitate their estimation:


In a cohort study, these estimates are equal to the empirical proportions of cases amongst those with estimated risks above and below p. However, the formulations here are valid in a case-control study too. Finally risk thresholds yielding specified TPR or FPR are obtained by first calculating the corresponding quantile of Y and then plugging it into the fitted risk model:


Summary measures that do not involve specific risk thresholds are proportion of explained variation, PEV, standardized total gain, TG, and area under the ROC curve, AUC. Recall that PEV is the difference between mean risk in cases and in controls. Sample means of estimated risks yield an estimator of PEV:


On the other hand, TG, can be expressed as the difference between the proportion of cases and controls with risks less than ρ. We write:


Finally AUC is estimated as the proportion of case-control pairs where the estimated risk for the case exceeds that of the control


Since G(θ, Y) is increasing in Y, this is the same as the standard empirical estimator of the AUC based on Y,


4.2. Case-Control Design

Case-control studies are often conducted in the early phases of marker development (Pepe et al. 2001; Baker et al. 2002). Compared to cohort studies, they are smaller and more cost efficient. Since early phase studies dominate biomarker research, it is crucial that estimates of statistical measures of performance accommodate case-control designs. In this section, we describe estimation under a case-control design assuming that an estimate of prevalence, [rho with circumflex] is available. The value [rho with circumflex] may be derived either from a cohort which is independent from the case-control sample, or from the parent cohort within which the case-control sample is nested. As a special case one can assume ρ is known or fixed without sampling variability. In determining populations where risk markers may or may not be useful, predictiveness curves could be evaluated for various specified fixed values of ρ.

In case-control studies, we sample fixed numbers of cases and controls, nD and n, respectively. As a consequence, the intercept of the logistic risk model is not estimable. But by adjusting the intercept, we can still estimate the true risk in the population. In particular let S indicate case-control sampling. In the case-control study the risk model can be written as


where θ0S=θ0lognD¯nDρ1ρ and θ1S = θ1, and θ0 and θ1 are population based intercept and slope. Therefore, having calculated maximum likelihood estimates for θ0S and θ1S from the case-control study, we use θ^0S+log(nD¯nDρ^1ρ^) to estimate the population intercept θ0.

The marker distribution in the population, F, cannot be estimated directly because of the case-control sampling design. However, since case and control samples are representative, empirical estimates of FD and F are valid which we have denoted by FD and F. Therefore we estimate F with F = [rho with circumflex]FD + (1 [rho with circumflex]) FD̄.

Estimates of the predictiveness summary measures can then be obtained by plugging corresponding values for [theta w/ hat], F, FD, F and [rho with circumflex] into the expressions given earlier. These estimates are called semiparametric “empirical” estimates by Huang and Pepe (2008a) because FD and F are estimated empirically. The semiparametric likelihood framework also allows one to estimate FD and F using maximum likelihood (Qin and Zhang 1997, 2003; Qin 1998). Huang and Pepe (2008a) compared the performance of semiparametric “empirical” estimators of the predictiveness curve with semiparametric maximum likelihood estimators. Gains in efficiency by using maximum likelihood are typically small. We use empirical estimators of FD and F here, because this approach is intuitive and easy to implement. Moreover, they estimate important estimable quantities even when the risk model is misspecified. For example, TPR(p) is the proportion of cases whose calculated risks (calculated under the assumed model) exceed p, PÊV is the difference in mean calculated risk for cases and controls, and so forth.

5. Asymptotic Distribution Theory

In this section, we present asymptotic distribution theory for all of the summary measures defined in previous sections. Results for pointwise estimators of R(ν) and R1(p)were previously reported by Huang et al.(2007) and Huang and Pepe (2008a), but for completeness we restate them here. Theory for the empirical estimator of AUC is not reported here since it is well established (Pepe 2003, page 105). Derivations of our results are provided in the Appendix. In addition, in the Appendix, we detail the components of the asymptotic variance expressions separately for case-control and cohort study designs.

Assume the following conditions hold:

  1. G(s, Y) is a differentiable function with respect to s and Y at s = θ, Y = F1(ν).
  2. G1(s, p) is continuous, and [partial differential]G1(s, p)/[partial differential]s exists at s = θ.

Theorem As n → ∞, each of the following random variables converges to a mean zero normal random variable: (i) n(R^1(p)R1(p)), with variance


(ii) n(TP^R(p)TPR(p)), with variance


(iii) n(FP^R(p)FPR(p)), with variance


(iv) n(PP^V(p)PPV(p)), with variance


where σρ2 is the asymptotic variance of n(ρ^ρ) and




(v) n(NP^V(p)NPV(p)), with variance


(vi) n(R^(ν)R(ν)), with variance


(vii) n(R^(νT(t))R(νT(t))), where TPR=1−νT (t) is pre-specified, with variance


(viii) n(R^(νF(t))R(νF(t))), where FPR=1νF (t) is pre-specified, with variance


(ix) n(PE^VPEV), with variance


(x) n(TG¯^TG¯), with variance






6. Simulation Studies

We performed simulation studies to investigate the validity of using large sample theory for making inference in finite sample studies, and to compare it with inference using bootstrap resampling. Data were simulated under a linear logistic risk model. Specifically we employed a population prevalence of ρ = 0.2 and generated marker data according to Y ~ N(0, 1) and YD ~ N(1, 1). The correct form for G(θ, Y) was employed in fitting the risk model, namely a linear logistic model. For each simulated dataset, estimates of summary indices were calculated and their corresponding variances were estimated using the analytic formulae from the asymptotic theory. Variance estimates were also calculated using bootstrap resampling. Sample sizes ranged from 100 to 2000 and 5000 simulations were conducted for each scenario.

Simulation studies were conducted for case-control study designs as well as for cohort study designs. For the case-control scenario, we simulated nested case-control samples within the main study cohort employing equal numbers of cases and controls and with size of the cohort equal to 5 times that of the case-control study. The estimator [rho with circumflex] is calculated from the main study cohort and sampling variability in summary estimates due to [rho with circumflex] is acknowledged in making inference. Separate resampling of cases and controls was done for the nested case-control scenarios.

A full reporting of our simulation results can be found in Gu and Pepe (2009a). Tables 1 and and22 display results for a subset of the summary indices under case-control study designs. Huang and Pepe (2008a) report extensive simulation results for estimates of points on the predictiveness curve, R(ν) and R1(p), and are not reported here. We found little bias in the estimated values. Moreover estimated standard deviations based on asymptotic theory agree well with the actual standard deviations and with those estimated from bootstrap resampling. Coverage probabilities were excellent when sample sizes were moderate to large. We observed some under-coverage and some over-coverage with small sample sizes (n = nD + n = 100). Not surprisingly this occurred primarily at the boundaries of the case and control distributions and was not an issue for the overall summary measures, PEV, TG and AUC. Generally, coverage based on percentiles of the bootstrap distribution are somewhat better than those based on assumptions of normality, but the difference shrinks for larger n.

Table 1.
Results of simulations to evaluate the application of inference based on asymptotic distribution theory and bootstrap resampling to finite sample studies. The study design employs case-control sampling from a parent cohort with prevalence 0.2. Marker ...
Table 2.
Results of simulations to evaluate the application of inference based on asymptotic distribution theory and bootstrap resampling to finite sample studies. Data are generated as described for Table 1. Shown are results for PEV, standardized total gain, ...

7. The Cystic Fibrosis Data

Cystic fibrosis is an inherited chronic disease that affects the lungs and digestive system of people. A defective gene and its protein product cause the body to produce unusually thick, sticky mucus which clogs the lungs and leads to life-threatening lung infections, and also obstructs the pancreas and stops natural enzymes from helping the body break down and absorb food. The main culminating event that leads to death is acute pulmonary exacerbations, i.e. lung infection requiring intravenous antibiotics.

The data for analysis here is from the Cystic Fibrosis Registry, a database maintained by the Cystic Fibrosis Foundation that contains annually updated information on over 20,000 people diagnosed with CF and living in the USA. In order to illustrate our methodology, we consider FEV1, a measure of lung function, and weight, a measure of nutritional status, as measured in 1995 to predict occurrence of pulmonary exacerbation in 1996. Data from 12,802 patients 6 years of age and older are analyzed. 5,245 subjects (41%) had at least one pulmonary exacerbation. A child’s weight is standardized for age and gender by reporting his/her placement value, which is equal to 1 minus his/her percentile value, in a healthy population of children of the same age and gender (Hamill et al. 1977), while FEV1 is standardized for age, gender and height in a different way, explicit formulae were provided by Knudson et al. (1983). We modelled the risk functions using logistic regression models with weight and FEV1 entered the model as linear terms, and both are negated to satisfy the assumption that increasing values are associated with increasing risk. Figure 1 shows the predictiveness curves for the entire cohort and Figure 2 shows the risk distributions separately for cases (those who had a pulmonary exacerbation) and for controls.

First, we use the entire cohort to estimate predictiveness summary measures for weight and lung function. Table 3 shows the point estimates discussed earlier in sections 2 and 3. Here we provide confidence intervals based on asymptotic distribution theory and on bootstrap resampling. Observe that standard deviations are all small and that corresponding confidence intervals are very tight. Bootstrap confidence intervals are almost identical to those based on asymptotic theory.

Table 3.
Point estimates and 95% confidence intervals for the summary indices using FEV1 and weight as markers of risk for subsequent pulmonary exacerbation in patients with cystic fibrosis. Results based on the entire cohort.

We used the summary indices as the basis for hypothesis tests to formally compare the predictive capacities of FEV1 and weight. The difference between estimates of the indices was calculated and standardized using a bootstrap estimate of the variance of the difference. By comparing these test statistics with quantiles of the standard normal distribution, p-values were calculated. We see that differences between lung function and weight as predictive markers are statistically significant, no matter what summary index is employed. Note however that each test relates to a different question about predictive performance, depending on the particular summary index on which it is based. Asking if PEVs for weight and lung function are equal is not the same as asking if the proportion of subjects whose risks are less than 0.25, R1(0.25), are equal. Asking if PEVs for weight and lung function are equal is not the same as asking if the AUCs for risks associated with weight and FEV1 are equal.

Next, we randomly sampled 1,280 cases and 1,280 controls from the entire cohort to form a nested case-control study sample that is about 1/5 th the size of the cohort. Table 4 presents results that use data on FEV1 and weight from the case-control subset and the estimate of the overall incidence of pulmonary exacerbation from the entire cohort, [rho with circumflex] = 0.41. Estimates of summary indices are very close to the cohort estimates but corresponding confidence intervals are much wider. Nevertheless conclusions remain the same. This demonstrates that in a study where predictive markers are costly to obtain, the nested case-control design could yield considerable cost savings.

Table 4.
Point estimates and 95% confidence intervals for the summary indices using FEV1 and weight as markers of risk for subsequent pulmonary exacerbation in patients with cystic fibrosis. Results based on prevalence estimated from the entire cohort and marker ...

Predictiveness summary measures, such as R(ν) and R1(p), are based on a single point on the predictiveness curve. Others, such as true and false positive rates and event rates, accumulate differences over part of the curve. Measures such as PEV, TG and AUC accumulate differences over the entire curve. One might conjecture that measures that accumulate differences would often be more powerful for testing if the predictiveness curves are equal. To investigate, we varied the case-control sample size and evaluated p-values associated with differences between the various summary measures. From Table 4, with a reasonably large case-control sample size (n=2560), we concluded that differences between almost all summary indices for the two markers are significantly different. However we see from Table 5 that as the size of the case-control study varies from medium (n=500) to small (n=100), the point estimates of measures based on specific thresholds become much more variable and p-values for differences between lung function and weight become statistically insignificant in most cases. In contrast, conclusions about the superiority of lung function as a predictive marker remained firm when PEV, TG or AUC were used as the basis of hypothesis tests about equality of curves, even with very small sample sizes (n=100).

Table 5.
Point estimates and 95% confidence intervals for the summary indices using FEV1 and weight as markers of risk for subsequent pulmonary exacerbation in patients with cystic fibrosis. Results are based on prevalence estimated from the entire cohort and ...

8. Concluding Remarks

This paper presents some new clinically relevant measures that quantify the predictive performance of a risk marker. New measures formally defined include TPR(p), FPR(p), PPV(p), NPV(p), R(νT), R(νF), although several of these are already used informally in the applied literature. We have previously argued for use of R(ν) and R1(p) in practice. In addition we have provided new intuitive interpretations for some existing predictive measures, including the popular proportion of explained variation, PEV which is called the IDI by Pencina et al. (2008), the standardized total gain, TG, and recently proposed risk reclassification measures, namely the NRI and the risk reclassification percent. We demonstrated that all of these measures are functions of the risk distribution, also known as the predictiveness curve. A fundamental initial step in the assessment of any risk model is to evaluate if risks calculated according to the model reflect the probabilities P(D = 1|Y). The predictiveness curve can also be useful in making this assessment (Pepe et al. 2008) and is complemented by use of the Hosmer-Lemeshow statistic (Hosmer and Lemeshow 1989). In our cystic fibrosis example, the two linear logistic risk models, one for lung function and one for weight, both yield Hosmer-Lemeshow test p-values>0.05, indicating that they fit the observed data reasonably well. When the risk model is misspecified the summary measures relate to the distributions of ‘fitted’ risk. Performance summaries such as TPR, FPR, PPV, NPV, R(νT), R(νF), R(ν), and R1(p) will be of interest even when the fitted risk is not the true risk. For example, TPR, the proportion of cases detected by the (incorrect) risk calculator will still be clinically relevant.

A second contribution of this paper is to provide distribution theory for estimators of the summary indices. Such has not been available for most of the measures heretofore, including the popular PEV measure. Our methods can now be used to construct confidence intervals for the summary indices. Simulation studies indicate that the methods are valid for application in practice with finite samples.

We also demonstrated in an example how summary indices can be used to make formal rigorous comparisons between markers. Such has only been possible previously for comparisons based on the AUC or on point estimates of the predictiveness curve, R(ν) and R1 (p) (Huang, Pepe and Feng 2007; Huang and Pepe 2008).

Our methods accommodate cohort or case-control study designs. The latter are particularly important in the early phases of biomarker development when biomarker assays are expensive or procurement of biological samples is difficult (Pepe et al. 2001). In such settings nested case-control studies are much more feasible (Baker et al. 2002; Pepe et al. 2008d). Our methodology is currently restricted to risk models that include a single marker or a predefined combination of markers. In practice studies often involve development of a marker combination and assessment of its performance. Compelling arguments have been provided in the literature for splitting a dataset into training and test subsets (Simon 2006; Ransohoff 2004). In these circumstances our methods apply to evaluation with the test data of the combination developed with the training data. It would be worthwhile however to explore use of cross validation techniques for simultaneous development and assessment of risk models using the summary indices we have described.

Which summary index should be recommended for use in practice? In our opinion, a summary index should not replace the display of the risk distributions (Figures 1 and and2)2) but should serve only to complement them. The choice of summary indices to report should be driven by the scientific objectives of the study. For example, if the objective is to risk stratify the population according to some risk threshold, below which treatment is not indicated and above which treatment is indicated, the corresponding proportions of the population that fall into the two risk strata, R1(pL) and 1 − R1(pL) would be key performance measures to report. Yet additional measures would also be of interest in this setting and should be reported including TPR(pL), FPR(pL), PPV(pL) and NPV(pL). When no risk thresholds have been defined as clinically relevant, PEV or TG or AUC could complement the displays of risk distributions and serve as the basis of test statistics to test for the equality of differences between case and control risk distribution curves.

The final stages of evaluating a model for use in practice should incorporate notions of costs and benefits (i.e. utilities) that may be associated with decisions based on risk(Y). However, specifying costs and benefits is typically very difficult in practice. Vickers and Elkin (2006) have recently proposed a standardized measure of expected utility for a prediction model that does not require explicit specifications of costs and benefits. The net benefit at risk threshold p is defined as NB(p) = ρTPR(p) + (1 − ρ)FPR(p)p/(1 − p), and their decision curve plots NB(p) versus p. This weighted average of true and false positive rates could complement descriptive plots of risk distributions. Moreover, the methods for inference that we have presented here give rise to methods for inference about decision curves too.


To simplify notation, we suppose the risk model is logistic linear in Y :


I.  Cohort Design

In a cohort study the log likelihood function is


Let [theta w/ hat]0, [theta w/ hat]1 be the maximum likelihood estimators (MLE) of θ0, θ1 based on (15). The following results are well known.

Result 1 Let


and A(θ) = A(θ, ∞). If A(θ)1 exists,


We can also write n(θ^θ)=1ni=1nθ(Yi)+op(1), where [ell]θ(Yi) = A(θ)1lθ(Yi), i = 1, ..., n. And


Result 2 As n → ∞,


where η [equivalent] nD/n.

Lemma 1 Let


and use BD,0, BD,1, BD̄,0 and BD̄,1 for the limits at t = .

Then we have



We prove the first result and proofs of the other two results follow from similar arguments.


Proof of Theorem items (i), (ii) and (iii) We show the proof for item (i). The proofs for items (ii) and (iii) follow similar arguments.




by equicontinuity of the process n(F^F). Earlier results and the delta method then imply:


The last equality follows from Result 2 (for variance of F), Result 1 (for variance of [theta w/ hat]) and Lemma 1 (for covariance of (F, [theta w/ hat])).

Proof of Theorem items (iv) and (v)

We write


The asymptotic distribution of n(ρ^ρ) is given in Result 2 as are the distributions of n(TP^R(p)TPR(p)) and n(FP^R(p)FPR(p)) because:


And similarly,


In the following, we calculate the covariances between ( n(TP^R(p)TPR(p)),n(ρ^ρ)), ( n(FP^R(p)FPR(p)),n(ρ^ρ)) and ( n(TP^R(p)TPR(p)),n(FP^R(p)FPR(p))). The asymptotic variance of n(PP^V(p)PPV(p)), σ4(p)2, then follows from the delta method.

Consider the covariance between n(TP^R(p)TPR(p)) and n(FP^R(p)FPR(p)):


Because FD and F are uncorrelated, the above covariance can be written as



The last equality uses Result 1 (for variance of [theta w/ hat]) and Lemma 1 (for covariance of (FD, [theta w/ hat]) and (F, [theta w/ hat])).

The second covariance (equation (10)) is between n(ρ^ρ); and n(TP^R(p)TPR(p)):


where An=cov(n(ρ^ρ),n(FD(G1(θ^,p))FD(G1(θ,p)))) and Bn=cov(n(ρρ),n(F^D(G1(θ,p))FD(G1(θ,p)))).

Observe that


where R(v) [equivalent] G(θ, Y) and Hnn(G(θ^,Y)G(θ,Y))d(F^(Y)F(Y))=1nn(G(θ^,Y)G(θ,Y))d(n(F^(Y)F(Y))). Both n(G(θ^,Y)G(θ,Y)) and n(F^(Y)F(Y)) are bounded in probability and therefore Hn converges to 0 as n → ∞.

We next derive expressions for An and Bn.


And Bn is


where cov ( n(θ^θ),n(F^D(G1(θ,p))FD(G1(θ,p)))) is given by Lemma 1.

Combining the two terms yields a value for cov2. The derivation of cov3 follows from a similar argument.

The proof of item (v) of the Theorem uses exactly the same techniques.

Proof of Theorem items (vi), (vii) and (viii) We prove Theorem item (vii) in the following. Proofs of (vi) and (viii) are similar. The following proof is based on Huang and Pepe (2008a) where they derived the asymptotic distribution of n(R^(ν)R(ν)) when R(ν) is estimated under a case-control design.

When the value of TPR is 1 νT (t), by a Taylor series expansion,


It follows that the asymptotic variance is


The variance of n(θ^θ) and of n(F^D(FD1(νT(t)))t) are provided in Result 2, and their covariance is provided in Lemma 1. Putting them all together, we have the following result,


Proof of Theorem item (ix)


where Pn=n(G(θ^,Y)G(θ,Y))d(F^DFD)+n(G(θ^,Y)G(θ,Y))d(F^D¯FD¯). It is easy to see that Pn converges to zero as n → ∞ since n(G(θ^,Y)G(θ,Y)) is bounded in probability and FD−FD (or F −F) converges in probability to 0. We define Ann(G(θ,Y)d(F^DFD)), Bnn((G(θ^,Y)G(θ,Y))dFD), Cnn(G(θ,Y)d(F^D¯FD¯)) and Dnn((G(θ^,Y)G(θ,Y))dFD¯).

Now we have,




From a similar argument,




Because FD and F are independent the covariance between An and Cn is zero. Observe also that from previous derivations we have


The asymptotic variance of n(PE^VPEV), σ92, can be obtained by combining var(An + Bn), var(Cn + Dn) and cov(An + Bn, Cn + Dn).

Proof of Theorem item (x)


The result in the Theorem follows. Now we derive expressions for the variance components in a cohort study. Observe that


where we define


and note that n(F^D(G1(θ^,ρ^))F^D(G1(θ,ρ)))n(FD(G1(θ^,ρ^))FD(G1(θ,ρ)))0 as n → ∞ due to the equicontinuity of the process.


The variance of Bn follows from that of n(θ^θ) given in Result 1, and the variances of An and Cn both follow from Result 2. cov(An, Bn) follows from Lemma 1. Furthermore, cov(An, Cn) and cov(Bn, Cn) concern the covariance between (FD, [rho with circumflex]) and ([theta w/ hat], [rho with circumflex]), both of which can be found in the proof of Theorem item (iv), cov2 (see equation (19) and (20)).

Similarly, we have


The variance of n(FP^R(ρ^)FPR(ρ)), Σ2 (equation (13)), depends on the variances and covariances of the three terms


The variance of En can be obtained by using Result 1, and the variances of Dn and Fn can be found using Result 2. The covariance between Dn and En follows from Lemma 1. Covariances between (Dn, Fn) and (En, Fn) follow from an argument similar to that used in deriving cov(An, Cn) and cov(Bn, Cn) (equation (19) and (20)).

The asymptotic variance of n¯^(TG¯TG), σ102 is therefore


where ∑1,2 is


All of these covariance terms can be obtained using corresponding Results and Lemmas: cov(An, En) and cov(Bn,Dn) from Lemma 1; cov(Bn, En) from Result 1; cov(Cn, Fn) from Result 2. cov(An, Fn) and cov(Bn, Fn) concern the covariance between (FD, [rho with circumflex]) and ([theta w/ hat], [rho with circumflex]) and expressions have been derived in the proof of Theorem item (iv) above, cov2 (equation (19) and (20)), while cov(Cn, Dn) and cov(Cn, En) concern the covariance between (F , [rho with circumflex]) and ([theta w/ hat], [rho with circumflex]) and are derived using a similar argument.

II.  Case-Control Design

Let [rho with circumflex] be the estimate of disease prevalence ρ from a cohort independent of the case-control sample, or the parent cohort within which the case-control sample is nested. Assume the size of the cohort is λ times the size of the case-control sample. Denote


The following results are well established.

Result 3 Let


where k [equivalent] nD/n and A = A(). If A1 exists,




A proof can be found in Prentice and Pyke (1979), Qin and Zhang (1997) and Zhang (2000).

The next set of results, Results 4–7, have been proved by Huang and Pepe (2008a).

Result 4 As n → ∞, n(F^*(t)F*(t)) converges to a normal random variable with mean 0 and variance

Result 5


Result 6


Result 7


Most of the proofs in the following are exactly the same as those developed for a cohort study. Therefore we do not repeat the proofs that are the same. However, expressions for the components of the asymptotic variances that are different are provided. We will frequently refer to expressions in Results 4–7.

Proof of Theorem item (i), (ii) and (iii)

The proof is the same as the proof provided for cohort studies. Based on equation (16),


Expressions for the three individual components can all be found in Results 6 and 7. Proofs for items (ii) and (iii) of the Theorem follow similar arguments.

Proof of Theorem items (iv) and (v)

According to equation (17),


Results 6 and 7 provide expressions for the three individual terms.

However, the expressions for cov2 and cov3 are different from those under a cohort study design,




Item (v) of the Theorem follows from a similar argument.

Proof of Theorem (vi), (vii) and (viii) These all follow similar arguments. We use (vii) to illustrate.

According to equation (21),


The result follows by plugging in corresponding expressions from Result 2, 6 and 7. Proofs of items (vi) and (viii) follow similar arguments.

Proof of Theorem item (ix) Proof of Theorem (ix) is exactly the same as the proofs for a cohort study. Equations (22), (24) and (26) defined expressions for the components of the asymptotic variance of n(PE^VPEV), σ92. We only need to substitute l(θ) in the definition of [P1P2] and [Q1Q2] (equation (23) and (25)) with the likelihood function based on case-control data, which are defined by Prentice and Pyke (1979), Qin and Zhang (1997) and Zhang (2000).

Proof of Theorem item (x) According to equations (27), (28) and (29), the three components of var ( n(TG¯^TG¯)), σ102, are


The following Results provide corresponding expressions for each of the individual components:

  • Result 2: var(An) and var(Dn).
  • Result 3: cov(Bn, En).
  • Result 6: var(Bn), var(Cn), var(En), cov(Bn, Cn), var(Fn), cov(En, Fn), cov(Bn, Fn), cov(Cn, En) and cov(Cn, Fn).
  • Result 7: cov(An, Bn), cov(Dn, En), cov(Bn,Dn) and cov(An, En).

Furthermore, cov(An, Cn)=cov(Dn, Fn)=cov(An, Fn)=cov(Cn,Dn)=0.


*This work is supported in part by grants from the National Institutes of Health (R01 GM054438 and U01 CA086368).


  • Baker SG, Kramer BS, Srivastava S. “Markers for Early Detection of Cancer: Statistical Guidelines for Nested Case-Control Studies,” BMC Medical Research Methodology. 2002;2:4. doi: 10.1186/1471-2288-2-4. [PMC free article] [PubMed] [Cross Ref]
  • Bura E, Gastwirth JL. “The Binary Regression Quantile Plot: Assessing the Importance of Predictors in Binary Regression Visually,” Biometrical Journal. 2001;4:3, 5–21.
  • Cook NR. “Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction,” Circulation. 2007;11:5, 928–935. [PubMed]
  • Cook NR, Buring JE, Ridker PM. “The Effect of Including C-Reactive Protein in Cardiovascular Risk Prediction Models for Women,” Annals of Internal Medicine. 2006;14:5, 21–29. [PubMed]
  • D’Agostino RB, Sr, Vasan RS, Pencina MJ, Wolf PA, Cobain M, Massaro JM, Kannel WB. “General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study,” Circulation. 2008;11:7, 743–753. [PubMed]
  • Expert Panel on Detection EaToHBCiA “Executive Summary of the Third Report of the National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, and Treatment of High Blood Cholesterol in Adults (Adult Treatment Panel III),” The Journal of the American Medical Association. 2001;285(19):2486–2497. [PubMed]
  • Gail MH, Pfeiffer RM. “On Criteria for Evaluating Models of Absolute Risk,” Biostatistics. 2005;6(2):227–239. doi: 10.1093/biostatistics/kxi005. [PubMed] [Cross Ref]
  • Green DM, Swets JA. Signal detection theory and psychophysics. New York: Wiley; 1996.
  • Gu W, Pepe MS. “Estimating the Capacity for Improvement in Risk Prediction with a Marker,” Biostatistics. 2009;10(1):172–186. doi: 10.1093/biostatistics/kxn025. [PMC free article] [PubMed] [Cross Ref]
  • Gu W, Pepe MS. 2009a. “Measures to Summarize and Compare the Predictive Capacity of Markers,” UW Biostatistics Working Paper Series, Working Paper 342.
  • Hamill PV, Drizd TA, Johnson CL, Reed RB, Roche AF. United States, Vital Health Statistics. Vol. 11. Washington, DC: 1977. “NCHS Growth Curves for Children Birth-8 Years,” pp. 1–74. [PubMed]
  • Hosmer DW, Lemeshow S. 1989. Applied Logistic Regression (section 522). New York: Wiley
  • Hu B, Palta M, Shao J. “Properties of R2 Statistics for Logistic Regression,” Statistics in Medicine. 2006;25(8):1383–1395. doi: 10.1002/sim.2300. [PubMed] [Cross Ref]
  • Huang Y, Pepe MS. 2008a. “Semiparametric and Nonparametric Methods for Evaluating Risk Prediction Markers in Case-Control Studies,” UW Biostatistics Working Paper Series, Working Paper 333.
  • Huang Y, Pepe MS. 2008b. “A Parametric ROC Model Based Approach for Evaluating the Predictiveness of Continuous Markers in Case-Control Studies,” Biometrics, doi: 10.1111/j.1541-0420.2005.00454.x [PMC free article] [PubMed]
  • Huang Y, Pepe MS, Feng Z. “Evaluating the Predictiveness of a Continuous Marker,” Biometrics. 2007;63(4):1181–1188. [PMC free article] [PubMed]
  • Hunink M, Glasziou P, Siegel J, Weeks J, Pliskin J, Elstein A, Weinstein M. Decision making in health and medicine. Cambridge University Press; 2006.
  • Janes H, Pepe MS, Gu W. “Assessing the Value of Risk Predictions Using Risk Stratification Tables,” Annals of Internal Medicine. 2008;149(10):751–760. [PMC free article] [PubMed]
  • Janssens ACJW, Aulchenko YS, Elefante S, Borsboom GJJM, Steyerberg EW, van Duijn CM. “Predictive Testing for Complex Diseases Using Multiple Genes: Fact or Fiction?,” Genetics in Medicine. 2006;8(7):395–400. doi: 10.1097/01.gim.0000229689.18263.f4. [PubMed] [Cross Ref]
  • Knudson RJ, Lebowitz MD, Holberg CJ, Burrows B. “Changes in the Normal Maximal Expiratory Flow-Volume Curve with Growth and Aging,” The American Review of Respiratory Disease. 1983;12:7, 725–734. [PubMed]
  • Pencina MJ, D’Agostino RB, Sr, D’Agostino RB, Jr, Vasan RS. “Evaluating the Added Predictive Ability of a New Marker: From Area Under the ROC Curve to Reclassification and Beyond,” Statistics in Medicine. 2008;2:7, 157–172. [PubMed]
  • Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003.
  • Pepe MS, Etzioni R, Feng Z, Potter JD, Thompson ML, Thornquist M, Winget M, Yasui Y. 2001. “Phases of Biomarker Development for Early Detection of Cancer,” Journal of the National Cancer Institute 93, 1054?061.10.1093/jnci/93.14.1054 [PubMed] [Cross Ref]
  • Pepe MS, Feng Z, Gu W. “Comments on ’Evaluating the Added Predictive Ability of a New Marker: From Area Under the ROC Curve to Reclassification and Beyond’,” Statistics in Medicine. 2008b;2:7, 173–181. [PubMed]
  • Pepe MS, Feng Z, Huang Y, Longton GM, Prentice R, Thompson IM, Zheng Y. “Integrating the Predictiveness of a Marker with its Performance as a Classifier,” American Journal of Epidemiology. 2008a;167(3):362–368. doi: 10.1093/aje/kwm305. [PMC free article] [PubMed] [Cross Ref]
  • Pepe MS, Janes H. “Gauging the Performance of SNPs, Biomarkers and Clinical Factors for Predicting Risk of Breast Cancer,” Journal of the National Cancer Institute. 2008c;100(14):978–979. doi: 10.1093/jnci/djn215. [PMC free article] [PubMed] [Cross Ref]
  • Pepe MS, Feng Z, Janes H, Bossuyt P, Potter J. “Pivotal Evaluation of the Accuracy of a Biomarker Used for Classification or Prediction: Standards for Study Design,” Journal of the National Cancer Institute. 2008d;100(20):1432–1438. doi: 10.1093/jnci/djn326. [PubMed] [Cross Ref]
  • Pepe MS, Janes H, Gu W. “Letter by Pepe et al Regarding the Article, ’Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction’,” Circulation. 2007;116:e132. doi: 10.1161/CIRCULATIONAHA.107.709253. [PubMed] [Cross Ref]
  • Prentice RL, Pyke R. “Logistic Disease Incidence Models and Case-Control Studies,” Biometrika. 1979;66(3):403–411. doi: 10.1093/biomet/66.3.403. [Cross Ref]
  • Qin J. “Inferences for Case-Control and Semiparametric Two-Sample Density Ratio Models,” Biometrika. 1998;85(3):619–630. doi: 10.1093/biomet/85.3.619. [Cross Ref]
  • Qin J, Lawless J. “Empirical Likelihood and General Estimating Equations,” The Annals of Statistics. 1994;66(3):403–411.
  • Qin J, Zhang B. “A goodness of fit test for logistic regression models based on case-conrol data,” Biometrika. 1997;84(3):609–618. doi: 10.1093/biomet/84.3.609. [Cross Ref]
  • Qin J, Zhang B. “Using logistic regression procedures for estimating receiver operating characteristic curves,” Biometrika. 2003;90(3):585–596. doi: 10.1093/biomet/90.3.585. [Cross Ref]
  • Ransohoff DF. “Rules of Evidence for Cancer Molecular-Marker Discovery and Validation,” Nature Reviews Cancer. 2004;4(4):309–314. doi: 10.1038/nrc1322. [PubMed] [Cross Ref]
  • Ridker PM, Hennekens CH, Buring JE, Rifai N. “C-Reactive Protein and Other Markers of Inflammation in the Prediction of Cardiovascular Disease in Women,” New England Journal of Medicine. 2000;34:2, 836–843. [PubMed]
  • Ridker PM, Paynter NP, Rifai N, Gaziano JM, Cook NR. “C-Reactive Protein and Parental History Improve Global Cardiovascular Risk Prediction: The Risk Score for Men,” Circulation. 2008;11:8, 2243–2251. [PMC free article] [PubMed]
  • Simon R. “Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers,” Journal of Clinical Oncology. 2006;23(29):7332–7341. doi: 10.1200/JCO.2005.02.8712. [PubMed] [Cross Ref]
  • Stern RH. “Evaluating New Cardiovascular Risk Factors for Risk Stratification,” The Journal of Clinical Hypertension. 2008;10(6):485–488. doi: 10.1111/j.1751-7176.2008.07814.x. [PubMed] [Cross Ref]
  • Vickers AJ, Elkin EB. “Decision Curve Analysis: a Novel Method for Evaluating Prediction Models,” Medical Decision Making. 2006;26(6):565–574. doi: 10.1177/0272989X06295361. [PMC free article] [PubMed] [Cross Ref]
  • Ware JH, Cai T. “Comments on ‘Evaluating the Added Predictive Ability of a New Marker: From Area Under the ROC Curve to Reclassification and Beyond’,” Statistics in Medicine. 2008;2:7, 185–187. [PubMed]
  • Youden WJ. 1950. “Index for Rating Diagnostic Tests,” Cancer 3, 32?5.10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3 [PubMed] [Cross Ref]
  • Zheng B, Agresti A. “Summarizing the Predictive Power of a Generalized Linear Model,” Statistics in Medicine. 2000;19(13):1771–1781. doi: 10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P. [PubMed] [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press