|Home | About | Journals | Submit | Contact Us | Français|
We propose and study two criteria to assess the usefulness of models that predict risk of disease incidence for screening and prevention, or the usefulness of prognostic models for management following disease diagnosis. The first criterion, the proportion of cases followed PCF(q), is the proportion of individuals who will develop disease who are included in the proportion q of individuals in the population at highest risk. The second criterion is the proportion needed to follow-up, PNF(p), namely the proportion of the general population at highest risk that one needs to follow in order that a proportion p of those destined to become cases will be followed. PCF(q) assesses the effectiveness of a program that follows 100q% of the population at highest risk. PNF(p) assess the feasibility of covering 100p% of cases by indicating how much of the population at highest risk must be followed. We show the relationship of those two criteria to the Lorenz curve and its inverse, and present distribution theory for estimates of PCF and PNF. We develop new methods, based on influence functions, for inference for a single risk model, and also for comparing the PCFs and PNFs of two risk models, both of which were evaluated in the same validation data.
Statistical models that predict disease incidence, disease recurrence or mortality following disease onset have broad public health and clinical applications. Disease incidence models are available for many diseases, including coronary heart disease (Wilson et al., 1998), colorectal cancer (Freedman et al., 2009) and other cancers (http://riskfactor.cancer.gov/cancer_risk_prediction/). Disease incidence models can be used to target preventive interventions to those with high enough risks to justify an intervention that has adverse effects (Gail al, 1999) and to identify high risk individuals for intensive screening to detect disease in its early stages. Prognostic models are useful for understanding the risk of cancer recurrence (e.g. Stephenson et al., 2006) or death following cancer diagnosis (e.g. Albertsen et al., 2005). Such information can inform choices for a treatment whose side effects may outweigh the risk of death from the initial disease.
Before a model can be recommended for practical use, its performance characteristics need to be understood. General criteria to evaluate prediction models for dichotomous outcomes include predictive accuracy, proportion of variation explained, calibration and discrimination (Gail and Pfeiffer, 2005; Hand, 1981; Pepe, 2003). Most recent validation studies have emphasized calibration and discrimination. A model is called well calibrated (or unbiased) when the predicted probabilities agree with observed risk in subsets of the population and overall. Discriminatory accuracy, which assesses how much the distributions of risk predictions differ among those who subsequently did or did not develop disease, is often defined as the area under the receiver operating characteristic (ROC) curve (AUC) (see Pepe, 2003, page 67). Non-parametric inference on the AUC is based on the Mann-Whitney-Wilcoxon test.
We propose two measures of concentration of risk that are directly relevant to public health decisions. Only if most of the total population risk is concentrated in a small high risk subset will a prevention strategy that focuses on high risk groups be effective (Rose, 1992). If risk is concentrated, AUC will typically be large, but the reverse is not necessarily true (see Discussion). We define the ’‘proportion of cases followed”, PCF(q), as the proportion of cases that would be followed in a program that followed the proportion q of the population at highest risk. If risk is concentrated in a small proportion of the population at highest risk, then PCF(q) will be high, even for small q. Pharoah et al. (2002) used this quantity to measure the usefulness of a risk model, but did not give methods of inference. We propose a new complementary criterion, the proportion needed to follow-up, PNF(p), namely the proportion of the general population at highest risk that one needs to follow in order that a proportion p of those destined to become cases will be followed. If risk is concentrated in a small proportion of the population at highest risk, PNF(p) will be small, even for large p.
By recognizing the relationship of PCF and PNF with the Lorenz curve of the distribution of risks in the population, we are able to draw on theory for the Lorenz curve and its inverse to derive their distribution theory. We develop new methods based on influence functions for inference for a single risk model, and also for comparing the correlated estimates of PCF and PNF for two risk models evaluated in the same validation data. In Section 2 we present the general framework, formally define PCF and PNF, and give nonparametric estimates for PCF and PNF with their asymptotic properties. We present results for simulated data (Section 3), a data example (Section 4) and discussion (Section 5).
We present notation and methods for disease incidence models, but they also apply to models of disease prognosis. We want to predict whether an individual will be diagnosed with a particular disease over a given time period, for example five years (Yi = 1) or not (Yi = 0). Individual i has a true, unknown probability of the event, πi = P(Yi = 1). Given a set of baseline predictors Xi for individual i, we approximate πi by a risk prediction model R(xi) = P(Yi = 1|Xi = xi), that is a mapping from the set Ω of possible values of X to [0, 1]. The distribution F of risk R in the population can be derived from the distribution of the covariates FX(x),
Because X includes continuous variables such as age, and length of the risk projection interval, we assume throughout that F is continuous, even though some results only require continuity at selected quantiles of F. From Gail and Pfeiffer (2005), we assume that the triplet (Yi, πi,Xi) arises from a superpopulation with unknown distribution that can be written as πY (1 − π)Y F(π|X)FX(X), where Y is conditionally independent of X given π, and F(π|X) is the unknown, and unobservable conditional distribution of the true, random probability of disease, π, given the observed covariates X. We assume the model is perfectly calibrated, i.e. R(x) = E(Y|x) = E(E(Y|x, π)|x) = E(π|x), as E(Y|x, π) = E(Y|π) = π. Thus, the overall disease prevalence satisfies and the distribution of risk in those who will and will not develop disease at the end of follow-up can be derived from the distribution F of risk in the population (Gail and Pfeiffer, 2005). Using that , the distribution of risk in diseased individuals (Y = 1) is given by
Similarly, the distribution of risk in non-diseased persons (Y = 0) is given by
Suppose the risk model R can be applied to an entire population to assign a disease risk r to each person. The model-based risks are then ranked, and a proportion of those individuals with highest risks of developing disease is followed up. Depending on the specific application, follow-up could consist of a diagnostic test, screening, or a preventive intervention. For example, the risk of colorectal cancer could be computed based on a risk model, and then a proportion of those with highest risk would receive more intensive screening with colonoscopy. Another example of “follow-up” arises in high risk preventive strategies (Rose, 1992) in which a proportion of the population at highest risk is given a preventive intervention, such as statins to prevent heart disease. In the context of prognosis after disease diagnosis, those at highest risk of dying might be followed-up by giving a special treatment.
The proportion of cases followed, PCF(q), is the proportion of individuals who will develop disease who are included in the proportion q of individuals in the population at highest risk. Let ξ1−q denote the (1−q)th quantile of the population distribution F, that is P(r > ξ1−q) = 1 − F(ξ1−q) = q. To obtain PCF(q) we apply (2) to the corresponding population quantile, ξ1−q = F−1(1 − q), and get
Note that after a change of variable,
which is also the Lorenz curve L of F evaluated at 1 − q. Thus PCF(q) = 1 − L(1 − q), where L(1 − q) is the ratio of the total population risk below the (1 − q)th quantile of F to the total risk in the population (Gastwirth, 1971; Goldie, 1977). Figure 1 (a) plots the Lorenz curve when F is a Beta(1,49) distribution for which PCF(0.1) = 0.325.
The proportion needed to follow-up, PNF(p), is the fraction of the general population at highest risks one needs to follow to assure that a fraction p of those destined to become cases will be followed. Hence the quantile ξ1−PNF = F−1(1 −PNF) satisfies
Thus, from (5),
Figure 1 (b) plots the Lorenz curve and shows the corresponding value of PNF(0.9) = 0.591 when F is Beta(1,49).
Assume that a validation cohort contains a random sample of individuals with measured covariates. Corresponding estimates r1,…,rn are a random sample from the continuous risk distribution F. We then can find nonparametric estimates of PCF and PNF as follows. Let r(1) ≤ … ≤ r(n) denote the order statistics of the estimated risks, and [x] be the largest integer less or equal to x. Let . An estimate of the Lorenz curve and thus PCF(q) is
Using Goldie’s (1977) result for the inverse function of the Lorenz curve, , for a fixed value of 1 − p, the PNF(p) is estimated as
In what follows, we often suppress the arguments q and p for ease of exposition.
For continuous F with finite mean, Ln(1−q) converges almost surely to L(1−q) and hence converges almost surely to PCF (Goldie, 1977). As 0 ≤ ri ≤ 1, F has finite second moments, and for a fixed q,
In the Web Appendix we derive the influence function ϕ(r) for Ln,
and use it to compute , that is given in the Web Appendix. By substituting empirical estimates Ln, Fn and = (1/n) Σri, we can estimate the influence of risk estimate ri as (ri). Then can be estimated as . In Section 3, however, we estimate using a bootstrap procedure.
Almost sure convergence of to PNF for a given p holds for continuous F with finite mean (Goldie, 1977). For a fixed p, and assuming that a condition on the variation of the quantile process F−1 near 0 (equation 2.23 in Csörgö and Yu, 1999) holds,
We evaluate by deriving the influence function
Motivated by the asymptotic theory, we study 95% confidence intervals with the bootstrap variance estimates, . The influence function representation is also valuable for computing the efficiency of the non-parametric estimates, compared to parametric estimates, and for comparing two models with respect to PCF and PNF, as shown next.
AUC values are often used to compare the discriminatory ability of two models. The model with larger AUC value better separates the distributions of risk in cases and non-cases. For small values of q and high values of p that are practically relevant, risk is more concentrated at high levels for a model that has larger PCF(q) or a smaller PNF(p).
We assume that risk models 1 and 2 are evaluated in the same population, and we observe paired risk projections for individuals i = 1, …, n in the population, from the joint distribution function of risks (R1, R2). We estimate from , i = 1, …, n, respectively.
To test whether, for fixed q, PCF1 = PCF2, we use the statistic
We denote the influence for subject i for risk model 1 by ϕi1, and the corresponding influence for risk model 2 by ϕi2. The influence function representation , and application of the central limit theorem prove asymptotic normality of , as var(ϕ1,i−ϕ2,i) exists because ϕ is bounded. We estimate PCF using a bootstrap in simulations and the data example. For each bootstrap sample we draw n bivariate observations with replacement from , i = 1, …, n, and then compute the empirical variance estimate based on the B bootstrap samples of the differences , b = 1, …, B. PCF can also be estimated as
where 1 and 2 are obtained as in Section 2.3. Using asymptotic normality, 95% confidence intervals for the difference can be computed as . Alternatively, confidence intervals for the differences could be computed based on the 0.025th and 0.0975th quantiles from the bootstrap distribution. Similar remarks apply to
where PNF estimates var(ψ1,i − ψ2,i).
Asymptotically both TPCF and TPNF have a central distribution under H0. Under the alternative, the non-centrality parameters are δPCF = n(PCF1−PCF2)2/var(ϕi1−ϕi2) and δPNF = n(PNF1 − PNF2)2/var(ψi1 − ψi2), respectively.
We used simulations to investigate the properties of the nonparametric estimates defined in Section 2.3. We also used the influence functions in Section 2.2 to compute the variances of and to compare their efficiencies to maximum likelihood estimates (MLEs) under a fully parametric model. We assumed that observed risks ri, i = 1, …, n are a random sample from F, a beta distribution with parameters α and β. The cumulative distribution function can be written as F(x) = B(x, α, β)/B(α, β), where B(α, β) denotes the beta function and B(x, α, β) stands for the incomplete beta-function with parameters α and β. In this setting the distribution of risk in the cases, G(x) = B(x, α + 1, β)/B(α + 1, β), and thus
To investigate the inefficiency of non-parametric estimates, we also estimated PCF and PNF by finding the MLEs and based on ri, i = 1, …, n and then computing PCF(, ) and PNF(, ). Using the corresponding Fisher information matrix I(α, β) and applying the delta method yields the asymptotic variances of the MLEs PCF(, ) and PNF(, ). To provide context for our parameter choices, we also compute the AUC. We study parameters α = 6.55, 4.23 and 1 with corresponding β = 320.95, 207.27, and 49; the expected risk E(R) = 0.02 for each (α, β) pair. A 2% risk might be found, for example, for colorectal cancer over a ten year period. The AUC values for these choices of parameters are 0.609, 0.635 and 0.753, respectively.
Table 1 gives results for 1000 independent simulations each based on a random sample of size n = 500. Bootstrap variance estimates were based on B = 1000 replicates. The mean estimates of were very close to the theoretical values, and the coverage of the confidence intervals was close to the nominal 0.95 value. The bootstrap estimates of the standard deviations of agreed closely with the theoretical estimates based on the influence functions. For example, for with (α, β) = (4.23, 207.27) and q = 0.10, the theoretical calculation and mean bootstrap estimates were 0.114 and 0.111 respectively. For q = 0.3 the theoretical and mean bootstrap estimates of standard deviation were both 0.144. Similar good agreement was found for . Very similar results were found for parameters that yielded the same AUC values as Table 1, but with E(R) = 0.2 (Web Table 1).
The model with the least discriminatory accuracy, (α, β) = (6.55, 320.95), detects only a fraction PCF = 0.178 of cases when the fraction q = 0.10 of those at highest risk is studied, whereas the model with the most discriminatory accuracy (α, β) = (1, 49), detects PCF = 0.325. Likewise, to follow a fraction p = 0.90 of cases requires that the fraction PNF = 0.808 of the general population at highest risk be followed for the least discriminating model and PNF = 0.591 for the most discriminating model studied.
The non-parametric estimates were less precise than corresponding parametric estimates, as assessed by the asymptotic relative efficiency (ARE), or variance ratio. For small q, has an ARE less than 0.5, or a variance more than twice as large as the parametric estimates. For example, ARE = 0.473 for (α, β) = (4.23, 207.27) for q = 0.10. The AREs of compared to the MLE ranged from 0.682 to 0.895 over the parameters and choices of q studied. Even though the non-parametric procedures are less efficient, they yield consistent estimates, whereas incorrectly specified parametric methods need not.
We examined the size and power of tests TPCF in equation (13) and TPNF in equation (15) for comparing risk models 1 and 2 with data from the same validation sample. To obtain risk estimates from each model that have a marginal beta distribution and are correlated, we first generated n bivariate normal random variables (Xi1,Xi2) ~ MVN((0, 0), Σ), i = 1, …, n, where Σ11 = Σ22 = 1 and Σ12 = Σ21 = ρ. We then computed where , k = 1, 2 denotes the inverse of the beta distribution function with parameters (αk, βk), and ϕ−1 is the inverse of the standard normal distribution. Based on the random sample (ri1, ri2), i = 1, …, n, we computed the non-parametric estimates of TPCF and TPNF with the bootstrap variance estimates based on B = 1000 bootstrap samples. Estimates of size and power were based on 1000 independent simulations for each choice of parameter values and use of TPCF and TPNF. Each simulation used a random sample of n = 500 bivariate risks.
The estimated correlations for the risk estimates (ri1, ri2) generated under this mechanism are close to the correlation used in the bivariate normal model (Table 2). For ρ = 0.25 the estimated correlations ranged from 0.19 to 0.29, and for ρ = 0.50 the estimated correlations ranged from 0.41 to 0.54 (Table 2).
Table 2 shows the size for TPCF and TPNF with nominal 5% significance level, for various choices of correlations, ρ, and common parameters for the two beta distributions. Theoretical size, Pt(TPCF > c) and Pt(TPNF > c), and empirical estimates, Pe(TPCF > c) and Pe(TPNF > c), are presented. Again, parameters were chosen so that ER1 = ER2 = 0.02. The tests had near nominal (0.05) size regardless of ρ or choice of beta distribution parameters (Table 2).
We compared the theoretical power Pt(TPCF > c) or Pt(TPNF > c) based on the influence-function calculation of non-centrality to power estimated from simulations for non-parametric estimates of TPCF and TPNF when risk estimates were generated from the normal-beta model. The estimated power agreed well with theoretical power estimates for TPCF and TPNF over a range of correlations ρ and marginal beta distributions of risk. Note that with a sample of n = 500 bivariate observations there is good power to detect subtle differences in PCF and PNF. For example, for ρ = 0.10, (α1 = 6.55, β1 = 320.95), (α2 = 9.50, β2 = 465.5), the estimated power was 0.834 to detect a difference in PCF of only PCF1 − PCF2 = 0.178 − 0.163 = 0.015. Likewise, for ρ = 0.25, the estimated power was 0.862 to detect a difference of PCF1 − PCF2 = 0.806 − 0.826 = −0.02.
For a few sets of parameters we compared the power of TPCF and TPNF with influence function based variance estimates to their power with the bootstrap variance estimates and found close agreement (unreported data).
Freedman et al. (2009) developed a model to predict colorectal cancer (CRC) incidence. We now compute PCF and PNF for that model in independent data on 108, 016 women from the National Institutes of Health (NIH)-AARP Diet and Health Study, a prospective cohort study (Park et al., 2009).
Let T denote the time to CRC. An individual’s CRC risk over the age interval (a, b) is . Here λ1 denotes the CRC hazard and λ2 the hazard for competing causes of death. As CRC incidence differed among three different sites in the colon, Freedman et al. modeled λ1 = λ11 + λ12 + λ13, as the sum of proximal (λ11), distal (λ12) and rectal (λ13) colon cancer hazard rates, with , i = 1, 2, 3, where denotes the age specific baseline hazard rate and RRi the relative risk, which was estimated from case-control data. The were obtained by multiplying the composite hazards from SEER cancer registries by one minus the attributable risk, which was estimated from the relative risks and the risk factor distribution in cases in the case-control study. Risk factors included an individual’s age, sex, sigmoidoscopy/colonoscopy history, polyps and other factors.
While the model was well calibrated overall among women in the AARP Diet and Health Study it showed lack of fit in some subgroups, possibly because the AARP questionnaire assessed sigmoidoscopy/colonoscopy history differently (Park et al., 2009). Because good estimates of PCF and PNF require good model calibration, we recalibrated the model. We first split the women in the AARP cohort randomly into a training and a test set, each comprised of 54,008 women, with 458 CRC cases in the training set and 467 in the test set. Following Cox (1958), we fit a logistic regression model with the log-transformed absolute risk estimate r as the independent variable to the training set. We then used logit(rc) = −4.7019 + 0.5335 log(r) to predict disease outcomes for women in the test set. Estimates rc were well calibrated overall and in groups defined by deciles of risk in the test set. The recalibration does not affect AUC, as rc is a monotone transformation of r.
In the test data, the model rc had . The modest discriminatory ability was also reflected in . For example, for q = 0.10, ; thus 17% of the cases were predicted to be in the 10% of the population at highest risk. From observed disease outcome in the test data, we computed that the proportion of cases with risks above the 90th percentile of F was 0.180 (95%CI : 0.145–0.215). Similarly, for PNF, for p = 0.90, , and in fact a fraction 0.914 with 95%CI(0.884, 0.940) of cases were found in the fraction of the population with highest risk. Thus were consistent with the observed outcome data in the test sample (Table 3).
We then compared the previous model (model 1) to model 2 that used a single hazard, . Model 2 was fit to the original case control data in Freedman et al. We applied the same recalibration to model 2, resulting in a logistic model, . Estimates were well calibrated in the test set.
Model 2 had in the test data, which was not significantly different from (p = 0.23). For q = 0.1, model 2 yielded a significantly lower , 95%CI(0.167, 0.168) than model 1 (, p < 0.0001) (Table 3). Likewise, for p = 0.9, model 2 resulted in a significantly larger , 95%CI(0.810, 0.811) than model 1 (, p < 0.0001). These small differences are statistically significant because the estimates are based on 54, 008 women in the test data, resulting in high precision.
Even though model 1 is statistically significantly superior to model 2, the practical improvement is small. For example, to assure that 90% of future cases are screened in the test cohort, model 1 would require screening of 0.808 × 54, 008 = 43, 638 women, compared to 0.811 × 54, 008 = 43, 800 for model 2.
We considered two criteria for evaluating risk models that measure the concentration of risk in a population and that have potential application in public health and clinical epidemiology. PCF(q), the proportion of cases who will be followed if the proportion q of the general population at highest risk is followed has been recommended by Pharoah et al. (2002). Hand and Henley (1997) used the “bad risk rate amongst accepts” that is analogous to 1− PCF, in credit risk models, but neither Pharoah et al. nor Hand and Henley described methods of inference. For models of disease incidence, PCF can determine the proportion of those destined to develop disease who will receive screening or preventive interventions. Following disease diagnosis, it can predict the proportion of all patients destined to have a bad outcome among patients selected for a treatment based on high risk. We introduced the criterion PNF(p), namely the proportion of the population at highest risk that needs to be followed in order that a proportion p of cases be followed as a complimentary guide to screening or intervention applications. PNF could be adapted to high risk preventive interventions (Rose, 1992), such as deciding what proportion of a population should be given statins to assure that at least 80% of all persons destined to have a myocardic infarction shall have received statins beforehand. The quantity 1−PCF(q) assesses what proportion of cases will fail to be followed if a proportion 1−q at lowest risk is not followed. Thus, PCF is useful for evaluating the effectiveness of an ongoing or proposed screening or prevention program by determining what proportion of future cases shall have participated. A “high risk strategy” requires that q be small, say 20% or less. Whether that strategy will be effective in covering cases depends on whether PCF(q) is large enough. PNF is useful in assessing the feasibility of the program required to cover a proportion p of future cases. Typically, one will want p to be 80% or more. Whether such a p is feasible depends on whether PNF(p) is small enough.
AUC, unlike PCF and PNF, does not measure risk concentration. For example, suppose that risk is zero in non-cases and uniformly distributed in [0, 1] in cases, and that half of the population develops disease. Then AUC = 1, but PCF(0.1) = 0.2 and PNF(0.9) = 0.45.
By relating PCF and PNF to the Lorenz curve of the risk distribution and its inverse, we adapted and developed the theory needed for inference on these quantities for risks from a single model applied to the distribution of covariates in an independent validation sample. We developed methods for testing whether PCF1 = PCF2 or PNF1 = PNF2 for two models evaluated on the same independent sample with bivariate risk estimates (r1i, r2i). Our methods allow for correlations between r1i and r2i. The work of Zheng and Cushing (2001) can be used to compare two PCFs with dependent data, but we are unaware of similar results for PNF. Methods are available for comparing two Lorenz curves from independent samples (e.g. Dardanoni and Forcina, 1999). When using such tests, care should be taken to assure that each model is well calibrated. Otherwise, PCF1 > PCF2 or PNF1 < PNF2 may simply reflect miscalibration of one or both models. We plan to explore the sensitivity of our methods to miscalibration in further simulations in future research.
The ideal data for estimating PCF and PNF are from a random sample from the distribution of covariates in an independent population of interest. This sample can be used to derive the distribution of risks for one or more models. If random samples of cases (Y = 1) and controls (Y = 0) are available from a population, with corresponding distributions of risk G and K, and if disease is rare so that K nearly represents F, then has a simple distribution theory because and Ĝ are independent (Greenhouse and Mantel, 1950). The same is true for . If the disease is not rare but the disease risk μ is known, one can use F = μG + (1−μ)K in expressions for PCF and PNF. However, the corresponding distribution theory is not developed.
Other criteria have been proposed to evaluate risk models, apart from calibration and the AUC. If loss functions can be specified, they can determine the optimal risk threshold for intervention, t*, and indicate which risk model has smaller expected losses (Vickers and Elkin, 2006; Gail and Pfeiffer, 2005) or larger PCF(q) for q = 1 − F(t*). Cook (2007) proposed reclassification criteria based on risk thresholds, and Pencina et al. (2008) proposed a related measure, the net reclassification index. Another criterion they recommended, the integrated discrimination improvement, is not tied to a particular risk threshold, and can be regarded as a global criterion, like the AUC. Pepe et al. (2008) point out that the ROC curve, which is a plot of 1−G(r) against 1−K(r) as r varies, suppresses information on r. Instead, they recommend use of the predictiveness curve, a plot r versus F(r), together with another plot with two curves: 1−G(r) versus F(r) and 1−K(r) versus F(r). Together, these three curves summarize the information on the risk distribution in the population, F(r), and the effects of various risk thresholds r on P(R > r|Y = 1) and P(R > r|Y = 0).
Huang and Pepe (2009) showed that ROC(q) = 1−G○K−1(1−q) and that dROC(q)/dq can be used to estimate the predictiveness curve and the ”total gain” statistic, introduced by Bura and Gastwirth (2001), if the disease prevalence μ is known. For rare diseases, F ≈ K, and PCF(q) approximates ROC(q), but more work is needed to prove the conjecture that PCF(q) and its derivative can be used to approximate the predictiveness curve and total gain.
The risk distribution F is central to the evaluation of risk models, because criteria such as expected loss, measures of relative dispersion of risk such as the Gini index, risk distributions G and K, and the AUC are functionals of F (Gail and Pfeiffer, 2005), and because of its value in displaying information in the ”integrated predictiveness and classification” plots of Pepe et al. (2008). We believe that two functionals of particular public health interest are PCF(q) and PNF(p).
We thank the reviewers for many helpful suggestions.
The Web Appendix referenced in Sections 2 and 3 is available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.