|Home | About | Journals | Submit | Contact Us | Français|
Interest in targeted disease prevention has stimulated development of models that assign risks to individuals, using their personal covariates. We need to evaluate these models and quantify the gains achieved by expanding a model to include additional covariates. This paper reviews several performance measures and shows how they are related. Examples are used to show that appropriate performance criteria for a risk model depend upon how the model is used. Application of the performance measures to risk models for hypothetical populations and for US women at risk of breast cancer illustrate two additional points. First, model performance is constrained by the distribution of risk-determining covariates in the population. This complicates the comparison of two models when applied to populations with different covariate distributions. Second, all summary performance measures obscure model features of relevance to its utility for the application at hand, such as performance in specific subgroups of the population. In particular, the precision gained by adding covariates to a model can be small overall, but large in certain subgroups. We propose new ways to identify these subgroups and to quantify how much they gain by measuring the additional covariates. Those with largest gains could be targeted for cost-efficient covariate assessment.
A risk model is a statistical procedure for assigning to an individual a probability of developing a future adverse outcome in a given time period. The assignment is made by combining his or her values for a set of risk-determining covariates with incidence and mortality data and published estimates of the covariates’ effects on the outcome. Such risk models are playing increasingly important roles in the practise of medicine, as clinical care becomes more tailored to individual characteristics and needs. The recent literature contains useful discussion of risk models and their evaluation [1-5].
The utility of a risk model is assessed with respect to two attributes. Its accuracy (also called its calibration) reflects how well the model predictions agree with outcome prevalences within subgroups of the population. Its precision (also called its discrimination or resolution) reflects its ability to discriminate those with different true risks. An important issue is to quantify how much a model's attributes can be improved by expanding it with additional covariates. We describe several scores used to assess and compare models with respect to the two attributes. The relevance of these scores depends strongly on how the models will be used , as shown by the following three applications.
The first application is to determine eligibility for a randomized clinical trial involving a treatment with serious side effects. For example, eligibility for the National Surgical Adjuvant Breast Prevention (NSABP) trial of tamoxifen to prevent breast cancer was determined by the Breast Cancer Risk Assessment Tool (BCRAT) . Because tamoxifen increases the risk of endometrial cancer and deep-vein thrombosis, and because its value in preventing breast cancer was unknown, eligibility was restricted to women whose breast cancer risks were deemed high enough to warrant risking these side effects. Thus eligible women were those whose BCRAT-assigned five-year breast cancer risk exceeded 1.67% . Here a good risk model is one that yields a low proportion of women whose assigned risk exceeds 1.67% but whose actual risk is lower. A model without this attribute could expose women needlessly to the side effects of tamoxifen.
The second application is to improve the cost-efficiency of a preventive intervention. For example, screening with breast magnetic resonance imaging (MRI) detects breast cancers earlier but costs more and produces more false positive scans, compared to mammography. Costs can be reduced by restricting MRI to women at high breast cancer risk . The risk model most useful for this application is one with high sensitivity and specificity when “high risk” is defined as having an assigned risk exceeding some fixed cut-off value.
The third application is to facilitate personal health care decisions. Consider, for example, a postmenopausal woman with osteopenia who must choose between two drugs, raloxifene and alendronate, to prevent hip fracture. Because she has a family history of breast cancer, raloxifene would seem a good choice, since it also reduces breast cancer risk. However she also has a family history of stroke, and raloxifene is associated with increased stroke risk. To make a rational decision, she needs a risk model that provides accurate and precise information on her own risks of developing three adverse outcomes (breast cancer, stroke, hip fracture), and the effects of the two drugs on these risks.
The first two applications involve dichotomizing risks into “high” and “low” categories; thus they require risk models with low false positive and/or false negative probabilities with respect to the outcome. In contrast, the third application involves balancing risks for multiple adverse outcomes, and thus it requires risk models whose assigned risks are accurate enough at the individual level to facilitate rational health care decisions.
This paper reviews and synthesizes several performance measures with respect to a model's accuracy and precision. We begin by formalizing the notions of a population risk distribution and a risk model. We then describe several measures of model performance developed for forecasting problems in meteorology, economics and psychology [10-13], and we use hypothetical data to illustrate their use in medical applications. We note important limitations of the measures, such as their loss of useful information on model performance in specific population subgroups. This is followed by a discussion of how the measures are used to compare performance for different risk models, particularly when one model expands another with additional covariates. We propose a new method for evaluating performance improvement within subgroups of the population, which helps identify the individuals who profit most from the additional covariates. We illustrate the method by application to data on risk of breast cancer, and include a glossary in the Appendix.
Consider a population of women at risk of developing breast cancer within a ten year period. We assume that each woman has an unknown probability p of this outcome, which depends on her values for a set of covariates z through an unknown relation p = ξ (z). The covariates z may include continuous and/or discrete components, and thus the risks p may assume continuous or discrete values. In practise, however, continuous covariates and risks are grouped into finitely many discrete categories, and here we shall represent all risks as discrete.
The distribution (z) of covariates in the population determines the population risk distribution f (p) via the relation
The mean π of the risk distribution f specifies the prevalence of the outcome in the population. In symbols, π = E [Y ] = Pr (Y = 1), where the random variable Y is an indicator assuming the value one if a woman develops breast cancer in ten years, and zero if she dies from other causes or survives the period without breast cancer. The variance σ2 of f specifies the degree of risk heterogeneity in the population.
Plate 1 shows two risk distributions, both with mean 10%, that provide bounds on population risk heterogeneity for all populations with outcome prevalence π = 10%. At one extreme, the “constant” distribution c (left panel) assigns all mass to the mean risk π, with c (p) = 1 if p = π, and c (p) = 0 otherwise. Under this distribution, all in the population have the same risk π, and the risk variance σ2 = 0. The variance of this distribution gives a lower bound on that of any distribution with mean π.
At the other extreme, the deterministic distribution d (right panel) assigns mass 1 – π to p = 0 and mass π to p = 1, with d (p) = 1 – π if p = 0, and d (p) = π if p = 1. Under this distribution, the outcome occurs deterministically, with probability one in a fraction π of the population, and with probability zero in the remainder. The risk variance is σ2 = π (1 – π), which gives an upper bound on the variance of any risk distribution with mean π. When the mean risk is π = 10%, for example, the maximum standard deviation of risks is 30%. In summary, the variance σ2 of any risk distribution f of equation (1) with mean π is bounded below by the zero variance of the constant distribution, and bounded above by the variance π (1 – π) of the deterministic distribution:
Moreover the standard decomposition of a variance into the variance of a conditional expectation plus the expectation of a conditional variance gives the variance of outcomes as the sum
The term π (1 – π) – σ2 in equation (3) is the nonlinear analogue of the residual sum of squares in linear regression . We shall see that this term produces an unknown and inflexible bound on the precision of any risk model.
We don't know the complete covariates z that determine a woman's breast cancer risk, much less their population distribution (z) or the relation p = ξ (z) between covariates and risks. Instead, we use measured covariates x = k (z), and epidemiological data to develop a statistical model that assigns to all women with covariates x a risk γ(x) = r, 0 < r < 1. Thus a woman's unknown true risk p is approximated by her assigned risk r.
The unknown distribution (z) of complete covariates determines the distribution of a model's assigned risks by the relation , where the summation is taken over the set of covariates z such that γ[k (z)] = r. Similarly (z) determines the joint distribution of true risk P and assigned risk R as , with summation taken over the set of covariates z such that ξ (z) = p and γ[k (z)] = r. The distribution of true risks among individuals assigned risk r is
The mean π (r) of this distribution is the outcome prevalence in the subgroup of individuals with assigned risk r .
The performance of a risk model is best evaluated by applying it to a cohort of individuals independent of those used to develop the model. The outcome prevalences π (r) are estimated by the observed cumulative incidence of the outcome among subjects in the various assigned risk groups r. To simplify the presentation and focus on the probabilistic properties of the performance measures, I assume throughout that the cohorts are so large that the sampling error of the estimates can be ignored. Thus is identified with π (r).
What attributes do we want in a risk model? The answer depends on how the model will be used. If used to determine eligibility for clinical trials, a model should have high positive predictive power, to ensure that participants satisfy the eligibility criteria. If used to improve the cost-efficiency of a preventive intervention, high values for both positive and negative predictive power are needed. If used to facilitate personal health care decisions, the requirements are more stringent: we want the model to assign each individual a risk that agrees well with his or her true risk.
How do we measure accuracy when we don't know people's true risks? Since we can't assess accuracy at the individual level, we instead assess it at the group level, by measuring agreement between a woman's assigned risk r and the outcome prevalence π (r) (i.e., the mean true risk) among all those who are assigned the same risk r as she. Measures of model calibration describe this agreement.
However even a perfectly calibrated model has limited utility for any application if it assigns a single risk to a group of individuals whose true risks vary substantially. Such a group might consist of two subgroups, one containing individuals at high risk and the other individuals at low risk. Overall, their outcome prevalence might agree with their assigned risk, but important risk-determining covariates would not be reflected in the model. Thus the second desirable attribute of a model is its precision (also called resolution or discrimination), which reflects its ability to sort the population into subgroups with different true risks.
In considering a model's calibration and precision, it is useful to compare it to a theoretical, perfect model (Model P) that assigns each individual his or her true risk: r = p. The risks of Model P are those people would receive if we could measure their complete covariates z, and could correctly specify the relation p = ξ (z) between covariates and risk. The distribution of assigned risks for Model P is just g (r) = f (r).
A risk model's calibration describes how well it agrees with outcome prevalences in subgroups of the population. The calibration bias in a group of individuals assigned a given risk r is the difference r – π (r) between assigned risk and outcome prevalence in the subgroup. A risk model is said to be well-calibrated if its calibration bias is zero for all assigned risks r. Calibration biases are displayed in a calibration plot or attribute diagram , which is a plot of the points (r, π (r)) for given assigned risks r. For well-calibrated models, such as the perfect Model P, these points lie on the 45-degree line.
To illustrate calibration with a simple hypothetical example, consider the risk distributions in panels (a) and (b) of Plate 2 for two populations with different distributions of four risk-determining covariates z = z0, z1, z2, z3 (see Appendix I for details). Both risk distributions have mean π = 10%, and both sets of risks range from 2% to 74%. However the variance of risks is larger for Population B (σ2 = 2.83%) than for Population A (σ2 = 0.30%). Now consider a model that assigns risks using only the first two of the covariates, z0 and z1. Panels (c) and (d) show the distributions g (r) of outcome prevalences π (r) within the five risk groups of this model. The model can assign each of these five subgroups any risk between 0 and 1. However it is well-calibrated to a population only if it assigns each risk group a risk r that equals the outcome prevalence in that group. Thus a model that is well calibrated to Population A assigns the five risks shown on the abscissa of panel (c) of Plate 2. If these same risks were assigned to individuals in Population B, the model would show the calibration biases displayed in the attribute diagram in the upper panel of Plate 3. These biases occur because Populations A and B have different distributions of the two unmeasured risk-determining covariates z2, z3, and thus they have different outcome prevalences within assigned risk groups.
A model's accuracy is often summarized by its overall calibration bias, whose square is the average of the squared biases r – π (r) (vertical distances of points from diagonal line in Plate 3), weighted by the proportions g (r) of individuals in the assigned risk groups:
The overall calibration bias for the data in the upper panel of Plate 3 is Biasg = 8.9%. This value reflects the discrepancy between assigned risk and outcome prevalence in the entire population. In practise, risks are assigned to a sample of individuals who are followed for outcome occurrence. In this circumstance the Hosmer-Lemeshow (HL) statistic , which is based on an estimate of , can be used to test the null hypothesis that the model is well-calibrated. Approximate significance levels can be obtained by referring the HL statistic to a chi-square distribution on n degrees of freedom, where n is the number of risk groups with at least 20 events. However a P-value for statistical significance (suggesting poor calibration) is less informative than an attribute diagram, such as those in Plate 3. Such a P-value depends strongly on the size of the sample, and it tells little about the overall magnitude of the bias and nothing about population subgroups for whom the model is particularly biased.
The overall precision (also called discrimination or resolution) of a model reflects its ability to discriminate among those with different true risks. Several closely related summary measures of precision have been proposed. Most are simple functions of the population outcome prevalence π and the model variance varR [π (r)]. The latter is the variance of outcome prevalences across subgroups assigned different risks r, i.e., the variance of the conditional risk distribution fP|R (p|r) of equation (4). (Note that a model's precision does not depend on the variance of the actual assigned risks, which, if biased, can vary quite differently from the outcome prevalences). As shown in Appendix II, the model variance has a useful heuristic interpretation as the covariance between the outcome prevalences within assigned risk groups and the individual outcome indicators Y:
A model's precision loss is the difference
between the maximum risk variance π (1 – π) for a deterministic outcome and the model variance. The larger the model variance varR [π (r)], the smaller the precision loss.
Similar to a model's precision loss is its prevalence-outcome (PO) correlation coefficient ρg, which measures the extent of correlation between individual outcomes and the outcome prevalences within risk groups:
The second equality in (8) follows from (6). The squared correlation is also called the Yates slope , and the fraction of outcome variation explained by the model, a generalization of R2 from linear to binary outcomes [14,19]. Equations (7) and (8) show that the precision loss is proportional to the fraction of outcome variation unexplained by the model:
The most widely used summary precision measure is the concordance (also called the C-statistic or area under the received operating characteristic (ROC) curve (AUC)). To describe it, we introduce the conditional distributions of assigned risk among those who do and do not develop the outcome:
The concordance ζ is the probability that an assigned risk chosen from the distribution h1 exceeds one chosen from [18,20]. Here H1 (r) is the cumulative distribution corresponding to h1 (r). As indicated by its AUC acronym, a model's concordance is the area under its ROC curve. This curve is a plot of points (x (r), y (r)) as r varies over the model's assigned risks, increasing from 0 to 1. The abscissa x (r) and ordinate y (r) are, respectively, the proportions H0 (r) and H1 (r) of individuals without and with the outcome who are assigned a risk exceeding r. The concordance is invariant under any rank-preserving transformation of the assigned risks. Thus any set of assigned risks whose ranks agree with those of the outcome prevalences π (r) has the same concordance as that of a well-calibrated risk model.
One yardstick for evaluating a summary measure concerns how easily it can be interpreted by patients and clinicians. As a function of the conditional distributions h1 (r) and h0 (r) of (10), the concordance is most easily interpreted as a retrospective assessment of how well a model discriminates the risks of those with and without the outcome. The other measures also can be interpreted this way. To see this, note from equations (10) and (8) that the squared PO correlation coefficient can be written as the mean outcome prevalence among the assigned risk subgroups of those who develop the outcome, minus the corresponding mean in those who do not: . This representation has motivated the name Integrated Discrimination for . Equations (8) and (9) imply that as functions of , a model's variance and precision loss share this retrospective interpretation. Nevertheless the interpretation has limited utility for the applications considered earlier, particularly that of facilitating personal health care decisions [22-25]. To their advantage, the model variance, the precision loss and the PO correlation also have prospective interpretations as measures of the extent to which individuals with different true risks are grouped together and assigned a common risk.
Accuracy and precision measures can be combined into a single score for model performance. For example, a model's Brier score , also called its mean probability score , is the mean squared error between individual outcomes and assigned risk: . A decomposition of the Brier score, due to Murphy , shows that , where is the model's squared calibration bias (5), and PLg is its precision loss (7). However accuracy and precision reflect two different aspects of model performance, and some have argued that they should not be combined .
Calibration measures have limited utility for evaluating a risk model designed to facilitate personal health care decisions. The woman who needs to know her breast cancer risk requires a model that accurately assesses her own risk, rather than the mean risk in the subgroup of women with the same assigned risk as she. In the example shown in Plate 2, for instance, the subgroup of population A with assigned risk of 33% contains some individuals whose true risk is 26%, others whose true risk is 42% and still others whose true risk is 74%. These differences reflect unmeasured covariates omitted from the model. Moreover no model can be well-calibrated to two populations with difference distributions of such unmeasured covariates.
Precision measures also have limitations. They are bounded by the unknown distribution of measured and unmeasured risk factors in the population under study. Indeed, as shown in Appendix II, a model's variance is bounded by the variance σ2 of the perfect model, which by (2) is itself bounded by the variance of a deterministic outcome:
The other precision measures of the previous section are similarly bounded. They approach their optimal levels as the risk model captures increasingly many of the risk-determining covariates. Nevertheless they all are constrained by the distribution of these covariates in the population to which the model is applied (which the investigator cannot control). Consider, for example, the risk model based on the first two of the four risk-determining covariates z0, z1, z2, z3 (Model 1) as applied to populations A and B in Plate 2. Table I shows that the model variance is varR [π (r)] = 0.24% in Population A, which by inequality (11) is bounded by the variance σ2 = 0.30% of true risks. In contrast, the model variance in Population B is varR [π (r)] = 2.57%, with the more relaxed upper bound of 2.83%. Table I also shows the values of other precision measures for the model. For example, the PO correlation coefficients for Model 1 (16.3% for Population A and 53.4% for Population B) are not far from their optimal values 18.1% and 56.1% for these two populations, given by the perfect model. Similarly, the concordance of Model 1 is 59.4% for Population A and 80.6% for Population B. For comparison, concordances of the perfect model for these two populations are 60.3% and 83.3%, respectively.
This example shows that a model's poor precision may be caused more by population covariate homogeneity than by loss of covariate information, a fact that has been noted in the literature , but that is not widely appreciated when precision is reported in medical applications. Indeed, a risk model that uses the same covariates in the same way can have very different precision when applied to populations with different covariate distributions. Because of this strong dependence of precision measures on the population covariate distribution, comparing summary measures of model performance in different populations requires caution. For example, Table I shows that if we applied Model 1 to Population B and Model 2 to Population A, we would conclude that Model 1 is superior to Model 2 by any precision measure, despite the consistently better precision of Model 2 when both models are applied to the same population.
Finally, all summary performance measures are aggregates that obscure important information on subgroups of the population, which may limit their utility for some applications. Suppose, for example, that a model is needed to improve the cost-efficiency of a preventive intervention. Then the goal is to split the population into individuals at high and low risk, with the risk cutpoint separating the two groups determined by considerations of efficacy and cost. A risk model's usefulness for this goal depends on the coordinates (x (r), y (r)) of its ROC curve at the predetermined risk cutpoint r of interest, rather than the area under the entire curve, as given by its concordance. In the following section we provide additional examples showing how summary performance measures can obscure model features that are important for particular subgroups of the population, and we suggest solutions to the problem.
It is often desirable to evaluate two risk models applied to the same population. Of special interest is the case when one model expands the other with additional covariates [27-28]. An important issue is how to compare the two models, which may differ in both calibration accuracy and precision, with respect to their utility for a given clinical application.
The overall calibration of two models can be compared by evaluating the difference or ratio of their squared biases, given by equation (5). Consider, for example, the two populations in Plate 2. Suppose we expand Model 1 by including one additional covariate z2 (see Appendix A.1.). Panels (e) and (f) of Plate 2 show the distributions of outcome prevalences within the seven assigned risk groups of the expanded Model 2. The overall calibration bias incurred by Model 2, if calibrated to Population A but applied to Population B, is 5.5%, an overall improvement in accuracy compared to the value 8.9% for Model 1. In general, however, expanding a model with additional covariates can actually decrease its accuracy. In population B, for instance, Model 2 might assign biased risks to each of the two subgroups with outcome prevalences 38.8% and 67.6% (panel (f) in Plate 2), whereas Model 1 might assign an accurate risk of 61.8% to the combined group in panel (d).
Comparison of the models’ attribute diagrams provides useful additional information about the extent of bias associated with specific assigned risks. The upper and lower panels of Plate 3 give attribute diagrams for the two models as calibrated to Population A and applied to Population B. It is evident that individuals with assigned risks exceeding the population mean of 10% enjoy the most bias reduction from Model 2 compared to Model 1.
A more informative comparison would classify individuals jointly according to the risks they receive by the two models, and determine the outcome prevalence in each of the resulting subgroups. For example, Table II shows outcome prevalences among 100,000 women from Population B who are classified according to the risk groups of two models that are poorly calibrated to this population because they were fit to the covariate distribution of Population A, as described in the Appendix A.1.. Plate 4 compares these outcome prevalences to the risks assigned by the two models which are shown in the row and column headings of the table. Both models exhibit substantial downward bias in the group with the highest outcome prevalence, with at most small bias in the five groups at lowest risk. Moreover compared to Model 1, the expanded Model 2 is less biased in groups 7 and 9, but more biased in groups 6 and 8. In practise, when risks are assigned to a sample of individuals who are followed for outcome occurrence, calibration can be tested using the Hosmer-Lemeshow (HL) statistic  as applied to the subgroups determined by Model CC. In this context the HL statistic has been called the Reclassification Calibration Statistic, and Table II has been called a reclassification table .
Table II also shows that some individuals assigned different risks by Model 1 can receive the same risk by Model 2. Thus adding covariates to a model need not yield a more refined partition of the population; in this sense, nested sets of covariates need not yield nested sets of risks.
While expanding a model with additional covariates may actually decrease its accuracy, generally such expansion will increase its precision. The issue is whether the gain is large enough to warrant the cost of measuring the additional covariates.
Precision gains are usually measured as either the difference or ratio of model-specific precision scores, and most of these relative measures are closely related. For example, equations (7) and (9) show that the difference in precision loss for two models is just the difference between the models’ variances, and that both are proportional to the difference between the models’ squared PO correlation coefficients:
Here the subscripts 1 and 2 refer to the two risk models. The difference in squared PO correlations has been called the Integrated Discrimination Improvement (IDI) . Equation (8) also shows that the ratio of squared PO correlations for the two models is just their variance ratio varR2 [π (r2)] / varR1 [π (r1)].
Recently Pencina et al  proposed a new measure of precision gain called the Net Reclassification Index (NRI). The NRI is the probability that Model 2 provides a correct reclassification of risks (i.e., increasing the risk of one who develops the outcome or decreasing it for one who does not), minus the probability of an incorrect reclassification (increasing the risk of one who avoids the outcome or decreasing the risk of one who develops it). specifically
Table I shows NRIs comparing a well-calibrated Model 1 to a well-calibrated expanded Model 2 and to the perfect Model P, for the example of Plate 2. For Model 2, the NRI is -0.102 in Population A and 0.806 in Population B. The corresponding values for Model P are 0.034 in Population A and 0.602 in Population B. These NRIs illustrate two anomalies of this index. First, as seen for Model 2 in Population A, the NRI can be negative even when the expanded model captures additional population variation in outcome prevalences. Second, the NRI does not increase monotonically with the variance of outcome prevalences across model risk groups: in Population B, for example, the NRI for Model 2 exceeds that of Model P, despite their variances of 2.70 × 10–2 and 2.83 × 10–2, respectively.
All of these comparative precision scores provide summary measures of the models’ relative ability to sort individuals into groups having different true risks. However comparison of model precision (no matter how measured) requires caution, for two reasons. First, the precision gained by one model relative to another depends on the covariate distribution in the population, and thus can vary across populations. This variation can be seen in Table I, where relative to Model 1, Model 2 provides an AUC increase of 1.5% for Population B but only 0.5% for Population A. Similarly, the NRI for Model 2 is 0.806 for Population B but -0.102 for Population A.
Second, the overall precision gain from adding new covariates can be negligible, even though the expanded model may be considerably more precise in certain subgroups of the population . This anomaly occurs because individuals with small risks, who typically comprise most of the population, tend to benefit little from increased precision, while the few with large risks have much to gain. Since the precision measures are averages over the entire population, the small gains for the majority tend to dominate the larger gains for a few. For example, Table I shows that the overall precision gain from expanding Model 1 with an additional covariate is small by any measure, and it tells us little about subgroups who may gain appreciable risk information by adding the new covariate.
This problem can be addressed by creating subgroup-specific measures of relative precision, where each subgroup consists of individuals assigned a common risk by a given model (say, Model 1). Consider for instance the outcome prevalences in Table II, which cross-classifies individuals according to risks assigned by Models 1 and 2. We wish to determine the Model 1 risk groups who gain most from the additional covariate used by Model 2. The last columns of Table II show several measures of precision gain within each of the five risk groups of Model 1 (rows), based on variation in outcome prevalences across risk groups of Model 2 (columns). These include ranges, standard deviations (SDs), NRIs and slopes of outcome prevalences. The SD for the ith Model 1 risk group is the square root of the variance of expected outcome prevalences πij across the columns, which differs from the sampling error variances of sample estimates . The NRI for the ith Model 1 risk group is obtained by conditioning the probabilities in (13) on membership in this risk group:
Finally, the slope for the ith Model 1 risk group is defined as the rate of increase of cross-classified outcome prevalence per unit increase in Model 2 risk. (This measure differs from the slope suggested by Rosner et al , which is the rate of increase of log prevalence per decile of cross-classified risk.)
Table II shows that the standard deviation of outcome prevalences across the rows corresponding to the risks of Model 1 ranges from zero (in those assigned a risk of 10% by Model 1) to 11.5% (in those assigned the highest risk of 61.8%). Similarly, the NRI ranges from zero to 39%. According to these measures, the individuals at highest risk gain substantially more precision from the additional covariate than do others, and this information can be important for applications involving personal health decisions. If the additional covariates are difficult to measure, it might be cost-efficient to do so only for individuals in the risk groups of Model 1 with most to gain, as suggested in the management of cardiovascular disease [31,32]. In contrast to the standard deviations and NRIs, the subgroup-specific slopes in Table II are similar, despite considerable variation in their risk variances. This similarity reflects a problem with the slope as a measure of subgroup-specific precision gain: it fails to reflect the range of Model 2 risks across a row. Thus two rows can have the same slope but represent substantially different variation in actual risk.
The row-specific SDs and NRIs in a reclassification table are closely related to the corresponding summary measures of precision gain for Model 2 compared to Model 1. For example, the overall difference varR2 [π (r2)] – varR1 [π (r1)] in model variances equals the difference between two sets of row-specific variances, each averaged over the relevant risk groups:
Here for example, varR1, R2|R1 [π (r1, r2) |r1] is the variance of outcome prevalences across the assigned risk groups of Model 2, within the group of individuals assigned risk r1 by Model 1. (see Appendix for a proof of equation (15)). Similarly, each of the four probabilities in the summary NRI of equation (13) is an average of the corresponding conditional probabilities in equation (14). For example,
We illustrate the precision measures of the previous sections by application to two models for risk of estrogen-receptor-positive (ER+) breast cancer among postmenopausal women. These models were developed by Rosner et al  using data from the prospective Nurses Health Study (NHS). Model 1 is based on a woman's current age and breast cancer risk factors, and Model 2 includes one additional covariate representing estimated serum levels of endogenous estradiol. The investigators identified 1559 ER+ breast cancer cases in 746,590 person-years of follow-up among postmenopausal women, giving a crude annual incidence rate of 1559/746,590 = 0.0021 cases per woman per year. To illustrate the methods described here, we consider the ten-year breast cancer risks in a hypothetical cohort of postmenopausal women aged 50 years with the same covariate distribution and overall incidence as the NHS women. The annual death rate for US white women aged 50-59 years is 0.0053 deaths per woman per year . This rate, combined with the annual breast cancer incidence rate of 0.0021, yields a mean risk π = 2.02% of developing ER+ breast cancer within ten years. Rosner et al  cross-classified case counts, person-years of follow-up, and incidence rates by joint deciles of the risks assigned by Models 1 and 2.
Here we used these data to compute the ten-year breast cancer risks shown in Table III. We assume that the calibration bias is zero, i.e., that the probabilities in the joint risk groups of Table III represent the actual ten-year breast cancer prevalences. To evaluate the relative precision of the two models, we use the summary column and summary row of Table III to calculate standard deviations of for Model 1 risks and 1.26% for Model 2 risks. These values correspond to PO correlation coefficients of 7.52% and 8.98%, respectively. The summary NRI comparing Model 2 to Model 1 is 44.07%, indicating that Model 2 more precisely classifies women, relative to their subsequent outcome occurrence.
Which risk groups of Model 1 benefit most from adding endogenous estradiol levels to Model 1? The last columns of Table III show the standard deviation, NRI and slope of the cross-classified outcome prevalences within each of the Model 1 risk groups. (To calculate these quantities, we assumed that within each decile of Model 1 risk, women were distributed in the same proportions as the person-years of follow-up given by Rosner et al .) The row-specific SDs tend to increase with outcome prevalence; however this trend is not seen in the row-specific NRIs or slopes. In summary, there are clear precision benefits associated with adding a serum estradiol measurement to the breast cancer risk assignments of Model 1 for postmenopausal women. These benefits are evident overall and in specific risk groups of Model 1.
We have reviewed several performance measures for risk models that use personal covariates to assign individual risks of a future adverse outcome. Application of the measures to risk models for hypothetical populations and for women at risk of breast cancer illustrates the following points. First, as noted by others [6,29], the characteristics needed in a risk model depend on how it will be used. Some applications use a model to split the population into subgroups at high and low risk of the outcome; these applications require risk models with low false positive and/or false negative probabilities. In contrast, other applications involve patients’ need to balance risks for multiple adverse outcomes, and thus they require models whose assigned risks are accurate enough at the individual level to facilitate rational health care decisions. This type of application places high demands on a risk model.
Second, performance measures have limitations that warrant consideration when using them to evaluate and/or compare models. For example, measures of both accuracy and precision depend on the distribution of risk factors in the population of interest . No model can be more precise than an ideal perfect model that assigns the true risks to all in the population. Because a model's precision bounds are unknown in practise, the precision gained by adding new covariates can be informative: a large precision gain indicates that the expanded model accounts for substantial variation in risk. However a small precision gain is less informative.
Summary performance measures can obscure model features of importance to subsets of the population. This limitation can be addressed by focusing on subgroup-specific performance measures, which can help identify those with the most to gain by enlarging a model to include covariates that distinguish their risks. These individuals could then be targeted for the additional covariate assessment [31,32]. We have reviewed methods for evaluating subgroup-specific gains in both calibration and precision, and we have proposed a new method for evaluating the precision gain within each of the risk groups of the reduced model, by quantifying the spread of outcome prevalences across the joint risk groups of both models.
Evaluating performance in population subgroups also helps assess a model's value for facilitating personal health decisions. An individual who needs to know his own risk does not care how a model performs for others in the population; yet summary performance measures involve the distribution of covariates in the entire population to which he belongs. Restricting the measures to the subgroup with risks similar to his own provides performance measures more relevant to his needs.
Finally, sampling error issues have largely been suppressed here, in order to simplify the presentation and focus on the probabilistic properties of the performance measures. However the bias & efficiency of estimated performance measures and optimal designs for moderately sized cohorts are areas in need of further research.
This research was supported by NIH grant CA094069. I am grateful to Joseph B. Keller for useful discussions, to Nicole Ng and Jerry Halpern for help with the calculations, and to the reviewers and an Associate Editor for comments that greatly improved earlier versions of the manuscript.
The risks shown in panels (a) and (b) of Plate 2 are determined by four covariates z0, z1, z2, z3, with z0 coded as –1, 0, 1, and zj coded as 0, 1, j = 1, 2, 3. These covariates determine risk according to the rule
The distributions f (p) in the two panels are determined by the relation (1), where the covariate distribution . Here τ (z0) = 0.8, 0.1, 0.1, for z0 = –1, 0, 1, respectively. The value α = 0.2 gives the distribution for Population A and α = 0.8 gives the one for Population B. The mean risk is 10% and the variance is 7.2 × 10–4 (1 + 3α)3.
Panels (c) and (d) of Plate 2 show outcome prevalences within population subgroups assigned common risks based on two of the four covariates: z0, z1. These prevalences are
Panels (e) and (f) of Plate 2 give outcome prevelances in the seven subgroups determined by expanding a risk model based on z0, z1 (Model 1) to include z2. These prevalences are
To see that the variance of any risk model is bounded above by the variance σ2 of the perfect model, we use the standard decomposition of a variance into the expectation of a conditional variance plus the variance of a conditional expectation:
which implies equation (11).
We next show that the variance of a risk model equals the covariance of its grouped outcome prevalences with the individual outcomes. Since ER [π (r)] = π = EY [y], we have
We write the left side of (15) as
where varR1,R2 [π (r1, r2)] is the variance of outcome prevalences in the cross-classified model. We write this variance as
accuracy (also called calibration)- extent of agreement between assigned risks and outcome prevalences among
those assigned the same risk.
area under ROC curve (AUC) - see concordance
attribute diagram - plot of outcome prevalences versus assigned risks. If a model is perfectly calibrated, these
points lie on the diagonal line y=x.
bias - square root of the mean-squared difference between assigned and true risks.
Brier score - mean squared difference between outcomes and assigned risks.
calibration - see accuracy
concordance (also called AUC or C-statistic) - probability that the assigned risk of a randomly selected
individual who develops the outcome exceeds that of a randomly selected individual who does not.
cross-classified (CC) model - obtained by cross-classifying individuals into cells according to the risks assigned
by each of two models, and assigning each member of a cell the outcome prevalence among those in that cell.
deterministic outcome –one that occurs with probability one in some members of a population and
probability zero in the remaining members.
Hosmer-Lemeshow (HL) statistic - standardized sum of squared differences between assigned risks and
outcome prevalences. Under the null hypothesis that a model is perfectly calibrated, the statistic has a
integrated discrimination - the square of the PO correlation coefficient, which equals the fraction R2 of
integrated discrimination improvement (IDI) - difference in squared PO correlation coefficients between two
models when one is obtained from the other by including additional covariates.
model variance - variance of outcome prevalences across risk groups of the model.
net reclassification index (NRI) – probability that an expanded model correctly reclassifies a person's risk
(relative to outcome occurrence) minus the corresponding probability of an incorrect classification.
outcome prevalence – expected proportion of individuals in a group who develop the outcome, equal to the
mean risk of individuals in the group.
perfect model – assigns each individual the risk that would result if we knew all risk-determining covariates
and their joint effects on outcome probability.
precision (also called discrimination, resolution) – extent to which a model assigns different risks to individuals
with substantially different true risks.
precision loss – difference between Bernoulli variance of outcomes and the model variance.
prevalence-outcome (PO) correlation coefficient - correlation between actual outcomes and outcome
prevalences in assigned risk groups of a model.
reclassification calibration statistic – Hosmer-Lemeshow test statistic applied to the risk groups determined by
cross-classifying individuals according to risks assigned by two models.
receiver operating characteristic (ROC) curve - plot of points (x(r),y(r)) as r varies from 0 to 1, where x(r) is
the probability that the model assigns risks to individuals who do not develop the outcome, and y(r) is the
corresponding probability for those who develop the outcome.