|Home | About | Journals | Submit | Contact Us | Français|
Dr. Nancy R. Cook, Division of Preventive Medicine, Brigham and Women’s Hospital, 900 Commonwealth Avenue East, Boston, MA 02215
Dr. Paul M Ridker, Division of Preventive Medicine, Brigham and Women’s Hospital, 900 Commonwealth Avenue East, Boston, MA 02215
Models for risk prediction are widely used in clinical practice to risk stratify and assign treatment strategies. The contribution of new biomarkers has largely been based on the area under the receiver operating characteristic curve, but this measure can be insensitive to important changes in absolute risk. Methods based on risk stratification have recently been proposed to compare predictive models. These include the reclassification calibration statistic, the net reclassification improvement (NRI), and the integrated discrimination improvement (IDI). This work demonstrates the use of reclassification measures, and illustrates their performance for well-known cardiovascular risk predictors in a cohort of women. These measures are targeted at evaluating the potential of new models and markers to change risk strata and alter treatment decisions.
Risk prediction equations are used in a variety of fields for risk stratification and to determine cost-effective and appropriate courses of treatment. The Framingham risk score, for example, has been used by the Adult Treatment Panel III (ATP III) (1) in guidelines for use of cholesterol-lowering therapy. Whether new risk predictors can add to a score in terms of clinical utility is an important question in many areas of research.
Traditionally, risk models have been evaluated using the area under the receiver operating characteristic curve (2), but this has been criticized as being an insensitive measure in comparing models(3), and as having little direct clinical relevance (4). New methods have recently been proposed to evaluate and compare predictive risk models. These are based primarily on stratification into clinical categories based on risk, and attempt to assess the ability of new models to more accurately reclassify individuals into higher or lower risk strata (5).
Since its first description in 2006 (6), much interest has been generated in reclassification, and, though the approach is still in its infancy, there have been further methodologic developments (7-9). Researchers in the fields of breast cancer (10), diabetes (11, 12), and genetics (12-14), as well as clinical cardiology (15-18), have published papers using these techniques. The current paper is intended as a guide to understanding this research, including the strengths, known limitations, and differences between the various new methods. We apply these to known predictors of cardiovascular disease in a cohort of women to describe how the new methods perform relative to more traditional ones.
Data are from the Women’s Health Study, a large-scale nationwide cohort of US women aged 45 years and older, who were free of cardiovascular disease (CVD) and cancer at study entry beginning in 1992 (19). Women were followed annually for the development of CVD, with an average follow-up of 10 years through March 2004. All reported CVD outcomes, including myocardial infarction (MI), ischemic stroke, coronary revascularization procedures, and deaths from cardiovascular causes, were adjudicated by an endpoints committee after medical record review. During follow-up 766 cardiovascular events occurred. All study participants provided written informed consent, and the study protocol was approved by the institutional review board at the Brigham and Women’s Hospital in Boston, MA.
The baseline characteristics of the WHS sample has been described previously (20). Baseline blood samples were assayed for total, high-density lipoprotein cholesterol, and low-density lipoprotein cholesterol with direct-measurement assays (Roche Diagnostics, Basel, Switzerland), and for C-reactive protein with a validated, high-sensitivity assay (Denka Seiken, Tokyo, Japan). Women eligible for the current analysis had adequate baseline plasma samples, complete ascertainment of exposure data of interest, including age, blood pressure, current smoking, diabetes, and parental history of MI prior to age 60 (n=24,558), and were used in the development and assessment of the Reynolds Risk Score for women (20).
Models were fit using Cox proportional hazards models for CVD risk. Predictors included components of the Framingham risk score (age (years), systolic blood pressure (SBP, mm Hg), current smoking (yes/no), and total and high-density lipoprotein cholesterol (mg/dL)), as well as additional risk predictors included in the Reynolds Risk Score (hemoglobin A1c (%) among diabetics only, high-sensitivity C-reactive protein (mg/L), and parental history of MI before age 60 (yes/no)), all assessed at baseline. The natural logarithm transformation was used for SBP, total and high-density lipoprotein cholesterol, and C-reactive protein to linearize the relationship with outcome. We compared the full model to one without each of the risk predictors in turn, but including all other factors. Predicted probabilities were estimated as of eight years of follow-up, and observed rates were based on the Kaplan-Meier survival estimates at eight years. All rates were extrapolated to ten years for presentation.
Traditional measures of fit include measures of discrimination, or the accurate separation into cases and non-cases, measures of calibration, or how well the predicted probabilities compare to the observed (model-free) estimates, and global measures, combining both. These criteria can be assessed for binary outcomes, such as from logistic models, or for survival outcomes, such as from the Cox model. These will be illustrated for survival data in the example data, though an important limitation to some of the measures is that they do not currently incorporate censored data.
First, only predictors which are statistically significant, using, for example, a likelihood ratio test, are typically used in predictive models. Overall model fit can be assessed using Nagelkerke’s R2 (see Glossary), which is analogous to the percent of variation explained for linear models, and compared using the Bayes information criterion (see Glossary), a function of the log likelihood with an added penalty for the number of parameters. The latter tends to select parsimonious models. Discrimination is usually assessed using the c-statistic (see Glossary), or the area under the receiver operating characteristic curve. The c index is an analogous measure that incorporates censored data (3). Calibration within categories can be assessed using the Hosmer-Lemeshow goodness-of-fit statistic (see Glossary) (21), with categories formed by deciles or by intervals of risk (e.g., 0-<2%, 2-<4%, etc., up to 18%+).
Table 1 shows the overall measures of fit for the full model and the model leaving out each risk predictor one at a time in the Women’s Health Study data. All variables were highly significant statistically as indicated by the likelihood ratio chi-square test. As expected, the R2 was highest and the BIC was lowest in the full model. The c-statistic for the model without age was 0.76; that for all others ranged from 0.79 to 0.80 for the full model. The differences in c were 0.01 or less for all variables except age, but were statistically significant for all except CRP and family history. When the optimism of this measure was assessed using 100 bootstrap samples (3), the value of the c index was generally decreased by 0.002 or less. The Hosmer-Lemeshow tests within deciles or 2% risk percentiles (i.e., 0-<2%, 2-<4%, etc.) demonstrated adequate fit, although when 2% risk percentile categories were used, some deviation from fit was suggested.
Risk reclassification for single factors can be examined using models with and without each risk factor in turn; that is, comparing a model without a given risk factor to the full model. For CVD, relevant strata are 0-<5%, 5-<10%, 10-<20% and >=20% ten-year risk. Table 2 illustrates the risk reclassification for models with and without SBP, but including all other risk factors. The model without SBP categorized 86% of women into the lowest risk group, with a 10-year risk of <5%, 10% into the 5-<10% risk stratum, 3% into the 10-<20% risk stratum, and 1% at 20% or higher risk. The same was approximately true for the model including SBP, such that the ‘marginal’ proportions were very similar.
A continuous analog of this table, which plots the predicted vales from both models, along with the category cut points, is provided in Figure 1. The figure plots the values of the logarithm base 10 for the predicted risks from the two models, and shows the spread and difference in these, with the diagonal line denoting the line of identity. Ideally more cases will be above than below the line of identity. The dashed lines indicate the risk strata. The striated appearance is due to use of categories for SBP, here in 9 categories of 10 mmHg from <110 mmHg to 180 or more mmHg. The lines show by how much the predicted values can change when SBP increases or decreases by 10 mmHg units.
The overall percent reclassified gives some indication of how many individuals would change risk categories, and possibly treatment decisions, under the new model. Of the 24,558 women, 2,022 (8%) were classified into different risk strata. The overall percent, however, is heavily influenced by the incidence of disease in the population. In the WHS, the majority of women were in the lowest category under both models. Those who were in the intermediate categories may be more relevant clinically, and demonstrate more shift in risk category. For example, of those at 5-<10% risk in the model without SBP, 40% were reclassified into higher or lower categories. Of those at 10-<20% risk, 36% were reclassified.
More important for model fit than the simple percent reclassified, however, is a comparison of observed and expected rates of disease within each cross-classified category. This determines whether individuals are reclassified correctly, or whether the changes are due to chance. Observations in a reclassified cell are considered ‘correctly’ reclassified if the observed rate is closer to the new than to the old risk stratum. For example, 696 women were reclassified from <5% to 5-<10% 10-year risk. The observed 10-year risk based on a Kaplan-Meier estimate for these 696 women was 6.8%, which falls into the 5-<10% category. The average estimated risk for these women from the model without SBP was 4.0%, while that from the model with SBP was 6.1%, which is closer to the observed risk of 6.8%. Overall, 2022 women were reclassified; 2009 of these fell into cells with at least 20 women, for whom the observed rate could be computed. Of these 2009 women, 1932 (96%), were reclassified correctly.
Observed and average predicted rates, for cells with at least 20 observations, can be compared based on a chi-squared goodness-of-fit test within reclassified categories for each model separately (9). This is simply the familiar Hosmer-Lemeshow goodness-of-fit statistic, but applied to reclassified categories, and we refer to it as the Reclassification Calibration Statistic (see Glossary). It is calculated as
where nk is the number in cell k, Ok is the observed number of events in cell k, and is the average predicted risk in cell k for the model under consideration. Survival data can be incorporated by using the observed events and predicted risk as of a given time, such as 10 years. The Kaplan-Meier estimated of the observed risk can be used to accommodate censored data. The statistic follows an approximate chi-square distribution with k-2 degrees of freedom, where k is the number of cells with at least 20 observations in the table. In Table 2, k=11, and the degrees of freedom is 9. As with the usual Hosmer-Lemeshow test, a significant result indicates poor fit. The test for SBP found that the model without SBP suffered from a strong lack of fit (X2 = 68.3, p<0.001). That for the model with SBP X2 = 22.9 (p=0.006), which still indicated some lack of fit but to a much lesser extent.
Table 3 examines risk reclassification from the initial reduced model eliminating each predictor individually compared to the full model. The overall percent reclassified ranged from 3% for models with and without parental history of MI up to 13% for models with and without age. The percents reclassified within the intermediate risk categories of 5-<10% and 10-<20% were much higher, ranging from at least 13% to 62% for age, suggesting more substantial changes within these risk strata. For each model comparison, over 95% of those reclassified were reclassified correctly when the variable was included. In comparing the observed to expected rates, the reclassification calibration statistic, a X2 statistic, showed significant lack of fit in models excluding each variable. While the full model sometimes demonstrated a lesser degree of lack of fit within these cross-classified categories, the full model provided better fit to the observed rates in each comparison. Thus, each of these variables improved the fit of the model to the observed rates of cardiovascular disease.
Some other proposed measures of improvement include the net reclassification improvement (NRI) (7) (see Glossary), and the integrated discrimination improvement (IDI) (7) (see Glossary). The NRI assesses risk reclassification and is the difference in proportions moving up and down risk strata among cases versus controls, i.e. those who did or did not develop the disease over follow-up, or
The NRI is similar to the simple percent reclassified, but it distinguishes movements in the correct direction (up for cases and down for non-cases). Ideally the predicted probabilities would move higher (up a category) for cases and lower (down a category) for controls. The NRI can be rearranged to reflect improvement in both cases and controls as follows:
where RI stands for relative improvement. The NRI is then the sum of improvements for cases and controls.
The data representation for SBP for cases and controls separately in the Women’s Health Study is also shown in Table 2. In this cohort study ‘controls’ were defined to be those who did not develop disease as of 8 years of follow-up. Of the cases, 38+36+1+24 = 99 (17.6%) correctly moved up a risk category and 22+7+11 = 40 (7.1%) incorrectly moved down when adding SBP to the model, resulting in a relative improvement for cases (percents moving up minus moving down) of 10.5%. For the non-cases, 821 (3.5%) correctly moved down whereas 992 (4.2%) incorrectly moved up, yielding an overall change of -0.7%, showing that there was a slight upward movement among non-cases also, or a worsening of classification for non-cases. The NRI is the sum of the two, or 9.8%. This means that compared to controls, cases were almost 10 percent more likely to move up a category than down. The results for other variables in the Women’s Health Study are shown in Table 3. The NRI was highest for age, at 19.5%, and ranged down to 3.2% for total cholesterol. All values were statistically significant in these data.
The IDI is the difference in Yates slopes between two models, where the Yates or discrimination slope is the mean difference in predicted probabilities between cases and controls. The IDI is defined as
where is the predicted probability. The terms in parentheses are the Yates slopes for the two models; ideally we would like the cases to have a higher average probability than the controls. The difference in slopes is a measure of improvement in the model. The IDI can also be thought of as a percent of variance explained (8). In the model without SBP, the average predicted probability was 7.4% for the cases and 2.2% for the non-cases, yielding a slope of 5.2%. In the model including SBP, the averages were 7.9% for cases and 2.2% for non-cases, yielding a slope of 5.7%. The IDI for SBP is the difference in these or 0.5% (Table 3). This means that the difference in average predicted probabilities between cases and controls increased by 0.005 when SBP was added to the model. The IDI was 1.8% for the model leaving out age, 1.3% for Hemoglobin A1c, and less than 0.6% for all other predictors.
Both the NRI and IDI condition on case-control status, and neither of these measures assess model calibration (7). Because they depend on outcome status, they are not currently available for censored data. Status as of eight years of follow-up was used in these analyses in an ad-hoc fashion, since most women had been followed for at least this length of time, and observations censored prior to eight years of follow-up were excluded. Of the total 766 cardiovascular events, 560 (73%) occurred by eight years of follow-up, and could be used for calculation of the NRI and IDI. An additional 387 women were censored prior to eight years and were excluded from calculation of these measures. This is an important current limitation for these measures.
When reclassification into risk strata is considered, the particular categories used can affect the estimates. For example, Supplement Table 1 (available at www.annals.org) shows the results when three categories were used with cut points of 5% and 20% only (7). The percents reclassified were lower, as would be expected. The value of the NRI, an adjusted percent reclassified, was also lower, and ranged from 14% for age down to 1.5% for parental history of MI. This was reduced further when only two categories (with cut point 10%) were used, with an NRI for age of only 9.2%. The reclassification X2 statistics were also reduced, but so were the corresponding degrees of freedom. In a 4×4 table the number of cells with more than 20 observations is often 10, leading to 8 degrees of freedom; for the 3×3 table, this is usually 7, with 5 degrees of freedom. The values of X2 divided by its expectation (the degrees of freedom), as well as the levels of statistical significance, were relatively similar whether three or four categories were used.
Whenever fit of a new model is evaluated, that found in the derivation data may be too optimistic. Use of a test dataset, bootstrapping, or cross-validation can adjust measures for optimism (3). When 10-fold cross-validation was applied to these data, there was little change in the estimated effects or in the test statistics (data not shown). The NRI for age increased from 19.5% to 20.5%, and that for family history decreased from 3.2% to 2.2%. The reclassification X2 statistics were also relatively similar.
These data illustrate the use of several newly proposed reclassification measures and demonstrate their magnitude and variation for well-known cardiovascular risk factors in a cohort of women. The NRI ranged from 3.2% for parental history of MI before age 60 to almost 20% for age in these data using clinically meaningful risk strata. Other studies have presented similar results for some of these variables (7, 9, 22, 23). Of note, a significant statistical association does not necessarily lead to an improvement in risk stratification. For example, although a polymorphism at the 9p21 gene was associated with increased risk of CVD, it did not improve calibration, and the estimated NRI was negative (13). As noted by Pepe et al (8), testing the IDI is equivalent to testing whether the regression coefficient in a model is equal to zero. It can be represented as a change in R2 or the proportion of variance explained. Whether this can translate to clinical utility thus remains questionable.
The NRI and the IDI both condition on case-control, or later disease, status. As such, they do not provide information on calibration of the estimated risks. As for the receiver operating characteristic curve (5), they do not measure how close the predicted observations fall to the actual probabilities. Alternatively, the reclassification calibration statistic directly compares the observed and predicted probabilities within each cell of the reclassification table, and assesses this calibration directly. In particular, the reclassification calibration statistic for the model excluding the variable of interest examines whether the model without this variable provides adequate fit; it is therefore a useful adjunct to measures of discrimination.
Since the NRI and IDI condition on outcome status, how to assess these measures with survival data is not yet clear. In the current analysis, some women were censored prior to eight years, and these were simply excluded from the calculations. The reclassification calibration measures, however, can use survival analysis to determine the observed rate within each cell while allowing for such censoring. They are thus more readily generalizeable to survival or prospective data, particularly when length of follow-up differs for individuals.
A limitation of the NRI and other reclassification measures is that they depend on the particular categories used. The calibration test seems to depend somewhat less on the number of categories since the degrees of freedom adjust for the number of categories. The best choice of risk categories for reclassification tables remains an important question, however. For best interpretation, the choice should have clinical meaning. As Greenland suggests, “predictive values, costs, and cutpoints must be considered together to make informed decisions.” (24) For cost-effectiveness and public policy considerations, categories based on the absolute predicted risk, rather than, say, quantiles, would be of most interest. If a particular category is important, such as a 10% treatment threshold, one may wish to include categories both above and below this. In preventive cardiology, the cut points 5%, 10%, and 20% have been proposed as clinical choices for treatment decisions (1, 25), which ideally should be based on considerations of cost-effectiveness. Net benefit curves (26) or relative utility estimates (27) can help determine whether it is cost-effective to measure a new marker in a population or in a subset at intermediate risk.
Even if clinical or treatment categories are not widely available, reclassification measures, particularly the reclassification calibration statistic and NRI, may be useful in demonstrating the ability of new models and markers to change risk strata and alter treatment decisions. Though their statistical properties are still being explored, these are gaining popularity as other means of comparing the accuracy of absolute risk estimates.
Supported by grants from the Donald W. Reynolds Foundation (Las Vegas, NV) and the Leducq Foundation (Paris, France). The overall Women’s Health Study is supported by grants (HL-43851 and CA-47988) from the National Heart Lung and Blood Institute and the National Cancer Institute, both in Bethesda, MD.
Traditional measures of model fit
New measures based on reclassification
The authors had full access to the data and take responsibility for its integrity. All authors have read and agree to the manuscript as written. All computations were done using SAS 9.1 (SAS Institute Inc., Cary, NC). SAS macros to compute the reclassification measures are available as a web appendix at www.annals.org.
Conflict of Interest Disclosures:
Dr. Ridker reports receiving grant support from AstraZeneca, Novartis, Merck, Abbott, Roche, and Sanofi-Aventis; consulting fees or lecture fees or both from AstraZeneca, Novartis, Merck, Merck-Schering-Plough, Sanofi-Aventis, Isis, Dade Behring, and Vascular Biogenics; and is listed as a coinventor on patents held by Brigham and Women’s Hospital that relate to the use of inflammatory biomarkers in cardiovascular disease, including the use of high-sensitivity C-reactive protein in the evaluation of patients’ risk of cardiovascular disease. These patents have been licensed to Dade Behring and AstraZeneca.