Among 4 elderly North American populations we found a consistent performance ranking of comorbidity scores in predicting 1-year mortality using claims data. This finding is reassuring for researchers who want to select the best performing standard score to control for comorbidity. Based on these data it can be assumed with greater certainty that the performance ranking can be generalized to other large general populations age 65 years or older, or large subgroups thereof, such as cardiovascular patients.
Comorbidity scores are useful because they are easy to apply and save time and resources, a major issue when analyzing massive health care databases and testing multiple hypotheses. They can also increase the efficiency of statistical inference, which may become an issue in claims data when analyzing small population subgroups or when comorbidities are modeled as time-varying covariates in longitudinal studies. However, adjusting for a score should not be regarded as successfully controlling for all confounding caused by comorbidity,37
because even scores with improved performance impose a functional relation between comorbidities and outcome. The narrower a subpopulation is defined, the more likely a score developed for a more general population will perform insufficiently. Scores are useful for preliminary analyses to indicate the direction and magnitude of confounding, which can guide decisions about further adjustment. It remains unclear how much more confounding can be controlled by using traditional multivariate modeling techniques to control comorbidity.
We found that diagnosis-based scores consistently performed better than medication-based scores at predicting future mortality. This is consistent with earlier findings that sicker patients are less likely to be treated for comorbid conditions.38
In particular, medications with some preventive effects, such as oral antidiabetic agents39
or lipid-lowering drugs,40
are less frequently prescribed in very sick patients, causing them to appear artificially “healthier” in medication-based scores. Another limitation of medication-based scores is that their list of drugs must be updated regularly as new drugs and drug classes become available.
In earlier studies, we had found that the number of distinct medications received during the previous year was a better predictor of mortality than the CDS-1.16
This finding is consistently true among our 4 large study populations. It is not surprising that the Ghali score in our sample of elderly beneficiaries did not perform as well as in the original publication because its weights were developed in patients undergoing bypass surgery.1,12
Numeric differences in the performance between some scores were small because the c-statistic (or area under the ROC curve) has a limited sensitivity to detect additional improvements in prediction once a certain level is reached.41
Nevertheless, if researchers had the option to select one score, we would recommend choosing the numerically best performing even if the absolute difference in c
is small. In practice, often not all desirable data sources (5-digit ICD-9-CM diagnostic data plus pharmacy dispensing data) are available. Based on our ranking, researchers can pick the next best alternative and discuss the increased potential for residual confounding compared to the better performing scores.
of 0.77 means that for all possible pairs of individuals who died with individuals who lived, 77% of the time the model correctly attributed a higher risk of death to the person who died than to the person who lived. It has been shown how predictive validity can be translated into confounder adjustment.16
From several examples42–44
and text books,35
it appears that a c
above 0.7 is a model with acceptable performance and above 0.8 with excellent predictive performance; however, the meaningfulness of any level of performance depends on what is otherwise achievable using other methods. Large investments, such as additional data and analyses, yield only small numeric gains in c
above 0.75. Whether those gains are worth their price depends on the benefits of a “truer” analysis and the costs of error, issues that are unique to each problem studied.
The data for our study are now several years old; however, we are not aware of important changes in diagnostic coding for Medicare patients or changes to the structure and interpretation of relevant data fields. In the unlikely event that there were significant changes, they would affect the absolute performance in predicting mortality (c-statistic) but should not change the ranking in any systematic way.
Our study was limited to predicting mortality, the ultimate outcome of care. However, the ranking of scores may be different when predicting other health care outcomes, such as annual expenditures or number of services. The study was further limited to health care utilization databases from 3 states/provinces that were available to the authors and originally requested for other purposes. Although this selection may not be representative for both countries, it provides sufficient variability in characteristics of patients, health care systems, and recording practice for claims. The Pennsylvania database had more extensive coding of diagnoses related to ambulatory visits and services and the British Columbia data had more discharge diagnoses available. This may have affected the absolute level of performance, which is the reason why a relative ranking is the most valid approach to compare the performance of scores across populations.
We recommend that investigators use these performance data as one important factor when selecting a comorbidity score for epidemiologic analyses of health care utilization data. The Charlson-based Romano and Deyo scores using published weights in combination with a simple count of the different prescription drugs received during the past year appear to be well performing comorbidity measures in epidemiologic studies.