In the present study, we demonstrated the benefits of adjusting surgical mortality rates for statistical reliability. For all three procedures, reliability adjustment greatly reduced apparent variation by eliminating statistical noise. However, when assessing the ability to forecast future performance, the impact of reliability adjustment varied across procedures, with the greatest benefit for the two operations with lower hospital caseloads, pancreatic resection and AAA repair. For these two procedures, mortality rankings based on reliability-adjusted mortality were superior at identifying the “best” hospitals (i.e., those likely to have the lowest mortality in future years). Because most surgical procedures are similar in frequency to pancreatic resection and AAA repair, reliability adjustment would likely improve the accuracy of hospital mortality reporting for most operations.
The importance of reliability adjustment is becoming increasingly recognized. Hofer et al. (1999)
popularized reliability adjustment by elucidating its benefits in profiling physician quality in diabetes care. In another application, Zaslavsky et al. (2000)
described the benefits of applying reliability adjustment to patient satisfaction reports in the Consumer Assessment of Health Plans Survey (CAHPS). Glance et al. (2006)
demonstrated the impact of hierarchical modeling on hospital and surgeon mortality rankings for cardiac surgery in New York State. Consistent with this prior work, and other empirical studies, our study demonstrates the theoretical value of reliability adjustment. However, our present study also adds to this body of work by demonstrating that reliability-adjusted hospital rankings are better at forecasting future performance.
As a result of these and other studies, several organizations are advocating the use of these techniques for quality measurement. The National Quality Forum (2005–2006)
expressed a preference for “hierarchical” modeling in their consensus standards for quality monitoring. Although they do not specifically state a preference for the use of empirical Bayes analysis, this is implied in their statement that this technique is especially good at evaluating quality in small hospitals. The AHRQ (2008)
also encourages the use of similar methods for use with their Inpatient Quality Indicators
. The AHRQ website includes downloadable software for creating so-called smoothed estimates of outcomes, which are created using methods similar to those presented in this study.
Reliability adjustment is already seeing real-world application in several quality reporting programs. Perhaps the most visible use of reliability adjustment is the CMS HospitalCompare website (2010), which reports reliability- and risk-adjusted mortality and readmission rates for medical conditions, such as acute myocardial infarction, heart failure, and pneumonia. Massachusetts, which publishes each hospital's mortality rate as part of an annual cardiac surgery report card, recently began emphasizing reliability adjustment (Shahian et al. 2005
). The Society of Thoracic Surgeons (STS), which maintains the largest clinical registry for cardiac surgery, adjusts process and outcome measures for reliability before combining them into a composite quality measure (O'Brien et al. 2007
). However, as we demonstrate in the present study, this technique is even more valuable for less common operations, and it would likely have even greater value if applied outside cardiac surgery.
In the context of public reporting in surgery, reliability adjustment offers clear advantages over traditional approaches. First, as demonstrated in this paper, the excess variation from statistical noise is greatly diminished. Traditionally, imprecision in measuring mortality is addressed by using confidence intervals, or by comparison to a benchmark, followed by testing for statistically significant differences. Unfortunately, confidence intervals are often misinterpreted and p
-values are usually relegated to a footnote. In contrast, reliability adjustment directly addresses this problem by estimating the hospital's “true” mortality. Second, reliability adjustment improves the ability of historical mortality to forecast future performance. The ability to forecast future performance is particularly important for public reporting and value-based purchasing, because decisions made by patients and payers about where to have surgery are based on data from several years ago (Birkmeyer, Dimick, and Staiger 2006
Reliability adjustment also has potential disadvantages. In empirical Bayes methods, the mortality is “shrunk” back toward the average mortality, with the degree of shrinkage related to the reliability, or precision, with which mortality is measured (Morris 1983
). Hospitals with low caseloads and unreliable mortality rates are shrunk more toward the average. This technique has very different implications for hospitals that start on each side of the average. For hospitals below average, their mortality may be higher after reliability adjustment. Even hospitals with no deaths will have reliability-adjusted mortality rates greater than zero. Although this seems unfair at first, there is empirical data suggesting that low-volume hospitals with no deaths are actually no better and perhaps worse than average, the so-called zero mortality paradox (Dimick and Welch 2008
For hospitals above average, reliability adjustment tends to reduce apparent mortality. Because reliability is a function of sample size, small hospitals will be “shrunk” more toward the average than larger hospitals. Thus, this technique gives smaller hospitals the benefit of the doubt. Many critics of reliability adjustment correctly point out that this introduces bias, because small hospitals may not truly have average performance. This criticism is backed by a large body of evidence showing a relationship between higher volume and worse outcome in health care, especially for high-risk surgical procedures (Birkmeyer et al. 2002
). However, there is an alternative approach for reliability adjustment that overcomes this potential bias. This alternative approach uses empirical Bayes techniques, but rather than shrinking back toward the average, the hospital's mortality is shrunk back toward the mortality expected given the hospital's volume (Dimick et al. 2009
). By taking into account the well-known relationship between lower volume and higher mortality, this approach avoids the bias introduced by assuming small hospitals have average performance. The Leapfrog Group, a large group of health care purchasers, has embraced this approach and will feature these measures in the next iteration of their evidence-based hospital referral initiative for high-risk surgery (The Leapfrog Group 2010
In this study, we found that reliability adjustment appeared to improve the ability to identify the “best” but not the “worst” hospitals. We also found much greater movement of hospitals out of the “best” quintile (and toward the middle) with reliability adjustment compared with the “worst” quintile. This interesting finding is explained by the fact that the mortality rates for these surgical procedures are closer to “0 percent” than “100 percent.” Thus, there is clustering of a large number of small hospitals with no deaths in the “best” quintile (zero mortality hospitals). However, there are only a few “100 percent” mortality hospitals in the “worst” quintile. We would only expect there to be equal movement toward the middle from both sides (“best” and “worst” quintiles) if the average mortality was 50 percent. Nonetheless, we believe that reliability adjustment is still important for both tails of the distribution. The reliability-adjusted mortality for these hospitals is much lower, despite these hospitals still being ranked in the “worst” quintile. Thus, reliability adjustment is important so hospitals do not overestimate their “true” mortality. Further, reliability adjustment results in a net movement of a few small hospitals that should not be included in the “worst” quintile.
This study has several limitations. Because the Medicare population does not account for all patients undergoing these surgical procedures, we likely underestimate the sample size at each hospital. Thus, our analysis may represent an overestimate of the importance of reliability adjustment. However, the impact of this limitation is likely small, because Medicare represents a large proportion of people undergoing each of these three procedures. Further, many existing quality reporting programs create reports based only on Medicare data. Another limitation of this study is the exclusive focus on mortality. Although this outcome is perhaps the most common quality metric in hospital report cards for high-risk procedures, other quality measures such as morbidity, process of care, and patient satisfaction may be important with other procedures. However, the problem of statistical noise is not unique to mortality and is shared by all quality measures. Thus, the accuracy of other measures could be similarly improved using reliability adjustment.
Numerous stakeholders would benefit from better surgical quality measures. Publicly reported surgical outcomes should be adjusted for reliability to help patients choose the best hospitals, thereby improving their odds of surviving surgery. Quality measures used for value-based purchasing should be reliability adjusted to ensure payers and purchasers are steering patients toward hospitals that truly have superior performance. Finally, outcomes data fed back to hospitals should be adjusted for reliability to optimize the impact of provider-led quality improvement registries. Without reliability adjustment, hospitals may waste resources by responding to spuriously high mortality rates, or be lulled into a false sense of security by spuriously low mortality rates. Reliability adjustment is ready for immediate application and should become standard for reporting mortality and other outcomes.