We found that the median reliability of physician cost profiles, constructed to reflect typical approaches that insurers use, ranged from 0.05 for vascular surgery to 0.79 for gastroenterology and otolaryngology. Overall, the majority of physicians did not have cost profiles that met common thresholds of reliability. In an illustrative two-tiered classification system, one half of internists and two thirds of vascular surgeons were classified inaccurately as lower cost.
Sample size is one of three factors that determine reliability. We aggregated 2 years of data across four health plans that enrolled about 80% of commercially insured persons in Massachusetts to increase the number of potential episodes assigned to physicians. This strategy increased reliability for three of the four plans but reduced reliability slightly for the fourth. The lower reliability for the fourth plan in the aggregate data set, which resulted from higher physician-specific error estimates, might be seen as a reasonable compromise to make in order to achieve improved reliability in the other plans and to increase the potential for producing consistent scores across all plans.
4,25 Would adding more years of data increase reliability and decrease misclassification? We found that doubling the number of episodes for an average family physician would increase reliability for that physician from 0.61 to 0.76 and decrease his or her probability of misclassification from 17% to 15%. This modest change may not be acceptable because multiyear rolling averages make it difficult to rapidly detect improvements.
On the basis of our findings, we recommend that users of physician cost profiles directly assess reliability instead of relying on proxies of minimum sample size.
7,16,26 This approach will present some implementation challenges. Users will need to agree on a minimum acceptable reliability threshold, such as 0.70. Since only a minority of physicians had profiles that met the 0.70 threshold of reliability, users would have to decide how to classify physicians whose scores did not meet the threshold. Physicians with lower-reliability cost profiles could be classified in a lower tier or they could receive a designation indicating that there was not enough information to assess their performance. Since the surgical specialties in particular appear to have low reliability scores, providing incentives for patients to select lower-cost surgical specialists may have little effect in terms of reducing spending. However, physicians with median cost-profile reliabilities greater than or equal to 0.70 accounted for more than half of total costs across the plans, suggesting that opportunities for cost control still exist among physicians with more reliable scores.
The rates of misclassification for the one illustrative application that we examined were large enough to be cause for concern. Among the physicians who were classified as lower cost, 43% were not actually lower-cost performers, which suggests that there are serious threats to insurance plans’ abilities to achieve cost-control objectives and to patients’ expectations of receiving lower-cost care when they change physicians for that purpose.
Plans may want to consider how they could increase the reliability of cost profiles. Although sample size is a major contributor to reliability, we found that even substantial increases in sample sizes were not adequate to ensure reliability for many specialties. Adding public payers, particularly Medicare, could substantially increase the sample size for some specialties, but because the effects on physician-to-physician variation and on the error variance of the measure are uncertain, reliability might not improve. Episode mix will be difficult to change because it reflects the types of conditions typically managed by physicians in a given specialty. If current efforts to reduce variations in performance are successful, we can expect a decrease in reliability over time. The final option is to develop better measures of cost performance at the physician level. According to our analysis, this is the most promising avenue for further work.
Our study has some limitations in terms of its generalizability. We tested reliability with the use of data from a single state and had access only to commercial claims. Although Massachusetts is unique in many ways, we believe that the pattern of results observed here is likely to be repeated in other data sets, but such testing should be performed. We tested only one commercially available software product for the purpose of constructing episodes; other tools may produce different results and should be evaluated.
These findings bring into question both the utility of cost-profiling tools for high-stakes uses, such as tiered health plan products, and the likelihood that their use will reduce health care spending. Consumers, physicians, and purchasers are all at risk of being misled by the results produced by these tools.