|Home | About | Journals | Submit | Contact Us | Français|
Performance measures are widely used to profile primary care physicians (PCPs) but their reliability is often limited by small sample sizes. We evaluated there liability of individual PCP profiles and whether they can be improved by combining measures into composites or by profiling practice groups.
We performed a cross-sectional analysis of electronic health record data for patients with diabetes (DM), congestive heart failure (CHF), ischemic vascular disease (IVD), or eligible for preventive care services seen by a PCP within a large, integrated healthcare system between April 2009 and May 2010. We evaluated performance on 14 measures of DM care, 9 of CHF, 7 of IVD, and 4 of preventive care.
There were 51,771 patients seen by 163 physicians in 17 clinics. Few PCPs (0 to 60%) could be profiled with 80% reliability using single process or intermediate-outcome measures. Combining measures into single-disease composites improved reliability for DM and preventive care with 74.5% and 76.7% of PCPs having sufficient panel sizes, but composites remained unreliable for CHF and IVD. 85.3% of PCPs could be reliably profiled using a single overall composite. Aggregating PCPs into practice groups (3 to 21 PCPs per group) did not improve reliability in most cases due to little between-group practice variation.
Single measures rarely differentiate between individual PCPs or groups of PCPs reliably. Combining measures into single- or multi-disease composites can improve reliability for some common conditions, but not all. Assessing PCP practice groups with in a single healthcare system, rather than individual PCPs, did not substantially improve reliability.
Amidst growing concerns about the quality of American healthcare, many health systems now measure physician performance to inform quality improvement efforts and to increase accountability for achieving quality goals.(1-3) Such initiatives often target primary care providers' (PCPs) care of patients with chronic illness, as this care has the potential to impact a growing proportion of morbidity and cost.(1, 4, 5) These initiatives often use performance measures, such as annual diabetic foot examinations or antiplatelet agent use in patients with known ischemic vascular disease. Each individual measure, which assesses a specific type of care for a single disease, produces a result that reflects a very narrow aspect of a provider's practice. While consumers sometimes value these granular data, composite measures may provide a broader perspective on quality. In addition, for optimal provider profiling, measures should be able to reliably detect meaningful variation between the individuals being compared. If the number of patients is in adequate or the variability between physicians is too low, performance profiles may mislead. The conventional expectation is that quality profiles should achieve at least 80% reliability when judging individuals (such as “grading” hospitals or doctors), and even higher levels are generally recommended when judgments have serious consequences (such as strong financial penalties).(6) Unfortunately, several studies suggest that existing profiling methods often lack this critically important level of reliability.(7-11)
The reliability of a quality profile could be improved with a larger sample size, such as by adding more measures or more patients, or by profiling an entire practice group rather than an individual physician. Reliability can also be improved by better identifying areas of between-physician variation, such as with more accurate quality measures or more reliable data sources. In this study, we use electronic health record (EHR) data to evaluate two possible ways to improve the reliability of performance profiles in a large, integrated healthcare system: combining measures into single-disease or multi-disease composites and combining PCPs into practice groups to increase sample size.(12-15)
We conducted this study within an academic-affiliated healthcare system comprised of 3 hospitals, 40 ambulatory care sites, and more than 1,600 physicians providing over 1.7 million clinic visits per year. There are 163 PCPs and geriatricians, most of whom are full-time clinicians practicing in community settings.
We identified all patients between ages 18 and 75 who were seen in a family medicine, general medicine, medicine/pediatrics, or geriatrics outpatient clinic between April 2009 and May 2010. We attributed patients to a specific PCP if they had been seen by the provider at least twice in the preceding 2 years, including once in the past 13 months. While geriatricians can function as consultants or PCPs, for the purposes of institutional disease registries and feedback reports, patients are assigned to geriatricians as PCPs if they meet the above attribution criteria.
We then limited the population to patients eligible for preventive care services and patients included in registries for diabetes (DM), congestive heart failure (CHF), and ischemic vascular disease (IVD.) The registries are designed to measure core quality indicators for physician feedback, internal quality improvement, and reporting to external programs.
Patients with DM were identified by 1 of 3 criteria: (1) 2 diagnoses of diabetes in billing data in an ambulatory care setting in the past 3 years; (2) 1 diagnosis of diabetes in billing data in an acute care setting such as the emergency department or inpatient setting in the past 3 years; or (3) a diagnosis of diabetes in the EHR problem summary list (PSL). Validation of the diagnosis required 1 of the following criteria: (1) diabetes documented in a clinical note; (2) a prescription for hypoglycemic medication (excluding metformin alone); (3) a prescription for diabetic supplies; (4) hemoglobin A1c> 6.4%; or (5) 2 or more blood glucose levels ≥ 200 mg/dL.
Patients with CHF were identified by one of the following 4 criteria: (1) 2 diagnoses of heart failure in billing data in an ambulatory care setting; (2) 1 diagnosis of heart failure in an acute care setting; (3) a diagnosis of heart failure in the EHR PSL; or (4) 1 diagnosis of heart failure in an ambulatory care setting and either evidence of low ejection fraction or heart failure documented in EHR PSL. Criteria 1, 2, and 3 required validation by one of the following: admission in the past 2 years with a principal diagnosis of heart failure, a Brain Natriuretic Peptide (BNP) ≥ 100 in the past 2 years, an ejection fraction (EF) < 40% on echocardiogram or nuclear medicine test, heart failure recorded on the EHR PSL, or a reference to New York Heart Association classification in the EHR within the preceding 2 years. Patients were excluded if they had a history of heart transplant, left ventricular assist device, or congenital heart disease.
Patients with IVD were identified by 1 of the following criteria: inpatient encounter for coronary artery bypass grafting (CABG), percutaneous transluminal coronary angioplasty (PTCA), percutaneous transluminal coronary intervention (PCI), stroke, transient ischemic attack (diagnosed on the in-patient neurology service only), acute myocardial infarction (AMI) or an EHR PSL entry for CABG, PTCA, PCI or AMI. Patients were excluded if they had a history of heart transplant or pulmonary hypertension.
This study was approved by the University of Michigan Institutional Review Board.
We obtained detailed patient-level data including demographics from the health system's EHR and billing records. This included laboratory and immunization data, mammography findings, physical exam findings such as blood pressure and the date of the most recent eye and foot examinations and ejection fraction. We obtained medications including contraindications from the EHR medication summary list.
We identified performance measures from an environmental scan of existing publicly-available physician measures (e.g. Healthcare Effectiveness Data and Information Set (HEDIS,)(16) Rand Health Quality of Care Assessment Tool,(17) American College of Cardiology/American Heart Association measures.(18, 19)) When possible, we based measures on those currently used within the health system to provide feedback to providers and to inform internal quality improvement initiatives. The measure topics can be found in Table 1 (see Table, Supplemental Digital Content 1, for a more detailed description of the measure criteria.) We converted all measures into dichotomous (adherent/not adherent) outcomes for analysis.
We analyzed the data using multilevel logistic regression techniques in Stata version 11.1. This method accounts for the clustered nature of the data, i.e., multiple measures for an individual patient, multiple patients cared for by a single physician, and several physicians practicing together at the same site (PCP group). First, we developed separate models for each measure. We started with unadjusted random intercept models, i.e., “empty” models, to partition variance in performance measures between the different levels of the hierarchy. Next, we added fixed-effect patient-level variables for case-mix adjustment, including patient age, sex, and number of eligible chronic illnesses (0 to 3). Variance estimates from these models allowed us to calculate the intra class correlation coefficient (ICC). The ICC represents the fraction of total variability attributable to a particular level of the model (patient, PCP, or PCP group). Reliability is the ability of a measure to provide the same result on repeat testing. The ICC at the PCP level accounts for the correlation in patient measurements within an individual PCP as an estimate of the PCP effect on those measurements. The ICC at the PCP group level accounts for the correlation in patient measurements within an individual PCP, and for PCP measurements within a PCP group as an estimate of the PCP group effect on those measurements.
We then developed single-disease composite models using all measures related to a specific disease that had a non-zerocase-mix-adjusted PCP-level ICC in the single-measure models. Since physicians vary in the types of patients they see and specific measures are associated with higher or lower physician performance, we included a fixed-effect measure-level variable: the proportion of eligible patients for whom the measure was achieved in the overall population. Finally, we developed an overall multi-disease composite model using all measures included in the single-disease composite models. For all composite measures, we equally weighted each individual measure observation as has been described previously for measures assumed to have equal validity and importance.(20)
We used posterior empirical Bayes estimates from the single- and multi-disease composite models to predict PCP quality scores that reflect their expected performance on an average-difficulty measure for a 65 year-old female patient with 1 comorbidity. This produced case-mix-adjusted PCP profiles that were also “reliability-adjusted” (often referred to as “shrunken” estimates) to account for bias resulting from suboptimal reliability.(7, 21) To compare differences between physician profiles, we report results with confidence intervals of 1.4 standard errors (~83% confidence intervals). This is the level at which overlap of two estimates' confidence intervals indicates that the differences between the results are not statistically significant at the α=0.05 level.(22, 23)
We applied the Spearman-Brown prophecy formula to determine the panel size needed to achieve a measurement reliability of 80%. This suggests that 80% of the variation in a profile is due to differences in practice and 20% is due to chance, which is often considered the minimum reliability needed to make decisions about individuals.(6, 7) We reported the percentage of providers who had a panel size that would fulfill this criterion.(6-8, 11)
The final study sample included 51,771 patients cared for in 17 clinics by 1 of 163 primary care providers (PCPs). PCPs had a mean number of 51 (range 1-219) patients eligible for the DM registry, 8 (1-34) for CHF, 19 (1-107) for IVD, and 220 (1-837) for preventive care. The number of PCPs per clinic ranged from 3 to 21. Table 1 summarizes the overall adherence with each performance measure across the population. The proportion of patients meeting a measure ranged from a low of 45.5% of patients with diabetes attaining a hemoglobin A1c of <7% to a high of 98.5% of CHF patients having ejection fraction (EF) assessed.
Variation between PCPs was statistically significant for all single measures of prevention and most of DM (Table 2). In contrast, there was no significant variation in performance between PCPs for any individual CHF or IVD measure. For prevention, approximately 60% of PCPs had sufficient patient panel sizes to be profiled with 80% reliability for both breast and cervical cancer screening whereas less than 20% could be reliably profiled using the pneumococcal and influenza immunization measures. Few PCPs (0 to 40%) had sufficient panel sizes to be profiled with 80% reliability on the basis of any individual DM process measure and none on the basis of any DM intermediate-outcome measure (i.e., hemoglobin A1c level or blood pressure control).
After combining measures into single-disease composites, variation between PCPs was statistically significant for all but the CHF composite. For diabetes and preventive care, 74.5% and 76.7% of PCPs respectively had sufficient panel sizes to be profiled with 80% reliability (Table 3). However, a mere 2.1% of PCPs had sufficient panel sizes for IVD care. After combining measures into an overall multi-disease composite, 85.3% of PCPs had sufficient patients to be profiled with 80% reliability (Figure 1).
Variation between PCP groups (practice sites) was sufficient to reliably distinguish their performance on measures of breast and cervical cancer screening. However, there was no significant variation in performance between PCP groups on single measures for DM, CHF, IVD, or immunizations (Table 2). All PCP groups had sufficient patient panel sizes to be profiled with 80% reliability for breast cancer screening and 94% for cervical cancer screening.
After combining measures into single-disease and an overall composite, 100% of PCP groups could be profiled with 80% reliability for the preventive care and overall composites, although the ICCs for PCP groups were consistently much lower than those for individual PCPs. In contrast, there was no significant variation between PCP groups for composites of DM, CHF, or IVD care (Table 4).
Our study provides valuable information on proposed strategies for improving the reliability of PCP profiling within a large, integrated healthcare system. We found that few PCPs or PCP groups could be reliably profiled using single process or intermediate-outcome measures of diabetes, congestive heart failure, ischemic vascular disease, and preventive care. When we combined individual measures into disease-based composites, reliability improved substantially, primarily due to the increased number of measures and/or eligible patients for the more common conditions of DM and preventive care. Composites remained unreliable for CHF and IVD due to a combination of them being less common, having fewer measures and there being less variation between PCP scores. Although profiling PCP practice groups instead of individual PCPs can greatly increase sample size, with the exception of preventive care profiling, PCP groups generally lowered reliability because there was more variation between individual PCPs than there was between PCP practice groups.
Previous studies on the reliability of individual performance measures have yielded mixed results.(11) In some instances, measures of patient satisfaction, preventive care, or chronic care have demonstrated sufficient variability between physicians or groups for reliable profiling.(8, 24-26) More commonly, individual performance measures have proven unreliable.(5-10, 26-31). These disparate findings are not necessarily contradictory but rather demonstrate that the degree and sources of performance variation can vary by contextual factors such as practice type, setting, or even over time. Appropriate consideration of context is known to be critical for the successful spread of quality improvement interventions.(32, 33) Similarly, contextual factors are likely to influence the usefulness of performance profiles, highlighting the need to use methods, such as those used in this paper, to evaluate the reliability of measures for a specific use prior to linking profiles to rewards or penalties. A profiling system that has proven reliable in one setting may prove unreliable in another.
In some instances, combining measures into composites has improved reliability.(14, 29) In our study, single-disease composites dramatically increased our ability to distinguish between PCPs for both DM and preventive care, with approximately 75% of PCPs having enough patients with these conditions for reliable profiling. For PCP groups, 100% could be reliably profiled with the preventive care composite. In contrast, single-disease composites did not improve reliability for either CHF or IVD. Relative to DM and preventive care, physicians had relatively few patients with CHF or IVD, there were fewer measures for these diseases, and there was little variation between providers in achieving these measures (i.e., “topped-out” measures with high overall performance). This high performance may have been the result of intense measurement and quality improvement activities in these areas prior to our study. Patients with CHF or IVD may also be more likely to be co-managed by sub-specialty consultants, making the PCP or PCP group a less relevant focus of profiling efforts, although we were not able to evaluate this hypothesis directly. Regardless, given the burden of illness and attention received by major conditions like CHF and IVD, performance profiles are often used to drive care improvement. Our data provide a cautionary note, however, that the value of these profiles to differentiate between PCPs or practice groups may be limited by poor reliability.
Aggregating measures one step further by combining all measures included in the single-disease composites into an overall composite increased reliability at the PCP level, by increasing sample size and increasing the variation in scores between PCPs. Although the composite score also allowed reliable profiling of PCP practice groups, this was due to increased sample size, not due to greater variation between PCP practice groups. Further, the overall composite improved reliability mainly due to combining the two most reliable condition-specific measures, DM and preventive care. This improved reliability could be negated if performance on one measure or one disease poorly correlates with performance in other areas.(34) In this situation, a signal of poor performance could be masked by high performance in other areas. Despite these potential limitations of composites, they did improve the reliability of our quality profiles and may be useful for high-level comparisons between providers or groups. In contrast, the results from single measures may provide the granularity necessary for targeted quality improvement activities in situations where differentiating between physicians is not the primary intent and limitations in measure reliability can be taken into consideration.
In general, for individual measures, single-disease composites, and the overall composite, aggregating providers into practice groups did not improve reliability, despite the large increase in sample size, because between-group variation was insignificant for most measures. Since the provider groups practice at different sites within a single integrated healthcare system, they can be considered clinical micro systems.(35, 36) It is possible that broader institutional efforts have reduced variability between micro systems through access to similar resources (e.g., the same EHR), administrative structure, quality improvement initiatives, and comparative measurement targeted to the group level. While this may have resulted in greater consistency between practice sites, variation persists between individual providers that are not specifically targeted. Others studying more diverse populations have reported greater variability between practice groups, (8, 24-26) once again demonstrating the need to consider context and goals of measurement, and to determine prospectively whether a profiling system can meet those goals.
A significant strength of our study was the ability to use an electronic health record system, allowing us to pool data from patients regardless of health insurance and to obtain more detailed data than administrative datasets allow, including laboratory results, medication allergies, and other measure exclusions. Electronic health record data also improve attribution of patients to specific physicians, aid assignment of disease categories, and facilitate the assessment of quality measures.(37)
There are several limitations to our study. Although most of the physicians practice full-time in community settings, they were associated with a single academic healthcare system and were exposed to similar quality improvement activities including feedback reports, educational initiatives, and incentives to improve on the measures used in this study. These activities may have improved measure performance and reduced variation. Our results may also not be generalizable to a system with a less robust medical informatics infrastructure or to measures based on less detailed clinical data, such as those lacking exclusions for contraindications. Given the national push to extend both this infrastructure and quality measurement, however, our results may better inform future efforts to use electronic health records to assess and improve quality. As in many performance measurement studies, attributing patients to a specific PCP was often difficult and undoubtedly not always accurate. Also, patients who received care elsewhere during the study may not have had all results captured in the EHR. Although missing data could decrease measure performance in general, its impact on variability is uncertain. Most health systems and performance measurement efforts face similar challenges. Finally, our results relate to reliability and not necessarily validity. Validity requires that what is being measured truly represents differences in quality. Although we used standard measures in widespread use, there remains a need to further improve the clinical accuracy and salience of performance measurement as well as the ability to account for patient preferences or characteristics that could drive differential selection of providers or groups.(38, 39) Lastly, while our study cannot fully explain why aggregating measurement to the PCP group level or by creating composites does not improve reliability for each condition, we offer several hypotheses that may be tested in future analyses.
Performance profiles are now a common component of programs to encourage informed patient decision-making and drive accountability for quality improvement. It is critical that these profiles provide valid and reliable assessments of quality. Our study demonstrates that single measures will frequently produce unreliable profiles for both individual PCPs and for PCP group practices with in a single healthcare system. Combining these measures into single-disease or multi-disease composites may improve reliability for some common diseases (such as diabetes) or types of care (such as preventive care), but not for other important conditions (such as CHF and IVD).
Funding sources: This work was supported in part by the Robert Wood Johnson Foundation Clinical Scholars Program, the Department of Veteran Affairs Quality Enhancement Research Initiative (QUERI), and the Measurement Core of the Michigan Diabetes Research & Training Center (NIDDK of The National Institutes of Health [P60 DK-20572]).
Supplemental Digital Content: Supplemental Digital Content 1.docx
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.