|Home | About | Journals | Submit | Contact Us | Français|
To investigate the feasibility, reliability, and validity of comprehensively assessing physician-level performance in ambulatory practice.
Ambulatory-based general internists in 13 states participated in the assessment.
We assessed physician-level performance, adjusted for patient factors, on 46 individual measures, an overall composite measure, and composite measures for chronic, acute, and preventive care. Between- versus within-physician variation was quantified by intraclass correlation coefficients (ICC). External validity was assessed by correlating performance on a certification exam.
Medical records for 236 physicians were audited for seven chronic and four acute care conditions, and six age- and gender-appropriate preventive services.
Performance on the individual and composite measures varied substantially within (range 5–86 percent compliance on 46 measures) and between physicians (ICC range 0.12–0.88). Reliabilities for the composite measures were robust: 0.88 for chronic care and 0.87 for preventive services. Higher certification exam scores were associated with better performance on the overall (r = 0.19; p <.01), chronic care (r = 0.14, p = .04), and preventive services composites (r = 0.17, p = .01).
Our results suggest that reliable and valid comprehensive assessment of the quality of chronic and preventive care can be achieved by creating composite measures and by sampling feasible numbers of patients for each condition.
Several recent national research reports point to significant gaps and variability in the quality of care patients receive in the U.S. health care system (Institute of Medicine 2001; McGlynn et al. 2003; Agency for Healthcare Quality and Research [AHRQ] 2005; Nelson et al. 2007;). It is not clear whether and to what extent these gaps are concentrated among subgroups of physicians or reflect widespread variability in physician-level quality. Our current understanding of physician-level quality is limited to performance on only one or a few conditions, and investigators have raised concerns about the ability to differentiate performance among physicians using current process and outcome measures (Hofer et al. 1999; Landon et al. 2003; Holmboe et al. 2006; Campbell et al. 2007; Young et al. 2007; Bodenheimer 2008; Landon and Normand 2008; Kaplan et al. 2009;). Other methodological concerns include insufficient and inadequate quality measures, inadequate adjustments for patient factors, clustering of performance at the physician level requiring larger patient and physician sample sizes for adequate statistical power, difficulty in determining appropriate standards (thresholds) at the physician level, and lack of uniformity of data collection and audits (Landon et al. 2003; Lee 2007; Bodenheimer 2008; Landon and Normand 2008; Kaplan et al. 2009;).
For example, pay for performance (P4P) is one increasingly popular approach to measure and reward individual physicians (Campbell et al. 2007; Lee 2007; Young et al. 2007;). P4P programs generally target only one or a limited set of conditions and performance measures. For example, the Center for Medicaid and Medicare Services' (CMS) Physician Quality Reporting Initiative only requires reporting on three measures (CMS 2008), and the current criteria for qualification as a patient-centered medical home focus solely on practice-level structural and some process characteristics (National Committee for Quality Assurance [NCQA] 2010).
Against this backdrop, there is intense interest in new models of primary care, such as the patient-centered medical home. The medical home practice model is designed to provide patient-centered, longitudinal, coordinated, and perhaps most important, comprehensive care to meet the needs of its patients (Hofer et al. 1999; American College of Physician [ACP] 2007;). There is thus a pressing need for creating more robust performance measurement methods to effectively evaluate the comprehensive nature of this practice model (Palmer et al. 1996; ACP 2007;). One study attempted to benchmark physicians using administrative claims data from nine health plans on 10 quality measures. The authors found the majority of physicians did not have a sufficient number of “quality events” for reliable assessment of physician quality at the individual measure level, and only 15–20 percent of physicians had enough events for a reliable overall composite measure (Scholle et al. 2008). In a more recent study of a multipayer primary collaborative within a single multispecialty group, sufficient reliability at the individual physician level could only be found for preventive care measures (Sequist et al. 2010). These studies highlight the difficulty in assessing physician's performance in practice comprehensively across preventive and chronic care, but to our knowledge few if any studies have attempted to assess multiple conditions across chronic, acute, and preventive care using multiple measures per condition with existing, well-tested quality measures that will be needed to effectively evaluate a practice as a patient-centered medical home.
General internal medicine (GIM) practice is a logical setting for such performance assessment because many of the currently endorsed measures are applicable to general internists, and GIM practice is a major focus of the medical home practice model. Our primary objectives in this study were to (1) investigate the feasibility of comprehensively assessing the practice performance of general internists using existing, widely endorsed performance measures and (2) assess the reliability and validity of composite measures across chronic, acute, and preventive care composite measures. Our secondary objectives were to (1) examine the associations of physician-level performance on chronic, acute, and preventive care measures within the same physician's practice using composite measures and (2) compare the variation in performance between and within physicians.
From a pool of interested volunteer participants (N = 534), we recruited 254 general internists with time-limited board certification due to expire between 2007 and 2009 who agreed to undergo a comprehensive assessment of their practice through medical record audit and a self-report of their systems capability (Holmboe et al. 2010). The participants were drawn from 13 states sorted by the 2005 AHRQ quality ranking of medical care within groups (AHRQ 2005) so as to draw physicians from the United States with variable levels of practice size and performance.
Participants received a U.S.$1,000 incentive; U.S.$500 at the time of enrollment, and U.S.$500 when they completed the entire project. The project was approved by the New England Institutional Review Board, and all physicians were consented for participation. It is a convenience sample of diverse, volunteer general internists in ambulatory practice settings to test comprehensive assessment methods.
The eligibility criteria for patients were age between 18 and 90 years, participating physician was their primary provider, and the patient had been seen at least once by the physician between July 1, 2005 and June 30, 2006. The audit collected information on patient demographics (age, ethnicity, gender) and comorbidity using the Charlson comorbidity index (Charlson et al. 1994).
The practice was instructed to identify patients where the physician subject was designated as the primary provider with these specific conditions (using ICD-9 codes): 20 patients with diabetes, hypertension (without diabetes), heart disease (any combination of coronary artery disease, acute myocardial infarction, and/or congestive heart failure), osteoarthritis (knee or hip), acute infection (any combination of upper respiratory or urinary tract infection), and 10 patients with acute low back pain for a target total sample of 110 medical records. The index date for the study was June 30, 2006 and practices were instructed to use a retrospective sequential sampling strategy to identify the target groups of patients until they reached the required sample size per condition or reached July 1, 2005. Regardless of the primary indication for sampling, all applicable chronic and acute care measures, and preventive services measures were audited for each patient.
The medical record audit was performed by trained abstractors (Westat Inc., Rockville, MD). Westat field managers received training from ABIM staff on the use of a web-based data collection tool, the Comprehensive Care Practice Improvement Module and use of a comprehensive abstraction manual covering detailed specifications for all performance measures, including detailed specifications for measures such as functional status, and eligible disease conditions. Field managers then trained abstractors using these materials and three training medical records. Abstractor performance was then assessed on agreement for 109 data elements used to calculate the 46 performance measures using six reference patient charts scored previously by ABIM physician experts following abstraction specifications. Abstractors had to obtain an average agreement per reference chart of at least 85 percent in six domains before they could begin actual chart abstractions in practices (range of abstractor reliabilities per training chart were 0.43–0.96 using a kappa-based estimate). ABIM and Westat coordinators reviewed a 5 percent sample of the abstractors' audits from the physician offices for quality control purposes during field operations, and physician subjects were also instructed to inform the ABIM of any perceived irregularities or problems. One abstractor was subsequently removed after completing one office due to poor reliability and the physician's medical records were reaudited.
The audit targeted seven chronic conditions: diabetes, coronary artery disease, postacute myocardial infarction, congestive heart failure, hypertension, atrial fibrillation, and osteoarthritis; four acute care conditions: upper respiratory infection, urinary tract infection, low back pain, and acute depression; and six preventive services: smoking cessation counseling, influenza and pneumococcal immunizations, and mammography, colorectal cancer, and osteoporosis screening. These measures were selected using a consensus process by a panel of experts in quality measurement and improvement, basing their decision on the high prevalence and health impact of the conditions in the U.S. population, commonly cared for by general internists and for which validated measures were available.
We used National Quality Forum (NQF) endorsed measures for all the chronic care and preventive services measures (NQF 2008); the acute care measures were derived from the RAND Corporation's Quality Assessment Tools (Kerr et al. 2000). In all, 46 performance measures (Appendix SA2) were abstracted for this study. Scoring for all measures was dichotomous, based on whether the patient had achieved a threshold value for outcome measures (e.g., hemoglobin A1c <7.0 percent) or the process of care had been performed.
To summarize physician performance we calculated two types of composite performance scores—one for each specific disease condition (e.g., diabetes, osteoarthritis) and a more comprehensive composite created by aggregating conditions by care type (i.e., acute care, chronic care, and preventive services) for each physician. For each individual measure, a physician-level unadjusted score was computed as the percentage of patients who “met” the goal. The individual measures were also adjusted for patient factors, including age, race, ethnicity, and comorbidity (Charlson et al. 1994), using a random physician-intercept logistic model. We first calculated the expected logit values of individual measures using the random physician intercept term (shrinkage estimator) and estimated coefficients of the covariates from the risk-adjusted model and means of patient risk factors of the entire patient sample. Then we transformed the expected logit value onto the probability scale and used this to represent the adjusted score (Hasnain-Wynia et al. 2007). The composite measure score of a disease (e.g., diabetes) or a general care type (e.g., chronic care) was computed using the indicator average method (Reeves et al. 2007). For example, for each physician the composite score for preventive services was computed by averaging the percentages of its six individual measures (i.e., smoking cessation counseling, influenza immunizations, pneumococcal immunizations, mammography, colorectal cancer screening, and osteoporosis screening) giving each measure equal weight. We chose to use indicator average method because we intended to assess quality with respect to the processes of care and to avoid potential overrepresentation of frequently triggered individual measures.
Characteristics of the 254 physicians who began the study and the 236 physicians who completed the chart audit were reviewed. Descriptive statistics, including means, standard errors, and ranges, were computed at patient and physician levels for each individual performance measure. For eligible patients, a process measure was scored as “not meeting goal” if it was not recorded on chart audit, and an outcome measure was scored as “not meeting goal” if the test had been performed within the index period but did not meet the goal, was not recorded on chart audit, or was out of date. If a test had a result out of plausible range but with a valid date, then the outcome measure was set as missing. The missing data rates for outcome measures ranged from 0.5 to 2 percent.
To quantify reliability of the performance estimates, we estimated intraclass correlation coefficients (ICC) for each performance measure. The ICC is the ratio of variation between physicians' practice means (σp2) to the total measure variation where the latter is defined as the sum of the within- (σe2) and between-practice variation (ICC=(σp2)/(σp2+σe2)). An ICC close to one implies a reliable physician “thumbprint” in that the between-physician variation is relatively larger than the within-physician variation; measures having physician-level reliability >0.85 can be considered sufficiently reliable for comparing physicians scoring over or under a threshold value (Kaplan et al. 2009). We estimated the ICC using a multilevel random effect logistic model (NLMIXED procedure, SAS version 9.1.3). Ninety-five percent confidence intervals were calculated using the delta method (Reeves et al. 2007). We reestimated the random effects logistic model adjusting for patient factors, including age, comorbidity index, ethnicity, and gender, to obtain adjusted ICC estimates. Box plots of the performance scores and Pearson correlations across the risk-adjusted performance scores were computed. Composite score reliabilities were estimated by combining ICC estimates and variances from the component performance scores (Feldt and Brennan 1989). To assess external validity, we calculated Pearson correlation coefficients for the association among the composite scores and physician characteristics that included location of medical school training, initial Internal Medicine Board Certification scores, and number of attempts needed to pass the certification examination.
Finally, we estimated the patient sample size per measure needed to achieve a high level of reliability (i.e., 0.85) and compared this number to the actual sample sizes. When the number of needed patients was less than or equal to the actual physician panel size, then the sample size was deemed adequate. All analyses were conducted using SAS version 9.1.3.
The mean age of the 236 physicians was 42 (range 33–69), 36 percent were female, and 37 percent were in solo practice, 28 percent in single specialty medical group or partnership, 23 percent in multispecialty group or partnership, and 7 percent in academic faculty practice (Table 1). This was consistent with our goal of sampling a diverse set of GIM practice in the United States.
Key patient characteristics, clustered at the physician level, showed substantial variability between physicians (Table 2). For example, the mean patient age ranged from 44 to 77 years, and the proportion of males ranged from 10 to 75 percent. The mean Charlson comorbidity score and other factors that can affect delivery of patient care also varied substantially between physicians (Table 2).
Overall, the mean number of medical record audits completed per physician was 95 (goal was 110 charts per physician; 86 percent of overall target). Only 72 physicians (31 percent) were able to identify the prespecified numbers for all the targeted conditions combined in the 1-year study frame. However, 193 physicians (82 percent) did meet the sampling targets for four conditions, and over 90 percent of physicians achieved the target sample for hypertension and diabetes.
Physicians performed better on the chronic care process measures that involved a laboratory test or medication than measures involving medical history or examination (Table 3). For example, performance on completion of a foot exam with monofilament (10 percent), documented eye examination (19 percent) for diabetics, and level of pain assessment for osteoarthritis (16 percent) were all substantially low. Performance on the acute care measures was substantially more variable compared with the chronic care measures. For example, while physicians appropriately avoided the use of inappropriate drugs like antidepressants in acute low back pain (86 percent), the quality of documentation for key aspects of the history was poor. Finally, none of the performance rates for the preventive services measures were >62 percent.
The ICCs for the outcome measures were all roughly 0.10 (Table 3). The ICCs for the process measures were more variable but on average substantially higher than those for the outcome measures. Table 3 indicates that for risk-adjusted measures, 14 out of 23 chronic care measures, 11 out of 15 acute care measures, and 8 out of 8 prevention measures had sufficient sample size. For example, 18 patients with diabetes are needed to reliably (ρ = 0.85) measure whether the annual A1c test was done; the observed cluster size was about 29.
Figure 1 shows the box plots and descriptive statistics of composite performance scores. The width of box plots indicates the number of physicians with a performance score available. Performance compliance scores for disease conditions ranged from 30 percent (upper respiratory infection) to 73 percent (congestive heart failure). The mean performance score for chronic conditions (59 percent) was the highest among the three care categories.
Table 4 displays the risk-adjusted Pearson correlation matrix for the mean physician-level performance across the chronic care, acute care, and preventive services measures (upper triangle) and the measure reliabilities (main diagonal). The majority of the coefficients between the chronic care conditions show moderate levels of correlation, and performance on the chronic care conditions correlated moderately with performance on the preventive services composite (r = 0.66). However, performance on the composite measure for the acute care conditions was less correlated with performance on either the chronic care (r = 0.32) or the preventive services composite measures (r = 0.24).
The overall performance (r = 0.19), chronic care (r = 0.14), and preventive care (r = 0.17) composites were all significantly (p <.05) correlated with higher scores on the internal medicine certification examination. We did not find significant associations between the acute care composite and performance on the examination.
This study, using well-tested measurement across multiple conditions, highlights the complexity in comprehensively measuring individual physician quality using existing performance measures. However, the high correlation among the chronic disease composite measures and consequent robust reliability of the physician-level composite for chronic diseases suggests that we may now be able to measure quality of chronic care feasibly and reliably. Our finding of significant associations between performance on a knowledge examination and the overall, chronic and preventive care composites provides validity evidence for these composites and is consistent with recent studies finding associations between certification and chronic and preventive care measures (Pham et al. 2005; Holmboe et al. 2008; Turchin et al. 2008;).
Our findings also provide guidance on the utility of comprehensively measuring the effectiveness of new models of primary care such as the patient-centered medical home. Currently, practices are qualified based only on the presence or absence of systems components (ACP 2007; NCQA 2010;). However, the public and policy makers will want evidence new practice models are truly providing high-quality, cost-efficient care across a spectrum of clinical care. The overall, chronic and preventive care composite measures potentially provide very reliable and valid measures of physician and practice performance, and allows for a compensatory approach to measure performance given the substantial between- and within-physician variation.
There are several possible explanations for the physician-level variability on the individual measures, specific conditions, and for composite chronic, acute, and preventive care measures. First, physicians may simply possess varying degrees of knowledge and skill to care for the conditions examined in this study (Palmer et al. 1996). Second, lack of sufficient documentation for underuse measures may partially explain some of the findings (Luck et al. 2000; Peabody et al. 2000; Holmboe et al. 2010;). However, there was clear evidence through documentation that physicians engaged in substantial overuse of procedures and therapies in the acute care domain. Third, some physicians' office systems may be more suitably configured to execute common tasks such as immunizations and test ordering for chronic conditions, but not well designed to provide care for the same patients needing acute care (Stewart 1995).
Fourth, unlike many of the chronic and preventive care measures, most of the acute care measures require a different level of physician clinical judgment (e.g., proper clinical evaluation and diagnosis to inform a decision to order an antibiotic for a respiratory infection). Patients also play a role, such as how insistent they may be in requesting a drug or imagining study; these interactions, however, require physician knowledge and skills to effectively negotiate and communicate with these patients (Kaplan et al. 1996; Duffy et al. 2004; Lipner et al. 2007;). Given the importance of managing acute care conditions, the lack of association of the acute care composite with the other composites, and no apparent relationship to physician competence in knowledge and time spent in practice highlight the urgent need for more research to understand how physicians and their offices deal with acute care conditions and how best to measure performance in this domain.
A few additional caveats are important to note. The likelihood composite measures represent a singular or similar quality construct diminishes as more diverse conditions are included in the composite. Moreover, comprehensive composites may not be most useful for guiding quality improvement initiatives. Physicians will require their results on individual performance measures to guide quality improvement efforts.
What are the potential implications of this study? First, our results provide evidence on the feasibility of obtaining sufficient sample sizes across multiple conditions in a physician's practice. Our results in this limited and likely motivated cohort suggest that it is unlikely, at least for general internists, that a single “blueprint” for comprehensive practice assessment is feasible or even desired using existing performance measures; composite measures and measurement over longer periods of time may address some of these challenges if the goal is to differentiate between physicians. The most encouraging area for further development of composite measures, based on our results, is of the quality of chronic and preventive care.
From a policy perspective, these results may have salience to several accountability approaches being used to improve care. First, P4P programs should consider evolving to more comprehensive assessment, including the use of composite measures but also targeting other important areas of care such as communication and coordination. The Bridges-to-Excellence program has taken a small step in this direction through its advanced medical home program that will require a level of performance in two of its clinical recognition programs (diabetes, heart/stroke, and spine care) and the provider practice connections systems assessment (Conway 2008). Other P4P providers should explore more comprehensive, compensatory models for their programs because a singular focus may fail to uncover other opportunities for improvement in a practice.
Early research also suggests that improvement in performance is concentrated on those measures that are actively monitored (Asch et al. 2004). A comprehensive assessment may help physicians identify the unintended consequences of focusing on only one condition for additional financial reward. Investigators and policy makers implementing and studying new primary care practice models should carefully consider how they will define “success” in these programs. Measurement of limited aspects of practice is not consistent with the goal of improving the delivery of high-quality comprehensive care, especially for patients with multiple conditions (Boyd et al. 2005).
This study has several limitations. First, the goal of this study was not to sample the “universe” of general internists but rather to investigate the feasibility, reliability, and validity of comprehensive assessment using existing and accepted performance measures. As such, the sample size is modest at 236 physicians, who were relatively young, were given compensation for their participation, and all were involved in maintenance of certification program. Thus, nonrepresentativeness of all general internists across the career spectrum is a limitation of this study. However, we sampled broadly across the United States and our results did show substantial variability in the physicians' practice populations and practice performance.
Second, our use of a medical record audit to measure performance is dependent on the quality of documentation. However, medical record audit is a validated approach to measurement and allowed standardization of the audit process across practices regardless of the format of the medical record. We also used a highly rigorous process with several layers of quality control, and the effects of under and overreporting help to offset each other (Peabody et al. 2000). The medical record audit was also labor intensive, making our approach difficult to implement outside of a research setting. However, we found a medical record audit approach produced high reliabilities (e.g., >0.85) for a number of single measures as well as our composites using very feasible sample sizes not seen in studies relying on claims data (Scholle et al. 2008; Sequist et al. 2010;).
Our results strengthen the argument for the urgent need to build performance measurement into electronic medical records to enable comprehensive assessment of performance (DesRoches et al. 2008). However, some of the important measures used in this study, such as functional status, are currently difficult to generate electronically yet are clinically meaningful, necessitating some amount of “hand” audit for the foreseeable future if we truly want to assess the quality of care comprehensively. While the traditional medical record audit must give way to electronically generated data, it will be some time before health information technology can generate comprehensive performance assessment at the point of care. Perhaps more important, our study provides guidance on how data should be structured to ensure health information technology provides clinically meaningful data not just to policy makers, payers, and so forth, but to help physicians improve their care. For example, validated templates or structured questions that capture important aspects of care such as functional status, and so forth, should be built into the EMRs with appropriate links to key demographic data and tracked through an ongoing registry. EMRs should also investigate software used in qualitative research that can accurately search strings of key text.
In conclusion, comprehensive assessment of individual physicians is a challenging but important task as the country embarks on health care reform, especially in primary care. Our data demonstrate how difficult it can be for physicians to perform well across multiple conditions. However, and perhaps most important, our results suggest that the creation of composite measures, especially in chronic disease and preventive care, may be a more reliable, fair, and valid way to discriminate performance across heterogeneous physicians' patient populations if such performance measurement remains a part of P4P programs and is publicly reported. New approaches to comprehensive assessment are needed to better inform the public about the quality of the care they receive in primary care practices and should be sensitive to the changes that occur in the microsystems (ACP 2007; NCQA 2010;). However, in the end what matters most are the outcomes experienced by patients, and measuring just delivery system changes is likely insufficient (Holmboe et al. 2010). This study shows that a broader sampling of practice across multiple conditions may be a reasonable first step, and our approach could be used to evaluate the effectiveness of patient-centered medical home demonstration projects and other primary care redesign initiatives.
Joint Acknowledgment/Disclosure Statement: Drs. Holmboe, Arnold, Weng, and Lipner and Ms. Hood are employed by the American Board of Internal Medicine, and Dr. Eric S. Holmboe, MD, receives partial salary support from the American Board of Internal Medicine Foundation. Drs. Kaplan, Greenfield, and Normand received financial support from the American Board of Internal Medicine Foundation.
Financial Support: American Board of Internal Medicine and American Board of Internal Medicine Foundation. Bridges-to-Excellence provided funding for an expert panel meeting related to this project.
Additional supporting information may be found in the online version of this article:
Appendix SA1: Author Matrix.
Appendix SA2. Measures for Comprehensive Care Project.
Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.