This study, using well-tested measurement across multiple conditions, highlights the complexity in comprehensively measuring individual physician quality using existing performance measures. However, the high correlation among the chronic disease composite measures and consequent robust reliability of the physician-level composite for chronic diseases suggests that we may now be able to measure quality of chronic care feasibly and reliably. Our finding of significant associations between performance on a knowledge examination and the overall, chronic and preventive care composites provides validity evidence for these composites and is consistent with recent studies finding associations between certification and chronic and preventive care measures (Pham et al. 2005
; Holmboe et al. 2008
; Turchin et al. 2008
Our findings also provide guidance on the utility of comprehensively measuring the effectiveness of new models of primary care such as the patient-centered medical home. Currently, practices are qualified based only on the presence or absence of systems components (ACP 2007
; NCQA 2010
;). However, the public and policy makers will want evidence new practice models are truly providing high-quality, cost-efficient care across a spectrum of clinical care. The overall, chronic and preventive care composite measures potentially provide very reliable and valid measures of physician and practice performance, and allows for a compensatory approach to measure performance given the substantial between- and within-physician variation.
There are several possible explanations for the physician-level variability on the individual measures, specific conditions, and for composite chronic, acute, and preventive care measures. First, physicians may simply possess varying degrees of knowledge and skill to care for the conditions examined in this study (Palmer et al. 1996
). Second, lack of sufficient documentation for underuse measures may partially explain some of the findings (Luck et al. 2000
; Peabody et al. 2000
; Holmboe et al. 2010
;). However, there was clear evidence through documentation that physicians engaged in substantial overuse of procedures and therapies in the acute care domain. Third, some physicians' office systems may be more suitably configured to execute common tasks such as immunizations and test ordering for chronic conditions, but not well designed to provide care for the same patients needing acute care (Stewart 1995
Fourth, unlike many of the chronic and preventive care measures, most of the acute care measures require a different level of physician clinical judgment (e.g., proper clinical evaluation and diagnosis to inform a decision to order an antibiotic for a respiratory infection). Patients also play a role, such as how insistent they may be in requesting a drug or imagining study; these interactions, however, require physician knowledge and skills to effectively negotiate and communicate with these patients (Kaplan et al. 1996
; Duffy et al. 2004
; Lipner et al. 2007
;). Given the importance of managing acute care conditions, the lack of association of the acute care composite with the other composites, and no apparent relationship to physician competence in knowledge and time spent in practice highlight the urgent need for more research to understand how physicians and their offices deal with acute care conditions and how best to measure performance in this domain.
A few additional caveats are important to note. The likelihood composite measures represent a singular or similar quality construct diminishes as more diverse conditions are included in the composite. Moreover, comprehensive composites may not be most useful for guiding quality improvement initiatives. Physicians will require their results on individual performance measures to guide quality improvement efforts.
What are the potential implications of this study? First, our results provide evidence on the feasibility of obtaining sufficient sample sizes across multiple conditions in a physician's practice. Our results in this limited and likely motivated cohort suggest that it is unlikely, at least for general internists, that a single “blueprint” for comprehensive practice assessment is feasible or even desired using existing performance measures; composite measures and measurement over longer periods of time may address some of these challenges if the goal is to differentiate between physicians. The most encouraging area for further development of composite measures, based on our results, is of the quality of chronic and preventive care.
From a policy perspective, these results may have salience to several accountability approaches being used to improve care. First, P4P programs should consider evolving to more comprehensive assessment, including the use of composite measures but also targeting other important areas of care such as communication and coordination. The Bridges-to-Excellence program has taken a small step in this direction through its advanced medical home program that will require a level of performance in two of its clinical recognition programs (diabetes, heart/stroke, and spine care) and the provider practice connections systems assessment (Conway 2008
). Other P4P providers should explore more comprehensive, compensatory models for their programs because a singular focus may fail to uncover other opportunities for improvement in a practice.
Early research also suggests that improvement in performance is concentrated on those measures that are actively monitored (Asch et al. 2004
). A comprehensive assessment may help physicians identify the unintended consequences of focusing on only one condition for additional financial reward. Investigators and policy makers implementing and studying new primary care practice models should carefully consider how they will define “success” in these programs. Measurement of limited aspects of practice is not consistent with the goal of improving the delivery of high-quality comprehensive care, especially for patients with multiple conditions (Boyd et al. 2005
This study has several limitations. First, the goal of this study was not to sample the “universe” of general internists but rather to investigate the feasibility, reliability, and validity of comprehensive assessment using existing and accepted performance measures. As such, the sample size is modest at 236 physicians, who were relatively young, were given compensation for their participation, and all were involved in maintenance of certification program. Thus, nonrepresentativeness of all general internists across the career spectrum is a limitation of this study. However, we sampled broadly across the United States and our results did show substantial variability in the physicians' practice populations and practice performance.
Second, our use of a medical record audit to measure performance is dependent on the quality of documentation. However, medical record audit is a validated approach to measurement and allowed standardization of the audit process across practices regardless of the format of the medical record. We also used a highly rigorous process with several layers of quality control, and the effects of under and overreporting help to offset each other (Peabody et al. 2000
). The medical record audit was also labor intensive, making our approach difficult to implement outside of a research setting. However, we found a medical record audit approach produced high reliabilities (e.g., >0.85) for a number of single measures as well as our composites using very feasible sample sizes not seen in studies relying on claims data (Scholle et al. 2008
; Sequist et al. 2010
Our results strengthen the argument for the urgent need to build performance measurement into electronic medical records to enable comprehensive assessment of performance (DesRoches et al. 2008
). However, some of the important measures used in this study, such as functional status, are currently difficult to generate electronically yet are clinically meaningful, necessitating some amount of “hand” audit for the foreseeable future if we truly want to assess the quality of care comprehensively. While the traditional medical record audit must give way to electronically generated data, it will be some time before health information technology can generate comprehensive performance assessment at the point of care. Perhaps more important, our study provides guidance on how data should be structured to ensure health information technology provides clinically meaningful data not just to policy makers, payers, and so forth, but to help physicians improve their care. For example, validated templates or structured questions that capture important aspects of care such as functional status, and so forth, should be built into the EMRs with appropriate links to key demographic data and tracked through an ongoing registry. EMRs should also investigate software used in qualitative research that can accurately search strings of key text.
In conclusion, comprehensive assessment of individual physicians is a challenging but important task as the country embarks on health care reform, especially in primary care. Our data demonstrate how difficult it can be for physicians to perform well across multiple conditions. However, and perhaps most important, our results suggest that the creation of composite measures, especially in chronic disease and preventive care, may be a more reliable, fair, and valid way to discriminate performance across heterogeneous physicians' patient populations if such performance measurement remains a part of P4P programs and is publicly reported. New approaches to comprehensive assessment are needed to better inform the public about the quality of the care they receive in primary care practices and should be sensitive to the changes that occur in the microsystems (ACP 2007
; NCQA 2010
;). However, in the end what matters most are the outcomes experienced by patients, and measuring just delivery system changes is likely insufficient (Holmboe et al. 2010
). This study shows that a broader sampling of practice across multiple conditions may be a reasonable first step, and our approach could be used to evaluate the effectiveness of patient-centered medical home demonstration projects and other primary care redesign initiatives.