|Home | About | Journals | Submit | Contact Us | Français|
To systematically rate measures of care quality for very low birth weight infants for inclusion into Baby-MONITOR, a composite indicator of quality.
Modified Delphi expert panelist process including electronic surveys and telephone conferences. Panelists considered 28 standard neonatal intensive care unit (NICU) quality measures and rated each on a 9-point scale taking into account pre-defined measure characteristics. In addition, panelists grouped measures into six domains of quality. We selected measures by testing for rater agreement using an accepted method.
Of 28 measures considered, 13 had median ratings in the high range (7 to 9). Of these, 9 met the criteria for inclusion in the composite: antenatal steroids (median (interquartile range)) 9(0), timely retinopathy of prematurity exam 9(0), late onset sepsis 9(1), hypothermia on admission 8(1), pneumothorax 8(2), growth velocity 8(2), oxygen at 36 weeks postmenstrual age 7(2), any human milk feeding at discharge 7(2) and in-hospital mortality 7(2). Among the measures selected for the composite, the domains of quality most frequently represented included effectiveness (40%) and safety (30%).
A panel of experts selected 9 of 28 routinely reported quality measures for inclusion in a composite indicator. Panelists also set an agenda for future research to close knowledge gaps for quality measures not selected for the Baby-MONITOR.
The release of the Institute of Medicine’s influential reports on patient safety and quality1,2 has invigorated the focus on improving quality of care. Quality deficits and variations have been documented in the neonatal intensive care unit (NICU) setting.3–7 These variations are important, as they are often associated with preventable morbidity and mortality.3–5,8
Reducing variation in quality and outcomes through measuring, reporting and rewarding quality of health-care delivery has become a national health policy priority.9 In other areas of medicine, multi-stakeholder initiatives and health-care payers are experimenting with comparative quality measurement (benchmarking),10 public release of performance information11,12 and financial incentives to improve quality.12–14
Composite indicators of care quality have been used in adult medicine to measure and track global provider performance.15–18 A composite performance measure combines two or more indicators into a single number to summarize multiple dimensions of provider performance and to facilitate comparisons.19 Composites have several potential desirable attributes, including improving communication among stakeholders through reduced complexity of data generated, providing global insights and trends on quality for internal and external benchmarking, and improving the reliability and efficiency of provider measurement.20–23 In addition, global measurement solutions via composite indicators have the potential to foster comprehensive approaches to quality improvement. Such approaches might change the current practice of addressing individual domains of quality (for example, use of a specific medication) to a more systems-based approach, which may improve care across multiple domains (for example, improving teamwork and safety culture24,25).
However, the process of developing composite indicators is complex, and developers have to make choices in their construction that may significantly influence performance ratings.26 For example, a previous report compared differences between composite scores derived from the hospital quality scorecard generated by the Centers for Medicare and Medicaid Services and from the quality scorecard generated by US News and World Report.27 Based on these composite scores, the authors found frequent discordance in hospital performance between the scorecards. This discrepancy highlights the importance of a standardized and explicit approach to scorecard and composite indicator development. A particularly explicit approach to composite indicator development has been described by JRC (European Commission Joint Research Center).23
Specifically, the JRC advocates a purposeful 10-step approach to composite indicator development, which guides developers through various stages of conceptualization, computation, testing and dissemination.22 Although the JRC targets global measurement of social and economic domains, the methods are broadly applicable to the health-care setting.
Our overarching goal is to develop a composite indicator that could be used to assess the global quality of care provided by each NICU member of a quality improvement consortium. Initially, we are developing a composite indicator of quality delivered to very low birth weight infants (VLBW), which we will name Measure Of Neonatal InTensive care Outcomes Research or Baby-MONITOR. Here, we report on an essential step for indicator development, selection of quality measures for inclusion in the Baby-MONITOR.
The framework for development of the Baby-MONITOR has been described in detail elsewhere.22 In brief, ideally the Baby-MONITOR would consist of six subpillars, each representing one of the domains of quality described by the Institute of Medicine (safety, effectiveness, efficiency, patient-centeredness, timeliness and equity).2 Each subpillar would combine several measures of quality. The Baby-MONITOR was specifically created using outcome and process measures routinely collected at the NICU level by the California Perinatal Quality Care Collaborative (CPQCC) and the VON (Vermont Oxford Network).
Our sampling framework attempts to compare similar populations of VLBW infants among hospitals while minimizing the number of patients that would be excluded. Exclusion criteria and their rationale are shown in Table 1. These criteria were vetted by a panel of experts described below and attempt to reconcile the competing demands of accuracy, fairness and generalizability.
From the CPQCC and the VON operations manuals, we pre-selected 28 candidate measures for expert vetting. The pre-selection process was designed for inclusiveness, limited only to exclude measures of surgical quality and based on consensus among participating researchers. Measures included VLBW annual volume, antenatal steroids, temperature measured within 1 h of NICU admission, hypothermia at NICU admission, surfactant administration within 2 h of birth, timely retinopathy of prematurity (ROP) examination, severe ROP (>stage 2), ROP surgery, any intracranial hemorrhage, intracranial hemorrhage severity >grade 2, cystic periventricular leukomalacia, use of assisted ventilation, duration of assisted ventilation, pneumothorax, postnatal steroids for chronic lung disease, oxygen on day 28, oxygen at 36 weeks postmenstrual age, oxygen at initial discharge, discharge home or to long-term care facility on assisted ventilation, necrotizing enterocolitis, necrotizing enterocolitis surgery, feeding with human milk only at discharge, feeding any human milk at discharge, growth velocity, healthcare-associated infection, length of stay, 28-day mortality and mortality during NICU admission. Measure details are shown in Supplementary information 1.
A panel of 15 experts rated commonly reported measures of quality via a modified Delphi process. Panelists were selected based on peer recommendations using criteria to include recognized expertise in neonatal outcomes and quality research, and geography. Participating panelists are acknowledged below.
The rating exercise was conducted between March and August 2008. We provided each panelist with introductory materials including explicit definitions of the selected quality measures, rating sheets for each measure and instructions regarding the Delphi process.28 Panelists were instructed to consider the applicability of all measures at the population level rather than their relevance to the individual patient. Panelists were asked to put themselves in the position of a NICU administrator evaluating performance reports in the context of local improvement efforts with current and historical clinical performance. We clarified that the Baby-MONITOR could potentially be used for either internal quality improvement or external comparative measurement of quality of care.
In addition, we asked panelists whether any measure not selected for the composite should be used as a sentinel indicator. Sentinel indicators describe individual events that are undesirable and represent the extreme of poor performance, triggering further analysis. Measures so designated would be added to the CPQCC report card in case of significant performance deviation (>2 s.d.) from the group mean. A two-thirds majority was required for measure designation as a sentinel indicator.
Measure rating sheets included a summary of each measure including its classification, description, rationale, risk adjustment method, numerator and denominator, and variable type. Measures were rated on a scale of 1 to 9 (9 being best) for importance, reliability, validity, scientific soundness, usability and overall score. The specific prompts used to clarify these criteria are shown in Table 2.
In the first round of ratings, panelists individually rated the 28 measures. Panelists then received both individual and group ratings and discussed each measure and its ratings during two conference calls in May and June 2008. In all, 12 of 15 panelists participated in at least one of the conference calls, and conference call minutes were distributed among all panelists. The panelists then re-rated the 28 measures for overall score. Potential differences in ratings between conference call participants and non-participants were evaluated using t-tests.
In addition, we asked experts to attribute measures to the quality of care domains defined by the Institute of Medicine:2 safety (the ability to provide care with minimal detrimental errors), effectiveness (the proper use of evidence-based care and the ability of that care to attain a specific therapeutic objective), efficiency (the non-wasteful use of health-care resources), patient-centeredness (the ability to prioritize patient desires and values for guidance of clinical decisions), timeliness (the ability to provide timely care) and equity (care delivery independent of patients’ gender, race or socioeconomic status). Panelists also summarized measures according to their applicability for benchmarking. Each measure was graded as level 1, 2 or 3 as defined in Table 3.
Measures were selected for the Baby-MONITOR if they passed three criteria adopted from accepted gold standard methods developed by researchers at RAND studying the appropriateness of medical care delivery.28,29 The first was a high median rating (7 to 9). The second criterion tested for agreement via the hypothesis that 80% of the ratings were within the high range (7 to 9). If the hypothesis could not be rejected at P<0.33, the measure was rated ‘with agreement.’ The third criterion tested for disagreement via the hypothesis that 90% of ratings are within one of two ranges (1 to 6 or 4 to 9). If that hypothesis could be rejected on a binomial test at the P<0.1, the measure was rated ‘with disagreement.’ These significance levels were selected by RAND to accommodate variable panel sizes and provide a statistical solution to measure selection.30 We validated the results of the RAND approach with a method used by multinational European panels carried out as part of the BIOMED Concerted Action on Appropriateness designed for 15 panel members.31
We assessed clinician agreement with the Delphi panel by surveying a national sample of 46 neonatal intensive care practitioners. Study subjects were selected based on recommendation by the District Chairs of the American Academy of Pediatrics Perinatal Section on account of the following qualities: board certification, experience with quality improvement, peer respect, public/private mix and geographic distribution. Contacts for enrollment were limited to three attempts. Clinician participants received electronic surveys, which provided general instructions, and detailed information on each quality measure, including attribute ratings by the Delphi panel. We then asked the clinician group to indicate their level of agreement with the Delphi panel for each measure using a 5-point scale (much too high, slightly too high, reasonably, slightly too low, much too low). In addition, we assessed whether clinicians would select the same measures as the Delphi panel for inclusion in the composite based on an up or down vote on each measure and a pre-specified two-thirds majority for inclusion. In this study, we only inquired about 27 measures of quality as one (duration of ventilation) had not been consistently recorded in the database.
Table 4 exhibits Delphi panelist ratings of measures according to importance, reliability, validity, scientific soundness, usability, and overall score in Round 1 and overall score in Round 2. Of 28 measures considered, 13 had high median ratings (7 to 9). Of these, 9 met the criteria for inclusion in the composite (bolded) using either the RAND or the European BIOMED study selection criteria.
Overall scores changed little between the two rounds of ratings. In general, measures not favored by panelists in the first round became less favored in the second round and the ones that were favored became more so. Discussions among panelists led to the inclusion of three measures (oxygen at 36 weeks postmenstrual age, any human milk at discharge and growth velocity) in the final round of rating that would not have been selected following the initial round of ratings. These new inclusions were the result of either an increased median score or decreased variability in scores from initial to final round ratings.
Among the four measures not selected for the final composite, one was too similar to another more highly rated measure (panelists preferred in-hospital mortality over 28-day mortality). The other three measures failed the agreement criterion. For example, although the median rating was high, the early surfactant measure generated substantial disagreement among the panel with regard to whether or not infants given a trial of continuous positive airway pressure should be included in the denominator. Therefore, following RAND methods, the measure was excluded. (A table summarizing the expert discussions for each measure is available in electronic version as Supplementary information 2).
Final ratings from panelists not participating in either of the two conference calls did not differ significantly from panelists that did participate in at least one of the conference calls (P>0.1). Among measures not selected for the composite, severe intracranial hemorrhage (>grade 2), temperature measurement within the first hour of admission, postnatal steroids for chronic lung disease and length of stay were selected as ‘sentinel indicators’ based on ≥10 of 15 votes by panelists.
Panelist attributions of measures to domains of quality are presented in Figures Figures11 and and2.2. Most measures, whether selected for the Baby-MONITOR or not, map onto the domains of safety and effectiveness. Among the nine measures selected for the composite panelists assigned four (timely ROP exam, hypothermia on admission, late onset infection and pneumothorax) to the domain of safety and five (antenatal steroids, oxygen at 36 weeks postmenstrual age, growth velocity, any human milk at discharge and in-hospital mortality) to the domain of effectiveness.
Of clinician nominees, 23 (47.8%) responded to the survey. We found high levels of agreement between clinical neonatologists and the Delphi panel. Survey participants selected the same nine measures for inclusion in the composite as the panel from the original study. For these nine measures, 74% of clinicians indicated that Delphi panel rating was reasonable; 18% thought Delphi panel ratings were slightly too high.
We report a systematic rating process to select measures of neonatal intensive care quality as candidates for inclusion into the Baby-MONITOR, a composite indicator. As composite indicators are being used increasingly for provider profiling, rigorous development methods are necessary to provide accurate feedback regarding performance. Multiple strategies for indicator selection have been developed. These strategies can be classified into two basic methodologies: participatory and statistical. Participatory methods, such as the modified Delphi method used in this study, optimize face validity. While this results in a composite indicator that may be acceptable to users, it may contain measures that contribute little to the measurement of overall quality of care. On the other hand, statistical methods (for example, factor analysis and principal component analysis) provide a more mathematically parsimonious indicator set, but may lack face validity.23 We favored using the participatory method for indicator selection to enhance acceptability of the Baby-MONITOR among neonatologists. While other participatory group methods exist (for example, consensus development conference), we preferred the widely used and generally accepted Delphi method because it does not require consensus among a conferencing group, and therefore, is less dictated by the opinions of dominant individuals.
The NICU setting may be ideal for the development of composite indicators, as many patients remain in a single physical location or a defined network and are under the control of one care group for the duration of the initial hospital stay. This allows for better attribution of responsibility for care quality to individual NICUs than is the case for other inpatient and ambulatory care settings, in which patients are treated by a multitude of providers in different locations. In addition, standardized clinical quality measures have been developed and are being collected by numerous NICUs. The NICU setting therefore provides a good framework to test whether global measurement of quality via composite indicators will support comprehensive improvements in care delivery.
This paper presents an explicit quantification of the quality of commonly recorded measures of neonatal intensive care. Panelists rated 9 of 28 measures sufficiently high and with agreement to satisfy criteria for inclusion in a composite indicator of quality. Our results have important implications for the neonatal quality improvement enterprise in that selected measures suggest areas of priority, having been rated as highly important, valid and amenable to improvement. In addition, our results guide the need for future research and measure refinement to ensure that data collection efforts yield measures of high value and little dispute among users. For example, ‘severe ROP’ did not meet the panel’s approval for inclusion in the composite, despite its clinical importance and prominence as a potential target for quality improvement. Panelists were concerned about transfer bias and lack of ascertainment after early discharge, which may provide undue credit to NICUs that transfer out their patients for higher level care or discharge them before the peak incidence of ROP. Concerns regarding transfer bias already led to efforts by CPQCC and the VON to ensure better linkage of patient outcomes with treating hospitals in order to avoid giving credit to hospitals that transfer their poor outcomes and punishing hospitals who receive them. However, our study indicates an urgent need to research the postdischarge conversion rate to severe ROP so that a true NICU-specific severe ROP rate can be calculated. The need for such longitudinal research also highlights the need for further integration of care services and the measurement thereof, so that quality of care can be evaluated and improved comprehensively.
It is notable that the most highly rated measure of quality, antenatal steroid administration, is really a measure of perinatal care quality. Panelists acknowledged this but affirmed neonatologists’ responsibility to influence their obstetrical colleagues’ care provision with respect to this therapy within their institutions. Some even suggested that an NICU’s sphere of influence should extend beyond its own walls to include its referral network (currently, outborn infants are excluded from analysis for this measure).
The panelists’ view is concordant with current health policy priorities, which aim at improving care coordination among specialties through ACOs (Accountable Care Organizations).32 An ACO is a health systems model, which aims to integrate services along the continuum of care across different institutions and care settings. One way to promote the development of ACOs is to align quality measurement with underlying health policy intent. Specifically, longitudinal measurement of quality across different care settings may give an impetus to providers to coordinate high quality of care delivery for patients in which they share joint responsibility.
Safety and effectiveness were the primary domains of quality assigned to the selected nine measures. These results imply that in its first iteration, the Baby-Monitor will contain only two rather than all six of the Institute of Medicine’s domains of quality. While safety and effectiveness reflect areas of health policy priority, our results highlight the need for additional research to develop new measures in other domains, or refine existing measures or data collection methods for existing measures. For example, one major concern regarding length of stay as an efficiency measure was the inability to assess the safety of earlier discharge due to the lack of data on postdischarge medical resource utilization. Once such issues are resolved, future revisions of the composite can accommodate additional measures of quality.
Our findings should be viewed within the context of the study design. The Baby-Monitor is based on measures available through the CPQCC and the VON. Therefore, measures may not be entirely generalizable to other data sources. However, these consortia receive data from over 900 NICUs worldwide, representing a robust sample for indicator development. In addition, quality measures are very comparable to those collected nationally and internationally by other large consortiums, such as Pediatrix Medical Group or the Australian and New Zealand Neonatal Network.
Measure ratings may vary between and even within groups of experts. However, the high level of agreement between academic researchers and clinical neonatologists with regard to selecting measures of quality for a composite index of neonatal intensive care quality provides important face validity for the Baby-MONITOR. Although a response rate of close to 50% is common among physician surveys, we cannot exclude bias in our survey response. However, the direction of any potential bias is not determined easily.
Panelist discussions and ratings may be dominated by the most vocal participants. We attempted to minimize this effect in several ways. The first panel discussion was co-moderated by an independent researcher experienced in quality of care. In addition, we allotted time for additional comments and assigned each participant a group of measures for introduction to the group.
Our initial exclusion criteria should not be regarded as normative. It is our intent to develop a data set with the smallest degree of systematic bias against any individual hospital type. Any decision to include or exclude certain patients may lead to biases. In addition, data collection may change over time and richer data sets may allow for exclusion criteria to be altered. Moreover, some variation in inclusion criteria will not significantly alter NICU performance. We have shown that NICU performance ratings are largely insensitive to variations in definitions of mortality with regard to in/exclusion of delivery room deaths, deaths before 12 h of life, 28-day mortality and in-hospital mortality. While the positions of top and bottom performing hospitals are very stable most of the rank switching occurs in the middle tier.33 These results are consistent with the findings of others.26 Therefore, while one can reasonably disagree regarding the exact definitions of quality measures, their effect on comparative performance is often marginal.33 Nevertheless, in the development of the Baby-MONITOR we will address the uncertainty in any given measure through sensitivity analysis at the stage of measure aggregation so that the extent of bias can be explored.34
In a modified Delphi experiment, a panel of 15 experts selected 9 of 28 measures of quality for inclusion into a composite indicator of neonatal intensive care quality delivered to VLBW infants. In future work, we will aggregate the individual measures and test whether the resulting composite is robust and valid. Our systematic andtransparent approach to indicator construction may serve as a template for developers in other health-care settings.
Supplement 1. Measure Definitions
Supplement 2. Expert Panel Commentary
JP, JBG, JAFZ and LAP led the design, data analysis, interpretation of results and writing of the manuscript. JAFZ, JBG and ARS participated as panelists and helped with panelist recruitment. ARS also contributed to writing of the manuscript. KMW and MAK coordinated the study, prepared study materials, assisted with data acquisition and contributed to writing of the manuscript. MM and KP undertook data analysis and interpretation. EJT contributed to the study design, interpretation of results, writing of the manuscript and moderated panel discussions. JP is guarantor of the paper. We gratefully acknowledge the contributions and effort by all expert panelists listed here alphabetically: Judy L Aschner, MD (Vanderbilt University), Reese H Clark, MD (Pediatrix Medical Group), Edward F Donovan, MD (University of Cincinnati), William H Edwards, MD (Dartmouth University), Gabriel E Escobar (Kaiser Permanente Medical Care Program), Donald A Goldmann, MD (Harvard University and Institute for Healthcare Improvement), Jeffrey H Horbar, MD (Vermont Oxford Network and University of Vermont), Martin J McCaffrey, MD (University of North Carolina, Chapel Hill), Lu-Ann Papile, MD (Baylor College of Medicine), Roger F Soll, MD (Vermont Oxford Network and University of Vermont), Jon E Tyson, MD, MPH (University of Texas, Houston), Michele C Walsh, MD (Case Western University). The following experts co-authored this manuscript: Jeffrey B Gould, MD, MPH, Ann R Stark, MD, John A Zupancic, MD, ScD. We also acknowledge the contribution and support from Dr Carl Bose in facilitating the validation survey component of this research. Jochen Profit’s contribution is supported in part by the Eunice Kennedy Shriver National Institute of Child Health and Human Development K23 HD056298-01 (PI Jochen Profit, MD, MPH). Dr Petersen was a recipient of the American Heart Association Established Investigator Award (Grant number 0540043N) at the time this work was conducted. Drs Petersen, Pietz, and Mr Mei also receive support from a Veterans Administration Center Grant (VA HSR&D CoE HFP90-20). Dr Thomas is supported by Eunice Kennedy Shriver National Institute of Child Health and Human Development Grant K24 HD053771.
Conflict of interest Drs Profit, Zupancic and Gould will serve as Expert Consultants with the Vermont Oxford Network’s NICQ 7 Quality Improvement Collaborative. Dr Gould is the principal investigator for the California Perinatal Quality Care Collaborative.
The results of this study were presented as a poster at the Pediatric Academic Societies’ Annual Meeting in Baltimore on 5 May 2009.
Supplementary Information accompanies the paper on the Journal of Perinatology website (http://www.nature.com/jp)