Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Perinatol. Author manuscript; available in PMC 2014 March 24.
Published in final edited form as:
PMCID: PMC3963391

Do Practicing Clinicians Agree with Expert Ratings of Neonatal Intensive Care Unit Quality Measures?

Marc Kowalkowski, M.S.,1,2 Jeffrey B Gould, M.D., M.P.H.,3,4 Carl Bose, M.D.,5 Laura A Petersen, M.D., M.P.H.,1,2 and Jochen Profit, M.D., M.P.H.1,2,6



To assess the level of agreement when selecting quality measures for inclusion in a composite index of neonatal intensive care quality (Baby-MONITOR) between two panels: one comprised of academic researchers (Delphi) and another comprised of academic and clinical neonatologists (Clinician).


In a modified Delphi process, a panel rated twenty eight quality measures. We assessed clinician agreement with the Delphi panel by surveying a sample of forty eight neonatal intensive care practitioners. We asked the clinician group to indicate their level of agreement with the Delphi panel for each measure using a five- point scale (much too high, slightly too high, reasonable, slightly too low, and much too low). In addition, we asked clinicians to select measures for inclusion in the Baby-MONITOR based on a yes or no vote and a pre-specified two-thirds majority for inclusion.


Twenty three (47.9%) of the clinicians responded to the survey. We found high levels of agreement between the Delphi and clinician panels, particularly across measures selected for the Baby-MONITOR. Clinicians selected the same nine measures for inclusion in the composite as the Delphi panel. For these nine measures, 74% of clinicians indicated that the Delphi panel rating was ‘reasonable’.


Practicing clinicians agree with an expert panel on the measures that should be included in the Baby-MONITOR, enhancing face validity.

Keywords: infant, newborn, quality of health care, measurement, composite indicator


Composite indicators are being used increasingly to track provider performance in adult health care settings (13). Composite indicators aggregate individual measures into a single summary score. They can provide broad insights and trends of quality for external benchmarking against other providers’ institutions and facilitate the tracking of quality improvement efforts within institutions. However, the process of developing composite indicators is complex and developers have to make choices in their construction that may significantly influence performance ratings. Thus, a standardized and explicit approach to indicator development is necessary and has been described elsewhere.(4;5)

Briefly, composite indicator development begins with a theoretical framework which provides the foundation for the selection and aggregation of variables. Data selection targets variables with strong analytical soundness, measurability, and relevance to the measured event. Additionally, developers examine the completeness, structure, and comparability of each selected variable, as well as the methods for weighting and aggregation. Finally, developers evaluate uncertainty (i.e., around aggregate ranks), transparency (i.e., explaining relative significance of individual domains), and linkages (i.e., to other indicators), and provide coherent visual tools for interpretation.

We are working to develop a CI of quality delivered to very low birth weight infants (VLBW), specifically the Measure Of Neonatal InTensive care Outcomes Research, or Baby-MONITOR. In a previous study, a panel of neonatal outcome researchers utilized a modified Delphi method to assign scores from 1–9 (9 = best) to 28 measures of quality routinely collected at the NICU-level by the California Perinatal Quality Care Collaborative (CPQCC) and Vermont Oxford Network (VON) (6). Based on the panel’s ratings, nine measures were selected for inclusion in the Baby-MONITOR. It is possible that a sample of largely academic researchers and quality improvement experts, which composed our original panel, would have a biased view regarding quality of care measurement. We address this concern by examining agreement between the Delphi panel’s selections of quality measures with those of clinical providers of neonatal intensive care.


Physician Nominations

Clinical neonatologists were nominated for study participation via executive channels within the American Academy of Pediatrics (AAP) Section on Perinatal Pediatrics. We invited the District Chairs from each of the ten AAP districts to provide three to four nominations of clinical neonatologists from within their district for study participation. Nominees were Board–certified in Neonatal Perinatal Medicine and had experience in quality improvement. They were judged by the nominator to be respected by their peers. Nominees were selected to create a sample that represented both public and private hospitals from a geographic distribution throughout the districts.

Nominees were contacted electronically. Consent was implied by those who responded with intent to participate. The study was approved by the Baylor College of Medicine Institutional Review Board.

Measure Selection

Details of the clinical measures and the process for selecting a candidate measure set have been previously described (6). Briefly, 28 routinely reported process–and-outcome quality measures were selected by consensus of a panel of quality improvement researchers. These measures were selected from CPQCC and VON operations manuals. The selection process was designed for inclusiveness and only excluded measures of surgical quality. For the current study, one of the 28 previously selected measures (duration of ventilation) was excluded because of inconsistent recording in the CPQCC database.

Survey Instrument

Participants were provided with a summary of each quality measure along with ratings from the Delphi panel. The Delphi panel’s ratings were presented as median scores across five measurement domains: importance, reliability, validity, scientific soundness, and usability (rated on a scale of 1–9; 9 = best). In addition, we provided the Delphi panel’s overall median rating.

Data Collection

Clinicians utilized the measure summary information, as well as the ratings generated by the previous Delphi panel, to assess their level of agreement with the panel’s ratings. Clinicians reported one measure of agreement, using a five-point scale (the measure is rated much too high, slightly too high, reasonably, slightly too low, or much too low), to represent their assessment of the panel’s full complement of ratings, including individual domain and overall scores. In addition to structured multiple-choice response options, respondents provided additional free-text feedback regarding the measure itself and their rationale for selecting a specific response option. Following the evaluation of each measure independently, respondents designated which of the 27 measures to incorporate into the Baby-MONITOR. Prior to data collection, we specified that measures must receive support from a two-thirds majority to be selected for the composite. Lastly, clinicians provided information regarding practice setting (i.e., academic versus private; NICU-level designation [using established criteria (7)]; years in practice; level of research participation; quality improvement involvement; and location of medical training [e.g., foreign versus US medical graduate]).

Statistical Evaluation

SAS version 9.2 (SAS Institute Inc., Cary, North Carolina, USA) was utilized for data analyses. Descriptive statistics were calculated to determine the distribution of characteristics in the Clinician panel. Additionally, frequency distributions were generated to determine the percentage of agreement with Delphi panel ratings for each measure. Because the information provided by the respondents was being used to determine the appropriateness of the measures included in the BABY-Monitor, we defined “agreement” as all ratings that did not indicate a strong divergence from the expert panel’s opinions (e.g., much too high/much too low). Potential differences in the reported percentage of agreement were evaluated in comparisons of clinicians practicing in academic versus private settings, using chi-square tests. Additional comparisons were conducted around the percentage of agreement between measures selected versus not selected for the composite by the Delphi panelists.



District chairs nominated 48 neonatologists, representing all ten AAP districts. Of the 48 nominees, 30 clinicians consented to participate. At the completion of the study, 23 of the 30 participants consented had completed and submitted responses. Nominations and responses were received over a period of nine months, from January to September 2009.

Clinician characteristics are shown in Table 1. Frequency distributions for practice setting; years in practice since completion of subspecialty training; NICU level designation; involvement in quality improvement; clinical research participation; and foreign medical education are displayed. Respondents represented geographically diverse regions of the US, including nine out of ten districts from the AAP Section on Perinatal Pediatrics. None of the nominees from District 10, which includes Alabama, Florida, Georgia, and Puerto Rico, provided responses. Overall, a majority of the respondents were academic clinicians (67%); in practice for ten years or more (86%); involved in quality improvement (95%); and US medical graduates (100%). Approximately one half of respondents reported practicing medicine at a NICU with a Level 3C (7) subspecialty designation and participating in clinical research as either a principal or co-investigator. Similar to respondents, a majority of non-respondents were from academic hospitals (60%; data not shown).

Table 1
Characteristics of respondents

Clinician Agreement

Surveyed clinicians reported a high degree of agreement with the previous Delphi panel’s ratings (see Table 2). Overall, 62% of clinician responses across the 27 measures indicated the ratings provided by the Delphi panel were ‘quite reasonable,’ and only 6% indicated extreme disagreement with the Delphi panel’s ratings (‘much too high’/’much too low’). Furthermore, the Delphi panel’s ratings for one measure, timely ROP examination, received almost unanimous support from clinicians (96%). However, there were also measures that engendered disagreement between the panels. The three measures with the most discordant responses were feeding with human milk only at discharge; surfactant administration within two hours of birth; and intracranial hemorrhage severity > grade 2. For example, clinician opinion regarding the Delphi panel’s ratings of ‘feeding with human milk only at discharge’ was distributed across all response options, including ‘much too high’ and ‘much too low’ (‘reasonable’=46%). However, when taking into account slight agreement and disagreement, even this most “controversial” measure achieved an overall agreement of 80%.

Table 2
Frequency Distribution of Clinician Responses for Each Measure Evaluated for Inclusion in Composite Indicator of NICU Quality of Care (Baby-MONITOR)

Appendix 1, available electronically as supplementary material, summarizes clinician feedback regarding aspects of each of the measures assessed by the panel. On average, there were seven comments provided for each measure. Comments were categorized by measure definition, usability, and validity. With the exception of duplicate content, all of the clinician comments were presented in Appendix 1.

Respondents reported several recurring criticisms of the measures under evaluation. Frequently, clinicians indicated that patient transfers in and out of the NICU may significantly impact inter-unit comparisons. Thus, to facilitate accurate assessments, definitions must appropriately adjust for patient transfers. Additionally, clinicians reported concerns about quality of care being measured on variables significantly influenced by factors outside of their control. Furthermore, respondents expressed apprehension around the variability in policies, practices, and techniques that may obscure comparisons between different units. Finally, respondents suggested that several measures were susceptible to “gaming” and that risk-adjustment models have not been sufficiently tested and validated.

Differences by Clinician Practice Setting

Clinicians from different practice settings responded similarly to the ratings for most measures (p>0.1). However, results indicate clinicians practicing in private settings viewed some of the panel’s ratings differently than those practicing in academic settings. Most notably, clinicians in private settings reported that surfactant administration was rated too high by the research panel while those affiliated with academic institutions reported the panel’s ratings were reasonable (p=0.05). Additionally, no private practice setting clinicians (n=7) voted to include surfactant in the composite. Conversely, compared to academic clinicians, those practicing in private settings tended to report that length of stay was rated too low (p=0.06) and to include the measure in the composite indicator more frequently than academic clinicians (p<0.1). Neither of these measures was included in the Baby-MONITOR by the Delphi panel.

Measure Selection

The Delphi panel had selected nine measures of quality for inclusion in the Baby-MONITOR. In the current study, we pre-specified a two-thirds majority of clinician responses as evidence for agreement. Using this criterion, the clinicians independently selected the same nine measures as the Delphi panel. The percentage agreement ranged from 66% to 100% across the selected measures. Clinicians reported significantly higher agreement with the Delphi panel’s ratings across the measures included in the composite, compared to measures not included (p<0.001). Table 3 presents the median ratings for each of the measures selected by the Delphi panel; the percentage of clinician agreement with the panel’s ratings; and the percentage of clinicians in favor of including the measure in the Baby-MONITOR.

Table 3
Clinician Agreement with Research Panel Ratings and Selection of Measures for Inclusion in Composite Indicator of NICU Quality of Care (Baby-MONITOR)


We report results from an electronic survey to assess agreement with the selection of measures of NICU quality-of-care for inclusion in the Baby-MONITOR. Our principal finding was a high level of agreement with the Delphi panel’s selection of measures into the Baby-MONITOR by a sample of clinical neonatologists.

Comparative performance measurement has become a national health policy priority in an attempt to promote better quality of care and decrease health care expenditures (8;9). Composite indicators have become popular because they are able to summarize otherwise complex information (5;10). However, the selection of quality measures into composite indicators must undergo a careful vetting process in order to avoid undue bias on part of the developers. We have undertaken a careful, explicit, and structured measure selection process for the Baby-MONITOR (6). Results from the current study demonstrate a high degree of agreement between the clinical neonatologists surveyed and the Delphi panel’s ratings. The Delphi and clinical panels identified identical measure sets for the Baby-MONITOR. We, therefore, believe this measure set to represent a state-of-the-art selection of robust measures of neonatal intensive care quality. This measure set is based on available measures within the CPQCC and VON consortia. These organizations include more than nine hundred member institutions; therefore, the measure set has the potential for broad application.

The substantial agreement between the Delphi and Clinical panels among the measures selected for the Baby-MONITOR provides powerful support for its acceptability as a measurement tool. The panels also agreed strongly with regard to the measures least suitable to measure NICU quality of care (e.g., VLBW volume, any intracranial hemorrhage, oxygen at discharge), which calls into question the utility of allocating resources to modifying the outcomes described in these measures. Lastly, the disagreement observed across measures with moderate ratings (e.g., only human milk at time of discharge, surfactant within two hours of birth, severe intracranial hemorrhage) suggests that these measures may be functional but require modifications or additional research to enhance their value as quality measures.

Comments in Appendix 1 emphasize potential subjective weaknesses of each of the quality measures. However, empirical evidence will be needed to assess whether changes in measure definitions would improve ratings of measures not currently included in the Baby-MONITOR (11). Additionally, these comments highlight the struggle involved in the development of standardized quality measures. Inherent practice-based differences create challenges in balancing measurement uniformity with specificity and applicability. These factors support the utilization of longitudinal comparisons within organizations, rather than comparisons across organizations.

Clinicians from different practice settings agreed upon most of the ratings of the Delphi panel. Most importantly, clinicians from different practice settings reported agreement for all measures selected for inclusion in the Baby-MONITOR. However, in our sample, clinicians practicing in private hospitals indicated length of stay was rated too low by the previous researcher panel, compared to clinicians from academic hospitals. There is a large degree of variation in length of stay for VLBW newborns (12). This variation may be explained by a variety of factors not directly associated with quality of care (e.g., financial incentives, transfer bias) (8;13).

Clinicians also reported disagreement over the rating for surfactant administration, which may reflect the current lack of consensus regarding surfactant utilization. Studies conducted during the 1990’s firmly established intubation and early surfactant administration as the best therapy for the treatment of respiratory distress syndrome in premature babies (14;15). However, more recently there has been a movement to reduce the negative long-term effects associated with mechanical ventilation, such as chronic lung disease and ventilator-associated pneumonia. Specifically, researchers have evaluated the benefits of avoiding intubation through the early use of nasal continuous positive airway pressure (1619). A best-practice consensus has yet to be established, potentially explaining the variation in responses across academic and private settings.

The findings from the current study should be viewed within the context of the study design. We collected responses from a small number of clinicians with membership in the AAP. While the majority of the country’s neonatologists are members of AAP, our results may not generalize to other samples. In addition, our sample size was relatively small. The survey required time-consuming consideration and commentary that was not suitable for mass distribution. Our final sample included fewer clinicians from the private practice setting than intended. However, the current sample does represent a much more practice-focused group than the initial Delphi panel. Therefore, we think the strong agreement between the clinician and Delphi panels enhances the generalizability of our results. Overall, the Baby-MONITOR has now undergone an extensive formal measure selection process.


Our results indicate a high level of agreement between a panel of researchers and clinical neonatologists concerning the selection of appropriate quality measures for the Baby-MONITOR, enhancing its face validity.

Figure 1
Deviation among clinicians’ responses evaluating ratings of 27 measures of NICU care quality

Supplementary Material



We gratefully acknowledge the contributions and effort by all panelists listed here, alphabetically: Francis J. Bednarek, MD (UMass Memorial Health Care, University of Massachusetts Medical School); Richard E. Bell, MD (Northbay Medical Center); Carl J. Bodenstein, MD (Neonatology Associates Spokane, Sacred Heart Children’s Hospital); Robert Boyle, MD (University of Virginia); James A. Cook, MD (Geisinger Health System); Leandro Cordero, MD (The Ohio State University Medical Center); David Fisher, MD (Levine Children’s Hospital at Carolinas Medical Center); Ronnie Guillet, MD (University of Rochester Medical Center); Victor Herson, MD (Connecticut Children’s Medical Center); Michael J Horgan, MD (Children’s Hospital at Albany Medical Center); Priscilla Joe, MD (Children’s Hospital & Research Center Oakland); Jonathan Nedrelow, MD (Cook Children’s Medical Center); Mary Revenis, MD (Children’s National Medical Center); William D. Rhine, MD (Stanford University); Renate Savich, MD (University of New Mexico); Lawrence Skolnick, MD (MidAtlantic Neonatology Associates, Morristown Memorial Hospital); Susan F. Townsend, MD (Memorial Health System); Robert Ursprung, MD (Cook Children’s Medical Center). Note: additional panelists did not release their information to receive acknowledgement.

Funding support: Jochen Profit’s contribution is supported, in part, by the Eunice Kennedy Shriver National Institute of Child Health and Human Development #1 K23 HD056298-01 (PI: Profit). At the time this work was conducted, Dr. Petersen was a recipient of the American Heart Association Established Investigator Award (#0540043N). Drs. Petersen, Profit, and Mr. Kowalkowski also receive support from a Veterans Administration Center Grant (VA HSR&D CoE HFP90-20). The results of this study were presented as a poster at the Pediatric Academic Societies’ Annual Meeting in Vancouver, British Columbia on May 4, 2010.


Composite Indicator
Chronic lung disease
California Perinatal Quality Care Collaborative
gestational age
intracranial hemorrhage
interquartile range
necrotizing enterocolitis
neonatal intensive care unit
retinopathy of prematurity
very low birth weight
Vermont Oxford Network


Conflict of interest

Drs. Profit and Gould are serving as Expert Consultants with the Vermont Oxford Networks NICQ 7 Quality Improvement Collaborative. Dr. Gould is the Principal Investigator for the California Perinatal Quality Care Collaborative.

Author Contributions: MAK, JP, JBG, CB, and LAP led the design, data analysis, and interpretation of results. MAK coordinated the study, prepared study materials, and assisted with data acquisition. MAK and JP undertook data analysis and interpretation. MAK and JP wrote the first draft of the manuscript. CB, JBG, and LAP reviewed and edited the manuscript. JP is guarantor of the paper.

Reference List

1. Campbell SM, Roland MO, Middleton E, Reeves D. Improvements in quality of clinical care in English general practice 1998–2003: longitudinal observational study. BMJ. 2005;331:1121. [PMC free article] [PubMed]
2. Premier Inc. Summary of the composite indicator scoring methodology. Available at: Accessed on 11.22.2007.
3. Roland M. Linking physicians’ pay to the quality of care–a major experiment in the United kingdom. N Engl J Med. 2004;351:1448–54. [PubMed]
4. Nardo M, Saisana M, Saltelli A, Tarantolo S, Hoffman A, Giovanini E. Handbook on Constructing Composite Indicators: Methodology and User Guide. Paris, France: OECD Publishing; 2005.
5. Profit J, Typpo KV, Hysong SJ, Woodard LD, Kallen MA, Petersen LA. Improving benchmarking by using an explicit framework for the development of composite indicators: an example using pediatric quality of care. Implement Sci. 2010;5:13. [PMC free article] [PubMed]
6. Profit J, Gould JB, Zupancic JA, Stark AR, Wall KM, Kowalkowski MA, et al. Formal selection of measures for a composite index of NICU quality of care: Baby-MONITOR. J Perinatol. 2011 Feb 24; [Epub ahead of print] [PMC free article] [PubMed]
7. Committee on Fetus and Newborn. Levels of neonatal care. Pediatrics. 2004;114:1341–7. [PubMed]
8. Petersen LA, Woodard LD, Urech T, Daw C, Sookanan S. Does pay-for-performance improve the quality of health care? Ann Intern Med. 2006;145:265–72. [PubMed]
9. Rosenthal MB, Fernandopulle R, Song HR, Landon B. Paying for quality: providers’ incentives for quality improvement. Health Aff. 2004;23:127–41. [PubMed]
10. Freudenberg M. Composite indicators of country performance: a critical assessment. Available at: Accessed on 2.22.2006.
11. Profit J, Gould JB, Draper David, Zupancic JAF, Woodard LD, Pietz K, Petersen LA. Variations in definitions of mortality have little influence on NICU outcome performance ratings. JAMA Under review at. [PMC free article] [PubMed]
12. Lemons JA, Bauer CR, Oh W, Korones SB, Papile LA, Stoll BJ, et al. Very low birth weight outcomes of the National Institute of Child Health and Human Development Neonatal Research Network, January 1995 through December 1996. NICHD Neonatal Research Network. Pediatrics. 2001;107:e1. [PubMed]
13. Profit J, Zupancic JA, Gould JB, Petersen LA. Implementing pay-for-performance in the neonatal intensive care unit. Pediatrics. 2007;119:975–82. [PMC free article] [PubMed]
14. Yost CC, Soll RF. Early versus delayed selective surfactant treatment for neonatal respiratory distress syndrome. Cochrane Database Syst Rev. 2000:CD001456. [PubMed]
15. Soll RF, Morley CJ. Prophylactic versus selective use of surfactant in preventing morbidity and mortality in preterm infants. Cochrane Database Syst Rev. 2001:CD000510. [PubMed]
16. Morley CJ, Davis PG, Doyle LW, Brion LP, Hascoet JM, Carlin JB, et al. Nasal CPAP or intubation at birth for very preterm infants. N Engl J Med. 2008;358:700–8. [PubMed]
17. Rojas MA, Lozano JM, Rojas MX, Laughon M, Bose CL, Rondon MA, et al. Very early surfactant without mandatory ventilation in premature infants treated with early continuous positive airway pressure: a randomized, controlled trial. Pediatrics. 2009;123:137–42. [PubMed]
18. Sandri F, Plavka R, Ancora G, Simeoni U, Stranak Z, Martinelli S, et al. Prophylactic or early selective surfactant combined with nCPAP in very preterm infants. Pediatrics. 2010;125:e1402–e1409. [PubMed]
19. SUPPORT Study Group of the Eunice Kennedy Shriver NICHD Neonatal Research Network. Finer NN, Carlo WA, Walsh MC, Rich W, Gantz MG, et al. Early CPAP versus surfactant in extremely preterm infants. N Engl J Med. 2010;362:1970–9. [PMC free article] [PubMed]