|Home | About | Journals | Submit | Contact Us | Français|
We evaluated outcomes for common operations in the STS Congenital Heart Surgery Database (STS-CHSDB) to provide contemporary benchmarks and examine variation between centers.
Patients undergoing surgery from 2005-2009 were included. Centers with>10% missing data were excluded. Discharge mortality and postoperative length of stay (PLOS) among patients discharged alive were calculated for eight benchmark operations of varying complexity. Power for analyzing between-center variation in outcome was determined for each operation. Variation was evaluated using funnel plots and Bayesian hierarchical modeling.
18,375 index operations at 74 centers were included in the analysis of eight benchmark operations. Overall discharge mortality (range) was: ventricular septal defect repair (VSD) 0.6% (0%–5.1%), tetralogy of Fallot repair (TOF) 1.1% (0%–16.7%), complete atrioventricular canal repair (AVC) 2.2% (0%–20%), arterial switch (ASO) 2.9% (0%–50%), ASO+VSD 7.0% (0%–100%), Fontan 1.3% (0%–9.1%), truncus repair 10.9% (0%–100%), Norwood 19.3% (2.9%–100%). Funnel plots revealed the number of centers characterized as outliers were: VSD=0, TOF=0, AVC=1, ASO=3, ASO+VSD=1, Fontan=0, Truncus=4, Norwood=11. Power calculations showed statistically meaningful comparisons of mortality rates between centers could only be made for Norwood, for which the Bayesian-estimated range (95% Probability Interval) was 7.0% (3.7%-10.3%) to 41.6% (30.6%-57.2%). Between-center variation in PLOS was analyzed for all operations and was larger for more complex operations.
This analysis documents contemporary benchmarks for common pediatric cardiac surgical operations and the range of outcomes among centers. Variation was most prominent for the more complex operations. These data may aid in quality assessment and quality improvement initiatives.
The Congenital Heart Surgery Database of the Society of Thoracic Surgeons (STS–CHSDB) is the largest database in North America that tracks the outcomes of pediatric and congenital cardiac surgery [1,2,3]. As of January 1, 2011, participants in the STS–CHSDB include 96 of the estimated 122 congenital cardiac surgical programs in the United States . One of the major goals of the STS-CHDB is to facilitate the improvement of quality in pediatric cardiac surgical programs in North America.
The purpose of this analysis is to document current outcomes for common operations in the STS–CHSDB to provide contemporary benchmarks and examine variation in outcomes between centers. In this manuscript, the terms “centers” and “participants” are used as synonyms to denote pediatric and congenital cardiac surgical programs that participate in STS–CHSDB. The approach of using benchmark operations to assess the quality of care of pediatric cardiac surgical operations has been previously described . The goal of the analysis was to describe discharge mortality and postoperative length of stay (PLOS) for eight common potential benchmark operations of varying complexity and to examine between-participant variation in these endpoints. A related goal was to assess the feasibility of comparing institutions with these endpoints.
The study population includes patients who underwent operations with one of the Primary Procedures listed in Table 1 and met the inclusionary and exclusionary criteria listed in Table 1. Patients undergoing ASO are in a separate cohort from those undergoing ASO+VSD because the outcomes of these two groups are quite different . Furthermore, the presence or absence of a VSD is a nonmodifiable variable that is an intrinsic characteristic of the patient. In the Fontan cohort, patients undergoing “Fontan revision or conversion (Re-do Fontan)” were excluded. Patients≥7 years undergoing Fontan were excluded because it was felt to be less likely the patient was undergoing primary Fontan operation.
Outcome variables in this analysis are mortality prior to discharge from the hospital (“discharge mortality”) and PLOS among patients discharged alive. In this manuscript, the word “mortality” is used to represent “discharge mortality” [7,8]. Previous publications from the STS–CHSDB have used PLOS as one measure of operative morbidity [7,8,9]. In these prior analyses, prolonged PLOS was regarded as a very general proxy measure of morbidity .
For each type of procedure, the overall and participant-specific discharge mortality rates and the overall and participant-specific average PLOS were calculated. Participants-specific results were summarized by the mean, median (50th percentile), interquartile range (25th and 75th percentiles), and range (minimum and maximum).
Participant-specific unadjusted mortality rates were depicted graphically in relation to the participant’s number of eligible cases (i.e. the participant’s sample size). Lines depicting exact 95% binomial prediction limits were overlaid to make a “funnel plot” . For each individual participant, the probability of observing a mortality rate that falls on or outside of the plotted prediction limits is<5%, if the participant’s true mortality rate is equal to the overall aggregate mortality rate of all STS participants in the analysis.
The feasibility of analyzing between-participant variation in mortality was assessed by counting the number of participants that met the sample size required to achieve 50% power to detect a two-fold increase in the mortality rate  (vs. the overall aggregate mortality rate of all participants) using a one-sided type-I error rate of 0.05. For example, assuming an overall aggregate mortality rate of 7%, a sample size of 48 operations would be required to attain 50% power to detect a doubling of the mortality rate to 14%.
The feasibility of analyzing between-participant variation in PLOS was assessed by counting the number of participants that met the sample size required to achieve 50% power to detect a doubling of the mean PLOS with a one-sided 0.05-level test. For simplicity, power was calculated by assuming an exponential distribution for time to hospital discharge. (This assumption was only made for sample size calculations, not for the actual data analysis.)
Bayesian hierarchical modeling was used to estimate the distribution of true unadjusted and adjusted participant-specific mortality rates and average PLOS. For unadjusted mortality, the observed number of deaths was modeled as a binomial distribution with different probability parameters (log-odds) for each participant. The log-odds parameters were assumed to be normally distributed across participants. For unadjusted PLOS, the patient-level variable y=log(1+PLOS) was modeled as a normal distribution with a different mean parameter for each participant and a single variance parameter that was common to all participants. Similar to the mortality model, mean parameters were assumed to be normally distributed across participants. For analyzing risk-adjusted outcomes, a hierarchical logistic regression model was used for mortality, and a hierarchical linear regression model was used for the variable y=log(1+PLOS). Covariates in each model included age (linear and quadratic), weight-for-age-and-sex z-score, sex (male vs. female/other/missing), any preoperative risk factor (yes/no), and any noncardiac abnormality (yes/no). The STS–CHSDB contains standard definitions adopted in 2007 for pre-operative risk factors and noncardiac abnormalities . In addition, each model included normally distributed participant-specific random intercepts. The Bayesian approach to data analysis requires the analyst to specify prior beliefs about unknown model parameters using a probability distribution. Because our prior knowledge was limited, we specified a vague proper prior distribution that consisted of independent normal distributions for regression coefficients and inverse gamma distributions for variances. Inferences were based on Markov Chain Monte Carlo (MCMC) simulations as implemented in WinBUGS version 1.4 software. Bayesian point estimates (posterior means) and 95% probability intervals (PIs) were calculated using 420,000 MCMC iterations following a burn-in period of 5,000 iterations. To facilitate interpretation, parameters from the mortality models were converted to probabilities and parameters of the PLOS models were converted from the scale of log(1+PLOS) to the scale of untransformed PLOS. The risk-adjusted mortality rate was defined as the mortality rate that would be predicted for a patient with risk factor values that are equal to the STS population average. The risk-adjusted mean PLOS was defined similarly.
The ratio of the maximum and minimum value was estimated for each endpoint to illustrate the scale of between-center differences. Also, the Gini index (GINI) was calculated for each operation as a measure of spread. GINI ranges from 0 to 1. A larger number means more variation between hospitals. GINI is one half of the average absolute difference of the mortality rates of two hospitals, averaging over all possible pairs of hospitals in the analysis, divided by the average mortality rate. We did not provide p-values because p-values are not used in Bayesian analyses. Instead, 95% Bayesian PIs are provided.
All analyses were performed using SAS version 9.2, R version 2.8, and WinBUGS version 1.4.
This study was approved by the Duke University Health System Institutional Review Board. Because the data used in analysis represent a limited data set (no direct patient identifiers) that was originally collected for non-research purposes, and the investigators do not know the identity of individual patients, the analysis of these data was declared by the Duke University Health System Institutional Review Board to be research not involving human subjects .
From 2005–2009, inclusive, 85 centers (USA and Canada) submitted data to STS–CHSDB, and discharge mortality of index cardiac operations was 4.0% (3,418/86,297). For patients age<18 years, from 2005–2009, inclusive, 85 centers submitted data to STS–CHSDB, and discharge mortality of index cardiac operations was 4.1% (3,309/81,062). 18,375 index operations at 74 centers were included in the analysis of eight benchmark operations.
Table 2 summarizes overall aggregate and participant-specific results for mortality and PLOS for each operation. Mortality data are also displayed as funnel plots for these eight benchmark operations (Figure 1). These funnel plots demonstrate that for the majority of these benchmark operations, very few programs can be classified as outliers for discharge mortality, i.e., most programs fall within the 95% prediction limits and are not considered outliers. In fact, for some operations such as VSD, TOF, and Fontan, no programs are outliers. For other operations such as AVC, ASO, ASO+VSD, truncus, and Norwood, some participants are outliers. The number of “outliers” (based on two one-sided .025-level tests) were: VSD=0, TOF=0, AVC=1, ASO=3, ASO+VSD=1, Fontan=0, Truncus=4, Norwood=11. By design, approximately 5% of participants would be expected to have mortality rates that fall outside of the 95% prediction interval even if true probability of mortality did not vary across centers. For each operation except Norwood, the number of centers falling outside of the 95% prediction interval was consistent with the number that would be expected under the null hypothesis of no between-center variation. However, the small number of outliers should not be interpreted as evidence of no between-center variation in mortality. Power for detecting between-center variation for low complexity operations was minimal, as described below.
The number of cases required to detect a two-fold increase in the mortality rate with at least 50% power ranged from 17 for Norwood to 599 for VSD repair (Table 3). In the Norwood group, 40 participants met this required sample size. (Power to detect a smaller 1.5-fold increase in Norwood mortality was at least 50% for 12 participants and at least 80% for 4 participants.) For procedures other than Norwood, at most 1 participant met the sample size required to detect a doubling of mortality with at least 50% power. Based on these results, between-participant variation in mortality was analyzed with Bayesian methodology only for Norwood. For Bayesian analyses of Norwood, all participants were included regardless of sample size.
The required sample size to detect a doubling of the mean PLOS is five operations (Table 3). Based on these results, between-participant variation in PLOS was analyzed for all operations. All participants were included regardless of sample size.
Table 4 documents unadjusted and risk adjusted Bayesian estimation of between-participant variation for mortality and PLOS. The estimated 25th and 75th percentiles for Norwood mortality are 15.5% and 27.0%. We estimate that 25% of participants have a true mortality rate<15.5% and 75% of participants have a true mortality rate<27.0%. The estimated minimum and maximum true mortality rates are 7.3% and 47.0%. We estimate that the highest mortality rate is approximately 7-fold higher than the lowest. The 95% PI for the max/min ratio is 3.7–13.9, implying that we are highly confident that there is at least a 3.7-fold difference and no more than a 13.9-fold difference between the highest and lowest participant-specific true mortality rate. The between-center variation in mortality was only marginally attenuated when adjusting for case mix (estimated max/min ratio=6.5; 95% PI: 3.3–13.0). Variation in PLOS was also substantial, with a trend suggesting greater variation for higher-complexity operations. The estimated GINI index for adjusted PLOS ranged from 0.069 (95% PI: 0.056-0.082) for TOF to 0.142 (95% PI: 0.117-0.171) for Norwood.
The STS–CHSDB is the largest Congenital Heart Surgery Database in North America. This analysis documents (1) contemporary benchmarks for common pediatric cardiac surgical operations of varying levels of complexity, and (2) the degree of variation in outcome between centers. Variation in outcome was most prominent for the more complex operations. These data can aid in quality assessment and quality improvement initiatives. Variation in outcomes across centers demonstrates opportunities for multi–institutional collaboration to improve quality.
Knowledge of the distribution (e.g. percentiles) of adverse event rates across hospitals can be used to prioritize improvement efforts and establish benchmarks. However, estimation of hospital percentiles is not straightforward because the number of patients per hospital is often quite small. Percentiles calculated directly from the observed event rates are misleading because hospitals with a very small number of subjects are likely to have extreme event rates (e.g. 0% or 100%), and these rates may not be representative of their true long-run performance. Although the raw data are skewed towards having an unrealistically large amount of spread, a statistical model can be used to recover the true underlying distribution of hospital-specific probabilities. The Bayesian hierarchical modeling approach used in this article is particularly well suited for this purpose as it is designed explicitly to model true variation between units while accounting for purely random variation. In addition to estimating percentiles and other measures of between-hospital variation, the Bayesian approach also allows calculating an appropriate measure of uncertainty (95% probability intervals) for these estimates.
Because Norwood was the only operation (of the 8 benchmark operations in this analysis) with more than one participant performing the minimum number of operations to detect a doubling of mortality, between-participant variation in mortality was analyzed with Bayesian methodology only for Norwood. With the other seven operations, the sample size (number of events) was too small to produce a valid estimate of the magnitude of between-center variation. In a Bayesian analysis, beliefs about unknown quantities are expressed using a probability distribution. One specifies a prior distribution (i.e. what one believes before seeing any data) and then one calculates the posterior distribution (i.e. what one believes after seeing the data) by using Bayes theorem. In the other seven operations, with such a small sample size, the results of a Bayesian analysis would be largely driven by the prior distribution, rather than the data. Although sensitivity to the prior distribution is an issue specific to Bayesian analysis, no alternative method could produce a meaningful estimate of between center variation for these seven operations.
It is apparent that even with 5 years of data, many individual operations are not performed frequently enough at any given institution to detect a doubling of mortality. A previous analysis of the potential to use mortality after “marker operations” to assess pediatric cardiac surgical performance concluded that, “There were relatively small data sets for individual hospitals and surgeons, which made statistical evaluation difficult. For setting standards, data from more departments for a longer period will be required. Statistical methods alone cannot be used as a sole arbiter of what is considered acceptable performance.”  Nevertheless, the strategy of analyzing mortality using funnel plots can help to identify programs that are outliers with respect to mortality for specific operations. Since 2000, this strategy has been utilized by the United Kingdom Central Cardiac Audit Database and forms the basis of their public reporting initiative .
Our assessment of the feasibility of analyzing between-center variation in mortality and PLOS revealed that statistically meaningful comparisons of mortality between centers could only be made for Norwood, while variability in PLOS could be analyzed for every operation. PLOS has been used as a surrogate for morbidity  and can help assess variation in outcome between centers. These data create an opportunity for inter-institutional collaboration in optimizing structure and process with a goal of improving overall quality of care and outcome.
Complexity stratification using the five STS-EACTS Categories  allows the grouping of operations into similar strata of risk and therefore permits analysis of higher volumes of cases than using benchmark operations. An analysis of variation in outcomes of mortality and PLOS stratified by the five STS-EACTS Categories represents an important opportunity for future investigation. Such an analysis may create additional opportunities for inter-institutional sharing of structure and process in order to improve overall quality of care and outcome. Similarly, Bayesian estimation of between-participant variation based on the five STS-EACTS Categories represents an important area of future investigation.
This analysis documents (1) contemporary benchmarks for common pediatric cardiac surgical operations of varying levels of complexity, and (2) the range of outcomes among centers. Variation in outcome was most prominent for more complex operations. Even with the use of 5 years of data, because of the relatively small datasets for many operations at most centers, it is not possible to perform statistically meaningful comparisons between centers of mortality after benchmark operations. Funnel plots of mortality after benchmark operations can help to identify outliers. Grouping of operations into strata of similar complexity may further facilitate inter-institutional comparisons. These data can aid in quality assessment and quality improvement initiatives. Variation in outcomes across centers demonstrates opportunities for multi-institutional collaboration to improve quality.