The voluntary participation of trauma centers in collection and submission of registry data is motivated by a desire to define a measurable standard of care, as well as the desire of individual centers to demonstrate their achievement of the standard (within the bounds of inevitable random variation). The assumption is that properly adjusted institutional performance in the recent past can be used to predict institutional performance in the immediate future. These predictions can be compared with a “benchmark”, most often the similarly predicted average performance of all hospitals in the same class.
Once a measure of relative hospital performance has been developed, it may be used (or abused) for competitive or regulatory purposes, and even publicized in a “scorecard” or “report card”.(
22) To produce the desired goal of quality improvement, the measure must therefore be perceived by the professional community as clinically and statistically valid. Diligent efforts to standardize data collection and quality are clearly important, but so is the implementation and explanation of a statistical method that properly accounts for random variation in outcomes.(
1,
9)
In the present study, we sought to determine whether the additional complexity of ML modeling might be worthwhile compared to standard methods of logistic regression. While theoretical considerations should lead us to expect fewer outliers using ML methods, this does not guarantee that the correct number of outliers will be identified. We therefore sought to test the methodologies in the practical sense of how well a model predicted future performance.
The structure of NIS and other hospitalization data is multilevel or hierarchical, with patients clustered within hospitals. Implicit assumptions of such clustering are that patients in a given hospital tend to be more like one another, in characteristics important to the analysis, than they are like patients at other hospitals. Multilevel models explicitly acknowledge that a small sample of patients from a hospital will predict, on average, the hospital effect with a larger error than a sample with a larger number of patients. This methodology potentially reduces prediction errors caused by small sample sizes and reduces mislabeling of hospitals as outliers simply because of small patient sample sizes.
Even a zero mortality may not be an accurate predictor of future performance, especially if it is based on a relatively small number of cases.(
23)
Accumulating, submitting, and processing registry data at a state or national level typically takes at least two years, so we reasoned that a model constructed from data in Year N would typically be applied to predicting the performance of hospitals in Year N+2. The NIS offered the opportunity to see in retrospect how well the models would have worked if they had been applied for this purpose to hospitals that happened to be in the samples for both Year N and Year N+2.
Estimates of various patient cofactors were very similar in each year (), and were also similar to those obtained by Clark and Winchell(
17) using other administrative data. In particular, we note again that the effect of increasing AIS is much more pronounced for head injuries than for injuries in other parts of the body. This finding argues against the use of a single calculated Injury Severity Score and in favor of some method that separates these effects by body region.
The fixed effects shown in were also nearly identical regardless of whether a standard or a ML model was estimated. It is therefore not surprising that Cohen et al.(
11) found no shrinkage and no difference between the results from MLLR and standard LR when the hospital effects were “estimated using only the fixed portion of the model”. However, the determination of
u0j and evaluation of the shrunken estimates are essential features of the ML approach, so we do not feel that their study should be taken as evidence that it has no advantages over standard LR.
For the purposes of simplifying this comparative study, we did not use any hospital-level predictors (e.g., teaching status), although these could easily be added. In particular, a registry-based comparison could include hospital level variables indicating trauma center status (Level I, ACS Verification, etc.), which is not possible in standard regression. One caution after applying hospital-level variables in a ML model is that estimates will then be shrunken toward the mean of any subgroup sharing the same hospital-level predictors, which may or may not be the intention of the analysis.
A significant limitation of our study is the reliance on NIS data that were not specifically collected for quality improvement. Administrative data in general do not have the depth of descriptive detail that might be available in a registry or other dedicated database. Indeed, a previous comparison of ML and standard mortality models using New York clinical cardiac surgical data did not find that ML methods were preferable.(
10) While this may reflect differences between cardiac and trauma patients, it may also be true that better methods of risk adjustment at the patient level would reduce the proportion of variability assigned to hospitals with differing patient populations.
Another important limitation of the NIS (and other databases recording only outcomes at the time of hospital discharge) is that differences in survival at the time of discharge may not reflect a true difference in survival, since earlier discharge from the hospital may lead to some deaths not being observed and recorded in hospital data. This is particularly true of older patients, who may be discharged to nursing or rehabilitation facilities earlier or later in different regions of the country.(
24) Thus, small differences in hospital survival should not necessarily be interpreted as differences in overall survival, and the apparent consistency of a model to predict this outcome may be influenced by its ability to predict practice patterns rather than its ability to predict clinical quality.
Comparison of standard regression to ML regression must consider the assumptions made by each model, and the purposes for which they are applied. For the theoretical reasons discussed above and elsewhere, the ML approach should be expected to be more reliable, because it shrinks the observation from one year toward what would be expected in the future assuming that the hospitals are truly exchangeable. On the other hand, standard regression does not allow individual hospitals to “borrow” data from the group as a whole, and might be preferable if the purpose of a model is to determine which hospitals should be rewarded (or penalized) for their actual performance in a given year.
With our samples of patients and hospitals, standard regression predicted more low outliers, while ML regression predicted more high outliers. This difference may be attributed to the distribution of hospital means, which does not necessarily follow a normal distribution as assumed by the ML model. Standard logistic regression evaluates performance with respect to the overall mean probability of mortality, while ML regression evaluates performance with respect to the mean of the hospital effects. Other samples of patients and hospitals may or may not affect the results in the same way, and arguments may be advanced whether one or the other mean is more appropriate.
Regardless of the differences we found between the two methods, neither was very accurate in predicting outliers using data from two years previously. Also, the difference between observed rates and predicted rates in the validation samples was quite large in relation to the overall mortality rates. Hospital mortality is a relatively infrequent outcome, and its variability over two years in a given hospital may be too large to allow effective prediction by any statistical method, especially when small differences among hospital outcomes may result from variability in inclusion criteria or hospital length of stay. Further studies using other databases may be useful to demonstrate or refute the contribution of ML models to the process of hospital quality improvement for trauma or other conditions.