We have shown here the utility of hierarchical clustering as an unsupervised non-linear classification schema in the prediction of outcome in severely injured trauma patients. We obtained clusters that were enriched for patients who died, contracted an infection, and suffered multiple organ failure. These clusters were not merely dominated by a few specific patients with a particular outcome. Indeed each of the clusters was made up of multiple patients' data and each patient transitioned through multiple clusters during their ICU stay. Lastly, the prognostic information incorporated in the clustering results was not obtainable by univariate traditional statistical analysis and persists in the face of univariate analyses that could not predict any of these outcomes.
Despite the near continuous monitoring of many physiologic variables and treatment parameters, traditional care in the ICU fails to fully use all these data in an efficient manner. Currently, clinicians base understanding of patient state and appropriate manipulation of that state on intermittent examination of patient variables (vital signs, labs, studies and physical examination). It has been shown, however, that more frequent data collection and analysis better defines patient physiology [
6], and there has been much work in using continuous data, including the alarms built into the standard ICU bedside monitors [
7,
8]. While these monitors are excellent as instant alarms regarding critical parameters, they do nothing to help predict long-term outcomes. Improvements in diagnosis and care have traditionally resulted from both improved clinical acumen and scientific advancement, mostly surrounding scientific examination of a single or small group of adjuncts. Indeed, the critical care literature is full of examinations of monitors, scoring systems, measurements and biomarkers, all of which seek to define and predict the degree of injury, physiological insult and resuscitation [
8,
9]. Despite this proliferation, multivariate understanding of resuscitation state and identification of occult hypoperfusion remain elusive and an open experimental question. Multivariate decision tools using supervised learning algorithms have been implemented to detect hypovolemia [
10] and alarms for critical care patients [
8]. In contrast to our current work, this previous work used relatively few types of data (five and nine, respectively), giving a less complete picture of the patient's physiology. Additionally, multiple logistic regression models have been shown to predict MOF 12 hours post-injury [
11], but these suffer from the inability to discover new physiology or make use of complex multivariate physiological relationships. In ground breaking work in the mid 90s Rixen and collegues utilized K-means clustering to define patient states based on 17 non-continuous variables. Through clustering and comparison to reference states (derived from non-injured controls) this group elegantly proposed that patient state could be defined in multidimentional state space [
12,
13]. This work represented the first attempt at defining patient state as a multivariate entity. Here we extend these analyses using continuous data with no a priori understanding of the relationship between these data and outcome. We then extend these analyses by tracing patient state through the
state space over time.
The use of unsupervised learning with large multivariate data sets comprised of continuous data represents a rarely used combination of techniques to predict and improve patient outcomes. Nelson
et al. [
14] used self-organizing maps to visualize patterns in microdialysis data from patients with traumatic brain injury, finding that individuals were likely to cluster together, in contrast to our results showing much movement among clusters. The work presented here extends previous observations from our group that employed methods similar to those we report here, except that they used aggregate data from each patient rather than q1 minute data, and our methods provide predictions of outcome in addition to the clinical insights discussed by the authors [
15]. To fully utilize our data, we required a technique to distill all variables into a meaningful single value - in this case, a patient state. This could then, in turn, be defined in terms of clinically relevant patient outcome or physiologic state, as we have done here by associating each cluster with the probability of an outcome. Instead of fixation on one or a few physiologic parameters, transformation of all data into a single reproducible and clinically relevant value allows all available data to be used simultaneously. Furthermore, the complex relationships among multiple variables are preserved and exploited. Our analysis has shown that without inputting any prior knowledge, unsupervised algorithms are able to discern data (unobtainable by
traditional statistics) that are indicative of death, infection, and MOF. With our data obtained every minute, the fact that patients transition through many clusters throughout their observation period attests to rapidly changing complex physiology. We have demonstrated our ability to both define patient state using hierarchical clustering and to track the progress of individual patients through these clusters over time. Indeed, patients tend to move between clusters during their stay, and we would expect most of them to experience under-resuscitation during part of their first 24 hours of care. Future analysis could reveal the potential of assigning transition probabilities between clusters based on physiology, which combined with knowledge of the likelihood of death in each state suggests potential methods of
steering the physiology away from clusters with high mortality towards clusters associated with safety. The ability to do this in real time would greatly improve patient care decisions, leading to potentially enormous gains in outcomes.
We acknowledge that our results are dependent on our choice of similarity measure and clustering method. Our choice of Euclidean distance is natural for the problem at hand, as we were interested in the similarity of all variables to each other, not in how they varied with each other. Though the techniques of traditional linear statistics, correlation and regression analyses, can reveal differences between groups or correlations between pairs of physiological variables, we have shown here that they do not easily define a state made up of many variables with complex interrelationships.
There are several limitations to this preliminary study. First, the analysis here is based on a limited number or patients (17) and data points (52,000). Future studies should incorporate more patients (and more data) representing the primary outcomes. While a potential criticism is that a few clusters were dominated by the few patients with poor outcome, resulting in an overfit model, we stress that the clusters were defined in a way blind to patient outcome yet remained enriched for those outcomes.
Our results, while novel, represent a proof of concept study to show that cluster analysis can reveal complex patterns and predict outcome. Even so, we remain aware that to test the general applicability of these results, future studies will have to use a training data set to produce clusters/states that would then be applied to a test data set from separate patients. While we have tried to address the limitations of our single set data and the existence of serial dependence of data points using bootstrap analysis and by showing that each state was populated by data from many patients, future studies can conclusively address these concerns with separate training and test data sets. It also remains unclear how to select the correct number of clusters. As there is little guidance in the literature and these analyses have never been attempted in this manner we selected 10 clusters as a trade-off between inter- and intra-cluster distance and a usable number of patient states for analysis. Future studies could easily compare the prognostic information obtained from more or fewer clusters thereby discerning the correct number of states for a similar analysis.
Lastly, while our current work is limited to retrospective assignment of data to clusters, future work should include developing a single score that indicates both the patient's current state and their likelihood of dying during their hospital stay.