|Home | About | Journals | Submit | Contact Us | Français|
We sought to develop a simple point score that would accurately capture the risk of hospital death for patients with acute lung injury (ALI).
This is a secondary analysis of data from two randomized trials. Baseline clinical variables collected within 24 hours of enrollment were modeled as predictors of hospital mortality using logistic regression and bootstrap resampling to arrive at a parsimonious model. We constructed a point score based on regression coefficients.
Medical centers participating in the Acute Respiratory Distress Syndrome Clinical Trials network (ARDSnet).
Model development: 414 patients with non-traumatic ALI participating in the low tidal volume arm of the ARDSnet ARMA study. Model validation: 459 patients participating in the ARDSnet ALVEOLI study.
Variables comprising the prognostic model were: hematocrit <26% (1 point), bilirubin ≥ 2 mg/dl (1 point), fluid balance greater than 2.5 liters positive (1 point), and age (1 point for age 40–64, 2 points for age ≥ 65 years). Predicted mortality (95% confidence interval) for 0, 1, 2, 3, and 4+ point totals was 8% (5–14%), 17% (12–23%), 31% (26–37%), 51% (43–58%), and 70% (58–80%), respectively. There was excellent agreement between predicted and observed mortality in the validation cohort. Observed mortality for 0, 1, 2, 3, and 4+ point totals in the validation cohort was 12%, 16%, 28%, 47%, and 67%, respectively. Compared to the APACHE III score, areas under the receiver operating characteristic curve for the point score were greater in the development cohort (0.72 vs. 0.67, p=0.09) and lower in the validation cohort (0.68 vs. 0.75, p=0.03).
Mortality in ALI patients can be predicted using an index of four readily-available clinical variables with good calibration. This index may help inform prognostic discussions, but validation in non-clinical trial populations is necessary before widespread use.
Acute lung injury (ALI) is a devastating cause of respiratory failure associated with significant morbidity and mortality.1,2 Despite the wealth of existing knowledge about risk factors for death in this syndrome, providers remain unable to determine which patients with ALI will ultimately die during their hospital stay. The vast majority of patients with ALI who die do so in the context of a decision to forgo life sustaining treatment driven in large part by patient preferences.3–5
Prognostication in the intensive care unit (ICU) is an important part of communication with surrogates, and often plays a role in the decision to forgo life sustaining treatment.6,7 Incapacitated patients rely upon surrogates such as their family members to represent their wishes during ICU care, and surrogates often rely upon clinician estimates of the likelihood of survival and functional recovery from acute illness when deciding whether to forgo life sustaining treatment for their loved one.7 Documented cognitive and non-cognitive biases held by physicians may overly influence their prognostic estimates for a given patient and have the potential to misrepresent true risk of death.8–11 Objective prognostic models, such as the Acute Physiology Assessment and Chronic Health Evaluation (APACHE) III score12 and Simplified Acute Physiology Score (SAPS) III score13 can provide estimated probabilities of death for an individual patient in the ICU. However, experts recommend against use of these models for predicting outcomes for individual patients in part because of their inability to convey uncertainty in estimated probabilities of death for an individual patient and the complexity involved in their calculation.14
The goal of this study was to develop a simple, disease-specific multivariable predictive scorecard for mortality to be used at the bedside in patients with early ALI. Given the importance of well calibrated models for individual prognostication15, we sought to maximize the concordance between predicted and actual probabilities of hospital death across point strata for our model, and thus to arrive at a system that might classify patients into groups for planning patient care.
The model derivation population arose from the 861 patients participating in the ARDSNet low tidal volume study (ARMA).16 Briefly, intubated, mechanically ventilated patients meeting American European Consensus Conference (AECC)17 definition for ALI were randomized within 36 hours of meeting the last qualifying AECC criterion to receive tidal volumes of 6 mL/kg or 12 mL/kg predicted body weight. Demographics, comorbidities, ALI precipitating cause, physiology, radiographic and ventilator data were recorded within the 24 hours prior to change in ventilator settings for all enrolled patients. Vital status for each patient was determined at hospital discharge. We limited our development cohort to all patients randomized into the 6 mL/kg arm of the parent study to eliminate tidal volume as a predictive variable in the analysis since current best practice involves low tidal volume ventilation for this population (n=473). Patients with trauma as the primary risk factor for ALI were excluded due to the low mortality rate in this subgroup.18
Our general strategy to develop a predictive model for death consisted of three steps. First, we identified variables previously reported as associated with mortality or severity of illness in ALI. Baseline values were selected to minimize missing data and to allow for mortality prediction at the beginning of ALI. Next, we constructed a parsimonious multivariable model based on these predictors. Finally, we validated the final predictive model in an independent sample of patients.
When deciding which covariates to retain as candidate predictors for the multivariable model, we considered the clinical relevance and generalizability of each covariate; the amount of missing data (retaining the measure with the least missing data); and finally, the amount of spread in the covariate’s scale (retaining the measure with the most variability) in that order. We assessed the collinearity among the predictors using the Pearson correlation coefficient, χ2 tests, and ANOVA/t-tests. When highly correlated covariates quantifying the same clinical information (e.g. A-a difference and PaO2) we selected the covariate that was more clinically relevant, had less missing data, and had more variability.
The resulting baseline clinically relevant covariates with minimal collinearity were entered into a multivariable logistic regression model. These variables included demographics (age, gender, race/ethnicity); weight; respiratory physiology (PaO2/FiO2, PaCO2, positive end-expiratory pressure [PEEP], number of opacified quadrants on frontal chest x-ray19, volume/pressure targeted ventilation, assist/control ventilation); primary ALI risk factor as coded by the clinical coordinator and physician investigator within 36 hours of ALI onset (pneumonia, sepsis, aspiration, other/none); timing of ALI onset (hospital days prior to ARDSnet screen, days with ALI prior to randomization) and physiologic and laboratory derangement (number of non-pulmonary organ failures, vasopressor use, net 24-hour fluid balance prior to enrollment, 24-hour urine output prior to enrollment, peak bilirubin, peak creatinine, lowest systolic blood pressure, lowest hematocrit). All peak and nadir values were identified during the 24 hour period prior to enrollment. We included continuous variables in categorical form to simplify point calculation from the final model. We determined cut points for continuous variables by assessing each variable’s functional form using generalized additive models.20 We evaluated two-way multiplicative interactions for each covariate which were excluded from the final model if they were not statistically significant.
Variable selection in the multivariable regression framework utilized a bootstrap algorithm.21 We generated 1000 bootstrap samples from the original dataset. Each bootstrap sample was the same size of the original derivation sample; however, patients in each bootstrap sample were randomly drawn from the original data with replacement.21 Within each bootstrap sample, we performed stepwise logistic regression with thresholds of p=0.10 for selection and p=0.20 for variable elimination. Predictors present in at least 600 runs (e.g., 60% of the 1000 generated bootstrap samples), were entered in a final logistic regression model using the original data.22,23 This method determines the empirical distribution of a variable’s likelihood of being included in the model thereby quantifying the strength of evidence that a given variable is indeed a true independent predictor of death and compares favorably to more traditional cross-validation or isolated automated model development methods.23
Point scores were assigned to each covariate by rounding the regression coefficients in the final model to integers.24 We then calculated a point score for each patient in the cohort and plotted the resulting receiver operating characteristic (ROC) curve. The ROC curve graphically describes the overall performance of our point score.25 Discrimination of the model was summarized with area under the curve (AUC) of the ROC curve.25 In addition, we derived positive likelihood ratio (LR+) estimates for each level of the point score to be able to estimate how much a prior probability of death would be influenced by an observed point score. The LR+ summarizes how many more times likely patients who die are to have that particular point total than patients who survive.26,27 Predicted probabilities of death and their respective confidence intervals for each point strata were generated from a logistic regression with mortality as the outcome and the point totals per patient as the sole predictor. Post-test probabilities of death were generated using hypothetical, provider-determined pre-test probabilities of death and the LR+ for each point category as previously described.27 We calculated confidence intervals for post-test probabilities of death by incorporating the uncertainty in the likelihood ratio. Pretest probabilities were assumed to have no uncertainty. We assessed calibration using the Hosmer-Lemeshow statistic with P<0.10 indicating that fit was inadequate.28 Given the low power of this test in small samples we also compared the actual and predicted mortality within each point stratum for the development and validation cohorts.
We assessed internal validity of our model by comparing the AUC of our point score to that of the predicted mortality estimated from the APACHE III score12 using the method outlined by DeLong et al.29 APACHE probabilities of death were generated by fitting the APACHE III score in a logistic model where hospital death was the outcome. We assessed external valididity by applying our model to an independent database which consisted of the same target study population used in constructing the prediction model (participants in the ARDSnet clinical trial ALVEOLI).30 Briefly, ALVEOLI randomized 549 intubated, mechanically ventilated patients meeting the AECC definition for ALI or ARDS within 36 hours to receive higher or lower PEEP. All patients received tidal volumes of 6mL/kg predicted body weight. Baseline variables collected in ALVEOLI were similar to those captured in ARMA. Patients were followed until discharged. We limited our analysis of ALVEOLI to patients without trauma as the primary ALI risk factor (n=505).
As a sensitivity analysis, we determined the influence of missing data on our model by performing multiple imputation (SAS PROC MI) for each incomplete covariate as described by Rubin.31 The imputed model and mortality estimates derived from the imputed model were identical to those from compete case analysis. We also utilized the same variables and cut points to determine model performance for predicting 28-day mortality.
The institutional review board for each center participating in ARDSnet approved of the parent studies. All statistical analyses were conducted with SAS 9.1 (Statistical Analysis Systems, Cary, NC) and Stata 9.2 (StataCorp, College Station, TX). All tests of significance utilized a two-sided α = 0.05.
Of the 902 patients participating in the ARDSnet low tidal volume study, 429 were randomized to the 12cc/Kg tidal volume arm and excluded. Of the remaining 473 patients, 59 (12%) were excluded due to trauma as the primary risk factor for ALI leaving 414 patients (88% of patients in the 6 cc/Kg arm) available for analysis. Demographics, ALI risk factor, severity of illness, and laboratory and physiology data for the cohort are shown in Table 1. Of the 414 patients in the development cohort 139 (33%) were dead at hospital discharge, similar to the 31% mortality reported in the 6mL/kg arm of the parent study.16 In general, patients dead at hospital discharge were older and had a greater severity of physiologic and laboratory derangement.
During multivariable modeling 64 additional patients were excluded due to missing data for bilirubin (n=38, 9%), fluid balance (n=24, 6%), and hematocrit (n=2). Variables retained in the final regression of the covariates present in >60% of the bootstrap iterations included age, hematocrit, 24-hour fluid balance, and bilirubin. The model derived from imputed data was identical to that derived by complete case analysis. For simplicity, we report only the results of the complete case analysis. Point values generated from the regression coefficients for each of these covariates are shown in Table 2. The resulting point total for each patient was incorporated in a regression with hospital mortality as the outcome. We refer to this model as the custom model. Predicted mortality by point total for the development cohort and observed mortality in the development and validation cohorts are presented in Table 3. The mean predicted mortality for each point strata was very close to the observed mortality in both the development and validation cohorts. In all strata, observed mortality in the validation cohort fell within the confidence bounds of the predicted mortality.
Positive likelihood ratios (LR+) and 95% confidence intervals for each point total in the combined cohorts are also shown in Table 3. Utilizing the LR+s from Table 3, we calculated the hypothetical post-test probability of death as a function of point total from our model over a range of pre-test probabilities of death (Table 4).
The comparison between predicted mortality estimated from the APACHE III score and the mortality rate predicted by the custom model is illustrated in Figure 1. Overall, there was considerable spread in the predicted mortality estimated from the APACHE III score within each point total. Hosmer-Lemeshow goodness of fit test for the custom model in the development and validation cohort showed no evidence of inadequate fit ( , p=0.67 and , p=0.79, respectively).
ROC curves for the custom model in the development and validation cohorts are compared to APACHE III in Figure 2. The custom model outperformed APACHE III in the development cohort and performed worse than APACHE III in the validation cohort. The AUC for the custom model in the derivation set was 0.72 compared to 0.67 for APACHE III (p=0.09). When applied to the validation cohort the AUC for the custom model was 0.68 while the AUC for APACHE III was 0.75 (p=0.03).
At 28 days, 90 (26%) patients in the development cohort were dead. Predicted 28-day mortality, observed 28-day mortality, and LR+ for the development and validation cohorts are present in Table 5. In general, 28-day mortality was lower than hospital mortality for each point total; however, there was good agreement between predicted and observed mortality for each point total in the validation cohort. Positive likelihood ratios for each point total were similar to those reported for hospital mortality. Discrimination of the custom model in the development cohort was similar to discrimination in the validation cohort (AUC 0.71 vs. 0.71, respectively). Hosmer-Lemeshow goodness of fit test for the custom model for 28-day mortality in the development and validation cohort showed no evidence of inadequate fit ( , p=0.95 and , p=0.79, respectively).
We developed and validated a simple, easily-calculable scoring model that accurately predicts hospital mortality for patients with ALI. Our simple point score, incorporating age, 24-hour fluid balance, hematocrit, and bilirubin, is able to discriminate patients with high mortality from those with a lower mortality. Importantly, observed mortality in the validation dataset fell within predicted mortality ranges for the point total strata, indicating good model calibration. Furthermore, the accuracy of the model’s prediction for 28-day mortality was similar to that predicting hospital mortality. These results support the use of this model as a useful clinical tool for prognostication, classification, and counseling.
Our results are notable for the excellent concordance or calibration between our custom model’s predicted mortality rate and the observed mortality in each point strata within the validation cohort. Although the AUC of our model in the validation cohort was worse than in the development cohort, calibration remained intact. Discrimination refers to a model’s ability to distinguish survivors from non-survivors. The AUC represents the probability that a patient who died had a greater predicted probability of dying than a patient who survived. Calibration refers to the agreement between predicted probabilities and the actual, observed probabilities. Ideally, a predictive model should have excellent discrimination (AUC >0.9) and calibration (observed rates = predicted rates). Maximizing calibration is of primary importance when a model is used to counsel patients or their families about prognosis,15 because patients and their families are more interested in accurate assessment of the probability of death (calibration); not necessarily how sick the patient is relative to other patients (discrimination).15
This model can be used to inform prognosis (e.g. in counseling patients or families) but should not be used for decision making (e.g. withdrawal of support). The literature documenting the presence of cognitive biases in physician decision making is extensive.8,10 Confronted with task of prognosticating in the complex environment of the ICU physicians must assess the probability of an uncertain event. Physicians often use heuristics or simple rules-of-thumb in place of explicit analysis of probabilities to reduce these complex tasks to simpler judgements.8 While often useful when utilized by experienced ICU attending physicians32 these heuristics can lead to severe errors in assessing the probability of an event. For example, the availability of recent memories (e.g. “the last patient I cared for…”)8,10, an aversion to change therapeutic course (status-quo bias),11,33 or the potential to feel more responsible for an adverse outcome due to active treatment compared to inaction (regret/outcome bias)10 can unduly influence a physician’s estimates of prognosis in the ICU. There are often additional factors that ought not play a role in prognostic decision making such as physician age, experience and religion, patient age and race, and other conscious or unconscious biases that impede rational and compassionate decision making in critically ill patients.9,34–37 These biases may contribute to the discrepancy between an attending physician’s predicted outcome and the patient’s actual outcome.38
For these reasons, there is a great need for objective measures to facilitate prognostication in critically ill patients that are immune to bias and subjectivity. To date, however, experts advocate against using traditional severity-of-illness measures (e.g. APACHE, SAPS) for decision making at the end of life for multiple reasons.32,39,40 There is little evidence to suggest that prognostication systems influence physician decisions caring for patients at the end of life.41 Additional objections stem from the inability of severity scores to convey uncertainty in estimated probabilities of death, the poor concordance between individual predictions among different severity models,39 the poor performance of such models at the extremes of estimated probabilities (e.g. close to zero or to one), and the complexity involved in their calculation.42 Based upon the above limitations, we caution physicians in the solitary use of our model purely for decision making in individual patients; ICU severity of illness scores, including our point score, will never predict patient outcomes with 100% certainty. Though accurate for populations of patients such models can never truly account for all uncertainty when applied to individuals. Nonetheless, families value prognostic discussions and utilize mortality estimates to prepare emotionally for the possibility that a patient may not survive even when they appreciate that prognostic estimates may not be correct.43,44 Providing stratum-specific estimates of mortality, such as those provided by our point score, to patients and their families has been recommended by many risk communication experts.45,46
While the use of scoring systems as a sole guide to making decisions about whether to initiate or continue to provide intensive care is inappropriate,40 they can provide an objective means for providers to inform their own assessment of prognosis. Combining clinician estimates of mortality with model estimates of mortality improves one’s overall ability to discriminate patients who live from those who die compared to either estimate alone.41,47 Given physicians’ pessimistic estimates of mortality, whether combining physician and model estimates improves agreement between the expected an actual mortality is still unclear.47,48
Providers can utilize the likelihood ratios from our model at the bedside similarly to a diagnostic test to estimate the post-test probability of death. Figure 3 illustrates a hypothetical “case study” examining how a prior probability of death of 0.4 (based upon population estimates from the literature) is updated to a probability of 0.74 with knowledge that the patient’s point score is four. It is important to note that population-based data support a pre-test mortality in all comers with ALI of approximately 40%.49 Given this estimate, most ALI patients will have post-test mortalities indicating a significant chance of surviving to hospital discharge. We also stress that, in practice, providers often have uncertainty in their estimated pre-test probability of death. Our analyses do not incorporate this uncertainty and thus confidence intervals around the post-test probabilities are too narrow.
There are several strengths to our analysis. We utilized a well-defined cohort of patients with ALI cared for in hospitals throughout the United States. We subsequently validated our model utilizing an independent cohort of patients arising from a similar patient population. Finally, our score, utilizing only four readily available clinical variables is considerably easier to calculate than the APACHE III predicted probability of death or SAPS 3 predicted probability of death, yet maintains excellent discrimination and calibration.
We also recognize several limitations to our analysis. First, our model was derived on data from the ARDSNet low tidal volume study, a study conducted over 10 years ago. The mortality of ALI has decreased over time as implementation of evidence-based therapy in this disease has improved.50 We attempted to address this limitation by validating the model in a more contemporary population of patients (ALVEOLI); nevertheless, our model may perform differently in more current ALI cohorts. Second, our derivation population had a small number of deaths limiting our ability to evaluate all potential predictors of death without over fitting the model.51 Third, in contrast to development of APACHE III, our model development was limited to variables available in the data set; we were unable to evaluate some potentially important predictors such as pulmonary dead space and PEEP responsiveness as they were not collected routinely in this cohort.52–54 We were also unable to evaluate the predictive ability of other comorbidities, such as chronic liver disease and metastatic cancer55, as patients with these underlying illnesses were excluded from the parent study. Fourth, in addition to excluding trauma patients, we excluded of 15% (64/414) of the cohort due to missing data to maximize the utility of our model in practice. This may have influenced the variables selected for our model and may bias the mortality within each strata when applied. Validation of our model in populations with complete data is important prior to its routine use. Fifth, our model was derived in a cohort collected from multiple academic tertiary-care hospitals participating in a randomized trial with specific exclusion criteria. Documented differences between academic- and community-based ALI patients, and patients enrolled versus not enrolled in randomized trials may prevent generalization to the broader community.49 Moreover, our inclusion of fluid balance, a treatment dependent variable, may influence the performance of our model under different practice patterns. Further validation of this model in a contemporary, large, multicenter study should be performed prior to widespread adoption. Finally, APACHE III was developed to predict mortality utilizing data during the first 24 hrs of ICU stay; therefore, our use of APACHE III scores generated at the time of enrollment may have resulted in underperformance of APACHE III.
We have developed simple prognostic score that accurately identifies groups of ALI patients at high risk of death. This model can facilitate a provider’s assessment of prognosis when informing patients and their families about the possible outcomes of ALI. Prior to widespread use, this model should be validated in contemporary non-clinical trial populations.
Financial Support: F32 HL090220, N01 HR46055, NO1 HR46058
This study was conducted at the University of Pennsylvania and the University of Washington.
Conflict of interest: All authors have no conflicts of interest to disclose.