The purpose of this study was to evaluate the criterion validity of selected surgical PSIs in the VA using chart-abstracted data collected on surgical adverse events by NSQIP. Despite differences between the PTF and NSQIP, we were able to create a matched PTF/NSQIP file to validate five of the surgical PSIs (and two experimental PSIs) using “gold standard” clinical data. In general, we found moderate sensitivities and PPVs for the original PSIs. The proportion of adverse events identified by NSQIP that were also flagged by ICD-9-CM codes varied across the PSIs, from 19 percent for “postoperative respiratory failure” to 56 percent for “postoperative PE/DVT.” The proportion of events identified by ICD-9-CM codes that were confirmed by NSQIP had a similar range, from 22 percent for “postoperative PE/DVT” to 74 percent for “postoperative respiratory failure.” All PSIs had high specificities and positive likelihood ratios, indicating that flagged events were from 65 to 524 times more likely to occur in a hospitalization that had a true adverse outcome (based on NSQIP) than in a hospitalization that did not have a true adverse outcome.
NSQIP events were generally defined more narrowly or precisely than PSI events, except for “postoperative wound dehiscence.” Our alternative PSI definitions improved the sensitivities of all five PSIs, although the only statistically significant increases were for “postoperative respiratory failure” and “postoperative wound dehiscence.” For these two PSIs, we witnessed a tradeoff between sensitivity and PPV, although the decrease in PPV (and positive likelihood ratio) was statistically significant only for “postoperative wound dehiscence.” In version 3.0 of the
PSI software, AHRQ adopted our alternative definitions for “postoperative physiologic and metabolic derangements,”“postoperative respiratory failure,” and “postoperative sepsis” (
AHRQ 2008). For the other two indicators, the modest improvements in sensitivity with our alternative definitions were felt to be outweighed by decreased PPV.
Our research builds on other studies that have attempted to validate or improve the PSIs by linking administrative and chart-abstracted data on adverse events.
Zhan et al. (2007) used 2002–2004 Medicare discharge data to compare “postoperative PE/DVT” events identified by ICD-9-CM codes with medical record information on 20,868 beneficiaries. Their sensitivity, specificity, and PPV estimates were 68, 90, and 29 percent, respectively. Our sensitivity and PPV estimates for “postoperative PE/DVT” (56 and 22 percent, respectively) were slightly lower than those reported by Zhan and colleagues, perhaps due to superior coding in non-VA hospitals, variation in the epidemiology of thromboembolic disease, or NSQIP's more restrictive definition of PE. NSQIP required a “high probability” nuclear scan, but PE may be diagnosed after an “intermediate-probability” scan in a high-risk patient (
PIOPED 1990). Because of the indicator's poor predictive ability, we conducted sensitivity analyses separately for PE and DVT. Using original PSI definitions, we found sensitivity and PPV of 53 and 42 percent, respectively, for
PE alone, compared with sensitivity and PPV of 29 and 15 percent, respectively, for
DVT alone. A recent study using administrative data from New York and California suggested that the poor PPV of “postoperative PE/DVT” is largely attributable to preexisting or chronic thromboembolic disease, as 54–57 percent of these diagnoses were reported by hospitals as present on admission (POA) (
Houchens, Elixhauser, and Romano 2008). By contrast, the “POA” rates for the other PSIs evaluated herein ranged from 6–7 percent for “postoperative respiratory failure” to 23–36 percent for “postoperative derangements.”
Gallagher, Cen, and Hannan (2005b) examined the validity of the PSI “accidental puncture or laceration.” Of 67 cases found in New York State administrative data in 2000, 75 percent (50) appeared to be true cases based on medical record abstraction. Three recent studies showed that the sensitivity of “postoperative PE/DVT” (
Weller et al. 2004), “postoperative hemorrhage and hematoma” (
Shufelt, Hannan, and Gallagher 2005), and “selected infections due to medical care” (
Gallagher, Cen, and Hannan 2005a), could be improved by expanding the PSI definitions to capture readmissions within 30 days of a previous surgical hospitalization. AHRQ has recently revised the specifications of “postoperative hemorrhage and hematoma” to enhance sensitivity, based on the findings of Shufelt and colleagues (
AHRQ 2008).
Finally,
Best et al. (2002) used 1994–1995 VA administrative data to compare ICD-9-CM codes from discharge abstracts to NSQIP chart-abstracted adverse events. Eighty-six percent of the NSQIP indicators had potentially matching ICD-9-CM codes. Of these, only 23 percent had sensitivities >50 percent and only 31 percent had PPVs >50 percent. However, the coding of VA inpatient data has substantially improved since this study was conducted (
Kashner 1998), so its applicability to present circumstances is limited.
Evaluation of the criterion validity of the PSIs remains a challenge because of the limited data available for analysis. Sensitivity and PPV estimates depend on the accuracy and completeness of chart-abstracted data. Despite our use of NSQIP as the gold standard, it assesses only major noncardiac surgeries and does not capture complications that may result from high-volume minor surgeries. Because of NSQIP's exclusion criteria, we were only able to match about 50 percent of our flagged PSI hospitalizations with NSQIP adverse events. Further, we examined relatively infrequent events, limiting the power of our analyses.
The generalizability of our findings to non-VA administrative data sets is uncertain. VA inpatient data have a high level of completeness and are not affected by financial incentives for providers to “upcode” diagnoses (
Kashner 1998). Some administrative data sets, but not the PTF, permit users to distinguish between conditions that develop during hospitalization and those that are “POA.” Incorporating this information into the PSI logic, as AHRQ now encourages, would be expected to enhance PPV with little effect on sensitivity. Administrative data sets also differ on the number of allowable diagnoses and procedures. The VA PTF Main File contains a maximum of 10 diagnosis fields, and the Bedsection Files (also used in this study) contain up to five codes each, yielding a maximum of 31 unique diagnoses per hospital stay. By contrast, many state databases contain only 10–15 diagnosis fields. However, this difference may have little practical significance, as we recently found that the VA datasets and the HCUP Nationwide Inpatient Sample had the same average number of diagnosis codes per discharge (6.5).
Ten PSIs were recently submitted to the National Quality Forum (NQF) for consideration as hospital performance measures (
AHRQ 2007b). Several of these indicators, including “postoperative physiologic and metabolic derangement,”“postoperative respiratory failure,” and “postoperative sepsis,” were withdrawn because of insufficient evidence of validity, action/ability, or both (although “postoperative respiratory failure” appears to have relatively high sensitivity and PPV). The poor PPV for “postoperative PE/DVT” is correctible, with future implementation of POA coding and proposed new codes for subacute and upper extremity thromboses (
Centers for Disease Control and Prevention, 2008). Of the PSIs examined in this study, only two (“postoperative respiratory failure” and “postoperative wound dehiscence”) appear ready for use in efforts beyond quality improvement and screening (based on sensitivity and PPV exceeding 60 percent). Only the latter indicator, among those evaluated here, is now endorsed by the NQF (
National Quality Forum 2008). One experimental PSI (“postoperative myocardial infarction”) also appears promising. However, the high positive likelihood ratios of all five PSIs suggest that they are valuable case-finding tools for providers and quality improvement organizations. We should continue to explore more creative algorithms based on diagnosis and procedure codes to improve the sensitivity and PPV of the PSIs. The addition of POA reporting should also help to improve PSI validity (
Naessens et al. 2007,
Bahl et al. 2008). Ongoing and future research, such as the AHRQ Validation Pilot Project, will build on our results by evaluating both surgical and nonsurgical PSIs, and by reviewing random samples of eligible records, without irrelevant exclusion criteria.
Efforts to improve safety will be facilitated by the availability of valid measures that can be used to evaluate hospital performance. The AHRQ PSIs represent a useful step in this direction, but our results demonstrate that health data agencies, purchaser coalitions, and other sponsors should still proceed cautiously in using administrative data to identify postoperative complications for the purpose of public reporting on hospital safety performance.