Phase 1: Algorithm parameters
Using the first and second datasets, the authors empirically set the parameters for each algorithm. For EWMA, authors set a decay rate λ=0.3 and an alerting threshold k=5. For CUSUM, the authors used a V-mask for determining the alerting threshold with a daily rise of three times the standard deviation of the CUSUM statistic for each particular organism. SaTScan was executed using its purely temporal Poisson model, and WSARE with its Fisher's exact scoring metric and 100 randomizations for each day.
Phase 2.1: Expert review results
For institution-wide microbial data covering the 2-year study period, the four outbreak detection algorithms collectively generated a total of 257 alerts (CUSUM: 114, EWMA: 66, SaTScan: 21, WSARE: 56). To present alerts to clinical expert reviewers, the study combined any computer-generated alerts with start and stop dates differing by fewer than 2 days into one single alert. As a result, six alerts detected by two algorithms and one alert detected by three algorithms were combined to form the final review list of 249 alerts.
Percent agreement on the clusters between the two assigned reviewers ranged from 79% to 88% with Cohen's κ ranging from 0.11 to 0.49 (). Overall, reviewers agreed on their determinations for 210 of the 249 alerts, with 17 (8.1%) deemed candidate outbreaks.
| Table 2Percent agreement between reviewers (Cohen's κ in parentheses) |
For the 39 clusters on which the pair of initial reviewer assessments disagreed, the study assigned a randomly selected third reviewer. Of the 39, the third reviewer deemed nine (23%) to be candidate outbreaks. Six randomly selected candidate outbreaks (where the two initial reviewers agreed the cluster was a potential outbreak) and six randomly selected false alarms (where the reviewers had agreed the cluster was not an outbreak) were also assigned to a random third reviewer. The third reviewer agreed with the first two reviewers on all six of the false alarms. However, for the six pairwise-agreed-upon candidate outbreaks, the third expert reviewer only agreed with the initial experts' judgment once (17%).
The hospital infection control service had previously identified five suspected outbreak clusters during the study period. Those clusters were not detected by any of the algorithms as originally configured for the phase 1 study. Of the five, two have been excluded from the study analysis. In one, the laboratory assay for the involved organism, Clostridium difficile, was not included in the input since the dataset only included organisms identified by microbiological culturing and thus C difficile antigen could not be detected by the algorithms. In the other, the outbreak spanned several months and began prior to the beginning of the study period. The study ‘gold standard’ outbreak dataset therefore contained 29 candidate outbreaks: 17 from the initial expert consensus review, nine from the second expert conflict-resolving review, and three from the infection control archival data.
Phase 2.2: Algorithm performance
For the four evaluated algorithms, the positive predictive value relative to the study-derived gold standard ranged from 5.3% to 29%, with sensitivities ranging from 21% to 31%. shows individual results for each algorithm. The differences in sensitivity were not sufficient to reject the null hypothesis that the algorithms had identical performance. For positive predictive value, CUSUM was significantly lower than all other algorithms (p<0.001 in all comparisons), and EWMA and WSARE were significantly lower than SaTScan (p<0.001 for each).
| Table 3Cluster determination by algorithm |
Stratifying the analysis by location type (hospital-wide clusters and inpatient units as inpatient; clinics and emergency rooms as outpatient) demonstrated that clusters from inpatient locations were much more likely to be considered candidate outbreaks than clusters from outpatient locations (inpatient: 21/120 clusters vs outpatient: 5/129 clusters; χ2 p=0.002).
Phase 3.1: Parameter adjustment
As EWMA yielded both better positive predictive value and sensitivity than CUSUM, project members adjusted EWMA's decay rates and minimum alerting thresholds in phase 3. After the adjustments, EWMA detected up to 24 of the 29 candidate outbreaks, but its positive predictive value suffered at this sensitivity, with 629 false alarms (3.7%) at this most sensitive setting.
Phase 3.2: Scoring metrics
Using the minimum alerting threshold k as the initial ranking metric to sort the original list of 249 clusters generated by the four algorithms yielded an area under the precision-recall curve (AUC) of 0.283, where the AUC for a precision-recall curve represents the average overall precision. A linear interpolation of the expert reviewers' performance targets of 0.5 precision at 0.9 recall and 0.75 precision at 0.25 recall gives a target AUC of 0.65. shows the precision-recall curve for this initial metric, with the curve for the adjusted EWMA algorithm and points for each of the individual algorithms.
To investigate whether primary culture specimen type could help to separate clinically significant clusters from less important ones, project members developed an algorithm that labeled each cluster by specimen type (blood, urine, wound, etc) if more than 50% of the cultures in a given cluster shared a common source. A χ2 test compared that specimen type to all other cultures independent of source type. The only statistically significant relationship this analysis identified was that urine cultures were less reliable indicators of clusters than other specimen types (2.0% of urine vs 13% non-urine; p=0.029). After adjusting the ranking metric downward for clusters of urine cultures, the k-sorted precision-recall AUC improved from 0.283 to 0.356. As observed in phase 2, clusters in inpatient locations were more likely to produce candidate outbreaks than clusters in outpatient units. After increasing the ranking metric for inpatient clusters, the AUC rose from 0.356 to 0.489.
Project members calculated antibiotic susceptibility difference scores for the 165 clusters that met the 50% criterion, including six of the 19 candidate outbreaks. Antibiotic susceptibility difference scores ranged from 0 to 138 in the false alarm clusters and from 0 to 2.7 in the candidate outbreaks. Based on these results, project members generated new precision-recall curves after eliminating all clusters with similarity scores greater than a conservative threshold of 5 and an aggressive threshold of 3. These adjustments increased the precision-recall AUC from 0.489 to 0.528 for the conservative threshold and to 0.553 for the aggressive threshold. Precision-recall curves for each of these adjustments are shown in .
Phase 3.3: Retrospective evaluation of combined algorithms
During the 6-month retrospective evaluation period, infection control staff identified and confirmed two single-unit outbreaks: an outbreak of vancomycin-resistant Enterococcus, and an outbreak of C difficile. Unlike the phase 2 dataset, in phase 3, non-culture assays were added, allowing the system to detect the C difficile outbreak. The system detected a total of 41 clusters during that time period, including both of the confirmed outbreak clusters. No phase 2-type expert analyses of the other 39 clusters were conducted.