|Home | About | Journals | Submit | Contact Us | Français|
We sought to validate three methods for automated safety monitoring by evaluating clinical trials with elevated adverse events.
An automated outcomes surveillance system was used to retrospectively analyze data from two randomized, TIMI multi-center trials. Trial A was stopped early due to elevated 30 day mortality rates in the intervention arm. Trial B was not stopped early, but there was transient concern regarding 30 day intracranial hemorrhage rates. We compared Statistical Process Control (SPC), Logistic Regression Risk Adjusted SPC (LR-SPC), and Bayesian Updating Statistic (BUS) methods with a standard prospective two-arm event rate analysis. Each method compares observed event rates to alerting boundaries established with previously collected data. In this evaluation, the control arms approximated prior data, and the intervention arms approximated the observed data.
Trial A experienced elevated 30 day mortality rates beginning 7 months after the start of the trial and continuing until termination at month 14. Trial B did not experience elevated major bleeding rates. Combining the alerting performance of each method across both trials resulted in sensitivities and specificities of 100% and 85% for SPC, 0% and 100% for BUS, 100% and 93% for both LR-SPC models, respectively.
Both SPC and LR-SPC methods correctly identified the majority of months during which the cumulative event rates were elevated in Trial A, but were susceptible to false positive alerts in Trial B. The BUS method did not result in any alerts in either Trial and requires revision.
Recent product recalls of both medications and medical devices have highlighted the need for robust improvements in post-marketing surveillance methods.1 While post-marketing clinical outcomes data have become increasingly available in the form of clinical registries and electronic health records, there is no consensus on which continuous monitoring methodologies would be most appropriate for these types of observational cohort data.
There are a number of statistical process control (SPC) techniques used for adverse event surveillance in industrial processes, such as Shewart control charts,2 Exponentially Weighted Moving-Average charts (EWMA), cumulative sum charts (CUSUM), and sequential probability ratio tests (SPRT).3 Until recently, these methods had limited application in the medical environment because of clinical data heterogeneity, which can be found among demographic and clinical characteristics of patients, provider practice variation, non-standard data collection, and missing data elements. While attempts have been made to apply basic SPC methods to medical outcomes surveillance,4–7 these factors require sophisticated risk adjustment and protocols for establishing adverse event rate alerting boundaries that are unnecessary in industrial processes.
Statistical advances in risk adjustment methods and subsequent incorporation of those techniques into some SPC frameworks has facilitated use of these tools in medical outcomes surveillance. There have been some evaluations of these methods among specific clinical domains, such as pediatric and adult cardiac surgery,3, 8, 9 general surgery,10 and interventional cardiology.11 Most of these studies were performed on retrospective clinical registry data, but the results are encouraging for detecting unexpected or elevated adverse event rates among broader applications. However, the full utility of these methods can be found for prospective data monitoring.
We developed a computer application, Data Extraction and Longitudinal Time Analysis (DELTA), that provides both retrospective and prospective outcomes monitoring among new and established medical devices and medications.12 A number of non-risk adjusted and risk adjusted SPC-based methods were developed and adapted for use in this application, and pilot studies using the methods and system have been successfully conducted within a single institution interventional cardiology clinical registry.11–14
However, our methods have not been fully validated. RCT data provide a gold standard for comparison. Any attempt to validate these methods using observational cohort data faces limitations from potential unmeasured confounding. RCT data balances unmeasured confounding between control and intervention arms within the trial design, and using the control arm to provide the baseline or event rate expectations addresses this limitation. In addition, RCT data provides meticulously adjudicated outcomes with independent review by a data safety monitoring board (DSMB) both at set time intervals and after the conclusion of a trial. In this idealized setting, results from a SPC-based method with and without risk adjustment should be approximately the same, which allows flaws in the risk-adjustment unrelated to confounding to be discovered. This also compares baseline accuracy of both types of methods to standard trial statistical analysis and the DSMB findings.
In this study, we sought to validate three SPC-based methods imbedded in an automated monitoring application against a standard of statistical methods employed by DSMBs with randomized controlled trial data. To provide a true positive signal, we selected a trial that was stopped early because of a high rate of adverse events. To provide a true negative signal with a reasonable chance of a false positive alarm, we selected a trial in which adverse event rates were of concern early on but never met the established DSMB stopping rules.
Two Thrombolysis In Myocardial Infarction (TIMI) randomized, controlled trials with DSMB monitoring and safety endpoint stopping rules were selected for use in this study.15, 16 Trial A was a multi-center, randomized trial that evaluated the efficacy of an oral platelet glycoprotein antagonist (GP IIb/IIIa inhibitor) versus placebo with regards to a composite primary endpoint of death, myocardial infarction, stroke, or recurrent ischemia at rest leading to re-hospitalization or emergent revascularization at 30 days and 10 months post-randomization. A total of 10,288 patients were enrolled in the study within three arms: 3421 in the placebo arm, 3,330 in the active treatment arm with sustained dosing, and 3,537 in the active treatment arm with reduced dosing after the first thirty days. The primary safety endpoints for the trial were all-cause mortality or severe or life-threatening bleeding, defined as intracranial hemorrhage or bleeding associated with severe hemodynamic compromise, a drop in hematocrit by 15% or more, or requiring a blood transfusion. The trial was terminated by the DSMB before the goal of 12,000 patients was reached due to an increase in the 30-day mortality in the reduced dose treatment arm.15
Trial B was a multi-center, randomized trial that evaluated the efficacy of an oral anti-platelet agent versus placebo in combination with fibrinolytic therapy in ST elevation myocardial infarction with regards to a composite primary endpoint of an occluded infarct-related artery by angiography, or death or recurrent myocardial infarction in the absence of angiography. A total of 3,491 patients were enrolled in the study: 1,739 in the control arm, and 1,751 in the treatment arm. The primary safety endpoint was TIMI major bleeding, of which intracranial hemorrhage was a component.17 The trial operations committee became concerned with the rates of intracranial bleeding and major bleeding early in the trial based upon aggregated blinded data, but neither safety outcome reached DSMB stopping criteria during the trial. This study was approved by the Brigham & Women’s Institutional Review Board.
Three statistical methods were used to assess for deviation from acceptable safety benchmarks during this evaluation. These include both a non-risk adjusted and risk-adjusted SPC method based upon Shewart control charts. The third method is a non-risk adjusted Bayesian adaptation of Shewart control charts. These methods are reviewed in brief below, and are more fully described elsewhere.12
All of the methods use four values in each time period in order to calculate whether or not generate an alert. These values are: observed number of events, observed number of cases, expected number of events, and expected number of cases. A limitation of standard Shewart control charts is that the observed number of cases (sample size) is ignored, since the observed value is only represented as a proportion or point estimate. This can result in alerting insensitivity, particularly in cumulative analyses where the values for each observed period are the sum of all the period observed periods and the current period. We addressed this by adapting each of the methods to generate the alert threshold using the Wilson’s method for comparing independent proportions. 24
This method is an adaptation (as described above) of standard non-risk adjusted Shewart control charts to provide cumulative event rate monitoring in which adverse events and total cases are aggregated and analyzed in pre-defined time periods. This method is most appropriate under unchanging conditions where deviations from an established norm need to be detected, and is very reliable for these purposes.2
Logistic regression adjusted SPC18 is an experimental methodology that incorporates our Shewart control chart adaptation with logistic regression (LR), a modeling technique that estimates the probability of an outcome on case-level basis. A LR model is developed from the available baseline data, and validated by resampling methods or external data sets. The model is then applied to the observed data, and the model outcome probabilities are considered the expected (baseline) number of events. This incorporates risk adjustment into the method by allowing the expected event rate to change over time with the composition of the observed cases.
However, the limitations of this method are those inherent in LR modeling in general. The levels of discrimination and calibration of the model on the baseline data are not guaranteed to remain the same even on closely related subsequent populations. While discrimination is generally retained across different patient populations, calibration can vary, which may directly impact monitoring results.19
Bayesian Updating Statistics (BUS) is an experimental methodology pioneered in nuclear power safety monitoring.20 We developed this method by incorporating Bayesian statistics21 into a traditional SPC framework by utilizing prior observed data to evolve the estimates of risk.22 The incorporation of previously observed information into the expected data allows the method to be very sensitive for detecting reversal of trends and sudden, large changes in event rates. However, a slow drift (either elevation or depression) of the observed event rate can be missed.13 This method partially addresses the limitation in changing conditions in SPC, and is best suited to detecting changes after an incremental change (such as a medication or device) is introduced. However, this method should be considered non-risk adjusted since individual patient conditions and exposures are not considered in establishing the alerting threshold.
We have previously described an automated real-time safety monitoring tool, Data Extraction and Longitudinal Time Analysis (DELTA), that is able to perform an arbitrary number of concurrent prospective analyses using statistical methodologies (SPC, BUS, LR-SPC) and alerting thresholds.12 The system uses a SQL 2000 server (Microsoft Corp., Redmond, WA) to provide internal data storage and configuration information, as well as providing the capability to integrate with external databases. The user interface is displayed in a web browser from a Microsoft IIS 5.0 Web Server Microsoft Corp., Redmond, WA).
The system is currently in operation within the Partners Healthcare intranet, a secure multi-hospital network. Security of patient data is further addressed by record de-identification steps23 and user login access restrictions. DELTA is part of ongoing quality assessment and control measures within the institution.
Both sets of trial data were imported into DELTA, and then SPC, LR-SPC, and BUS analyses were configured to evaluate the outcomes of interest in monthly intervals. These outcomes were the primary safety endpoints of the trials, which were 30 day mortality for Trial A and major bleeding for Trial B. The gold standard of whether the appropriate safety endpoint event rate in each trial was elevated in a particular month was determined using standard DSMB analysis methods. This was calculated using the Fisher’s exact test applied to the cumulative control and intervention data on a monthly basis for each outcome of interest in both trials. The proportional difference method with 95% confidence intervals was used to establish alert thresholds for the same sets of cumulative data in order to compare performance with Fisher’s exact method.
Each of the statistical methods used by DELTA requires a baseline event rate expectation to establish alerting thresholds, and this data is generally obtained from observational cohort data prior to the initiation of a new medication or device. In order to simulate this environment with randomized, controlled trial data, the control arms were used for this baseline measurement, and the intervention arms were used as the monitored prospective observational cohort. This resulted in all of the control arm data being available at the “beginning” of the intervention arm monitoring. The reduced dosage treatment arm in Trial A was utilized for the intervention arm since that was the arm for which the DSMB stopped the trial.
The SPC alerting threshold is static, the LR-SPC alerting threshold was adjusted in each analyzed time period by the model predicted event rate of the observed (intervention) data, and the BUS alerting threshold was adjusted in each analyzed time period by the observed event rate of the observed (intervention) data. Monthly time intervals and 95% confidence intervals or posterior credible intervals were used for each analysis in this study.24 Overall sensitivity and specificity of the methods can be ‘tuned’ by adjusting the alerting threshold, but the emphasis in this evaluation was to determine relative performance between the methods for a standard threshold set point.
LR-SPC required the development of a logistic regression model in order to perform case-level risk adjustment. A literature search was conducted to identify risk factors for each of the outcomes of interest in the trials, and all such factors that were associated with the respective outcome of interest were included in the LR model development process. Model development was done in SAS (Version 9.1, Cary, NC). The models were evaluated for discrimination with the Area Under the ROC Curve (AUC) and calibration with the Hosmer-Lemeshow goodness-of-fit (HL-GOF) deciles test using 10-fold cross-validation.25, 26
Accuracy of each method for detecting elevated event rates were calculated by comparing whether each method alerted or not compared to the standard trial analysis. There were a total of 14 months in Trial A and 21 months in Trial B resulting in 35 values (alert / not-alert) for each method. The results in each month for both trials were aggregated together to determine overall sensitivity (defined as the true positives divided by the sum of true positive and false negatives) and specificity (defined as the true negatives divided by the sum of true negatives and false positives).
The cross-validation results for the logistic regression models developed from the Trial A control arm data were an AUC of 0.67 [0.59 – 0.75] and a HL-GOF of 8.82 (p = 0.358) for a variable selection threshold of 0.01, and an AUC of 0.70 [0.62 – 0.78] and a HL-GOF of 8.38 (p = 0.397) for a threshold of 0.20. The cross-validation results for the logistic regression models developed from the Trial B control arm data were an AUC of 0.79 [0.73 – 0.86] and a HL-GOF of 15.7 (p = 0.047) for a threshold of 0.01, and an AUC of 0.80 [0.74 – 0.87] and a HL-GOF of 10.5 (p = 0.230) for a threshold of 0.20.
Significant differences for the outcome of all-cause death at 30 days in Trial A were noted between the control arm and the reduced treatment arm from month 7 until the trial’s early termination at month 14. A summary of the event rates of each arm and p values by month are listed in Table 1. The SPC monitoring method also reported significant event rate elevations in months 7 through 14 (Figure 1a). The BUS method, however, did not report any intervention arm event rate elevations during monitoring. The LR-SPC method using a model building threshold of 0.01 reported elevations in months 7 through 14 (Figure 1b), and reported elevations in months 6 through 14 for a model building threshold of 0.20 (Figure 1c). A summary of the Trial A proportional difference results for each of the monitoring methods by month are shown in Appendix 1.
No significant differences for the outcome of major bleeding at 30 days were noted between the control and intervention arms in Trial B. Month 8 was the period in which the event rate of the intervention arm was the most elevated in relation to the control arm with a p value of 0.262. A summary of the event rates for each arm with p values by month are listed in Table 2. The SPC monitoring method did generate alerts for months 7 through 9 and 11 (Figure 2a). BUS did not generate any alerts. The LR-SPC method using a model building threshold of 0.01 reported elevations in months 7 and 14 (Figure 2b), and reported an elevation in month 7 for a model building threshold of 0.20 (Figure 2c). A summary of the Trial B proportional difference results for each of the monitoring methods by month are shown in Appendix 2.
Aggregating the results of both trials for each method as compared to the trial analysis standard resulted in a sensitivities and specificities of 100% and 85% for SPC, 0% and 100% for BUS, 100% and 93% for both LR-SPC models, respectively. A summary of the 2×2 table elements are listed in Table 3.
This study evaluated non-risk adjusted and risk adjusted statistical process control methods for detecting elevated adverse event rates among randomized controlled trial data. Both SPC and LR-SPC performed well and were comparable to each other. However, BUS was significantly over-specific and did not alert in any month in either trial.
The proportional difference test was validated by comparison with Fisher’s exact test and the results were concordant in both trials analyzed. The SPC method alerted properly in each of the months identified by Fisher’s method for the Trial A data, and did alert in 4 months in the Trial B data in which the Fisher’s method did not find a significant difference. The LR-SPC method alerted appropriately with the exception of one false positive in Trial A (for the 0.20 model threshold) and one false positive alert in Trial B (both thresholds). The BUS method did not alert any during either trial, resulting in false negatives in Trial A.
Substantially different performance was found between SPC and BUS. The SPC method was consistently more sensitive and the BUS method was more specific. The performance of both methods is sensitive to the data used in establishing the expected event rates and alerting thresholds. Theoretically, as the n (number of subjects) of the baseline data increases, SPC becomes more sensitive and BUS becomes less sensitive (and more specific) to event rate deviations in the monitored data. Conversely, BUS should more rapidly detect an event rate difference than SPC for sparse or low volumes of baseline data, and this has been shown in other monitoring applications.27, 28 The clinical trials evaluated here had large numbers of control patients in order to appropriately evaluate the primary outcomes, and this could have favored the SPC method in this analysis. A sensitivity analysis between the performance of SPC and BUS for large ranges of n is ongoing in order to determine relative performance between the methods, and we are currently evaluating an alternate BUS alerting threshold using the percentage overlap of the area under the probability density function.
LR-SPC, unlike SPC and BUS, is not directly sensitive to the n of the baseline data because the alerting threshold is generated from the predicted event rate of the observed data, which results in the n used to generate the alerting threshold being equal to n of the observed data. LR-SPC is sensitive to the performance of the logistic regression model used, and such models are more robust when generated from larger data sets.
In this evaluation, LR-SPC performed in a comparable manner to the SPC method. However, this result should be interpreted as LR-SPC performing in a non-inferior way to SPC in the absence of confounding in the evaluated data. This is a useful finding because it supports the use of the methodology, but it does not provide an evaluation of the method’s risk adjustment efficacy. Further work needs to be performed to establish the relative performance between SPC and LR-SPC using observational cohort data.
There are a number of limitations to this study. The control group data in both trials were accumulated concurrently and in a randomized fashion with the intervention data. However, in order to evaluate the system, the control data was assumed to be collected prior to the intervention data. This allows direct comparison of methods, but may result in over-optimistic performance measurements when such methods are applied to a prospective patient cohort, which experiences shifts in patient case-mix and provider behavior over time. In addition, all of the methods used perform serial evaluations of the data, which can increase the false positive alerting rate. However, these methods are intended for screening large numbers of outcomes for a wide variety of medications and medical devices within an automated application. Such surveillance emphasizes early detection and accepts lower sensitivity for additional specificity in this setting. Because of this, in-depth manual review of identified signals must then be performed in order to determine whether the signal is a true positive. Additional work will be required to satisfactorily adjust the sensitivity and specificity of the alerts to a manageable rate for manual review of the results from this application.
These methods are intended for use in prospective observational cohort surveillance within a health care environment, whether it is one hospital or a network of hospitals and outpatient clinics. Once a surveillance methodology is validated and established, selection of the baseline or expected data becomes critical for risk adjustment purposes and defines the nature of any resulting alerts. For example, a medical product just released to market could use phase 3 trial data as a baseline, which would evaluate whether the observed population experienced safety outcomes in excess of that reference group. However, such trials are well-known to recruit healthy patients, and sample sizes are generally low. Alternatively, outcome data from a closely related product with the same indication could be collected in the local environment for this purpose. This has the benefit of a larger sample size and could allow more granular data collection (since data elements in phase 3 trial data are expensive to collect) but might also suffer from missing data or collection, recall, or other biases. Further work must be done in this area to establish data selection hierarchies and protocols in order to inform such a process.
In conclusion, the SPC and LR-SPC methods performed well when evaluating randomized controlled trial data for significant safety event rate elevations. For monitoring where large amounts of data are available to provide the expected event rate (and threshold), SPC and LR-SPC appear to outperform BUS monitoring. Further work is required to establish risk adjustment performance in the LR-SPC method, and to establish BUS performance for event rate monitoring in conditions with sparse prior data or when highly variable trends in safety are present.
The authors would like to thank Anne Fladger and her library staff for their assistance. This study was funded in part by grants T15-LM-007092, R01-LM-008142, and R01-LM-009520 from the National Library of Medicine of the National Institutes of Health.