|Home | About | Journals | Submit | Contact Us | Français|
The IOM report, Preventing Medication Errors, emphasizes the overall lack of knowledge of the incidence of Adverse Drug Events (ADE). Operating rooms, emergency departments and intensive care units are known to have a higher incidence of ADE. Labor and Delivery (L&D) is an emergency care unit that could have an increased risk of ADE, where reported rates remain low and under-reporting is suspected. Risk factor identification with electronic pattern recognition techniques could improve ADE detection rates.
The objective of the present study is to apply Synthetic Minority Over Sampling Technique (SMOTE) as an enhanced sampling method in a sparse dataset to generate prediction models to identify ADE in women admitted for Labor and Delivery based on patient risk factors and comorbidities.
By creating synthetic cases with the SMOTE algorithm and using a 10-Fold Cross validation technique, we demonstrated improved performance of the Naïve Bayes and the decision tree algorithms. The true positive rate (TPR) of 0.32 in the raw dataset increased to 0.67 in the 800% over-sampled dataset.
Enhanced performance from classification algorithms can be attained with the use of synthetic minority class over-sampling techniques in sparse clinical datasets. Predictive models created in this manner can be used to develop evidence based ADE monitoring systems.
The Institute of Medicine (IOM) in the report, Preventing Medication Errors  recommended the implementation of decision support tools derived from evidence based knowledge and patient information as part of the strategies to prevent medication errors (ME). The report also recommended the active monitoring of medication use to promote prevention strategies. Although medical research has actively pursued these problems, the reported incidence of ME is suspected to be under-estimated[1–3].
Operating rooms, emergency departments and intensive care units are known to have a higher incidence of ADE . Labor and Delivery (L&D) areas are considered by quality assurance groups as special care units and pregnant women are considered by the FDA as a vulnerable group for ADE. L&D provides emergency care and therefore should also be treated as a high risk area Studies published in the literature focus on specific drugs and anesthesiology events. [5–9] To the best of our knowledge there are no published studies of ADE as a general category in pregnant women. Our findings indicate an incidence of 0.34% of ADE in women admitted to L&D. This incidence is surprisingly low in a population that includes at least 10% of high risk pregnancies that require poly-pharmacy.
One of the most complex tasks in the design and development of automated decision support tools is evidence based rule generation and knowledge extraction from existing data. The task is even more challenging in those cases where the class label of interest or ADE patients as in this case, has an incidence of 1% or less. Datasets with these characteristics are also known as skewed or imbalanced datasets. The class of interest is relatively rare and there are important trade-offs in the decision between false negatives and/or false positives. Overall, it is more costly to have a false negative versus a false positive. More so in a medical application where the interest is detecting patients with adverse outcomes that can be prevented. Without loss of generality, we will assume that the larger class or the majority class is the negative class and the class of interest is the minority (smaller) or positive class. We will use these terms interchangeably in the paper. The use of machine learning algorithms in sparse datasets with class imbalance causes suboptimal classification performance as these techniques get overwhelmed by the majority class. Recent work has focused on sampling techniques that counter the problem of class imbalance by either oversampling the minority class or under-sampling the majority class [12–15].
In this paper, we focus on the application of the Synthetic Minority Over sampling Technique (SMOTE). SMOTE works by generating new instances from the existing cases. SMOTE effectively counters the imbalance in data by not only solving the problem of high class skew but also the problem of high sparsity. It works in the “feature space” rather than “data space”. The synthetic samples are created by taking each minority class sample and the k nearest neighbors. The synthetic sample shares features of both the chosen minority class sample and one or more of the nearest neighbors. This approach effectively forces the decision region of the minority class to become more general. The synthetic cases will not only increase the data space but will also amplify the features of the minority class without duplicating the original data. SMOTE’s effectiveness has been shown in a variety of domains and with a variety of classifiers [15, 16].
The objective of the present study was to apply SMOTE as an enhanced sampling method using a sparse dataset and to identify a prediction model for ADE in women admitted for L&D based on patient risk factors and comorbidities. We would like to note here that we tried other of oversampling methods like replication and random under-sampling but none of them resulted in improvement. Hence, for clarity of presentation in the paper, we only focus our discussion and results on using SMOTE.
Machine learning techniques include both data sampling and learning algorithms. Over sampling techniques are applied to reuse the available data by dividing the dataset into three or more sets. Once the data sampling step is completed, the classification algorithms are applied to the resulting datasets. Subsequently, the performance of the classifiers is evaluated by comparison of the results in the training, testing and validation datasets..
SMOTE was used to generate new synthetic cases for this study. The computations for the new synthetic sample variables are based on Euclidian distance for continuous variables and the Value Distance Metric for the nominal features. The continuous variable values are created by taking the difference in distance between two existing minority class samples and multiplying that difference by a random number between 0 and 1. The resulting number is added to the feature value of the original sample and the result will be the value of that variable in the new synthetic sample. For nominal variables, the variable value is assigned by majority vote of the K nearest neighbors. As a result, the synthetic cases will have attributes with values similar to the existing cases and not just replications as provided with oversampling. The objective is to increase the representation of the minority class in the resulting dataset and reflect the structure of the original cases. By adding new samples of similar characteristics to the originals the decision region is amplified and there should be improvement of the evaluation measures: true positives and the Area Under the Curve (AUC). The newly created cases are appended to the original dataset in 100% increments. Thus the “second” dataset will have 100% more minority class cases, the third 200% more minority class cases and so forth. This technique has proven to be useful in improving prediction of sparse datasets by other authors .
Naïve Bayes is a simple probabilistic classifier based on Bayes’ theorem with strong (naive) independence assumptions. Bayes’ theorem is based on the conditional probability theory; the posterior probability is proportional to the product of the prior probability and likelihood. With the independence assumption, the Naïve Bayes classifier oversimplifies the models. It avoids the complexity of producing the joint probabilities across features, which quickly becomes overwhelming by the large number of features. While the assumption of independence is “naïve”, it has been shown to perform exceptionally well in classification in the medical field[17, 18]
Decision Trees are predictive models that allow the selection of an attribute that will serve as the root node for prediction. Based on the probability distribution chance of ocurrance and gain or utility of the root nodes, the leaf nodes (or branching nodes) are created. Decision Trees are inductive learners that have proven to perform well in clinical research. The interpretation is facilitated for domain knowledge experts by the display in graphical form. C4.5 is a popular decision tree learning algorithm used in a multitude of domains. We used the WEKA (Waikato Environment for Knowledge Analysis) Open Source Software implementation of C4.5, namely JR48, in our experiments.
Naïve Bayes and Decision Trees were chosen as the classification algorithms for the experiments because the results are in a format that facilitates interpretation by domain experts. The graphical representation of the Decision Trees and the simplicity of the Naïve Bayes model are easily understood as opposed to the “black box” that other algorithms such as Neural Networks and Vector Machines generate.
Records for the present study came from the Enterprise Data Warehouse (EDW) of Intermountain Healthcare in Salt Lake City, Utah. The EDW contains clinical care and coded data for billing and reporting. Data from 135,000 individual patients admitted for L&D during years 2002–2005 were extracted. The variables included demographic characteristics and discharge diagnosis as well as maternal and fetal outcomes and maternal comorbidities.
Inclusion criteria were post partum women with gestational ages between 20 and 44 weeks and birth weight between 500 and 4800 grams. Two patient’s records with maternal age above 55 were excluded as they were confirmed to be data entry errors. In patients with multifetal pregnancies, the outcome data of the first-born infant were selected for inclusion.
A classification methodology for outcomes and comorbidities was created based on the clinical classification of ICD9 codes for labor and delivery published by Yasmeen and on the reportable adverse events criteria published by the Joint Commission and the Utah Department of Health[21, 22]. In interest of clarity we called these tables “published classifications”.
The published classifications included ICD9 codes assigned to obstetrical diagnosis, pregnancy related comorbid diagnoses, procedures and for sentinel events. For example the diagnosis “diabetes mellitus” includes ICD9 codes: 250.xx, 357.2, 362.0, 648.0x. We created an electronic table called “classifications” with one column that included each one of the diagnosis, procedures and sentinel events and another column with the ICD9 code. The original ICD9 table included the ICD9 code and the description. We then used SQL queries to join both tables on the ICD9 code and selected both the description from the ICD9 table and the classification from the published classifications. One by one each row was verified to ensure that the ICD9 description matched the classifications. A column in the ICD9 table was added for class variables of diagnosis, procedures and risk factors to use in our study, e.g. ‘ADE’, ‘Cesarean’, ‘pregnancy induced hypertension’, etc. Once the tables were joined by ICD9 and the verification was made, we updated the class variable column assigning a category to each ICD9 code. Table 1 shows the resulting clinical classification and categories and the corresponding ICD9 codes. We found some factors not included in the published classifications, since those were of interest for prediction they were added to the table. The factors added by us were: demographic variables such as maternal age, fetal weight and fetal presentation during labor.
The clinical classification attribute was added to the patient dataset as a dichotomous variable. Those records that had an ICD9 code corresponding to each comorbidity, risk factor or procedure were assigned a value of 1 or 0 if not present.
The above procedure was done in order to ensure the accuracy of the classifications and include other codes that were in use at Intermountain Healthcare and were not in the publications. It also allowed us to assign a diagnosis to each patient and use it for the validation with the the patient electronic record.
Despite shortcomings, numerous clinical and informatics researchers have proven the usefulness of ICD9 coding systems for clinical research . Table 2 describes the different methodologies used to validate the accuracy of the clinical classification.. The patient electronic records were randomly selected and the validation for diagnosis was done on the clinicians interface of the medical record. Kappa statistic for agreement between the free text diagnosis in the clinical notes and the classification created based on the ICD9 codes was used.
From the pharmacy database we extracted values for number of drugs administered to the patients with ADE and to those with no-ADE. The mean values for number of drugs for each group and the t-statistic for comparison are also included in Table 2. As expected from previous reports in the literature, patients with ADE had a statistical significant higher number of drugs.
Comparison of disease incidence in the study population and the population disease incidence reported by the Utah Department of Health were performed. Similar incidences were found in the comparison for pregnancy induced hypertension, gestational diabetes, preterm birth and fetal weight.
The original dataset consisted of eighty four variables including maternal comorbidities, demographic information, fetal outcomes and surgical procedures. Principal components analysis (PC) and Chi-Square ranking were used to determine the explained variability in the dataset. The methods were also used for variable selection of highly correlated variables and to avoid multicollinearity[1, 3]. We applied Chi-Square ranking and PC to each of the complete datasets after the SMOTE procedure. This approach allowed the comparison of the variance in each of the original and resulting datasets. The intent was to verify if SMOTE altered the structure of the data. Variables with high collinearity (Eigenvectors > .5 ) were dropped in favor of those that preserved more specific information e.g. puerperal fever vs surgical wound infection. After we ensured that the preserved variables had no collinearity, we selected the variables with Eigenvalues that explained 80% of the variability as advised in the literature.
The ratio of ADE to controls in the dataset was 0.348/100 and clearly qualifies as a highly imbalanced data set. We used 10-fold cross-validation as a vehicle to empirically validate the results. 10-fold cross-validation divides the data into 10 mutually exclusive subsets, and then combines 9 of those at a time and evaluates the 10th left-out subset. Thus, a classifier is identified on ten different, but overlapping training sets, and evaluated on 10 completely unique testing sets. In preliminary experiments (results not included not included in this study), we applied a popular ensemble technique called AdaBoost that provides random oversampling of the minority class and random under-sampling of the majority class. None of these resulted in an improvement over the performance of the base classifier. The SMOTE algorithm was applied creating new synthetic cases of the class of interest in 100% increments. The first synthetic dataset had 100% more ADE cases than the original one, the second synthetic dataset had 200% more synthetic cases and so forth.
The suite of classification algorithms were then applied to the datasets modified by SMOTE. SMOTE boosted datasets using the 10-fold cross validation sampling technique. The decision to use 10-fold cross validation sampling technique was based on the small number of cases with class label of interest (ADE). The literature reports risk of overfitting and therefore introducing bias to the evaluation of the performance of the classification algorithms with this technique. However, the standard evaluation technique in situations where a limited number of cases is available is stratified 10-fold cross validation[17, 26]. Stratified 10-fold cross validation implies averaging the results after invoking the algorithm 10 times ten fold. In other words, each classification algorithm runs 100 times on each dataset. In our experiments, the Naïve Bayes classifier took 2 hours for one instance of 10-fold and 4.5 hours for the Decision Tree. The total time to run the experiments reported was 136.5 hours. The computational expense for 21 datasets was beyond the capacity of our resources. Based on the literature 10 is the suggested number of folds for the best estimate of errors. Likewise, SMOTE does not alter the original distribution of the data, therefore the problem of over-fitting is avoided.
The performance measures for evaluation of the classification algorithms were True Positive Rate (TPD), AUC (Area Under the Curve) and Kappa Statistics for agreement of classification between the different models.
As previously noted, the justification for utilizing SMOTE as the data boosting algorithm is to increase the availability of cases with the class label of interest; patients with ADE. We decided not to use over-sampling techniques that involve exact data replication and favored SMOTE as an alternative that creates new synthetic cases of the original class label of interest. In order to prove that SMOTE did not change the original data structure, we applied PC to compare the variance of the original dataset and that of synthetic datasets through the comparison of the Eigenvalues. Likewise, PC is described as an exploratory technique useful to gain a better understanding of the interrelationships among the data.
Domain expertise, in this case clinical interpretation of the results is necessary when applying novel techniques for predictive models[17, 25]. In order to determine if the predictive models generated by our experiments can eventually be used to create electronic applications, the results were clinically analyzed by two of the authors both specialists in obstetrics and gynecology. The purpose was to determine if the risk factors and comorbidities in the predictive models are likely to be associated with a higher risk of ADE.
The statistical comparison for the performance of the classifiers was done with the results of the three tests in the SAS output of the univariate procedure: Student’s t test, Wilcoxon and signed rank test. Although the t test is the most common one found in the data-mining literature for this purpose, there is evidence that non-parametric tests are more reliable when the number of datasets to compare is 30 or less and there is no assumption of normal distribution. The statistical reason in favor of non-parametric tests for this purpose is beyond the scope of the present report. We refer the reader to the paper published by Demsar on Statistical Comparison of Classifiers over Multiple Data Sets for this purpose.
MySQL V5.0 Open Source database management system was used for data preparation and transformation. WEKA Machine Learning Tools version 3.5.5. Open Source system and SAS software Release 9.1 and SAS Enterprise Miner Release 4.3. were used for data analysis and construction of the predictive models.
Institutional Review Board approval was obtained from both Intermountain Health Care and the University of Utah.
There were 106,480 cases that met the inclusion criteria and 371 ADE were identified based on the clinical classification previously described.
The demographic maternal characteristics as well as fetal outcomes showed no significant variation on ADE as indicated by the Eigenvalues of the PC. Surgical procedures (cesarean section and forceps) had the highest variation. Fifty five independent comorbidities were identified and accounted for explaining 80% of the variation in the dataset and were used in the final model.
Figures 1 and and22 show the increments in the number of new synthetic ADE cases obtained after each SMOTE procedure. Each time the algorithm was applied 371 new synthetic cases were added to the original dataset. Figure 1 shows the improved performance of the evaluation metrics with the minority class boosted datasets on the J48 Decision Tree. The original dataset showed a TPR of .32 and an AUC of .78. In the first synthetic dataset the TPR increased to .59 and the AUC to .81. A small increment of the evaluation metrics was observed as the number of synthetic cases increased. Figure 2 shows the results for the evaluation metrics for the Naïve Bayes classification algorithm. With the initial 100% boosting there was a slight decrease in the AUC and the TPR remained unchanged. However, after 200% boosting there was an immediate improvement of the performance measures. After the initial increment, the performance measures slightly improved until the 900% SMOTE point was reached. There was no further increased performance beyond the 1000% increase of the synthetic cases.
An analysis of the structure of the synthetic datasets was done by comparison of the principal components. The principal components of the original dataset and of those with synthetic cases remained the same. There was a non-significant variation in the Eigenvalues and the percentage of variation explained by each principal components did not vary. Thus, we believe that SMOTE was effectively able to counter the highly sparse nature of the data by increasing the density of points that enabled the classifiers to discriminate between the two classes.
The decision trees in all the models were similar in structure. The first split in the decision tree occurred in patients with external trauma followed by anomalies of the cervix, genito-urinary infections and chorioamnionitis. The next split occurred at severe pregnancy induced hypertension followed by history of previous cesarean and preterm birth labor. The main difference in the structure of the decision trees is in the number of leaves and granularity of the divisions for each rule. While a greater granularity in the decision trees is not necessarily a sign of improvement in the prediction model and can be attributed to over-fitting, the increased number of leaves in the boosted models facilitates the ability of domain experts to determine if the comorbidities and risk factors found could be associated with patients with ADE. Figure 3 shows the difference in structure and decision paths obtained with the decision tree classification algorithm in the raw dataset and the 900% boosted dataset.
Table 3 shows the results of the test statistics used for comparison of the performance of the two classifiers on the raw dataset and the SMOTED datasets. The results indicate a statistical significant difference for the Kappa statistics both with parametric and non-parametric tests. The p value from the t Statistic for the comparison of the AUC shows a level of significance < 0.0321. However, the sign test and the ranked signed test indicate a p <.0001. The number of datasets for evaluation was 21 and with a t Statistic within levels of significance we conclude that the evaluation metrics are indeed significantly different as confirmed by the non-parametric tests.
The importance of developing automatic detection tools for ADE has been widely emphasized . The current low ADE reporting rate creates unbalanced datasets that are very difficult to analyze and use for automatic rule extraction. Electronic methods used for knowledge extraction are likely to fail as demonstrated by the evaluation of the classifiers in the raw dataset.. Alternative data manipulation methodologies are a subject of current research in disciplines outside of medicine where it is also necessary to develop knowledge bases to predict rare occurrences of an event. Sparse data sets that would otherwise be useless can be used to create the starting point of evidence based electronic systems. Predictive models created in this manner can be used to develop evidence based ADE monitoring systems with the potential to increase ADE detection.. Increased detection of patients at risk for ADE can lead to changes in patient care protocols and improve patient safety and quality of care. One role of biomedical informatics is to evaluate these methodologies and determine the usability in the clinical arena[20, 30–33].
The use of ICD9 coded data for clinical research has been controversial. However, multiple research studies have demonstrated its usefulness[14, 27]. In addition Yasmeen et al proved the reliability of reports of disease incidence using such classification. It should be kept in mind that the resulting clinical classification is a general classification of risk factors and comorbidities with the limitations and short comings of a system as inespecific as ICD9. Nonetheless, it can be used to create useful predictive models to automatically detect those patients at higher risk for ADE and even as an automatic method to detect disease incidence or study populations for further research.
Obstetric indicators report severe pregnancy induced hypertension, embolism and infection as the three leading causes for severe maternal morbidity and mortality[34, 35]. Our results show severe hypertension and wound infection as two of the leading factors for variability in the dataset. It is unclear to us why “trauma” appears as the leading factor for variability since the incidence of trauma is extremely low. We can only speculate that it is because these patients are at higher risk for obstetrical complications such as embolism, infections and hemorrhage as reported in the literature.
As noted in the introduction, existing methodologies for detection of ADE and AE in general are insufficient, underreporting is suspected at all levels. We believe that the introduction of machine learning methods could have a promising future in this arena if we are able to create predictive models that could deal with clinical factors of low incidence like ADE. Machine learning methods are capable of detecting associations that are not evident when the prevalence is low. Clinical data are numerous, complex, can be confounding and noisy, as a consequence datasets of this nature are likely to be sparse and difficult to analyze. The introduction of boosting algorithms like SMOTE where the original structure of the data is maintained is promising and future research is necessary. However, for an real time automatic detection method to be reliable, the clinical data of interest would have to be coded in real time. Existing real time reports of Natural Language processing and detection of antidote drugs for ADE are promising[37, 38].
In the present study, we found important discordance between the coded data and the text reports in the electronic medical record (Table 2). ICD9 coding for billing and reporting is done based on both electronic and paper records. Therefore higher agreement could be expected if the validation of the ICD9 codes were done including both sources. Nonetheless, our data indicated similar disease incidence when comparing the study population to that of the State of Utah. Likewise, based on the validation study published by Yasmeen we can conclude that the ICD9 coding system is accurate for clinical classification of obstetrical diagnosis.
Another limitation of the ICD9 coding system and more so of the way it is used for billing and reporting, is the impossibility to determine the timing of the comorbidity in relation to the time of delivery and patient admission. The ICD9 codes are included in the electronic record after patient discharge and account for all the events that accompanied the patient during the hospital stay and are not stratified by date or time. This could be a problem if specific comorbidity analysis is done. We can only conclude that patients with certain comorbidities are prone to ADE but we can not determine the timing of the appearance of the comorbidity in relation to the maternity admission or the ADE. Also, the nature of the data makes it impossible to differentiate among those patients with preventable and non-preventable ADE. The clinical classification used in the present study could be used to classify patients in general categories of comorbidities, procedures and to identify risk factors. A classification like this could be useful to identify groups of patients with shared clinical trends. However, a real time monitoring system could not be implemented since the ICD9 codes are not assigned until days after the patient is discharge from the hospital.
The disadvantages of using sampling and classification techniques with all types of datasets are over-fitting or over-training. Oversampling leads to overfitting, while random under-sampling does not necessarily provide new information. The data are optimized in such a way that the classifiers have an excellent performance in the training and testing sets but can have poor performance in the validation sets. In this case, the normal distribution of the individual variables is altered. Oversampling techniques often involve making exact copies of the majority class, resulting in overfitting and does not solve the problem of sparse data. It can on the other hand increase the computational expense without improving the performance in the validation sets. Under-sampling can discard useful information and therefore decrease classifier performance. [16, 17]. The SMOTE algorithm creates synthetic cases based on the values of the variables of the nearest neighbors. This approach maintains the original distribution and therefore the over-fitting problem is avoided. In the present study, we were able to verify this t by comparison of the Eigenvalues of the principal components in the raw dataset with those that included the synthetic cases.
It could be argued that the improvement of the evaluation throughout the experiment is evident but that it does not show dramatic changes. We demonstrated statistical significant differences with the use of both parametric and non-parametric statistics in the evaluation metrics of both classifiers. The differences of the structure of the decision trees does change and shows additional split areas that can be used in practical applications through identification of patients at higher risk for ADE. These models can be used as a starting point in future research to focus attention on factors that might be shared by the cases present in the models.
Although precise clinical conclusions can not be drawn from the results of the present study, the decision trees allow clinical validation of the results. The decision tree in the raw dataset has one split at the beginning and does not allow discrimination between different groups of patients that may have similar risk for ADE than others. By displaying the risk factors in this manner, it is impossible to discern if there are groups that could share a similar risk for ADE and not the same diagnosis. On the other hand, the tree resulting from the SMOTED datasets allowed the visualization of different groups at the same level of risk for ADE and that do not share diagnosis (Figure 3). The left hand side figure (tree resulting from the raw data) shows trauma, severe pregnancy induced hypertension, wound infection in decreasing levels of importance. The right hand side of the figure (tree resulting from 900% SMOTED dataset) shows trauma, severe pregnancy induced hypertension and wound infection as parent nodes at the same level. Through this graphical display we can see how patients with different diseases receiving completely different set of medication can share a similar risk for ADE.
The ICD9 classification system used in the present study is general and unspecific for the study of individual diseases. We believe that if a similar methodology to the ones used in this report were to be applied by replacing ICD9 codes with clinical events, signs, symptoms and data from the actual medical record, there would be more success in developing predictive models that could be used in real time electronic systems. It is also of importance to study the types of drugs associated with ADE in the pregnant population. The pharmacopeia in obstetrics is limited and it is likely that a sparse dataset can be encountered when analyzing drugs likely to cause ADE. Further research is necessary in order to determine which drugs are associated with ADE and also to determine which drug combinations are likely to produce ADE and drug-drug interactions.
In addition, it would be desirable to compare the performance of the classifiers among the subsets selected with additional variable selection techniques as advised by Hall et.al.
The use of knowledge extraction techniques in clinical applications with sparse data is prone to failure without further data manipulation. Enhanced performance from classification algorithms can be attained with the use of SMOTE in the clinical setting as demonstrated in this study and previously reported by other clinical specialties. Models obtained through this methodology can be used as starting points to develop prediction models for future experiments that will ultimately aid in the development of automatic reporting tools.
The present study was conducted with data from Intermountain Health Care.
It was supported in part by the grant No. LM 007124-11 from the National Library of Medicine.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.