|Home | About | Journals | Submit | Contact Us | Français|
With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient–feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug–adverse event associations and adverse events associated with drug–drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk.
Phase IV surveillance is a critical component of drug safety because not all safety issues associated with drugs are detected before market approval. Each year, drug-related events account for up to 50% of adverse events occurring in hospital stays,1 significantly increasing costs and length of stay in hospitals.2 As much as 30% of all drug reactions result from concomitant use—with an estimated 29.4% of elderly patients on six or more drugs.3
Efforts such as the Sentinel Initiative and the Observational Medical Outcomes Partnership4 envision the use of electronic health records (EHRs) for active pharmacovigilance.5–7 Complementing the current state of the art—based on reports of suspected adverse drug reactions—active surveillance aims to monitor drugs in near real time and potentially shorten the time that patients are at risk.
Coded discharge diagnoses and insurance claims data from EHRs have already been used for detecting safety signals.8–10 However, some experts argue that methods that rely on coded data could be missing >90% of the adverse events that actually occur, in part because of the nature of billing and claims data.1 Researchers have used discharge summaries (which summarize information from a care episode, including the final diagnosis and follow-up plan) for detecting a range of adverse events11 and for demonstrating the feasibility of using the EHR for pharmacovigilance by identifying known adverse events associated with seven drugs using 25,074 notes from 2004.12 Therefore, the clinical text can potentially play an important role in future pharmacovigilance, 13,14 particularly if we can transform notes taken daily by doctors, nurses, and other practitioners into more accessible data-mining inputs.15–17
Two key barriers to using clinical notes are privacy and accessibility. 16 Clinical notes contain identifying information, such as names, dates, and locations, that are difficult to redact automatically, so care organizations are reluctant to share clinical notes.
We describe an approach that computationally processes clinical text rapidly and accurately enough to serve use cases such as drug safety surveillance. Like other terminology-based systems, it deidentifies the data as part of the process.18 We trade the “unreasonable effectiveness”24 of large data sets in exchange for sacrificing some individual note-level accuracy in the text processing. Given the large volumes of clinical notes, our method produces a patient–feature matrix encoded using standardized medical terminologies. We demonstrate the use of the resulting patient–feature matrix as a substrate for signal detection algorithms for drug–adverse event associations and drug–drug interactions.
Our results show that it is possible to detect drug safety signals using clinical notes transformed into a feature matrix encoded using medical terminologies. We evaluate the performance of the resulting data set for pharmacovigilance using curated reference sets of single-drug adverse events as well as adverse events related to drug–drug interactions. In addition, we show that we can simultaneously estimate the prevalence of adverse events resulting from drug–drug interactions. The reference set, described in the Methods section, contains 28 positive associations and 165 negative associations spanning 78 drugs and 12 different events for single drug–adverse event associations. For the drug–drug interactions, the reference set contains 466 positive and 466 negative associations spanning 333 drugs across 10 events.
To demonstrate the feasibility of using free text–derived features for detecting drug–adverse event associations, we reproduce the well-known association between rofecoxib and myocardial infarction. Rofecoxib was taken off the market because of the increased risk of heart attack and stroke.19,20 We compute an association between rofecoxib and myocardial infarction, keeping track of the temporal order of the diagnosis of rheumatoid arthritis, exposure to the drug, and occurrence of an adverse event as described in the Methods section. Using data up to 2005, we obtain an odds ratio (OR) of 1.31 (95% confidence interval (CI): 1.16–1.45) for the association, which agrees with previously reported results.19,20 In a previous study, we compared using clinical notes with using the codes from the International Classification of Diseases, Ninth Revision (ICD-9), and found no association (OR: 1.71; 95% CI: 0.74–3.53) using the coded data.21 This is probably due to undercoding: for patients to be counted as exposed requires a prior arthritis indication, and approximately one-third of the patients meet that criterion.
Figure 1 shows the adjusted ORs and 95% CIs for the 28 true-positive associations from our single drug–adverse event reference set. As expected, the results show some variation by event across the adverse events.10 Figure 2a shows the overall performance for detecting associations between a single drug and its adverse event, with an area under the receiver operating characteristic curve (AUC) of 75.3% (unadjusted) and 80.4% (adjusted). A threshold of 1.0 (a commonly used cutoff) on the lower bound of the 95% CI of the adjusted ORs translates to 39% sensitivity and 97.5% specificity. Choosing a signaling threshold, defined using minimum specificity of 90%, based on the receiver operating characteristic curve, yields a cutoff of 1.18 (unadjusted) and 0.84 (adjusted) on the lower bound of the 95% CI. Supplementary Data S1 online lists all adjusted results, and Supplementary Data S2 online lists the AUC threshold data.
Figure 3 shows the cumulative ORs and exposures over time based on the unadjusted associations for the 10 drugs in our reference set that have had an alert in the past decade. Using a threshold of 1.0 on the lower bound of the CI for the association, we would flag six of nine alerts earlier than the official date (we do not have enough data for one drug, troglitazone). By comparison, the propensity-adjusted method would catch three of the alerts early. The unadjusted associations can flag signals worth investigating, and the adjusted associations may reduce false alarms.
Figure 2b shows the performance (AUC of 81.5%) for detecting known adverse events arising from drug–drug interactions. Adjusting the associations for potential confounding improves the signal detection capability (red curves in Figure 2b).22 In the drug–drug interaction scenario, we do not constrain by drug indications because of combinatorial complexity. We obtain 52% sensitivity at 91% specificity, using 1.0 as a threshold on the lower bound of the CI for the adjusted associations.
Population-level prevalence data for adverse events are hard to come by. For single drugs, sources such as Side Effect Resource provide information on the frequency of specific adverse events from the drug product label. No such comparable resource exists for adverse events arising from drug–drug interactions.
While performing the drug–adverse event association calculations using data from a clinical data warehouse, we can in parallel estimate the prevalence of adverse events associated with drug–drug interactions. For example, we found that 42.8% (176 of 411) of patients on both levodopa and lorazepam experience parkinsonian symptoms, 19.8% (140 of 707) of patients on paclitaxel and trastuzumab experience neutropenia, and 17.8% (796 of 4,467) of patients on amiodarone and metoprolol experience bradycardia.
We have demonstrated that adverse drug events as well as adverse events associated with drug–drug interactions can be detected using a deidentified patient–feature matrix extracted from free-text clinical documents. Blumenthal and others5 envision a scenario in which a new drug comes to market and a nationwide learning system monitors for safety signals. Our results show that deidentified clinical notes can be used to generate drug safety signals—taking a step toward such a scenario. In addition, the patient–feature matrix also provides prevalence data not available from other data sources (e.g., spontaneous reports). Having such prevalence information can assist in prioritizing actionable events and reducing alert fatigue.23
Our approach to processing clinical notes is simple in comparison with advanced natural language processing (NLP) systems that may have better accuracy in identifying nuanced attributions of disease conditions. We sacrifice some individual note-level accuracy in exchange for the ability to detect population-level trends against massive data sets. Our results, based on a reference set of known drug–event pairs, show that when exposure data are numerous enough, the use of relatively simple text mining with standard association strength tests for signal detection can work, reflecting the adage in the machine-learning community that “a dumb algorithm with lots of data beats a clever one with modest amounts of it.”24,25 When used in combination with other data sets, clinical notes may address cases that otherwise pass undetected. We sacrifice sensitivity for specificity because for a new approach, and a new data source (clinical notes), keeping false-discovery rates low is important, particularly in the initial stages of establishing feasibility.
We find that ontologies are an excellent source of features and allow systematic normalization and aggregation when the feature set needs reduction.15,26 For example, we can count all patients who experience cardiac arrhythmias as patients with arrhythmias because of the hierarchical relationships. Therefore, ontology hierarchies can organize a very large number of terms into a smaller feature set. Moreover, because names, dates, and locations are not present in the clinical terminologies, those are not extracted as features by dictionary-based methods.18,27
We believe that the information embedded in text is crucial for leveraging EHR data,10,13,14 particularly for rare events for which large amounts of data are needed. Our annotation-based approach produces a feature matrix that complements other structured data such as codes from the ICD-9. Of note, our methods are not dependent on any particular NLP tool (we contrast MGREP and UNITEX in the Methods section), and we expect results to improve given the availability of better and faster clinical NLP tools.28,29 We are currently collaborating with researchers at the Mayo Clinic to improve the speed of the clinical Text Analysis and Knowledge Extraction System,29 one of the state-of-the-art NLP tools available for clinical text. Broader availability of curated clinical NLP data sets and health outcome definitions would accelerate research and validation.
Our work has several limitations and opportunities for improvement. Not all conditions are equally identifiable from text using lexical approaches (Supplementary Data S3 online reports validation results by condition). Advanced NLP tools would improve accuracy in these cases. Biases in our reference set—although among the largest used for such a study— affect our performance estimation. A new reference standard covering four events has just recently been released by the Observational Medical Outcomes Partnership,4 and we are currently evaluating its utility. Some adverse drug events are dose dependent, and our methods currently ignore this information. The UNITEX tool, described in the Methods section, includes libraries for dosage extraction and thus is a logical next step. We do not distinguish between new users of drugs and existing or chronic ones. Our methods have a limited ability to define eras (durations of medication and illness). We are currently examining the annotation data for the utility of the last mention of a concept, sentence-level co-occurrences, and temporal density of mentions to address this question. The majority of our findings are based on the Stanford Hospital and Clinics, which is a tertiary-care center representing a skewed population. At the same time, this population has added utility for investigating rare events. Variations in signaling thresholds can also occur as a result of the prevalence or rarity of an event,10 and more research is needed to adapt detection algorithms accordingly. The prevalence data estimated in studies such as ours are an important step in this direction.10 Finally, we note that the Observational Medical Outcomes Partnership group suggests that no single method works best uniformly, that different methods be considered for each event and data source, and that profiling performance via receiver operating characteristic curves assists in understanding the utility of a method or data source.4
To conclude, our method extracts from textual clinical notes a deidentified patient–feature matrix encoded using standardized medical terminologies. We have demonstrated the use of the resulting patient–feature matrix as a substrate for detecting single drug–adverse event associations (AUC of 80.4%) and for detecting adverse events associated with drug–drug interactions (AUC of 81.5%), illustrating that clinical notes can be a source for detecting drug safety signals at scale.15 The patient–feature matrix can also be used to learn off-label usage30 and to discern drug adverse events from indications.31 Using the textual contents of the EHR complements efforts using billing and claims data or spontaneous reports4,8,14,32,33 and opens up new opportunities for leveraging observational data.
Our primary data source was the Stanford Translational Research Integrated Database Environment,34 which spans 18 years of patient data from 1.8 million patients; it contains 19 million encounters, 35 million coded ICD-9, diagnoses, and >11 million unstructured clinical notes, which are a combination of pathology, radiology, and transcription reports. The gender split is ~60% female; the average age is 44 with an SD of 25.
We created reference standards of known drug–adverse event associations for testing the performance of our methods in detecting drug safety signals from text. Supplementary Data S4 online lists the single drug–event reference set.
For the single-drug adverse events, our reference set included 12 distinct events worth monitoring35 and 78 distinct drugs, 28 positive cases, and 165 negative cases. We started with a validation set from the European Union adverse drug event project (EU-ADR)36 and to that set, we added 10 drug safety signals that involved US Food and Drug Administration intervention in the past decade, manually curating these from the literature and cross-referencing with the agency’s website. We established our false-discovery rate by generating a set of negative associations by creating all combinations of drugs and events and subtracting any known associations that were identified by any one of the EU-ADR filtering workflows,37 the Medi-Span (Wolters Kluwer Health, Indianapolis, IN) Adverse Drug Effects Database, or the Side Effect Resource database.38
For the two-drug case, known drug–drug interactions were extracted (and manually validated) from textual monographs in DrugBank and the Medi-Span Drug Therapy Monitoring System. In this case, we simulated the negative set by associating drug pairs with a randomly chosen event, removing any cases that were already known to be associated on the basis of external knowledge (DrugBank, Medi-Span, Drugs.com, Unified Medical Language System (UMLS), or Side Effect Resource). This reference set included 10 distinct events, 333 distinct drugs, 466 positive cases, and 466 negative cases.
We followed a two-step process for detecting drug safety signals: first, we computed a raw association in the form of an unadjusted OR, followed by adjustment for potential confounders. The first step is useful for flagging putative signals, and the second step is useful in reducing false alarms.
In the first step, we computed unadjusted ORs and 95% CIs by constructing a 2 × 2 contingency table26,33 from the patient–feature matrix. On the basis of first mentions of drug, event, and indication and their temporal order, we assigned patients to specific cells of a 2 × 2 contingency table as shown in Figure 4 (see also Supplementary Data S5 online). The temporal information in the patient–feature matrix is critical for determining whether the event follows exposure.39 Patients having no mention of the indication at any time are excluded from the analysis (see Supplementary Data S6 online for those patients being excluded). Using data following the indication, and not counting repeat mentions, the ordering of the drug and event determined into which cell of a 2 × 2 matrix the patient fell. Because all unexposed patients have the indication, they could be on an alternative drug or other treatment, or none at all.
In the second step, we adjusted for confounding by specific patient factors. We included age, gender, race, and comorbidity and coprescription frequency (as a surrogate for overall health status) in calculating the propensity score.9 The propensity score quantified the likelihood of a patient to be exposed to a drug. Patients with known indications were matched (exposed vs. unexposed) via the propensity score. Finally, we included the propensity score as a covariate in logistic regression to compute adjusted ORs and 95% CIs using the coefficients of the regression model. We used the Matching and Survival packages in R.40
For single drug–event associations, we identified the indications of the drug using the Medi-Span Drug Indications Database and the National Drug File–Reference Terminology. In the drug–drug interaction scenario, the key idea is to determine whether the association of the event with the combination of the two drugs outweighs any association of the event with either one of the drugs alone (or none at all). Including the indications adds a degree of combinatorial complexity, so we focused primarily on the temporal order of the two drugs and event (Figure 4b) without restricting by the indications of the drugs.
Our annotator workflow, described previously,21,30 uses ~5.6 million strings from existing terminologies; filters unambiguous terms that are predominantly noun phrases representing drugs, diseases, devices, or procedures; uses the cleaned up lexicon for term recognition in the clinical notes to tag or annotate41 the text; excludes negated terms or terms that apply to family and medical history;42 normalizes all terms using the ontology hierarchies; and finally uses the time stamps of the note to produce a deidentified, temporally ordered patient–feature matrix. The process is summarized in Figure 5 and the individual steps are detailed below.
We use existing ontologies as a source of (i) a lexicon of strings that are grouped together and linked to over a million concepts via synonymy (referred to as mappings) and (ii) a hierarchy of >14 million parent–child relationships among those concepts. We use the lexicon to recognize terms in the input text using a tool called MGREP,41 which also tracks the relative position at which each term occurs (Figures 5 and and6).6). In addition to clinical terms, based on the ConText system,42 we include terms corresponding to contextual cues called “triggers” in our lexicon. Cues such as “denies,” “no sign of,” and “father has a history of” are used in a postprocessing step to identify terms that are negated or that apply to family or medical history. Terms that correspond to mentions in these contexts are ignored—thus, the subsequent analysis relies on positive, present mentions of concepts.
The resulting annotations for the Stanford Translational Research Integrated Database Environment data set comprise ~3.75 billion records. It takes 1 hour to generate annotations from 3 million documents using a single computer workstation and ~2 hours to postprocess the data. MGREP can be substituted with other NLP tools: one such tool we have tested is UNITEX,43 which offers advanced functionality such as regular expressions for drug doses and morpheme-based matching at the cost of an additional 10–20% processing time.
Motivated by previous work on identifying and removing noninformative terms,44,45 we apply a series of suppression rules that fall into two categories: syntactic and semantic. We keep terms that are predominantly noun phrases46 based on an analysis of over 20 million MEDLINE abstracts; we remove uninformative phrases based on term frequency analysis of >50 million clinical documents from the Mayo clinic;47 and we suppress terms having fewer than four characters by default because the majority of these tend to be ambiguous abbreviations. Finally, using frequency-based sorting, we manually identify ambiguous terms that belong to more than one semantic group (drug, disease, device, and procedure),47,48 and we suppress their least likely interpretation. For example, “clip” is more likely to be a device than a drug in clinical text, so we suppress the interpretation as “corticotropin-like intermediate lobe peptide” even though clip is listed as its synonym.
Drug prescriptions are identified via the text processing and normalized into active ingredients using relationships from RxNorm (e.g., “tradename_of ”). Therefore, “rofecoxib 12.5 mg oral tablet” and “Vioxx” are normalized to the active ingredient rofecoxib. In addition, we map ingredients to the Anatomical Therapeutic Chemical Classification System, which enables four levels of aggregation, i.e., rofecoxib, celecoxib, and valdecoxib are all cyclooxygenase-2 inhibitors, which are nonsteroidal anti-inflammatory drugs, and so on.
Although drug normalization is fairly straightforward, diseases, devices, and procedures present a challenge. In what we call the two-hop method (Figure 7), we use a query-driven approach to normalize disease, device, and procedure concepts. We start with definitions from the EU- ADR project’s specifications and MedDRA standardized query definitions: for example, for myocardial infarction, we would start with the ICD-9, code 410 (acute myocardial infarction) and 18 different UMLS concept unique identifiers including C0027051 ( myocardial infarction), C0340324 (silent myocardial infarction), and C0155626 (acute myocardial infarction). Starting with these “seed” concepts, we utilize mappings across ontologies and the hierarchical parent–child relationships to expand subsumed entities. Supplementary Data S7 online lists all seed queries and their full expansions.
We first precompute the transitive closure over all parent–child hierarchies, and we index it such that we can retrieve all ancestors or all descendants of a given concept. Second, the mappings among synonymous terms form an equivalence class to which we assign a unique identifier (similar to the UMLS Metathesaurus concept unique identifiers). Using these two resources, given concepts of interest as a seed query, for example, the 18 concepts for myocardial infarction, we use the mappings to find all canonical identifiers (first hop) and then use the transitive closure to include all subsumed concepts in the query. Next, we repeat the process once more with this expanded set of concepts (second hop). For myocardial infarction, the expansion process yields 470 unique strings. In principle, recursion with a least fixed-point semantics would apply; however, recursion does not work well in practice because of differing abstraction levels among ontologies, which induce cycles. We have found that two hops achieve an adequate balance between soundness and completeness for the current use case.
By combining the above procedures (seeding queries using established definitions, normalizing and aggregating terms, and using only positive, present mentions; see Supplementary Data S8 online), we are able to recognize events and exposures with enough accuracy for the drug safety use case. We determine the accuracy of the event identification using a gold-standard corpus from the 2008 i2b2 Obesity Challenge.49 This corpus has been manually annotated by two annotators for 16 conditions and was designed to evaluate the ability of NLP systems to identify a condition present for a patient given a textual note. We extended this corpus by manually annotating each of the events listed in Figures 1 and and33 (see Supplementary Data S3 online).
Using the set of terms corresponding to the definition of the event of interest (see Supplementary Data S7 online) and the set of terms recognized by our annotation workflow in the i2b2 notes, we evaluate the sensitivity and specificity of identifying each of the events (see Supplementary Data S3 online). Overall, our event identification has 74% sensitivity and 96% specificity. Accuracy varies by condition: for example, myocardial infarction has 63% sensitivity and 94% specificity, whereas gallstones have 15% sensitivity and 99% specificity.
Drug recognition is done in a similar manner using strings from RxNorm and an independent study at the University of Pittsburgh, which examined the annotations on 1,960 clinical notes manually50 and estimated over 84% recall and 84% precision for recognizing drugs (R. Boyce, personal communication).
We use the time stamps for each note to induce a temporal ordering over the recognized concepts on a per-patient basis. We focus on first mentions of concepts and do not use exposure windows or eras. We keep positive, present mentions and ignore negated mentions and family and medical history mentions identified via trigger terms. Therefore, for every patient, the feature matrix contains a temporally ordered list of drugs, diseases, devices, and procedures mentioned in their medical record.
The authors acknowledge support from the National Institutes of Health grant U54-HG004028 for the National Center for Biomedical Ontology. NHS also acknowledges support from NIH grant U54-LM008748. The authors thank Cédrick Fairon for assistance in evaluating UNITEX and Richard Boyce for evaluating drug accuracy.
AUTHOR CONTRIBUTIONSP.L., A.B.-M., S.V.I, and N.H.S wrote the manuscript. P.L., S.V.I., A.B.-M., and N.H.S. designed the research. P.L., S.V.I., A.B.-M., J.M.M., and N.H.S. performed the research. P.L., S.V.I., A.B.-M., R.H., T.P., and T.A.F. analyzed the data. P.L., S.V.I, A.B.-M, and N.H.S contributed new reagents/analytical tools.
CONFLICT OF INTEREST
The authors declared no conflict of interest.