Recently, comprehensive EHRs linked with DNA biorepositories have demonstrated their value for genomic research, including identifying genetic variants contributing to diseases.4
Since EHRs contain a longitudinal record of medication exposure and response, EHR-linked genomic data may be a great resource for pharmacogenomic data as well. A recent study demonstrated this finding, replicating the associations between steady-state warfarin weekly dose and genetic variants in VKORC1
However, the most challenging step in this study was extracting drug-exposure and outcome information from the EHR. A manual chart review is costly and time-consuming, hampering the efficiency of conducting pharmacogenetic studies. In this study, we extended an existing medication extraction tool to automatically calculate weekly doses of warfarin from clinical text, and evaluated the ability to find known genetic associations with warfarin dose using only automatically extracted values. Evaluation showed that the system performed well in capturing weekly doses of warfarin. Moreover, the genetic association analysis successfully replicated previously known associations between steady-state warfarin weekly dose and variants in VKORC1
genes, using completely automated calculations of weekly dose. This demonstrates that informatics tools have the potential to simplify the data-extraction processes for EHR-based pharmacogenetic studies.
Determining medication doses is critical to many drug-related studies using EHR data, including pharmacogenetics. In inpatient settings, structured medication data can be obtained from computerized systems such as physician order entries and electronic medication administration records. However, for outpatients, detailed drug-dose information is often embedded in clinical text, often requiring costly manual abstraction. Therefore, NLP systems that can automatically extract and calculate daily or weekly doses of medications used in the outpatient setting are very useful. However, outputs from current medication extraction systems, such as those developed for the 2009 i2b2 NLP challenges, are textual strings of extracted information such as dose and frequency, and such information is not directly usable for daily or weekly dose calculation. Interpreting these data for real-world use is not a trivial task. In this study, we demonstrated that we could extend an existing NLP tool (MedEx) by adding new knowledge components to accurately capture weekly doses of warfarin, providing a good example of applying existing NLP tools for practical research uses.
Although MedEx performs well unchanged for most medications and dosing regimens, specific application of MedEx to warfarin extraction was challenging. We found that the dosing text of warfarin was much more complicated than average drugs (see examples in ). Our current implementation with new lexicons and an additional parsing step provides a generalizable solution to solve this problem, allowing MedEx to be customized to reach a desirable high performance for any specific drug. In addition, we conducted an experiment to assess if such modifications to MedEx affect its performance on other drugs. We randomly selected 1000 discharge summaries from the SD, and processed them using both the original and the modified MedEx for this study. Among all 42 563 medication entities (including associated signature information) recognized by the modified MedEx, 41 942 (98.54%) were identical to the outputs from the original MedEx. A manual analysis of 50 mismatched medication entities extracted by the modified MedEx showed that 53% of them were correctly identified by the original MedEx, and 47% were correctly identified by the modified MedEx. Such results indicated that the modification to MedEx for improving warfarin dose extraction did not significantly affect its performance on other drugs.
We also looked into errors in the weekly dose extraction, which could be categorized into two classes: (1) failures in capturing dose-related findings (eg, ‘1/4 tablet’ was not identified in the sentence ‘warfarin 2.5 mg PO 0.5 tabs daily ex ¼ tablet q Th’); and (2) failures in dose normalization and weekly dose calculation (eg, the sentence ‘Continue Coumadin 5 mg’ indicated a weekly dose of 35 mg, as it omitted the default frequency ‘daily’; but our program outputted 5 mg as its weekly dose). Manual analysis of 20 sentences with incorrect weekly doses revealed that 10% of errors were from normalization and calculation, and 90% of errors were from dose entity extraction, which also indicated the complexity of natural language expression (eg, we noticed different expressions for ‘every Monday and Wednesday’: ‘qM, W; q Mon, Wed; qMon, Wed; q Mondays and Wednesdays; qM&W … etc’). Our future work will include investigations on methods to capture variants of dosing entities.
Our genetic association algorithm selected the steady-state warfarin dose by finding a time period between 3 and 12 weeks in which the patient had stable INR values between 2 and 3 (in whom that was the goal range). Given that many people had stable doses for long periods of time with only slight adjustments in dose, we analyzed the ‘median’ dose during that time period. When we compared the patient-level stable weekly doses from the automated approach and those from manual review, 75% were exactly the same, and 88% were within 20% of the manually extracted doses. Manual review of 20 patients who were randomly selected from the 25% mismatched population showed that 11 of them (55%) had the incorrect stable weekly dose from the automated approach due to errors by the weekly dose calculation system. Seven patients (35%) had dose differences because the automated weekly dose extraction system identified more warfarin dosing mentions than found in the manual review. The remaining two patients had incorrect stable weekly doses by the manual review approach (ie, the reviewer calculated the weekly dose incorrectly). Such findings indicate that the weekly dose extraction system needs further improvement, but also that some of the differences were due to errors in the manual review process. We also noticed a fair number of patients had discordant drug-dosing information on the same day stored in EHRs. Six out of 20 patients (30%) reviewed had at least one discrepant pair of warfarin weekly doses from different notes that were recorded at the same date. A detailed analysis in the discrepancy is beyond the scope of this study but would be helpful to identify the correct weekly dose by deciphering such discrepant information.
Despite its success in replicating known warfarin pharmacogenetic associations, this study has limitations. The drug-outcome data in this study are about drug doses instead of drug responses such as adverse events or treatment efficacy, which involve accurate assessment of an event and timing in correlation with drug exposure, which could be more challenging for automated extraction. Additionally, assessing drug exposure based on medication mention in clinical text does not adjust for possible non-compliance issues, which can be common with many medications. Finally, such methods require the presence of a robust EHR that can be easily queried, and that has been linked to genomic information. The medication dose extraction tools developed in this study are valuable for clinical research, but they are not robust enough to support practical applications such as decision-support systems in the clinical settings.