|Home | About | Journals | Submit | Contact Us | Français|
This case study examined the utility of regular expressions to identify clinical data relevant to the epidemiology of treatment of hypertension. We designed a software tool that employed regular expressions to identify and extract instances of documented blood pressure values and anti-hypertensive treatment intensification from the text of physician notes. We determined sensitivity, specificity and precision of identification of blood pressure values and anti-hypertensive treatment intensification using a gold standard of manual abstraction of 600 notes by two independent reviewers. The software processed 370 Mb of text per hour, and identified elevated blood pressure documented in free text physician notes with sensitivity and specificity of 98%, and precision of 93.2%. Anti-hypertensive treatment intensification was identified with sensitivity 83.8%, specificity of 95.0%, and precision of 85.9%. Regular expressions can be an effective method for focused information extraction tasks related to high-priority disease areas such as hypertension.
By some estimates free text physician notes contain over 50% of the data in the patient’s medical record. 1 Across the United States healthcare organizations are increasingly moving toward electronic medical record systems. 2 As part of this process, physician notes are frequently becoming available in digital format and thus potentially amenable to computational processing. 3 Information extracted from narrative medical documents has been used for populating structured electronic medical records databases, 4 billing, 5 identification of potential subjects for research studies, 6 and epidemiological research. 7–9
A number of software tools for extraction of information from narrative medical documents have been described in the literature. 4,5,10–14 At this time, most of the tools developed in the academic settings are not easily available to other researchers, while the commercial ones are costly. Regular expressions—a metalanguage that describes finite-state automata used to recognize string patterns 15 —have been employed in information extraction both in and outside of medicine, 16–18 and could provide an alternative approach to more complex syntactic/semantic parsers.
Regular expressions were first described by Kleene in 1956. 19 Their advantages include speed and ease of use: the tools for interpreting regular expressions already exist in multiple implementations and over the years have been fine-tuned for performance. 20 At the same time regular expression syntax is mostly standard across all implementations and regular expressions developed for one application can usually be transferred to any of the others with minimal modification. 15 The main drawback of regular expressions when compared to the syntactic/semantic parsers is their lack of flexibility. However, medical narrative documents have been shown to be lexically less ambiguous than unrestricted documents. 21 Consequently, regular expressions possess many of the qualities necessary for successful extraction of information from free text medical documents, such as physician notes.
In this paper we report on an example of application of regular expression techniques to extraction of blood pressure values and anti-hypertensive medication intensification information from narrative medical documents using regular expressions. Information derived from this application can potentially be used for future epidemiologic studies of hypertension treatment. Many large-scale epidemiologic investigations are currently carried out using manual review of patient charts, which typically requires the expensive labor of trained professionals over long periods of time. In contrast, information extraction software based on regular expressions can process many thousands of notes per hour, 22 drastically reducing the cost and time required to complete the study.
In order to successfully apply computation information extraction for epidemiologic research it is necessary to show that the technique employed has high accuracy and a significant gain in speed over the manual methods. In this study we therefore evaluated the accuracy and speed of a regular expression-based software tool that abstracts the documented blood pressure values and anti-hypertensive medication regimen intensification from the text of narrative physician notes in the electronic medical record.
The program used to extract the data from physician notes was implemented in Perl and used extended regular expressions to detect word patterns empirically determined to be specific for the concepts sought. The program identified two sets of concepts: blood pressure values and treatment intensification. Treatment intensification was defined as initiation of a new or an increase in the dose of an existing anti-hypertensive medication (the definition used in previously reported studies on the subject 23 ). Substitutions of one anti-hypertensive medication for another were included; decreases in the dose of an existing anti-hypertensive medication were excluded. 24 The documented blood pressure values were identified using the following algorithm:
Documentation of anti-hypertensive treatment intensification was identified using the following algorithm:
The actual regular expressions used to detect blood pressure and anti-hypertensive treatment intensification can be found in Appendices 1 and 2, respectively (both available as JAMIA online supplements at www.jamia.org). If more than one blood pressure value was documented in the note, the blood pressure value with the lowest mean arterial blood pressure (diastolic blood pressure + one-third of the difference between systolic and diastolic blood pressure) was recorded.
We assessed the accuracy of extraction of blood pressure values documented in the text of the note and of identification of anti-hypertensive medication intensification. Two non-overlapping sets of 300 primary care physician notes randomly selected from the electronic medical record of two academic medical centers were used to evaluate each of these outcomes. The notes in the medication intensification set were randomly selected from the notes with documented elevated blood pressure (identified using the technique validated in the first phase of the evaluation). Elevated blood pressure was defined as either systolic blood pressure above 129 mm Hg, or diastolic above 84 mm Hg, in accordance with the guidelines published prior to the beginning of the study period. 25
For the first phase of the evaluation, blood pressure values were manually abstracted from each of the notes by two independent reviewers who did not participate in the design of the software and did not know the word patterns that the software identified. The reviewers’ results were then compared and inter-reviewer consensus was established after joint review of the notes for which the original abstractions differed. This consensus was subsequently used as the gold standard to which the software results were compared to determine sensitivity, specificity, and overall agreement of automatic extraction of documented blood pressure values. For the second phase of the evaluation, a similar procedure was followed to establish inter-reviewer consensus between manual abstractions of anti-hypertensive treatment intensification by two independent reviewers. This consensus was subsequently used as the gold standard to which the software results were compared to determine sensitivity, specificity and overall agreement of automatic extraction of documented intensification of anti-hypertensive medication regimen.
This study was approved by the Partners HealthCare Human Research Committee.
Inter-reviewer agreement for manual abstraction of the numeric value of documented blood pressure from 300 randomly selected physician notes was 94.0% with kappa of 0.94. Inter-reviewer agreement for manual abstraction of documentation of anti-hypertensive treatment intensification from a second set of 300 randomly selected physician notes that documented elevated blood pressure was 91.7% with kappa of 0.79.
The same 600 physician notes were subsequently processed by the software and the results were compared to the inter-reviewer manual abstraction consensus. The software processed over 370 Mb of text per hour. Sensitivity (recall), specificity, and positive predictive value (precision) were calculated for identification of elevated blood pressure, numeric value of the blood pressure documented in the note, and documentation of anti-hypertensive treatment intensification. Sensitivity of the automated data extraction ranged from 83.8% for treatment intensification to 98.2% for documentation of elevated blood pressure, and specificity ranged from 95.0% for treatment intensification to 98.4% for documentation of elevated blood pressure (). Positive predictive value of the automated data extraction ranged from 85.9% for treatment intensification to 93.2% for documentation of elevated blood pressure ().
Examples of the word patterns correctly and incorrectly identified by the software are given in . For blood pressure identification most common false negatives were encountered in sentences that only documented the systolic but not the diastolic blood pressure, and most common false positives were blood pressure values that did not represent the patient’s blood pressure (e.g., goal blood pressure). Many of the false positive identifications of treatment intensifications were due to missed conditionals or references to the past, while most of the false negatives were caused by word patterns not captured by the set of regular expressions used.
In this study, we present the evaluation of utility of using regular expressions for computational extraction of blood pressure and anti-hypertensive medication intensification information from narrative physician notes. The software achieved accuracy rates comparable to the rates of agreement between human abstractors for all categories of information it extracted, while processing data at speeds several orders of magnitude higher. This approach could make possible the use of the software in large-scale epidemiologic studies where it could potentially replace months of manual work by many highly trained human abstractors.
Our software used a set of regular expressions to accomplish the task. This approach has both advantages and disadvantages. Its obvious limitation is the lack of generalizability: a new set of regular expressions has to be developed and validated for each particular task. Applications of regular expressions are also limited to the extraction of data items that have a constrained lexical scope, and complex synonyms have to be manually generated. On the other hand, a set of regular expressions can be developed much faster than a full-fledged natural language processing engine. This is particularly important because most of the academic natural language processing engines are not publicly available, and the commercial ones frequently bear a price tag unaffordable to researchers. The only freely available natural language processing engine for medical documents—MetaMap—has significant limitations, including lack of implementation of negations. 4
Both benefits and shortcomings of using regular expressions for information extraction were well illustrated in our study. The software tool was designed, implemented and validated over the period of only six months—a significant advantage over syntactic and semantic parsing systems that frequently takes years to develop. The software achieved high accuracy rates while maintaining the speed of data processing necessary for handling large data sets.
Conversely, the accuracy of the information extraction by the software, while high, was not perfect. At the point where the design and implementation of the software were completed, there remained a number of patterns that the software misinterpreted either as false positives or false negatives. Typically these patterns occurred rarely (e.g., once or twice in the entire dataset of several thousand documents) and it was therefore not practical to design additional regular expressions to capture them. Concepts whose expression in the narrative medical documents is less lexically constrained than the ones we chose to study may not be suitable for this technique.
In conclusion, our case study demonstrates that regular expressions can be effectively used to extract focused information from narrative medical documents. When general purpose NLP software is not available, regular expressions provide an alternative approach for abstraction of lexically constrained data elements that can be quickly designed and validated and can potentially be used in a number of clinical applications.
The authors would like to express our gratitude to Dr. Maria Shubina at the Center for Clinical Investigations at the Brigham and Women’s Hospital for her advice on statistical evaluation.
This research was supported in part by the Partners HealthCare IS Research Council (AT, JSE), Diabetes Trust Foundation (AT), NHLBI training grant T32HL007609 (NSK), and NIDDK Career Development Award K23 DK067452 (RWG).