In this study, we present the evaluation of utility of using regular expressions for computational extraction of blood pressure and anti-hypertensive medication intensification information from narrative physician notes. The software achieved accuracy rates comparable to the rates of agreement between human abstractors for all categories of information it extracted, while processing data at speeds several orders of magnitude higher. This approach could make possible the use of the software in large-scale epidemiologic studies where it could potentially replace months of manual work by many highly trained human abstractors.
Our software used a set of regular expressions to accomplish the task. This approach has both advantages and disadvantages. Its obvious limitation is the lack of generalizability: a new set of regular expressions has to be developed and validated for each particular task. Applications of regular expressions are also limited to the extraction of data items that have a constrained lexical scope, and complex synonyms have to be manually generated. On the other hand, a set of regular expressions can be developed much faster than a full-fledged natural language processing engine. This is particularly important because most of the academic natural language processing engines are not publicly available, and the commercial ones frequently bear a price tag unaffordable to researchers. The only freely available natural language processing engine for medical documents—MetaMap—has significant limitations, including lack of implementation of negations. 4
Both benefits and shortcomings of using regular expressions for information extraction were well illustrated in our study. The software tool was designed, implemented and validated over the period of only six months—a significant advantage over syntactic and semantic parsing systems that frequently takes years to develop. The software achieved high accuracy rates while maintaining the speed of data processing necessary for handling large data sets.
Conversely, the accuracy of the information extraction by the software, while high, was not perfect. At the point where the design and implementation of the software were completed, there remained a number of patterns that the software misinterpreted either as false positives or false negatives. Typically these patterns occurred rarely (e.g., once or twice in the entire dataset of several thousand documents) and it was therefore not practical to design additional regular expressions to capture them. Concepts whose expression in the narrative medical documents is less lexically constrained than the ones we chose to study may not be suitable for this technique.
In conclusion, our case study demonstrates that regular expressions can be effectively used to extract focused information from narrative medical documents. When general purpose NLP software is not available, regular expressions provide an alternative approach for abstraction of lexically constrained data elements that can be quickly designed and validated and can potentially be used in a number of clinical applications.