|Home | About | Journals | Submit | Contact Us | Français|
The Obesity Challenge, sponsored by Informatics for Integrating Biology and the Bedside (i2b2), a National Center for Biomedical Computing, asked participants to build software systems that could “read” a patient's clinical discharge summary and replicate the judgments of physicians in evaluating presence or absence of obesity and 15 comorbidities. The authors describe their methodology and discuss the results of applying Lockheed Martin's rule-based natural language processing (NLP) capability, ClinREAD. We tailored ClinREAD with medical domain expertise to create assigned default judgments based on the most probable results as defined in the ground truth. It then used rules to collect evidence similar to the evidence that the human judges likely relied upon, and applied a logic module to weigh the strength of all evidence collected to arrive at final judgments. The Challenge results suggest that rule-based systems guided by human medical expertise are capable of solving complex problems in machine processing of medical text.
The Obesity Challenge, sponsored by Informatics for Integrating Biology and the Bedside (i2b2), a National Center for Biomedical Computing, asked participants to build software systems that could “read” a patient's clinical discharge summary and replicate the judgments of physicians in evaluating the patient's condition. To create training and evaluation data, physician judges, who had substantial clinical experience treating obese patients, were asked to read 1,250 discharge reports and make judgments about whether each patient had or did not have obesity and fifteen of its comorbidities. The judgments were of two types: textual judgments, based solely on the literal content, and intuitive judgments, based on the judges' “gestalt” interpretation of all information in the text. The 15 comorbidities were asthma, coronary artery disease, congestive heart failure, depression, diabetes, gastroesophageal reflux disease (GERD), gallstones, gout, high cholesterol, hypertension, hypertriglyceridemia, osteoarthritis, obstructive sleep apnea, peripheral vascular disease, and venous insufficiency. The challenge was to replicate the human judges' judgments: the possible textual judgments were “yes” (Y), “no” (N), “possibly” (Q), or “unmentioned” (U); possible intuitive judgments were Y, N, and Q. The metrics used to evaluate participating systems were recall, precision, and the combined F-measure.
Lockheed Martin and SAGE Analytica partnered to create a rule-based system guided by human medical expertise. Methods for processing clinical records, as with free text in other domains, include statistical systems, 1 rule-based systems, and hybrids. Arguably, all approaches require some manual labor, either in developing the rules or the training data. Much work has been published related to the Natural Language Processing (NLP) challenges of 2006 (identifying patient smoking status) 2 and of 2007 (assigning ICD-9-CM tags to radiology reports). 3 In the former, the top performing system (Clark et al.) 4 combined a rule-based extraction engine with machine learning algorithms. In the latter, the top system (Farkas and Szarvas) 5 was a hand crafted rule-based system that combined rule-based with statistical learning models. Because many rule-based systems, including the Lockheed Martin system, performed well in the 2007 challenge, we hypothesized that a rule-based approach integrating medical, epidemiological, and NLP expertise would be able to effectively complete the 2008 task.
An unabridged version of this manuscript is available as an online supplement at http://www.jamia.org.
We approached the Challenge as an exercise in gathering and weighing evidence. The fundamental problem was to identify the “signals” that the judges relied on for each disease judgment, and then create a system that could find these signals and weigh them against one another to reproduce those judgments. We used Lockheed Martin's ClinREAD, a rule-based NLP capability a to build a tailored software solution that contained a knowledge base comprising the lexical information necessary to recognize mentions of obesity and each of the 15 comorbidities in the clinical records and the syntactic context of these mentions, and then the logic required for weighing the evidence found. Our domain experts manually examined many of the patient records in the development set and the judges' decisions, to replicate the logic they used as closely as possible.
Our development process was an iterative one, starting with textual judgments and then turning to intuitive ones. Because the judges chose U for 72% of records in the training set, we assigned U as the default textual judgment. We also developed a set of basic rules that would find mentions of each condition; any mention found would provide weak evidence for a Y judgment. Then, over several rounds of system testing and error analysis, we established rules to convert a provisional Y judgment into a Q or N. We increased the weight of evidence found in the “past medical history” and “primary diagnosis” section of a record, and established rules to eliminate any evidence from the “family history” section. The goal in each iteration of system development was to balance precision and recall scores to maximize the macroaveraged F-measure.
Intuitive defaults were derived from the system's textual judgments and the pattern observed in the ground truth judgments. A textual U or N accompanied an intuitive N 99.9% of the time; a textual Y, an intuitive Y 65% of the time; and a textual Q, an intuitive Q 91% of the time. Intuitive defaults were set to N for a textual U or N, Y for a textual Y, and Q for a textual Q. We wrote further intuitive-only rules on diseases for which the intuitive ground truth was most likely to depart from the default text-to-intuitive pattern; these included depression, GERD, and osteoarthritis. The signals used to override an intuitive default were often the presence of drug names associated with a specific disease, even where the disease itself was not explicitly mentioned.
To more closely model the judges' thinking, the system was adjusted for two anomalies in the development set ground truth, both of which seemed to be the result of how the judges went about their task. First, the probability that intuitive ground truth would not follow the default text-to-intuitive pattern increased dramatically for documents with ID numbers greater than 500. This indicated to us that the judges changed their criteria for evaluating intuitive responses as they worked through the document set. We therefore decided to apply our intuitive rules so that these rules could override the text-to-intuitive default only for documents with ID numbers above 500. Second, textual “Q” responses for obesity were far more likely in documents with numbers less than 60 than in the rest of the set. This seemed to stem from early confusion on the judges' part about the difference between a “textual Q” and an “intuitive Y.” We therefore applied different rules for textual obesity Q judgments to low numbered documents. Although both of these strategies were in response to artifacts introduced by the judging process in the context of a competition, similar systemic biases can occur in real-world situations. We concentrated most of our efforts on Q and N judgments, as these had the greatest impact on macroaveraged F-measure, the primary outcome.
Our system was rule-based, meaning that a comprehensive set of rules defines patterns in the text and instructs the system to take specific actions when each pattern is found. In the nomenclature of our software, a feature is a literal text string that represents a disease or other concept; synonymous features are grouped together into feature lists. For example, our features for “osteoarthritis” were arthritis, DJD, OA, osteoarthritis, osteoarthrosis, degenerative arthritis, degenerative joint disease, hypertrophic osteoarthritis, and cervical spine degenerative disease; our feature list for “symptom” included evidence, history, sign, and symptom. See Appendix 1, available as an online data supplement at http://www.jamia.org, for a listing of all disease features.
An element is a pattern of feature lists and/or literal strings. One such element is composed of a feature list followed by a set of “and”, “or”, and punctuation that defines a list. This element is then used in a rule for matching lists of diseases such as “… patient has asthma, OA and CHF.” Elements that can be expected to behave similarly in context, such as all disease types, are grouped together into element lists.
A rule defines a pattern composed of feature lists, element lists, and/or literal text strings, and instructs the system to take a specific action when that pattern is found in the text. In rule-based systems, the action is typically to create a data structure to record the information the rule was designed to collect. shows schematically how features, elements, and rules lead to actions. Other factors, including context and presence of conflicting information, were also weighed to reach valid conclusions. We assigned simple symbolic strength values to judgments (weak, normal, or strong) based on the particular document section in which the evidence was found and the surrounding local context. For example, when making textual judgments the system contained a rule that when a record contained the pattern “her OA” the system recorded a strong judgment of Y for that record. Simply finding the word “asthma” generated a weak judgment of Y for asthma, while the phrase “? asthma” was normal strength for a Q judgment. We then used a logic module, which can be found in , to weigh the evidence and decide on the final judgment.
The metrics used to evaluate participating systems were recall, precision, and F-measure. Recall is the percentage of the correct answers produced by a system. Precision is the percentage of a system's answers that are correct. F-measure is a weighted average of these scores. 6 Due to the uneven distribution of judgments of each type (Y, N, Q, U), a macroaveraged F-measure was used as the primary ranking metric, with a microaveraged F-measure as secondary. In effect, the macroaveraged F-measure metric is an unweighted average of the F-measures achieved for each judgment type. (Please see Table 2 in the unabridged online version at http://www.jamia.org for the confusion matrices with accompanying recall and precision scores of our results.)
Our best-performing system ranked third in the intuitive category and fourth in the textual category; however, in terms of statistical significance, it tied for first place for intuitive judgments and second place for textual judgments. 7 The system processed the 507 evaluation discharge records in 14 seconds on a commercially available 2.0 GHz laptop with 2.0 GB of RAM; that corresponds to about 130,000 records per hour.
Most of the development time was devoted to improving the system performance on textual Q and N judgments, because a relatively small improvement would have a substantial effect on the macroaveraged precision and recall. The “questionable” signals that system rules looked for included “?[disease element],”“questionable [disease element],” and “question of (disease element).” Some Q signals that we did not include because of the possibility of false positives were “possible (disease element),” “possible h/o (disease element),” and “might have [disease element]”; post-Challenge analysis indicated that we should have included these. In some cases, our manual review of records uncovered inconsistency in the judging, which made rule writing difficult. For example, the term “borderline” meant different things for different diseases: “borderline hypertension” was rated a Y for hypertension, while “borderline diabetes” was sometimes a Q and sometimes a Y, even in the absence of any other textual mention of diabetes.
Our system for making intuitive judgments depended heavily on making the correct textual judgment because our intuitive answers were largely determined by a default pattern from the textual answers. After we noticed changing “ground truth” patterns in documents with numbers 500 and above, we chose to modify the intuitive default responses for some diseases when specific drug names were present. This strategy gave the system reasonably good F scores for intuitive Y and N but not Q. If we had applied the intuitive rules to all documents regardless of document number, however, the system's overall intuitive F-macro would have been slightly higher. We also implemented special rules for obesity differently when the document number was less than 60; specifically, the system accepted “ADA 1800” as evidence of a textual Y for obesity, even though the judges apparently abandoned this criterion after about Document 60. This strategy improved our overall scores slightly.
Although not the primary measure of success in the competition, the microaveraged F scores of 0.98 textual and 0.96 intuitive reflect a system capability that could be useful to an organization wishing to find similar information in a large body of patient records.
Our system is similar to an expert system in both form and function. Knowledge engineers capture domain expertise, consisting of the rule base and decision rules, within a knowledge base that is applied by a domain-independent text-processing engine. It functions in a data-driven fashion, gathering up evidence it finds in text and then applying decision logic to reach a conclusion. It might be thought of as a tightly focused cousin of an expert system. Unlike the A-Life system 8 in the 2006 challenge, which has a separate expert system module, our system's domain expertise and logic is integrated throughout processing.
The knowledge engineering required to build a rule-based system is sometimes seen as an obstacle to fielding an operational system. Our team collectively spent approximately 300 person-hours training our system by iteratively developing rules and testing their effect. This was a substantial—but not Herculean—investment of resources. We believe, however, that our system could readily be scaled up to handle a much larger number of diseases with similar precision and recall. For their 2007 NLP Challenge 9 system, Lockheed substantially reduced the knowledge engineering labor by applying Perl scripts to the online ICD-9-CM documentation to acquire the relevant terms. We believe that we could apply similar methods to acquire the concepts and synonyms already compiled in the National Library of Medicine's Unified Medical Language System. This would allow us to extend our existing knowledge base to other diseases. Moreover, because human domain experts are intimately involved in building the system, our approach would not necessarily require a large body of marked up text on which to train the system. Many systems in the 2008 challenge took advantage of Chapman's NegEx 10 algorithm. Use of that algorithm might improve our system as well.
Our results suggest that rule-based systems could play a larger role in overcoming real-world problems in medical language processing. One very promising area, for example, would be to apply NLP to chart review for retrospective epidemiological studies. 11,12
The authors thank the organizers of the NLP Challenge, Informatics for Integrating Biology and the Bedside, i2b2, a National Center for Biomedical Computing. In particular, the authors thank Dr. Özlem Uzuner for her patient attention to detail as she organized and administered the NLP Challenge, the Workshop including fruitful interactions with other participants, and all follow-up activity. The authors also thank the JAMIA reviewers for helpful suggestions as to article content.
aClinREAD uses the commercial NLP development environment, Rocket AeroText. See http://www.rocketsoftware.com/products/rocket-aerotext.