Research Corpus
The research corpus was supplied by the NLP Challenge organizers. It consisted of a training set of approximately 700 de-identified discharge summaries, each abstracted by two physician experts for the presence or absence of 16 different patient conditions (obesity, diabetes mellitus, etc). The abstractions were further broken down into “textual” and “intuitive.” The textual analysis registered that the patient condition was directly mentioned in the text of the discharge summary, and each scoring for “textual” analysis had four possible values: Yes, No, Questionable, Unknown. The intuitive analysis identified those patient conditions that may not have been directly mentioned, but whose presence or absence could be deduced or inferred by the text of the document. Each scoring for “intuitive” analysis had three possible values: Yes, No, and Unknown. Therefore, the training dataset that accompanied the 700 discharge summary documents contained roughly 16 × 2 × 700 individual data points.
The challenge organizers focused on obesity and its co-morbidities.
Overview of Procedure
Documents were first preprocessed to remove unnecessary elements. For each principal concept, the document was textually evaluated by a three step process for the concept itself and for any secondary concepts. First, each concept would be searched for by regular expression (explicit textual). Then, for each match found, the neighborhood of the match was examined for context. Third, the contextualized matches were brought together in an evaluation for that concept. Once all the principal and secondary concepts had been evaluated, a determination was made for the principal concept.
Preprocessing
Portions of the documents that were not germane to the task at hand were removed.
Many of the documents contained a family history section. While these sections did occasionally contain information about the current patient's current condition, they were mainly focused on the conditions of other family members. In early trials, we determined that these mentions of familial conditions confused the bag-of-words-based context finder. Therefore, this section was removed in a preprocessing step.
Some of the documents contained text identifying an override of the computerized clinical decision alert along with the responsible physician. An example is shown below:
- ECASA (ASPIRIN ENTERIC COATED) 325 MG PO QD
- Override Notice: Override added on 11/9/01 by
- FUDD, ELMER J., M.D.
- on order for COUMADIN PO (ref # 00944322)
- POTENTIALLY SERIOUS INTERACTION: ASPIRIN & WARFARIN
- Reason for override: md aware
These medico legal snippets sometimes included references to medications the patient was not taking. In addition, these sections confused human readers. Consequently, they were removed in the preprocessing step.
Finally, when the first line of the document contained an admitting diagnosis, this line was removed, as this often represented a working hypothesis and not a confirmed diagnosis. Admitting diagnosis sections were not removed.
Feature Extraction
For feature selection we used a combination of guided and manual methods.
Most features were identified by medical relevance. We attempted to first answer the question “why did the physician grade the discharge summary this way?” and then to mimic that process.
Medical concepts were linked to the original 16 whenever they were suspected of affecting an annotation. For example, when it was conjectured that a patient's history of a myocardial infarction was used by the physician evaluators to establish a diagnosis of CAD, myocardial infarction was added as a concept, linked to CAD.
The most common class of the secondary concepts was medications. Medications were added semantically: with brand names, generic names, and common abbreviations. The concept would generally be at the level of a drug class, such as: steroids, non-steroidal anti—inflammatory drugs (NSAIDs), ACE inhibitors, loop diuretics, thiazide diuretics, β blockers, and antidepressants. A few concepts were more specific, such as nitroglycerin or albuterol. For example, use of inhaled albuterol suggests the diagnosis of asthma.
Treatments were also used as secondary concepts. For instance, gastric bypass surgery suggested obesity, and pressure stockings indicated venous insufficiency.
Additionally, word bigrams were considered with an inverse document frequency. This brought out a few phrases such as “GI Bleed” in a GERD context.
Each concept had a synonym list. For example, the list of synonyms for albuterol was “salbutamol”, “albuterol”, “ventolin”, “proventil”, and “proair”. For medications, we used the Apelon terminology engine to provide these synonym sets. For a given drug class, this list might contain hundreds of synonyms.
Numerical features were also supported. For example, obesity could be inferred from a numeric BMI or from a numeric weight and height or from a numeric weight alone (for, e.g., >90 kg). Many other numerical features were recognized including ejection fraction, the lipid panel results, and hemoglobin A1c values.
Synonym lists were converted to regular expressions for matching against the clinical documents. The regular expressions were generally case insensitive and matched whole words, but these features could be overridden. For example, the synonym list for myocardial infarction was “MI”, “ami”, “imi”, “myocardial infarc”, “septal infarc”, and “heart attack”.
Concept Context
When a concept match was found, the neighborhood around the concept was examined. This approach follows NegEx and related work by Chapman et al.
17,18 The examination looked for the presence of key phrases within the neighborhood in order. Our implementation differed from Chapman's in two important ways. First, we processed at the level of characters rather than words—just regular expressions without preliminary lexing. Second we used a neighborhood bounded by any punctuation mark, rather than a five word window.
The first phrases searched for were obliterators. If an obliterator from a concept's obliterator list was found, the match was determined to be not a match at all. For example, for the concept osteoarthritis, the phrase “arthritis” was taken as a possible match. However, the phrase “rheumatoid arthritis” was used as an obliterator as it describes a different disease from osteoarthritis.
The next phrases searched for were pseudonegators such as “no further …”; The pseudonegators look like negations but are not. Then, we looked for hypothetics such as “evaluate for.” These indicate that the condition is possible, but not certain. Next, history markers were considered. These are strings like “h/o”—meaning history of—that indicate the temporality of the condition. Finally, plain negaters were sought, such as “denies” or “not signs of.”
Rules
We developed and inserted sets of rules to assist with the intuitive determinations. These rules were framed in a domain specific language (DSL) written in the scala programming language.
In the development of the rules, there was always a tension between what seemed to achieve better agreement with the annotators in the training set and what was more medically correct. Following the annotators too closely can lead to over fitting, matching noise rather than signal. By contrast, a rule which does not at first seem medically sound may proxy some real condition of the patient or thought process of the annotator.
An illustrative example is Congestive heart failure (CHF). When the concept is not directly mentioned in the text, several ways of inferring the patient's status were used.
This condition is unequivocally indicated by a low numeric ejection fraction; we took 50% as a threshold. Most patients with CHF will be treated with ACE inhibitors; however ACE inhibitors are also used to treat other conditions such as hypertension. In treating hypertension ACE inhibitors are often prescribed with a thiazide diuretic; this combination for CHF is rarer. So an ACE inhibitor without a thiazide diuretic is a possible treatment for CHF; but we would also like to see a condition associated with CHF. The findings we identified were a history of heart transplant, the presence of pulmonary edema, or the finding of a high numeric wedge pressure. Sample rules described in the DSL are shown below.
- val chfMed = hasAce and! hasThiazide;
- Val chfSymptom = hasHeartTransplant |
- hasPulmonaryEdema |
hasHighWedgePressure:
- Val chfRules = hasLowEjectionFraction |
- (chfMed and chfSymptom);
Technologies and Support Systems Used
Expert and rule-base systems have a long history in healthcare, going back to the early efforts with Mycin
19,20 and Internist.
21 There are several commercial and open source solutions that implement forward chaining, backward chaining or even blackboard systems.
22 We chose to implement our solutions from scratch as our goals were more experimental. The difficult part of building the system was to develop the feature set—the implementation of the inference logic was more straight forward. We relied on Apelon terminology environment to determine relevant medications for conditions used in the challenge.
23 Evaluation Metrics Used
We used Cohen's kappa metric to measure the progress of our evolving solution versus the physician gold-standard document annotators with the training set of documents. The kappa statistic for two person inter-rater agreement attempts to generate a measurement of agreement beyond that expected by chance. The main reason for the selection of this metric was that a good number of answers could simply be guessed correctly—i.e., “the person does not suffer from X” will be mostly true and sometimes overwhelmingly true.
24 Use of this measure allowed us to focus on optimizing a single metric, as opposed to a suite of metrics.
The NLP challenge organizers provided a utility to measure six metrics—micro and macro versions of precision, recall and F-measure. These measures are reported in the results section.