The NLP task in the 2008 i2b2 Obesity Challenge involved classifying medical discharge summaries into several judgment categories for obesity and 15 associated co-morbidities. Two types of judgments were provided: textual judgments based on explicit indicators for the morbidities in the documents, and intuitive judgments based on what was implied about the morbidities in the documents. Teams had the option of classifying the documents based on either type of judgments or both.
Data for the challenge were released in two sets: (1) a training set of 730 annotated discharge summaries for development and training purposes, and (2) a test set of 507 discharge summaries without annotations for the evaluation. In the textual training set, the Y and U categories were well represented with tens or hundreds of documents for most morbidities. In contrast, the N and Q categories contained fewer training documents, ranging from zero to 23 for each morbidity. Due to concerns that the limited amount of training data in two of the four judgment categories would substantially hinder the effectiveness of a machine learning algorithm, we opted to use a rule-based approach.
Our team chose to participate in the classification task based on textual judgments since our approach looked for keywords directly related to each morbidity. The textual judgments consisted of the following four categories: (a) the morbidity is “present” (labeled “Y” for “yes”), (b) the morbidity is “absent” (labeled “N” for “no”), (c) occurrence of the morbidity is “questionable” (labeled “Q”), or (d) the morbidity is “unmentioned” (labeled “U”).
Our approach involved three steps:
- 1 text preprocessing
- 2 identification of keyword occurrences and associated assertion types
- 3 document scoring and classification
The text preprocessing step was done by a custom script written in the Perl programming language. The script performed text clean-up and modification to improve the effectiveness of the keyword identification step. Text modifications were made using regular expression pattern matching and substitution. The preprocessing script made the following changes to each discharge summary: (a) removed the “Family History” section since text here is not relevant to the patient's current condition, (b) changed question marks (“?”) to the word “questionable” to improve assertion type detection, and (c) changed commas (“,”) to periods (“.”) to restrict assertion modifiers to the most immediate terms they modify. A more detailed discussion of these modifications is available in a JAMIA online data supplement at
http://www.jamia.org.
In the keyword identification step, the discharge summaries were examined for the occurrence of keywords associated with each morbidity, along with the type of assertion in which each keyword occurred. The possible assertion types were (a) positive (e.g., “Diabetes: diet controlled”), (b) negative (e.g., “no significant CAD”), and (c) questionable (e.g., “borderline HTN”). This step used the NegEx negation detection application developed by Chapman et al.
10 A key NegEx component is its dictionary of clinical terms and several types of negation phrases. We made substantial customizations to the NegEx dictionary to tailor it to this classification task. In particular, the default list of clinical terms was completely replaced with a custom list of keywords associated with the morbidities targeted in this task. In addition, the list of conditional possibility terms (e.g., “no history of”, “might be ruled out for”) was repurposed to store terms pertaining to other family members (e.g., “maternal”, “father”, “cousin”) so keywords identified with this code could be ignored in the document scoring and classification step. A more detailed discussion of these modifications is available in a JAMIA online data supplement at
http://www.jamia.org.
NegEx used the terms in its dictionary for identifying morbidities and assertion types in the discharge summaries. The NegEx output consisted of modified discharge summaries containing markup around any identified keywords. This markup indicated the assertion type in which each keyword occurred.
The output of the NegEx algorithm was passed through the document scoring and classification step. This step was performed by a custom Perl script that calculated the total number of positive, negative, and questionable assertions for each morbidity in each discharge summary. For each document, this resulted in three “scores” for each morbidity, one for each assertion type. The scoring process ignored any keyword occurrences pertaining to other family members as indicated by the terms assigned to the conditional possibility codes in the NegEx dictionary, as well as any keywords identified with a special code indicating keywords to be ignored. The discharge summaries were then assigned to a judgment category for each morbidity based on the assertion type with the highest total. Positive assertions corresponded to the “Y” judgment, negative assertions corresponded to the “N” judgment, questionable assertions corresponded to the “Q” judgment, and the absence of keyword occurrences corresponded to the “U” judgment.
For ties between several assertion types, three different tie-breaking rules were developed: (a) positive-weighted tie breaking, in which ties between positive and non-positive assertions resulted in a “Y” judgment, (b) negative-weighted tie-breaking, in which ties between negative and non-negative assertions resulted in an “N” judgment, and (c) questionable-weighted tie-breaking, in which a tie between questionable and non-questionable assertions resulted in a “Q” judgment (please see Tables 1, 2, and 3, available in a JAMIA online data supplement at
http://www.jamia.org).
In scenarios in which the weighted assertion type did not participate, we made an arbitrary rule that the weighted judgment could not win the tie-breaker and we would default to the least positive judgment available. For example, in the positive-weighted scenario in which a tie exists between negative and questionable assertions and the positive assertion type is not involved, the resulting tie-breaker judgment is “N.” The same outcome applies in the questionable-weighted scenario in which a tie exists between positive and negative assertions. In the negative-weighted scenario, a tie between positive and questionable assertions results in a tie-breaker judgment of “Q” since it is the least positive of the non-weighted judgments.
The initial training set for the challenge was released in mid-Mar 2008, allowing us time over several months to test and make adjustments to our approach before the release of the test set. Repeated runs against the training set were used to tune the various components of our classification system. Tuning tasks included adjusting the keyword lists for morbidities, negation terms, and questionable assertion terms, and trying different tie-breaker rules. The test set was released at the end of Jun, and teams were allowed three days to evaluate the data and submit their results.