We developed a hybrid system with three processing steps: (1) a pre-processing step, (2) a field detection step that identifies the six fields, and (3) a field linking step that links fields together to form entries. The second step is a statistical system, whereas the other two steps are rule-based. The second step was the main focus of this study. The entire system was first presented at the 2010 Louhi Workshop [13
], where the authors were invited for the special issue of this journal.
In addition to common processing steps such as part-of-speech (POS) tagging, our pre-processor includes a section segmenter that breaks discharge summaries into sections. Discharge summaries tend to consist of sections such as “ADMIT DIAGNOSIS”, “PAST MEDICAL HISTORY”, and “DISCHARGE MEDICATIONS”. Knowing section boundaries is important for the task because, according to the i2b2 challenge annotation guidelines for creating the gold standard, medications occurring under certain sections (e.g., “FAMILY HISTORY” and “ALLERGIES”) were to be excluded from the system output. Knowing the sections could also be useful for field detection and linking. For example, the ‘DISCHARGE MEDICATIONS’ section is more likely to contain medications in a list than medications embedded in narrative text.
The set of sections and the exact spelling of section headings vary across discharge summaries. The section segmenter uses a regular expression (a line starting with a sequence of capitalized letters followed by a colon) to collect potential section headings from the training data. The headings whose frequencies are higher than a threshold are used to identify section boundaries in the discharge summaries.
This step consists of three modules: find_name
, which finds medication names, context_type
, which determines whether each identified medication name appears in narrative text or in a list of medications, and find_others
, which detects the five non-name field types. For all three modules we use the Maximum Entropy (MaxEnt) learner in the MALLET package [14
] because the training time for MaxEnt can be shorter than more sophisticated algorithms such as CRF [15
]. For find_name
, we follow the common practice of treating named entity (NE) detection as a sequence labeling task with the Inside-Outside-Beginning (IOB) tagging scheme; that is, each token in the input is tagged with B-x (beginning an NE of type x), I-x (inside an NE of type x) and O (outside any NE).
The find_name module
As this module identifies medication names only, the tagset under the IOB scheme has three tags: B-m for beginning of a name, I-m for inside a name, and O for outside.
Various features are used for this module, which we group into four types:
• (F1) includes word n-gram features (n=1,2,3). For instance, the bigram wi-1 wi looks at the bigram consisting of the previous word and the current word.
• (F2) contains features of properties of the current word and its neighbors (e.g., their POS tags, affixes, lengths, containing section, capitalization, etc.)
• (F3) checks the IOB tags of previous words
• (F4) contains features that check whether an n-gram in the text appears as part of a medication name in some medication name lists.
For (F4) we used two medication name lists. The first list consists of medication names from the training data and is the only list used in set F4a. The second list includes drug names from the FDA National Drug Code Directory (http://www.accessdata.fda.gov/scripts/cder/ndc/
) and is used to test whether features that check an external resource improve performance. Feature set F4b uses both lists.
The context_type module
This module is a binary classifier that determines whether a medication name occurs in a list or narrative context. Features used by this module include the section name as identified by the pre-processing step, the number of commas and words on the line, the medication name itself and its position on the line, and nearby words.
The find_others module
This module complements the find_name module and uses eleven IOB tags to identify five non-name fields. The feature set used in this module is similar to the one used in find_name, but some features in (F2) and (F4) are modified to suit the non-name fields. For instance, one feature that was not present in find_name checks whether a word fits a common pattern for dosage. In addition, some features in find_others look at the output of previous modules, like the location of nearby medication names, as this information can be provided by the find_name module at test time.
The final step is to form entries by associating each medication name with its related fields. Our current implementation uses simple heuristics. First, for each non-name field the closest prior and subsequent name fields are identified. Second, each non-name field is linked to one of those two name fields. In most cases, the non-name field is linked to the prior name field, but if the distance to the subsequent name field is shorter than the distance to the prior name field by more than two lines, we link the non-name field to the subsequent name field. Third, the (name, non-name) pairs are assembled into entries with a few rules that apply if more than one non-name field of the same type is linked to the same name field. More information about the modules, including the features and the linking rules, is available in [16