|Home | About | Journals | Submit | Contact Us | Français|
Coreference resolution of concepts, although a very active area in the natural language processing community, has not yet been widely applied to clinical documents. Accordingly, the 2011 i2b2 competition focusing on this area is a timely and useful challenge. The objective of this research was to collate coreferent chains of concepts from a corpus of clinical documents. These concepts are in the categories of person, problems, treatments, and tests.
A machine learning approach based on graphical models was employed to cluster coreferent concepts. Features selected were divided into domain independent and domain specific sets. Training was done with the i2b2 provided training set of 489 documents with 6949 chains. Testing was done on 322 documents.
The learning engine, using the un-weighted average of three different measurement schemes, resulted in an F measure of 0.8423 where no domain specific features were included and 0.8483 where the feature set included both domain independent and domain specific features.
Our machine learning approach is a promising solution for recognizing coreferent concepts, which in turn is useful for practical applications such as the assembly of problem and medication lists from clinical documents.
The Health Information Technology for Economic and Clinical Health (HITECH) act passed in February 2009 as part of the American Reinvestment and Recovery Act (ARRA) sets aside significant funds as incentives for the adoption of electronic medical records (EMR). In particular, the act calls for providers to demonstrate ‘meaningful use’ of a certified EMR to qualify for financial incentives. Evidence of meaningful use has been defined, in part, as the capturing of structured elements in an EMR such as problem lists, medications, procedures, allergies, and quality measures. Identifying coreferent concepts is an essential part of capturing these elements.
The 2011 natural language processing (NLP) challenge's focus on coreference (different terms in a document that refer to the same concept) gives it a high practical relevance in the marketplace. In particular, the relationship of problems and treatments in transcribed medical documents can be used to assemble complete, yet precise, problem and medication lists. For instance, the terms: ‘congestive heart failure,’ ‘CHF,’ ‘systolic heart failure,’ and ‘heart failure’ may be coreferent—that is, they may all describe the same condition in the same patient. The objective of the challenge is to assemble all such relationships into chains of coreferent problems, treatments, tests, or persons.
This year, i2b2 provided two sets of annotations using two different guidelines for the challenge: ODIE and i2b2. The ODIE guidelines were more elaborate and detailed than the i2b2 guidelines. We entered track 1c, with i2b2 annotations, to capitalize on our experience with this annotation set from last year's competition. The training document set had the following set of annotations: Test, Problem, Procedure, Person, and Pronoun. The annotation chains (coreferent chains) were of the following categories: Test, Problem, Procedure, and Person.
The effort discussed in this report details our solution to the coreference challenge using the i2b2 guidelines.
There has been significant recent research effort in the NLP community to address the problem of coreference resolution. The BART system1 uses a classifier that relies on a feature set that can be tuned to different languages. Facets of the feature set described in that system include: gender agreement (he/she), number agreement (singular/plural), animacy agreement (him/it, them/that), string match (exact or partial match), distance between concepts (physical separation—number of characters, words between mentions), and aliases (synonyms). The authors also describe a semantic tree compatibility, in which a frame of slot-value pairs (that include the above features) is associated with each concept and the frames are compared for compatibility. Most approaches to coreference resolution rely on supervised-learning techniques. However, the method used by Raghunathan and coworkers2 uses a completely different approach. They order the feature sets to resolve coreference from most precise to least precise and apply them successively to collate coreferent chains. A cluster-ranking approach, where coreference resolution is recast as a problem of finding the best preceding cluster to link a particular mention, is discussed in the paper by Rahman and Ng.3
Our approach in this effort builds upon the methods described by Culotta et al.4 It uses a learning engine and a feature set that is fine-tuned to the clinical domain. The core learning engine is implemented using the Scala programming language.
We used the ‘Factorie’ toolkit5 to support the learning task. The toolkit is used to implement factor graphs. In the factor graph, the mentions are represented as nodes. Mentions which are coreferent in a given configuration are connected by edges, which we call pairwise-affinity factors. As the system considers different possible configurations, it constructs a factor graph to represent each configuration.
For example, figure 1 shows an incorrect configuration with four mentions divided into three chains. The mentions ‘gnr’ and ‘gram negative rods’ are chained with each other, but the mention ‘gnr bacteremia’ is incorrectly omitted from the chain.
The system will consider adding ‘gnr bacteremia’ to the correct chain as in figure 2.
It will also consider adding ‘gnr bacteremia’ to the chain containing the mention ‘hypoxic’ as in figure 3. The features for the pairwise-affinity factors are of two types. Some of the features attempt to capture the relationship of the two mentions. This uses a fairly standard hand-constructed list including: distance metrics, gender agreement (he/she), laterality agreement (left/right), number agreement (singular/plural), overlap, synonyms in SNOMED, hypernyms (broader concepts) in SNOMED, string equality, etc. For the second set of pair-wise features, we take the cross-product of certain mention-wise features. Mention-wise features include words, bi-grams, four character prefixes, and enclosing section type. We also used chain-wise factors, one per chain, to capture information about the chain as a whole. As an example, a chain which included both ‘Mr.’ and ‘she’ would be noted as having a gender inconsistency. The graph was trained using a maximum entropy model with adaptive regularization of weight vectors (AROW) updates.6 Sampling was from the plausible permutations generated for the mentions.
In the next section, we discuss feature selection in greater detail.
Regardless of whether the features discussed above are used in a pair-wise fashion or as an aspect of a single mention or a whole chain, we can classify the feature selection as being domain independent or domain dependent.
These include four and five character prefixes, words, bi-grams, string match, gender match, and number match. We also considered headword (root/stem matches) and animacy match approaches but lacked the development time to implement these prior to the contest deadlines. Although we had access to parts-of-speech tagging based on the cTAKES system,7 we did not use them. We anticipate trialing these as time permits in the future.
Here we implemented a variety of features, as follows:
Although this approach identified ‘knee’ as a related term of ‘meniscal tear,’ we found that using SNOMED vocabulary afforded better results in general. However, the use of web searches does appear interesting and promising and merits further investigation. Of note, Google did not allow programmatic use of their search engine, while Microsoft Bing provided useful software sources to help with such searches.
The whole NLP framework was developed using the Scala programming language which supports both object-oriented computing and functional programming. Implementations of the language are available for Java and .NET platform. Our NLP platform was built using the Java environment. We also relied on the Apelon terminology environment to determine relevant SNOMED-based aliases.
The evaluation metrics were supplied by the i2b2 contest organizers. Four evaluation metrics have been specified. For an excellent comparison of the metrics chosen by the i2b2 contest see Recasens and Hovy.8 Cai and Strube also have discussions surrounding some of these metrics.9 The various measures are:
where is the response chain (system response) for the ith mention and is the gold standard. These are then summed over the entire set.
B3—measures are overly sensitive to a large number of singleton mentions, a fact that we verified by assigning every mention as its own coreference chain. This resulted in a B3 value of 0.955, higher than any result we obtained in the actual runs.
The BLANC measure, although computed, was not used by the i2b2 organizers and is not included in our results.
Table 1 shows the results from training using only the domain independent features set. On our computing hardware, the training phase required approximately 18 h to run on the 489 documents in the contest training set.
Table 2 shows the results from training using only the domain independent and domain dependent features set that extensively used the SNOMED vocabularies as discussed in the previous sections. The training phase took approximately 40 h to run.
In general, we did less well on the MUC metric, and scored particularly poorly in the ‘tests’ category by MUC analysis, perhaps because the training documents had fewer instances of ‘tests’ than the other categories, offering less material for our learning engine to build upon.
Table 3 shows the result of testing/evaluation carried out on 322 documents. The test required only 10 min to run. It used the model created from training on the domain independent features set.
Table 4 shows the result for the test run, based on the model created from using domain independent and domain dependent features. This test run was completed in about 10 min.
Table 5 compares the results across all four runs.
The striking finding is that the addition of the domain dependent features, which extensively used SNOMED concepts, did not provide as much benefit in the scoring. Our assumption when approaching the task was exactly the opposite, a notion supported by the literature.13 Multiple factors appear at play in our results. The training samples were fairly extensive (for some of the categories) and the testing samples were drawn from the same corpus. Routine aliases were already captured in the training set and therefore the benefit of the SNOMED vocabulary was less powerful. In addition, our domain dependent features were not targeted at pronoun resolutions which formed the greater part of the recognition task for the challenge.
Our machine learning technique performed poorly at recognizing tests, most likely because of the fewer number of tests in the training documents to engage the learning process. We also performed relatively weakly at pronouns recognition, primarily due to lack of attention to this facet of the challenge by us, the investigators. Disambiguating pronouns are of less practical significance to a commercial entity such as ours, than collating a precise collection of problems or medications.
We looked at the errors made by the learning engine and placed them in the following general classes:
Our efforts to use SNOMED vocabulary were clearly targeted at addressing errors that arose from failure to recognize synonymous terms. However, that effort has not yet borne fruit. Temporal features were not implemented, which might have eliminated some of the errors. Recognizing when a concept is coreferent and when it is not remains a challenge for NLP systems.
The coreference challenge has focused attention on an area that has sometimes been ignored in clinical document analysis. Our machine learning approach is a promising solution to the task of automating the assembly of problem and medication lists from clinical documents. Our results also suggest that having a high-quality set of annotated training documents is the key—and domain independent features are sufficient for obtaining reasonable results. In real world deployment, however, we consider that it will be critical to employ well formulated domain specific features to provide for a more robust engine that will work across different document types and sources. Of course, problem and medication lists generated through this method will need to be manually reviewed and corrected, but the techniques explored here will capably render an excellent first draft for a human validator.
We would like to thank our M*Modal management for supporting i2b2 and our participation in this challenge.
Funding: The 2011 i2b2/VA challenge and the workshop are funded in part by grant number 2U54LM008748 on Informatics for Integrating Biology to the Bedside from the National Library of Medicine. This challenge and workshop are also supported by the resources and facilities of the VA Salt Lake City Health Care System with funding support from the Consortium for Healthcare Informatics Research (CHIR), VA HSR HIR 08-374 and the VA Informatics and Computing Infrastructure (VINCI), VA HSR HIR 08-204, and the National Institutes of Health, National Library of Medicine under grant number R13LM010743-01.
Competing interests: None.
Provenance and peer review: Not commissioned; externally peer reviewed.
Data sharing statement: The i2b2 organizers are making the data used for the challenge available to research institutions.