|Home | About | Journals | Submit | Contact Us | Français|
Narratives of electronic medical records contain information that can be useful for clinical practice and multi-purpose research. This information needs to be put into a structured form before it can be used by automated systems. Coreference resolution is a step in the transformation of narratives into a structured form.
This study presents a medical coreference resolution system (MCORES) for noun phrases in four frequently used clinical semantic categories: persons, problems, treatments, and tests. MCORES treats coreference resolution as a binary classification task. Given a pair of concepts from a semantic category, it determines coreferent pairs and clusters them into chains. MCORES uses an enhanced set of lexical, syntactic, and semantic features. Some MCORES features measure the distance between various representations of the concepts in a pair and can be asymmetric.
MCORES was compared with an in-house baseline that uses only single-perspective ‘token overlap’ and ‘number agreement’ features. MCORES was shown to outperform the baseline; its enhanced features contribute significantly to performance. In addition to the baseline, MCORES was compared against two available third-party, open-domain systems, RECONCILEACL09 and the Beautiful Anaphora Resolution Toolkit (BART). MCORES was shown to outperform both of these systems on clinical records.
Narratives in electronic medical records present information about the patient's health but are not directly available to clinical automated systems. Natural language processing (NLP) technologies can facilitate the integration of narratives into automated systems by extracting structured information (eg, names of persons, treatments) from narratives, identifying negation and uncertainty, and determining the relations between concepts (eg, LabScanner1 and MedLEE2). A key NLP task, coreference resolution, determines whether two concepts are coreferent, ie, linked by an ‘identity’ or ‘equivalence’ relation. For example, in the sentence ‘A complete blood count showed…, but the test did not…’, ‘the test’ and ‘a complete blood count’ are equivalent because they refer to the same entity. Using the results of NLP analysis for clinical applications, whether these are in decision support, quality assessment, or epidemiological modeling, relies on deriving an accurate, non-redundant set of facts from the narratives. Correctly determining coreference helps to tie together different statements made about a single entity and helps to keep multiple entities that might be confused for each other distinct.
In this study, we focused on coreference resolution of noun phrases in the clinical domain. We defined mentions of concepts as markables and considered pairs of markables for coreference. We defined an ordered pair of markables I–J as a coreference candidate; in such a pair, I is the antecedent, and J is the anaphor. Groups of coreferent markables create chains, for example, I–J–K is a chain consisting of coreferent pairs I–J and J–K.
In the NLP literature, coreference resolution focused primarily on the newspaper3 4 and biomedical corpora,5 leaving the clinical corpora relatively unexplored.6–8 Inspired by the work of He9 we developed a medical coreference resolution system (MCORES) that targets noun phrases in the clinical domain, and showed that a rich set of features can be beneficial to coreference resolution. In addition, we compared MCORES with two existing third-party, open-domain (ie, not domain-specific) coreference resolution systems as a way of benchmarking it against the state-the-art. We found that MCORES outperforms these systems on clinical records.
MCORES was inspired by work conducted in open-domain NLP. In general, research on coreference resolution in open-domain NLP made use of annotated corpora10–13 and created exemplary rule-based and machine learning systems.
One of the seminal coreference resolution systems, RESOLVE, was designed by McCarthy and Lehnert.3 RESOLVE tackled coreference resolution of noun phrases in four steps: pair creation, feature set determination, learning, and clustering. RESOLVE focused on four semantic types: organizations, facilities, persons, and products-services. It applied the C4.514 decision tree algorithm to classify pairs as coreferent and then clustered coreferent pairs into chains, achieving an average F-measure of 0.858 over coreference chains.
Soon et al15 extended RESOLVE to pronouns. They expanded RESOLVE's feature set, and modified RESOLVE's pair creation and clustering steps, achieving an F-measure of 0.626 on the Message Understanding Conference 6 (MUC-6) corpus compared with RESOLVE's F-measure of 0.472 on the same corpus. Versley et al16 implemented the algorithm of Soon et al15 in the Beautiful Anaphora Resolution Toolkit (BART). On the MUC-7 corpus, BART gave a best precision of 0.741 using the support vector machine (SVM) linear classifier, and a best recall of 0.563 using the maximum entropy classifier.
Yang18 extended RESOLVE's feature set in order to examine how the different methods of measuring the token overlap affect coreference resolution. His system achieved a best precision of 0.697 and a best recall of 0.714. Castaño et al19 deviated from RESOLVE's framework and focused on sortal and pronominal coreference resolution on MEDLINE abstracts. For sortal coreference resolution, they used the UMLS Metathesaurus20 and MetaMap to identify biomedical markables and their semantic types. They achieved a precision of 0.733 and a recall of 0.700.
Son et al21 studied the coreference of findings of lung masses in radiology documents. Their system incorporated domain knowledge (eg, mass location, quantity, size, calcification pattern) and achieved a 0.672 MUC F-measure. Yangy et al22 solved coreference resolution by exploring the relationship between noun phrases and coreference clusters. Their system achieved an F-measure of 0.817 on MEDLINE abstracts.
Stoyanov et al23 modeled RECONCILEACL09 after the state-of-the-art system of Ng and Cardie.17 They used a set of 76 features and applied the perceptron learning algorithm for classification. A single-link algorithm was used for clustering. RECONCILEACL0924 outperformed the systems of Soon et al15 and Ng and Cardie17 with a 0.712 MUC F-measure on MUC-6 and a 0.629 MUC F-measure on MUC-7.
Open-domain coreference resolution systems paved the way for coreference resolution in clinical records. Inspired by these systems, we developed MCORES for noun phrase coreference resolution in clinical records.
We focused on noun phrases that fall under four semantic categories: persons, problems, treatments, and tests. We observed the importance of token overlap and number agreement features for successful open-domain coreference resolution. Given the observations from the open-domain:
We used a corpus of de-identified clinical records that contained 230 discharge summaries from Partners Healthcare (PH) and 196 from Beth Israel Deaconess Medical Center (BIDMC). The Partners Healthcare records contained a total of 23 277 noun phrase markables, and the Beth Israel Deaconess Medical Center records contained a total of 16 072 noun phrase markables. These records25 were provided by the i2b2 National Center for Biomedical Computing and were prepared for the coreference resolution track of the 2011 Shared Tasks for Challenges in NLP for clinical data.26 This study was approved by the relevant institutional review boards.
We targeted the discovery of coreference chains for noun phrase markables in four semantic categories. The coreference chains in our data were built on gold standard markables with gold standard semantic categories.
The markables in our corpus were annotated as a part of the 2010 i2b2/VA challenge on concepts, assertions, and relations.27 These markables included tests, problems, and treatments;28 however, for coreference resolution, persons and pronouns were added to the 2010 challenge markables.26 Table 1 shows the number of annotated markables and chains per semantic category in our corpus.
The corpus was doubly annotated for coreference given the gold standard markables and their semantic categories. The length of the annotated chains varied between two and 149 markables, with an average of two markables per chain for problems, treatments, and tests, and 13 markables per chain for persons (see supplementary table 2, available online only).
MCORES resolves coreference in four steps that follow in the spirit of RESOLVE: pair creation, feature set determination, classification, and output clustering.
MCORES creates positive training pairs only from neighboring markable pairs in a chain. Table 3 shows that MCORES created 94 914 pairs in our corpus; of these, 80 455 were non-coreferent and 14 459 were coreferent.
Ng and Cardie17 showed that large feature sets help coreference resolution. We consequently built MCORES with a large feature set that included lexical, syntactic, and semantic information. Some of these features (eg, token distance, sentence-level markable overlap) are novel to the task of coreference resolution.
We observed that some of the frequently used features in coreference resolution are asymmetric, as they measure the distance between various representations of markables. We incorporated multiple perspectives of these features into MCORES: (1) the antecedent perspective assessed each feature from the vantage point of the antecedent; (2) the anaphor perspective was from the vantage point of the anaphor; (3) the greedy perspective was the maximum of the antecedent and anaphor perspectives; and (4) the stingy perspective was their minimum. As an example, a multi-perspective token overlap feature first counted the number of tokens that were common to the antecedent and the anaphor; it created the antecedent perspective by normalizing this count by the number of tokens in the antecedent; it created the anaphor perspective by normalizing the count by the number of tokens in the anaphor; greedy and stingy perspectives were created by applying max and min to the antecedent and anaphor perspectives. We refer to features that do not have multiple perspectives, as well as each of the individual perspectives of multi-perspective features, as single-perspective features.
MCORES included the following phrase-level lexical features:
We hypothesized that two coreferent markables will probably be surrounded by similar tokens and markables, and we supplemented MCORES with sentence-level lexical information that captured context:
Syntactic features of markables can help resolve coreference. MCORES included:
MCORES mapped markables to UMLS concept unique identifiers (CUI) using MetaMap; it filtered these CUIs according to their UMLS scores20 and the number of their UMLS semantic types, creating a set of UMLS CUIs for the antecedent (set I) and a set for the anaphor (set J). This process also created a set of UMLS semantic types for the antecedent (set S) and the anaphor (set T):
MCORES supplemented its feature set with a series of single perspective attributes:
MCORES used the C4.5 decision tree algorithm. We selected this algorithm for its flexibility, prediction model readability, and established track record.14
We used RESOLVE's aggressive merge to cluster coreferent pairs into chains. Aggressive merge clustered a coreferent pair A–B with all coreferents linked to A or B.
We evaluated coreference systems on pairs (see supplementary data, available online only) and on chains.
We evaluated chains using four widely used metrics that possess different strengths: MUC,29 B-Cubed (B3),30 CEAF,31 and BLANC.32 The MUC metric ignores recall for singleton chains (ie, chains that consist of a single markable), and favors systems that generate longer chains (ie, a system that generates a single chain of all the markables will receive 100% recall, and a fairly high precision). The B3 metric takes singletons into account. Because multiple markables can belong to a single chain, the B3 metric can count the same chain many times. To avoid this, the CEAF metric aligns entities in the system response and the gold standard before evaluating performance. BLANC adjusts the Rand index33 for coreference resolution. It takes singletons into consideration and evaluates correctly identified chains according to the number of markables they contain.
We used the unweighted average of the four metrics as a measure of coreference performance on chains and evaluated pair classification using F-measure. For details on each of the metrics, see supplementary data (available online only).
We used the approximate randomization test34 to assess whether two system outputs were significantly different from each other. We set α to 0.05. Because we compared multiple hypotheses, we applied the Bonferroni35 correction to counteract the problem of multiple comparisons. The Bonferroni adjusted α was set to 0.00045 for 111 comparisons (see table 5). See supplementary data (available online only) for details.
We evaluated MCORES against a baseline that was identical to MCORES except in its feature set. The baseline employed only token overlap and number agreement as features; comparison with this baseline revealed the gain of MCORES' features over those of the baseline.
We also compared MCORES against RECONCILEACL09 and BART, two state-of-the-art, open-domain systems. Both RECONCILEACL09 and BART automatically generate features from any given corpus and can therefore be applied to any domain. RECONCILEACL09 and BART differ from MCORES in their classification and clustering algorithms (see table 4). Given these differences from MCORES, we present the results of these systems only as a benchmark, and mean for them to demonstrate the performance on clinical records of state-of-the-art, open-domain systems.
Last, but not least, to evaluate each of the features of MCORES, we ran it with subsets of its features. Each of the instances of MCORES run with subsets of features was referred to as sub-MCORES.
RECONCILEACL09 and BART were designed to handle all markables in the corpus together, regardless of their semantic categories. We therefore evaluated MCORES, the baseline, sub-MCORES, RECONCILEACL09, and BART under this scenario:
However, given the availability of ground truth semantic category information for our markables, we also ran MCORES, the baseline, sub-MCORES, and RECONCILEACL09 on individual semantic categories as follows.
We evaluated the runs:
Given its design, we evaluated BART across all markables only. All systems were cross-validated (10-fold) on our data.
We evaluated the difficulty of coreference resolution in our corpus by checking the degree of token overlap between the markables in pairs.
Table 3 shows that approximately 30–40% of coreferent pairs in each of the semantic categories in our corpus present exact overlap (markables overlap in their entirety). At least 40% of problem, treatment, and test coreferent pairs present partial overlap (markables share at least one token and, as supplementary table 2 (available online only) shows, the average markable length is 5.28 across all semantic categories). In addition, on average 1% of non-coreferent pairs show exact overlap (eg, two x-rays taken on two different days) and between 2% and 5% of non-coreferent pairs show at least partial overlap.
Given these data, we then investigated the three questions from the problem definition section. We repeat these questions here for convenience:
MCORES performed well on both pairs and chains (see table 5 for chains and supplementary table 6, available online only, for pairs). It achieved an unweighted average F-measure of 0.749 for per-corpus and 0.898 for per-entity runs when evaluated on chains across all markables. MCORES' best per-entity run, with an unweighted average F-measure of 0.886, was on treatments (see table 5). MCORES' performance on individual semantic categories was lower on the per-corpus runs than on the per-entity runs.
MCORES significantly outperformed the baseline across all markables in per-entity runs; its per-corpus unweighted average F-measure was 0.749 versus 0.730 of the baseline; its per-entity unweighted average F-measure was 0.898 versus 0.804 of the baseline (see table 5). Extended results are included in supplementary table 7 (available online only).
MCORES also outperformed the baseline in pair classification (see supplementary table 6, available online only). The per-entity run of MCORES performed particularly well on pair classification. It had an F-measure of 0.824 across all markables with best performance on persons and lowest performance on problems (for details see supplementary text, available online only).
We evaluated pair classification on exact token overlap, partial token overlap, and no token overlap pairs separately. MCORES outperformed the baseline on all pair types for per-corpus and for per-entity runs when evaluated across all markables (see supplementary table 8, available online only). When evaluated on individual semantic categories, MCORES outperformed the baseline on some and was outperformed on others for both the per-entity and per-corpus runs.
We evaluated MCORES against RECONCILEACL09 and BART. Table 5 shows that MCORES outperformed RECONCILEACL09 across all markables on per-entity runs. It outperformed RECONCILEACL09 on persons, problems, and treatments on both per-entity and per-corpus runs. MCORES outperformed BART across all markables on the per-corpus runs.
To evaluate MCORES' lexical, syntactic, semantic, and miscellaneous features, we ran it with each group of features separately. Each of the runs with subsets of MCORES' features is referred to as feature-based sub-MCORES.
Table 5 shows that MCORES outperformed the feature-based sub-MCORES (except for the phrase-level lexical sub-MCORES) on per-entity runs when evaluated across all markables (see table 5). It outperformed the miscellaneous sub-MCORES on per-corpus runs when evaluated across all markables. Analysis on individual semantic categories of per-entity runs shows that MCORES performed better or as well as all feature-based sub-MCORES on all semantic categories. Analysis on individual semantic categories of per-corpus runs shows that MCORES outperformed all feature-based sub-MCORES on persons, it outperformed all but one feature-based sub-MCORES (phrase-level lexical sub-MCORES) on problems, it outperformed all but one feature-based sub-MCORES (sentence-level lexical sub-MCORES) on treatments, and was comparable to all feature-based sub-MCORES on tests.
To measure the value of multi-perspective features, we evaluated each of the individual perspectives against MCORES. Each of the runs with individual perspectives is referred to as perspective-based sub-MCORES. The results of comparing antecedent, anaphor, greedy, and stingy perspective-based sub-MCORES with MCORES are shown in table 5. Comparing the perspective-based sub-MCORES with each other, we find that, in the per-entity runs, greedy perspective sub-MCORES gives the best performance; in the per-corpus runs, antecedent perspective sub-MCORES gives the best performance.
In general, MCORES outperformed the baseline and the third-party systems across all markables. In the per-entity runs, MCORES significantly outperformed the baseline on persons and tests. In the per-corpus runs, MCORES significantly outperformed the baseline on persons, problems, treatments, and tests.
Analysis of system outputs revealed several patterns. The gain of MCORES in persons over the baseline came from its ability to link markables with no token overlap (ie, ‘patient’—‘Kulrine, ryyelege n’); on tests, the baseline showed a strong disadvantage by linking unrelated markables with partial token overlap (ie, the baseline incorrectly links ‘the mri on admission’—‘the hct on admission’). In general the baseline overgenerated chains; this was coincidentally to its advantage on the less prevalent classes, such as treatments. However, we expect that this advantage would disappear as the data set grows.
Both MCORES and the phrase-level lexical sub-MCORES outperformed the third-party systems (see table 5). The third-party systems generally underpredicted the true coreference pairs; this was accounted for by their pair creation methods. Despite the performance displayed by the phrase-level lexical sub-MCORES with no clinical knowledge, clinical knowledge played a role in the gain of MCORES over RECONCILEACL09. For example, pairs such as ‘right basilar atelectasis’—‘atelectasis’ were correctly classified by MCORES but were missed by RECONCILEACL09, BART, and the phrase-level lexical sub-MCORES.
Phrase-level lexical sub-MCORES performed similarly to MCORES on most individual semantic categories and across all markables. However, much like the baseline, phrase-level lexical sub-MCORES tended to link markable pairs incorrectly with partial overlap and mostly generated shorter chains than the gold standard. We expect that as the data grow, MCORES will also gain over this sub-MCORES on the individual semantic categories.
We found that individual perspective-based sub-MCORES performed similarly to each other and to MCORES when evaluated across all markables (see table 5). While some of their differences were statistically significant, the observed differences may not justify the additional model complexity and a greedy perspective sub-MCORES may in general be sufficient for our application.
MCORES did not always outperform its competitors. For example, it did not significantly outperform the perspective-based sub-MCORES and the phrase-level sub-MCORES on the per-entity runs. It also did not show significantly better performance on treatments, tests, and across all markables on the per-corpus runs. Yet, there is no single system that could outperform MCORES on both the per-corpus and per-entity runs across all markables and across all semantic categories.
MCORES generated typical errors for each semantic category; for example, it failed to classify misspelled person pairs. False positives for problems were generated by the inability to distinguish between newly arisen and recurring events (eg, pneumonia occurring at different dates vs an incurable disease such as AIDS). Treatment false positives were encountered when medications with the same name (but different routes of administration) did not corefer. Test errors occurred because many test pairs that exhibited exact token overlap did not corefer. For all semantic categories, false negatives mainly came from markables with no token overlap; a shortcoming that could be remedied by additional world knowledge, or by better filtering the knowledge provided by the UMLS (eg, the system should infer that ‘infection’ and ‘communicable disease’ referred to the same entity).
Last, but not least, this paper relied on markables and semantic categories to be annotated before resolving coreference so that we could focus on coreference resolution without having to worry about the noise that could be introduced by automatic processes. Obtaining such corpora is difficult; however, manual markable and semantic category information can be replaced by their automated counterparts at the expense of some system performance.
We presented MCORES, a coreference resolution system that is modeled after RESOLVE but includes significant expansions to the original feature set. Our evaluation of coreference resolution in the clinical domain found token overlap to be a very helpful, but insufficient, feature that can overgenerate chains. With a feature set that enhances token overlap with lexical, syntactic, and semantic information, we showed that MCORES outperformed an in-house baseline and two third-party systems, improving coreference resolution on clinical records.
Contributors: AB is the primary author and was instrumental in designing and developing the work and performed data analyses. PS and OU are the principal investigators for the grant involving the secondary use of clinical data. OU co-designed the experiments, led the data analysis, provided expertise in machine learning, and co-wrote and edited the manuscript. PS provided expertise in data analysis and reviewed and edited the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NLM, NHLBI, NIH, or ONC.
Funding: This project was supported in part by award number 2U54LM008748 from the National Institutes of Health (NIH)/National Library of Medicine (NLM) and co-founded by the National Heart, Lung and Blood Institute (NHLBI), and by contract number 90TR0002 (SHARP—Secondary Use of Clinical Data) from the Office of the National Coordinator (ONC) for Health Information Technology and by contract number EB001659 from the National Institute of Biomedical Imaging and Bioengineering.
Competing interests: None.
Ethics approval: This study was conducted with the approval of the institutional review boards of Partners Health Care, MIT, SUNY at Albany.
Provenance and peer review: Not commissioned; externally peer reviewed.
Data Sharing Statement: Data are available from i2b2.org/NLP.