Search tips
Search criteria 


Logo of jamiaAlertsAuthor InstructionsSubmitAboutJAMIA - The Journal of the American Medical Informatics Association
J Am Med Inform Assoc. 2012 Sep-Oct; 19(5): 875–882.
Published online 2012 May 19. doi:  10.1136/amiajnl-2012-000810
PMCID: PMC3422838

A supervised framework for resolving coreference in clinical records



A method for the automatic resolution of coreference between medical concepts in clinical records.

Materials and methods

A multiple pass sieve approach utilizing support vector machines (SVMs) at each pass was used to resolve coreference. Information such as lexical similarity, recency of a concept mention, synonymy based on Wikipedia redirects, and local lexical context were used to inform the method. Results were evaluated using an unweighted average of MUC, CEAF, and B3 coreference evaluation metrics. The datasets used in these research experiments were made available through the 2011 i2b2/VA Shared Task on Coreference.


The method achieved an average F score of 0.821 on the ODIE dataset, with a precision of 0.802 and a recall of 0.845. These results compare favorably to the best-performing system with a reported F score of 0.827 on the dataset and the median system F score of 0.800 among the eight teams that participated in the 2011 i2b2/VA Shared Task on Coreference. On the i2b2 dataset, the method achieved an average F score of 0.906, with a precision of 0.895 and a recall of 0.918 compared to the best F score of 0.915 and the median of 0.859 among the 16 participating teams.


Post hoc analysis revealed significant performance degradation on pathology reports. The pathology reports were characterized by complex synonymy and very few patient mentions.


The use of several simple lexical matching methods had the most impact on achieving competitive performance on the task of coreference resolution. Moreover, the ability to detect patients in electronic medical records helped to improve coreference resolution more than other linguistic analysis.

Keywords: Natural language processing, clinical informatics, medical records systems, computerized, semantic relations, statistical learning, machine learning, predictive modeling, privacy technology

Background and significance

The adoption of electronic medical records (EMRs) has enabled the use of automatic methods for analyzing, reviewing, and querying patient records. Natural language processing (NLP) technology contributes to many of the automatic analysis methods that provide a wide variety of applications. This is due to the fact that many EMRs contain an unstructured, narrative portion where important medical information is recorded. However, most NLP techniques have been traditionally developed for processing very different discourse domains (eg, news), and in very different genres (eg, financial or political). For the medical domain, NLP techniques need to take into account significant semantic information available in various medical ontologies. Moreover, the narratives are written in quite different styles than for other domains, imposing a variety of processing techniques that capture the pragmatics of clinical discourse.

As clinical narratives repeatedly refer to the same concept, special NLP techniques need to be developed to resolve references. Mentions of people (eg, the patient, the doctor), tests (eg, x-ray, blood count), treatments (eg, drugs, surgeries, therapy), and problems (eg, diabetes, atrial fibrillation) need to be resolved to specific entities. Existing NLP systems can accurately identify medical concepts in EMRs,1 however those systems do not identify whether several concept mentions actually refer to the same concept. For example, in the sentence ‘The patient's cardiovascular status has been stable throughout his CMED CSRU stay,’ the concepts ‘The patient‘ and ‘his’ refer to the same person. In other words, these concepts are coreferential. Automatically resolving such coreferences enables the consolidation of medical information that would otherwise appear unrelated.

Coreference resolution is known to be a complex and difficult problem2–6 because it relies on syntactic, semantic, and mostly pragmatic knowledge that is difficult to discern from narratives. However, when documents pertain to a specific domain, for example, medicine, pragmatic knowledge is replaced largely by domain-specific knowledge, which can be modeled by domain-specific concepts.

To exemplify the complexity of knowledge required for resolving coreference in clinical texts, consider the relationship between the underlined phrases involving laparoscopy (a camera-aided procedure through the abdomen) from the following example:

  1. In November 2008 her doctors in Louisiana did an exploratory laparoscopy.
  2. The laparoscopy saw only some minimal endometriosis.
  3. The pain returned and she had a repeat laparoscopy which showed nothing.
  4. She has a new-onset of pain since her surgery.

The two mentions of laparoscopy from sentences 1 and 2 refer to the same procedure. By examining the lexical overlap between the mentions, it can be inferred that they refer to the same treatment. While such an assumption often holds true, the mentions of laparoscopy from sentences 2 and 3 are actually referring to different procedures, because the mention from sentence 3 is qualified with the word ‘repeat.’ Thus, we need syntactic knowledge to understand that ‘repeat’ is an adjective modifying the concept of ‘laparoscopy’ and semantic knowledge to know that a ‘repeat laparoscopy’ refers to a second procedure.

A different type of semantic knowledge must be used to determine that the concepts ‘laparoscopy’ and ‘surgery’ in sentences 3 and 4 are referring to the same procedure. In this instance the reader must understand that laparoscopy is a type of surgery and the doctor has referred back to the same concept using a generalization. Hence, achieving the most accurate resolution of coreference in clinical records automatically requires the incorporation of multiple forms of linguistic knowledge.

Performing automatic coreference resolution provides valuable information when extracting knowledge from clinical records. In the example sentences above, simply knowing how many laparoscopies the patient has undergone requires knowledge about which mentions of laparoscopy are referring to the same procedure and which ones are distinct. Furthermore, with coreference information, details extracted about individual concept mentions can be merged to form a more complete picture, such as the fact that a November 2008 laparoscopy revealed only minimal endometriosis.

The 2011 i2b2/VA Shared Task on Coreference focused on evaluating techniques for clinical coreference resolution, providing both a training and testing dataset. We use these datasets for evaluation and describe them further in the section on Materials and methods.

Related work

Coreference resolution has been studied for years in the NLP literature.2 5 7 8 Approaches have included both supervised4 9–11 and unsupervised methods.12–15 Bengston and Roth9 showed that a detailed focus on high quality features outperforms more sophisticated models in a supervised setting. Haghighi and Klein12 were the first to report that a generative model which jointly models entities across multiple documents can perform at a state-of-the-art level with very little supervision. Nicolae and Nicolae16 introduced BestCut, which treats coreference resolution as a graph cutting problem, achieving state-of-the-art performance.

A common approach to coreference resolution involves determining the best antecedent for every mention. Our approach instead makes decisions for individual pairs of concepts, possibly determining that a single concept is coreferential with many other concepts. Ng and Cardie4 review several methods which chose a single best previous mention. These include Closest-Link, which makes pair-wise decisions about coreference, but only keeps the most recent antecedent in the final determination. Another such method is the Best-Link strategy, which generates scores for every antecedent of a mention. The mention is linked only with the highest scoring antecedent if the score is above a threshold. More recently, Raghunathan et al15 report on a multiple pass sieve approach which we describe later.

Zheng et al provide a good review of the literature on coreference resolution in the clinical domain.17 Wang et al evaluated pronominal coreference for the words ‘it,’ ‘this,’ and ‘that’ within 1000 sentences taken from clinical text.18 Using a rule-based approach they achieved results ranging from 90% to 94%. He et al19 studied coreference in hospital discharge summaries involving five types of entities using a supervised C4.5 decision tree classifier and a carefully selected set of features. Previous results are also available on the ODIE dataset which comprises one of our evaluation corpora. Zheng et al report on the results of a support vector machine (SVM)-based approach trained on syntactic, semantic, and surface features.20 Research on biomedical literature, particularly biomedical scholarly articles, has also discussed the coreference problem. However, such articles usually focus on coreference resolution for drugs,21 genes,22 proteins, and bio-processes,23 which occur less frequently in the clinical domain.

Based on previous work on coreference resolution, which suggests that supervised approaches do well when there is a sizeable training corpus available for training models, we chose to incorporate the knowledge derived from the annotations on the training data through the use of supervised classifiers in our approach.

Materials and methods

i2b2/VA 2011 Shared Task on Coreference

The 2011 i2b2/VA Shared Task on Coreference follows a series of annual shared tasks having different NLP focuses. The 2010 task made available a corpus of annotated concepts and their concept types. The 2011 task extends this by additionally annotating coreference between the concepts. Concepts have been annotated in two different ways: (1) according to the i2b2 guidelines, and (2) according to the ODIE24 guidelines. The i2b2 guidelines specify five concept types: problem, treatment, test, person, and pronoun. The ODIE guidelines were developed independently and contain concept types such as people, disease-or-syndrome, sign-or-symptom, anatomical-site, procedure, organ-or-tissue-function, laboratory-test-or-result, other, and none.

Multiple pass sieve strategy for resolving coreference

We use a multiple pass sieve approach similar to Raghunathan et al.15 This method involves multiple independent models for resolving coreference which are executed in succession. Each model (or pass) makes coreference decisions on pairs of concepts from the text. Given a pair of concepts, a model can decide either that the two concepts from a pair are coreferential or that they are not. Rather than considering all possible pairs of concepts from the text, each model has its own selection criteria for choosing a subset of those pairs on which to make decisions. For instance, one pass identifies coreferential pairs of mentions which are synonymous (eg, ‘GERD’ and ‘Gastro-esophageal reflux disease’), while another pass identifies coreferential pairs of mentions whose strings are identical (eg, ‘attending physician’ and ‘attending physician’). Unlike the approach in Raghunathan et al,15 each of our passes uses a machine-learned classifier to identify which of the pairs of mentions are actually coreferential. For instance, rather than assuming that all mentions sharing the same exact text are coreferential, we train a binary classifier to make the final determination.

As seen in figure 1, coreference resolution is performed by executing the passes sequentially. Each pass makes use of the coreference decisions output by the previous passes. The final set of coreference chains is then the combination of running individual coreference passes that are specialized at identifying specific types of coreference. All of the passes use an SVM classifier provided by the LIBLINEAR library25 with default settings. With the exception of pass 1, all of the passes share a common set of features describing properties of a pair of concepts, with some passes adding additional features. These features are used by the classifier to make a determination about whether the pair of concepts is coreferential. Each pass addresses a different coreference problem. The creation of these passes was data driven: we examined how coreference occurs in clinical records and created the different passes to resolve the different types of coreference that we observed.

Figure 1
Architecture for the multiple pass sieve strategy. SVM, support vector machine.

Pass 1: identification of patient mentions

The first model is unique because it does not operate in the same manner as the other models. In this pass, we utilize a classifier which identifies concepts referring to the patient. These mentions include ‘the patient,’ ‘he,’ ‘she,’ ‘the infant,’ etc. We train an SVM classifier to identify whether each concept refers to the patient. The manually annotated concepts and coreference chains provided for the patient records in the training data do not identify whether each concept refers to the patient. Therefore, we make an assumption that the largest coreference chain involving people (i2b2) or person (ODIE) concepts is the chain for references to the patient. This assumption appears to be true among a number of documents which we have inspected. An exception to this is the pathology reports in the ODIE dataset, which we discuss further in the Results section. All concepts belonging to the longest chain can then be used as positive training instances for a classifier, with the remaining people/person concepts being used as negative instances. The trained classifier is used to identify all concepts which are mentions of the patient in the testing data. Those concepts are then combined into a single coreference chain.

The classifier uses several features: the full text of the concept, individual tokens from the concept, the three tokens before/after the concept, character trigrams from the concept, the section header, and individual tokens from the section header. For the purposes of identifying sections within the clinical record, we consider any line ending in a colon to be a section header rather than more sophisticated techniques.26 27

While this initial pass operates on individual concepts, the remaining passes operate on pairs of concepts and identify whether each pair is coreferential or not.

Pass 2: coreference resolution between concepts with the same text

Selection criteria: concepts which share the same text

This pass will only consider pairs of concepts whose texts are the same. A loose definition of ‘same’ is used, where case is ignored, and Porter stemming is performed. Also, initial determiners and possessive pronouns are ignored. Under this relaxation, for instance, the concepts ‘Propofol drips’ and ‘his propofol drip’ would be considered the same. While previous coreference approaches have considered exact string matches to be coreferential,15 we found that training a classifier to filter some pairs of concepts improved performance. The features used by the classifier are described in the next section.

Base set of features for passes 2 through 8

Passes 2 through 8 share a common set of features, although some of the passes have extra features used only by that pass. Each feature describes some property of a pair of concepts. Using features, a classifier will make a determination about whether those two concepts are coreferential or not. The features are detailed in table 1. Feature F3 enables the classifier to determine whether the concepts share the same concept type. Features F4 and F5 can be particularly useful for pronouns such as ‘which’ or ‘This,’ which tend to be coreferential with immediately adjacent concepts. F6 and F7 capture individual tokens in the concepts. Hence, the properties of coreference the classifier learns about ‘small lung cancer’ can also be applied to ‘cervical cancer’ because they both have the token ‘cancer.’ For example, both concepts have a high affinity for linking with mentions of ‘tumors.’

Table 1
Set of features used by passes 2 through 8

Feature F8 provides information about the context of the two concepts. If there are more than 10 tokens between the concepts, this feature returns an empty set to avoid confusing the classifier with many non-contextual tokens. Feature F9 buckets the number of tokens between the concepts among: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100}, with other distances being assigned to the closest bucket. Bucketing the distances in this way provides the classifier with more examples per feature value.

We have also included three features related to words which are shared between the two concepts. For instance, ‘small cell lung cancer’ and ‘lung cancer’ share the words ‘lung’ and ‘cancer.’ One feature (F10) indicates all of the shared words. Another feature (F11) indicates the number of shared words, which would be two in this case.

Sharing a common word such as ‘the’ or ‘and’ is less important for coreference detection than a shared word such as ‘lung.’ This importance is approximated by (F12) the log term frequency of the shared word in the 2011 PubMed Central corpus of scholarly biomedical articles, rounded to the lowest integer. For instance, the term ‘lung’ has a log term frequency of 17, while ‘the’ has a log term frequency of 25, and a rare term such as ‘catecholamine’ has a log term frequency of 11. Smaller values indicate less frequent terms, and therefore words which are more likely to be related.

Pass 3: identification of coreference between consecutive concepts of the same concept type

Selection criteria: consecutive pairs of concepts of the same concept type

Each concept is paired with the next concept in the text having the same concept type. In addition, pairs of concepts where at least one concept is a pronoun or was detected as a patient mention are disregarded. For example, in the following sentence several concepts have been marked:

[This]pronoun is a 55-year-old male with [critical aortic stenosis]problem who was referred to [Dr John Doe]person for discussion for [surgical options]treatment to treat [this condition]problem.

Pass 3 would extract the following pair of concepts for consideration:

([critical aortic stenosis]problem, [this condition]problem).

This pass implements the assumption that neighboring concepts of the same type are more likely to be coreferential.

Pass 4: resolving coreference when concepts are Wikipedia aliases

Selection criteria: any pairs of concepts that are mapped to the same article in Wikipedia

Wikipedia articles can have many aliases (redirects) to account for alternative spellings, misspellings, and synonymous terms. For example, the terms ‘Hepatitis C’ and ‘Hep C’ are both aliases within Wikipedia for the article about hepatitis C. Therefore, if a record contains both terms, the pair of concepts will be considered for coreference resolution in this pass. Other examples caught by this pass include ‘A fib’ and ‘Atrial fibrillation,’ ‘Vtach’ and ‘Ventricular tachycardia,’ as well as ‘COPD’ and ‘Chronic obstructive pulmonary disease.’

This pass includes a feature indicating the canonical title of the Wikipedia article associated with the pair of concepts (eg, hepatitis C, atrial fibrillation). We also include a feature indicating the number of tokens in that Wikipedia article title. The intuition behind this feature is that longer titles are more likely to be correctly mapped.

Pass 5: recognition of coreferential consecutive concepts of the same animacy

Selection criteria: pairs of concepts that are either both animate or both inanimate

The concepts must also be consecutive (allowing for other concepts between them of the opposite animacy). The animacy of a concept was determined solely on the basis of concept type, where types ‘person’ and ‘people’ were marked as animate and all other concepts were marked as inanimate. Pairs of concepts involving at least one pronoun were ignored in this pass.

Pass 6: identification of coreference between concepts with shared prefixes

Selection criteria: pairs of concepts which have a common prefix of at least five characters

We assume that concepts which start with the same characters (a very simple type of word stem) are more likely to be the same entity, and likewise coreferential. Only concepts with no intervening concepts of the same prefix are paired. One such example is the concepts ‘hypotensive’ and ‘hypotension.’ Stemming algorithms may not stem these two words identically. However, the UMLS SPECIALIST Lexicon is capable of enumerating morphological derivatives, including ‘hypotensive/hypotension’ and would be a better resource for identifying related concepts than relying on prefixes. The main benefits of an approach based on prefixes are speed and simplicity. In future work, we plan to integrate the UMLS SPECIALIST Lexicon to gain further improvements in performance.

This pass uses two features beyond the base set of features: the sets of tokens present in the section header for the first and second concept, respectively.

Pass 7: resolution of coreference between nearby concepts

Selection criteria: consecutive pairs of concepts which have at most seven tokens between them

This pass relies on the intuition that mentions which are close to each other are more likely to be coreferential, however no restriction is made on concept type, unlike in pass 3. One additional feature is added in this pass representing all word bigrams found between the mentions. A word bigram consists of two adjacent words. This pass was added when we noticed many phrases indicative of coreference such as ‘[concept], which,’ where ‘which’ and ‘[concept]’ are coreferential. Another example would be ‘This [concept]’ where ‘This’ and ‘[concept]’ are coreferential. Rather than writing individual rules for every similar case, the machine learning classifier is able to learn the relevant patterns which indicate coreference.

Pass 8: identification of coreferential concepts which share at least one word

Selection criteria: pairs of concepts which have at least one word in common between them

This is a relaxation of pass 2 which requires the entire strings of both mentions to be the same. The shared word must not be a stopword or a single character. In addition, all pairs of concepts which meet the criteria for the second pass are not considered in this pass.

Training the classifiers

The classifiers used for all passes are trained independently of each other. All pairs of concepts from the training data meeting the selection criteria for a pass are used as training instances. Two concepts will be considered coreferential if any of the passes determines they are coreferential. During testing, after all passes have been executed, coreference chains are formed from all concepts which can be considered coreferential, using the transitive closure over pairs of concepts.

Automatic extraction of concepts

In order to determine the feasibility of using our coreference algorithm on completely unseen data, we ran a series of experiments with automatically annotated concepts. We utilized a pre-existing concept identification system,28 which, given training data, automatically chooses the best set of features for concept identification. The system was trained on the ODIE concept data under two separate configurations: (1) automatically annotated concepts based on the i2b2 concept types (problem, treatment, test, person) were available to the ODIE concept classifier (which uses nine concept types), and (2) no automatic annotations were available to the ODIE concept classifier. The automatic concepts were represented in an IOB-style feature. The features selected for each configuration are shown in table 2. For more details on the individual features or feature selection process, see Roberts and Harabagiu.28

Table 2
Features selected for the automatic concept extraction method


We evaluate our coreference approach by training the classifiers on the training portion of data made available during the i2b2 2011 Challenge, and by evaluating on the testing portion of data provided by that challenge. Both training and testing datasets were divided into two subsets, one which had been annotated under the ODIE standard, and another which had been annotated under the i2b2 standard. Table 3 summarizes the data. The records came from four hospital systems: Beth, Partners, Mayo Clinic, and the University of Pittsburgh Medical Center. No records were annotated using both i2b2 and ODIE guidelines. The ODIE subsets of the data were considerably smaller than the i2b2 subsets.

Table 3
Statistics about the training and testing data

We performed several experiments to evaluate our method on these corpora. In the first experiment, we evaluated how our method performed on only the i2b2 portion of the data. Likewise, we experimented on only the ODIE portion of the data. Finally, we performed an experiment to evaluate our method when trained on both portions of the data. The i2b2 2011 Challenge used four official scoring metrics: B3,29 MUC,30 BLANC,31 and CEAF.32 The challenge also included an official overall score which was the unweighted average of B3, MUC, and CEAF, referred to in the tables below as Avg.

Table 4 shows the results of our evaluation. The scores on the smaller ODIE corpus are significantly lower than the scores on the i2b2 corpus. In both cases, training on all available data shows a small improvement. We analyzed the results by both concept types (table 5) and record types (table 6) in an effort to determine the reason for the lower scores on the ODIE data. Table 5 shows that F scores are lower for ODIE data across all concept types. This indicates that a difference in concept types was likely not the cause of the performance discrepancies between the i2b2 and ODIE datasets.

Table 4
Evaluation results for tasks 1B and 1C
Table 5
Evaluation of coreference results by concept type
Table 6
Performance of coreference resolution on subsets of the clinical records, by type of record

Furthermore, table 6 shows the performance of the automatic coreference resolution approach broken down by record type. The performance on the ODIE dataset is comparable to the i2b2 set on all record types except pathology reports. However, these constitute almost a third of the records and thus bring the overall score down significantly. The pathology reports are characteristically different from the other report types. For example, shown below are the chains for a single pathology report:

  1. Invasive, grade 3 (of 4) adenocarcinoma arising from the tubular adenoma ‖ Neoplasm
  2. Rectal polyp base ‖ The separately submitted polyp base ‖ adenomatous mucosa
  3. Colon ‖ rectum ‖ rectal ‖ submucosa ‖ a cauterized margin.

The first notable difference in these chains from many of the other types of reports is the absence of patient mentions. In discharge and progress notes, the patient is mentioned many times and constitutes a very large fraction of the mentions in coreference chains. Detecting patient mentions is also relatively easy in comparison to detecting other types of coreference. This leads to higher coreference scores for records which mention the patient frequently. The sixth column of table 6 shows the average number of concepts in coreference chains within each type of record. As expected, pathology records have the smallest coreference chains, largely due to the absence of patient mentions. An additional factor affecting the performance on pathology records is the semantic knowledge requirements for correctly detecting coreferential concepts. Each of the three example chains above contain terms which cannot be associated at the lexical level. One must know that a neoplasm (tumor) is caused by cancer and can therefore be used to refer to the cancer (adenocarcinoma) which caused it. These factors also explain why adding the i2b2 data did not significantly help performance when testing on ODIE data. The i2b2 data consist of only discharge and progress notes which do not contain nearly as many of the long, precise technical terms found in pathology notes.

Table 7 shows how well our method performs when only using some of the passes of the sieve method. The average shown in the last column increases as each pass is executed. The evaluation was performed using the i2b2 portion of the data. Pass 1, which identifies mentions of the patient and links them together, acts as a good baseline with a score only about 11 points lower than the score for all eight passes. This is reasonable because patient mentions usually form the largest coreference chain in each record. Also, other references are limited and usually only contain a few concepts. The second pass, which matches concepts that have the same string, has a very large impact, adding almost eight points to the average F score. Three other passes have an impact of almost a full point each. The first was pass 5, which incorporates information about consecutive concepts of the same animacy. The second was pass 7, which detects short patterns indicative of coreference between nearby concepts. Finally, pass 8 also has a large impact by identifying concepts which share words. Table 8 shows the results of the sieve method when only a single pass is executed. These results give a better idea of the strength of the passes on their own. Passes 1, 2, and 5 performed the strongest. The strong results from passes 2 and 5 are likely a result of their recall-oriented nature and the fact that these passes will link likely patient mentions as well.

Table 7
Results obtained when running only a subset of the coreference passes using the i2b2 portion of the data
Table 8
Results obtained when an individual pass is executed by itself

Table 9 shows the performance of our end-to-end coreference approach on the ODIE test data, using automatically extracted concepts. Two methodologies were used: with and without i2b2-style automatic concepts. It appears that the performance was improved slightly by providing i2b2 concepts, despite the fact that i2b2 and ODIE used different concept types. It is unclear why the exact matching metric is higher for the second case than the partial matching.

Table 9
End to end results on the ODIE test dataset when using automatic concept recognition


Our approach achieved encouraging results while at the same time being simple and using mostly lexical features. During our analysis of the medical records and the occurrence of coreference within them, we observed that the majority of the references were of two kinds: (1) references to the patient and (2) nominal coreference. The majority of pronominal references other than simple construction such as ‘which’ and ‘that,’ were references to the patient. That is why we chose an approach which first tries to identify all mentions of the patient. Once those mentions have been identified, a large portion of the remaining references can be identified using fairly simple lexical features.

Despite the fact that our approach did not incorporate much syntactic, semantic, or pragmatic information, it is quite likely that such information would lead to even better results. Furthermore, information from existing medical ontologies such as UMLS33 can be incorporated through the addition of a new pass which links concepts that are semantically similar. The limited use of external resources should enable the application of our approach in other domains with minimal reconfiguration. The various passes are not domain specific, relying primarily on vicinity and lexical features. Even pass 1 which detects patient mentions can be applied to other domains in which a single entity is mentioned predominantly in a document.

The order in which we performed the various passes roughly tries to fit the rule that more precise passes should be performed first, as suggested by Raghunathan et al.15 One reason for this is that our method does not pass attributes about entities across passes as they do, therefore the order of the passes is not nearly as important. It is possible that further experiments regarding the ordering of passes could lead to additional gains in performance. A limitation to the extension of this method is the fact that passes are all trained independently and pairs of mentions linked together in all passes are combined together. The result is that adding any new passes cannot break the coreference chains being produced by earlier passes, the chains can only be made longer. This limitation could be remedied by incorporating passes which break chains, or through the use of a scoring-based approach such as Best-Link instead.


We were able to achieve promising results on the task of resolving coreference between concepts in medical records using a simple approach based on a multi-pass sieve which included machine learning classifiers in every pass. While our approach owes its inspiration to an existing method reported by Raghunathan et al,15 we have adapted it in several important ways. The first is the inclusion of a classifier in each pass. The availability of a large corpus generously made available by i2b2/VA, University of Pittsburgh Medical Center and several other institutions allowed for a hybrid approach incorporating machine learning to outperform a purely rule-based approach. Another significant diversion from the existing approach is that our method makes coreference decisions at the level of pairs of concepts, rather than finding a single antecedent for every mention. The information used by our method is primarily lexical. However, information about alternative spellings and synonyms of concepts from Wikipedia was also incorporated. Exploration of the addition of even more semantic (UMLS SPECIALIST Lexicon, SNOMED CT,34 and distributional similarity techniques35) along with pragmatic information36 will be the goal of our immediate future work due to the encouraging results obtained by this approach.


We would like to acknowledge the efforts of the organizers for the 2011 i2b2 Challenge, who made this work possible.


Funding: The 2011 i2b2/VA challenge and the workshop are funded in part by grant number 2U54LM008748 on Informatics for Integrating Biology to the Bedside from the National Library of Medicine. This challenge and workshop are also supported by the resources and facilities of the VA Salt Lake City Health Care System with funding support from the Consortium for Healthcare Informatics Research (CHIR), VA HSR HIR 08-374 and the VA Informatics and Computing Infrastructure (VINCI), VA HSR HIR 08-204, and the National Institutes of Health, National Library of Medicine under grant number R13LM010743-01. MedQuist, the largest transcription technology and services vendor, co-sponsored the 2011 i2b2/VA challenge meeting at AMIA.

Competing interests: None.

Provenance and peer review: Not commissioned; externally peer reviewed.


1. Uzuner O, South BR, Shen S, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inform Assoc 2011;18:552–6 [PMC free article] [PubMed]
2. Hobbs JR. Resolving pronoun references. Lingua 1978;44:339–52
3. Ng V. Semantic class induction and coreference resolution. Annual Meeting-Association for Computational Linguistics; Prague, Czech Republic. Stroudsburg, PA: Association for Computational Linguistics, 2007;45
4. Ng V, Cardie C. Improving machine learning approaches to coreference resolution. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics; University of Pennsylvania, USA. Stroudsburg, PA: Association for Computational Linguistics, 2002
5. Baldwin B. CogNIAC: high precision coreference with limited knowledge and linguistic resources. Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts. Stroudsburg, PA: Association for Computational Linguistics, 1997
6. Yang X, Su J. Coreference resolution using semantic Relatedness information from automatically Discovered patterns. Annual Meeting-Association for Computational Linguistics; Prague, Czech Republic. Stroudsburg, PA: Association for Computational Linguistics, 2007;45
7. Stoyanov V, Cardie C, Gilbert N, et al. Coreference resolution with reconcile. Proceedings of the ACL 2010 Conference Short Papers; Uppsala, Sweden. Stroudsburg, PA: Association for Computational Linguistics, 2010
8. Recasens M, Marquez L, Sapena E, et al. SemEval-2010 Task 1: coreference resolution in multiple languages. Proceedings of the 5th International Workshop on Semantic Evaluation; Uppsala, Sweden. Stroudsburg, PA: Association for Computational Linguistics, 2010
9. Bengston E, Roth D. Understanding the value of features for coreference resolution. Proceedings of the Conference on Empirical Methods in Natural Language Processing; Waikiki, Honolulu, Hawaii. Stroudsburg, PA: Association for Computational Linguistics, 2008
10. Soon WM, Ng HT, Lim DCY. A machine learning approach to coreference resolution of noun phrases. Computational linguistics. Cambridge, MA: MIT Press, 2001;27
11. Versley Y, Ponzetto SP, Poesio M, et al. BART: a modular toolkit for coreference resolution. Proceedings HLT-Demonstrations '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session; Columbus, OH. Stroudsburg, PA: Association for Computational Linguistics, 2008
12. Haghighi A, Klein D. Coreference resolution in a modular, entity-centered model. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational linguistics; Los Angeles, CA. Stroudsburg, PA: Association for Computational Linguistics, 2010
13. Ng V. Unsupervised models for coreference resolution. EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing; Waikiki, Honolulu, Hawaii. Stroudsburg, PA: Association for Computational Linguistics, 2008
14. Poon H, Domingos P. Joint unsupervised coreference resolution with Markov logic. Proceedings of the Conference on Empirical Methods in Natural Language Processing; Waikiki, Honolulu, Hawaii. Stroudsburg, PA: Association for Computational Linguistics, 2008:650–9
15. Raghunathan K, Lee H, Rangarajan S, et al. A multi-pass sieve for coreference resolution, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing; MIT Stata Center, MA. Stroudsburg, PA: Association for Computational Linguistics, 2010
16. Nicolae C, Nicolae G. BestCut: a graph algorithm for coreference resolution. EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing; Sydney, Australia. Stroudsburg, PA: Association for Computational Linguistics, 2006
17. Zheng J, Chapman WW, Crowley RS, et al. Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform 2011;44:1113–22 [PMC free article] [PubMed]
18. Wang Y, Melton GB, Pakhomov S. It's about this and that: a description of anaphoric expressions in clinical text. AMIA Annu Symp Proc 2011;2011:1471–80 [PMC free article] [PubMed]
19. He TY, Uzuner O, Szolovits P. Coreference resolution on entities and events for hospital discharge summaries. Thesis (M. Eng.), Massachusetts Institute of Technology, 2007
20. Zheng J, Chapman WW, Miller TA, et al. A system for coreference resolution for the clinical narrative. J Am Med Inform Assoc. Published Online First. doi:10.1136/amiajnl-2011-000599 [PMC free article] [PubMed]
21. Segura-Bedmar I, Crespo M, de Pablo-Sánchez C, et al. Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents. BMC Bioinformatics 2010;11(Suppl 2):S1. [PMC free article] [PubMed]
22. Vlachos A, Gasperin C, Lewin I, et al. Bootstrapping the recognition and anaphoric linking of named entities in Drosophila articles. Pac Symp Biocomput 2006:100–11 [PubMed]
23. Pustejovsky J, Castaño J, Zhang J, et al. Robust relational parsing over biomedical literature: extracting inhibit relations. Pac Symp Biocomput 2002:362–73 [PubMed]
24. Uzuner O, Forbush T, Shen S, et al. i2b2/VA 2011 Co-reference Annotation Guidelines for the Clinical Domain. 2011.
25. Fan RE, Chang KW, Hsieh CJ, et al. LIBLINEAR: a library for large linear classification. J Machine Learn Res 2008;9
26. Denny JC, Spickard A, 3rd, Johnson KB, et al. Evaluation of a method to identify and categorize section headers in clinical documents. J Am Med Inform Assoc 2009;16:806–15 [PMC free article] [PubMed]
27. Li Y, Gorman SL, Elhadad N. Section classification in clinical notes using supervised hidden Markov model. IHI '10 Proceedings of the 1st ACM International Health Informatics Symposium, Arlington, VA. New York: ACM, 2010
28. Roberts K, Harabagiu SM. A flexible framework for deriving assertions from electronic medical records. J Am Med Inform Assoc 2011;18:568–73 [PMC free article] [PubMed]
29. Amit B, Baldwin B. Algorithms for scoring coreference chains. Proceedings of the First International Conference on Language Resources and Evaluation Workshop on Linguistic Coreference; Granada, Spain. Paris, France: European Language Resources Association (ELRA), 1998
30. Vilain M, Burger J, Aberdeen J, et al. A model-theoretic coreference scoring scheme. Proceedings of the 6th Message Understanding Conference (MUC6); Columbia, MD. Stroudsburg, PA: Association for Computational Linguistics, 1995
31. Recasens M, Hovy E. BLANC: Implementing the Rand index for coreference evaluation. Natural Language Engineering. 2010
32. Luo X. On coreference resolution performance metrics. HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing; Vancouver, BC, Canada. Stroudsburg, PA: Association for Computational Linguistics, 2005
33. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium. 2001;17 [PMC free article] [PubMed]
34. Stearns MQ, Price C, Spackman KA, et al. SNOMED clinical terms: overview of the development process and project status. Proceedings of the AMIA Symposium; Washington, DC. Bethesda, MD: American Medical Informatics Association, 2001 [PMC free article] [PubMed]
35. Lee L. Measures of distributional similarity. Proceedings of the 37th annual meeting of ACL; College Park, Maryland, USA. Stroudsburg, PA: Association for Computational Linguistics, 1999
36. Iida R, Inui K, Takamura H, et al. Incorporating contextual cues in trainable models for coreference resolution. Proceedings of the 10th EACAL Workshop on the Computational Treatment of Anaphora; Budapest, Hungary. Stroudsburg, PA: Association for Computational Linguistics, 2003

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of American Medical Informatics Association