The focus of text mining systems in biomedicine has recently shifted from extracting named entities of biological relevance to recognizing relations and events. Biomedical event extraction (EE) systems have already been integrated with a number of applications, such as semantic search, association mining for knowledge discovery, bioprocess extraction and pathway reconstruction (Ananiadou et al., 2010
; Björne et al., 2010
; Tsuruoka et al., 2011
; Wang et al., 2011
). The increasing requirement for high performance EE systems has motivated the development of several event annotated corpora to facilitate their training [e.g. GENIA (Kim et al., 2008
), BioInfer (Pyysalo et al., 2007
), GREC (Thompson et al., 2009
), BioNLP Shared Tasks 2009 (ST09) (Kim et al., 2011a
) and 2011 (ST11) (Kim et al., 2011b
) corpora]. Events are structured descriptions of biological processes involving complex relationships (e.g. angiogenesis, metabolism and reactions) between biomedical entities, and are highly dependent on context. Events usually consist of triggers, which are often verbs (e.g. inhibit
) or nominalizations (e.g. inhibition
), and their typed arguments, which can be biomedical entities (e.g. gene
) or other events (e.g. Regulation
Typically, EE systems find both triggers and associated arguments, which can present a number of challenges. Triggers are expressed in diverse ways and their exact interpretation can depend upon context, e.g. ‘expression of [gene]’ is an event of type Gene Expression, but ‘expression of [mRNA]’ is of type Transcription. Furthermore, event arguments can be difficult to detect; since triggers are sublanguage dependent, their exact syntactic arguments are dictated by the domain. EE systems are thus highly dependent on the availability of training sets and external resources (e.g. dictionaries) that can expand the trigger candidates in the training set with semantically similar alternatives. Generally, machine learning methods are applied to syntactic parse results to disambiguate event types according to their surrounding contexts and to identify relations between the syntactic and semantic arguments of triggers.
There are two important issues that are scarcely dealt with by existing EE systems. First, the usual method of training on only a single annotated corpus can limit the coverage and scalability of the system. Secondly, coreference resolution (CR) is required for the correct interpretation of certain event arguments. This is illustrated in , in which there are two coreferential links. The first link involves the mention this gene and its antecedent jun-B, whereas the second concerns the mention that and its antecedent expression. There are two gene expression events in the sentence, i.e. jun-B expression and jun-C expression, which can only be correctly recognized if these coreferential links are identified and resolved.
Coreference example. Two coreferential links are illustrated
This article reports on two enhancements that have been made to an existing EE system, EventMine (Miwa et al., 2010b
), to address the issues outlined above. First, we constructed a new CR system, whose performance in resolving anaphoric references to genes and proteins considerably exceeds that of the best participating system in the Protein Coreference task (COREF) of ST11, which focussed on protein/gene name CR. Subsequent integration of our new CR system with EventMine clearly demonstrates an improvement in state-of-the-art EE performance on several event corpora. Secondly, our incorporation of domain adaptation (DA) methods into EventMine, which make it possible to use information from multiple annotated corpora when training the system, have been shown to further boost EE performance. The enhanced version of EventMine, incorporating both CR and DA, outperforms the highest ranked systems participating in ST09, and the GENIA and Infectious Diseases (ID) subtasks of ST11.