A detailed understanding of biological systems requires the ability to trace cause and effect across multiple levels of biological organization, from molecular-level reactions to cellular, tissue- and organ-level effects to organism-level outcomes (Kitano, 2002
). Consequently, any effort aiming to comprehensively represent biological systems must address entities and processes at all of these levels.
This challenge has so far been only partially met in biomedical information extraction (IE) and text mining, which aim to improve access to domain knowledge by automating aspects of processing the literature. Until recently, efforts in domain IE were primarily focused on the basic task of recognizing mentions of relevant entities such as genes and proteins in text (Yeh et al., 2005
) and on the extraction of pairwise relations between these representing, for example, protein–protein interactions (Krallinger et al., 2007
; Nédellec, 2005
). Such representations lack the capacity to capture any but the simplest of associations.
In recent years, there has been increasing interest in the extraction of structured representations capable of capturing associations of arbitrary numbers of participants in specific roles. Such approaches to IE, frequently termed event extraction
, are capable of representing complex associations—such as the binding of a protein to another inhibiting its localization to a specific cellular compartment ()—and open many new opportunities for domain text mining applications ranging from semantic search to database and pathway curation support (Ananiadou et al., 2010
). There is significant momentum behind the move to richer representations for IE: more than 30 groups have introduced methods for biomedical event extraction in shared tasks (Kim et al., 2011a
); event-annotated corpora have been introduced for many extraction targets, including DNA methylation (Ohta et al., 2011a
), protein modifications (Pyysalo et al., 2011
) and the molecular mechanisms of infectious diseases (Pyysalo et al., 2012c
); event extraction methods have been applied to automatically analyze all 20 million PubMed abstracts (Björne et al., 2010
); and event extraction analyses are being integrated into literature search systems such as MEDIE1
and applied in support of advanced tasks such as pathway curation (Ohta et al., 2011b
Example sentence with event annotation. Prot, -Reg and Cell comp. abbreviated for Protein, Negative regulation and Cell component, respectively
While the event extraction approach has been demonstrated to be applicable to a variety of extraction targets across different subdomains of biomedical science, related efforts all share a key restriction: nearly exclusive focus on molecular-level entities and events.2
Entities such as proteins and genes and events such as binding and phosphorylation are an important part of the picture of biological systems, but still only a part, and any IE approach aiming to capture the whole picture must also consider other levels of biological organization.
In this study, our aim is to extend the scope of existing event extraction resources and methods to levels of biological organization ranging from the subcellular to the organism level as a step toward developing the capacity for the automatic extraction of these targets from the entire available literature. Toward this end, we propose relevant entity and event types for annotation across these levels with reference to community-standard ontologies, develop a set of detailed guidelines for their annotation in text and create structured event annotation marking over 8000 entities and 6000 events in abstracts relevant to cancer biology, previously annotated by domain experts to identify spans of text relevant to their interests. Using this data, we perform experiments using state-of-the-art methods for both entity mention detection and event extraction to analyze the feasibility of extraction using existing tools, further evaluating the benefits of specific adaptations of such tools to the novel task.