|Home | About | Journals | Submit | Contact Us | Français|
Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.
Recent disease networks studies on ataxia [1,2] and Huntington’s disease  demonstrated that the integration of biological knowledge from several sources could lead to biomedical discovery. It is particularly attractive to connect the dots using the wealth of knowledge in the published literature. The entire body of biomedical literature, dubbed the “bibliome”, represents a significant resource for understanding the genetic basis of disease. It contains high quality and high-confidence information on genes that have been studied for decades, including the gene’s relevance to a disease, its reaction mechanisms, structural information and well-characterized interactions.
Once a clinical study of a disease produces a list of genes, it is necessary to place these in a wider context of cellular mechanisms and molecular interactions. There are numerous academic and commercial projects striving to correctly represent the wealth of entities and relations extracted from the bibliome. However, these resources are bound to be either reliable but scarce and outdated — in the case of manual curation, or extensive but unacceptably inaccurate — in the case of automated extraction. Often this situation leaves clinical disease researchers sifting through the literature with the naked eye and using ad-hoc methods to represent relevant knowledge in a disease evidence network. In this paper we describe an open-access tool that facilitates this process, while enabling the gradual development of advanced text mining algorithms.
Recently high-throughput methods in biology have produced a rising volume of scientific publications which analyze these new data and extract biological knowledge from them [4,5]. The scientific community now faces the problem of scaling methods for representation of and search over such a volume of information. For example, a direct search in PubMed  for the term autism returns over 11, 000 related papers. A search for the gene p53 returns almost 45, 000 articles. The biomedical literature grows at an exponential rate  and currently MEDLINE contains more than 16 million publications. Naturally, there has been a growing interest in text-mining techniques to automatically extract expert knowledge from the literature.
One common idea to the multitude of advanced computational linguistics approaches is that the semantics of a relation being expressed in a piece of text has to fall out from the syntactic analysis of that text. The syntactic structure of a sentence is often represented as a parse tree capturing the interaction of phrasal constituents within the formalism of Dependency Grammar.
In this paper we focus on extracting relationships between biomedical entities, such as interactions between genes and proteins or associations between genes and diseases. We propose a cross-fertilization of two efforts: 1)human-curated compilation of literature into disease evidence networks; and 2) automated information extraction. This work constitutes the first step towards a system where the creation of curated disease networks (see Figure 1) is seamlessly augmented to collect the curator’s judgments about syntactic and semantic features of the texts supporting the relationships represented in the network. Such annotations would be juxtaposed to the parsed structures in order to develop and improve the automated relation extraction, which in turn will serve to facilitate the curation process in order to build and maintain current disease networks.
Example associations extracted from biomedical texts are presented in Table 1. The first sentence reports an interaction between two genes: SCPA and C5a. The second sentence rules out the existence of a relation between a gene (APOE) and a disease (autism). While many methods have been put forward for automatically extracting these relations between biological entities from the scientific literature (for a review see [4,7–9]), the problem remains unsolved. One reason is that the development and validation of such methods requires a large corpus of correctly annotated text. However, available corpora are still small and poorly annotated. Although recent efforts have made progress [22,27], there is still a great need of large corpora annotated on protein-protein and gene-disease interactions. To analyze the state-of-the-art with respect to this issue, we have compiled the main features of the available corpora which annotate relations between biomedical entities (genes, proteins and diseases) and/or syntactic dependencies in the sentence in Table 2. In addition, Table 3 shows a detailed analysis of the corpora containing protein-protein interactions.
A careful analysis of these tables suggests the following observations:
Several annotation tools have emerged for general-purpose annotation tasks in recent years (Knowtator , WordFreak , SAFE-GATE , iAnnotate ). These tools provide the user with flexible mechanism to define the annotation schema, so they can be customized for the annotation of relationships in biomedical texts. Some BioNLP groups have also created customized annotation tools for their specific annotation tasks such as Xconc Suite’s implementation for annotating events in the GENIA corpus . These tools are not intended for distribution or large scale annotation efforts, but for annotation processes carried out by a limited group of trained annotators according to sophisticated annotation schemas.
In the biomedical field, tools for collaborative annotation have been developed, such as WikiGene , CBioC  and WikiProteins . WikiGene and WikiProteins are two collaborative frameworks built on wiki-based systems. Users can edit pages associated to the entities of interest and share their knowledge with the community. WikiGene is focused on genes and gene regulatory events. WikiProteins is a more ambitious effort that allows the annotation of many concepts from the biomedical literature (i.e. genes, proteins, drugs, tissues, diseases, etc.) and their relationships with other concepts (creating what the authors called Knowlets). While these efforts provide the community with means to access and modify a large amount of information indexed by biological entities of interest, they are not intended for the creation of a corpus explicitly stating the text that supports the relationships between the entities.
Our work is largely inspired by the recent distributed and collaborative annotation efforts that have emerged in the image analysis domain. These efforts have shown a great potential since they allow any interested user to contribute in the annotation task. In particular, Label-Me  is a tool for tracing and labeling boundaries of objects in images, and Google™ Image Labeler  for labeling images to improve search results.
Our approach is similar to that implemented by CBioC. This tool allows the user to annotate relationships between biomedical concepts while browsing PubMed records. The user is presented with potential relationships from the current record extracted by automated tools or suggested by other users. Registered users can add new relationships and vote for suggested relationships. For a given PubMed record, a relationship is defined by providing the literals of the two interacting entities and the keywords of the interaction. However, CBioC does not allow to highlight the exact mentions of these words in the text. Furthermore, the users cannot access the whole corpus of annotations until it is made publicly available by the CBioC team.
We offer an intuitive framework for distributed annotation of extracts of biomedical text with interactions between biological entities of interest. Our system, which we call BioNotate, allows disparate research groups to perform literature annotation to suit their individual research needs, while at the same time contributing to the large-scale effort. There are multiple levels of integration built into the system. At one level, several annotators could collaborate on processing statements from a single corpus on their own server. At another level, multiple corpora could be created on different servers, and the resulting corpora could be integrated into a single overarching resource.
BioNotate provides the community with an annotation tool to harness the great collaborative power of biomedical community over the internet and create a substantially sized corpus as a baseline for research on biomedical relation extraction. Specifically, we focus on the annotation of protein-protein and gene-disease relationships. However, the proposed tool and annotation schema promise to be reusable for a variety of relation types.
We tackle the creation of this collaborative annotation resource in two phases: Phase One involves the design of the annotation schema and protocol; Phase Two involves the design and implementation of the annotation tool.
According to the recently developed convention in the field, throughout this paper we use the term snippet to mean a small chunk of text anywhere from a few to a few dozens words, which does not have to fall on sentence boundaries. The annotation process we propose is based on focusing the attention of the annotator on two particular biomedical named entities which appear in a snippet. In this context, we ask the annotator to answer the following question: “does this snippet imply a direct interaction between the provided entities?”. The Yes/No answer to this question allows the snippets to be classified into positive (existence of an interaction) and negative (absence of an interaction). To further help future research and to enrich the annotation of the corpus, the annotator is also asked to provide the minimal phrase in the snippet that supports his answer. For example, consider again the snippets from Table 1. The first snippet reports an interaction between the genes SCPA and C5a. Therefore, the annotator would answer Yes to the proposed question and highlight inhibits as the minimal phrase that supports this answer. The second snippet reports a negative evidence that gene APOE is associated with autism, but while the two words of interest (APOE and autism) are syntactically connected by the phrase is associated with, the phrase that provides the main message of the discourse is failed to reveal. Therefore, the annotator would mark-up failed to reveal as the minimal phrase that supports the No answer. Note that our focus is on the contextual semantics of the message.
If the entities only co-occur in the snippet without any explicit relation being reported between them, the answer would be No (the two entities are not related). Since there is no explicit support in the text for either positive or negative evidence of interaction, nothing needs to be highlighted to justify the answer in this case. An example of this can be found in the third snippet in Table 1. Complete annotations of these snippets are provided below.
We consider these two simple annotations: the Yes/No answer to the previous question and the highlighted text supporting the answer the most valuable knowledge the annotator can transfer to the corpus for identifying relations between the provided entities. In addition, this protocol is simple and intuitive enough to be embedded in an annotation tool opened to voluntary collaboration.
For our purposes, a snippet is a small chunk of text that may confirm or rule-out a relationship between two known entities (genes, proteins or diseases). We are particularly interested in two types of snippets:
By our definition, a “relationship” or “interaction” (either positive or negative) between two entities that co-occur in a snippet exists only if there is text in the snippet that explicitly supports that relationship.
Also, we are only interested in direct interactions between pairs of entities. For example, these sentences:
do not imply a direct interaction between A and B.
The annotator will be shown a snippet and a pair of entities of interest: gene-gene or gene-disease. One mention of each of the two entities of interest is highlighted in the text of the snippet in advance.
For a given snippet, the annotator is asked to:
Gene: Protein A
Gene: Protein B
Snippet: Protein A is found in tissue T. Protein A interacts with protein B in the presence of catalyst C to produce D.
changing the first mention of Protein A to protein E would not alter the relation being expressed, while changing the second mention to protein E would. Therefore, the second occurrence of Protein A should be highlighted, together with the mention of protein B.
Also, in the case where a pronoun refers to the entity of interest and links the entity to the INTERACTION phrase, the pronoun and not the entity mention should be marked up with the corresponding label (GENE or DISEASE). For example consider the following snippet:
Snippet: Gene RELN was studied in various disorders. It turned out to be causing autism.
It should be marked-up as a GENE since it refers to the gene of interest RELN and changing this pronoun to a mention of another entity would alter the relation being expressed between this gene and the other entity of interest: autism.
This also applies to noun phrases that refer to one of the entities of interest. For example, in the following snippet:
Snippet: Recently, two proteins homologous to FMRP were discovered: FXR1 and FXR2. These novel proteins interact with FMRP and with each other. (PubMed 009259278)
“These novel proteins” should be marked up as a gene since it refers to one of the genes of interest (“FXR1”) and changing this noun phrase to a mention of another entity would not convey the same relationship. Only one mention of every gene/disease of interest should be highlighted in each snippet. The annotator should check whether the highlighted regions comply with these guidelines, and correct annotations that do not.
The resulting set of available tags for the annotation is the following:
Detailed annotation guidelines and more annotation examples are provided at the Sourceforge.net project site.
Since the task of annotating the snippets will be carried out simultaneously by many annotators, we have implemented an annotation tool with the following features (see Section 2.2.2 for more details):
When an annotator logs in the system and requests a snippet to annotate, he is assigned a new one from the pool of documents pending annotation. The assigned document is picked at random from the documents not previously annotated by this user. Each snippet is annotated by at least k different annotators. If the k annotations of a snippet do not meet a minimum degree of agreement, the snippet is presented to another annotator at random. The process continues until at least k annotations performed on the snippet meet a minimum degree of agreement (see Figure 2). We have initially established k = 2 for the current annotation effort. However, this value can be increased when more annotators join the effort.
k given annotations are said to meet the minimum agreement if they satisfy the following three conditions:
For example, consider again the snippet (1) from Table 1. If Annotator1 highlights “inhibits” as INTERACTION and Annotator2 highlights “inhibits the activity of” with the same label, the two annotations would agree in terms of the interaction phrase since none of the tokens from the shortest interaction phrase (“inhibits”) are different from those in the largest interaction phrase (“inhibits the activity of”). If a new annotator, Annotator3, highlights “enzymatically inhibits”, this would also agree with both Annotator1 (same reason above) and with Annotator2: there is only one token (“enzymatically”) from the shortest interaction phrase (by Annotator3) which is not included in the largest interaction phrase (by Annotator2). If a new annotator, Annotator4, highlights “action of SCPA enzymatically inhibits” as interaction, this annotation would not agree with that of Annotator2, but would agree with Annotator1 and Annotator3.
Figure 5 shows the information flow in and out of BioNotate. The system must be fed XML-formatted snippets in which two entities of interest have been identified and marked. The resultant annotations performed by the users are also available in the form of XML files. The XML formats used for the original snippets and the annotations can be seen in the figure. The system also generates a plain text file with a list of references to the annotations that agree for every snippet. A full description of the XML formats used by BioNotate and step-by-step configuration instructions can be found in the project webpage.
As an example of the use of the BioNotate system, this section presents a pilot effort to annotate a corpus of interactions between genes related to autism. It provides a description of the methods for creating the corpus and the first results of the annotation effort.
Our main source of data is PubMed, a widely used biomedical information search tool which includes over 16 million citations and abstracts in its database: MEDLINE. As we previously mentioned, a search of PubMed for extended genes, proteins or diseases returns a huge number of results. To narrow our search for papers reporting protein-protein or protein-disease interactions, we have used publicly available tools and databases such as STRING . STRING is a database of known and predicted protein-protein interactions, which are mainly derived from PubMed. The input is one or more protein names, for which STRING returns a graph where the nodes are proteins and the edges are the relationships between pairs found in the literature. For every edge in this graph, the list of publications which support this relation is provided.
Our goal with this case of study is to build a disease evidence network for autism. Therefore, for the creation of a pilot corpus to start our annotation effort we are focusing on genes and proteins involved in autism and the relationships between them.
To obtain the pilot corpus, we proceeded as follows:
The resultant corpus contains snippets supporting 168 relationships between 127 proteins. A total of 2,053 abstracts was processed, yielding 1,819 snippets.
Once all the mentions of the genes/diseases of interest in a text have been identified (by STRING in our case), we create one snippet for every pair of mentions of different genes/diseases which are close enough to each other in the text. Each snippet will contain the text in between the two mentions, and a small amount of text before and after to provide the annotator with some context around the segment of interest. The following schema describes the steps of the algorithm:
Therefore, the maximum length for a snippet will be given by MAX_LENGTH_TOTAL, and MAX_LENGTH_CORE will provide the maximum numbers of tokens allowed between the two entities of interest in the snippet. We also use the constant MIN_LENGTH_TOTAL to guarantee that all the snippets have enough context for a better comprehension in the annotation process, and thus achieve a higher annotation accuracy. The establishment of values for these constants depends on the type of text being analyzed. We have established experimentally that a MAX_LENGTH_TOTAL of 300 tokens is appropriate for snippets extracted from biomedical texts, with MAX_LENGTH_CORE of 240 tokens and MIN_LENGTH_TOTAL of 40.
In case there is more than one mention of the entities of interest in a snippet we marked-up the two mentions that occur nearest each other to create the snippets that were loaded into BioNotate.
As part of this publication we provide the resulting corpus of our pilot annotation effort on literature related to autism. To date, it consists of one thousand snippets annotated by one of the authors, though we expect it to grow rapidly as we involve multiple annotators. The resulting corpus consists of archived original snippets as well as marked-up snippets in XML format and some post-processing, namely compiled lexicons of all 200 entities encountered in the corpus along with all quotes and supporting relations.
Our annotation reveals that only 116 of the original thousand snippets contain positively identified relationships, i.e. there is roughly 89% error rate on the text support, assuming that all abstracts were meant to identify a positive relation in the STRING database. Not all 200 entities are in fact distinct, e.g. synonymous ways to address the same gene (VLDL-R, VLDLR, VLDLr) and (5-HT-2A, 5-HT2A, 5HT2A) were not merged. As for the phrases supporting the relationships, we encountered many action verbs e.g. “associated with”, “docks to”, “binds”, “phosphorylated by”. Naturally, there were many inconclusive cases like “probably unrelated genes”, “may interact with” and “little is known about”. Phrases supporting an interaction span anywhere from 1 to 28 words, with an average of about 4 words.
In order to evaluate the agreement between different annotators we performed a test on a reduced corpus. For this test, we selected the snippets for which the previous annotator made “interaction” highlightings, i.e. we selected the snippets which explicitly supported either a positive or a negative relationship between the entities of interest (according to that single annotator). We focused on this collection of snippets because they potentially supported interactions, since one annotator already reported so, and therefore they were an interesting test of the effectiveness of our approach. The resultant corpus contains 139 snippets. We involved three more annotators to complete the annotation of this corpus, with the goal of finding agreement among two annotators for every snippet according to the criteria presented in section 2.2.1.
The results are shown in Table 5. Previous annotation efforts on gene identification and normalization reported agreement rates ranging from 91% to as low as 69% for certain contexts . In our case, the averaged percent of agreement per annotation is over 75% and the task involves annotating interacting entities and interaction keywords in the snippets. This agreement rates are thus similar to other annotation tasks and show that the approach we propose is effective for collaborative annotation.
Disagreement analysis revealed some errors inadvertently introduced by the annotators, such as a negative answer accompanied by a highlighted interaction clearly implying a positive relationship. Another frequent reason for disagreement was the presence of several distinct interaction phrases in the same snippet. For example:
“The KH domains of FXR1 and FMR1 are almost identical, and the two proteins have similar RNA binding properties in vitro. However, FXR1 and FMR1 have very different carboxy-termini. […] These findings demonstrate that FMR1 and FXR1 are members of a gene family and suggest a biological role for FXR1 that is related to that of FMR1.”
Sometimes long interaction phrases can also cause disagreement among annotators, e.g.:
“By immunoblotting, we found that a marked reduction in FMRP levels is associated with a modest increase in FXR1P” (PubMed 012112448).
“No association between the very low density lipoprotein receptor gene and late-onset Alzheimer’s disease nor interaction with the apolipoprotein E gene in population-based and clinic samples.” (PubMed 009181358)
Another source of disagreement is the highlighting of pronouns and noun phrases that refer to one of the entities of interest according to the guidelines provided in section 2.1. For example, consider the following sentence:
“The biological role of the very low density lipoprotein receptor (VLDL-R) in humans is not yet elucidated. This cellular receptor binds apolipoprotein E (apoE)-containing lipoparticles and is mainly expressed in peripheral tissues.” (PubMed 009409253).
In this case one annotator highlighted the mention of the gene “very low density lipoprotein receptor” while another two annotators highlighted the noun phrase “This cellular receptor” which refers to this gene and whose replacement with another entity would alter the relationship being expressed.
Since we require at least two annotators to substantially agree on their annotations, inadvertent errors and disagreements are discarded and the resultant corpus with agreement annotations reflects good quality relationships.
This analysis allowed us to additionally improve the annotations guidelines as well as to enrich the documentation available on the project site with illustrative examples.
According to the gold standard established by the annotators, 110 out of the 139 snippets contain a positive relationship, another 8 contain explicit negative interactions and the remaining 19 do not contain any relationship. In the 76% of the snippets with interaction, the interacting entities were those highlighted in advance (the two mentions that occur nearest each other were marked in advance as described in section 3.2). In another 18% the interacting mentions were not those highlighted in advance, but the literals were the same as those provided. In the remaining 6% the interacting entities were synonyms of those provided (like ‘methyl CpG binding protein 2’, ‘MECP2’), pronouns or noun phrases which refered to the entities (like “This cellular receptor”).
We have presented BioNotate, an open source resource for supporting distributed collaborative annotation efforts via a simple interface over a standard internet browser-based client/server system. This resource provides a way to create a substantial benchmark corpus for the evaluation and development of automated relation extraction methods for biomedical literature mining. Additionally, we presented a study of existing definitions of relations between genes and suggested a method to merge existing definitions. It is our hope that the resource presented in this paper will enable rapid progress at the intersection of biology, computational linguistics and knowledge representation.
We also described a method for the creation of a pilot corpus focused on the relations between proteins associated with autism. The annotation tool we built can easily consume new sets of snippets, facilitating similar efforts for the creation of new corpora on this or other diseases. The resultant snippets can be contributed at any time to the current annotation effort. Furthermore, since the tool is freely available, it can be downloaded and deployed anywhere, allowing parallel annotation efforts to be undertaken. Our aim is to facilitate many small, distributed annotation efforts whose output could be integrated into a single and uniform resource.
The consistency of the resulting integrated corpus is guaranteed since every annotated interaction must have explicit verbal support in the text of the snippet. Note that our system offers a process to assemble disjoint annotation efforts into a single corpus consistent from the point of view of computational linguistics, but not necessarily from the biological point of view. That is, decisions about the meaning of text should be consistently supported by the annotation, but generally speaking various annotators do not have to be in agreement as to what constitutes an interaction. More stringent consistency requirements however could be observed by re-constituting corpora from disjoint sets coming from groups once such agreement within a group is reached.
There are several ways for our annotation system to evolve as a result of a tight feedback loop between manual annotation, training of automated text mining tools on a larger corpus, and the use of automated tools to facilitate manual annotation. One issue we encountered was the unbalanced constitution of our corpus, since most of the snippets did not support a relation. Currently, we are using more sophisticated tools for the automatic selection of snippets and the identification of named entities to create better-balanced corpora. We are also developing methods to automatically classify snippets using unsophisticated heuristics in order to enrich the annotated material for positive snippets. This is just one form of active learning . Another related improvement is the use of shallow parsers and noun phrase chunkers to learn to hypothesize the phrase supporting the relationship, relying on the user for corrections, rather than selecting it from scratch.
Once a substantial corpus have been assembled, annotated and analyzed, additional annotations will benefit from the existing one. As we learn about typical disagreements between annotators on the same snippet, and ambiguous phrases and acronyms, we will provide helpful on-the-fly tips and shortcuts to the annotator. Future directions include the integration of knowledge from other biological sources into the disease evidence networks, such as gene expression data and sequence-derived knowledge. An additional improvement aimed at extending the use of resultant annotated corpora can be made by adopting the Distributed Annotation System (DAS) protocol  to provide the annotated relationships and their text support to other tools and servers.
We also plan to improve the available software and extend it by providing tools for further post-processing the annotated snippets, such as retrieving all the annotations above varying levels of agreement.
In conclusion, while this resource was developed with binary gene-gene and gene-disease relations in mind, we would welcome applications outside of the original domain. In principle, the tool may be used to create annotated benchmark corpora for arbitrary domains, as long as named entities can be reliably identified.
C.C. and A.B. are supported by the projects P08-TIC-4299 of J. A., Sevilla and TIN2006-13177 of DGICT, Madrid. L.P. is supported by the Milton foundation.
1STRING version 6.3
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.