Overview of the GENIA corpus
The event annotation presented here builds on our earlier work in extracting the GENIA corpus and annotating it with linguistic features and biological terms. The documents in the corpus come from the Medline database, which covers a broad range of domains in bio-medicine. However, since we are interested in providing semantically rich annotation for text mining in molecular biology, we have focused on a much smaller, semantically homogeneous subject domain: biological reactions concerning transcription factors in human blood cells. We used a search query, "Humans" [MeSH] AND "Blood Cells" [MeSH] AND "Transcription Factors" [MeSH] to retrieve a set of articles, and then chose 2,000 of these articles for our annotation.
The biological annotations in the GENIA corpus include term annotation, which was completed in earlier work, and the event annotation described in this paper. The term annotations include 93,293 bio-medical terms that have been annotated using the 35 terminal classes of the GENIA term ontology (see Figure ). The event annotation was performed on top of the term annotation, relating the terms.
GENIA term ontology. The hierarchy of the GENIA term ontology. Terminal classes are used for GENIA term annotation. The figures in parenthesis indicate number of annotation instances made to the GENIA corpus.
While terms in text are related with each other in various ways, we have focused on dynamic relations. By "dynamic", we mean that at least one of the biological entities in the relationship is affected, with respect to its properties or its location, in the reported context. Extracting such information from text would be useful in building models of biological systems, e.g. pathways. In order to focus on dynamic relations, some relationships are excluded from our annotation, even though they are biologically important. Static relationships such as Part-of, IS-A, and Similarity relationships between terms are all excluded. (This does not necessarily mean that expressions in text which usually describe static relations were ignored. See Section Single-facet Annotation for detail.) Examples of these are given below:
• The structural similarity of SNI1 to Armadillo repeat protein ... [Similarity]
• Connexin has four transmembrane domains. [Part-of]
• NF kappa B, a transcription factor, is ... [IS-A]
Relationships outside the domain of molecular biology, such as clinical ones involving diseases and symptoms, are also excluded from the current annotation.
An example of event annotation
Figure shows a screen snapshot of our annotation tool, XConc. There are four regions within the figure, each outlined by a box. The top box contains a sentence under annotation. That is,
Figure 2 Example of event annotation. GENIA event annotation is made sentence by sentence. Although the actual corpus file with annotation is encoded in XML (C), the annotators work on a CSS-styled view (A) which is much more user-friendly. Sometimes, a graphical (more ...)
The binding of I kappa B/MAD-3 to NF-kappa B p65 is sufficient to retarget NF-kappa B p65 from the nucleus to the cytoplasm.
Each of the remaining three boxes displays an event annotation which has been added to this sentence. The original sentence without term annotation is shown inside each of those boxes, to allow annotators to mark-up text spans that belong to the corresponding annotation.
Biological entities, which had been annotated earlier during term annotation, are shown in colors on the screen. Blue and green indicate protein molecules and cell components, respectively. Each term is assigned a term Id (T36~T40 in the example of Figure ). These terms are expressed as n-tuples of attribute-value pairs as follows:
• (Id: T36, Class: Protein_molecule, Name: I kappa B/MAD-3)
• (Id: T37, Class: Protein_molecule, Name: NF-kappa B p65)
• (Id: T38, Class: Protein_molecule, Name: NF-kappa B p65)
• (Id: T39, Class: Cell_component, Name: nucleus)
• (Id: T40, Class: Cell_component, Name: cytoplasm)
As mentioned, the three boxes under the input sentence in Figure show three event annotations. The first event E5 represents binding of the two entities, T36 (I kappa B/MAD-3) and T37 (NF-kappa B p65). The word "binding" is shown in red. This indicates the clue which the annotator used as textual evidence for the existence of a binding event. One of our annotation principles requires each event to be supported by such a clue word. This principle is described in the Text-bound Annotation Section. Clue words are described in detail in the section Linguistic clues and event classes. Additional supporting words are shown in yellow ("of" and "to").
Each event is also assigned a unique Id. The description of the binding event is:
• (Id: E5, Class: Binding, ClueType: binding, Theme: T36, Theme: T37)
The two arguments are specified by their Ids so that they are unique and bound globally over the corpus. The Theme in an event is an attribute or slot to be filled by an entity or entities whose properties are affected by the event. The second event E6 represents localization of the protein T38. The textual clues, "retarget" and "to the cytoplasm", are marked up as key expressions denoting the event type and the location relevant to the event, respectively. E6 is represented as:
• (Id: E6, Class: Localization, ClueType: retarget, Theme: T38, ClueGoal: T40)
T38 is taken as a Theme since its location is affected by the event. The two entities T37 and T38, though they have the identical textual expressions NF-kappa B p65, are distinguished by their Ids. They appear in two different spans in text and thus in different biological contexts. They are identified as the Themes of the two events E5 and E6, respectively. This distinction is important for identification of biological entities in their proper context(See Section Event annotation and pathways).
The last event E7 is the causality relation between E5 and E6. That is, the binding event (E5) of the two proteins "causes" the localization event (E6) of one of the two proteins. This causality relation is represented as an event of type Positive_regulation.
(Id: E7, Class: Positive_regulation,
ClueType: is sufficient to,
Theme: E6 (Localization, Theme: T38),
Cause: E5 (Binding, Theme: T36, Theme: T37))
In the current GENIA event ontology, Regulation has a broader definition than regulatory events in a strict biological sense, e.g., catalysis, inhibition, up-/down-regulation, etc. It is used to encode general causality among events. We will discuss the issues related with regulatory events in Section General causality. Note that, although the expression is sufficient to is hardly a linguistic expression for causality, the annotator recognized it as such in this sentence.
To assist the reader in understanding these relationships, we present Figure , a graphical depiction of the example from Figure . In this representation, entities from the GENIA term ontology are shown in rectangular boxes, while entities from the GENIA event ontology are shown in circles. Black, red and blue arrows indicate a link between an event and its themes, causes and location, respectively.
Figure shows the XML representation of the three event annotations. This format will be used for public distribution of the event-annotated corpus.
Event annotation and biological ontologies
Although text in natural language (like English) is easy for human readers to understand, the "same" biological events are expressed in diverse surface textual forms. A representation scheme of events such as those in the previous section is important for reducing such surface diversity. It represents the "same" events in the same formats.
In order to establish such a scheme, we have to answer certain ontological questions, such as how to identify the "same" events or the same types of events (event classes), and what structures are needed to represent them. We partly avoided these questions by adopting the Gene Ontology (GO) [47
] as our core ontology. We started with GO to define the initial set of event classes and revised them subsequently. The definitions in GO have frequently been referred to by our annotators to judge whether events in text belong to certain event classes or not.
While our information-centered approach to event annotation frees the annotators from linguistics-based criteria for annotation, annotation should not be totally free from text being annotated. Annotation by biologists should be curbed by information actually encoded in text. In other words, annotation should be performed based on information explicitly present in the source text and should not be detached from it too far. This requirement that annotation should reflect the organization of information in text imposes constraints on our representation scheme, distinguishing it from other, more biology-oriented, schema. In the following three sections, we will describe the ontology we used for text annotation, discussing how it differs from other bio-ontologies and the reasons why.
GENIA ontology and GO
The GENIA event annotation relies on two ontologies: the event ontology and the term ontology. The GENIA event ontology defines and classifies events (or occurrents in the terminology of philosophical ontology [62
]) which are of interest in the GENIA domain. In contrast, the GENIA term ontology defines things (or continuants [62
]) which cause or run through the events. Roughly speaking, the event ontology provides vocabulary for predicates (e.g. "binding", "phosphorylation", etc.), while the term ontology is for arguments (e.g. proteins) which are used in event descriptions. The term ontology is given in Figure . For the details of the term ontology, please refer to [32
]. In this section we focus on the details of the GENIA event ontology.
Figure shows the hierarchy of the event classes in the GENIA event ontology. The numbers attached to the nodes are the number of instances of the events in the current annotation of 1,000 abstracts. With the exception of the six classes shown in dotted boxes (Gene_expression, Artificial_process, Correlation, Regulation, Positive_regulation, Negative_regulation), all event classes are taken from GO. We inherit the names and definitions of the event classes from GO, performing minimal conversion for fitting them into the Web Ontology Language (OWL) naming conventions. While the class of Regulation in GO with its two sub-classes, Negative regulation and Positive regulation, remain in our ontology, the definitions of these classes are different from those of GO (See Section General causality). Since the domain of interest in GENIA is much narrower than that in GO, we only use a subset of the GO classes. For example, under the top level class Biological_process, we retain only three classes, Cellular_process, Physiological_process and Viral_life_cycle. These three classes reflect the three major topics in the GENIA domain. In particular, Physiological_process with its subclasses Metabolism and Localization is the main focus of the domain. Accordingly, the GENIA event ontology includes the finer grained GO sub-types of these event classes.
Figure 3 GENIA event ontology. The hierarchy of the GENIA event ontology. For event annotation, not only terminal classes but also classes at higher level are allowed to be used. The figures in parenthesis indicate number of annotation instances made to the GENIA (more ...)
In addition, the GENIA event ontology has the following three event classes which GO does not have: Gene Expression, Artificial Process, and Correlation.
Gene expression is missing from the Gene Ontology, so for the GENIA term Gene_expression, we use the definition given in MeSH, e.g. the phenotypic manifestation of a gene or genes by the processes of genetic transcription and translation. Gene expression is not in GO because it is not a single event, but a macro process. An event in this class consists of micro events or processes such as transcription, translation, and post-translational processes. All of these micro events are in GO. While the decision to exclude a composite process like Gene expression may be justifiable in GO, we need this class for text annotation. The versatility of natural language freely allows authors to express information with variable granularity, and authors often use expressions with coarse granularity to denote complex objects or events. Such expressions are pervasive in text: in the GENIA event-annotated corpus, 3,535 events have been annotated as Gene_expression. Some example sentences involving Gene_expression are given below:
• T-cell expression of the human GATA-3 gene is regulated by a non-lineage-specific silencer (Figure ).
Figure 4 Graphical representation of events in some example sentences. Examples in text with corresponding event annotation in graphical representation. (A) T-cell expression of the human GATA-3 gene is regulated by a non-lineage-specific silencer. (B) The extent (more ...)
• Most retinoblastoma specimens revealed a high COX-2 expression.
• IL-10 preferentially increased expression of IFNgamma-inducible genes.
• However, B cells can also synthesize IL-2.
• The ability of CMV IE gene products to enhance IL-6 production may play ...
Artificial_process describes experimental processes which are performed by human researchers. Examples include transfection and treatment. Although the use of this event class was not encouraged, the annotators identified 597 events in 1,000 abstracts. Example sentences involving Artificial_process are given below:
• ... to induce NF-kappa B/Rel nuclear activity in cells incubated in the presence of 3,4-dichloroisocoumarin, ...
• Endogenous or exogenously administered RA may have a significant role in HIV regulation.
• Over-expression of STAT2 by transfection of the cDNA prevented apoptosis of the T cell clones.
Correlation represents an underspecified relation between events. It is a characteristic feature of natural language that authors can leave irrelevant or unknown details unspecified. Consider the following sentence:
The extent of IFN-induced NK cell killing of E1A-expressing cells was proportional to the level of E1A expression ... (Figure )
The text in this example indicates that there is a certain relationship between the two events "IFN-induced NK cell killing of E1A-expressing cells" and "E1A expression", but the author avoids specifying which event is the cause and which one is the consequence.
Such under-specification is frequently observed in text, and there are many linguistic expressions used to leave the relationships underspecified. While the exact relationship is left unsaid in such expressions, the existence of a relationship between two events is still crucial information for biologists. In these cases we encouraged annotators to use the event type Correlation. 1,722 Correlation events are recognized in the current annotation. Some examples are given below:
• Cell hemoglobinization was accompanied by the increased expression of genes encoding gamma-globin. (Figure )
• Decreased adhesion molecule expression was associated with a reduction of monocytic cell adhesion.
• ...may have a role in the increase in globin gene transcription that characterizes erythroid maturation.
• This increase in p50 homodimers coincides with an increase in p105 mRNA.
Event annotation and pathways
While developing the annotation framework which we have described so far, we compared our work to current research in representing pathways [63
]. A pathway is a detailed graphical representation of a biological system, which comprises a set of mutually related events [65
]. It integrates pieces of information on biological events scattered in many scientific publications into a coherent system, and thereby facilitates discussion among a large group of biologists and build consensus on what actually happens in a biological system.
The event annotation is intended to be used for development of an ER (event recognition) program. While the results of ER can be used for various NLP-based TM such as intelligent text retrieval, question answering, etc., one of the major challenges is to use them to associate text fragments with relevant part of pathways or to use them to construct semi-automatically pathways. Since events extracted from individual papers have to be integrated into organized networks of events, we need to transform the results of ER to the forms required by pathway models [66
Research of formalizing pathway representation has made a significant progress in last few years and has reached a consensus on how information on biological events should be represented [63
], showing how biological events should be represented in a way consistent with the scientific view of a biological system. The consensus actually contrasts with our own event representation. These contrasts highlight the difference between building a biological model, as pathways do, and building a loose biological description, as we find in natural language. From this point of view, the two most significant properties of pathway representations can be summarized as follows:
(1) Entity-Centered Representation Pathway representation has become entity-centered, while language organizes information in a predicate-centered manner. That is, pathways are usually organized around state-changes of continuants. The major players in this type of representation, e.g. nodes in a graphical representation, are biological entities which correspond to continuants in specific biological contexts. Events organized around predicates are relegated to mere labels which are attached to links between nodes.
(2) General Causality As a typical pathway shows, biological events are intertwined with each other. This makes it difficult, if not impossible, to determine causation, e.g. which event causes which. As a result, pathway representations either eradicate "general" causality from their representations or restrict the relation to a set of limited relations whose underlying mechanisms are well circumscribed.
We discuss each of these issues in detail in the following sections.
Systems Biology Mark-up Language
(SBML) is a framework which is becoming a de facto standard for pathway representation, and which clearly commits to the entity-centered representation [63
]. Figure shows the SBML representation for the same set of events as in the previous example, in Figure . In this representation, the same continuant, NF kappa B p65
, appears as three distinct nodes in different biological contexts: one before binding, another after binding, and the third after localization. These three nodes denote instances of the same continuant in different biological contexts. Since these three instances have different properties, it is natural that a pathway representation captures them as different nodes. In this paper we apply definitions introduced in [62
], which distinguishes between continuants
. A continuant is an entity which endures, or continues to exist, through time while undergoing different sorts of changes, including changes of location. We use the term biological entity
to refer to an instance
of a continuant at a specific time, which is also bound to a specific biological context. The SBML representation is entity-centered since it gives independent status to each of biological entities or instances of the same continuant.
SBML-style event description for the example in Figure 2. The nodes denote biological entities. The links denote transitions between different states of entities and correspond to events causing the state transitions.
On the other hand, natural language text does not usually make explicit such distinctions among instances of the same continuant with different properties or in different contexts. Consider the example sentence (shown here again for quick reference):
The binding of I kappa B/MAD 3 to NF-kappa B p65 is sufficient to retarget NF-kappa B p65 from the nucleus to the cytoplasm.
The two events (the binding and localization events that occur in a sequence) and their relationship are described. Since the sentence is organized around the main predicate "is sufficient to" without any explicit time points, there is no natural way to introduce a new entity (NF-Kappa B/I kappa B complex before localization) created by the binding event. The first occurrence of NF-kappa B p65 is involved in the binding event, but the expression does not make explicit whether it denotes the entity before or after the event. The same is true for the localization (retargeting) event; since the sentence is organized around the predicate "retarget", the distinction of the entities before and after the retargeting event is not made explicit.
Although such implicitness may be taken as a limitation of natural language as a language for science, it contributes to the easiness and efficiency of communication by language. Human perception of continuants is strong. Even though a continuant may change its properties over the course of an event, it is perceived as the same continuant and expressed as such in language. Such a conception of perpetual existence of continuants strongly influences expressions in language. It may even affect our modes of intuitive understanding and inference. Since continuants recognized as such permeate text, to replace them with distinct entities in different contexts requires a significant reorganization of information in text, and thus makes text annotation extremely difficult.
While the introduction of new entities such as NF-kappa B/I kappa B complex in nuclear or NF-kappa B/I kappa B complex in cytoplasm may improve the explicitness of pathway representations, in event annotation it is likely to introduce different interpretations by individual biologists. Interpretations which are not properly bound to expressions in text are one of the major causes of inter-annotator discrepancy. As we saw in Section An example of event annotation, we have two textual spans with the same expression NF-kappa B/p65, but with different Ids. The existence of these two distinguished entities is supported by evidence in text, and will facilitate the transformation from a textual description of the event to a more biology-oriented representation. However, no distinctions which lack explicit textual evidence should be made in the annotation.
Representation of General causality is highly related with the treatment of another controversial concept, "Agency." Agency, like causality, is basically an epistemological concept which presupposes that a participant with intention is involved in the event. Among the two major roles, Agent (deep subject) and Theme (deep object), which linguists normally use in the semantic representation of an event, involvement of the Agent in an event is much more tenuous than that of the Theme. In particular, verbs such as "raise," "activate", and "inhibit" which, by themselves, do not specify what actions are taken by their agents, pose special difficulties in semantic analysis.
The sentence "Mary hurt John," for example, can be interpreted as "Mary did something" which resulted in "John being hurt [67
]." The sentence explicitly states the getting hurt
event, and the involvement of John (Theme) is obvious since John is affected by the event. On the other hand, the actual event in which Mary (Agent) is involved is unstated, and the connection between Mary (Agent) and the getting hurt
event is only causality: whatever Mary did, it caused John to get hurt. In this analysis, verbs like "hurt" are taken to express a causal relationship between unspecified actions, taken by the Agent, and the event which explicitly involves the Theme.
This analysis provides us with a principled way of treating verbs such as "activate," "promote," "inhibit," and "induce." In the domain we are dealing with, there are no Agents with intention except for Artificial_Process. We therefore treat these verbs simply as expressions of causality. Consider the following three sentences:
• Expression of LMP1 in host cells activates NF-kappa B.
• LMP1 needs only 11 amino acids to activate NF-kappa B.
• All six B-cell lines tested showed NF-kappa B activation in response to LMP1 expression.
These three sentences show the variety of ways in which an event and its causes can be linked in text. The last sentence expresses the causal relationship between the two events (Activation of NF-kappa B and Expression of LMP1) by linking them with "in response to", while the other two sentences use the verb "activate" to express the causal relation. In addition, the first sentence expresses the cause as an event ("LMP1 expression"), while the second sentence expresses it as an entity ("LMP1"). These two expressions differ on the surface, but they are related in meaning. In our representation, activation of a protein is classified as a Positive_regulation event, following the definition in GO. Such regulation events can have causes, which are other events. Hence, in the first sentence, the event Expression of LMP1 can be represented as a cause of the event Activation (See Figure ). In the second sentence, the protein "LMP1" is directly linked as a cause of Activation (See Figure ). Equivalence between the two expressions can be recognized by applying a rule of entailment: "If a protein positively regulates an event, physical manifestation of the protein will cause the event."
Figure 6 Graph representations of events about "LMP1 to activate NF-kappa B". (A) expresses the event "LMP1 activates NF-kappa B", and (B) expresses the event "expression of LMP1 activates NF-kappa B". Biological implication of the two expressions is equivalent, (more ...)
In contrast to these textual expressions of causality, biology-oriented representations like SBML pathways do not represent causality among events explicitly. Instead, a sequence of state changes of biological entities is represented in a network. A set of biological entities in the upstream of a network is linked with other biological entities in the downstream, whose states change. Causality is represented implicitly by linked paths between entities in upstreams and downstreams. In such a representation, LPM1 would be located on the upstream, with active NF kappa B in the downstream.
However the pathway representation makes other relationships even more explicit than they usually are in text. For example, the second sentence given above suggests that LPM1 has a binding site of 11 amino acids for an unspecified adaptor protein. No concrete adaptor proteins were mentioned in the abstract where this sentence appears. However, a review paper [68
] constructed a partial pathway (Figure ) in which the adaptor protein was identified as TRADD. This information came from other publications, and the author of the review paper integrated such pieces of information scattered in the literature, in order to create a pathway. Furthermore, the resulting pathway indicates that a long sequence of biological entities and their state changes intervene between LPM1 and activated NF kappa B. The linked path involves the adaptor protein TRADD, NIK, IKK, and others, and finally reaches activated NF kappa B. This is in contrast to the three sentences shown above, which gloss over the linked path by simply expressing that "expression of LPM1" causes "activation of NF-kappa B."
Figure 7 Molecular interactions and signaling pathways engaged by LMP1. LMP1 is involved in the activation of NFkB. Even though it has to get through a complex path for the role of LMP1 to take effect on the activation of NFkB, in natural language text, the involvement (more ...)
As these examples show, causality expressions are convenient since they allow authors to describe relations among events without explaining the details of underlying mechanisms. Authors may want to leave such explanations out of a publication when they are not relevant or, in some cases, since the authors may not know these underlying mechanisms. For all of these reasons, expressions for causation are pervasive in text. Several more examples are given below:
• Expression of LMP1 activates transcription from p50/p65- and c-Rel-responsive promoters.
• Expression of LMP1 in EBV-negative nasopharyngeal epithelial cells induced COX-2 expression.
• Inhibition of NF-kappa B in T-lineage cells leads to a dramatic decrease in cell proliferation.
• Overexpression of TRAMP leads to two major responses, NF-kappaB activation and apoptosis.
• Apoptosis can occur after Bcl-2 phosphorylation.
In response to the omnipresence of causal expressions in natural language, we have chosen to make causality explicit in our event representation. In addition to expressions like "is sufficient to" and "in response to", verbs such as "induce," "promote," "activate," and "lead to" are treated as expressions of causal relationships between events. Note also that temporal expressions such as "after" are interpreted in some contexts as Causal in our representation.
Biological annotation and quality control
Before the actual event annotation, we performed a preliminary annotation with a loosely defined annotation scheme. We first gave annotators a set of GO classes with their definitions, and asked each of them to annotate the same set of abstracts. As mentioned previously, we did not restrict these annotations to staying within the boundaries of linguistic structures such as constituent or predicate-argument structure. For example, biologists identified events in expressions such as the inhibitory effect of CaM-K II on IL-2 promoter (See Section Linguistic clues and event classes). They often saw causal relations among events in temporal sequences such as apoptosis can occur after Bcl-2 phosphorylation. They tend to ignore or abstract away from certain linguistic structures. They simply decompose "A activates B as well as C" into two events, "A activates B" and "A activates C". Some adjectives are treated as causes, as in mitogenic activation and thermal activation, while certain adverbs such as transcriptionally in A upregulates transcriptionally B are taken to signal events. Our annotators identified two events, upregulation and transcription, in this sentence.
Interesting though they were, the preliminary results of annotation also showed the difficulty of the biological annotation. That is, since it relied on interpretation by individual biologists without specific annotation guides, inter-annotator discrepancies were much larger than we had expected. As a result we developed several techniques for a more sophisticated annotation methodology, which improved inter-annotator agreement.
First, biological annotation inevitably involves interpretation based on background knowledge and information from context. However, these are the two main factors which lead to discrepancies. We had to introduce a principle of annotation to curb the effects of these factors (Text-bound Annotation). Second, we had to give very clear guidance on the scope of annotation. This principle guides what types of information should be annotated and what types should not (Single-facet Annotation). Finally, we needed careful verification of annotation results. In particular, we found Cross Validation between event and term classes very effective for finding anomalies and cleaning up annotations (Semantic Typing and Cross Validation).
The environment for annotation work also played a crucial role in quality control. To share experiences, in particular, reviewing previously annotated text from different annotators became essential for maintaining homogeneity of annotation. The coordinator of annotation organized weekly meetings with the annotators and involved them closely in the adjudication process. We also developed a tool (XConc) for multi-layered annotation. The environment of annotation will be described in the Methods Section. Text-bound Annotation, Single-facet Annotation, and Semantic Typing are discussed in the following sections.
The first key principle which we established for reducing annotator discrepancies is called Text-bound Annotation. It can be described simply as follows:
Associate all annotations with actual expressions in text.
A similar principle was used in the annotation of Bioinfer [56
]. As in BioInfer, we do not allow annotators to annotate an event unless an expression mentioning the event type appears in the text. However in our attempt we deliberately dissociate annotation from linguistic structures, and events in our annotation are not necessarily organized around verbs. That is, an event does not necessarily correspond to a constituent such as a clause or phrase, governed by a verb. Expressions which indicate occurrences of an event and expressions which describe its participants (arguments) can be scattered throughout a sentence without constituting a single constituent in the linguistic structure. Nonetheless, such expressions must be provided for each annotation, and we refer to them as "clue words" or "clue expressions". This principle ensures that each annotation is grounded in textual evidence, and that annotations are not the result of unbounded interpretation by individual annotators. It applies even when the annotator could infer the existence of an event from context (See Section Linguistic clues and event classes
We also aligned our annotations to single sentences. That is, all evidence attached to an event should come from the same sentence. There are some cases in which slots for arguments are filled by anaphoric expressions such as pronouns, definite noun phrases and noun phrases with demonstrative determiners (such as this or these). Only in such cases were annotators allowed to expand the scope of annotation, identifying textual expressions outside the current sentence to fill the argument slots. Even in these cases, expansion of scope is explicitly indicated by a special link (Co-Ref link), which associates the anaphoric expression inside the sentence with the entity outside.
The goal of these restrictions is to prohibit annotators from introducing entities or events which lack textual clues in the same sentences. This does not imply that annotation was performed sentence by sentence, without considering context. On the contrary, the annotators were encouraged to use the document context for disambiguation. Consider the following examples:
• In addition, forced expression of GATA3 potentiated the induction of RALDH2 by TAL1 and LMO, and these three factors formed a complex in vivo (Figure ).
• Furthermore, a TAL1 mutant not binding to DNA also activated the transcription of RALDH2 in the presence of LMO and GATA3.
• In contrast, in vivo footprints on GT (CACCC) motifs differed between the cells expressing the fetal or the adult globin program.
In the first of these sentences, an annotator has to disambiguate the anaphoric expression these three factors. Without context, it can refer to either of the two sets of entities, (TAL1, LMO, RALDH2) or (TAL1, LMO, GATA3). However the second sentence provides enough context for the annotator to identify the third element in the set as GATA3, not RALDH2. It is important to note that this type of interpretation still adheres to our principle of Text-bound Annotation, because it relies on textual evidence in the same sentence: namely, the anaphoric expression.
On the other hand, although footprints in the third sentence indicates a DNA-binding event, implying the presence of a protein which is bound, there are no textual clues in the sentence to indicate the existence of such a protein. In such cases, annotators were not permitted to represent this protein (the hypothesized Theme of binding) in annotation, even if they could identify the missing element from context. As a result, we see quite a few events in our annotation which lack necessary arguments (See Results and discussion Section). To fill them from context remains a topic for future work, since this would require carefully calibrated guidelines to ensure inter-annotator agreement.
Our second key principle for reducing annotator discrepancies is called Single-facet Annotation. It is described as follows:
Keep the view point for annotation as simple and focused as possible.
Consider the following sentence:
Calcineurin acts in synergy with PMA to inactivate I kappa B/MAD3, an inhibitor of NF-kappa B.
One annotator identified a single event in this sentence, which was Inactivation of I kappa B/MAD3 by Cacineurin. However, another annotator claimed that the sentence conveys additional biologically important information: that calcineurin actually enables NF-kappa B to be activated by inactivating I kappa B/MAD, which inhibits NF-kappa B. For her, the expression "I kappa B/MAD3, an inhibitor of NF-kappa B" indicated another event: Inhibition of NF-kappa B by I kappa B/MAD3. This is a typical discrepancy caused by the multi-faceted nature of information in text.
When we see the sentence from the view point of events and their relationships, we interpret the sentence in the same way as the second annotator. That is, we consider every expression in the sentence as possible evidence of an event, even in cases where there is no explicit verb, as in "I kappa B/MAD3, an inhibitor of NF-kappa B".
On the other hand, the first annotator read the sentence from a rather general, non-focused view. She used a generic interpretation of the linguistic device of apposition, so she interpreted the same expression as a static IS-A relation, i.e. I kappa B/MAD3 IS-A inhibitor of NF-kappa B.
The goal of Single-facet Annotation is to reduce such discrepancies by defining one aspect of text as the focus of annotation. In our annotation, we asked annotators to examine text from the focused view point of events and their relationships. We gave each annotator a list of event classes from GO (the 35 event classes we chose) and asked them to identify as many events and their relations as possible in every sentence, within the limit imposed by Text-bound Annotation. We call this Event-centered Annotation as an instance of Single-facet Annotation.
Event-centered Annotation not only reduced annotator discrepancy but also contributed to the identification of a diverse vocabulary of event-related expressions. This is a secondary feature of Single-facet Annotation. As the annotation example above shows, focusing our interpretation on one facet of text, like the expression of events and their relationships, allows us to ignore the constraints that are usually imposed by other facets, like linguistic constituent structure. When we instruct annotators to examine every part of a sentence with respect to its role in an event, they are able to ascribe event-related meanings to parts of the sentence that cross constituent boundaries and that do not conform to predictable predicate-argument structures. Table shows examples which were identified as Inhibition events by the annotators. These examples demonstrate the wide variety of expressions that can be interpreted as events under the principle of Single-facet, Event-centered Annotation.
Linguistic realization of the word "inhibit" in various context
Hence, these principles work together to bound the interpretations given by annotators. Single-facet Annotation, in particular Event-centered Annotation, forces annotators to identify events, rather than static semantic relationships (IS-A, for example), or syntactic features. According to this principle, they should annotate as many events as possible. The principle of Text-bound Annotation gives this process a well-defined stopping criterion: "As many as possible" means precisely the number of events that can be linked with textual evidence, or clue words, from the same sentence.
Semantic Typing and Cross Validation
The GENIA event classes correspond to biologically homogeneous classes. This property is manifested in the homogeneity across entities (GENIA terms) which appear as arguments for the events in a given class. Although the relationship between GENIA term and event classes is not so straightforward (See Section Distribution of semantic types), semantic homogeneity of these arguments has been useful for Cross Validation of term and event annotations.
When only a relatively small number of instances of event annotations contain entities from specific term classes, either the term annotation or the event annotation may be wrong.
For example, after an initial phase of annotation, for the event class Gene_expression, we found the following patterns suspicious, since their rates of occurrence are very small:
• Gene_expression of Peptide (5 instances)
• Gene_expression of Nucleotide (2 instances)
• Gene_expression of Lipid (1 instances)
After verification, 4 annotation instances of the first case (Peptide) were accepted as correct annotations. The others were errors in which the wrong terms Ids had been given for the arguments. We added a new functionality to the annotation tool, XConc, to prevent the same errors from occurring again.
We also found many Binding events where two instances of DNA were annotated as Themes. However, the annotation coordinator was suspicious, thinking that DNA-DNA binding should be rare in our domain (transcription factors in human blood cells). When those instances were checked at an adjudication meeting, it turned out that there had been quite a few errors in term annotation. At the same time, they found that a few instances of DNA Metabolism had been wrongly annotated as Binding. An example is given below:
• In the T cell line CTLL2, ligation of kit/IL-4R alpha induces cellular proliferation.
Ligation can be considered a type of binding. However, in GO, it is classified under DNA Metabolism. One annotator was not aware of this. Through our process of Semantic Typing and Cross Validation, we were able to find and correct the resulting inconsistencies. In Table (of which a detailed description is given in Section Distribution of semantic types), 31 instances of DNA-DNA Binding still remain, but all of them are instances of Binding by a DNA-probe, which can appear in the domain of the GENIA corpus.
Distribution of theme classes for Transcription, Translation, Gene_expression and Binding events
As a result of completing this stage of event annotation, we were able to examine some important distributions in detail. First, the distribution of Linguistic clue words with regard to event classes. Second, the distribution of Semantic Event classes themselves. We describe each of these in the following sections.
Linguistic clues and event classes
Clue words are important in our framework not only because they help enforce the principle of Text-bound Annotation but also because they can be used in the next stage of our work, i.e. development of ER (Event Recognition) programs. They can be used as features for Machine Learners or as key words in rules for ER. However, the distribution of clue expressions indicates the kinds of difficulties which an ER program will have to resolve. In a similar way as NER (Named Entity Recognition), ER has to deal with difficulties caused by the ambiguity and diversity of language.
Table shows three representative event classes with the distribution of their linguistic clues. The distribution suggests that diverse words with different POS and syntactic structures are used to describe the same events. While some clue expressions such as "transcription" or "transcribe," "translocation," "secretion," and "cross-linking" unambiguously denote single event classes, other clues such as "engage," "recognize," and "associate" are general and ambiguous.
Clue expressions for some event classes
The following two sentences show how the verb "associate" can refer to two different event classes.
• In coimmunoprecipitation experiments using transfected COS cells, GATA-1 and ER associate in a ligand-dependent manner. [Binding]
• The induction of these genes is associated with interleukin-2 (IL-2)-induced T-cell proliferation. [Correlation]
In addition, while the event class of Binding has many specific clue expressions such as "bind", "interact", and "ligation", general expressions which are used for other event classes also appear. Examples are given below:
• CTLA-4 engagement by mAbs inhibits IL-2 production and proliferation upon T cell activation.
• The GM-kappa B sequence is recognized by NF-kappa B, which is mainly induced by PMA.
These ambiguous verbs with broad meanings would cause difficulties for event extraction programs. Even seemingly non-problematic verbs, such as "activate" or "bind," are ambiguous from the biological view point. In the current event-annotated corpus, there are 1,785 occurrences of "activate" which are annotated as Positive_regulation, while 496 occurrences are annotated as Physiological_process. However, uses of the word "activate" labeled with Physiological_process convey the same meaning as uses which are labeled Positive_regulation, i.e. either the number of the entity in the Theme increases or the function of the Theme is materialized. The ambiguity is purely due to the organization of the class hierarchy of GO. Events denoted by "bind" require a similar distinction. The term is sometimes used to refer to Cell_adhesion, which is a separate class from Binding in GO. However a larger proportion of occurrences of "bind" are still annotated as Binding events.
• Induction of cytokine expression in leukocytes by binding of thrombin-stimulated platelets. [Binding]
• Combinations of hypoxia and LPS significantly increased lymphocyte binding. [Cell_adhesion]
These ambiguities are not ambiguities of the meaning of the words themselves. They share the same linguistic core meanings. Instead, their ambiguities come from the biological heterogeneity of the events that these expressions denote. In these cases, annotators have to check the semantic classes of Theme in the term ontology for the correct classification of these events. The annotation guidelines list such confusing cases explicitly.
The class of regulatory events has the most diverse clue expressions. This is partly because, unlike other event classes, this class denotes relationships among events or processes. As noted before, the class Regulation which we use for event annotation covers a much wider range of relations than its counterpart in GO. We use it to denote general causal relationships among events. This may also contribute the diversity of clue expressions. In GO, regulatory events are sub-classified further. One may argue that subclassification of regulatory events leads to more uniform clue expressions for subclasses. This remains to be examined, but since most of the clue expressions for this class are general terms such as "regulate," "dependent," or "affect," we doubt that this is the case.
Distribution of semantic types
Table shows the distribution of term classes which appear as Themes of four events: Transcription, Translation, Gene_expression and Binding. Reflecting the nature of the event classes, the first three events, Transcription, Translation, and Gene expression, appear with a small, concentrated list of term classes as their Themes. This is in contrast to the long list of term classes that appear as the Theme of Binding. The first three classes are all related to gene expression, which consists of two micro events of Transcription and Translation.
As we expect, gene-related entities like DNA, RNA, and proteins are identified as possible Themes of the first three classes. The same is true of viruses, which often have genes expressed inside human bodies. In addition, we see a small number of occurrences of peptides which are gene products (e.g. insulin, GH). However, closer examination reveals interesting and rather convoluted phenomena. From the biological point of view, Transcription is the first step of Gene_expression, transcribing DNA to RNA. From a naive predicate-centered view, this means that DNA appears as the Theme of the event, and RNA appears as the Location. Accordingly, as Table shows, the majority (538) of Themes in our annotated Transcription events are instances of DNA. The following is an example of such a sentence:
The Ca(2+)-dependent factor NF-ATP plays a key role in the inducible transcription of both these lymphokine genes.
On the other hand, a transcription event can also be described from the view point of what is produced as a result. In this case, the Theme is RNA, i.e. what is expressed by Transcription. The following is an example:
These B cells expressed p40 and p35 mRNA, and phorbol myristate acetate (PMA) stimulation strongly enhanced p40 and p70 production.
The frequency of this type of expression, in which RNA was annotated as the Theme, is also high (334). Since linguistic expressions do not distinguish entities before or after an event, an entity can be described as a Theme in either of its states, before or after the event.
More interestingly, we observed quite a large number (291) of occurrences of Protein as the Theme of Transcription. The following is one of the typical contexts in which this occurs:
YM268 facilitated the insulin-stimulated triglyceride accumulation in 3T3-L1 adipocytes and increased the mRNA expression of fatty acid-binding protein.
Although "fatty acid-binding protein" has been annotated as a Protein, what is actually transcribed is the genomic information for the protein. In a specific context (i.e. transcription, translation or gene expression), the physical form (or the container) of the genomic information of the protein is obvious. Thus sometimes, the Theme is rather less strictly described in text.
This phenomenon is related with our perception of continuants (like proteins) and with systematic metonymy [69
], which permeates language. For an example of systematic metonymy, consider the following sentence given in everyday language:
The picture was developed, printed and sent to him.
Precisely speaking, what was developed is actually the film containing the picture, what was printed is the content of the picture (an image), and what was sent was the printed picture (physical manifestation of the image). The same expression "picture" is used in different contexts of development, printing, and delivery. Depending on the context, the proper interpretation is taken by the reader.
Similar phenomena are frequently observed in our domain. In the following example, the three terms "JunB", "FosB" and "c-Fos" are used to refer to genes in the context of transcription, and then used to refer to the corresponding proteins in the context of DNA binding.
... which correlates with an absence of JunB, FosB, and c-Fos transcription, as well as an absence of their DNA-binding activity.
In the current release of the GENIA event corpus, the term and event annotations will be kept as they are. However, these phenomena will have to be carefully studied to design a new annotation scheme. The scheme should be able to accommodate both the context-dependent nature of term semantic classes, and the context-independent nature of the classes of continuants.
Some Transcription events are annotated without any Theme. This is because transcription is often mentioned as a function of a protein as follows:
Transcriptional activity of p105 is also increased in infected cells and is also mediated by NF-kappa B through a specific kappa B motif.
Because our Single-facet Annotation principle focuses on events and their relations, the function of a protein is interpreted as a potential event regulated by the protein. Hence, the expression "Transcriptional activity of p105" in the above sentence is paraphrased as "transcription event regulated by p105". However, since the original sentence is different (e.g. the function of the protein), the Theme of the event (what is transcribed) is out of scope and not mentioned. The same phenomena are observed in Regulation, Positive_regulation and Negative_regulation events in Table .
Distribution of theme classes for Regulation events
The fact that a large number of events without any Theme (16 in Transcription, 192 in Regulation, 277 in Positive_regulation, 76 in Negative_regulation) were annotated indicates that our Single-facet Annotation worked as we hoped. That is, taking an Event-centered view of each sentence caused the annotators to identify every event mentioned in the text, including the main event indicated explicitly by the author as well as events which are described peripherally, with little additional detail.
Missing Themes in Binding events described in the Semantic Typing Section are same in nature. The annotators identified DNA binding in sentences such as
A footprint was visible over this region of he c-myb5' flanking sequence in activated T-cell but not in unactivated T-cell
One can safely assume the existence of another Theme of binding, which is the protein that left the footprint, but there was no mention of this protein in the text.
Table and show the type distribution of Themes and Causes of regulatory events, respectively, while Table shows a breakdown of the Positive_ and Negative_regulation which appear as Causes of regulatory events. The type distribution of Themes systematically corresponds to the subclassification of Regulation in GO. This means that, if we recognized basic event types, we could further subclassify them by rather simple rules referring to the types of their arguments. The only exceptions are the cases in which terms, instead of events, occupy the Theme. In these cases, ambiguity remains as to whether Positive regulation means increasing their amounts or enabling their functions.
Distribution of cause classes for Regulation events
Breakdown of causes in Positive_ and Negative_regulation
The type distribution of Causes also shows some interesting tendencies. A large portion of the Causes are proteins (5,797). While these are topics beyond the scope of this paper, we are now formulating entailment rules by which we can transform all complex cases, such Positive_regulation of Protein (405) and Gene_expression of Protein (218), into Protein, or vice versa. We expect that a set of such entailment rules will make our representation framework capable of handling variable granularity and underspecification of information, which are essential properties of natural language.