CRAFT can be compared to a number of previously released corpora. We focus primarily on biomedical corpora, as these are obviously most directly related: ABGene [42
], BioInfer [43
], the CLEF Corpus [44
], the FetchProt Corpus [45
], the Fourth i2b2/VA Challenge corpus [46
], GENETAG [47
], GENIA [48
], GREC [50
], the ITI TXM PPI and TE corpora [51
], MedPost [52
], the PennBioIE Oncology and CYP v1.0 corpora [13
], and the Yapex corpus [53
]. Though it bills itself as a “silver standard”, due to its vast scale we also compare CRAFT to the output of the Collaborative Annotation of a Large Biomedical Corpus (CALBC), an effort at constructing a very large biomedical corpus through “harmonization” of the automatically generated annotations of five systems [54
]. Finally, though it focuses on newswire articles, we also compare our corpus to OntoNotes Release 2.0 here, as it is analogously a large-scale manually created corpus project with multiple types of semantic and syntactic annotation [55
]. Table summarizes some criteria by which we compare CRAFT to other corpora.
Comparison of corpora in terms of total numbers of words/tokens is summarized in Table . The full corpus contains ~790,000 tokens, and the initial release contains more than 560,000; they are larger than nearly all gold-standard annotated corpora (for which we could find published numbers), including GENETAG, OntoNotes, GENIA, the PennBioIE Oncology and CYP Corpora, the MedPost Corpus, and BioInfer. The only corpora larger than ours by this criterion is the silver-standard CALBC corpus, with ~16,000,000 tokens, and the gold-standard ITI TXM PPI and TE Corpora, with ~2,000,000 and ~1,900,000 tokens, respectively; however, the counts of the ITI TXM corpora include all versions of the subset of documents that were multiply annotated (independently, for IAA calculation), and, as discussed later, not all sections of the component documents of these corpora were annotated.
Concept annotation attributes of corpora
Corpora can also be compared on the size of the documents annotated, also summarized in Table . Most of the corpora surveyed here are composed of relatively short documents. Among the shortest are those documents that are individual sentences, which compose the GENETAG, the ABGene Corpus, and BioInfer corpora. Most comparable corpora are composed of documents of several sentences to a paragraph, typically publication abstracts, e.g.
, the CALBC corpus, GENIA, the PennBioIE Oncology and CYP Corpora, GREC, and the Yapex Corpus, as well as those composed of discharge summaries, e.g.
, the Fourth i2b2/VA Challenge Corpus. The CLEF Corpus is composed of a number of different types of moderately sized medical documents, and the OntoNotes corpus contains 1,000 multiparagraph newswire documents. The longest documents of these surveyed corpora are full-length biomedical articles, e.g.
, the ITI TXM PPI and TE Corpora, the FetchProt Corpus, and the CRAFT Corpus. In the biomedical domain, having access to full-length articles is increasingly seen as important for concept-identification and information-extraction efforts [57
Another point of comparison of annotated corpora is in terms of their respective domain(s), also summarized in Table . The corpora surveyed are within the biomedical domain, with the exception of OntoNotes, which covers English and Chinese newswire text. The CLEF Corpus and the i2b2/VA Challenge Corpus contain clinical documents, which are relatively rare due to issues of patient confidentiality of medical records. The remainder of the corpora discussed here are composed of sentences, abstracts, or full-length articles culled from MEDLINE. However, most of these are further narrowed to one or several relatively specific biomedical domains. In addition to requiring open licensing, the articles of the CRAFT Corpus were selected for their being evidential sources for one or more GO and/or MP annotations of mouse genes or gene products. Apart from focusing on the laboratory mouse (though not exclusively, as evidenced by the unique-concept statistics for the NCBI Taxonomy annotations, as seen in Table ), the articles have no predefined constraints within the biomedical domain, and the corpus includes articles ranging over the disciplines of genetics, biochemistry and molecular biology, cell biology, developmental biology, and even computational biology. While our corpus does not include examples of articles that do not support GO and/or MP annotations of mouse genes/gene products, e.g., clinical studies, it otherwise reflects a broad overview of the biomedical literature. Compared to other publicly available corpora, CRAFT is a less biased sample of the biomedical literature, and it is reasonable to expect that training and testing NLP systems on CRAFT is more likely to produce generalizable results than those trained on narrower domains. At the same time, since our corpus primarily concentrates on mouse biology, we expect our corpus to exhibit some bias toward mammalian systems.
One of the most important aspects of the semantic markup of corpora is the total number of concept annotations, for which we have provided statistics in Table . The full corpus contains over 140,000 annotations to terms from ontologies and other controlled terminologies; the initial release contains nearly 100,000 such annotations. This is among the most extensive concept markup of the corpora discussed here for which we have been able to find such counts, including the ITI TXM PPI and TE corpora, GENIA, and OntoNotes, and it is considerably larger than that of most corresponding previously released corpora, including GENETAG, BioInfer, the ABGene corpus, GREC, the CLEF Corpus, the Yapex corpus, and the FetchProt Corpus. The only corpus with amounts of concept markup considerably larger than ours (and for which we have been able to find such data) is the silver-standard CALBC corpus.
A significant difference between the CRAFT Corpus and many other corpora is in the size and richness of the annotation schemas used, i.e.
, the concepts that are targeted for tagging in the text, also summarized in Table . Some corpora, including the ITI TXM Corpora, the FetchProt Corpus, and the CALBC corpus, used large biomedical databases for portions of their entity annotation, though most were done in a limited fashion.; furthermore, though such databases represent large numbers of biological entities, the records are flat sets of entities rather than concepts that themselves are embedded in a rich semantic structure. There has been a small amount of corpus annotation with large vocabularies with at least hierarchical structure, among these the ITI TXM Corpora and the CALBC corpus, though these are limited in various ways as well. OntoNotes, the GREC, and BioInfer use custom-made schemas whose sizes number in the hundreds, while most annotated corpora rely on very small concept schemas. In the CRAFT Corpus, all concept annotation relies on extensive schemas; apart from drawing from the ~7,200,000 records of the Entrez Gene database, these schemas draw from ontologies in the Open Biomedical Ontologies library, ranging from the ~800 classes of the Cell Type Ontology to the ~410,000 concepts of the NCBI Taxonomy. The initial 67-article release of the CRAFT Corpus contains over 4,300 distinct concepts from these terminologies. Furthermore, the annotation of relationships among these concepts (on which work has begun) will result in the creation of a large number of more complex concepts defined in terms of these explicitly annotated concepts in the vein of anonymous OWL classes formally defined in terms of primitive (or even other anonymous) classes [61
]. Analogous to research done in calculating the information content of GO terms by analyzing their use in annotations of genes/gene products in model-organism databases (and from this, the information content of these annotations) [62
], the information content of biomedical concepts can be calculated by analyzing their use in annotations of textual mentions in biomedical documents (and from this, the information content of these documents).
A crucial difference between the CRAFT Corpus and many other gold-standard annotated biomedical corpora is that markup of concepts requires semantic identity. By this we mean that every annotation in CRAFT is tagged with a term from an ontology or controlled vocabulary such that the text selected for the annotation is essentially semantically equivalent to the term; that is, each piece of annotated text, in its context, has the same meaning as the formal concept used to annotate it. In many other corpora, text is marked up even if the concept denoted is more specific than the concept used to annotate it; this approach is sometimes referred to as marking up all mentions “within the domain of” the given annotation class. For example, given a schema with a cell class (but nothing more specific), most corpora would annotate a mention of the word “erythrocyte” to that class. This results in semantic loss: It is not the case that the annotated text means the same thing as the associated semantic class. The size of the annotation schemas and the principle of semantic identity make assertions involving annotated concepts more valuable. For example, if the goal is to identify specific proteins expressed in specific cell types, annotations to generic categories such as “protein” or “cell” are not adequate.
Though it may sound straightforward to mark up all mentions of a given annotation class, it is often difficult and can seem subjective. Tateisi et al.
have reported on the difficulty of distinguishing the names of substances from general descriptions of the substances in the construction of GENIA [64
], and there was relatively low agreement on what qualified as, e.g.
, activators, repressors, and transcription factors in the GREC [50
]. This is even more difficult when it involves identifying precise text spans for annotation. Our annotators found that evaluating whether a span of text is semantically equivalent to a given term is easier than attempting to evaluate whether a piece of text refers to a concept that is subsumed by a more general schema class but not explicitly represented. It is for this reason that we emphasize annotation to an ontology/terminology rather than to a domain. Domain boundaries are often ill-defined, which makes it difficult to evaluate whether a piece of text refers to a concept that “should be” in some ontology; thus, we annotate only to what actually is in an ontology, not to some abstract idea of its domain. For example, if the ontology being used to annotate the corpus contains a concept representing vesicles but nothing more specific than this, a textual mention of “microvesicle” would not be annotated, even though it is a type of vesicle; this is because this mention refers to a concept more specific than the vesicle concept (and our annotation guidelines do not allow annotations to a part of a word such as this). In other cases, a portion of a mention to a concept missing from an ontology can be marked up; for example, for the text “mutant vesicles”, “vesicles” by itself is tagged with the vesicle concept. We regard such an approach as a strength, as only text that directly corresponds to concepts represented in the terminology is selected. Although experts might use such texts to make suggestions of new concepts to ontology curators, such activity was in general beyond the scope of the annotation work itself. However, we expect that the CRAFT Corpus could be exploited by ontology curators to find such missing concepts.
The CRAFT Corpus is distinguished by the quality and applicability of the schemas (i.e., potential target concepts) used for annotation. Many other corpora rely on concept schemas custom-made for their specific projects, often with representational idiosyncrasies; such schemas are not widely reusable for other purposes. Some corpora, such as the GREC and the event subset of GENIA, use schemas based, at least in part, on subsets of established external resources. The CRAFT Corpus is unique in that it relies on well-established, independently curated resources in their entirety. Eight of these resources are formal biomedical ontologies developed within the sphere of the Open Biomedical Ontologies (OBO) movement and are dedicated to faithfully representing the concepts within their respective domains, including five in the OBO Foundry that conform to an additional set of ontological principles. By predominantly annotating to widely used, high-quality terminologies, the CRAFT Corpus builds on years of careful knowledge representation work and is semantically consistent with a wide variety of other efforts that exploit these community resources.
In addition to using community-curated resources in our scheme, CRAFT also annotates every mention of nearlyc
every concept that appears in the texts. Although such an approach seems intuitive (and is clearly beneficial for training machine-learning NLP systems), it is not used in a number of corpora. Tanabe et al.
have written that “one fundamental problem in corpus annotation is the definition of what constitutes an entity to be tagged” and cited the complex guidelines of the MUC-7 Named Entity Task as evidence [47
]. In BioInfer, the focus is the annotation of relationships among genes, proteins, and RNAs, and entities are only annotated if they are relevant to this focus and if they are named entities—a term itself with much baggage, however, if the arguments of primary events are other events or qualities that recursively have genes, proteins and/or RNAs as arguments, these secondary events or qualities are annotated as “extended named entities”, but they are annotated only in such cases. In the PennBioIE Oncology corpus, a gene is only annotated if there is an associated variation event, and in the i2b2/VA Challenge corpus, only concepts lexicalized as complete noun phrases are annotated; e.g.
, “diabetes” is annotated in “she developed diabetes” but not in “she takes diabetes medication”.
The span selection guidelines for the concept annotations of the CRAFT Corpus also provide important advantages. Given an initial anchor word as the basis for an annotation, the rules for deciding which adjacent words can be considered for inclusion in an annotation and which cannot are precise and purely syntax-based, and the decision as to whether to include one or more modifiers or modifying phrases rests solely on whether their inclusion would result in a direct semantic match to a concept in the terminology being used. Unlike some other corpora (e.g.
, GENETAG, the ITI TXM corpora), annotations in CRAFT can be discontinuous, i.e.
, can be composed of two or more nonadjacent spans of text, though these must still abide by the same span-selection guidelines. Use of discontinuous annotations allows us to ensure that only text that is semantically identical to a concept is marked, regardless of internal interruptions. In some corpora, there are unclear guidelines (and consequently inconsistent annotations) for the text spans associated with an annotation. For example, in GENIA, “the inclusion of qualifiers is left to the experts sic
judgment” for the task of entity annotation [48
], and in the i2b2/VA Challenge corpus, “[u]p to one prepositional phrase following a markable concept can be included if the phrase does not contain a markable concept and either indicates an organ/body part or can be rearranged to eliminate the phrase” [46
]. The CRAFT specifications minimize subjective selections, and increase interannotator agreement on spans. CRAFT text span-selection guidelines are quite extensive (see supplementary materials), but our biomedical-domain-expert concept annotators with no previous experience with formal linguistics were able to quickly learn them.
Finally, few corpora have attempted to capture semantic ambiguity in concept annotations. The most prominent way in which CRAFT represents concept ambiguity is in cases in which a given span of text could be referring to two (or more) represented concepts, none of which subsumes another, and we have not been able to definitively decide among these. This occurs most frequently among the Entrez Gene annotations, in which many mentions of genes/gene products not grammatically modified with their organismal sources are multiply annotated with the Entrez Gene IDs of the species-specific genes/gene products to which these mentions could plausibly refer. Similar to GENIA, this multiple-concept annotation explicitly indicates that these cases could not be reliably disambiguated by human annotators and therefore are likely to be particularly difficult for computational systems. Explicitly representing this ambiguity allows for more sophisticated scoring mechanisms in the evaluation of automatic concept annotation; for example, a maximum score could be given if a system assigned both insertion concepts to the aforementioned example and a partial score for an assignment of only one of these concepts. . However, we have attempted to avoid such multiple annotation by instead singly annotating such mentions according to improvised guidelines for specific markup issues (which do not conflict with the official span-selection guidelines but rather build from them). For example, some nominalizations (e.g., insertion) may refer either to a process (e.g., the process of insertion of a macromolecular sequence into another) or to the resulting entity (e.g., the resulting inserted sequence), both of which are represented in the SO, and it is often not possible to distinguish among these with certainty; we have annotated such mentions as the resulting sequences except those that can only (or most likely) be referring to the corresponding processes. A simpler case involves a text span that might refer to a concept or to another concept that it subsumes. In such a case, only the more general concept is used; for example, Mus refers both to a organismal-taxonomic genus and to one of its subgenera, so a given mention would only be annotated with the genus; the rationale for this decision is that it is generally not safe to assume that the more specific concept is the one being mentioned.
Ongoing and future work
In addition to the conceptual annotation that is described here and the syntactic annotation that we describe in a companion article [27
], there are multiple ongoing projects that add additional layers of annotation to the CRAFT Corpus data, all of which will be made available in future releases of the corpus:
· We have begun work on assertional annotation of the corpus, i.e., the markup of assertions among the annotated concepts by linking them via relations. We have encountered many difficult aspects in this task, which may be challenging to accomplish as consistently as the concept annotation. We seek to create this assertional markup using a methodology such that the annotations will be able to be programmatically translated into formal knowledge representations that can be stored and queried in an RDF knowledge base .
· An extensive project is nearly complete to mark all coreference in the corpus. The two relations of COREF (coreferentiality) and APPOS (appositive) are marked. The guidelines for this portion of the work were adapted from the OntoNotes guidelines, with the major difference that we did not utilize the category of generics. As we have discussed in relation to the guideline selection process for this task , we maintain that in the biomedical domain, in which everything mentioned, including abstract concepts such as data, belongs in the domain of an ontology, the notion of genericity does not apply.
· Discourse annotation on the sentence level, using the CISP/ART schema , is nearly complete. An early result of this work has been the finding that sequences of rhetorical moves can be characterized by finite state machines.
· The contents of all parentheses are being annotated with respect to a schema of twenty categories, including citations, data values, p-values, figure/table pointers, list elements, and others. We have previously presented the annotation procedure and the use cases for the various categories in the schema, as well as a classifier for determining category membership of contents of parentheses .
· As a primary criterion in the selection of articles for the corpus was their use as evidential sources for ontological annotations of mouse genes/gene products in the Mouse Genome Database (a major component of the Mouse Genome Informatics resources), we have marked up the specific sentences within these articles upon which these annotations are based. Motivated by a growing need for semiautomatic assistance in the curation of data in model-organism databases, we intend for this to serve as a gold standard for the training of systems to identify relevant evidential sentences in the biomedical literature.
Furthermore, in the future, we intend to periodically update the annotations using current versions of the OBOs as well as correct errors that we find or are brought to our attention.