Why a sequence ontology is needed
Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. Unfortunately, biological terminology is notoriously ambiguous; the same word is often used to describe more than one thing and there are many dialects. For example, does a coding sequence (CDS) contain the stop codon or is the stop codon part of the 3'-untranslated region (3' UTR)? There really is no right or wrong answer to such questions, but consistency is crucial when attempting to compare annotations from different sources, or even when comparing annotations performed by the same group over an extended period of time.
At present, GenBank [1
] houses 220 viral genomes, 152 bacterial genomes, 20 eukaryotic genomes and 18 archeal genomes. Other centers such as The Institute for Genomic Research (TIGR) [2
] and the Joint Genome Institute (JGI) [3
] also maintain and distribute annotations, as do many model organism databases such as FlyBase [4
], WormBase [5
], The Arabidopsis
Information Resource (TAIR) [6
] and the Saccharomyces
Genome Database (SGD) [7
]. Each of these groups has their own databases and many use their own data model to describe their annotations. There is no single place at which all sets of genome annotations can be found, and several sets are informally mirrored in multiple locations, leading to location-specific version differences. This can make it hazardous to exchange, combine and compare annotation data. Clearly, if genomic annotations were always described using the same language, then comparative analysis of the wealth of information distributed by these institutions would be enormously simplified: Hence the Sequence Ontology (SO) project. SO began 2 years ago, when a group of scientists and developers from the model organism databases - FlyBase, WormBase, Ensembl, SGD and MGI - came together to collect and unify the terms they used in their sequence annotation.
The Goal of the SO is to provide a standardized set of terms and relationships with which to describe genomic annotations and provide the structure necessary for automated reasoning over their contents, thereby facilitating data exchange and comparative analyses of annotations. SO is a sister project to the Gene Ontology (GO) [8
] and is part of the Open Biomedical Ontologies (OBO) project [9
]. The scope of the SO project is the description of the features and properties of biological sequence. The features can be located in base coordinates, such as gene
, and the properties of these features describe an attribute of the feature; for example, a gene
may be maternally_imprinted
SO terminology and format
Like other ontologies, SO consists of a controlled vocabulary of terms or concepts and a restricted set of relationships between those terms. While the concepts and relationships of the sequence ontology make it possible to describe precisely the features of a genomic annotation, discussions of them can lead to much lexical confusion, as some of the terms used by SO are also common words; thus we begin our description of SO with a discussion of its naming conventions, and adhere to these rules throughout this document.
Wherever possible, the terms used by SO to describe the parts of an annotation are those commonly used in the genomics community. In some cases, however, we have altered these terms in order to render them more computer-friendly so that users can create software classes and variables named after them. Thus, term names do not include spaces; instead, underscores are used to separate the words in phrases. Numbers are spelled out in full, for example five_prime_UTR, except in cases where the number is part of the accepted name. If the commonly used name begins with a number, such as 28S RNA, the stem is moved to the front - for example, RNA_28S. Symbols are spelled out in full where appropriate, for example, prime, plus, minus; as are Greek letters. Periods, points, slashes, hyphens, and brackets are not allowed. If there is a common abbreviation it is used as the term name, and case is always lower except when the term is an acronym, for example, UTR and CDS. Where there are differences in the accepted spelling between English and US usage, the US form is used.
Synonyms are used to record the variant term names that have the same meaning as the term. They are used to facilitate searching of the ontology. There is no limit to the number of synonyms a term can have, nor do they adhere to SO naming conventions. They are, however, still lowercase except when they are acronyms.
Throughout the remainder of this document, the terms from SO are highlighted in italics and the names of relationships between the terms are shown in bold. The terms are always depicted exactly as they appear in the ontology. The names of EM operators are underlined.
SO, SOFA, and the feature table
To facilitate the use of SO for the markup of gene annotation data, a subset of terms from SO consisting of some of those terms that can be located onto sequence has been selected; this condensed version of SO is especially well suited for labeling the outputs of automated or semi-automated sequence annotation pipelines. This subset is known as the Sequence Ontology Feature Annotation, or SOFA.
SO, like GO, is an 'open source' ontology. New terms, definitions, and their location within the ontology are proposed, debated, and approved or rejected by an open group of individuals via a mailing list. SO is maintained in OBO format and the current version can be downloaded from the CVS repository of the SO website [10
]. For development purposes, SOFA was stabilized and released (in May 2004) for at least 12 months to allow development of software and formats. SO is a directed acyclic graph (DAG), and can be viewed using the editor for OBO files, OBO-Edit [11
The terms describing sequence features in SO and SOFA are richer than those of the Feature Table [12
] of the three large genome databanks: GenBank [1
], EMBL [13
] and the DNA Data Bank of Japan (DDBJ) [14
]. The Feature Table is a controlled vocabulary of terms describing sequence features and is used to describe the annotations distributed by these data banks. The Feature Table does provide a grouping of its terms for annotation purposes, based on the degree of specificity of the term. The relationships between the terms are not formalized; thus the interpretation of these relationships is left to the user to infer, and, more critically, must be hard-coded into software applications. Most of the terms in the Feature Table map directly to terms in SO, although the term names may have been changed to fit SO naming conventions. In general, SO contains a more extensive set of features for detailed annotation. There are currently 171 locatable sequence features in SOFA compared to 65 of the Feature Table. There are 11 terms in the Feature Table that are not included in SO. These terms fall into two categories: remarks and immunological features, both of which have been handled slightly differently in SO. A mapping between SO and the Feature Table is available from the SO website [10
Database schemas, file formats and SO
SO is not a database schema, nor is it a file format; it is an ontology. As such, SO transcends any particular database schema or file format. This means it can be used equally well as an external data-exchange format or internally as an integral component of a database.
The simplest way to use SO is to label data destined for redistribution with SO terms and to make sure that the data adhere to the SO definition of the data type. Accordingly, SO provides a human-readable definition for each term that concisely states its biological meaning. Usually the definitions are drawn from standard authoritative sources such as The Molecular Biology of the Cell
], and each definition contains a reference to its source. Defining each term in such a way is important as it aids communication and minimizes confusion and disputes as to just what data should consist of. For example, the term CDS
is defined as a contiguous RNA sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon
. According to SO, the sequence of a three_prime_utr
does not contain the stop_codon
- and files with such sequences are SO-compliant; files of three_prime_utr
are not. This is a trivial example, illustrating one of the simplest use cases, but it does demonstrate the power of SO to put an end to needless negotiations between parties as to the details of a data exchange. This aspect of SO is especially well suited for use with the generic feature format (GFF) [16
]. Indeed, the latest version, GFF3, uses SO terms and definitions to standardize the feature type described in each row of a file and SO terms as optional attributes to a feature.
SO can also be employed in a much more sophisticated manner within a database. CHADO [17
] is a modular relational database schema for integrating molecular and genetic data and is part of the Generic Model Organism Database project (GMOD) [18
], currently used by both FlyBase and TIGR. The CHADO relational schema is extremely flexible, and is centered on genomic features and their relationships, both of which are described using SO terms. This use of SO ensures that software that queries, populates and exports data from different CHADO databases is interoperable, and thus greatly facilitates large-scale comparisons of even very complex genomics data.
Like GFF3, Chaos-XML [19
] is a file format that uses SO to label and structure data, but it is more intimately tied to the CHADO project than is GFF3. Chaos-XML is a hierarchical XML mapping of the CHADO relational schema. Annotations are represented as an ontology-typed feature graph. The central concept of Chaos-XML is the sequence-feature, which is any sequence entity typed by SO. The features are interconnected via feature relationship elements, whereby each relationship connects a subject feature and an object feature. Features are located via featureloc elements which use interbase (zero-based) coordinates. Chaos-XML and CHADO are richer models than GFF3 in that feature_relationships are typed, and a more sophisticated location model is used. Chaos-XML is the substrate of a suite of programs called Comparative Genomics Library (CGL), pronounced 'seagull' [20
], which we have used for the analyses presented in our Results section.
The basic types in SOFA, from which other types are defined, are region
, equivalent to the concepts of interiors and boundaries defined in the field of topological relationships [21
]. A region is a length of sequence such as an exon
or a transposable_element
. A junction
is the space between two bases, such as an insertion_site
. Building on these basic data types, SOFA can be used to describe a wide range of sequence features. Raw sequence features such as assembly components are captured by terms like contig
. Analysis features, defined by the results of sequence-analysis programs such as BLAST [22
] are captured by terms such as nucleotide_match
. Gene models can be defined on the sequence using terms like gene
. Variation in sequence is captured by subtypes of the term sequence_variant
. These terms have multiple parentages with either region or junction. SOFA (and SO) can also be used to describe many other sequence features, for example, repeat
. Thus, SOFA together with GFF3 or Chaos-XML provide an easy means by which parties can describe, standardize, and document the data they distribute and exchange.
The SO and SOFA controlled vocabularies can be used for de novo
annotation. Several groups including SGD and FlyBase now use either SO or SOFA terms in their annotation efforts. SO is not restricted to new annotations, however, and may be applied to existing annotations. For example, annotations from GenBank may be converted into SO-compliant formats using Bioperl [23
] (see Materials and methods).
One essential difference between a controlled vocabulary, such as the Feature Table, and an ontology is that an ontology is not merely a collection of predefined terms that are used to describe data. Ontologies also formally specify the relationships between their terms. Labeling data with terms from an ontology makes the data a substrate for software capable of logical inference. The information necessary for making logical inferences about data resides in the class designations of the relationships that unite terms within SO. We detail this aspect of the ontology below. For purposes of reference, a section of SO illustrating the various relationships between some of its terms is shown in Figure .
Figure 1 A section of the Sequence Ontology showing how terms and relationships are used together to describe knowledge about sequence. The kind_of relationships are depicted using arrows labeled with 'i', the part_of relationships use arrows with 'P' and the (more ...)
Currently, SO uses three basic kinds of relationship between its terms: kind_of
, and part_of
. These relationships are defined in the OBO relationship types ontology [24
relationships specify what something 'is'. For example, an mRNA
is a kind_of transcript
. Likewise an enhancer
is a kind_of regulatory_region
relationships are valid in only one direction. Hence, a regulatory_region
is not a kind_of enhancer
. One consequence of the directional nature of kind_of
relationships is that their transitivity is hierarchical - inferences as to what something 'is' proceed from the leaves towards the root of the ontology. For example, an mRNA
is a kind_of processed_transcript
AND a processed_transcript
is a kind_of transcript
. Thus, an mRNA
is a kind_of transcript
relationships are synonymous with is_a
relationships. We adopted the 'kind_of
' notation to avoid the lexical confusion often encountered when describing relationships, as the phrase 'is a' is often used in conjunction with another relationships in English - for example 'is a part_of'.
SO uses the term derives_from to denote relationships of process between two terms. For example, an EST derives_from an mRNA. derives_from relationships imply an inverse relationship; derives. Note that although a polypeptide derives_from an mRNA, a polypeptide cannot be derived from an ncRNA (non-coding RNA), because no derives_from relationship unites these two terms in the ontology. This fact illustrates another important aspect of how SO handles relationships: children always inherit from parents but never from siblings. An ncRNA is a kind_of transcript as is an mRNA. Labeling something as a transcript implies that it could possibly produce a polypeptide; labeling that same entity with the more specific term ncRNA rules that possibility out. Thus, a file that contained ncRNAs and their polypeptides would be semantically invalid.
part_of relationships pertain to meronomies; that is to say 'part-whole' relationships. An exon, for example, is a part_of a transcript. part_of relationships are not valid in both directions. In other words, while an exon is a part_of a transcript, a transcript is not a part_of an exon. Instead, we say a transcript has_part exon. SO does not explicitly denote whole-part relationships, as every part_of relationship logically implies the inverse has_part relationship between the two terms.
Transitivity is a more complicated issue with regards to part-whole relationships than it is for the other relationships in SO. In general, part_of
relationships are transitive - an exon
is a part_of
, because an exon
is a part_of
, and a transcript
is a part_of
. Not every chain of part-whole relationships, however, obeys the principle of transitivity. This is because parts can be combined to make wholes according to different organizing principles. Winston et al
] have described six different subclasses of the part-whole relationship, based on the following three properties: configuration
, whether the parts have a structural or functional role with respect to one another or the whole they form; substance
, whether the part is made of the same stuff as the whole (homomerous or heteromerous); and invariance
, whether the part can be separated from the whole. These six relations and their associated part_of
subclasses are detailed in Table .
Six subclasses of part-whole relationships
Winston et al
] argue that there is transitivity across a series of part_of
relationships only if they all belong to the same subclass. In other words, an exon
can only be part_of
, if an exon
is a component_part_of
, and a transcript
. If, however, the two statements contain different types of part_of
relationship, then transitivity does not hold.
By addressing the vague English term 'part of' in this way, Winston et al. solve many of the problems associated with reasoning across part_of relationships; thus, we are adopting their approach with SO. The parts contained in the sequence ontology are mostly of the type component_part_of such as exon is a part_of transcript, although there are a few occurrences of member_part_of such as read is a part_of contig.
SO's relationships facilitate software design and bioinformatics research
Genomic annotations are substrates for a multitude of software applications. Annotations, for example, are rendered by graphical viewers, or, as another example, their features are searched and queried for purposes of data validation and genomics research. Using an ontology for sequence annotation purposes offers many advantages over the traditional Feature Table approach. Because controlled vocabularies do not specify the relationships that obtain between their terms, using the Feature Table has meant that relationships between features have had to be hard-coded in software applications themselves; consequently, adding a new term to the Feature Table and/or changing the details of the relationships that obtain between its terms has meant revising every software application that made use of the Feature Table. Ontologies mitigate this problem as all of the knowledge about terms and their relationships to one another is contained in the ontology, not the software.
SO-compliant software need only be provided with an updated version of the ontology, and everything else will follow automatically. This is because SO-compliant software need not hard-code the fact that a tRNA is a kind_of transcript; it need merely know that kind_of relationships are transitive and hierarchical and be capable of internally navigating the network of relationships specified by the ontology (see Figure ) in order to logically infer this fact. This means that every time a new form of ncRNA is discovered, and added to SO, all SO-compliant software applications will automatically be able to infer that any data labeled with that new term is a kind_of transcript. This means that existing graphical viewers will render those data with the appropriate transcript glyph, and validation and query tools will automatically deal with this new data-type in a coherent fashion. Placing the biological knowledge in the ontology rather than in the software means that the ontology and the software that uses it can be developed, revised, and extended independently of one another. Thus ontologies offer the bioinformatics programming community significant opportunities as regards software design and the speed of the development cycle. Using an ontology does, however, mean that software applications must meet certain professional standards; namely, they must be capable of parsing an OBO file and navigating the network of relationships that constitute the ontology, but these are minimal hurdles.
SO facilitates bioinformatics research in ways that reach far beyond its utility as regards software design. For example, SO's kind_of relationships provide a subsumption hierarchy, or classification system for its terms. This added depth of knowledge greatly improves the searching and querying capabilities of software using SO. The ontology's higher-level terms may be used to query via inference, even if they are never used for annotation. We recommend that annotators label their data using terms corresponding to terminal nodes in the ontology. Transcripts, for example, might be annotated using terms such as mRNA, tRNA, and rRNA (see Figure ). Note that doing so means that if, for example, non-coding RNA sequences are required for some subsequent analysis, then SO-compliant software tools can locate annotations labelled with the subtypes of ncRNA, and retrieve tRNAs and rRNAs to the exclusion of mRNAs, even though these data have not been explicitly labelled with the term ncRNA. Thus, many analyses become easy, for example, how many ncRNAs are annotated in H. sapiens? Of these what percent have more than one exon? Are any maternally imprinted? Moreover, using SO as part of a database schema ensures that such questions 'mean' the same thing in different databases.
SO also greatly facilitates the automatic validation of annotation data, as the relationships implied by an annotation can be compared to the allowable relationships specified in the ontology. For example, an annotation that asserts an intron to be part_of an mRNA would be invalid, as this relationship is not specified in the ontology (Figure ). On the other hand, an annotation that asserted that an UTR sequence was part_of mRNA would be valid (Figure ). This makes possible better quality control of annotation data, and makes it possible to check existing annotations for such errors when converting them to a SO-compliant format such as GFF3.
To summarize, by identifying the set of relationships between terms that are possible, we are also specifying the inferences that can be drawn from these relationships: that is, the software operations that can be carried out over the data. As a consequence, software is easier to maintain, SO can easily be extended to embrace new biological knowledge, quality controls can be readily implemented, and software to mine data can be written so as to be very flexible.
EM operators and SO
SO also enables some modes of analyses of genomics data that are completely new to the field. One such class of analyses involves the use of extensional mereology (EM) operators to ask questions about gene parts. Although new to genomics, EM operators are well known in the field of ontology, where they provide a basis for asking and answering questions pertaining to how parts are distributed within and among different wholes (reviewed in [26
]). These operators are usually applied to studies of how parts are shared between complex wholes - such as different models of automobiles or personal computers - for the purpose of optimizing manufacturing procedures. Below we explain how these same operators can be applied to the analyses of genomics data. Although these operators, difference
, share the same name as topological operators, they are different as they function on the parts of an object, not on its geometric coordinate space. The topological operators, regarding the coincidence of edges and interiors - equality, overlap, disjointedness, containment and coverage of spatial analysis [21
] - may also be applied to biological sequence.
EM is a formal theory of parts: it defines the properties of the part_of relationship and then provides a set of operations (Table ) that can be applied to those parts. These operators are akin to those of set theory, but whereas set theory makes use of an object's kind_of relationships, EM operators function on an object's part_of relationships. Only wholes and their 'proper parts' are legitimate substrates for EM operations. Proper parts are those parts that satisfy three self-evident criteria: first, nothing is a proper part of itself (a proper part is part of but not identical to the individual or whole); second, if A is a proper part of B then the B is not a part of A; third, if A is a part of B and B is a part of C then A is a part of C.
Note that the third criterion of proper parts is that they obey the rule of transitivity. As we discussed earlier, not all part_of relationships are transitive. Accordingly, we have restricted our analyses (see Results and discussion) to component parts (Table ).
Figure illustrates the effects of applying EM operations to analyze the relationships 'transcript is a part_of gene' and 'exon is a part_of transcript'. The EM operations overlap and disjoint pertain to relationships between transcripts, whereas difference and binary product pertain to exons. Two transcripts overlap if they share one or more exon in common. Two transcripts are disjoint if they do not share any exons in common. The exons shared between two overlapping transcripts are the binary product of the two transcripts, and the exons not shared in common comprise the difference between the two transcripts. The binary sum of two transcripts is simply the sum of their parts.
Figure 2 Using EM operations to characterize alternatively spliced transcripts and their exons. The EM operations overlap and disjoint can be used to characterize pair-wise relationships between alternative transcripts. Binary product and difference, on the other (more ...)
One key feature of EM operations is that they operate in 'identifier space' rather than 'coordinate space'. Two transcripts overlap only if they share a part in common rather than if their genomic coordinates overlap. Thus, two transcripts may be disjoint even if their exons partially overlap one another. This is one way in which EM analyses differ from standard bioinformatics analyses, and it has some interesting repercussions. This is particularly so with regard to modes of alternative splicing, as each of the EM operations suggests a distinct category by means of which two alternatively spliced transcripts can be related to one another. We further explore the potential of these operations to classify alternative transcripts and their exons below.