|Home | About | Journals | Submit | Contact Us | Français|
The Sequence Ontology is an established ontology, with a large user community, for the purpose of genomic annotation. We are reforming the ontology to provide better terms and relationships to describe the features of biological sequence, for both genomic and derived sequence. The SO is working within the guidelines of the OBO Foundry to provide interoperability between SO and the other related OBO ontologies. Here we report changes and improvements made to SO including new relationships to better define the mereological, spatial and temporal aspects of biological sequence.
Genomic data was notorious for the multitude of file formats that expressed the same kind of data in different ways. Each gene prediction algorithm for example, exported the gene models in either a different format from other groups, or when they used the same format, the terms often had slightly different meanings. Data integration between groups was therefore not straightforward. Likewise, validation of annotations relied on the programmers understanding the nuances of each kind of annotation and hard-coding their programs to match. The Sequence Ontology (SO)  was initiated in 2003 to provide the terms, and relations that obtain between terms, to describe biological sequences. The main purpose was to unify the vocabulary used in genomic annotations, specifically genomic databases and flat file data exchange formats. The Sequence Ontology Project provides a forum for the genomic annotation community to discuss and agree on terminology to describe the biological sequence they manage, in the form of mailing lists, trackers, and workshops.
The purpose of annotating a genome is to find and record the parts of the genome that are biologically significant. In this way researchers can make sense of what would just be a very long string of letters. For example, after annotation, a researcher will be able to know which of the sequence variants fall in coding or non coding sequence and perform subsequent analyses accordingly. A genome annotation anchors knowledge about the genomic sequence and the sequence of molecules derived from the genome on to a linear representation of the replicon (chromosome, plasmid etc) using base pair coordinates to capture the position. A sequence_feature is a region or a boundary of sequence that can be located in coordinates on biological sequence, and SO was initially created as an ontology of these sequence feature types and their attributes.
The SO has a large user community of established model organism databases and newer ‘emerging model organism’ systems who use on the Generic Model Organism Database (GMOD)  suite of tools to annotate and disseminate their genetic information. GMOD is a group that provides an open source collection of tools for dealing with genomic data. GMOD schemas and exchange formats rely on the SO to type their features such as the Chado database  with its related XML formats and the tab delimited flat file exchange format Generic Feature Format (GFF3) . Several GMOD tools use GFF3, for example GBrowse . SO is also used by genome integration projects such as Flymine , modENCODE  and the BRC pathogen data repository [8, 9]. There are other uses for SO such as natural language processing initiatives that use the SO terminology [10, 11].
Genome annotations specify the coordinates of sequence features that are manifest in one or more of the kinds of molecule defined by the central dogma. For example, although an intron is manifest as an RNA molecule, the coordinates of the intron can be projected onto the genomic sequence. The term labels chosen for SO were those in use by the genome annotation community, thus “transcript”, “intron” and so on were chosen as labels for the sequence feature types corresponding to genome regions encoding actual transcript and intron molecules. This polysemy does not cause problems when SO is used purely for genome annotation, but is potentially confusing when it is used in the context of other ontologies.
The current version of SO uses a subsumption hierarchy to describe the kinds of features and a meronomy to describe their part-whole structures. Sequence features were related by their genomic position. For example polypeptide (which referred to the sequence that corresponds to a polypeptide molecule) and transcript (which referred to the sequence that corresponds to an RNA molecule) were described only by genomic context, that is the region of the genome that encodes their sequence. This excluded the post-genomic topology of these features: how the topology of the features changes, as the sequence is expressed by different molecules.
The SO is one of the original members of the OBO Library, a collection of orthogonal, interoperable ontologies developed according to a shared set of principles. These later evolved into the OBO Foundry principles  which include a common syntax, a data-versioning system, collaborative development, and adherence to the same set of defined relationships . The OBO Foundry ontology developers attempt to accurately represent biological reality. Membership in the OBO Foundry represents a commitment to adhere to common ontology design principles and agree to reform where necessary. The OBO Foundry spans the biomedical domain in steps of granularity from the molecule to the organism, and also extends into the realm of experimental measurements, instrumentation and protocol. The OBO Foundry also partitions ontologies according to their relationship to time. Continuants endure through time, whereas occurrents, which include processes, unfold through time in stages. Anatomical entities such as cells and organs are continuants, as are molecules.
The SO is orthogonal to the neighbor ontologies within the OBO Foundry which represent molecular continuants. Chemical Entities of Biological Interest (ChEBI) is a dictionary of small molecules . The RNA Ontology  represents the secondary and tertiary motifs of RNA as well as describing the interactions between bases for base pairing and stacking. The Protein Ontology (PRO) defines the forms of proteins and the evolutionary relationships between protein families . These ontologies are themselves orthogonal to ontologies of processes, such as the Biological Process (BP) and Molecular Function (MF) subsets of the Gene Ontology (GO) . The GO BP ontology represents processes of relevance to SO, such as transcription, gene expression and splicing.
In order to best divide work between curators of neighboring ontologies, and to ensure that SO can reuse material from these ontologies and vice versa, the ontologies must all adhere to the same principles. In this paper we will describe how we have been developing the Sequence Ontology in two respects, first to promote interoperability and second to provide a solid framework to describe how sequences change over the course of genomic and post-genomic processes. The rest of the paper is structured as follows: in Section 2 we describe the OBO Foundry standards we have been adopting. In Section 3 we describe new relations for post-genome topology and in Section 4 we describe the relation of SO to neighboring ontologies.
The SO, like other pre-existing ontologies has begun to undergo reform to meet the OBO Foundry standards.
Upper ontologies such as Basic Formal Ontology (BFO)  provide a formal structure upon which to base domain ontologies. BFO provides a hierarchy of upper-level abstract classes. Classes in domain-specific ontologies can be defined as sub-classes of appropriate abstract classes and inherit their properties. This allows the multiple independently developed ontologies of the OBO Foundry to be linked together. The development of SO preceded the adoption of BFO by the OBO Foundry, so it was necessary to align SO to BFO post-hoc. In order to do this, a fundamental question must be answered: what kind of entity is a sequence feature? This is not a trivial question and suggested answers have ranged from: molecules or molecule regions, the physical pattern of electrons in a computer or purely abstract mathematical forms. None of these solutions was biologically satisfying. Our position is that biological sequences exist independently of our abstractions or computational representations, but are not identical with the molecules themselves. Multiple molecules can have the same sequence, and a sequence feature exists so long as there is a molecule with that sequence. This can be seen as analogous to the distinction between the physical content of a book, and the words written in that book.
BFO divides continuants into independent continuants and dependent continuants. The former include physical objects such as molecules, and the latter include entities such physical qualities, shapes and functions. The relation that links these is called inheres_in, and we say that for example my temperature inheres_in me, or that I am the bearer of my temperature. Dependent continuants are broken down into specifically dependent continuants (SDCs) and generically dependent continuants (GDCs). What differentiates these is the number of bearers – a SDC has a single bearer, and ceases to exist when that bearer ceases to exist (thus the shape of a particular apple disappears after the apple is eaten). A GDC can have multiple bearers, and can continue to exist when bearers cease to exist, so long as there is at least one bearer. A given genomic sequence may be borne by a DNA molecule, an RNA molecule, a polypeptide chain, or indeed by other molecules or systems that are not products of the replication machinery of the cell, for example the set of instructions that drive a solid-phase nucleic acid synthesis device. For this reason we take biological sequences to be GDCs (Fig 1). One of the consequences of this decision is that genes such as the gene denoted by the NCBI Gene ID 6469 (human Shh) are individuals rather than types.
The other SO root classes have also been aligned to BFO, as shown in Figure 1. We take sequence_collection, which is a non-contiguous set of sequences, and sequence_variant, such as a mutation, to be the same sort of thing as a sequence_feature, and hence a GDC. For the moment we are treating sequence_attribute as an intrinsic property of the molecule that bears the sequence, hence in BFO terms a quality, but this is under review. Lastly, the sequence_variant_effect, for example a structural change or a change in transcription, need not necessarily happen so we treat it as a disposition.
We now define new terms according to the OBO Foundry guidelines for definitions. Initially the terms in SO were either defined by a member of the developer community, or taken directly from a reputable website or textbook, giving the ISBN or the URL as the cross-reference. This has led to inconsistency between the definitions, and sometimes inconsistency between the definition and placement of the term within the ontology. This especially led to confusion over the kind of entity described by a feature, whether it was a molecule or a sequence, as there was not conformity in the definitions. For example, mRNA was defined as: Messenger RNA is the intermediate molecule between DNA and protein. It includes UTR and coding sequences. It does not contain introns. This has been updated to ‘Messenger RNA sequence is a mature transcript sequence, a portion of which is coding. It may include UTR but not intron sequence’. The OBO Foundry recommends that terms be defined with respect to the is_a parent, and the attributes that differentiate the term from its parent and sibling terms, called the differentiae. This practice forces a self check on the whether the position of the term in the ontology agrees with the defined meaning of the term. New definitions in SO must adhere to the “A is_a B that C’s” principle. For example, the new term, vector_replicon, a subtype of replicon, has the following definition: A replicon that has been modified to act as a vector for foreign sequence. We are actively refining existing terms.
The SO was the first ontology in the OBO library to augment free text definitions aimed at humans, with computable necessary and sufficient ‘cross-product’ definitions. SO has 100 of these definitions in genus/differentiae form . The genus is the broader category to which the term belongs, and the differentiae are the properties that other members of the genus do not have.
To achieve these computable definitions, sequence_feature terms are defined with sequence_attribute terms, using a new relationship has_quality1. Previous to the creation of cross product terms, a complex term such as engineered_foreign_transposible_element_gene would have several manually edited is_a parents: transposable_element_gene, engineered_foreign_gene, and engineered_foreign_region. These multiple parents cause problems for the ontology developer and for visualization and reasoning software. The developer must manually check for other is_a parents percolating further up the graph. The graph itself becomes difficult to navigate. With the addition of the cross-product relations, the definition becomes computationally visible. The term engineered_foreign_transposible_element_gene now has a single is_a parent: transposable_element_gene and two qualities: engineered and foreign. A reasoner can then be used to place the terms in the correct place in the ontology. This is especially useful as it untangles the graph for editing purposes. The SO is released in two forms, either with the logical definitions, or fully classified for use without a reasoner.
In order for reasoners to be able to draw correct inferences about the entities in an ontology, the class level relations must be of the all–some, all–only or all all–types, of which “all–only” is the weakest. This is one of the reasons for the Foundry principle that ontologies should reuse relations from the OBO Relations Ontology (RO), which provides a set of defined formal type level and instance level relations, typically of the all–some form . The list of relations may be extended by individual ontologies as required. In practice, making these changes to SO has required the addition of the ‘has_part’ relation to the ontology. For example, the ontology states that overlapping_EST_set has_part EST. If this relation was reversed and the ontology stated that EST part_of overlapping_EST_set it would have serious implications for software that use reasoning to validate sequence annotations. This would imply that all EST sequence annotations were part of a region composed of more than one EST, and therefore single EST’s would incorrectly cause a validation error.
We have added the integral_part_of relation and its inverse. X integral_part_of Y iff every X part_of some Y and every Y has_part some X. This covers the cases where the existence of the part implies the existence of the whole and vice versa.
There are several kinds of relation that are needed to describe the complex nature of biological sequence. Mereological relations are needed to describe containment. Spatial relations are needed to relate the positional information about features. These relations are based on Allen’s interval logic . Each transformation of sequence requires a temporal relation. We propose to extend SO with the relations outlined in Table 1.
Biological sequences inhere in three kinds of polymeric molecule that are produced by the cell’s replication machinery: DNA, RNA and polypeptide. There are also man-made polymers that can bear sequences, such as PNA . The SO will represent the transformation of sequence from one kind of molecule to another using the temporal relations shown in Table 1. A primary_transcript, which is expressed as RNA, is transcribed_from a gene. A polypeptide sequence is a ribosomal_translation_of the CDS sequence. Transcript molecules also undergo processing such as splicing and editing, which remove or add additional sequences. The relations processed_from and processed_into relate the primary transcript to its mature processed form.
The actual names of relations are under review – for example, we may decide to use sequence-specific relations such as upstream_adjacent_to in place of the starts relation, as it may be desirable to reserve starts as a temporal relation between processes.
The Gene Ontology is reforming itself in line with OBO Foundry principles by adding cross-product definitions of its classes where possible . However, for those terms that involve DNA, RNA and polypeptides, this alignment is hampered by SO describing those sequences that inhere in molecules rather than the molecules themselves. Most biologically relevant molecules belong in ChEBI; however, the scope of ChEBI explicitly excludes molecules that are specified by the genome. This gap is now filled by Sequence Ontology:Molecules (SOM), an ontology of molecules of genomic origin. This separate ontology that represents the molecules and molecular parts that correspond to SO terms such as exon, intron, transposon and so forth, will provide a bridge to neighboring ontologies in the form of cross product generation. ChEBI will continue to provide the molecular units from which genomic molecules are constructed, such as nucleotide residues. A further distinction from a purely structure-based interpretation of the ChEBI ontology is that the circumstances of a molecule are important.2 For example, an intron molecule is necessarily the result of a splicing process—an atom-for-atom identical molecule in a comet would not be an intron—hence in BFO terms they are defined classes rather than universals. The classes in SOM are cross-referenced to the Sequence Ontology via their logical definitions. In some cases the SO term (the sequence, for example an intron) takes logical precedence, hence the SOM term will be defined in terms of SO, while in other cases the SOM term (the molecule, for example a transfer RNA) comes first. Figure 3 illustrates the difference between SO and SOM. Note that the ontology structure is not always completely isomorphic – a transcript feature such as an exon or intron can be a subsequence of the genomic sequence or the transcript sequence, but this is not true of the corresponding molecule. Equally, not every class in SO has a SOM counterpart, and vice versa.
Conversely, there were terms in SO that described what are really processes, such as rolling circle replication and theta replication. As such, these terms have been obsoleted in SO and donated to the Gene Ontology.
The Sequence Ontology has always contained terms for annotating sequence regions according to how they were obtained and how much and how well they have been sequenced. As such there is overlap with recent work on the Ontology for Biomedical Investigations (OBI) . OBI is formalizing the representation of experimental design, protocol, instrumentation, materials, data generated and analyses performed. SO has taken steps to redefine the kinds of biological region to be in alignment with the OBI distinctions. Region has thus been subtyped to include: biological_region, those mind-independent sequences inhering in the nucleic acid and peptide molecules of you, me and the dinosaurs, biomaterial_region, describing those sequences with a specified experimental purpose, acting as a bacterial vector, for example, and experimental_feature, describing how sequences were assembled, whether they were a match, contig, supercontig or so forth, and what is known about them. Again, the biomaterial_region sequences are defined classes rather than universals and inhere in molecules which have a particular function or role. Those functions and roles are the domain of OBI.
Hoehndorf et al. have written an interesting “bottom-up” account of axiomatizing sequences and their relations from a logician’s perspective. Two of the assumptions made in the paper allow us to clarify some important points, one about sequence mereogeometry, the other about existential dependence. The first is that they draw a faulty distinction between the mereology of sequences and the mereology of molecules, in arguing that the sequence ACAC has a single sequence AC that appears twice (and hence only seven sequences that are parts), as opposed to the molecule ACAC which has distinct molecular parts AC- and -AC, and hence ten molecular parts. Hoehndorf et al. read the sequence ACAC such that AC proper_part_of ACAC, which contravenes the weak company principle  that
But this is a misreading, because ACAC should really be read as AnCn+1An+2Cn+3. Sequences, as their name suggests, consist of parts in a particular order, and the part AC that starts at position n is clearly distinct from the part AC that starts at position n+2. The second assumption is that they take junctions to be specifically dependent in the BFO sense on their sequences (by which they mean sequence regions). It is true that junctions existentially depend on the regions they start or end, but the sense of “dependence” intended by BFO’s “specifically dependent continuant” is one of inherence, and junctions inhere in molecules just as regions do.
The updates to the SO, based on OBO Foundry recommendations, have strengthened the ontology as a tool for reasoning. The treatment of definitions enforces a tight regulation on the position of a new term in the ontology and synchronizes the textual definition within the subsumption hierarchy. The process of updating all of the definitions is ongoing. Stricter adherence to the OBO Relations Ontology is making SO interoperable with the other OBO ontologies. The SO uses a reasoner to maintain the is_a parents of cross product terms. This aids ontology maintenance and can be used as a model for other OBO ontologies.
The application of sequence_features that span the range of the molecular biology central dogma, rather than simply the position of the genomic region that encodes the molecule, is a subtle but important step forward. It allows the topological relations at each stage from genome to transcript or peptide to be catalogued. It roots the SO within OBO making cross products between the neighbor ontologies possible.
The addition of a suite of mereological, topological and temporal relations will dramatically enhance the ability to use the SO as a tool for computational reasoning. Each of the new defined relationships adds another avenue for analysis. This is especially important for the validation of sequence annotations using SO.
The creation of the SOM subset of terms fills a gap in the OBO Foundry ontologies between SO, ChEBI and RNAO, in describing the physical molecules that are encoded by genomes. This will greatly facilitate inter-ontology relations, and also be useful in defining SO terms. The placement of both SO and SOM into the BFO hierarchy also strengthens the interoperability of the ontologies, and promotes reuse and cross product formation.
It is important to understand how the proposed changes will affect the annotation community who already use the terms and relations of SO in their pipelines and processes. The daily revisions to the SO are managed using a CVS repository , and there is a bi-monthly release schedule for more stable versions . Developers are either committed to using the revisions or releases. SOM is checked into the CVS repository and will undergo releases as required. The terminology used to type the features already in use will not change. The GFF3 format will be unaffected as it lists the feature types and the parent term of a given relation. It does not name the relation – this is maintained in the ontology. Developers are given notice of new relationships and structures via the developer mailing list, as this may have adverse effects of pipelines and programs. The relations are added to the ontology before they are used structurally. A webpage addresses the upcoming changes to the SO .
This work is supported by the NHGRI, via the Gene Ontology Consortium, HG004341.
1This is under review – according to BFO, the quality inheres in the independent continuant, so we will likely need a relation that chains the sequence feature to the molecule to the quality.
2Though in ChEBI the only difference, for example, between glycolipid (CHEBI:33563) and neoglycolipid (CHEBI:51019) is that the latter has been synthetically produced.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.