The Gene Ontology (GO) project () develops and uses a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see ). The GO Consortium continues to improve to the vocabulary content, reflecting the impact of several novel mechanisms of incorporating community input. A growing number of model organism databases and genome annotation groups contribute annotation sets using GO terms to GO's public repository. Updates to the AmiGO browser have improved access to contributed genome annotations. As the GO project continues to grow, the use of the GO vocabularies is becoming more varied as well as more widespread. The GO project provides an ontological annotation system that enables biologists to infer knowledge from large amounts of data.
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all.
The Sequence Ontology is an established ontology, with a large user community, for the purpose of genomic annotation. We are reforming the ontology to provide better terms and relationships to describe the features of biological sequence, for both genomic and derived sequence. The SO is working within the guidelines of the OBO Foundry to provide interoperability between SO and the other related OBO ontologies. Here we report changes and improvements made to SO including new relationships to better define the mereological, spatial and temporal aspects of biological sequence.
Sequence Ontology; biomedical ontology; genome annotation
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
PeerGAD is a web-based database-driven application that allows community-wide peer-reviewed annotation of prokaryotic genome sequences. The application was developed to support the annotation of the Pseudomonas syringae pv. tomato strain DC3000 genome sequence and is easily portable to other genome sequence annotation projects. PeerGAD incorporates several innovative design and operation features and accepts annotations pertaining to gene naming, role classification, gene translation and annotation derivation. The annotator tool in PeerGAD is built around a genome browser that offers users the ability to search and navigate the genome sequence. Because the application encourages annotation of the genome sequence directly by researchers and relies on peer review, it circumvents the need for an annotation curator while providing added value to the annotation data. Support for the Gene Ontology™ vocabulary, a structured and controlled vocabulary used in classification of gene roles, is emphasized throughout the system. Here we present the underlying concepts integral to the functionality of PeerGAD.
Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.
This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.
The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.
Bioinformatics; medical informatics; meta-data
Mereological relationships--relationships between parts and wholes--are essential for ontological engineering in the anatomical domain. We propose a knowledge engineering approach that emulates mereological reasoning by taxonomic reasoning based on SEP triplets, a special data structure for the encoding of part-whole relations, which is fully embedded in the formal framework of standard description logics. We extend the SEP formalism in order to account not only for the part-of but also for the has-part relation, both being considered transitive in our domain. Furthermore we analyze the distinction between the ontological primitives singletons, collections and mass concepts in the anatomy domain and sketch how reasoning about these kinds of concepts can be accounted for in a knowledge representation language, using the extended SEP formalism.
This Sequence Ontology (SO)  aims to unify the way in which we describe
sequence annotations, by providing a controlled vocabulary of terms and the
relationships between them. Using SO terms to label the parts of sequence annotations
greatly facilitates downstream analyses of their contents, as it ensures that annotations
produced by different groups conform to a single standard. This greatly facilitates
analyses of annotation contents and characteristics, e.g. comparisons of UTRs,
alternative splicing, etc. Because SO also specifies the relationships between features,
e.g. part_of, kind_of, annotations described with SO terms are also better substrates
for validation and visualization software.
This document provides a step-by-step guide to producing a SO compliant file
describing a sequence annotation. We illustrate this by using an annotated gene as an
example. First we show where the terms needed to describe the gene's features are
located in SO and their relationships to one another. We then show line by line how
to format the file to construct a SO compliant annotation of this gene.
The Mammalian Phenotype Ontology (MP) is a structured vocabulary for describing mammalian phenotypes and serves as a critical tool for efficient annotation and comprehensive retrieval of phenotype data. Importantly, the ontology contains broad and specific terms, facilitating annotation of data from initial observations or screens and detailed data from subsequent experimental research. Using the ontology structure, data are retrieved inclusively, i.e., data annotated to chosen terms and to terms subordinate in the hierarchy. Thus, searching for “abnormal craniofacial morphology” also returns annotations to “megacephaly” and “microcephaly,” more specific terms in the hierarchy path. The development and refinement of the MP is ongoing, with new terms and modifications to its organization undergoing continuous assessment as users and expert reviewers propose expansions and revisions. A wealth of phenotype data on mouse mutations and variants annotated to the MP already exists in the Mouse Genome Informatics database. These data, along with data curated to the MP by many mouse mutagenesis programs and mouse repositories, provide a platform for comparative analyses and correlative discoveries. The MP provides a standard underpinning to mouse phenotype descriptions for existing and future experimental and large-scale phenotyping projects. In this review we describe the MP as it presently exists, its application to phenotype annotations, the relationship of the MP to other ontologies, and the integration of the MP within large-scale phenotyping projects. Finally we discuss future application of the MP in providing standard descriptors of the phenotype pipeline test results from the International Mouse Phenotype Consortium projects.
The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.
Biological research is increasingly dependent on the availability of well-structured representations of biological data with detailed, accurate descriptions provided by the curators of the data repositories. The Reference Genome project's goal is to provide comprehensive functional annotation for the genomes of human as well as eleven organisms that are important models in biomedical research. To achieve this, we have developed an approach that superposes experimentally-based annotations onto the leaves of phylogenetic trees and then we manually annotate the function of the common ancestors, predicated on the assumption that the ancestors possessed the experimentally determined functions that are held in common at these leaves, and that these functions are likely to be conserved in all other descendents of each family.
Gramene (http://www.gramene.org/) is a comparative genome database for cereal crops
and a community resource for rice. We are populating and curating Gramene with
annotated rice (Oryza sativa) genomic sequence data and associated biological information
including molecular markers, mutants, phenotypes, polymorphisms and Quantitative Trait
Loci (QTL). In order to support queries across various data sets as well as across external
databases, Gramene will employ three related controlled vocabularies. The specific goal of
Gramene is, first to provide a Trait Ontology (TO) that can be used across the cereal
crops to facilitate phenotypic comparisons both within and between the genera. Second, a
vocabulary for plant anatomy terms, the Plant Ontology (PO) will facilitate the curation
of morphological and anatomical feature information with respect to expression,
localization of genes and gene products and the affected plant parts in a phenotype. The
TO and PO are both in the early stages of development in collaboration with the
International Rice Research Institute, TAIR and MaizeDB as part of the Plant Ontology
Consortium. Finally, as part of another consortium comprising macromolecular databases
from other model organisms, the Gene Ontology Consortium, we are annotating the
confirmed and predicted protein entries from rice using both electronic and manual
The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or ‘ontologies’. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.
The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of protein structures and sequences. Due to the high-throughput nature of these projects, many of the molecules are uncharacterised and their functions unknown. This, in turn, has led to the need for a greater number and diversity of tools and databases providing annotation through transfer based on homology and prediction methods. Though many such tools to annotate protein sequence and structure exist, they are spread throughout the world, often with dedicated individual web pages. This situation does not provide a consensus view of the data and hinders comparison between methods. Integration of these methods is needed. So far this has not been possible since there was no common vocabulary available that could be used as a standard language. A variety of terms could be used to describe any particular feature ranging from different spellings to completely different terms. The Protein Feature Ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS) is a structured controlled vocabulary for features of a protein sequence or structure. It provides a common language for tools and methods to use, so that integration and comparison of their annotations is possible. The Protein Feature Ontology comprises approximately 100 positional terms (located in a particular region of the sequence), which have been integrated into the Sequence Ontology (SO). 40 non-positional terms which describe general protein properties have also been defined and, in addition, post-translational modifications are described by using an already existing ontology, the Protein Modification Ontology (MOD). The Protein Feature Ontology has been used by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in 14 European countries generating over 150 distinct annotation types for protein sequences and structures.
We present an ontology for describing genomes, genome comparisons, their evolution
and biological function. This ontology will support the development of novel genome
comparison algorithms and aid the community in discussing genomic evolution. It
provides a framework for communication about comparative genomics, and a basis
upon which further automated analysis can be built. The nomenclature defined
by the ontology will foster clearer communication between biologists, and also
standardize terms used by data publishers in the results of analysis programs.
The overriding aim of this ontology is the facilitation of consistent annotation of
genomes through computational methods, rather than human annotators. To this
end, the ontology includes definitions that support computer analysis and automated
transfer of annotations between genomes, rather than relying upon human mediation.
Ontologies are widely used to represent knowledge in biomedicine. Systematic approaches for detecting errors and disagreements are needed for large ontologies with hundreds or thousands of terms and semantic relationships. A recent approach of defining terms using logical definitions is now increasingly being adopted as a method for quality control as well as for facilitating interoperability and data integration.
We show how automated reasoning over logical definitions of ontology terms can be used to improve ontology structure. We provide the Java software package GULO (Getting an Understanding of LOgical definitions), which allows fast and easy evaluation for any kind of logically decomposed ontology by generating a composite OWL ontology from appropriate subsets of the referenced ontologies and comparing the inferred relationships with the relationships asserted in the target ontology. As a case study we show how to use GULO to evaluate the logical definitions that have been developed for the Mammalian Phenotype Ontology (MPO).
Logical definitions of terms from biomedical ontologies represent an important resource for error and disagreement detection. GULO gives ontology curators a fast and simple tool for validation of their work.
Researchers in biomedical informatics use ontologies and terminologies to annotate their data in order to facilitate data integration and translational discoveries. As the use of ontologies for annotation of biomedical datasets has risen, a common challenge is to identify ontologies that are best suited to annotating specific datasets. The number and variety of biomedical ontologies is large, and it is cumbersome for a researcher to figure out which ontology to use.
We present the Biomedical Ontology Recommender web service. The system uses textual metadata or a set of keywords describing a domain of interest and suggests appropriate ontologies for annotating or representing the data. The service makes a decision based on three criteria. The first one is coverage, or the ontologies that provide most terms covering the input text. The second is connectivity, or the ontologies that are most often mapped to by other ontologies. The final criterion is size, or the number of concepts in the ontologies. The service scores the ontologies as a function of scores of the annotations created using the National Center for Biomedical Ontology (NCBO) Annotator web service. We used all the ontologies from the UMLS Metathesaurus and the NCBO BioPortal.
We compare and contrast our Recommender by an exhaustive functional comparison to previously published efforts. We evaluate and discuss the results of several recommendation heuristics in the context of three real world use cases. The best recommendations heuristics, rated ‘very relevant’ by expert evaluators, are the ones based on coverage and connectivity criteria. The Recommender service (alpha version) is available to the community and is embedded into BioPortal.
Controlled vocabularies are common within bioinformatics resources. They can be used to
give a summary of the knowledge held about a particular entity. They are also used to
constrain values given for particular attributes of an entity. This helps create a shared
understanding of a domain and aids increased precision and recall during querying of
resources. Ontologies can also provide such facilities, but can also enhance their utility.
Controlled vocabularies are often simply lists of words, but may be viewed as a kind of
ontology. Ideally ontologies are structurally enriched with relationships between terms
within the vocabulary. Use of such rich forms of vocabularies in database annotation could
enhance those resources usability by both humans and computers. The representation of
the knowledge content of biological resources in a computationally accessible form opens
the prospect of greater support for a biologist investigating new data.
The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource. The GO database integrates the vocabularies and contributed annotations and provides full access to this information in several formats. Members of the GO Consortium continually work collectively, involving outside experts as needed, to expand and update the GO vocabularies. The GO Web resource also provides access to extensive documentation about the GO project and links to applications that use GO data for functional analyses.
Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.
Functional annotation of novel sequence data is a primary requirement for the utilization of functional genomics approaches in plant research. In this paper, we describe the Blast2GO suite as a comprehensive bioinformatics tool for functional annotation of sequences and data mining on the resulting annotations, primarily based on the gene ontology (GO) vocabulary. Blast2GO optimizes function transfer from homologous sequences through an elaborate algorithm that considers similarity, the extension of the homology, the database of choice, the GO hierarchy, and the quality of the original annotations. The tool includes numerous functions for the visualization, management, and statistical analysis of annotation results, including gene set enrichment analysis. The application supports InterPro, enzyme codes, KEGG pathways, GO direct acyclic graphs (DAGs), and GOSlim. Blast2GO is a suitable tool for plant genomics research because of its versatility, easy installation, and friendly use.
Controlled vocabularies (CVs), i.e. a collection of predefined terms describing a modeling domain, used for the semantic annotation of data, and ontologies are used in structured data formats and databases to avoid inconsistencies in annotation, to have a unique (and preferably short) accession number and to give researchers and computer algorithms the possibility for more expressive semantic annotation of data. The Human Proteome Organization (HUPO)–Proteomics Standards Initiative (PSI) makes extensive use of ontologies/CVs in their data formats. The PSI-Mass Spectrometry (MS) CV contains all the terms used in the PSI MS–related data standards. The CV contains a logical hierarchical structure to ensure ease of maintenance and the development of software that makes use of complex semantics. The CV contains terms required for a complete description of an MS analysis pipeline used in proteomics, including sample labeling, digestion enzymes, instrumentation parts and parameters, software used for identification and quantification of peptides/proteins and the parameters and scores used to determine their significance. Owing to the range of topics covered by the CV, collaborative development across several PSI working groups, including proteomics research groups, instrument manufacturers and software vendors, was necessary. In this article, we describe the overall structure of the CV, the process by which it has been developed and is maintained and the dependencies on other ontologies.
Database URL: http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo
Recent increases in the volume and diversity of life science data and information and an increasing emphasis on data sharing and interoperability have resulted in the creation of a large number of biological ontologies, including the Cell Ontology (CL), designed to provide a standardized representation of cell types for data annotation. Ontologies have been shown to have significant benefits for computational analyses of large data sets and for automated reasoning applications, leading to organized attempts to improve the structure and formal rigor of ontologies to better support computation. Currently, the CL employs multiple is_a relations, defining cell types in terms of histological, functional, and lineage properties, and the majority of definitions are written with sufficient generality to hold across multiple species. This approach limits the CL's utility for computation and for cross-species data integration.
To enhance the CL's utility for computational analyses, we developed a method for the ontological representation of cells and applied this method to develop a dendritic cell ontology (DC-CL). DC-CL subtypes are delineated on the basis of surface protein expression, systematically including both species-general and species-specific types and optimizing DC-CL for the analysis of flow cytometry data. We avoid multiple uses of is_a by linking DC-CL terms to terms in other ontologies via additional, formally defined relations such as has_function.
This approach brings benefits in the form of increased accuracy, support for reasoning, and interoperability with other ontology resources. Accordingly, we propose our method as a general strategy for the ontological representation of cells. DC-CL is available from .
Over the last 50 years, sequencing, structural biology and bioinformatics have completely revolutionised biomolecular science, with millions of sequences and tens of thousands of three dimensional structures becoming available. The bioinformatics of enzymes is well served by, mostly free, online databases. BRENDA describes the chemistry, substrate specificity, kinetics, preparation and biological sources of enzymes, while KEGG is valuable for understanding enzymes and metabolic pathways. EzCatDB, SFLD and MACiE are key repositories for data on the chemical mechanisms by which enzymes operate. At the current rate of genome sequencing and manual annotation, human curation will never finish the functional annotation of the ever-expanding list of known enzymes. Hence there is an increasing need for automated annotation, though it is not yet widespread for enzyme data. In contrast, functional ontologies such as the Gene Ontology already profit from automation. Despite our growing understanding of enzyme structure and dynamics, we are only beginning to be able to design novel enzymes. One can now begin to trace the functional evolution of enzymes using phylogenetics. The ability of enzymes to perform secondary functions, albeit relatively inefficiently, gives clues as to how enzyme function evolves. Substrate promiscuity in enzymes is one example of imperfect specificity in protein-ligand interactions. Similarly, most drugs bind to more than one protein target. This may sometimes result in helpful polypharmacology as a drug modulates plural targets, but also often leads to adverse side-effects. Many cheminformatics approaches can be used to model the interactions between druglike molecules and proteins in silico. We can even use quantum chemical techniques like DFT and QM/MM to compute the structural and energetic course of enzyme catalysed chemical reaction mechanisms, including a full description of bond making and breaking.
Automated protein function prediction methods are the only practical approach for assigning functions to genes obtained from model organisms. Many of the previously reported function annotation methods are of limited utility for fungal protein annotation. They are often trained only to one species, are not available for high-volume data processing, or require the use of data derived by experiments such as microarray analysis. To meet the increasing need for high throughput, automated annotation of fungal genomes, we have developed a tool for annotating fungal protein sequences with terms from the Gene Ontology.
We describe a classifier called PoGO (Prediction of Gene Ontology terms) that uses statistical pattern recognition methods to assign Gene Ontology (GO) terms to proteins from filamentous fungi. PoGO is organized as a meta-classifier in which each evidence source (sequence similarity, protein domains, protein structure and biochemical properties) is used to train independent base-level classifiers. The outputs of the base classifiers are used to train a meta-classifier, which provides the final assignment of GO terms. An independent classifier is trained for each GO term, making the system amenable to updating, without having to re-train the whole system. The resulting system is robust. It provides better accuracy and can assign GO terms to a higher percentage of unannotated protein sequences than other methods that we tested.
Our annotation system overcomes many of the shortcomings that we found in other methods. We also provide a web server where users can submit protein sequences to be annotated.
The Zebrafish Information Network (ZFIN) is a web based community resource that serves as a centralized location for the curation and integration of zebrafish genetic, genomic and developmental data. ZFIN is publicly accessible at http://zfin.org. ZFIN provides an integrated representation of mutants, genes, genetic markers, mapping panels, publications and community contact data. Recent enhancements to ZFIN include: (i) an anatomical dictionary that provides a controlled vocabulary of anatomical terms, grouped by developmental stages, that may be used to annotate and query gene expression data; (ii) gene expression data; (iii) expanded support for genome sequence; (iv) gene annotation using the standardized vocabulary of Gene Ontology (GO) terms that can be used to elucidate relationships between gene products in zebrafish and other organisms; and (v) collaborations with other databases (NCBI, Sanger Institute and SWISS-PROT) to provide standardization and interconnections based on shared curation.