The GO project has three major goals: (i) to develop a set of controlled, structured vocabularies—known as ontologies—to describe key domains of molecular biology, including gene product attributes and biological sequences; (ii) to apply GO terms in the annotation of sequences, genes or gene products in biological databases; and (iii) to provide a centralized public resource allowing universal access to the ontologies, annotation data sets and software tools developed for use with GO data.
The GO project provides ontologies to describe attributes of gene products in three non-overlapping domains of molecular biology. Within each ontology, terms have free text definitions and stable unique identifiers. The vocabularies are structured in a classification that supports ‘is-a’ and ‘part-of’ relationships. The scope and structure of the GO vocabularies are described in more detail in references (5
). In the current research environment, where new genome sequences are being rapidly generated, and where comparative genome analysis requires the integration of data from multiple sources, it is especially germane to provide rigorous ontologies that can be shared by the community.
Molecular Function (MF) describes activities, such as catalytic or binding activities, at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where, when or in what context the action takes place. Examples of individual molecular function terms are the broad concept ‘kinase activity’ and the more specific ‘6-phosphofructokinase activity’, which represents a subtype of kinase activity.
Biological Process (BP) describes biological goals accomplished by one or more ordered assemblies of molecular functions. High-level processes such as ‘cell death’ can have both subtypes, such as ‘apoptosis’, and subprocesses, such as ‘apoptotic chromosome condensation’.
Cellular Component (CC) describes locations, at the levels of subcellular structures and macromolecular complexes. Examples of cellular components include ‘nuclear inner membrane’, with the synonym ‘inner envelope’, and the ‘ubiquitin ligase complex’, with several subtypes of these complexes represented.
The recent development of the Sequence Ontology (SO) permits the classification and standard representation of sequence features. Defined sequence features include terms such as ‘exon’, whose meaning is widely accepted, and the more problematic term ‘pseudogene’, for which several different usages have yet to be resolved. Although the SO is a relatively new vocabulary, and is still undergoing refinement, it is already being used for genome annotation projects in Drosophila and Caenorhabditis elegans.
Collaborating databases provide data sets comprising links between database objects and GO terms, with supporting documentation. Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis; furthermore, the annotation must indicate the type of evidence the cited source provides to support the association between the gene product and the GO term. A standard set of evidence codes qualifies annotations with respect to different types of experimental determinations. For example, a direct assay to determine the function of the exact gene product being annotated is more reliable than a sequence architecture comparison.
High-quality GO annotations, normally based on curatorial review of published literature and supported by experimental evidence, are now available for gene products in many model organisms. In addition, large sets of annotations made using automated methods cover both model organisms and less experimentally tractable organisms, including human. A number of different automatic methods have been applied (e.g. 8
), all of which are represented by the evidence code IEA (‘inferred from electronic annotation’). Table provides a snapshot of current annotations in the GO database; a more detailed table is maintained on the web at http://www.geneontology.org/doc/GO.current.annotations.shtml
. Additional information on GO annotations can be found in references (5
) and (13
The SO is being used by the collaborating databases for genomic feature annotation. Like GO annotations, SO annotations are curated using both manual work by experts and purely computational methodologies.
For many purposes, in particular reporting the results of GO annotation of a genome or cDNA collection, it is very useful to have a high-level view of each of the three ontologies. These subsets of the GO have become known as ‘GO slims’, the first of which was constructed for the annotation of the Drosophila
). An example of a GO slim analysis is shown in Figure .
Application of a GO slim set in genome annotation. The number of gene products annotated to each term in each of four model organism genomes is shown for a GO slim set taken from the cellular component ontology (data as of August 1, 2003).
The shared use of GO slims makes comparisons of summary GO term distributions very easy. Different applications, however, may require different GO slim sets tailored to the specific needs of an analysis. To address this, the GO Consortium makes both generic and specific GO slim files available. The generic GO slim file is kept up to date with respect to the full ontologies, and specific GO slim files that have been used in particular publications or analyses are archived.