As one example, consider DNA metabolism, a biological process carried out by largely (but not entirely) shared elements in eukaryotes. The part of the process ontology (with selected gene names from S. cerevisiae, Drosophila and M. musculus) shown is largely one parent to many children (). One notable exception is the process of DNA ligation, which is a child of three processes, DNA replication, DNA repair and DNA recombination. The yeast gene product Cdc9p is able to carry out the ligation step for all three processes, whereas it is uncertain whether the same enzyme is used in the other species. From the point of view of the ontology, it matters not, and a computer (or a human searcher) will find the appropriate nodes in either case using as the query either the enzyme, the gene name(s) or the GO term (or, if available, the unique GO identifier, in this case, GO:0003910).
Fig. 1 Examples of Gene Ontology. Three examples illustrate the structure and style used by GO to represent the gene ontologies and to associate genes with nodes within an ontology. The ontologies are built from a structured, controlled vocabulary. The illustrations (more ...)
Also shown are the molecular function ontology for the MCM protein complex members that are known to regulate initiation of DNA replication in the three organisms (), and a portion of the cellular component ontology for these proteins (). These ontologies reflect the finding that Mcm2–7 proteins are components of the pre-replicative complex in several model organisms, as well as sometimes localizing to the cytoplasm30
. The ontology supports both biological realities, and yet the molecular functions and the biological processes of the MCM homologues are conserved nevertheless.
The usefulness of the GO ontologies for annotation received its first major test in the annotation of the recently completed sequence of the Drosophila
genome. Little human intervention was required to annotate 50% of the genes to the molecular function and biological process ontologies using the GO method. Another use for GO ontologies that is gaining rapid adherence is the annotation of gene-expression data, especially after these have been clustered by similarities in pattern of gene expression32
. The results of clustering about 100 yeast experiments (of which about half are shown; ) grouped together a subset of genes which, by name alone, convey little to most biologists. When the full short GO annotations for process, molecular function and location are added, however, the biological reason and import of the co-expression of these genes becomes evident.
Fig. 2 Correspondence between hierarchical clustering of expression microarray experiments with GO terms. The coloured matrix represents the results of clustering many microarray expression experiments32. In the matrix, each row represents the yeast gene described (more ...)
The GO project is currently using a flat file format to store the ontologies, definitions of terms and gene associations. The ontologies, gene associations, definitions and documentation are available from the GO web site (http://www.geneontology.org
), which also describes the principles and objectives used by the project. The ontologies are by no means complete. They are being expanded during the association of gene products from the collaborating databases and we expect them to continue to evolve for many years. GO requires that all gene associations to the ontologies must be attributed to the literature; for each citation the type of evidence will be encoded. As of early April 2000 there were 1,923, 2,094 and 490 nodes in the process, function and component ontologies, respectively. The three organism databases have made substantial progress to link gene products. Thus far the process, function and component ontologies have associations with 1,624, 1,602 and 1,577 yeast genes; 741, 2,334 and 1,061 fly genes; and 1,933, 2,896 and 1,696 mouse genes, respectively. A running table of these statistics can be found at the web site.
The GO concept is intended to make possible, in a flexible and dynamic way, the annotation of homologous gene and protein sequences in multiple organisms using a common vocabulary that results in the ability to query and retrieve genes and proteins based on their shared biology. The GO ontologies produce a controlled vocabulary that can be used for dynamic maintenance and interoperability between genome databases. The ontologies are a work in progress. They can be consulted at any time on the World-Wide Web; indeed, their availability to human and machine alike is essential to maintain their flexibility and allow their evolution along with increased understanding of the underlying biology. It is hoped that the GO concepts, especially the distinctions between biological process, molecular function and cellular component, will find favour among biologists so that we can all facilitate, in our writing as well as our thinking, the grand unification of biology that the genome sequences portend.