The Gene Ontology (GO) project [1
] is a collaborative effort among multiple groups to develop a standardized and shared approach for describing biology in a species-independent manner. The ontology itself contains over 32
000 terms describing the sub-cellular localization [Cellular Component (CC): ~3000 terms], biochemical activity [Molecular Function (MF): ~9000 terms] and participation in larger processes [Biological Process (BP), ~20
000 terms] of proteins and other gene products. Each term is defined and placed in a directed acyclic graph with relations between terms: is a (for subclasses), part of and regulates. For example, superoxide dismutase (SOD) proteins are annotated with the term ‘SOD activity’ (MF, GO:0004784), which is a subclass of ‘antioxidant activity’ (GO:0016209); SOD proteins are also described by the term ‘removal of superoxide radicals’ (BP, GO:0019430) and—for different members of the family—the CC terms ‘mitochondrion’ (CC, GO:0005739) or ‘extracellular space’ (GO:0005615). For a recent review on GO (see du Plessis et al
., 2011 [3
]). The GO database contains nearly 3 million annotations to over 466
000 proteins (In this article, we will generally refer to gene products simply as ‘proteins’, although the overwhelming majority of statements will apply to the various types of RNA gene products and protein complexes as well).
GO annotations are assigned using either of two general approaches: based on direct experimental results or by sequence analysis. In the experiment-based approach, biocurators make annotations that record the results of experimental work published in the biomedical literature. There are 375
000 experiment-based annotations in the GO database to more than 81
000 proteins. While these annotations describe proteins from over 900 different species, most of the data come from a small number of well-studied model organisms. As shown in , only 20 species have more than 1000 experiment-based GO annotations. The second annotation approach, sequence-based, uses bioinformatics techniques to infer a likely function for uncharacterized proteins from sequence characteristics. These can include short sequence motifs that can evolve by both convergent and divergent evolution (e.g. mitochondrial targeting sequences or helical transmembrane domains), or long regions of sequence similarity between two proteins that can only be reasonably explained by divergence from a common ancestor (homology).
Species with more than 1000 experimentally-based annotations (evidence codes: EXP, IDA, IEP, IMP, IGI and IPIa)
The overwhelming majority of sequences in public databases remain experimentally uncharacterized, a trend which is increasing rapidly with the ease of modern sequencing technologies. To give a rough idea of the disparity between characterized and uncharacterized sequences, there are ~15 million protein sequences in the UniProt database that are candidates for annotation, while, as previously noted, only 81
000 (0.3%) have been annotated with a GO term based on experimental evidence. It is therefore indispensable to develop powerful and reliable methods for predicting protein function.
The GO Consortium coordinates an effort to maximize the utility of a large and representative set of key genomes, which we refer to as reference genomes. The Reference Genome project has two aspects: (i) to encourage complete and precise annotations of the proteins for the species widely used as model organisms; and (ii) to provide inferred annotations for proteins for which no experimental data are available [4
]. We describe here the homology-based method and software we have developed to achieve those goals.
Function inference by homology: theory and implementation in PAINT
Our method starts by treating each gene function (in this case, a GO term, or group of related terms) as a ‘character’, in the standard sense used for evolutionary inference [5
]. These functional characters are not used to reconstruct the phylogeny of each gene family (amino acid or nucleotide sequence characters are used for that purpose as described above). Rather, given the phylogeny, and the known functions of some subset of the extant genes (leaves of the tree), the goal is to reconstruct the functional evolution events (e.g. gain, loss and inheritance) that most likely led to the functions observed in extant sequences. We have developed a software application, called Phylogenetic Annotation and Inference Tool (PAINT), which allows a biocurator to implement this explicit phylogenetic paradigm. In PAINT, gain and loss events are represented as annotations of ancestral nodes in the phylogenetic tree. Inheritance of an annotation from each ancestor to its descendants is then automatically inferred to occur unless stopped by an explicit annotation of a loss event. This inheritance enables the inference of GO annotations for extant sequences that have not been characterized experimentally. In short, our process represents homology inference in terms of a gene family-specific model of the evolution of function within that family.
Our general approach is similar to the ‘phylogenomic’ method proposed by Eisen [6
] and further developed into a probabilistic form by Engelhardt et al.
], but with important differences. Eisen proposed a conceptual approach for predicting protein function using a phylogenetic tree together with available experimental knowledge of proteins. The original approach relied on manual curation to identify gene duplication events and to find and assimilate the literature for characterized members of the family. Engelhardt et al.
used automated reconciliation with the species tree [8
] to identify gene duplication events, and experimental GO terms (MF only) to capture the experimental literature. Using this information, they defined a probabilistic model of evolution of MF involving transitions between different molecular functions.
From these previous studies, we adopt the basic approach of function evolution through a phylogenetic tree and the use of GO annotations to represent function. However, unlike these other phylogenomic methods, we represent the evolution in terms of discrete gain and loss events. In Eisen's original model, an annotation does not necessarily represent a gain of function (it could have been inherited from an earlier ancestor), and losses are not explicitly annotated. The transition-based model of Engelhardt et al. assumes replacement of one function by another (gain of one function coupled to the loss of another), and does not capture uncoupled events, which is particularly important for BP annotations and cases where a protein has multiple molecular functions (see examples below). In addition, we make no a priori assumptions about conservation of function within versus between orthologous groups, or about the relationship between evolutionary distance and functional conservation (as the distance may not necessarily reflect every given function). While, as described below, gene duplication events and relatively long tree branches are important clues for curators to locate functional divergence (gain and/or loss), in our paradigm an ancestral function can be inherited by both descendants following a duplication (resulting in paralogs with the same function) or gained/lost by one descendant following a speciation event (resulting in orthologs with different functions). Evolution of each function is evaluated on a case-by-case basis, using many different sources of information about a given protein family.