The GOA group is the main supplier of electronic GO annotations to the GO Consortium. The group's annotation pipelines primarily use existing cross-references, keywords and Enzyme Commission (EC) numbers in UniProtKB entries and, by using ‘translation tables’ consisting of mappings between these external vocabularies and their equivalent GO terms, create GO annotations to the protein entries. Such translation tables are manually curated to ensure a high-accuracy is obtained from the created GO annotation set, and are passed over the UniProtKB database every 3 weeks in order to reflect any changes in the annotation work carried out by other annotation groups. Similar electronic methods are applied by UniProtKB/Swiss-Prot for HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) family rules (
8), and by InterPro (
9) for InterPro domains, whereby HAMAP rules or InterPro domains are assigned GO terms and the protein entries that either fall within these families or contain a mapped domain are automatically assigned the associated GO term(s).
These four mapping methods (UniProtKB/Swiss-Prot Keywords2GO, EC2GO, HAMAP2GO and InterPro2GO) have been applied to UniProtKB by the GOA group for many years. However, they must be continually maintained and checked to ensure that as changes to both the external resource and GO occur and as the number of proteins annotated by these methods increases, the GO annotation sets generated remain correct.
The nature of electronic annotation means that it can easily be applied to large numbers of protein entries, and because UniProtKB covers such a vast array of species, it follows that some of the less well-studied species can benefit from the addition of GO annotation to their proteins, which may well remain experimentally uncharacterized. In addition to this, there are some species which do not have a dedicated annotation effort and so may never be supplied with manual GO annotation. In UniProtKB there are currently 168 308 species (4 257 090 proteins) for which electronic annotation pipelines are the only source of GO annotation.
In this context the GOA group has recognized the need for additional automatic association of GO terms and so have expanded on their provision of electronic annotation pipelines. One such development has been in collaboration with the Swiss Institute of Bioinformatics, where a new GO mapping table has been created to exploit annotations made in UniProtKB Subcellular Location annotation lines. To date, 92% of subcellular location terms from the Comment (CC) lines of UniProtKB entries have now been manually mapped to GO terms, providing an additional 587 074 new cellular component GO annotations.
The mapping of external resources to equivalent GO terms has been gradually developed so that GOA uses mappings from over 14 450 external terms, which has produced more than 30 million annotations from the UniProtKB over the last four years (a 9-fold increase).
GOA also introduced a complementary electronic GO annotation method in December 2006, in collaboration with the Ensembl Compara group, which applies comparative genomics data to propagate annotations to non-model organism species (
7) (). In this pipeline, one-to-one and apparent one-to-one orthology data from Compara has been used to project manual GO terms from a source species (currently one of human, mouse, rat, Xenopus, Drosophila or zebrafish) into one or more target species. Currently 38 different species benefit from this annotation pipeline (including Xenopus, Macaque and Tetraodon), many of which are non-model organism species with few other annotations available. The Ensembl Compara pipeline currently produces over 147 000 GO annotations for more than 35 000 proteins (GOA release, September 2008).
Electronic annotations are updated every 3 weeks, as part of the release process, and procedures are checked automatically for obsolete GO terms or secondary protein accessions/GO identifiers on a weekly basis as well as at release time. Each electronic annotation can be identified by the ‘IEA’ (Inferred from Electronic Annotation) evidence code, and the specific methods applied to generate an electronic GO annotation can be identified by their distinct ‘GO Reference’ identifier, which links to a full description of each method on the GO Consortium's web site (ftp://ftp.geneontology.org/go/doc/GO.references) (see for details on each GOA-supplied GO_REF identifiers).