Manual GO annotation generates high-quality reliable information that is more accurate than electronic annotation. It also allows comparisons to be made with new annotation approaches and is an important tool for validation of these methods. However, manual annotation is time consuming and dependent on skilled biologists capable of extracting key information from the published literature.
In view of this, greater emphasis has been placed by the bioinformatics community on the development of new automatic annotation techniques, such as automated information extraction and the conversion of this knowledge into the GO vocabulary (9
). This has resulted in a variety of GO prediction servers with varying abilities to interpret accurately the subtleties of the scientific natural language as well as GO structure, mappings and annotation styles (see GO Tools list on GO home page). To assess these information extraction techniques and allow users to apply the methods judiciously, the BioLINK group (http://www.pdg.cnb.uam.es/BioLINK/
) organized the BioCreative (Critical Assessment of Information Extraction systems in Biology) competition. In collaboration with BioLINK, GOA provided one of the gold standard training and test sets of GO annotation. UniProt curators also took part in manual verification of GO terms mined from the literature, which were corroborated by evidence from the text. In spring 2004, the results of the competition will be announced and it is anticipated that those techniques with the potential to assist GOA in the accurate prediction of deep-level GO terms will be highlighted and may supercede current electronic strategies.
The core objective of GOA is to provide high-quality GO annotation to supplement and improve the interoperability and querying of external and in-house knowledgebases. Increasingly, it is being used to predict the function and biological roles of new gene products, and to reclassify and model relationships between known proteins (19
). For example, genes coding for gene products that are often involved in the same biological process have a likelihood of being regulated in a coordinated manner. For these reasons, public repositories of microarray data such as ArrayExpress (21
) actively encourage users to provide GO annotation in the array design, to facilitate clustering of data according to the GO ontologies and to allow cross-platform data comparisons. GO annotation provides a link between biological knowledge and gene expression profiles. When combined with statistical analysis, the GOA data set is a useful resource for building pathways and can help facilitate microarray probe design to create more focused arrays (e.g. neurofunction arrays) (22
). It is hoped that this use of GOA will assist researchers in identifying quickly new proteins of pharmaceutical interest based on their functional similarities. In addition to microarray data analysis tools, GOA is also incorporated into evolutionary studies, particularly when correlating structure–function relationships (24
). These examples help to demonstrate the potential that exists for the use of GOA in the development and validation of tools that try to accurately predict GO annotation.
GOA can also be used to answer specific biological problems. As GO represents a universal set of curated keywords, many users wish to retrieve all possible annotations to a high-level GO term in a candidate-based approach. According to GO philosophy, every child term inherits the meaning of all of their parent terms. As such, every annotation to a child term should be true for every parent of that child; this is called the ‘true path rule’. If a user wanted to analyse all proteins involved in the process of transcription they would have to retrieve all proteins annotated to the GO term for ‘transcription’ (GO:0006350) and the children of these GO terms. Retrieving the annotation to the children and parent GO term is possible via SRS (25
) but requires prior knowledge of this powerful retrieval system.
Another way of performing the query is to use the protein assignments to a set of GO-slim terms. Essentially, GO-slim is a list of high-level GO terms that cover the main aspects of each of the three GO ontologies. As each community has different needs, a variety of GO-slim files have been archived on the GO home page by Consortium members (ftp://ftp. geneontology.org/pub/go/GO_slims/). GOA has created its own GO-slim (goslim_goa.2002) to summarize the GO annotation of each completed proteome on the Proteome Analysis pages (26
, Table ). As an additional service, this mapping of GO annotation to both GOA (goslim_goa.2002) and a generic set of GO-slim terms (generic.0208) is available for download on the EBI FTP site (Table ). From there, users can download all possible annotations to the GO slim term for transcription ‘GO:0006350’. Users wishing to use a different set of GO-slim terms are advised to use the map2slim.pl script archived on the Berkley Drosophila Genome Project (BDGP) home page (http://www.fruitfly.org/developers/src/go.dev/apps/query-utils/
). This script uses the GO MySQL database and requires prior knowledge of Perl API.