|Home | About | Journals | Submit | Contact Us | Français|
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.
Since 2001, the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) has used the Gene Ontology (GO) to annotate gene products in the budding yeast Saccharomyces cerevisiae (1,2). GO consists of three sets of structured, controlled vocabularies, also known as ontologies: the Molecular Function ontology describes the activities of gene products; the Biological Process ontology places molecular functions in a biological context; and the Cellular Component ontology describes the subcellular localizations of gene products (3). The selection of a GO term from one of these ontologies to annotate a gene product must be supported by a reference, such as a peer-reviewed research article or an abstract, as well as by an evidence code that describes the type of evidence present in that reference (4).
At SGD, results from traditional experimental methods published in the scientific literature are the primary sources of evidence used to support the GO annotation of gene products. If no experimental data are available for a gene, it is annotated to the terms ‘biological_process’, ‘molecular_function’ or ‘cellular_component’ (the root terms of the three ontologies) with the evidence code ‘ND’ to indicate there are ‘No Biological Data Available’. While this does not describe the biology of the gene product, it indicates that no experimental results are available in the published literature at the time of annotation (Table 1). Using this curatorial process, every S. cerevisiae gene product has been assigned at least one GO term in each of the three ontologies since 2003.
In recent years, results from comparative sequence and genomic studies, as well as analyses of functional genomic and proteomic data, have provided valuable insights into the biological roles of gene products, especially when data from traditional experimental approaches are unavailable (5,6). In order to provide greater access to these results, SGD now incorporates these data as GO annotations. Because the process of assigning GO annotations from high-throughput experimental data and computational predictions differs from the process of assigning annotations from traditional experimental studies, GO annotations in SGD are now distinguished by their annotation method.
Traditional experimental methods, focusing on in-depth characterization of small numbers of genes, have been and will continue to be the primary source of evidence for GO annotations. However, modern techniques allow experiments to be designed on a genome-wide scale, generating data for large numbers of genes. SGD now assigns GO annotations based on data from such high-throughput experiments. These data sources have been particularly valuable in providing a nearly comprehensive set of Cellular Component GO annotations: from the GO annotation summary on SGD's Genome Snapshot, 5474 of 6301 gene products have been assigned at least one Cellular Component GO term as of September 2007, and 2238 of these are supported by data from high-throughput methods (7–9).
In addition to data from high-throughput experimental methods, GO annotations can also be generated by computational analyses. For example, the Gene Ontology Annotation Project generates computationally predicted GO annotations for UniProt proteins based on sequence similarity algorithms (GOA UniProt; http://www.ebi.ac.uk/GOA/) (10,11). In order to provide greater access to these predictions, GOA UniProt annotations are now incorporated into SGD. Because these computationally predicted GO annotations are added without being reviewed in the context of literature-based GO annotations, they retain the ‘Inferred from Electronic Annotation’ (‘IEA’) evidence code assigned by GOA UniProt (Table 1).
Note that GOA UniProt also compiles literature-based GO annotations from many data sources (10). These annotations are also available at SGD, along with their original evidence codes and data sources, but are reviewed for redundancy with current SGD GO annotations before being incorporated (Table 1).
In addition to GO annotations derived from the manual curation of traditional experimental approaches published in the literature, SGD now contains GO annotations derived from data from high-throughput experiments as well as computational predictions provided by GOA UniProt, creating a central repository for all S. cerevisiae GO annotations. Although all of these annotations are supported by references and evidence codes, the basis for any differences among the GO annotations for any given gene may not be immediately clear. The curation process used for assigning GO annotations from these data varies according to the experimental approach. Therefore, in order to indicate how the data were curated, and to facilitate identification and comparison of these annotations, each GO annotation is now categorized in one of three annotation methods: manually curated, high-throughput or computational (Table 1).
The manually curated method indicates that the evidence in a publication has been individually reviewed to generate an annotation. Types of evidence can include experimental results in published literature that focuses on single genes or small sets of genes, author statements in a publication and sequence similarities that have been analyzed by the authors [for examples, see (12,13) shown in Figure 1B)].
The high-throughput method indicates that, although the evidence for a subset of results from a high-throughput or genome-wide experimental approach may have been reviewed, results for each gene product in the dataset have not been individually reviewed. Generally, this annotation method includes data from experimental approaches in which all significant results were produced using the same condition or analysis [for examples, see (7,8)].
In contrast, annotations generated by the computational method are not supported by direct experimental evidence and are not individually reviewed. These annotations include predictions generated by sequence similarity algorithms or by the integrated computational analyses of different sets of high-throughput experimental data that have not been individually reviewed [(for examples, see (11,14–17)].
All literature-based GO annotations from SGD and GOA UniProt are classified either as manually curated or high-throughput. Computational predictions provided by GOA UniProt are classified as computational (Table 1).
SGD has changed several web interfaces in order to display data sources and annotation methods. The Locus Summary lists each manually curated and high-throughput GO annotation and indicates when computational GO annotations are available (Figure 1A). The phrases ‘All GO Evidence and References’ and ‘View Computational GO annotations’ are both hyperlinked to a detailed Gene Ontology Annotations page, which is subdivided into sections according to each annotation method. Because annotations no longer come solely from SGD, an ‘Assigned by’ column now indicates the data source (Figure 1B).
From the Locus Summary and GO Annotations pages, each GO term is hyperlinked to its GO Term page, which now lists all annotation methods used to generate that annotation for a particular gene. Annotations may be downloaded, according to annotation method, from the summary table at the top of the page (Figure 1C).
To ensure that data analyzed at SGD or by others in the scientific community are based on GO annotations supported by evidence in the published literature, only manually curated and high-throughput GO annotations are publicly available from the GO Consortium (http://www.geneontology.org/GO.current.annotations.shtml). They are also the default annotation sets used for SGD's GO Term Finder (http://www.yeastgenome.org/TermFinder) and GO Slim Mapper (http://www.yeastgenome.org/SlimMapper).
SGD will continue to update manually curated GO annotations as new experimental data are published and will add more sources of high-throughput and computational GO annotations. Discrepancies between annotations may become evident as GO annotations are made from different data sources and annotation methods. These differences can help refine GO and individual annotations by indicating areas in the ontology that require modification and gene products whose annotations need to be reviewed and updated to reflect the current literature. SGD will use this method of comparison to identify under-annotated gene products and areas in the GO structure that need to be reviewed.
The incorporation of annotations from additional data sources makes SGD a central source for S. cerevisiae GO annotations. Differentiating these annotations by annotation method distinguishes what has been experimentally determined for each gene from what has only been computationally predicted. This knowledge will spur experimental research by contributing valuable information for genes that have not been experimentally characterized, and by suggesting additional roles for others (6).
SGD is committed to maintaining high-quality GO annotations and welcomes all comments or questions. Please contact us at: ude.drofnats.emoneg@rotaruc-tsaey.
The SGD project is supported by a P41 grant from the NHGRI HG001315 (J.M.C.) and through the GO Consortium P41 grant from NHGRI HG002273 (co-PI J.M.C). Funding to pay the Open Access publication charges for this article was provided by the National Human Genome Research Institute.
Conflict of interest statement. None declared.