In the original S. cerevisiae
genomic annotation (c. 1993–1996), protein encoding genes were simply annotated as the longest possible open reading frame of 100 or more codons. These annotations have now been subjected to a decade of testing by thousands of scientists worldwide, using a large range of experimental and comparative methods. In particular, the genome-wide comparisons published by Brachat et al. (2003)
, Cliften et al. (2003)
, and Kellis et al. (2003)
provided an excellent opportunity to review the entire S. cerevisiae
gene model, both in sequence and interpretation. In these studies, the sequenced species were so closely related to S. cerevisiae
as to allow the expectation of very close conservation of ORF size, location and intron/exon structure. Not surprisingly, there have been many suggested changes: new ORFs have been identified, and existing ORFs have been ‘removed’ and revised ().
Figure 1 Sequence annotation changes since 1996. Arrow symbol represents the location of an indel [inserted or deleted nucleotide(s)]. Only the most common types of change have been diagrammed. (A) New ORF addition: 523 small ORFs have been added as a result of (more ...)
Most newly identified ORFs have been smaller than 100 codons. This is simply due to the fact that the S. cerevisiae genome sequencing project did not annotate ORFs of fewer than 100 codons that did not have significant sequence similarity to a previously identified gene. This approach was necessary because there is a high probability that ORFs of this size are just fortuitous sequences of nucleotides: only 342 (2%) of the 15 000 ORFs in the genome between 50 and 99 codons in length are currently thought to encode proteins within the yeast cell. As a consequence, any ORF under 100 codons is treated as spurious until proved otherwise through either experimental or comparative work.
However, length alone does not guarantee that an ORF is genuine, and the total number of biologically significant S. cerevisiae
ORFs has been the subject of debate since the completion of the genomic sequence (Termier and Kalogeropoulos, 1996
; Zhang and Wang, 2000
; Malpertuy et al., 2000
; Wood et al., 2001
; Mackiewicz et al., 2002
; Brachat et al., 2003
; Cliften et al., 2003
; Kellis et al., 2003
). At the heart of this debate is the basic principle that it is virtually impossible to demonstrate experimentally that an ORF is nonfunctional; there is always a chance that a suspect ORF encodes a protein of extremely low abundance or that is produced only under some specific environmental condition. Fortunately, the availability of genomic sequences from other fungi provides a positive test for the relevance of experimentally uncharacterized ORFs: evolutionary conservation among very closely related species. This has allowed for a separation of significant ORFs from those that are likely to be spurious.
Even many bona fide ORFs have required updating. Revisions of ORF annotation fall into two major categories: those in which the nucleotide sequence is corrected; and those in which the nucleotide sequence remains the same but its interpretation is altered. Changes in the first category often affect the start codon, stop codon, reading frame or coding sequence for that ORF, while changes in the second category include annotation of different start codons and intron/exon structure.
Although automated data processing is an important element in the process of revising and updating genomic sequence annotation, human evaluation is also essential. In making any changes to the genome sequence, SGD curators evaluate and synthesize all available types of evidence, including that generated by individual gene-specific experiments, by large-scale analyses and by cross-species comparisons.
Because SGD strives to provide rapid access to new information, individual updates are integrated into the genome sequence and released to the community as soon as possible. As a result, genome updates have been made gradually and released continually, rather than as rare scheduled updates encompassing multiple changes. While this approach provides the fastest means of disseminating the updates, alerting the research community to the changes has proven to be a continuing challenge. Here, we describe the types of changes that have been incorporated into the S. cerevisiae genome annotation, how SGD handles each type of change and how the research community can access the updated information.