In the lexicon of genomics, an annotation is any feature tied to the genomic DNA sequence, for example, a protein-coding gene model, a transposon, or a non-protein-coding RNA gene. Adding such annotations to the sequence of a genome in a rigorous and consistent way is a prerequisite for the efficient use of that sequence in biological research. Learning how to identify, display, query, and interpret genome features in well-characterized model organisms like the fruit fly, Drosophila melanogaster, is crucial to understanding the genomes of more complex organisms, including Homo sapiens.
A major long-term goal of the FlyBase [1
] annotation project is to overlay the Drosophila melanogaster
genomic sequence with all available biological information and to provide traceable evidence for every annotation in a publicly accessible database. In this paper, we provide a description of our most recent step toward these goals.
In March 2000, a collaborative group including Celera Genomics, the Berkeley and European Drosophila
Genome Projects (BDGP and EDGP), and a number of additional Drosophila
experts published the annotated, nearly finished genomic sequence of the fruit fly [3
]. This annotated sequence was called Release 1, in anticipation of future changes to the sequence and annotations. At that time, the annotation of genes relied heavily on computational gene-prediction algorithms with only limited human curation. The BDGP provided approximately 80,000 expressed sequence tags (ESTs), mostly from the 5' ends of genes, which were used in the computational analyses of the genome [5
]. Because these ESTs were derived from non-normalized cDNA libraries and were limited in number, they corresponded to only about 40% of all genes in the genome [5
]. Complete or nearly complete sequences for an overlapping set of approximately 2,500 known Drosophila
genes in GenBank/EMBL/DDBJ were also available [3
]. Owing to the nature of whole-genome shotgun (WGS) assembly, the 1,630 gaps present in the genome tended to occur at the sites of repetitive sequence [3
]; gaps corresponding to transposable elements were filled with composite sequences (reflecting sequence reads from throughout the genome) rather than the actual sequence. Release 1 predicted 13,601 protein-coding genes, encoding 14,080 transcripts; each gene was assigned a unique CG identifier. The coordinates and predicted sequences of the annotations, although not the evidence for the predictions, were made available to GenBank/EMBL/DDBJ [6
] and FlyBase, the public databases charged with making these annotations accessible to the research community. In FlyBase, the annotations were made available as part of the genome annotation database, Gadfly [12
Release 2, a collaborative effort between Celera Genomics and the BDGP, was submitted to GenBank/EMBL/DDBJ and FlyBase in October 2000, after approximately 330 of the gaps in the Release 1 sequence had been filled. Changes to the annotations were based largely on approximately 6,000 new 3' ESTs sequenced by the BDGP, which increased the number of genes with 3' UTRs and allowed further refinement in gene structures. Sequences of transposable elements remained inaccurate, being based on composite sequences. In all, 748 transcripts were modified, 114 transcripts were deleted, and 336 transcripts were added. Release 2 predicted 13,474 protein-coding genes, encoding 14,335 polypeptides, of which 13,218 (92%) were unchanged relative to Release 1. Thus, the change from Release 1 to Release 2 was minimal.
Inaccuracies in the Release 1 and 2 predicted gene structures resulted mainly from computationally predicted annotations which lacked supporting cDNA data. In addition, the annotation was carried out rapidly by a large and diverse group of curators. Mistakes in the annotation of more than 1,000 genes were reported to FlyBase in error reports from the community, and over 1,000 discrepancies between the translated annotations and those in the curated protein database SWISS-PROT [13
] were reported by Karlin et al
]. Finally, a report of 1,042 new predicted annotations that did not match any of the original 13,601 predicted genes [15
], and another based on analysis of testes cDNA sequences [16
], suggested that the initial annotation may have missed a substantial number of genes.
The D. melanogaster
116.8 megabase (Mb) euchromatic genomic sequence has now been finished to high quality [17
]. Here we report the results of the re-evaluation of previous annotations in light of the finished euchromatic genome and considerable additional experimental data. We call this sequence and new annotation set Release 3.
To support this re-annotation effort, a computational 'pipeline' was created, and the results were stored in a new Gadfly database, so that evidence for the annotations can be tracked and queried by the public [12
]. To identify new features in the genome, we utilized prediction software and annotated alignments of non-protein-coding genes, transposons [18
], and pseudogenes. To improve the extent and consistency of human curation, a small group of expert FlyBase curators visually inspected each gene in the entire euchromatic sequence, using defined rules to integrate computational analyses, cDNA data and protein alignments into updated annotations. To assess the accuracy of the exon-intron structures, we compared the resulting annotations to the subset of curated peptides in SWISS-PROT and TrEMBL that are based on experimental evidence [12
The annotations in Release 3 alter the majority (85%) of gene models, yet confirm that previous releases accurately reflected the number of protein-coding genes. The gene models have been enhanced in a number of ways. The number of genes with annotated untranslated regions (UTRs) and alternative transcripts has increased as a direct result of the increase in EST and complete cDNA sequences, and the fine details of the exon-intron structure are significantly improved. Numerous genes have been merged and/or split - that is, the partitioning of adjacent exons into individual gene models has changed - based on cDNA and protein sequence alignments. Overall, the improved annotations result in changes in more than 40% of the predicted proteins; however, more than 85% of the exons in the originally predicted genes contain sequences that are present in predicted exons in Release 3. We describe these changes under the headings 'Genome statistics: how is Release 3 different?', 'New and deleted annotations', and 'Structural changes to gene models' in Results and discussion.
The new annotations reveal a surprising number of genes that fall outside the typical definition of a protein-coding gene model with a 5' UTR, coding sequence (CDS), and 3' UTR distinct from neighboring genes. We found genes containing 3' UTR sequences that overlap the 5' UTR of the gene immediately downstream, examples of dicistronic transcripts (two or more distinct and non-overlapping coding regions contained on a single processed mRNA), and genes that, by means of alternative splicing, encode two completely distinct non-overlapping peptides. These atypical gene models illustrate the complexity of detailed annotation and pose new challenges for the computational annotation of genomic sequence. We describe these unusual genes, as well as assessment of and access to the data, under the headings 'Complex gene models', 'Assessment of Release 3 quality', 'Accessing data and reporting errors', and 'Future updates'.