Next-generation sequencing technologies drastically reduced the cost and increased the speed of complete genome sequencing. As a result, the number of completely sequenced genomes more than doubled since 2006 from ~450 to over 1000 genomes in August 2009 (http://genomesonline.org/
). Since the release of the first genome sequence >10 years ago, a large number of gene prediction tools were published using different data inputs and prediction methods in order to identify the location and exon–intron structures of genes. While ab initio
prediction tools such as GenScan (1
) and GeneID (2
) showed some success in predicting protein coding genes using HMMs and basic characteristics of genes, they are now mostly replaced by evidence-based tools [Gnomon (3
), Augustus (4
), EuGène (5
)] and in some cases, dual genome comparative prediction tools [Twinscan (6
) and SLAM (7
)]. But even for these more sophisticated, evidence-based tools prediction of the exact exon–intron structures and splice-variants of genes remains a challenge. For example, the human ENCODE Genome Annotation Assessment Project (8
) has shown that the average multiple transcript accuracy (e.g.
the accuracy in predicting all isoforms of a gene correctly) of tested prediction tools reached only 40–50%. Genomes also contain several other types of genes, such as pseudogenes, RNA genes, uORFs and short coding genes, which are much harder to predict than typical multi-exon protein-coding genes (9
). Even though it is clear that all genomes would benefit substantially from manual curation, only few model organisms such as Drosophila melanogaster
, Arabidopsis thaliana
, Caenorhabditis elegans
and Escherichia coli
benefit from the continuous, in-depth annotation of expert curators. For most newly sequenced genomes, however, no curatorial teams are available and genome annotation often remains limited to computational predictions. Incomplete knowledge of a genome’s gene repertoire represents a significant bottleneck in biological research as correct gene structures are a prerequisite for computational sequence analysis to determine gene function, for primer design to amplify genes and detect expression, for comparative analysis and for the identification and analysis of regulatory elements and splicing factors. It is therefore crucial for the research community to become more involved in the annotation process of newly sequenced genomes. In the last few years, a plethora of freely available genome browsing and editing tools have become available, including those developed by the Generic Model Organism Database project (GMOD). Furthermore, emerging new RNA sequencing technologies are starting to generate vast amounts of transcriptome data, which represents extremely useful experimental evidence for improving gene structures and detecting new splice variants. Increased community-based genome annotation will depend on availability of robust, intuitive and integrated suites of tools applicable across many species to visualize, edit, analyze and annotate genes and gene products, features and attributes.
As large genome centers and model organism databases are leading efforts on large-scale genome annotation and tool development, interest in their annotation protocols, methods and tools has markedly increased in recent years. In this article, we discuss methodologies and standards of annotation as well as tools used by four annotation teams at the following three centers: J. Craig Venter Institute (11
), Welcome Trust Sanger Institute (WTSI) (12
) and The Arabidopsis Information Resource (TAIR) (13
). The authors presented this work as a workshop at the Third International Biocuration Conference in Berlin, Germany, April 2009, organized and chaired by Dr Linda Hannick. This report is not intended to be a comprehensive review of all annotation methodologies and tools available, but as a discussion of the work presented at the workshop.