A core part of Ensembl is its automatic gene build system, which is used to consistently annotate genomes for which there is no curated geneset. The genome sequences of human, chimpanzee, mouse, rat, chicken, fugu, zebrafish, C.briggsae
, mosquito and honeybee have been annotated in this way. Annotation of the newest genomes, dog, cow and frog, is in progress. The exceptions are fruitfly [annotation imported from flybase (12
[annotation imported from wormbase (13
)] and tetraodon (annotation provided by Genoscope). Ensembl genesets have formed the basis for the initial analysis and publication of most vertebrate genomes. During the last year, the rat (14
) and chicken (16
) genomes have been published.
Full descriptions of the gene build system (7
) and the genewise family of algorithms that it uses (8
) have recently been published. Briefly, the system is mainly based on building gene models from initial alignments of protein and cDNA sequences to a genome sequence. Where a genome has limited species-specific cDNA data, the number of genes in the geneset will reflect the number of homologous cDNAs from other related organisms that can be aligned. Where expressed sequence tag (EST) collections are thought to contain a significant number of artefact sequences, they are considered a less reliable source of evidence for gene structures and so separate gene builds are created from them (9
). Where ESTs have been generated by a small number of groups and are thought to be of consistent quality they are used in the main gene build, but gene models are only built from them if there is no other evidence. This approach has been used in the chicken and honeybee gene builds.
Over the year, the flexibility of the gene build system has been exploited to create genesets for genomes such as zebrafish, honeybee and chicken, which have varied amounts of species-specific cDNA data. Of these three gene builds, the most difficult one has been for honeybee, which is evolutionarily very distant from other sequenced organisms. Zebrafish and chicken gene builds are much more complete, being evolutionarily closer to other sequenced vertebrates. Experimental validation of a randomly selected sample of gene models from the geneset made for the chicken genome (17
) shows a low false positive rate of ~4% (Eyras et al
., submitted for publication). As cDNA resources for these genomes increase, it is fully expected that the number of genes in their genesets will increase until it is similar to those of comparable organisms.
As the genome sequence of human has been finished, genesets for individual chromosomes have been manually curated and published [for a review see (18
)]. Ensembl is part of a new international collaboration to refine the human geneset involving the Havana group (19
) [which provides most of the curated annotation in the Vega database (20
)], the NCBI groups [that curate RefSeq (21
) and generate automatic gene builds], the UCSC browser group (22
) and Uniprot (23
). The aim is to resolve transcript sequence differences and generate and maintain a set of human genes with stable identifiers, where the CDS part of the gene structure can be agreed between all groups. The process of comparison of genesets has proved very fruitful and is leading to improved automatic gene building methods for both the Ensembl and NCBI. It is anticipated that this agreed geneset will progressively increase in size as the entire human genome is fully curated. The current Ensembl human gene build is already benefiting from this comparison, as where the CDS part of a Vega curated transcript or Ensembl transcript and a NCBI transcript agree perfectly and are complete (from ATG to stop codon), these propagate automatically into the next Ensembl geneset.