Since the publication of the draft human genome sequence in 2001 [
6,
7], a number of human gene reference sets have been created using either computational prediction or manual annotation or a mixture of the two methods. The Ensembl project was initially set up to warehouse and annotate the large amount of unfinished genomic data being produced as part of the public human genome project, as well as to provide browser capacity for both sequences and annotations (Figure ). Ensembl has expanded and now generates automatic predictions for more than 35 species. The Ensembl gene build process is based on alignments of protein and cDNA sequences to produce a highly accurate gene set with a low rate of false positives [
19].
Another genome browser supplying sequence and annotation data for a large number of genomes is the University of California, Santa Cruz (UCSC) genome browser database [
20]. In April 2007, UCSC released an improved version of their 'Known Gene Set' for the human genome and included putative noncoding RNAs as well as protein-coding genes. Each entry in this set requires the support of a GenBank entry and at least one other line of evidence, except for curated cDNAs, which require no other evidence.
Manual annotation still plays a significant part in annotating high-quality finished genomes. Currently, the National Center for Biotechnology Information (NCBI) reference sequences (RefSeq) collection provides a highly (manually) curated resource of multi-species transcripts, including plant, viral, vertebrate and invertebrate sequences [
21,
22]. These are, as their name indicates, transcript-oriented and usually rely on full-length cDNAs for reliable curation, although the dataset also contains predictions using expressed sequence tags (ESTs) and partial cDNAs aligned against genomic sequence using the Gnomon prediction program [
23]. Manually reviewed RefSeq nucleotide sequences begin with the reference NM identifier whereas unreviewed predictions have the XM identifier. When a new genome is initially sequenced, researchers usually use the RefSeq data set to identify genes that are missing or identify genomic rearrangements within genes, as RefSeq is used internationally as a standard for genome annotation [
21]. RefSeq is a very reliable, but also conservative, gene reference set. Other reference sets usually include RefSeq, but extend it substantially. For instance, the UCSC 'Known Genes' has 10% more protein-coding genes, approximately five times as many putative coding genes and twice as many splice variants as RefSeq.
A different approach to manual gene annotation is to annotate transcripts aligned to the genome and take the genomic sequences as the reference rather than the cDNAs. This is how the HAVANA group at the Wellcome Trust Sanger Institute produces its annotation on vertebrate sequence. Currently, only three vertebrate genomes - human, mouse and zebrafish - are being fully finished and sequenced to a quality that merits manual annotation [
24]. The finished genomic sequence is analyzed using a modified Ensembl pipeline [
25], and BLAST results of cDNAs/ESTs and proteins, along with various
ab initio predictions, can be analyzed manually in the annotation browser tool Otterlace. The advantage of genomic annotation compared with cDNA annotation is that more alternative spliced variants can be predicted, as partial EST evidence and protein evidence can be used, whereas cDNA annotation is limited to availability of full-length transcripts. Moreover, genomic annotation produces a more comprehensive analysis of pseudogenes. One disadvantage, however, is that if a polymorphism occurs in the reference sequence, a coding transcript cannot be annotated, whereas cDNA annotation can select the major haplotypic form and is, therefore, not limited by a reference sequence.
In 2006, the groups mentioned above (NCBI (RefSeq), UCSC, the Wellcome Trust Sanger Institute (HAVANA) and Ensembl) identified a need to collaborate and produce a consensus gene set for the human reference genome as there was still no official agreement between the different databases on the human protein-coding genes. Referred to as the Consensus Coding Sequence Set (CCDS) [
26], it currently contains only those coding transcripts that are equivalent in each database's gene build from start codon to stop codon. The latest human CCDS release (May 2008) contains 20,151 consensus coding sequences representing 17,052 genes. For the first time, this provides researchers with a consistent reliable gene set that has been derived independently from a combination of manual and automated annotation by three groups (Ensembl, NCBI and HAVANA) and quality checked at the UCSC. The protein-coding genes that differ between the gene sets of the different groups and cannot be merged automatically will be re-examined manually and either rejected or added to the consensus set if they get a unanimous vote from the groups at NCBI, UCSC and HAVANA.
Complementary to the CCDS project is the GENCODE project [
27]. The GENCODE consortium [
28] was initially formed to identify and map all protein-coding genes within the regions selected in the framework of the ENCODE project [
29,
30], representing 1% of human genome sequence. This was achieved by a combination of initial manual annotation by HAVANA, computational predictions and experimental validation, and the consequent refinement of the annotation on the basis of these experimental results. The project has been funded in 2008 to annotate the whole reference human genome sequence and experimentally verify a number of putative loci. The scaled-up annotation includes identification of pseudogenes and noncoding loci supported by transcript evidence. The initial manual annotation is compared with automated predictions to highlight inconsistencies based on comparative analysis or new transcript data. It is expected that, upon completion in 2011, this gene set will become the standard human gene reference set.