Currently only three vertebrate genomes, human, mouse and zebrafish, are being fully sequenced and finished to a quality which merits manual annotation. Although labour intensive and relatively slow compared with automatic annotation methods, manual annotation provides an invaluable reliable reference resource that can be used to predict gene structures on low coverage genomes from other vertebrate species. The Vega database is the central repository for the majority of genome sequencing centres to deposit their annotation of human chromosomes. Unlike other browsers, Vega only displays a manually annotated gene set on the latest chromosome assemblies, which are often more up-to-date than the reference genome assembly generated by NCBI. Currently, the human database contains twenty chromosomes annotated by eight different sequencing centres. The Havana Group at the Wellcome Trust Sanger Institute (WTSI) is updating the annotation through its involvement in the consensus-coding sequence (CCDS) collaboration with UCSC, NCBI and Ensembl (http://www.ncbi.nlm.nih.gov/CCDS/)
which aims to produce a reference set of protein-coding gene annotation across the entire human genome.
The four mouse chromosomes (2, 4, 11 and X) sequenced at WTSI have been virtually fully annotated and can be browsed through Vega. The rest of the mouse genome is being annotated on a gene-by-gene basis as part of the mouse CCDS collaboration.
The Zebrafish genome, which is being fully sequenced and manually annotated at the WTSI in collaboration with Zfin (1
), currently features eight completely annotated chromosomes.
In addition to full genomes, and unlike other browsers, Vega also displays small finished regions of interest from genomes of other vertebrates, human haplotypes and mouse strains. Currently this comprises the finished sequence and annotation of the major histocompatability complex (MHC) from different human haplotypes, and dog and pig [the latter of which is currently otherwise only available in very limited form in Ensembl Pre! (http://pre.ensembl.org/Sus_scrofa/index.html
)]. Additionally there is mouse NOD (non-obese diabetes) strain annotation of IDD (insulin-dependent diabetes) candidate regions and two more pig regions.
Improvements and progress in Vega since 2004
All three complete genomes (mouse, human and zebrafish) now contain a view of all the chromosomes in the Karyotype View and the annotation progress of each chromosome is highlighted with grey shading. Since the original Vega publication in 2005 (2
), the number of human gene loci annotated has more than doubled to almost 33 000 (June 2007 release), close to 19 000 of which are predicted to be protein coding. Four chromosomes (2, 4, 5 and 11) remain to be fully manually annotated to the Havana standard and these will be completed as part of the CCDS collaboration and the whole-genome extension of the ENCODE project (see below). Since annotation is continually re-evaluated on a gene-by-gene basis, every locus is versioned and the date of creation and last update can now be viewed by the user on the curated locus report page (GeneView, see ).
The CCDS project aims to produce a set of protein-coding transcripts that is agreed upon by the RefSeq group at the NCBI, the Havana and Ensembl groups at the Wellcome Trust Genome Campus and the Genome Informatics group at the UCSC. Though originally limited to human genes, the project now includes mouse. As part of the collaboration, we are comprehensively annotating (i.e. including all coding and non-coding variants) each human and mouse CCDS locus to provide a solid basis for comparison with RefSeq. In the process, this supplies up-to-date annotation to previously annotated sequences and novel annotation to unannotated sequence. Where appropriate, Vega transcript and gene records (TranscriptView, GeneView) have links to the CCDS gene records in the CCDS database at the NCBI (http://www.ncbi.nlm.nih.gov/CCDS/
As part of the ENCODE project (3
), Havana have comprehensively annotated the target genes (1% of human genes) in human and mouse and updated the annotation following both experimental and computational feedback from the GENCODE project (5–7
). In human Vega, ENCODE regions are marked in ContigView (users may have to switch on the relevant track in the ‘Decorations’ menu).
Vega transcript objects are also shown, in a separate track, in Ensembl Detailed View (tracks named ‘Vega Havana Gene’ and ‘Vega External Gene’; the user may have to switch these tracks on in the ‘Features’ menu). In order to eliminate redundancy in the Ensembl transcript track and highlight commonality, Ensembl and Vega have started to match protein-coding transcripts between the two datasets and only present a single transcript if within a given locus a Vega and an Ensembl transcript are identical. These transcripts (and loci containing them) are coloured gold and labelled ‘Merged Known Protein Coding’ or ‘Common Known Protein Coding’ in Ensembl ContigVew (). The project is currently limited to human genes annotated by Havana, but is expected to include Havana-annotated mouse genes in Ensembl version 48 (December 2007 release).
In preparation for the zebrafish genome paper (which will be based on genome assembly Zv8), all mRNA entries in the Zfin database (http://zfin.org/
) have been aligned to the current Zv7 assembly and those that map have been annotated (currently 6157). On an ongoing basis, known mRNAs are being mapped and annotated as new finished genomic sequence becomes available. To remove artificial duplications, annotation from the previous mixed-strain library genomic clones has been moved to a reference assembly constructed from a single double-haplotype Tübingen strain individual. The original clones are still visible in ContigView and annotation can be compared between the two in MultiContigView () using the ‘Comparative’ menu in Detailed View.
Figure 3. Zebrafish haplotype clones are marked in yellow (Top level and Navigational overview panels above). In MultiContigView, annotation can be shown on both reference and haplotype simultaneously with lines linking homologous genes (Detailed View panel above). (more ...)
We collaborate closely with the MGI group at The Jackson Laboratory (http://www.informatics.jax.org/
) regarding mouse gene sets and their nomenclature. Genes are cross-linked between Vega and MGI: Vega GeneView pages link to MGI locus records and vice versa. A similar collaboration is in place with the HGNC (http://www.genenames.org/
) for human genes () and Zfin (http://zfin.org/
) for zebrafish.
For the first time a large region of the porcine genome, 8.2 Mb of chromosome 17 sequence orthologous to human chromosome 20q13 and mouse 2, has been made available (8
). The region has been used to assess the sequencing methodology for the pig genome (8
). As both the pig sequence and the orthologous human and mouse sequences have been annotated by Havana, users can compare the sequences in Vega's MultiContigView. In addition to the chromosome 17 sequence, Vega presents the pig MHC region, located on chromosome 7 (9
) (see below), and the region of pig chromosome 6 containing the LRC (leukocyte receptor complex) genes (10
). Their orthologous regions in human have been annotated by Havana, so again, they can be viewed in Vega alongside human sequence, and, in the case of the MHC, dog as well.
Below, a selection of new projects, where the data have been first released in Vega, are described in more detail.
Mouse genes targeted for knockout
The WTSI is producing annotation for both the EUCOMM (European Conditional Mouse Mutagenesis) (http://www.eucomm.org/
) and KOMP (Knock-Out Mouse Project) (http://www.nih.gov/science/models/mouse/knockout/
) efforts. These two projects aim to generate a comprehensive resource of (conditional) knockout (KO) alleles in mouse embryonic stem cells. The target genes can be viewed in Vega as a KO track. Transcript models shown in this track are the transcripts produced in KO mice where target exon(s) (also shown) have been deleted; the resulting coding transcripts are subject to nonsense-mediated decay.
Mouse diabetes (IDD) candidate regions
Mouse strain NOD (non-obese diabetic) is a model for identifying genes involved in IDD (insulin-dependent diabetes) (12–14
). We are annotating candidate regions, in parallel, in the reference BL/6 strain and the NOD strain in order to compare the two strains and detect differences that may be relevant to type I diabetes susceptibility. Reference and NOD strain annotation can be viewed alongside each other in Vega MultiContigView.
MHC haplotype and comparative MHC
The primary aim of the human MHC Haplotype Project (15–17
) is to provide a comprehensively annotated reference sequence of a single, HLA-homozygous MHC haplotype and to use it as a basis against which we could assess variations from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Through the Vega database users can access gene annotation of the eight MHC haplotype sequences as it becomes available, providing a valuable public resource and a means of integrating annotation and variation data. As mentioned earlier, canine (Doberman breed) (18
) and porcine MHC regions have been sequenced and annotated as well, allowing for a direct comparison of the region between three different organisms and between a number of human haplotypes ().
Figure 4. MultiContigView of a region of the pig and dog MHC. Lines link computationally determined orthologues in the Detailed View. This view is accessible by choosing the desired second dataset from the ‘View alongside’ menu from the left-hand (more ...) Additional classification and improved annotation of alternatively spliced variants
Our locus classification classes were developed to aid standardization of the annotation of gene features by different groups across the human genome and were initially developed through a series of workshops (http://www.sanger.ac.uk/HGP/havana/hawk.shtml
). However, as the transcript diversity appears to present a complex landscape for each locus, we have introduced an in-depth classification at the transcript level to aid interpretation of their functionality. As mentioned above, the Havana group produced the reference annotation for the ENCODE project as part of the GENCODE collaboration. As part of this project, all coding transcripts were analysed by the Biosapiens consortium which examined the structural viability of each protein by various methods (19
). On feedback from the consortium we have started to classify our coding transcripts into the following four categories:
- Known CDS: identical to SwissProt entry or RefSeq NP protein.
- Novel CDS: shares >60% of its coding length with Known CDS, has cross-species or gene family support for its structure or a Pfam domain structure identical to Known CDS.
- Putative CDS: shares <60% of its coding length with Known CDS, has novel first or last coding exon or lacks cross-species or gene family support for its structure.
- NMD: if the CDS (following the appropriate reference CDS) of a transcript finishes >50 bp from a downstream splice site, the transcript is tagged as being subject to nonsense-mediated decay (NMD)
Further more, transcript variants for which a CDS cannot be assigned confidently, are classified into the following main types:
- Transcript: does not qualify for any of the specific types below.
- Retained intron: relative to an appropriate reference variant, transcript contains intronic sequences not due to alternative splice sites.
- Putative: up to three exons, supported by only up to two ESTs (from same or other species).
- Non-coding: for known non-coding genes only.
- Antisense: for known antisense genes only (i.e. genes that have a published regulatory/expression/functional relationship with the gene on the opposite strand, such as mouse Nespas).
- IG segment: for known immunoglobulin gene fragments only (e.g. the IGL cluster on human chromosome 22 or the Trav cluster on mouse chromosome 14).
Generating the database for the Vega website
As mentioned in Ashurst et al.
), the data released via the Vega website is produced by merging two in-house databases at the Sanger Institute: the pipeline database containing the genome assembly and alignments of features (mRNAs, proteins and ESTs, gene predictions, etc.) to that assembly, and the Otter annotation database containing the manual annotation. The Vega website runs from an Ensembl (21–23
) schema database, the version of which is, as far as possible, kept synchronized with that of the Ensembl website. This strategy of keeping closely synchronized with Ensembl has advantages such as facilitating maintenance of the website—new features developed for Ensembl can sometimes become available to Vega with little or no development time being required. However, the schema difference between the Otter annotation database (which is based on a version of the schema originating from 2003 and positions genes on clones instead of chromosomes) and the Vega website database is significant for the Vega release process: the genes have to be mapped from clones onto chromosomes, and data has to be moved from legacy tables into core Ensembl schema tables. Whilst there have been numerous improvements to this process over the four year life of Vega, this step does remain a bottleneck in the release process. For this, and for other reasons, we are currently in the process of migrating the Otter annotation database onto the current Ensembl schema (see Future Plans section). However, the frequency of release of the website will always be limited by the requirement to generate additional data required for the functionality of each specific release of the website. These data include the Compara database that allows for the comparative analysis in Vega, the files used for sequence (BLAST and SSAHA) and text (Exalead) searching, and updates to help documentation and other information.
Accessing and querying data
Most of the Vega annotation data can be accessed via Ensembl through its BioMart system (24
) for data queries. Furthermore, genomic, transcript and protein sequences can be easily exported in several formats from the various Views (for example ‘Export cDNA’ or ‘Export peptide’ from the menu obtained by clicking on gene cartoons in the Detailed View or Basepair View panels in ContigView). We also have Blast and SSAHA services available for alignments of user's query sequences against Vega transcripts, proteins or genomic sequence and users can download Fasta files from the Vega FTP site (ftp://ftp.sanger.ac.uk/pub/vega).
Feedback and submitting data
In order to maintain and enhance the quality and coverage of our annotation, the Havana team is always interested in feedback, collaboration and high-quality external data. Please feel free to contact us at vega/at/sanger.ac.uk for feedback and queries or contact the corresponding author to discuss collaborations and data submissions.
A significant development in the near future will be the migration of the Otter annotation database to a near-current version of the Ensembl schema. This should increase the release frequency and allow us to present the most recent data to the community. It will also improve versioning, searching, dealing with exceptions (e.g. selenocysteine), and mapping features across clone boundaries. In addition, in the longer term we are aiming for much of the data that is currently generated after merging the pipeline and annotation databases, such as the location of protein domains on translations, links between Vega genes/transcripts and external databases (such as MGI), karyotype images, etc. to be incorporated into the annotation database.
We will continue adding mouse, and updating human, CCDS annotation in collaboration with the NCBI and UCSC. Other ongoing collaborations are Ensembl-Vega gene merges, refining and extending ENCODE annotation with the ENCODE and GENCODE consortia and refining nomenclature and annotation with HGNC, MGI and Zfin. Maintenance and updating of existing annotation in human, mouse and zebrafish is ongoing, as is general (non-project related) de novo annotation.