Ensembl annotates known genes and predicts novel genes, with functional annotation from the InterPro (
1) protein family databases and with additional annotation by OMIM disease (
2), SAGE expression (
3,
4) and by gene family (
5).
Prediction of genes is the most important part of genome annotation, connecting the DNA sequence with the wide array of experimental data. In eukaryotic organisms with large introns, ab initio predictions are useful but have a high false positive rate and often predict partially incorrect gene structures. Thus, incorporation of all available evidence for gene prediction is necessary.
The Ensembl gene build system incorporates a wide range of methods including
ab initio gene predictions, homology and gene prediction HMMs. Genes are placed in the genome using a three step process. First, ‘best in genome’ positions for all known human proteins from SPTREMBL (
6) are found using a fast protein to DNA matcher (pmatch, R. Durbin, unpublished software). These positions are refined using genewise (
7) to provide an accurate gene structure. UTRs are also aligned to each gene structure using full-length cDNAs where known. Secondly, a similar process is used to align paralogous human proteins and proteins from other organisms to the genome to form a set of novel human genes. Finally, the
ab initio program genscan (
8) is run across the entire genome to create a set of genscan peptides. Exons from these predicted peptides that are confirmed by blast matches to proteins, vertebrate mRNA and UniGene clusters are assembled into genes.
The above process creates a set of transcripts and these are grouped into genes wherever an exon is shared. These ‘Ensembl genes’ are regarded as being accurate predicted gene structures with a low false positive rate, since they are all supported by experimental evidence of at least one form via sequence homology. Ensembl human genes are identified by numbers beginning ENSG (transcripts begin ENST, exons begin ENSE and translations begin ENSP). These identifiers are keep stable, as far as is possible, between assemblies of the human genome.
Ensembl is continuously refining and extending its gene building process, calibrating it against regions of the genome that have been hand annotated and experimentally investigated, such as human chromosome 22 (
9). We are in the process of integrating EST data into Ensembl gene building. ESTs offer a considerable advantage in aiding the prediction of non-coding exons, especially those located within the 3′-UTR. Two EST/genome alignment algorithms, namely exonerate (G. Slater, unpublished) and EST_genome (
10), have been integrated with the Ensembl gene-building pipeline to yield gene predictions incorporating EST alignments. Because EST data are notorious for their high sequence error rate, strict quality measures have been introduced such that only splicing ESTs are considered, and priority is given to those ESTs which align on the genome into clusters.
The whole genome shotgun (WGS) sequence of the mouse genome (data generated by the mouse sequencing consortium) is another rich source for identifying human genes. We have developed a very fast gapped DNA–DNA alignment algorithm ‘exonerate’ and have used it to align 14 million mouse reads to the assembled human genome. We have found that matches between human and mouse can be assessed using genscan to indicate those which are potentially novel coding exons.