Other than these major new features, there have been continuous enhancements (and also bug fixes) to Ensembl over the year. Users are recommended to read the what's new pages accompanying every release as frequently user interface improvements are subtle, but can save researchers considerable time. Some of the more significant improvements listed here.
Ensembl genome annotation
The overall principle of the three step Ensembl gene building system (
2) remains the same, however the details of its implementation have been refined with each human genome release. The accuracy and coverage of the genes built by this automatic system continue to improve, as assessed by comparison with the gene structures of the current ‘gold standard’ of manually annotated and experimental verified finished human chromosomes 20 (
6) and 22 (
7). Sanger finished clones are initially annotated and updated by Sanger Institute vertebrate annotation group (
http://www.sanger.ac.uk/HGP/havana/) and experimentally investigated by the experimental gene annotation group (
http://www.sanger.ac.uk/Teams/Team69/). Ensembl works closely with these groups to integrate this annotation into Ensembl web displays and provide feedback.
An area where the existing Ensembl gene building system has been weak is in predicting alternative transcript forms, since the core gene building machinery does not rely on ESTs directly, as this results in too many false positives. To partly address this, a separate set of gene predictions are now being made, built entirely from human ESTs using the Ensembl EST GeneBuilder.
Ensembl maps ESTs to the genome using a combination of Exonerate, BLAST and EST2Genome. These are then processed by merging the redundant ESTs and setting splice-sites to the most common ends. This method finds the correct internal splice-sites, clusters 5′ and 3′ ESTs into UTRs and joins the fragments into longer transcripts structures. The resulting transcripts are processed by Genomewise, which finds the longest ORF across each one.
Alternative transcripts are predicted where there is at least one alternatively spliced EST and each EST gene has a supporting evidence page showing which ESTs have been used to construct it. At present, these EST genes are not classed as ‘Ensembl genes’ and therefore do not have Ensembl stable identifiers, however we are working to combine the EST and core gene builders in a way to increase alternative transcript coverage without decreasing gene prediction accuracy.
Ensembl web site
All interfaces have continued to be refined during the year, with probably the most development carried out to the ‘workhorse’ interface to genome sequence contigview. Refinements include toggle controls to switch between single-line and multi-line track displays; screen width configuration; contig orientation indicators and a gap type track. New tracks include an Eponine (
8) track showing transcription start site predictions. Speed has been significantly improved by the use of the Ensembl-lite denormalised database that now provides much of the data for these pages and has been optimised for web queries.
New interfaces are the martview data mining interface (see above), goview and haploview. In goview we have integrated the standard GO browser from the GO consortium (
9). GO is an ontology of gene function, process and location terms (e.g. ‘protein phosphorylation’ or ‘cell cycle processing’). The GO data for human is inherited directly from SWISS-PROT GOAH work. The haploview interface provides access to haplotype data, currently available for human chromosome 22 (
10). To access this data, turn on the haplotypes track in contigview and where haplotypes are shown, click on them to jump to haploview.
What is unlikely to be apparent to the user, are the underlying changes to the webcode to make it more multi-species orientated, which allow it to support all the species presented in Ensembl from a single codebase.
Finally, the integration of DAS (
11,
12) servers with the website has greatly increased. New DAS tracks on contigview include NCBI Transcript models, NCBI GenomeScan predictions, Acembly Transcript models and Ensembl mapped RefSeqs. Improvements have also been made to the interfaces to allow you to add DAS tracks from your own servers and to upload your own data directly for display. It is clear that the usage of DAS to integrate user data with our baseline annotation has increased greatly over the year.
Ensembl software system
Maintaining the circa 500 000 lines of code that supports and runs the Ensembl project is a major task in itself. Over 2002, Ensembl has transitioned to a revised schema and code base that principally has involved a more complete compliance to code and schema standards. For example, mixed case columns have been removed from the schema definitions and foreign key relationships are consistently named. In the code, the previous loose convention of separating the ‘biological’ objects from the database aware ‘adaptor’ objects is now consistent across the database with a consistent style of function name.
In addition to the Perl code base, there is a parallel Java code base with a common design between the two language bindings. As with the Perl code, the biological objects versus database aware adaptor objects is rigorously followed. The Java layer is currently used for stable ID transfer and as a backend data adaptor for Apollo.
Ensembl data analysis pipeline
The data analysis pipeline has had a number of improvements, in particular the processing of ESTs and cDNAs as part of the EST Gene analysis. In addition, the work with the A. gambiae (Holt et al., Science, in press) and C. briggsae project has introduced more configuration options to allow the pipeline to adapt to these invertebrate genomes. For example, the heuristics about maximum intron size have to be adjusted between vertebrates and invertebrates.
At the more technical level, the Ensembl pipeline system has improved its handling of complex data conditions which previously took manual work to fix; for example, areas of the genome which are almost complete masked by repeats, and so often triggered software errors in programs such as GenScan (
13) when presented with high N content are now recognised and special processing rules applied. Different scheduler systems, such as PBS and GridEngine as well as LSF can now be used.
One innovation has been the compact storage of gapped alignments by storing the maximum extent of the matches and then a text string which encodes the placement of gaps inside the alignment. This text string format was first introduced in exonerate and represents the state path of the alignment process. Colloquially inside Ensembl this is called a ‘cigar line’ and its adoption has shrunk the number of rows in the feature table around 4-fold.