|Home | About | Journals | Submit | Contact Us | Français|
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Ensembl (http://www.ensembl.org) collects, creates, organizes and distributes data resources in support of research into the genetics and genomics of chordates. We currently support 70 species with a focus on human in additional to agricultural animals and major vertebrate model organisms such as mouse, zebrafish and rat. We support a full range of researchers in genomics from bench biologists interested in looking up specific details about their genes or loci of interest using a graphical web interface to advanced bioinformatics programmers looking to do complex analysis or build new tools that leverage the Ensembl infrastructure. As such, we provide all of the Ensembl source code freely under an Apache-style license and release all of our data without restriction. Ensembl data are distributed from our genome browser at http://www.ensembl.org as well as via BioMart, the Ensembl Application Programming Interface (API), direct MySQL access, Amazon Web Services Public data sets (http://www.ensembl.org/info/data/amazon_aws.html) and via full data download.
Ensembl aims to be a hub of genome information by linking identifiers and information between external biological resources and data within Ensembl or importing essential information from other resources so that it can be found within Ensembl and linked back to the original resource as necessary. For example, we provide up to date external database references to gene names from the HUGO Gene Nomenclature Committee (HGNC) (1), the Universal Protein Resource (UniProt) (2), Orphanet portal for rare diseases and orphan drugs (3), the Online Mendelian Inheritance in Man (OMIM) database (4), the RefSeq collection of Reference Sequences from NCBI (5), the UCSC Genome Browser (6), the Protein Data Bank (PDB) repository for biological macromolecular structures (7) and many other resources.
We participate in or work closely with a number of large-scale international projects including the 1000 Genomes Project (8), ENCODE (9), the International Cancer Genome Consortium (ICGC) (10) and the BLUEPRINT epigenome mapping project (11). Participation in these efforts helps ensure that we produce timely and valuable resources through direct scientific engagement with the communities that we are trying to serve. In addition, we actively develop and provide key pieces of large-scale bioinformatics infrastructure including the eHive workflow management system for genomic analysis (12).
Full incorporation of the data types resulting from the myriad of experimental assays now leveraging next generation sequencing technology remains an important area of development for the project. During the past year, we have made considerable progress in a number of ways including a greater incorporation of RNA-seq data into our gene annotations and ChIP-seq data into our regulatory annotations. In general, we believe that the most useful resources provide integrated summary information that transforms the raw sequencing data into biological knowledge that can provide a foundation for further biological research. Thus, we believe that the display of the called variants from the 1000 Genomes Project or regulatory region annotations supported by specific histone modification or transcription factor (TF) binding sites are more useful as resources for the community than a display of the raw aligned sequence reads. However, Ensembl does support the upload and visualization of read alignment data (e.g. alignment files in BAM format) and provides signal files for our ChIP-seq and alignment files for RNA-seq data within the browser for those users needing direct access to the supporting data. Indeed, Ensembl’s API development this year included increasing support for file-based data access to enable integration of very large BAM and other file-based data sets into the browser.
This report highlights the new data we have released and the new mechanisms of data access that we have deployed during the past year since our previous report (13). We describe how these new features extend the existing capabilities of the project, which will be explained as appropriate.
As of release 69 (October 2012), Ensembl supports 70 species including 61 species fully supported on our main site. Of these, we have created full gene annotations for 58 chordates (43 with high-coverage genome sequences and 15 with low-coverage) and have imported annotation data for three non-chordate model organisms (Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster) to facilitate comparative analysis. Five new species were included during the past year with full support: Atlantic cod (Gadus morhua), coelacanth (Latimeria chalumnae), ferret (Mustela putorius furo), Nile tilapia (Oreochromis niloticus) and Chinese softshell turtle (Pelodiscus sinensis). An additional nine species are currently available with limited support on the Ensembl Pre! site (http://pre.ensembl.org) including the following, which were newly added in the past year: budgerigar (Melopsittacus undulates), Chinese hamster CHO cell line (Cricetulus griseus), painted turtle (Chrysemys picta bellii), spotted gar (Lepisosteus oculatus), collared flycatcher (Ficedula albicollis) and squirrel monkey (Saimiri boliviensis boliviensis). Ensembl Pre! sites provide BLAST and genome visualization, but do not provide a complete gene build. For specific genomes, we also provide downloadable data on the preview site.
We update the human gene set for every Ensembl release via a merge of the Ensembl evidence-based automatic annotation and Havana manual annotation (14) to produce an updated GENCODE gene set (9,15). This set also includes all current human Consensus Coding Sequence (CCDS) gene models (16). Manual annotation from Havana is also incorporated into our gene sets on alternate releases for mouse and zebrafish. In addition, pig now includes manual annotation from Havana on selected regions of the genome.
The human genome assembly is updated regularly by the Genome Reference Consortium (GRC) to include alternate sequences in the form of ‘fix’ and ‘novel’ assembly patches (17), and we continue to include these additional alternate sequences and annotate them with genes and other features as appropriate. Ensembl release 69 (October 2012) included GRCh37.p8 (i.e. the eighth patch release of the GRCh37 assembly). The mouse genome annotation, which also incorporates all current mouse CCDS models, was updated for Ensembl release 68 (July 2012) to reflect the new GRCm38 assembly. Other species previously available on our website also saw updates in the past year including new primary assemblies and gene sets for chimpanzee, dog, pig, ground squirrel, bushbaby and Ciona intestinalis. The gene sets for orang-utan, opossum and platypus were also updated using RNA-seq data.
The whole genome multiple and pairwise alignments have been re-run in conjunction with the incorporation of new or updated genomes. In addition to cross-species alignments, we now provide self-alignments for the human genome and also use the Ensembl comparative genomics infrastructure for the comparison of fix and novel patches alongside the reference human genome (Figure 1).
The year 2012 has seen the inclusion of RNA-seq data provided by several different groups (18–20) as supporting evidence for our gene annotations. Thirteen species currently incorporate RNA-seq data including zebrafish, chimpanzee, Nile tilapia, dog, Chinese softshell turtle, pig, ferret, platyfish, coelacanth, Tasmanian devil, orang-utan, opossum and platypus. For some of these species, the RNA-seq data were added after a standard gene annotation process (21), whereas for other species, the data were added as an integral part of the genebuild process. Some species also include tissue-specific RNA-seq data that enables the exploration of tissue-specific expression. In addition, the Illumina Human BodyMap 2.0 data (http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-513) have been re-processed using our enhanced pipeline to produce updated gene models and new BAM files.
RNA-seq data are now routinely used in gene annotation in a number of ways, and we anticipate that RNA-seq data will be used in almost all gene annotation projects for the foreseeable future. Briefly, our current procedure starts with raw-sequencing reads that are aligned to the genome and processed to produce RNA-seq-based gene models, BAM files and intron features that are supported by intron-spanning reads. Intron-supporting evidence helps to quantify intron predictions in RNA-seq transcript sets. The intron features and RNA-seq-based gene models are used alongside cDNA and EST alignments to compare and filter the preliminary set of protein-coding models against a set of highly supported splice sites. In addition, the RNA-seq-based gene models are used to provide alternate isoforms and fill in gaps between models identified by the standard Similarity Genewise component of our annotation system, which aligns protein sequences to the genome, and to add untranslated regions to the protein coding models.
We have also developed an RNA-seq update pipeline that allows an existing Ensembl gene set to be updated through incorporation of new RNA-seq data. The RNA-seq update pipeline takes in the results of the standard Ensembl gene annotation method and also RNA-seq-based models produced by the pipeline previously described (20). The two sets of input models are compared and merged to produce an updated gene set. This new method was used to improve the existing opossum, platypus and orang-tuan gene sets for Ensembl release 69 (October 2012). The method is particularly effective for species that are distantly related to the well-annotated mammals and those with little species-specific sequence data available at the time of initial annotation. Specific improvements from the RNA-seq update pipeline include lengthening truncated genes, merging adjacent gene fragments and splitting artificially merged genes. RNA-seq-based data are also useful for higher primate species that have previously relied largely on human sequence data for annotation, as it allows for the identification of non-human primate-specific gene expression.
We create variation resources for 17 species by importing and merging data from many different sources through our pipeline (22). The current list of variation data is provided at http://www.ensembl.org/info/docs/variation/sources_documentation.html. Most of our SNP and in-del data (rsIDs, locations, allele frequencies and genotypes) come from dbSNP (23). This year, we have updated the Ensembl Variation databases for human, rat, chimpanzee, orang-utan, zebrafish, pig, dog and macaque. We have also remapped the variation data for mouse onto the new GRCm38 assembly before updated GRCm38 mappings were provided by dbSNP and provided the same update for new dog assembly. Available structural variation data have increased considerably, and we have data for human, mouse, horse, zebrafish, cow and macaque largely provided by the DGVa database of copy number and structural variation (24). The human structural variation data are more comprehensive than all other species combined and include >6 million variants of which 5624 are somatic. The variation database infrastructure storing genotypes has also been redeveloped to improve the responsiveness of our displays and to support non-diploid genomes.
The human variation data also include genotypes imported from the 1000 Genomes Project and the NHLBI Exome Sequencing Project (25), ~79 000 mutation data locations provided by HGMD (26), clinical variants on LRGs (27) and >135 000 somatic mutation positions from COSMIC (28). We have also added mitochondrial variants, information on clinical significance and global minor allele frequencies from dbSNP, as well as phenotype data for >287 000 variants from OMIM (4), the European Genome-phenome Archive (EGA) and the NHGRI GWAS catalog (29). We denote those variants present on three Affymetrix genotyping chips (GeneChip 100 K Array, GeneChip 500 K Array, GenomeWideSNP_6.0) and nine Illumina chips (CytoSNP12v1, Human660W-quad, Human1M-duoV3, CardioMetaboChip, HumanOmni1-Quad, HumanHap650, HumanHap550, HumanOmni2.5 and Human610_Quad), and also indicate those variants curated by UniProt (2).
For all species, we calculate the effect of each variant allele on overlapping Ensembl transcripts and whether the variant falls within an Ensembl regulatory feature, TF binding motif or a high information position within the motif. Our consequence annotation now uses defined Sequence Ontology (SO) terms (30) for all descriptions, which enable querying of ontological relationships in BioMart. More detailed consequence information is also provided for SNPs and in-dels in specific genomic locations such as splice sites. These SO terms have also been adopted by both the UCSC genome browser and ICGC providing a standard to enable easy comparison of variation annotation.
Other resources supporting human variation include calculated linkage disequilibrium values and tag SNPs, in addition to SIFT (31) and PolyPhen (32) predictions for amino acid changes. This year we have switched to using the Ensembl comparative genomics pipeline to provide the ancestral alleles of SNPs and short deletions for human, orang-utan, chimpanzee and macaque (previously this was imported from dbSNP). We have also extensively improved our quality control (QC) procedures, which leverage the eHive software and have been extended to include structural variations.
As a result of our effort to provide the most useful possible summaries of large data sets to our users, we have added new tracks for 1000 Genomes Project common variants and also tracks for each global 1000 Genomes population. Additionally, appropriate phenotype data have been collected into a dedicated section on the Ensembl gene pages. Finally, the documentation section of the website has also been extended and improved for all areas of Ensembl Variation especially for the Variant Effect Predictor (VEP), SO consequences, QC pipeline and API diagrams.
During the past year, development on the Ensembl web interface has continued a combined strategy of small incremental improvements on the website while making substantial progress on a number of major infrastructure-level projects.
On the data display front, we are now able to show alignments of human assembly patches to the reference assembly (Figure 1) and have renamed the ‘Multi-species view’ as ‘Region comparison’ to reflect its wider applicability. We have also added a transcript variation page, similar to the gene variation page but showing only one transcript at a time, which is particularly helpful in the case of large, well-annotated genes that are challenging to display quickly or interpret easily due to their data density. Other additions to the user interface include a new online tool, Region Report, which provides graphical access to the API script of the same name to export sequence, genes and other annotation from one or more regions. We have also re-introduced the ability to save configurations on images: users can turn their choice of tracks on and off and then save this selection in either the browser session or their personal accounts and then quickly return to the same layout at a later time. These configurations can also be grouped into sets (e.g. to combine a set of favourite variation tracks with a set of gene tracks) for even quicker reconfiguration of images.
We have started to refresh the look and feel of the website. For example, our icon set was previously created from various sources and has now been replaced with a single matching set. We have adapted the layout and colour scheme for increased readability, and we are continuing the process of replacing text-heavy pages with simpler, more user-friendly layouts where appropriate.
During the past year, we have significantly updated and increased the amount of data available from the Ensembl regulation database. As of Ensembl release 69 (October 2012), there are 532 ChIP-seq and DNase-seq data sets from 13 human and five mouse cell lines. In total, these data sets represent information about the genomic locations of 49 different histone modification types and the binding regions of 113 different TFs. Forty of these TFs have binding matrices available through the JASPAR database (33), and we have incorporated these motif data as positions of high probability TF-binding sites (5% False Discovery Rate) within the binding regions. We have also created a dedicated experimental summary page providing information on individual experimental details and summary metadata, such as references to the raw sequences reads available in the European Nucleotide Archive (34).
The data underlying the Ensembl Regulatory Build currently include experiments in 13 cell lines. Regulatory Build coverage has increased by 15% in the past year and now annotates 270 Mb of the human genome in 518 020 regulatory features. In Ensembl release 65 (December 2011), we introduced the combined Segway (35) and ChromHMM (36) segmentation analyses developed for ENCODE (9), which classifies the genome into regions based on 12 specific assays to obtain a single-track summary of the functional architecture of the human genome. The segmentation tracks are currently available for six human cell lines: GM12878, K562, H1-hESC, HepG2, HeLa-S3 and HUVEC. The segmentation tracks are displayed with specific views available from the ‘Regulation’ configuration in the Ensembl browser (Figure 2).
The Ensembl Regulation database and web views continue to provide various other data resources including the following: mapping of probe sets for all the common microarray platforms, DNA methylation from various projects including ENCODE, high profile externally curated data sets such as cisRED motifs (37) and an updated VISTA enhancer set (38).
New species added in the past year such as coelacanth and lamprey have provided our gene trees with representatives of new taxonomic groups. These species define additional branching points in the phylogenetic trees, enable splitting long branches and provide us with more taxonomic power to better resolve the gene trees. Further information on the evolution of the gene families is now provided by supplementing our phylogenetic analysis with a calculated assessment on the possible expansions and contractions in each family using the CAFE tool (39).
Our data model for gene trees has been modified to handle both protein and ncRNA gene trees. During that process, we also improved our support for protein super-trees, which are used in the resolution of very large protein families. These are split in sub-families, and the super-protein tree represents the relationship between these sub-families. We have developed a better identification and annotation of split genes that usually arise because of assembly errors (40). In our current implementation, the enhanced gene tree pipeline (41) detects gene split events after building the protein multiple alignment, and the resulting nodes of the tree can be annotated as gene split events when they relate to partial proteins that could be concatenated to form a full gene.
During the past year, we have made significant improvement to the Ensembl VEP (42) and launched a beta implementation of a new Ensembl REST API. The VEP provides comprehensive analysis of SNP, in-del or structural variation data including reports of which gene, transcript, protein or regulatory region overlap the variants of interest and if there is any change in amino acid sequence. It also includes information about SIFT and PolyPHEN predictions in human, protein domains, exon/intron numbers, minor allele frequencies and other information. The VEP works with many different file formats and can in fact convert variant positions between different coordinate systems (Ensembl, RefSeq, LRG and HGVS). We have also written plugins to report on degree of conservation, presence of the variant in an LOVD database in a Locus Specific Database (LSDB) using the Leiden Open Variation Database (LOVD) software (43) and other capabilities. Our VEP plugins are present in the ensembl-variation github repository (https://github.com/ensembl-variation/VEP_plugins), and we encourage users to share their own plugins.
The REST API web service was released as a beta application this year at http://beta.rest.ensembl.org. Although we have a fully supported Perl API to all of the Ensembl data (44), the REST API addresses those users who wish to access Ensembl data in a language-agnostic manner. The web service is built using the Perl web framework Catalyst, Catalyst::Action::REST and our existing Perl API providing a rapid development environment and lowering the cost of creating new endpoints. Output is a combination of bioinformatics and programmatically relevant formats such as FASTA and JSON. We provide access to sequences, assembly mapping, homologues and integration of the VEP with support for genomic features. The REST service, like all Ensembl software, is free to download from our CVS server allowing users to deploy over their local Ensembl databases.
Each Ensembl release provides a full rebuild of seven BioMart (45,46) databases. Four of these BioMart databases (Ensembl Gene, Ensembl Variation, Ensembl Regulation and VEGA) are visible on the Ensembl BioMart interface, and the remaining three BioMart databases are hidden from view but are accessed through federation with visible BioMart databases to provide ontology, sequence and genomic feature data. Performing a complete rebuild each release ensures the availability most up to date integrated data from across the Ensembl project. Users can access these data via the MartView (web interface) and MartService (BioMart Perl API, DAS server, SOAP, REST, BioConductor biomaRt package).
Each Ensembl BioMart release includes the addition of any new species, updated assemblies, updates to the germline and somatic variation and structural variation data sets as well as updates to the regulation data. One can now obtain our SIFT and PolyPhen predictions and scores from the Ensembl variation BioMart and from the variation ‘filter’ and ‘attribute’ sections of the Ensembl gene BioMart. It is also possible to select specific mouse strain information from the mouse structural variation data set, and one can filter on the source and study accession of interest in the structural variation data sets available for cow, zebrafish, horse, human, mouse and macaque. A new human somatic structural variation dataset has been added containing data from COSMIC (28). The ability to search multiple chromosomal regions at once has been added to the Ensembl Regulation mart. In addition to this, users can query human regulatory segmentation features using the newly added regulatory segments filter section and attribute page.
Ensembl supports new and existing users in a variety of ways from a strong and increasing on-line presence to direct face-to-face training at universities and other institutions worldwide. This year, we held one-day workshops on five continents and launched new virtual initiatives available to all including those further afield or without the means to host a one-day workshop.
We provide extensive free and user-driven tutorials via the Ensembl YouTube (http://www.youtube.com/user/EnsemblHelpdesk) and YouKu (http://i.youku.com/u/id_UMzM1NjkzMTI0) channels and e-learning course (http://www.ebi.ac.uk/training/online/course/ensembl-browsing-chordate-genomes). The Ensembl YouTube channel has >165 subscribers and >91 000 video views, now hosts >20 videos including navigation ‘how-to’ guides. This year, we have added more advanced videos covering subjects such as patches and haplotypes on the human assembly, API installation and how RNA-seq data are used in the genebuild. In 2012, the top 20 countries accessing our on-line training reflect a worldwide audience from the USA, Europe, India, Japan, Australia, Pakistan, Taiwan, Mexico, South Korea and Brazil, and our most popular videos have been viewed hundreds or thousands of times.
We communicate more informally and highlight updates and new features using the Ensembl blog (http://www.ensembl.info/), Facebook page (http://www.facebook.com/Ensembl.org) and Twitter account (http://twitter.com/ensembl). Our Helpdesk (helpdesk/at/ensembl.org) continues to provide email support for >100 questions monthly, and we are exploring webinars as a vehicle for more interactive long-distance learning and plan to offer more of these events in 2013.
The Wellcome Trust provides majority funding for the Ensembl project [WT062023 and WT079643] with additional funding from the National Human Genome Research Institute [U01HG004695, U54HG004563 and U41HG006104] the BBSRC [BB/I025506/1], and the European Molecular Biology Laboratory. Additional support for specific project components as specified: Funded by the European Commission under SLING, grant agreement number 226073 (Integrating Activity) within Research Infrastructures of the FP7 Capacities Specific Programme; The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 222664. (“Quantomics”). This Publication reflects only the author's views and the European Community is not liable for any use that may be made of the information contained herein; The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 – the GEN2PHEN project; The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/ 2007-2013) under the grant agreement no 223210 CISSTEM; The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 – BLUEPRINT. Funding for open access charge: The Wellcome Trust.
Conflict of interest statement. None declared.
The authors are consistently grateful to their users and especially to those who take the time to contact us through our mailing lists, blog and other avenues. They acknowledge those researchers, organizations and large-scale projects that have provided data to Ensembl before publication under the understandings of the Fort Lauderdale meeting discussing Community Resource Projects and the Toronto meeting on pre-publication data sharing.