|Home | About | Journals | Submit | Contact Us | Français|
Biological databases are an important resource for the life sciences community. Accessing the hundreds of databases supporting molecular biology and related fields is a daunting and time-consuming task. Integrating this information into one access point is a necessity for the life sciences community, which includes researchers focusing on human disease. Here we discuss the Ensembl genome browser, which acts as a single entry point with Graphical User Interface to data from multiple projects, including OMIM, dbSNP, and the NHGRI GWAS catalog. Ensembl provides a comprehensive source of annotation for the human genome, along with other species of biomedical interest. In this unit, we explore how to use the Ensembl genome browser in example queries related to human genetic diseases. Support protocols demonstrate quick sequence export using the BioMart tool.
The number of databases supporting molecular and cell biological research are growing. Starting with OMIM (Online Mendelian Inheritance in Man; Borate and Baxevanis, 2009), a collection of diseases and phenotypes in human developed in the 1970s, and moving up to the NHGRI’s recently developed GWAS (GenomeWide Association Studies) catalog (Hindorff et al., 2009), focusing on trait/disease-associated variations, the data available to the biological community are vast. Last year’s NAR database issue lists 1330 databases focusing on aspects of life sciences (Galperin and Cochrane, 2011).
However, challenges are presented in integrating these data into one single database, and/or graphical user interface such as a genome browser. Quality of information, data formats, and underlying sequences can differ, and the need for security in dealing with patient data present restrictions on data access (Horaitis and Cotton, 2005). The necessity of integrating information from disparate sources is clear, and projects are currently underway to standardize data formats (Dalgleish et al., 2010). The ability to access effects of sequence variation on genes, protein products, and diseases or phenotypes from one central point would allow faster integration and understanding of the effect of sequence variation on organisms, fueling fields such as pharmacogenomics (UNIT 9.19).
Genome browsers (the UCSC genome browser, NCBI Map Viewer, and the Ensembl genome browser) provide useful tools for data access from different sources. In this protocol, we focus on the Ensembl graphical user interface at http://www.ensembl.org. From sequence variation associated with human disease, to conserved genomic regions calculated from multispecies alignments, we present a how-to guide to accessing data that supports research and understanding, using the Ensembl genome browser.
Basic Protocol 1 takes a variation-centric view of genome browsing. We enter the browser by searching for a single nucleotide polymorphism (SNP) associated with hereditary hemochromatosis. This SNP is nonsynonymous in ten splice variants of the HFE gene. Navigation through the variation views for this SNP reveals the risk allele (A), individual genotypes, and the phylogenetic content (see Internet Resources) summary.We also view expression data for the HFE gene stored in the ArrayExpress database (Parkinson et al., 2009).
In Basic Protocol 2, we enter the genome browserwith a sequence. The Ensembl BLAST-Like Alignment Tool (BLAT; Kent, 2002) is used to position a short oligonucleotide sequence within the human genome, allowing identification of variations and genes (the MYC gene) corresponding to this sequence. The focus of this protocol is on location views, showing a region of the genome. Tissue-specific methylation patterns in this region are also examined.
Basic Protocol 3 explores functional information for the protein product of a human oncogene in the RAS superfamily of GTPases, HRAS, using gene ontology. The individual sequences of J.Watson (Wheeler et al., 2008) and C. Venter (Levy et al., 2007) are compared to the reference sequence. Linkage disequilibrium (LD) of associated variations are viewed as LD plots, and exported.
Basic Protocol 4 explores a region of the genome. We investigate the basis for a predicted regulatory sequence in a highly conserved region.
Support Protocols 1 and 2 focus on the export of sequences from the browser, and present the BioMart data-mining tool as an option to quickly export sequence and other gene annotation.
Please note that the discussion in this unit pertains to Ensembl version 60. Refer to our archive site at http://Nov2010.archive.ensembl.org/index.html for consistency.
Views: Variation tab (gene/transcript, individual genotypes, phenotype data), Gene tab (gene summary, external data: gene expression atlas).
In this protocol, we enter the Ensembl genome browser by searching with the dbSNP ID rs1800562 (Benyamin et al., 2009). This is the identifier for an SNP associated with hereditary hemochromatosis, a disease in which iron is not metabolized. This SNP has also been reported in the literature as C282Y, referring to the nonsynonymous status of this variation, which codes for cysteine or tyrosine (Cullen et al., 1999; Lucotte and Dieterlen, 2003).
The search function is the main entry point to the Ensembl genome browser; searching is described in more detail elsewhere (see Basic Protocol 1 in Fernández-Suárez and Schuster, 2010). A search can be performed using a gene symbol, name, or description; an identifier from a public sequence database such as UniProtKB or NCBI Entrez Gene; a gene ontology term from the GO project (Ashburner et al., 2000); a protein domain; a disease; or, as in Basic Protocol 1 of this unit, a variation.
The Ensembl home page (Fig. 6.11.1) provides links to all genomes housed in Ensembl, which include nearly 50 vertebrate species.
The “view full list of all Ensembl species” link (circled in Fig. 6.11.1) allows access to a sister project focusing on invertebrates.
Data for the Ensembl site are updated regularly. News for the latest update (or release) is shown at the bottom right of the home page.
Species-specific news can be obtained by choosing a species from the drop-down menu in the “All genomes” section of the home page. Click on Human to go to the human index page. Click on What’s New at the left of the human home page to see human-specific updates in the current release.
Sources of variations for human include NCBI dbSNP (Sayers et al., 2009), Affymetrix, and Illumina, and individual sequences (J. Watson and C. Venter). Variations in Ensembl are preferentially assigned a dbSNP identifier. If there is no dbSNP record for a variation, either the identifier from the contributing source is used, or, in the case of the alignment with Watson and Venter’s sequences, Ensembl assigns an ID (Chen et al., 2010).
You should now be in the variation tab for rs1800562 (Fig. 6.11.2). Links at the left lead to information specific to this variation. The location tab will also be showing. In the location tab, you can explore a region of the genome. We will explore this tab, along with the gene and transcript tabs (not shown here), in Basic Protocol 2.The variation tab for rs1800562. Ensembl genes and transcripts containing this variation are available through the Gene/Transcript link (1), genotype information for this variation can be accessed by “Individual genotypes” (2), phenotypes ...
The variation summary view(Fig. 6.11.2),which you should be currently looking at, shows the source(s) of the variation. A link to the dbSNP record is provided under “Variation class.” Synonyms, or other sources including this variation, are listed below the dbSNP ID. For example, rs1800562 is mapped in the Affymetrix GeneChip 500K Array, two Illumina arrays, and also UniProt. Also shown in the summary are links to linkage disequilibrium plots, and the flanking sequence, with the variation marked in red.
There are fourteen transcripts associated with this variation. Human Ensembl transcript IDs begin with ENST, which is followed by a unique, eleven-digit number. Ensembl transcripts that overlap in coding sequence are assigned to the same gene identifier (ENSG …). This view shows that ENSG00000010704 is the only gene associated with rs1800562.
The SNP is nonsynonymous in ten of the transcripts shown. The position in the transcript, counting from the transcript start site, is given where applicable. For example, in the transcript, ENST00000397022, the SNP position in the transcript is 936 bp.
The amino acid positions of both synonymous and nonsynonymous coding variations are displayed in a column at the right of the table. For ENST00000397022, the SNP is located at amino acid position 259.
The protein alleles are also listed; e.g., in ENST00000397022, the amino acid at position 259 is either a cysteine (C) or tyrosine (Y). This SNP was first identified in the literature as C282Y. Only two of the transcripts, ENST00000357618 and ENST00000309234, encode proteins with the variation at this position. This represents one challenge in transferring variations reported in the literature to positions on genes, transcripts, or the genome. Often, the literature reports the position on one transcript, and it is not always clear which transcript was examined. Indeed, at the time of discovery of the variation, only one transcript for a gene may have been known.
For more variation positions and effects (including splice site and stop gained/lost), see Chen et al. (2010).
Rs1800562 has been genotyped in 853 subjects. Populations analyzed by a variety of projects that submitted genotype information into dbSNP are shown at the right, in the “Populations” column. For example, the Windber Research Institute is listed in the table as a source of genotype data. Click on the links to find the original entries submitted into NCBI dbSNP.
The majority of the genotyped individuals are G|G. This can also be seen in the “Population genetics” link.
The strongest risk allele for several phenotypes is “A.” The phenotype relationships are taken from the NHGRI GWAS catalog. Compare this with the genotype found in the “Individual genotypes” or “Population genetics” view discussed in step 7, to deduce that the A allele is quite rare, at least in the populations studied.
Linkage disequilibrium expressed as r2 and D′ values are shown for different populations (e.g., HapMap groups). No data are shown for rs1800562 in version 60; however, look for variation rs1333049 for an example with linkage data.
The phylogenetic context view shows the nucleotide for other species that align in this region. Multiple genome alignments across 12 eutherian mammals were calculated by the Ensembl Compara team using the EPO pipeline (Fig. 6.11.3).Phylogenetic content for rs1333049. Variations are highlighted within the mammals in the alignment. The view is centered on rs1333049 in human.
The top panel of all “Gene” pages displays basic information, such as the gene name (HFE), the stable identifier (ENSG0000010704), and the genomic location, which is chromosome 6, base pairs 26,087,509-26,098,571 (from the start of the chromosome). The forward strand of the chromosome is indicated. A table lists all alternative splice variants for this gene and their corresponding protein products (Fig. 6.11.4A).(A) Transcript table in the gene tab for the human HFE gene. Fourteen transcripts are shown. The twelve protein-coding transcripts are listed first. Seven transcripts are found in the CCDS set, and all transcripts have been identified by manual annotation ...
The bottom panel of the “Gene” page shows a graphic of all transcript variants of this gene in their genomic context (Fig. 6.11.4B).
Transcripts are color-coded, depending on how they are determined. Red indicates protein-coding transcripts that have either been annotated automatically (Ensembl) (Curwen et al., 2004) or manually (HAVANA; Wilming et al., 2008), while gold indicates protein-coding transcripts where automated and manual annotation lead to the same results (Ensembl-HAVANA merge). These “gold transcripts” have a higher probability of being correct, as two independent annotation methods (the Ensembl automatic annotation pipeline and HAVANA manual annotation) lead to the same transcript structure.
A second measure of quality is found in the CCDS set (Pruitt et al., 2009a). The transcript table (Fig. 6.11.4A) indicates which transcripts have a consensus coding sequence agreed upon by Ensembl and RefSeq (Pruitt et al., 2009b). HAVANA and UCSC (Karolchik et al., 2009) are consulted when there are discrepancies in these transcripts.
Blue transcripts in the diagram are noncoding, and are listed as such in the table. In the case of HFE, in release 60, seven protein-coding transcripts are agreed upon by Ensembl and HAVANA (gold transcripts), five protein-coding transcripts are from one source only, either Ensembl or HAVANA (red), and two transcripts are noncoding (blue). The seven gold transcripts have a CCDS identifier, meaning they are also agreed upon by NCBI and UCSC. Transcript names with a number beginning with “0” are manually curated by the HAVANA project, while those starting with “1” are from the Ensembl pipeline. Thus, all the red transcripts are from HAVANA (HFE-011, HFE-014, HFE-015, HFE-019, and HFE-024). The GENCODE set currently includes both HAVANA manual annotation and Ensembl automatic annotation, with the aim of representing all human transcripts. (Searle et. al., 2010).
This link shows data from the ArrayExpress project, housed at the EBI. The Gene Expression Atlas contains curated data showing individual gene expression across experiments, and across biological conditions.
In this case, we can see the HFE gene is expressed in different organs (such as skeletal muscle and the pancreas). In some conditions, it is found to be over-expressed, and in others it is under-expressed.
Views: BLAT, Location [“Region in Detail,” Alignments (text)] Variation (Population genetics, Gene/Transcript) Gene (Variation Image, Variation Table) Transcript (cDNA).
Studies using the oligonucleotide sequence:
show hybridization to genomic DNA from diseased human cells, and no hybridization to wild-type human cells (under stringent conditions). Assume that you suspect at least one mutation in this sequence.
In this protocol, we use BLAT to align the oligonucleotide to the genomic sequence. The MYC gene, which overlaps with this sequence, is investigated, and a variation mapping to this sequence is identified. We discuss the potential effect of the variation on protein signatures and DNA methylation sites.
BLAT is the default alignment program in Ensembl. BLAT runs quickly, and is best for aligning sequences that will have an exact match. BLASTN and TBLASTX are also available for nucleotide searches, and are recommended for sequence alignments where gaps and/or mismatches are expected. BLASTP and TBLASTN are available for protein queries.
Alternate entries into the BLAST program include a sequence ID or accession number, or an existing ticket ID.
The human karyotype is shown (Fig. 6.11.5, labeled “1”), with the best hit boxed. In this case, there is only one hit to chromosome 8. Look down the page (Fig. 6.11.5, labeled “2”). The high-scoring pair (HSP) is diagrammed in red along the query sequence, which is drawn as white and black blocks. This HSP appears to align with most if not all of the query sequence.
Further down the page is a hit table (Fig. 6.11.5, labeled “3”). This table is customizable. By default, the score, E-value, %ID, and length are shown. In this case, the hit has a score of 164, a low E-value (reflecting the % chance that the hit is random), a high %ID, and a length nearly matching the query (remembering that the query sequence is only 34 nucleotides long.)
The alignment between the query sequence and the genomic sequence (chromosome 8, base pairs 128750508 to 128750541) shows only one mismatch (Fig. 6.11.6). The “g” at base pair 33 in the query sequence does not match the “a” at base pair 128750540 in chromosome 8. The + (Fig. 6.11.6, circled) denotes the positive strand, meaning the query sequence aligns to the positive, or forward, strand of chromosome 8.
The “Region in detail” view is divided into three panels representing different zoom levels into the genome sequence. A red box outlines the extent of the region displayed in the panel one level below. The red box in chromosome 8 (Fig. 6.11.7, labeled “1”) outlines 1 Mb of sequence, and is expanded in the “top panel” (Fig. 6.11.7, labeled “2”). The red box in panel 2 is expanded to show sequence encoding the MYC gene in the “main panel” (Fig. 6.11.7, labeled “3”).
Genome sequence annotation is organized in tracks and the entire data display is highly customizable using the configuration dialog. Individual tracks can be added or removed from the display, and the genome sequence region can also be zoomed in and out over a broad range.
Click on a track name for information about the source of the track.
The uppermost panel shows an ideogram of the entire chromosome, indicating centromeric and telomeric regions, as well as the cytogenetic banding pattern, if one has been established for the species. Bands are labeled if space permits. The red box marks the indication of any gene you were previously browsing, or the center of the view (in this case, it is determined by our BLAT hit).
The top panel below the chromosome diagram provides an overview of a 1-million base-pair region for vertebrate species, or a 0.5-million base-pair region for other species with denser genomes. For a larger region of the genome, click on “Region overview” at the left of the page. The top panel is centered on the gene of interest (or in this case, our BLAT hit). A scale bar in the top panel illustrates the physical map coordinates for the region. The “Contigs” track indicates in alternating light and dark blue color the individual sequences that form the genome sequence assembly. Greater than or less than signs provide information regarding in which orientation a sequence region has been incorporated into the genome sequence. For human, mouse and zebrafish, the assembly almost exclusively comprises bacterial artificial chromosome (or BAC) clones, which are also stored in the EMBL, GenBank, and DDBJ nucleotide repositories. BAC clones are labeled with accession numbers assigned by these public nucleotide sequence archives, and can be shown in the “Main panel” of this display.
By default the top panel also provides a graphic of Ensembl and HAVANA genes, as well as noncoding RNA genes and immunoglobulin and T-cell receptor genes that have been annotated in this region. Clicking any of these gene names provides more information about a particular gene, along with links to Gene pages.
The main panel provides the finer details of the genome sequence and its annotation down to the base-pair level. The main panel of the display is highly configurable, in that many tracks can be added, representing features of various types that Ensembl annotates on a genome-wide scale. In contrast to the top panel, the main panel can be zoomed in a range from a single base pair up to 1 million base pairs. The sequence can be viewed all the way up to a 1-Mbp overview of genes and other annotation.
As in the gene summary described in Basic Protocol 1, features shown above the genome sequence (blue bar) are annotated on the forward strand, while those shown below are on the reverse strand. Nonstranded features would be shown at the top (e.g., whole genome, multiple sequence alignments, and conservation scores) or bottom of the panel (e.g., pairwise conserved blocks and genetic variation information).
While the top panel shows genes, the main panel shows individual transcripts color-coded according to the description in Basic Protocol 1, step 11. Six transcripts of the MYC gene are shown in the view.
The BLAT hit (red-filled rectangle, circled in Fig. 6.11.7) aligns to the 5′ end of one MYC transcript, MYC-201, determined by the Ensembl annotation pipeline. A UTR has not been annotated in this transcript based on the evidence available.
Two gold transcripts (HAVANA/Ensembl merges) are also shown (MYC-001 and MYC-002). The BLAT hit aligns to coding sequence (filled boxes) in these two transcripts. The dotted lines extending out of the view indicate that MYC-001 and MYC-002 are not fully shown.
Now that we are looking at a larger region, let us look at the tracks turned on by default (Fig. 6.11.8).
Alignments of cDNA/mRNA sequences in the NCBI RefSeq set and in EMBL-nucleotides (Sterk et al., 2007) are shown in collapsed format, in green. These are alignments to the genome. Close inspection shows that many of these alignments mimic the exon structure of theMYC transcripts; thus, they can be seen as supporting the MYC transcript structure. Click on these green bars to find the accession number and description of the aligned sequence. They can be shown in expanded (normal) format using the “Configure this page” button at the left of the view.
The CCDS set is also shown. Only one CCDS sequence aligns to the genome in this location, supporting MYC-001. The Human RefSeq/EMBL-nucleotides set and the CCDS sequence are drawn above the blue bar, indicating they are on the forward strand of the genome.
Finally, regulatory features are drawn under the genome. The “core features” indicate sites of chromatin accessibility [DNase I hypersensitive sites (Gross and Garrard, 1988), transcription factor binding sites (http://www.ensembl.org/info/docs/funcgen/index.html), and CTCF binding sites (Nikolaev et al., 2009)]. The core features are extended to indicate positions of histone modification sites across cell types. Data for these features come from the ENCODE project (ENCODE Project Consortium, 2007). Click on any of these tracks (gray bars) for more information.
Turn on the three tracks with 16 amniota vertebrates in the name using the “Multiple alignments” menu of the configuration dialog. The “16 way GERP scores” track shows a conservation score (Cooper et al., 2005) for every nucleotide in a whole-genome alignment of sixteen amniota vertebrates (Chicken, Chimpanzee, Cow, Dog, Gorilla, Horse, Human, Macaque, Marmoset, Mouse, Opossum, Orangutan, Pig, Platypus, Rat, and Zebra Finch). This alignment was performed using the Pecan program (Paten et al., 2008). Positive scores indicate highly conserved nucleotides. Regions of conserved nucleotides (consecutive peaks in the GERP score plot) are shown in the “16 way GERP elements” track.
The display should now be zoomed in (Fig. 6.11.9).
The available tracks are divided into submenus (Fig. 6.11.10, labeled “1”). The “Active tracks” are the data selected by default.The configuration dialog for the location tab: “region in detail” view. Click on the “configure this page” link at the left of the “region in detail” to select or deselect data tracks. Active tracks are ...
For example, we could expand the Human RefSeq/EMBL cDNA track by clicking on the half-filled box to the left of the track, and changing the selection to “normal.”
As discussed in step 9 of Basic Protocol 1, this information is shown using DAS. In this case, regions of methylated DNA are immunoprecipitated using an antibody versus methylated cytosines (Deng et al., 2009).
Different cell and tissue types are listed. These reflect methylation sites studied on a genome-wide level using microarray analysis (Rakyan et al., 2008). Click on the empty box at the left of “MeDIP-chip B-cells” and select “normal” (Fig. 6.11.11).Searching with the term MeDIP in the configuration dialog reveals multiple tracks showing DNA methylation across different tissue types. These tracks can be found in the “Functional genomics” menu.
This region is rich in annotation. Sequence variants are drawn along the contig, indicating their position in the genomic sequence. The colors reflect their position and effect (if any) on the transcripts, according to the variation legend at the bottom of the view.
Note that our BLAT hit aligns to the same genomic position as a yellow (nonsynonymous coding) variation. We will come back to this in step 24.
Finally, we see annotation from the MeDIP-chip study done using B cells.
A pop-up box indicates the type, the start, the end, and the score of this methylated region.
After zooming in, you should see the yellow nonsynonymous coding variant aligning to the 3′end of the BLAT hit (look at the “Sequence variants” track).
The pop-up box reveals the rs ID (dbSNP reference SNP ID) for this variant (rs4645959). The genomic position of the SNP is 128750540. This corresponds to the mismatched allele seen in step 5 of this protocol. The alleles in the box are given as A/G. The “A” is listed first, indicating that is the allele in the reference sequence. Our starting sequence has a “G” at position 128750540.
The Variation tab has opened, described in Basic Protocol 1, steps 5 to 10.
Looking through different population data focusing on populations from the HapMap project (International HapMap Consortium et al., 2007), it is clear that “G” is the minor allele. The “G” allele is seen at a low frequency in Japanese, Chinese, Yoruba, and European populations. Heterozygotes for the G allele (A|G) exist in 0% to 11% of the different populations; however, no individual has been found to be homozygous for the G allele. The “Individual genotypes” link reveals the alleles for each individual in these studies.
No Phenotype Data are available in Ensembl for this variation in release 60, therefore the Phenotype Data link at the left is disabled.
The six possible transcripts for the MYC gene are listed in the table at the top of the page. They are drawn in the image below a track displaying all sequence variants in this region as vertical lines (Fig. 6.11.14, labeled “1” and “2”).The “variation image” in the gene tab. (1) All variations in MYC transcripts are displayed as vertical lines, color-coded according to the legend at the bottom of the view (not shown in the figure). (2) The six MYC transcripts are drawn. ...
Each transcript is expanded, and variations in the transcript are drawn as colored boxes underneath each exon. The colors reflect the position of the variant in the transcript (i.e., intronic, coding, UTR,…). A legend describing these colors is found at the bottom of the view. In the case of coding variants, nonsynonymous SNPs are colored in yellow, and synonymous SNPs are colored in green.
In addition, the amino acid(s) are written within the box according to the single-letter code. Intronic variations within 100 bp of an exon/intron junction are drawn, but this may be changed (fewer, or more intronic variations can be shown) using the Intron Context menu in the configuration dialog.
Clicking on the SNP should cause a pop-up box to open with an ID of rs4645959. The link to Variation Properties opens the variation tab for this SNP. The potential alleles shown are A/G, in the nucleotide sequence, and N/S in the protein sequence. In the SNP information boxes, the first allele is the one in the reference sequence, in this caseGRCh37 from the International Human Genome Sequencing Consortium (see Internet Resources).
Protein signatures from various databases [in this case, PROSITE (Sigrist et al., 2010), Pfam (Finn et al., 2010), PRINTS (Attwood et al., 2003), SMART (Letunic et al., 2009), and SUPERFAMILY (Wilson et al., 2009) are drawn as purple boxes underneath the transcript. The variation rs4645959 falls within the Pfam domain PF01056, named Tscrpt reg Myc N (Fig. 6.11.14, labeled “4”). Click on the domain for more information, including links to the original database and InterPro (McDowall and Hunter, 2011). This domain is involved in regulation of transcription of the MYC gene. We might hypothesize that the sequence variant we find in diseased cells (leading to an amino acid substitution from asparagine to serine) causes a disruption in transcription, due to its presence in this transcriptional regulation domain. Though the change from asparagine to serine is relatively conservative (they are both hydrophilic), a loss of size and/or a loss of an NH2 group might disrupt specific interactions with a binding partner. Experiments would have to be done to verify any functional effect of this variation on the domain.
We can compare the variation directly with the domain by following the yellow line. In fact, following the line through all three transcripts shows that this variation (and the Pfam domain) is present in five of the six transcripts.
The variation table lists all the consequent types of variations shown in the variation image. Click on “show” in front of any variation type to reveal a table of this type of variation in the transcript. For example, click on “show” in front of “Non-synonymous coding.” Variation IDs are listed in the first column, and a link is provided to the variation page for each ID. The location in the transcript, along with any consequence on the protein sequence, are shown in the “type” column. The genomic positions of variations are listed, along with ambiguity codes. Positions in the protein are listed (if any), along with protein alleles. Finally, the source(s) of the variation is (are) shown, and the validation status.
The transcript tab should open. By selecting a particular transcript, via the table or a context menu, a new page with transcript-specific information opens. This new Transcript tab indicates the name of the selected transcript, the number of exons, the transcript length in base pairs, and the protein length in amino acids. Furthermore, we can see the source of this transcript, whether it be the Ensembl annotation pipeline or the VEGA/HAVANA project. In this case, the transcript is colored gold, which indicates that both HAVANA and Ensembl agree upon the transcript. This is written at the bottom of the view.
Links at the left are now specific for this splice variant (ENST00000377970). Note that the Location, Gene, and Variation tabs are still available for quick navigation to those pages.
Three different types of sequence are shown in this view (Fig. 6.11.15). The first line corresponds to the cDNA sequence, starting with UTR (highlighted sequence in yellow). The second line shows the coding sequence, and the third line shows protein sequence. The numbering scheme is specific to each sequence type. (i.e., the first line numbering begins at “1” at the start of the transcript, the second line numbering begins at “1” at the start of the coding sequence, and the third line numbering begins at “1” at the start of the protein sequence).A selection of sequence from the transcript tab: cDNA view. Sequence and line numbering in the top line corresponds to the transcript, including UTR, highlighted in bright yellow. The second line corresponds to the coding sequence only. Codons in the ...
Variations are drawn within the coding sequence. Nonsynonymous variations are highlighted in yellow, and synonymous variations are highlighted in green. The ambiguity code is positioned on top of the nucleotide to which the variation maps. Clicking on the ambiguity code opens the variation tab for that specific variant.
If a variant is nonsynonymous, the amino acid in the protein sequence is highlighted in red. Hover with the mouse over the first red “N” in the protein sequence. This reveals the protein alleles to be N and S. The ambiguity code above the variant is R, representing the A and G alleles. The codon in the reference sequence is AAC. The G allele changes this codon to AGC.
This view allows the sequence of multiple species to be compared. Precalculated alignments include multi-species whole-genome alignments of 12 eutherian mammals, 34 eutherian mammals (including low coverage genomes), 16 amniota vertebrates, 6 catarrhini primates, and 5 fish. These alignments are performed using EPO or Pecan. Pairwise alignments between two species are also available, calculated using BLASTZ-Net or TBLAT (Kent, 2002).
By default only the sequence for the gene of interest is shown (in this case, human MYC). Exons are shown in red letters, and introns and flanking sequence are shown in black.
Ensembl-calculated alignments of the high-coverage genomes for 11 eutherian mammals are shown, aligned with the EPO pipeline (Paten et al., 2008).
Variations should be highlighted within the sequence. SNPs are demonstrated by replacing the nucleotide in the reference sequence with an ambiguity code. Chromosome and base pair coordinates are shown (for example, 8:128748330 means chromosome 8, base pair 128,748,330). Variation IDs and coordinates are shown at the right. Blue shading indicates identical nucleotides across species.
It should be at position 128750540 (we learned this in step 24 of this protocol).
Viewing the sequence in this position shows us an “R” in the human sequence (which is the ambiguity code marking this SNP). The other species in the alignment have an “A” at this position. The “A” is a highly conserved nucleotide across these 11 eutherian mammals.
Views: Transcript (General Identifiers, GO Terms, Population Comparison), Variation (LD view, Export).
Starting with the human HRAS gene, an oncogene in the RAS subfamily of GTPases, we will look at sequence matches between Ensembl and other databases such as UniProt. Gene ontology terms will be examined for an overview of what is known or predicted about the function of the HRAS protein. Variations are compared between human individuals. Finally, we will view linkage disequilibrium plots for a synonymous SNP in the HRAS coding sequence, and export linkage disequilibrium (LD) values from the browser.
155 matches of the Ensembl transcript (or encoded protein) to sequences in public, scientific databases are shown in this section. Matches include two antibodies in the human protein atlas (Ponten et al., 2008), the RASH Human gene in UniProt KB/Swiss-Prot (UniProt Consortium, 2010), and, scrolling down, four diseases in the OMIM database. Click the Align link next to the RASH Human hit from UniProtKB/SwissProt. Note the sequence matches completely. Go back once with the browser controls to the “General identifiers” view.
The Gene Ontology project provides a hierarchical set of standardized terms describing protein functional classes, cellular locations, and biological processes (Gene Ontology Consortium, 2010). Evidence codes (three-letter codes) reflect how the term was assigned to the protein.
For example, in this case, “nucleotide binding” is IEA, or “inferred through electronic annotation.” “Cytoplasm” has the term TAS, or “traceable author statement,” signifying that a publication exists showing the association of ENSP00000373382 to the cytoplasm. The meaning of the evidence codes can be found by clicking on the Help button in this view, or by going to the gene ontology Web site (http://www.geneontology.org).
Variations are mapped to the individual genomes of James Watson and Craig Venter. Genomes can be selected in the “configure this page” dialog. If the variation shows the same allele as the reference sequence, “SARA” appears under the “type” column, for “Same As Reference Assembly.” Most of the alleles in this transcript are the same between Watson, Venter, and the reference sequence. However, rs12628 is a synonymous SNP in the HRAS coding sequence in Watson’s genome (Fig. 6.11.16).The transcript tab: population comparison view. Variations are shown in Jim Watson’s genome. One synonymous coding SNP (rs12628) and one intronic SNP (rs61877782) differ in allele, when compared to the reference genome.
In the variation summary page for rs12628, linkage disequilibrium (LD) data are shown in the form of plots. Twelve plots demonstrating LD values for specific populations are shown above the flanking sequence. LD values are from the HapMap project in this example.
LD plots are drawn underneath the transcript diagrams for theHRAS gene, and variations in this region. The two plots represent LD values measured as r2 or D’. Red regions show variations in high LD, while white regions indicate no LD. Click on the plot for the variations measured in that region, and the variation IDs. To export a table of LD values, click on the blue “Export data” button at the left. Follow the link to “HTML” to view the values in table format (Fig. 6.11.17).The location tab: Linkage Data view. This view is reachable from the variation tab: summary view, if linkage disequilibrium (LD) values have been calculated for the specific variant. Clicking on the LD plot for the population “CSHL-HAPMAP:CHB” ...
Views: Location (“Region in Detail,” “Region Overview”), Gene (Regulation), Feature (Summary, Context).
We begin with a region encompassing a highly conserved sequence that Ensembl predicts to be a regulatory region. The underlying evidence for this predicted regulatory region is examined. Other motifs associated with gene regulation are discussed, specifically from the CisRED (Hillier et al., 2004), miRanda (Betel et al., 2008), and VISTA (Visel et al., 2007) projects. Markers and clones are viewed along the chromosome, and tilepath clones (the clones used to determine the genome sequence) are exported.
Most constrained elements correspond with exons. However, in this case we see one element calculated from the 34 eutherian mammals alignment that lies outside an exon. (Fig. 6.11.18, circled).Location tab: “region in detail” view. The main panel is shown. Constrained elements (circled) and GERP scoring (labeled “1”) of each nucleotide in the 34 species alignment are shown. The circled constrained element falls ...
A regulatory feature maps to the highly conserved sequence discussed in Basic Protocol 2, step 12.
A pop-up box should appear, showing this sequence has transcription factor binding sites, along with DNase I hypersensitivity.
A new tab opens, the regulation tab, which focuses on one regulatory region. This highly conserved regulatory feature is based on DNase I hypersensitive sites from various cell types, along with transcription factor binding sites. These make up the “core feature.” Histone modification sites extend the feature, and these extensions are drawn as error bars. Certain histone modification patterns are associated with promoters. If the feature contains these patterns, it is termed “promoter-associated.” ENSR00000684596 exhibits these patterns in various cell types and covers the 5′ end of the TALDO1 gene.
This view diagrams potential regulatory features in the region of the TALDO1 gene. In addition to the features from the regulatory build that are associated with the 5′ end of the gene (Fig. 6.11.19, green and gray boxes), two regulatory regions map to the 3′ terminus (Fig. 6.11.19, blue boxes). Furthermore, a track shows regulatory regions from the CisRED, miRanda, and VISTA projects (Fig. 6.11.19, circled). CisRED features are only searched for in specific regions of the gene (5′ ′end and upstream), indicated by the light purple “CisRed search regions” track. Clicking on any of these features (Fig. 6.11.19, circled) will yield a pop-up box with more information.Gene tab: regulation view. A graphical display of predicted and known sequences associated with gene regulation is shown for ENSG00000177156. Features from the Ensembl Regulatory Build are shown, along with sequences from cisRED (circled). Click on any ...
Scrolling down, the genomic sequence corresponding to each regulatory feature is shown. The feature we investigated in steps 5 to 9 of this protocol is ENSR00000684596. The sequence for this promoter-associated feature is shown. Clicking on the ID (ENSR00000684596) brings us back to the regulation tab.
The region for the TALDO1 gene is shown.
Make sure you are viewing the full TALDO1 gene model. Add these tracks by searching for each one in the configuration dialog and selecting it. Save and close the configuration dialog to refresh the Region in Detail view with the new tracks. The view should now show two markers (pink boxes named RH36444 and D11S327) and a clone (gold rectangle named XX-55F22; Fig. 6.11.20). Click on one of the markers and follow the link to “Marker info.” The resulting view shows IDs for this marker in other databases (“synonyms”) and two primer sequences, along with expected product size.Location tab: “region in detail” view. The main panel is centered on the TALDO1 transcript. Markers are displayed as pink blocks. Clicking on either RH36444 or D11S3271 will reveal more information about the marker. For color version of ...
Markers can also be turned on in the Top panel of “Region in Detail,” using the configuration dialog. To restore tracks to the default configuration, click on “configure this page” at the left, and choose the option “Reset configuration for Main panel to default settings” at the bottom of the configuration dialog.
You might have to do this twice, in order to achieve a region of less than 200 nucleotides. Once the range is small enough, the sequence will appear (Fig. 6.11.21).Location tab: “region in detail” view. The main panel has been zoomed in to chromosome 11, base pairs 755,875 to 756,005. Sequence and translated sequence are selected in the configuration dialog. Sequence is only displayed if the base ...
This view is similar to “Region in Detail,” but allows over 1 Mb of sequence to be viewed. Syntenic regions, or long regions of conserved sequence and gene order between species, may be drawn in this view.
The resulting view (Fig. 6.11.22), should show the genomic assembly (blue rectangles) and clones.Location tab: region overview. Contigs, genes, and tilepath clones are displayed. Gold clones indicate finished sequence, and a black triangle at the upper left-hand corner of a gold rectangle indicates the clone was mapped using fluorescence in situ ...
The export dialog allows export of sequence and annotation in various formats.
All options are selected for CSV.
Clones for the regions selected are listed.
For other ways to export sequence and annotation, see Support Protocol 1.
Views: Transcript (cDNA, Export) BioMart.
The browser can be used to export gene, transcript, or protein sequence in FASTA format. While this is useful for one or two genes, it becomes tedious for a set of genes. BioMart (Haider et al., 2009) is a quick export tool that retrieves information from a “martified” Ensembl database according to the user’s request. Both BioMart and export from the browser will be explored in this protocol.
The cDNA sequence will appear in FASTA format.
An alternative to export from the browser is found in the BioMart data mining tool. BioMart is also accessible from an alternate location: http://www.biomart.org.
More than one gene symbol may be entered. Alternatively, gene IDs or accession numbers can be used.
Click the blue Count button at the top left to make sure the gene ID is accepted. The result should be 1/52,580 (1 out of a possible 52,580 human genes; Fig. 6.11.24, circled).
The header may be customized in the “Header information” options. Ensembl gene ID and Ensembl transcript ID are automatically in the header, but may be deselected. Gene name, chromosome, and position in base pairs are all options.
Sequences may be exported as FASTA. In the Support Protocol 2, we export variations for a gene. Other file formats will be explored.
For a BioMart tutorial video, see the Ensembl tutorials page: http://www.ensembl.org/info/website/tutorials/index.html (Introduction to BioMart). These videos are hosted at YouTube.
The browser can be used to view all variations for a gene, as shown in Basic Protocol 1, steps 28 to 33. To quickly obtain a table of variations for one, or several, genes, BioMart can be used, as outlined in this protocol.
Instead of choosing Ensembl genes 60 as the database, it is possible to choose Ensembl variations. This would be relevant if a list of variation IDs were to be used in the Filters. Choosing Ensembl genes 60 allows filters to apply to the genes. Choosing the variation database means filters apply to variations. For example, intergenic variations may be seen when choosing the variation database.
By default, Gene ID, Transcript ID, and Variation ID are selected. Choices such as validation status, variation source, and consequence type can be selected.
Supported file types include HTML, CSV, TSV, and XLS. The “compressed web file, notify by email” option is especially useful in cases of large export files.
With the vast number of databases in life sciences today, it becomes necessary to provide one access point to a comprehensive set of annotation on a gene or genome. Ensembl meets this goal by coordinating with other projects such as UniProt, InterPro, and Gene Ontology in order to show multiple types of information on a genome-wide level.
The complete gene and transcript set for human is not yet known. Ensembl bases all automatically annotated transcripts on protein and cDNA sequences from public sequence sets (NCBI RefSeq, UniProt). As the sequences in underlying databases (European Nucleotide Archive, GenBank, and DDBJ) have been submitted by wet lab biologists, each Ensembl transcript goes back to experimental evidence. In addition, the Ensembl/HAVANA merged transcripts obtain a higher degree of confidence, as manual annotation confirms Ensembl determinations.
Variation resources providing disease and phenotype information, such as NHGRI’s GWAS catalog, are now developing. Understanding the effect of sequence variation on human disease will allow deleterious polymorphisms to be spotted, making for quicker diagnosis, and potentially, disease treatment. Genome browsers allow these types of annotations to be easily compared with genes and other genomic features, such as conserved regions across species, and/or protein domains, providing a bigger picture of any locus.
Like life sciences in general, the bioinformatics resources supporting the field are not static entities, but changing and improving over time. Although data updates and changes in the user interface may be challenging for biologists, the aim of database resources is always to provide a better integrated view of current biology.
The upshot of the fluidity of biological data is that data searches and analysis should be repeated from time to time. New supporting evidence, a better genome sequence assembly, or a refined algorithm all help to improve annotation quality significantly over time. Therefore, the answers given by a more current version than Ensembl 60 (November 2010), which is the basis of this unit, are also subject to improvement.
To allow looking up older references (e.g., in lab journals or research articles) in their original context, Ensembl has implemented an archive site (http://archive.ensembl.org/). We aim to provide a back-catalog of earlier releases for at least two years. Access to this particular release (Ensembl 60, November 2010) is made available via http://Nov2010.archive.ensembl.org/. The Ensembl project is open source; all data shown in the browser and underlying databases are freely available, along with the Web code.
For further support, view video tutorials for the project at http://www.ensembl.org/info/website/tutorials/index.html. Help pages are available for each Ensembl view, accessible via a blue Help button to the right of the page title. The help also provides links to a glossary, and to a “contact helpdesk” form. The helpdesk may also be reached by e-mail (helpdesk/at/ensembl.org). Finally, the “Help” and “Documentation” link at the top right of each Ensembl view provides basic documentation and protocols used in the project, along with frequently asked questions.
The Ensembl genome server is the result of the work of the Ensembl team at the Wellcome Trust Sanger Institute (WTSI) and the European Bioinformatics Institute (EMBLEBI), an outstation of the European Molecular Biology Laboratory. The Ensembl project is principally funded by the Wellcome Trust with additional support from the European Molecular Laboratory (EMBL), the National Human Genome Research Institute (NHGRI), the U.S. National Institute of Allergy and Infectious Diseases (NIAID), the Biotechnology and Biological Sciences Research Council (BBSRC), theMedical Research Council (MRC), and the European Union.
Ensembl project home page.
Support videos and other tutorials for Ensembl.
Distributed Annotation System (DAS) and BioDAS.
dbSNP: a repository of polymorphisms.
Gene Ontology Consortium.
Genome Reference Consortium: houses the reference human genome.
NCBI GWAS catalog.
An international organization working towards a haplotype map of the human genome.
HUGO Gene Nomenclature Committee (HGNC).
InterPro, a collection of protein signatures.
Online Mendelian Inheritance in Man, a set of human genes and phenotypes.
A multi-organism, nonredundant database of sequences.
UniProtKB, a catalog of information on proteins.
UniSTS, databank for chromosomal markers.
Vertebrate Genome Annotation (VEGA) at Sanger Institute.
International Human Genome Sequencing Consortium.