The ENCODE labs are submitting a wide range of different experiment types to the DCC ().
In addition to primary experimental data sets, the ENCODE data also includes the Gencode V4 gene annotation set (4
). This set contains both manually curated and annotated gene models, as well as automatically generated annotations for regions where manual annotation has not yet been performed, along with computationally identified pseudogenes.
Other data sets estimate the mapability of regions of the human genome for given sequencing read lengths. This mapability is determined by how many times a certain DNA sequence appears in the human genome, where higher numbers limit the ability to accurately map a sequence in that region.
For experimental data to be meaningful, the user must have some information on the context of the experiment. To communicate this contextual information, the DCC has established a controlled vocabulary of experimental metadata that includes such information as cell type and experiment type, as well as experiment-specific data such as the antibody used for transcription factor-binding experiments, insert length for paired-tag experiments and cellular compartment for organelle-specific measures of transcript presence. This metadata is presented in the Genome Browser track display, and is available in the download area of each ENCODE track.
Currently, most ENCODE tracks represent primary experimental data, as recorded by an experimental assay. As the ENCODE project matures, the DCC has begun publishing analysis tracks derived from processing the primary data. The first set of this data is displayed in as a rainbow multi-wiggle track. This composite track integrates six to eight different primary tracks representing RNA expression and histone modifications into a single display in which the signal from each track is rendered as a semi-transparent signal of a given color defined by the cell type assayed. This display facilitates the visual identification of constitutive and tissue-specific signals of cellular state.
Figure 1. Constitutive and tissue-specific gene regulation under the ENCODE Integrated Regulation tracks. These tracks compare transcription, promoter marks and enhancer marks across the ENCODE Tier 1 and Tier 2 cell lines. Each cell line is rendered in one color, (more ...)
Within each multi-cellular organism, the DNA in each cell is virtually identical, yet the regulatory signals from the configuration of this DNA yield a variety of cellular phenotypes depending on tissue type and developmental stage. To capture the breadth of these signals, the ENCODE project is surveying a variety of cellular contexts: over 100 different cell types have been designated for use. These are organized into three priority tiers: Tiers 1, 2 and 3. lists the Tier 1 and 2 cell lines. All experiments are performed on the Tier 1 cell lines if possible. Most experiments are also performed on the Tier 2 cell lines; and selected (often lab-specific) experiments are performed on Tier 3 cell lines. This experimental design facilitates integrative analysis while capturing some breadth of cellular state.
The ENCODE Tier 1 and Tier 2 cell types
Since late 2009 (5
), the DCC has more than doubled the number of publicly available ENCODE tracks, to 863 as of August 2010. This includes 305 additional ChIP-Seq tracks that collectively now report on 98 sets of transcription factor-binding sites and 10 histone modifications. The DCC is hosting several new types of data this year, including Digital DNase Genomic Footprinting (6
), ORChID predicted hydroxyl radical cleavage intensity on naked DNA (7
), RIP-chip identification of RNA–protein-binding sites (8
) and integrated regulatory tracks (see ). The RNA-Seq data sets have been augmented with longer reads, more paired reads and with data sets where the direction of transcription can be determined.
The human reference genome has transitioned from the hg18/NCBI36 assembly to hg19/GRCh37, and most tracks originally submitted on hg18 have been re-mapped to hg19. ENCODE data visualization has been enhanced through the deployment of new Genome Browser features, such as the ability to reorder tracks by dragging and dropping and support for BAM format (see the UCSC Genome Browser paper in this issue). The DCC ENCODE portal (http://genome.ucsc.edu/encode/
) now provides additional features for interpreting ENCODE data, such as a listing of all registered metadata variables
We have added a set of sample Genome Browser sessions to the UCSC wiki (http://genomewiki.ucsc.edu/index.php/Encode_scenarios
). These sessions illustrate how ENCODE data can be used in conjunction with other Genome Browser data to support biological inferences. Each image is a screenshot that links to a Genome Browser session (9
) with a preconfigured set of tracks and display parameters. Within the session, a user can access all the standard Genome Browser functionality, such as moving to a nearby region or adding a custom track to the display.