All ENCODE production data for the 5-year initial production phase of the project have now been submitted to the ENCODE Data Coordination Center at UCSC. UCSC has performed quality review and publicly released all conforming ENCODE data sets along with metadata, as both tracks for browsing and downloadable files for data mining. In the human genome, 288 cell and tissue types are now represented, covering 32 assays. Chromatin features and sites of DNA binding are mapped for >300 factors and marks. In mouse, 81 cell and tissue types were surveyed in five experimental assays.
The results of five new experiment types were released during the fifth year: chromatin interactions based on chromosome conformation capture carbon copy (5C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) methods, proteogenomics and DNA replication timing by both sequencing and microarray methods.
Although DNA is a linear molecule, it is packed and organized inside the nucleus in a 3D milieu, and gene regulation can be affected by interactions from elements located hundreds of kilobases distant in the genome. Long-range chromatin looping interactions can be detected using various techniques, including chromosome conformation capture (9
) and chromatin interaction analysis with paired-end tag (10
). The ENCODE chromatin interactions data sets comprise experiments in 14 cell types.
Proteogenomic methods differ from conventional mass spectrometry proteomic methods that identify peptides by comparing them with peptides produced from known proteins. In contrast, proteogenomic methods compare peptides with all peptides that might be produced by the six translation frames of the genome to identify the genomic region from which the peptides were produced. Study of proteogenomic data offers insights into regulatory mechanisms, including translation, pre-messenger RNA (mRNA) splicing and transcript diversity, nonsense-mediated decay and transcription of novel protein-coding genes. The ENCODE protegenomics data are available in four cell types. presents a Genome Browser session that includes proteogenomics data in conjunction with ENCODE gene, transcriptome and regulatory data sets.
Figure 1. ENCODE data displayed in the UCSC Genome Browser together with annotations from the ENCODE Analysis Hub in the region of the nucleoporin gene NUP133 demonstrate the power this diversity of data provides for visual interpretation. The GENCODE Basic gene (more ...)
The order in which DNA is duplicated during the synthesis phase of the cell cycle is correlated with the expression of genes and the structure of chromosomes; replication timing is known to be an important feature for epigenetic control of gene expression. ENCODE ‘Repli-chip’ (microarray) experiments are available in 9 cell types, and ‘Repli-seq’ (sequencing) experiments in 15.
The encyclopædia of genes and gene variants (GENCODE) gene set (11
) is a fundamental resource produced by ENCODE, providing high-quality manual annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group merged with evidence-based automated annotation from Ensembl (12
) across the human genome. For the final release (V12), the data organization and display were improved to make the data more accessible and intuitive. Annotations are now categorized according to their function and level of support. Color coding reflects non-coding, coding, pseudogene or problem status. To complement the ‘Comprehensive’ gene set, a new ‘Basic’ subset provides a simplified view intended for the majority of users, by filtering incomplete and problem annotations while still ensuring that at least one annotation is displayed at every locus. For researchers who require more detail regarding the degree of evidence supporting individual coding transcripts, a five-level scoring metric is provided, based on assessment of alignments of mRNAs and expressed sequence tags (ESTs) across the full length of the annotation. Filtering options allow tuning of the display based on the basic biological function of the transcript (coding, non-coding, etc.), annotation method (manual versus automated) or specific biotype characterization (http://www.gencodegenes.org/gencode_biotypes.html
). Finally, two additional annotation subtracks are provided: the ‘2-Way Pseudogene’ subtrack shows consensus pseudogenes predicted by two pipelines [Yale Pseudopipe (13
) and UCSC Retrofinder (14
)], and the ‘PolyA’ subtrack presents polyA signals and sites manually annotated on the genome based on transcribed evidence.
The majority of ENCODE primary data submitted and released in the past year for the human genome expanded the existing tracks with additional experiments; cell types, subcellular fractions, transcription factors or histone marks were also mapped. The total complement of data sets available is summarized in and . All links mentioned in this publication are collected in .
The full complement of ENCODE data sets summarized by cell type [types annotated as cancer are marked with asterisk (*)]
The full complement of ENCODE data sets summarized by assay
All links mentioned in this publication are collected in this table
Much of the primary human data (January 2011 data freeze) has been processed uniformly and used as the basis of the September 2012 published integrated analyses performed by the ENCODE analysis group. Results from this processing and analysis are accessible in the UCSC browser via a UCSC public Track Data Hub (ENCODE Analysis Hub) accessed at http://genome.ucsc.edu/cgi-bin/hgHubConnect
. All data have been reprocessed using the ENCODE uniform processing pipeline, including signal tracks corrected for read length and mappability (http://code.google.com/p/align2rawsignal
), peak calls from the SPP (15
) and PeakSeq (16
) peak callers filtered using the irreproducible discovery rate (17
), computed RNA contigs from seven cellular localizations (18
) and genome segmentations for the six Tier 1 and 2 cell lines. A total of 2876 data sets are included in the hub, with a reduced set displayed by default. The track organization for the analysis hub is illustrated in . Additional background and resources related to the ENCODE analysis effort are provided on the portal’s Integrative Analysis page, described later.
The ENCODE Analysis Hub at the EBI hosts over 2800 ENCODE data sets, organized in six tracks controlled via the track menu shown here.
Although the ENCODE project aims to discover all DNA sequences in the human genome with biochemical function under the expectation that these will likely be functional, extending the analysis to use comparative genomics approaches was identified as a fruitful direction for the project. Thus, in the third year, a Mouse ENCODE Project was inaugurated (19
). Assays identical to those being used in the ENCODE project are performed in cell types in mouse that are similar or homologous to those studied in the human project. The comparison will be used to discover which epigenetic features are conserved between mice and humans.
The past year marked the expansion of the UCSC Mouse browser (mm9/NCBI37) from a few preliminary ENCODE tracks to a full representation of Mouse ENCODE data production. A total of 20 tracks of ChIP-seq, RNA-seq and DNase-seq were released, reflecting 583 experiments. provides a graphical view of the Mouse ENCODE data availability.
Figure 3. All three screens of the Experiment Matrix for mouse are shown overlaid. The Data Summary screen lists experiments by data type, and provides launching to the two matrix screens that organize the data by assay and cell type. Clicking the appropriate table (more ...)
The ENCODE Data Coordination Center at UCSC (DCC) has accessioned all relevant ENCODE data at the Gene Expression Omnibus (GEO) (20
) and the Short Read Archive. Since September 2011, the DCC has archived a total of 3346 GEO Samples across 53 GEO Series for humans (GRCh37/hg19) and mice (NCBI37/mm9). ENCODE data now encompasses a total of 4835 GEO Samples and 98 GEO Series. ENCODE GEO submissions are listed on the GEO ENCODE summary page, http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html
. ENCODE has been assigned National Center for Biotechnology Information (NCBI) BioProject identifiers to further organize the data: PRJNA30707 for Human ENCODE (with the subproject PRJNA63443 for production phase data) and PRJNA50617 for Mouse ENCODE. Data in each project are further categorized as ‘epigenomic’, ‘functional genomics’ or ‘transcriptome’. Both UCSC and GEO are archival sites for 2007–12 ENCODE data, and user choice of repository is largely a matter of preferred interface.
All released data from ENCODE are tagged with permanent DCC accessions. Each accession represents a logical experiment, and therefore groups-related files representing different levels of results (e.g. sequence files, Binary Alignment/Map (BAM) alignments, signal graphs and peaks) for multiple replicates. ENCODE data sets at UCSC include GEO accessions in the accessible metadata.