The ENCODE portal (http://encodeproject.org
), which is the centralized resource for accessing the information and tools described in this section, was extensively upgraded this year. An entire section for Mouse ENCODE resources has been added. The experimental guidelines and data standards developed by the ENCODE Consortium this year for a broad range of whole-genome assays (RNA-seq, ChIP-seq, DNase-seq, DNA methylation assays) are hosted on a dedicated portal Data Standards
page, along with platform characterization summaries and references.
A key resource for learning about ENCODE data is the OpenHelix ENCODE tutorial (openhelix.com/ENCODE), a free Online resource released in November 2010. This tutorial provides an overview of the ENCODE project, summarizes the types of data available through ENCODE, and details methods for accessing ENCODE data via the UCSC Genome Browser. The tutorial, and accompanying instructional material, is free to the public and is sponsored by the DCC. Other resources for learning about ENCODE data usage can be found on the new ENCODE portal Education and Outreach page.
The DCC devoted considerable engineering effort this year to developing tools to enable users to easily locate data of interest within the overwhelming set of ENCODE data tracks and subtracks. For an overview of ENCODE data, the DCC now provides a Data Summary page on the ENCODE portal. This page includes a spreadsheet in multiple formats itemizing ENCODE experiments by lab, data type, cell type and other experimental variables.
The premier methods for locating ENCODE data are the new Track Search and File Search tools, available from the ENCODE portal and Genome Browser web pages. Both of these tools allow free-text searching by keyword, coupled with an advanced search feature that provides selectable lists of terms from the ENCODE controlled vocabulary (described below) to guide the search. Multiple terms can be applied in both ‘and’ and ‘or’ combinations. For example, in a single advanced search, a user can locate tracks showing evidence of the enhancer-associated histone modifications ‘H3K4me1’ and ‘H3K27Ac’ in either ‘NHLF’ or ‘IMR90’ lung cell lines. The Track Search tool is described more fully in the companion Genome Browser paper in this issue. The File Search tool locates downloadable files for analysis across the full range of ENCODE data sets, and the related track File Downloads tool (available from the track configuration page) selects files within a single track. The Downloads page of many ENCODE tracks include hundreds and even thousands of files. Using controlled vocabulary terms relevant for each experiment set, the files are now listed in a sortable and filterable table.
In a related effort, the DCC this year implemented an accessioning scheme to group related files and tracks within logical experiments. These accessions make it easier to relate associated files and provide a short, stable identifier for citations. Each experiment groups a set of data from a single providing laboratory for a single assay in a single cell type and set of experimental conditions. All replicates and levels of data (raw sequence files and mappings to multiple genome assemblies, processed data such as peak calls or putative transcription isoforms) associated with a single logical experiment are assigned the same accession. The DCC accession is visible everywhere metadata for a track or file appears. As of this writing, ENCODE comprises 1861 experiments in human and 174 experiments in mouse.
The ENCODE DCC controlled vocabulary (CV) is a mechanism for associating metadata with ENCODE experiments. Metadata terms are added as needed, and the metadata controlled vocabularies have been expanded this year for both human and mouse. There are currently 23 metadata controlled vocabularies. The largest vocabularies are ‘Antibody’ (199 terms) and ‘Cell Line’ (235 human and 34 mouse cell types). The CV has received extensive curation and quality review this year to ensure completeness and eliminate duplicate and confusing terms. This effort has led to a more informative set of metadata associated with each track, including links to term descriptions and supporting documents. Two specific areas where the CV was improved are the cell type karyotype and lineage terms. The karyotype term has been simplified to describe cell lines that are derived from normal or cancerous tissues. At present 72 cell lines have been annotated as normal and 47 cell lines as cancerous. The lineage term has been used to describe the progenitor tissue type from which the source tissue type has differentiated. The values ectoderm, endoderm, mesoderm and inner cell mass are associated with 36, 45, 90 and 12 cell lines, respectively.
A new Genome Browser feature, Data Hubs
, supports display of off-site annotations alongside ENCODE data. The first publicly provided hub presents the Roadmap Epigenomics (20
) catalog of data sets, enabling close comparison of the voluminous and complementary results from these two consortia. shows a Genome Browser screen showcasing ENCODE and Roadmap Epigenomics data together. For more information about the Data Hubs
feature, see the Genome Browser update in this issue.
Figure 1. ENCODE data displayed in the UCSC Genome Browser together with two annotations from the Roadmap Epigenomics Release III data hub. The genomic region contains two protein coding genes, plasma membrane calcium ATPase 4a (ATP2B4) and lymphocyte transmembrane (more ...)
The DCC effort to pass quality-reviewed ENCODE data to the NCBI Gene Expression Omnibus (GEO) (21
) and Short Read Archive (SRA) as an auxiliary data repository has made considerable progress in the past year. Since September 2010 we have accessioned 916 GEO Samples, in 15 GEO Series in human and mouse over 3 assemblies (NCBI36/hg18, GRCh37/hg19 and NCBI37/mm9). To further organize the data and facilitate access, NCBI BioProjects have been created for ENCODE.