Each individual factor page contains a number of sections (descriptions below) and a panel on the top right of the page, which provides an image of the three-dimensional protein structure of the TF (when available) and links to outside databases (PDB, HGNC, Gene Card, Entrez, RefSeq, UCSC, UniProt and Wikipedia). The UCSC link points to the Genome Browser at University of California Santa Cruz displaying all raw ENCODE data on this TF.
TF function
This section contains a brief overview of the molecular function of the TF. When known, factorbook provides information about its protein family, consensus-binding sequence, functional-binding partners (e.g. if it is part of a complex which acts as an expressional regulator) and disease phenotypes. The information was taken from the UCSC annotation for these factors, supplemented with information from RefSeq and Gene Card.
ENCODE ChIP-seq datasets
Each TF has a matrix organized by cell line and laboratory that generated the data. The number displayed represents the number of ChIP-seq experiments performed for that factor in that cell line. In some of these experiments, cells were subjected to treatment (e.g. with dexamethasone) to detect potential differences in TF binding. The user can toggle to show or hide this matrix, which can be convenient when there are many datasets for a TF (e.g. the insulator-binding protein CTCF and RNA Polymerase II). Clicking on the number opens a new page containing a sortable table of available data. The download links allow the user to download the datasets from the UCSC genome browser.
Average profiles of modified histones around the summit of ChIP-Seq peaks
Average histone modification profiles are shown for the [−2 kb, +2 kb] window around the summits (the position with the most sequence reads) of TF ChIP-seq peaks (). These are separated into peaks that are proximal to (within 1 kb of) an annotated transcription start site or TSS (dashed lines) and peaks that are distal to all annotated TSS (solid lines). Only histone modification data from the same cell line as the TF ChIP-seq data are shown. The Broad Institute team in the ENCODE consortium performed the ChIP experiments with antibodies specific for various modified histones to generate the histone modification data used in factorbook (
7).
These graphs are designed in an interactive fashion. A user can hover the cursor over a curve to reveal its histone modification identity or hover over a histone modification in the legend to show its curves and gray out other histone modifications in the figures. The user also can click a histone modification in the legend to toggle on/off its curve in all figures, as well as click the ‘Proximal’ or ‘Distal’ button in the legend to show only the average histone modification profiles anchored around ChIP-seq peaks that are proximal or distal to annotated transcripts. Furthermore, the legend remains on the right of the page even when there are many datasets available and the user scrolls down the page.
Average profiles of nucleosomes around the summit of ChIP-Seq peaks
This section shows the effect of bound TFs on regional nucleosome positioning (). Average nucleosome occupancy profiles are shown for the [−2 kb, +2 kb] window around the summits of TF ChIP-seq peaks. Red lines represent peaks that are proximal to an annotated transcript (within 1 kb of a TSS) and blue lines show peaks that are distal to all annotated transcripts (>1 kb of all TSS). As for histone modifications, proximal profiles of nucleosome occupancy are arranged such that the transcriptional direction of the nearest transcript is toward the right. The Stanford team generated the nucleosome positioning data in cell lines GM12878 and K562, using MNase digestion of chromatin followed by deep sequencing of mononucleosomal DNA (
8). Three examples shown in are normalized nucleosome occupancy around the ChIP-seq peaks of CTCF from the GM12878 cell line; the ChIP-seq datasets were generated by three different laboratories. The plots consistently show that CTCF positions nucleosomes more strongly in distal regions (higher signal, blue lines in ) than in regions proximal to the TSS (lower signal, red lines in ) and there is a loss of nucleosomes in gene body. More details are explained in our recent publication (
3).
Motifs enriched in the top ChIP-seq peaks
We built a computational pipeline that takes advantage of the MEME-ChIP suite of tools to discover the motifs enriched in the sequences of the top 500 TF ChIP-seq peaks (
3). We display five motifs (M1–M5), with motif name (when it is known, otherwise shown as no match) and sequence logo, as well as the number of peaks out of the top 500 peaks containing a motif site. This section allows the user to customize the motifs shown by cell line, laboratory, protocol, treatment and antibody. The five motifs shown in are from the top 500 ChIP-seq peaks of CTCF in GM12878 generated by the Broad Institute. Among the five motifs, M1 and M2 are variants of the canonical CTCF motif and M3 is the extension of the CTCF motif. However, M4 and M5 do not match any known motifs.
We used the FIMO tool (
9) to scan the
de novo identified motifs in the entire set of ChIP-seq peaks and the two equal-length regions flanking the peaks as control. A series of graphs report two quantities for each motif on bins of peaks sorted by their ChIP-seq q-values: (left
y-axis) percentage of the peaks that contain a site for the motif and (right
y-axis) the distribution of the distances of the motif site to the summit of the peak (the position in a TF ChIP-seq peak that corresponds to the most sequencing reads). For the previous CTCF example (), comparison between the peak and the regions flanking the peak shows that M4 is not more enriched in peaks than in flanking regions. Thus, M4 does not appear to be a valid motif, nor does M5 (data not shown). More details are explained in our recent publication (
3).
Comparison of the binding profile of a TF to those of other TFs and histone modifications
This section shows a detailed view of the relationship between a pivoting TF and other TFs or histone marks. A shows a heatmap for which the pivoting TF is c-Fos in the HUVEC cell line. The first column represents the binding profile of c-Fos in a [−1 kb, +1 kb] window centered on the peak summit, with each row being a peak and the rows sorted inversely by ChIP signal. Other columns represent the binding profiles of other TFs in the same regions as in the first column. B is the same as A except that the other columns represent the profiles of histone modifications in a wider window ([−5 kb, +5 kb] centered on the peak summit). Analyses are limited to TFs and histone marks from the same cell line as the pivoting TF.
FUTURE DEVELOPMENT
Our future plans for factorbook include the addition of more TFs, as well as more data types, including RNA-seq and DNase I. Further analysis results from the ENCODE Data Analysis Center (motif analysis from the Kellis Lab and the CAGT plots from Kundaje
et al. (
8), cross-species sequence conservation and sequence variation within the human population, and allele-specific sites) are forthcoming as well.