Downstream analyses for a cistrome study require specific or integrative tools. The value of Cistrome is that it enables biologists to use a broad range of bioinformatics tools to easily generate report-quality figures and tables, and to simplify routine analysis using reproducible pipelines. In Cistrome, we provide tools for correlation studies, genome feature association studies and motif analysis together with public workflows to link these tools together.
Usually, researchers require at least two biological replicates to show the consistency of an experiment. An intuitive way to show consistency is to ask if the replicates can be correlated in some meaningful measurement. Correlation can also answer the question of whether or not two transcription factors are co-localized. For instance, two biological replicates with low correlation might suggest poor data quality, or highly overlapping cistromes between two factors might suggest interactions between the factors. For these reasons, we deployed two levels of tools in Cistrome to calculate correlations: one to compare protein-DNA binding signals and the other to investigate the overlap of the predicted binding sites. First, Cistrome can calculate Pearson correlation coefficients for multiple signal profiles on a whole-genome scale or by restricting the calculation to a set of genomic regions defined by the user. A Pearson correlation coefficient close to 1 implies that the replicates are consistent or two factors are correlated. To save computation time, these tools use window-smoothing methods to calculate the mean or median values within non-overlapping fixed-size windows. This approach decreases the number of data points involved in the calculation. The results are represented as scatter plots or heatmap images in either PDF or PNG format as illustrated in Figure . The second level of correlation can address how many of the predicted binding sites (peaks) from several replicates, different factors or different conditions overlap. We provide a tool for drawing a Venn diagram using two to three BED format peak files. The circles and overlapping regions in the Venn diagram can be proportional to the actual number of peaks and overlaps (Figure ).
Figure 2 Correlation and association tools. (a) Correlation plots using different histone marks in C. elegans early embryos . Cistrome correlation tools can generate either a heatmap with hierarchical clustering according to pair-wise correlation coefficients (more ...)
Functional DNA regions in genomes are often evolutionarily conserved between different species [17
]. Therefore, evolutionary conservation of ChIP-chip/seq peaks compared with flanking non-peak regions is often a good indicator of good data quality and correct data preprocessing. In Cistrome, the 'Conservation Plot' tool can take one or more cistromes in BED files as input, and use UCSC PhastCons conservation scores [20
] to produce a figure showing the average conservation score profiles around the peak centers (Figure ). This analysis could be extended to compare the conservation differences between multiple cistromes.
Another useful task is to find the genomic features or genes associated with transcription factor binding or histone modification sites. For instance, H3K4me3 is enriched in the promoter regions of active genes [21
], and H3K36me3 is enriched in transcribed exons [22
]. Finding the target genes is critical to understanding the function of transcription factors, such as transcription repression or activation. Therefore, a set of tools from the CEAS (Cis-regulatory Element Annotation System) [23
] package, including SitePro, GCA (Gene Centered Annotation), Peak2Gene and the CEAS main program, has been deployed in the Cistrome web interface. SitePro can draw the average signal profiles around given genomic locations. When multiple locations or sets of signal files are used as input, SitePro can address questions such as how the signals of multiple factors change at the same locations between different conditions or how the same factor changes in different sets of genomic locations. The GCA tool can find the peaks that are closest to the transcription start site (TSS) of each gene and calculate the coverage of the peaks of the gene body in a spreadsheet. The Peak2Gene tool can find the nearest genes for each peak. The CEAS main program generates multi-paged figures as either a PDF document or PNG image. In general, when a BED file for peaks and a WIGGLE file for signals are used as input, the resulting report includes the peak enrichment on chromosomes and various genomic features, such as gene promoters, downstream regions, UTRs, coding exons or introns, and the average signal profile around TSSs and transcription termination sites (TTSs), the meta-gene body (all genes are scaled to 3 kbps), concatenated exons (coding regions), or concatenated introns. When gene lists are provided (for example, a list of genes with the highest and lowest levels of expression for the same sample in a ChIP-chip or ChIP-seq experiment), CEAS will plot the average signal profiles for different gene groups in different colors for the TSS, TTS, gene bodies, exons, or introns (Figure ). This function can be coupled with gene expression tools described in the previous section to show whether the signals of the transcription factor or histone marks are related to transcription repression or activation.
In addition to the average signal profiles at a given set of genomic locations, as shown in CEAS, the visualization and clustering of signal profiles from different factors at specific locations provides another angle of insight. Through the observation of patterns, we can also find the co-factors (co-activators or co-repressors) that tend to work together on their regulated genes. The Cistrome 'Heatmap' tool can extract the signals centered at every given genomic location, perform either a k-means clustering or a sorting by maximum, mean, or median values within each region, and then draw a heatmap. For example, the group of TSSs for active genes should have H3K4me3 enriched at the TSS and a gradual H3K36me3 enrichment downstream of the TSS, whereas the group of TSSs for inactive genes would have low signals of both H3K4me3 and H3K36me3. Additional detailed clustering will be revealed when signal profiles of multiple factors are used (Figure ). Multiple WIGGLE files for different factors or different conditions can be used as input together with a set of genomic locations defined in a BED file. These regions could be nucleosome-free regions or transcription factor binding sites instead of TSSs of genes. Clustering or sorting can be based on all or some of the WIGGLE files. The color schema of the heatmap is configurable to adjust the contrast for better visualization between high and low signals.
Figure 3 Heatmap analysis with k-means clustering. By combining H3K27me3, H3K9me3, H3K4me3, H3K4me2, H3K36me3 and MES-4 (the histone H3K36 methyltransferase) ChIP-chip signals, as in Figure 2a, the Cistrome heatmap tool separates the ± 1-kbp regions for (more ...)
Transcription factor motif analysis is a key to understand the specific DNA patterns of in vivo
transcription factor binding. Motif analysis can also identify the co-factors that work together to activate or repress gene expression because the binding sites of co-factors should have similar DNA motifs. We deployed a new motif algorithm called 'SeqPos' in Cistrome based on the algorithm in [24
]. By taking the peak locations as the input, SeqPos can find motifs that are enriched close to the peak centers. SeqPos can scan all of the motifs that we collected from JASPAR [25
], TRANSFAC [26
], Protein Binding Microarray (PBM) [27
], Yeast-1-hybrid (y1h) [28
], and the human protein-DNA interaction (hPDI) databases [29
]. SeqPos can also find de novo
motifs using the MDscan algorithm [30
]. The final significant motifs are listed in an HTML page, as in Figure , where the user can sort the motifs by z-score or P
-value and click on each motif to see detailed information, such as the probability matrix, logos, and the motif consensus. A position-specific scoring matrix can be copied or referred to another tool within Cistrome called a 'screen motif' to search a given set of genomic locations for all occurrences of a particular motif.
Figure 4 Cistrome SeqPos motif analysis. A screenshot of the SeqPos output. The enriched motifs at the androgen receptor binding sites without FoxA1 binding are displayed in an interactive HTML page. When the user clicks on the row of a particular motif, the motif (more ...)
Cistrome has many other useful tools to help users better manipulate their data. A lift over tool can convert WIGGLE files from one genome assembly to another if users want to combine old analysis results with a new genome annotation. However, ab initio
re-preprocessing is recommended to generate new WIGGLE files for the new genome assembly. A WIGGLE file standardization tool can convert the resolution of a WIGGLE file to 8, 32, 64 or 128 bps. Two other tools can extract data for certain chromosome out of a BED file or a WIGGLE file. Furthermore, many Galaxy functions that we considered to be very useful for ChIP-chip/seq data analyses are also enabled in Cistrome. For example, the intersect tool for two interval files, and the filtering/sorting/cutting tool for tab-delimited text files are widely used in many of our precompiled public workflows to post-process intermediate results then feed them into downstream tools (Table S2 in Additional file 1