Transcription factors (TFs) play critical roles in regulating gene expression.
Determining transcription factor binding sites (TFBSs) is challenging because the
DNA segments recognized by TFs are often short and dispersed in the genome, and the
target loci of a TF vary between tissues, developmental stages and physiological
conditions.
Genome-wide protein-DNA interactions are now typically profiled using ChIP-Seq, i.e.
chromatin immunoprecipitation (ChIP) with massively parallel short-read sequencing
[1]. A typical
ChIP-Seq experiment generates millions of short (35–75 bp) directional DNA
sequence reads that represent ends of ~200 bp immunoprecipitated DNA fragments.
The read sequences are mapped onto a reference genome. Then, for experiments with
transcription factors, there are three central analysis issues: peak-calling,
binding motif identification, and motif interpretation. Here, we report an
R/Bioconductor-based pipeline that offers an efficient, integrated set of analysis
tools for such experiments.
The aligned read data are first transformed into a form that reflects local densities
of immunoprecipitated DNA fragments, and regions with high read densities, typically
referred to as peaks, are identified by a peak-calling algorithm (Reviewed in
[2],
[3]). Here, we
use an R package, based on PICS, which we developed for this pipeline. PICS (see
methods) has been shown to perform well compared to the QuEST
[4], MACS
[5], CisGenome
[6], and USeq
[7].
Peak-calling returns a list of enriched genomic regions in which the protein of
interest is expected to be directly or indirectly associated with DNA. Analysis then
identifies potential DNA binding sites within these regions, and summarizes these
sets of short sequences as motifs, typically as position weight matrices (PWMs) or
families of PWMs
[8],
[9]. There are two main types of algorithms for
de
novo motif discovery: enumerative and probabilistic. Enumerative
methods identify and rank all m-letter patterns in a set of sequences. Probabilistic
methods use stochastic sequence models along with Expectation-Maximization (EM) or
Gibbs sampling techniques to infer PWMs
[10]–
[13], and can be computationally
impractical for large datasets. Established tools like Weeder
[14], Gibbs sampler
[15] or MEME
[16] were developed
to address relatively small sets of input sequences, and scale poorly to the much
larger sets of enriched sequences that whole-genome ChIP-Seq data can return.
Pipelines developed for ChIP-Seq analysis, e.g. CisGenome
[6] and MICSA
[17], are based on these algorithms
or variants of them, and face similar constraints. Other tools like HMS
[18] and ChiPMunk
[19] were
developed for motif discovery from ChIP-Seq data, and so are more scalable, but can
identify only a single-motif at a time, and would need to be modified to discover
motif combinations. Our pipeline uses GADEM
[20], which is a good compromise between
fully probabilistic and enumerative approaches, can process large sets of ChIP-Seq
regions, handles both dimer and monomer motifs, automatically identifies multiple
motifs, and automatically adjusts motif widths. We have ported GADEM to R, as a
package called rGADEM. To address very large sets of enriched regions, we have
extended the original C code to take advantage of multithreading, without requiring
user configuration, via Grand Central Dispatch on OS X, and openMP (openmp.org),
which supports shared-memory parallel programming on all architectures, including
Unix and Windows. Compared to probabilistic approaches, this provides a simple, fast
and efficient
de novo framework.
Once
de novo motifs have been identified, it is desirable to
compare, annotate and assess these in order to retain motifs that are likely to be
biologically relevant, while removing artifactual and background motifs. For this we
have designed a new tool, MotIV (
Motif
Identification and
Validation),
which is based on STAMP
[21]. Like STAMP, MotIV provides queries to the JASPAR
database
[22], and
users can flexibly input other sets of reference PWMs (e.g. TRANSFAC
[23], UniProbe
[24], DBTBS
[25], or
RegulonDB
[26]). As outlined below, MotIV provides visualization and
postprocessing options that are unavailable in STAMP, TOMTOM
[27] and MACO
[28]. It provides summary statistics on
motif occurrences, reports joint motif occurrences and plots distance and
pairwise-distance distributions. It can also refine motifs and motif occurrences
based on a set of filters provided by the user.
Because gene regulation typically involves combinatorial action of multiple TFs,
functional binding sites tend to occur as groups that are often referred to as
cis-regulatory modules (CRMs)
[29]. Identifying CRMs can improve the accuracy of predicting
functional binding sites. However, results from computational methods for
determining CRMs (e.g. Cluster-Buster
[30] and CisModule
[31]) are rarely
reported for ChIP-Seq data, because they are too computationally intensive or return
long lists of candidate modules that are challenging to assess. MotIV offers an
alternative way to identify biologically relevant combinations of motifs.
Below, we describe the pipeline in more detail. Its core consists of three
Bioconductor packages: PICS calls enriched regions; rGADEM identifies de
novo motifs; and MotIV visualizes and annotates motifs, and identifies
motif combinations that have nonrandom spatial relationships. This is the first
complete Bioconductor pipeline for analyzing transcription factor ChIP-Seq data. The
pipeline is computationally efficient, supporting processing datasets that consist
of tens of thousands of peaks. We illustrate the pipeline by analyzing published
Illumina datasets for genome-wide binding in human of FOXA1, ER, STAT1, and CTCF. We
compare the performance of our approach to previously described methods for motif
and module discovery, and show that the pipeline supports detecting biologically
relevant motif modules that are not easily discovered by other methods.