Chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) is a wonderful tool for studying the binding of transcription factors to genomic DNA. ChIP-seq provides a genome-wide map of the locations bound by the immunoprecipitated (ChIP-ed) transcription factor (TF). The resolution of the map depends on the TF and the software used to determine the binding locations (so-called ‘peak-calling software’), but the predicted locations are often within 50 base pairs (bp) of a site matching the TF's known DNA-binding propensity (
1). This map provides direct evidence of the enhancers and promoters bound by the TF and clues to its role in transcriptional regulation. In addition, the short genomic regions identified by ChIP-seq are generally very highly enriched with binding sites of the ChIP-ed TF, and consequently provide a rich source of information about its relative DNA-binding affinity. The regions also tend to be enriched for the binding sites of other TFs that bind cooperatively or competitively with the ChIP-ed TF (
2,
3).
DNA-binding motifs expressed as position-weight matrices (PWMs) can be used to model the binding free energy of a TF protein to a specific sequence of DNA relative to random DNA (
4). (In what follows, we will simply say that a motif represents the ‘DNA-binding affinity’, dropping the term ‘relative’ for compactness of exposition.) A primary objective of many ChIP-seq experiments is determining the
in vivo DNA-binding affinity of the ChIP-ed TF, and it has been shown that ChIP-seq tag densities are predictive of protein–DNA binding affinity (
5). This is usually approached by
ab initio motif discovery for which many algorithms exist (
3,
6,
7). This approach results in one or more motifs, one of which may represent the DNA-binding affinity of the ChIP-ed TF. The other motifs may be those of cooperatively- or competitively-binding TFs. In many cases, one motif stands out as occurring more frequently in the ChIP-ed regions than any other, and is assumed to be that of the ChIP-ed TF.
Assuming that the most highly ‘enriched’ motif represents the direct DNA-binding affinity of the ChIP-ed TF can be dangerous for several reasons. Firstly, if the ChIP-seq data is of low quality due to poor antibody performance or sample preparation issues, the correct motif may not be present in the set of discovered motifs, or the algorithms may fail to find any motifs. Secondly, if the TF primarily binds DNA in conjunction with one or more other DNA-binding TFs, their motifs may appear more enriched than the ChIP-ed TFs. Thirdly, the ChIP-ed factor may not bind DNA directly at all, but always by ‘piggy-backing’ on one or more distinct DNA-binding TFs.
This article describes a novel method for identifying the DNA-binding motif of the ChIP-ed TF even in difficult ChIP-seq data sets. Our method is designed to overcome the first two sources of difficulty described in the preceding paragraph—poor ChIP-seq data quality or highly enriched co-factor binding sites. It can also predict when the third situation—binding by ‘piggy-backing’ is likely to be occurring. Our method can be used to analyze sets of motifs determined using
ab initio motif discovery on the ChIP-seq regions. It can also be applied more generally as a motif enrichment analysis (MEA) tool (
8–
10), to consider all motifs in a compendium of known motifs as candidates for the ChIP-ed TFs binding motif.
Our analysis methodology, which we call ‘central motif enrichment analysis’ (CMEA), is based on the simple observation that the binding sites of the assayed transcription factor in a successful TF ChIP-seq experiment will cluster near the centers of the declared ChIP-seq peaks. In other words, the actual location of direct DNA binding by whatever protein or protein complex was actually pulled down by the antibody to the TF should
tend toward the center of any given ChIP-seq region. This assumption should be true if the ChIP-seq region itself was identified based on sharply defined ‘peaks’ in the mapped sequence tag density, as is the case for many commonly used ‘peak-calling’ algorithms [e.g. MACS (
11), PeakSeq (
12), QuEST (
13)]. When all goes well, the actual ChIP-ed binding site lies somewhere within a region of about 100 bp (
1), centered on the ‘peak’, and with increasing probability closer to the center. In other words, we expect the probability (density) of the true binding location to be maximum in the center of a peak.
We implement our approach in the CentriMo algorithm (Centrality of Motifs), which takes as input a set of equal-sized regions identified in a TF ChIP-seq experiment and one or more TF binding motifs expressed as PWMs. Ideally, each of these ChIP-seq regions should be centered on a single coordinate reported as the position of ‘maximum confidence’ within a peak by the peak-calling software. If the program only reports regions (rather than single genomic positions), we use equal-sized genomic regions centered on the precise middle of each of the reported regions. For each motif, CentriMo outputs a plot of the probability that a predicted binding site occurs at each position in a ChIP-seq region (site-probability plot). It also outputs, for each motif, the width of the central region that is most enriched in binding sites according to a statistical test, and a P-value adjusted for multiple tests. We refer to this as the ‘central enrichment P-value’ of the motif. To aid in visualization, CentriMo outputs the site-probability curves for the n motifs that are most highly ‘centrally enriched’, according to their central enrichment P-values. Thus, CentriMo both serves as a visualization tool and provides an objective assessment of the degree to which each of the input motifs predicts centrally enriched binding sites.
As we show in the ‘Results’ section, CMEA is consistently able to determine the direct DNA-binding motif of ChIP-ed transcription factors, even in cases where motif discovery and motif enrichment algorithms fail or give ambiguous results. We illustrate how to apply CentriMo to analyze ChIP-seq data sets using motifs from motif discovery algorithms, motifs from motif databases and even hand-tailored motifs. In the process, we point out the characteristics of site-probability curves that distinguish between direct-binding motifs and the motifs of co-factors that merely bind near the ChIP-ed transcription factor with high frequency.