Modeling the shift size of ChIP-Seq tags
ChIP-Seq tags represent the ends of fragments in a ChIP-DNA library and are often shifted towards the 3' direction to better represent the precise protein-DNA interaction site. The size of the shift is, however, often unknown to the experimenter. Since ChIP-DNA fragments are equally likely to be sequenced from both ends, the tag density around a true binding site should show a bimodal enrichment pattern, with Watson strand tags enriched upstream of binding and Crick strand tags enriched downstream. MACS takes advantage of this bimodal pattern to empirically model the shifting size to better locate the precise binding sites.
Given a sonication size (bandwidth) and a high-confidence fold-enrichment (mfold), MACS slides 2bandwidth windows across the genome to find regions with tags more than mfold enriched relative to a random tag genome distribution. MACS randomly samples 1,000 of these high-quality peaks, separates their Watson and Crick tags, and aligns them by the midpoint between their Watson and Crick tag centers (Figure ) if the Watson tag center is to the left of the Crick tag center. The distance between the modes of the Watson and Crick peaks in the alignment is defined as 'd', and MACS shifts all the tags by d/2 toward the 3' ends to the most likely protein-DNA interaction sites.
Figure 1 MACS model for FoxA1 ChIP-Seq. (a,b) The 5' ends of strand-separated tags from a random sample of 1,000 model peaks, aligned by the center of their Watson and Crick peaks (a) and by the FKHR motif (b). (c) The tag count in ChIP versus control in 10 kb (more ...)
When applied to FoxA1 ChIP-Seq, which was sequenced with 3.9 million uniquely mapped tags, MACS estimates the d to be only 126 bp (Figure ; suggesting a tag shift size of 63 bp), despite a sonication size (bandwidth) of around 500 bp and Solexa size-selection of around 200 bp. Since the FKHR motif sequence dictates the precise FoxA1 binding location, the true distribution of d could be estimated by aligning the tags by the FKHR motif (122 bp; Figure ), which gives a similar result to the MACS model. When applied to NRSF and CTCF ChIP-Seq, MACS also estimates a reasonable d solely from the tag distribution: for NRSF ChIP-Seq the MACS model estimated d as 96 bp compared to the motif estimate of 70 bp; applied to CTCF ChIP-Seq data the MACS model estimated a d of 76 bp compared to the motif estimate of 62 bp.
For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count. Sometimes the same tag can be sequenced repeatedly, more times than expected from a random genome-wide tag distribution. Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation, and are likely to add noise to the final peak calls. Therefore, MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value <10-5). For example, for the 3.9 million FoxA1 ChIP-Seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies.
With the current genome coverage of most ChIP-Seq experiments, tag distribution along the genome could be modeled by a Poisson distribution [7
]. The advantage of this model is that one parameter, λBG
, can capture both the mean and the variance of the distribution. After MACS shifts every tag by d
, it slides 2d
windows across the genome to find candidate peaks with a significant tag enrichment (Poisson distribution p
-value based on λBG
, default 10-5
). Overlapping enriched peaks are merged, and each tag position is extended d
bases from its center. The location with the highest fragment pileup, hereafter referred to as the summit
, is predicted as the precise binding location.
In the control samples, we often observe tag distributions with local fluctuations and biases. For example, at the FoxA1 candidate peak locations, tag counts are well correlated between ChIP and control samples (Figure ). Many possible sources for these biases include local chromatin structure, DNA amplification and sequencing bias, and genome copy number variation. Therefore, instead of using a uniform λBG estimated from the whole genome, MACS uses a dynamic parameter, λlocal, defined for each candidate peak as:
λlocal = max(λBG, [λ1k,] λ5k, λ10k)
where λ1k, λ5k and λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case λ1k is not used). λlocal captures the influence of local biases, and is robust against occasional low tag counts at small local regions. MACS uses λlocal to calculate the p-value of each candidate peak and removes potential false positives due to local biases (that is, peaks significantly under λBG, but not under λlocal). Candidate peaks with p-values below a user-defined threshold p-value (default 10-5) are called, and the ratio between the ChIP-Seq tag count and λlocal is reported as the fold_enrichment.
For a ChIP-Seq experiment with controls, MACS empirically estimates the false discovery rate (FDR) for each detected peak using the same procedure employed in the previous ChIP-chip peak finders MAT [13
] and MA2C [14
]. At each p
-value, MACS uses the same parameters to find ChIP peaks over control and control peaks over ChIP (that is, a sample swap). The empirical FDR is defined as Number of control peaks / Number of ChIP peaks. MACS can also be applied to differential binding between two conditions by treating one of the samples as the control. Since peaks from either sample are likely to be biologically meaningful in this case, we cannot use a sample swap to calculate FDR, and the data quality of each sample needs to be evaluated against a real control.
The two key features of MACS are: empirical modeling of 'd
' and tag shifting by d
to putative protein-DNA interaction site; and the use of a dynamic λlocal
to capture local biases in the genome. To evaluate the effectiveness of tag shifting based on the MACS model d
, we compared the performance of MACS to a similar procedure that uses the original tag positions instead of the shifted tag locations. The effectiveness of the dynamic λlocal
is assessed by comparing MACS to a procedure that uses a uniform λBG
from the genome background. Figure show that both the detection specificity, measured by the percentage of predicted peaks with a FKHR motif within 50 bp of the peak summit
, and the spatial resolution, defined as the average distance from the peak summit
to the nearest FKHR motif, are greatly improved by using tag shifting and the dynamic λlocal
. In addition, FoxA1 is known to cooperatively interact with estrogen receptor in breast cancer cells [1
]. As evidence for this, we also observed enrichment for estrogen receptor elements (3.1-fold enriched relative to genome motif occurrence) and its half-site (2.7-fold) [15
] within the center 300 bp regions of MACS-detected FoxA1 ChIP-Seq peaks.
is also effective in capturing the local genomic bias from a ChIP sample alone when a control is not available. To demonstrate this, we applied MACS to FoxA1 ChIP-Seq and control data separately. Using the same parameters, all the control peaks are, in theory, false positives, so the FDR can be empirically estimated as Number of control peaks / Number of ChIP peaks. To identify 7,000 peaks, the FDR for MACS is only 0.4% when a control is available and λlocal
is used. To get 7,000 peaks when a control is not available, the FDR could still remain low at 3.8% if MACS estimates λlocal
from the ChIP sample, whereas it would reach 41.2% if MACS uses a global λBG
. This implies that the λlocal
is critical for ChIP-Seq studies when matching control samples are not available [5
We compared MACS with three other publicly available ChIP-Seq peak finding methods, ChIPSeq Peak Finder [8
], FindPeaks [11
] and QuEST [12
]. To compare their prediction specificity, we swapped the ChIP and control samples, and calculated the FDR of each algorithm as Number of control peaks / Number of ChIP peaks using the same parameters for ChIP and control. For FoxA1 and NRSF ChIP-Seq (an FDR for CTCF is not available due to the lack of control), MACS consistently gave fewer false positives than the other three methods (Figure ).
Figure 2 Comparison of MACS with ChIPSeq Peak Finder, FindPeaks and QuEST. (a-f) Shown is the FDR for FoxA1 (a) and NRSF (b) ChIP-Seq, motif occurrence within 50 bp of the peak centers for FoxA1 (c) and NRSF (d), and the average distance from the peak center to (more ...)
Determining the percentage of predicted peaks associated with a motif within 50 bp of the peak center for FoxA1 and NRSF ChIP-Seq, we found MACS to give consistently higher motif occurrences (Figure ). Evaluating the average distance from peak center to motif, excluding peaks that have no motif within 150 bp of the peak center, we found that MACS predicts peaks with better spatial resolution in most cases (Figure ). For CTCF, since QuEST does not run on samples without controls, we only compared MACS to ChIPSeq Peak Finder and FindPeaks. Again, MACS gave both higher motif occurrences within 50 bp of the peak center and better spatial resolutions than other methods (Figure S1 in Additional data file 1). In general, MACS not only found more peaks with fewer false positives, but also provided better binding resolution to facilitate downstream motif discovery.
Comparison of ChIP-Seq to ChIP-chip
A comparison of FoxA1 ChIP-Seq and ChIP-chip revealed the peak locations to be fairly consistent with each other (Figure ). Not surprisingly, the majority of ChIP-Seq peaks under a FDR of 1% (65.4%) were also detected by ChIP-chip (MAT [13
] cutoff at FDR <1% and fold-enrichment >2). Among the remaining 34.6% ChIP-Seq unique peaks, 1,045 (13.3%) were not tiled or only partially tiled on the arrays due to the array design. Therefore, only 21.4% of ChIP-Seq peaks are indeed specific to the sequencing platform. Furthermore, ChIP-chip targets with higher fold-enrichments are more likely to be reproducibly detected by ChIP-Seq with a higher tag count (Figure ). Meanwhile, although the signals of array probes at the ChIP-Seq specific peak regions are below the peak-calling cutoff, they show moderate signal enrichments that are significantly higher than the genomic background (Wilcoxon p
; Figure ). Indeed, 835 out of 1,684 ChIP-Seq specific peaks could also be detected in ChIP-chip, when the less stringent FDR cutoff of 5% is used. Another reason why peaks detected by ChIP-Seq may be undetected by ChIP-chip is that ChIP-Seq specific peaks are usually slightly shorter than similar fold-enrichment peaks found by both ChIP-Seq and ChIP-chip (Figure ) and may not be detectable on the array due to insufficient probe coverage. On the other hand, ChIP-chip specific peak regions also have significantly more sequencing tags than the genomic background (Wilcoxon p
; Figure S2 in Additional data file 1), although with current sequencing depth, those regions cannot be called as peaks.
Figure 3 Comparison of FoxA1 ChIP-Seq and ChIP-chip. (a) Overlap between the FoxA1 binding sites detected by ChIP-chip (MAT; FDR <1% and fold-enrichment >2) and ChIP-Seq (MACS; FDR <1%). Shown are the numbers of regions detected by both (more ...)
Comparing the difference between ChIP-chip and ChIP-Seq peaks, we find that the average peak width from ChIP-chip is twice as large as that from ChIP-Seq. The average distance from peak summit to motif is significantly smaller in ChIP-Seq than ChIP-chip (Figure ), demonstrating the superior resolution of ChIP-Seq. Under the same 1% FDR cutoff, the FKHR motif occurrence within the central 200 bp from ChIP-chip or ChIP-Seq specific peaks is comparable with that from the overlapping peaks (Figure ). This suggests that most of the platform-specific peaks are genuine binding sites. A comparison between NRSF ChIP-Seq and ChIP-chip (Figure S3 in Additional data file 1) yields similar results, although the overlapping peaks for NRSF are of much better quality than the platform-specific peaks.