Generation of Illumina Tag Sequencing Data
For this paper we generated two deeply sequenced ChIP-Seq datasets for antibodies against both human RNA polymerase II and STAT1 performed in the HeLa S3 cell line. For Pol II using the results from 24 lanes of Illumina sequencing data we obtained more than 29 million mapped reads for the Pol II ChIP-Seq as well as a matching 29 million reads for a control sample of HeLa S3 Input DNA. STAT1 ChIP was performed in HeLa S3 cells that had been stimulated using interferon-γ, producing 26 million mapped reads for IFN-γ stimulated STAT1 ChIP DNA. We obtained 24 million mapped reads for a matching control of IFN-γ stimulated HeLa S3 input DNA. See Supplementary Material
for further details of experimental protocols. A complete breakdown of the ChIP-Seq reads generated is summarized in Supplementary Table 1
. Raw and aligned sequence reads for all datasets have been deposited at GEO with accessions numbers: GSE12781 (Pol II) and GSE12782 (STAT1).
Illumina Data Analysis Pipeline
Raw image files from the Illumina Genome Analyzer machine are transferred to the Yale biomedical high-performance computing cluster. Data is processed following the recommended Illumina pipeline. Raw images are first processed using the Firecrest software package. Base-called sequences with confidence metrics are obtained using the Bustard software. Gerald is the last part of the pipeline. It uses the program ELAND to align the short sequence reads against the genome of interest allowing for up to 2 possible mismatches. By default ELAND only gives the locations for reads that align uniquely to the target sequence; however, with modified parameters ELAND can report the locations for reads that map to multiple locations. For the human ChIP-Seq and control datasets we aligned the sequence reads to the latest build of the human genome (hg18/NCBIv36) obtained from the UCSC Genome Browser12
. We excluded the random unassembled contigs.
Alignment of Tag Sequences
Following a typical analysis pipeline, flowcell images are analyzed, yielding base called sequences with confidence scores, which are then aligned against the appropriate genome. The standard Illumina pipeline utilizes the program Eland for aligning sequence tags, although a number of other applications have been developed for the same purpose, e.g. Maq19
, Rmap (unpublished), SOAP20
. When aligning sequence reads against the genome, reads aligning to multiple locations are typically excluded, as they are ambiguous (see Supplementary Table 1
to see the proportion of sequence reads generated that map to multiple locations in the human genome). However, such exclusion results in portions of the genome, including highly repetitive sequences or recent segmental duplications, that are not alignable and thus not assayable.
Although the algorithm we have developed by default only utilizes sequence tags that map uniquely, it would be a relatively straightforward modification to utilize tags that map to multiple locations by capping the number of locations to which a tag is allowed to map and by weighting tags by a factor dependant on the number of locations to which it maps or randomly selecting one of the locations. Many of the mapping algorithms including Eland and Maq have options allowing for the aligning of sequence reads to multiple locations in the genome. The effect of including sequence reads that map to multiple locations in the scoring is also discussed in these Methods.
PeakSeq: First Pass Identification of Potential Target Sites
Once fragment density maps have been created for both the sample and control datasets, we initially focus on the sample density map. Each chromosome is subdivided into segments of length Lsegment (typically 1 Mb) for analysis. Depending upon the genome being analyzed one might select a different size for these segments, e.g. smaller segments for more compact genomes. For each segment the number of fragments that align to that segment are counted, Nreads. In addition by using the mappability map for the genome of interest, we can precompute the fraction of uniquely alignable bases in that segment, f (see methods for further details). Using these two parameters, Nreads and f we can perform a computational simulation by randomly generating Nreads aligned DNA fragments in a scaled segment of length f×Lsegment (see second panel of and below for details of the simulation). In order to perform an accurate simulation this is done multiple times and the results of these simulations are averaged.
Using a height threshold we can determine all the contiguous regions that are above this threshold in the sample fragment density map. Regions above threshold that are separated by genomic distances less than the average fragment length (~200 bp) are merged. This is similar to the maxgap/minrun approach that is commonly used for analyzing ChIP-chip tiling array data14,21
. For the same threshold we can determine the number of merged regions above the threshold in the simulation. For each threshold the fraction of false positives is calculated from the ratio of the estimated number of false positives above threshold from the simulation divided by the number of regions above threshold for the ChIP-Seq sample. By tuning the threshold used we can set an initial first pass false discovery rate (the false discovery rate for the final list of target binding sites will be determined after comparison with the control sample). The threshold is set independently for each segment of each chromosome, which accommodates for genomic variability along each chromosome due to e.g. structural variation15,16
Using this thresholding procedure we obtain candidate sets of peak regions (i.e. putative binding sites) from each chromosome that are significantly enriched compared to a null random background for each segment. However, some of these regions might be due to underlying peak signal that is present in the control sample. Thus in order to determine whether each of these putative peaks is actually bound by the transcription factor, we need to show that the number of mapped fragments from the sample dataset is significantly enriched compared the control input DNA dataset.
Estimation of False Positives by Simulation in the First Pass of PeakSeq
For each segment of length Lsegment a computational simulation of Nreads tag sequences is performed using the scaled segment length f×Lsegment (the length is scaled by the fraction of uniquely mappable bases in the segment). Nreads tag sequences are randomly placed along the f×Lsegment segment length. The same thresholding procedure is then followed for determining false positives from the simulated data. This is shown in the second panel of . The simulation is performed multiple times and the number of false positives is averaged over the different simulations.
We only use a simple background null distribution (Poisson) for each segment, rather than the more-complicated background model10
, during the first pass of the PeakSeq procedure. This is because we are trying to identify a candidate list of potential target regions using a relatively liberal threshold. The non-uniformity of the background will be accounted for in the second pass when counts of mapped fragments for putative binding sites are compared against the input DNA control. The control captures the actual background distribution, which we do not need to model explicitly. If we were scoring the ChIP-Seq data without a reference control then the non-uniformity of the background would have to be modeled explicitly.
Application of the Mappability Map to PeakSeq Scoring
The mappability map is initially constructed for each base pair in the genome (see Supplementary Material
), counting the number of locations to which a subsequence starting at that position, typically of length 30 nt, aligns. Using this we can generate a coarse-grained map of the fraction of uniquely mappable bases (corresponding to tags starting at those base pair locations) in a segment (i.e. window) of a given size. In the paper we have used coarse-grained maps for segments of size 1 Kb for illustrative purposes in . In addition we have generated a coarse-grained mappability map using larger 1 Mb segments. As part of the first pass filtering in the PeakSeq scoring procedure (the second panel of ) we determine a peak-height threshold determined for each 1 Mb segment in the genome. For a segment, the threshold is determined by comparison against a simulated null background using the same number of tag reads randomly mapped onto a region of length corresponding to the number of uniquely mappable bases in that 1 Mb segment (i.e. the fraction of mappable bases multiplied by the segment length).
PeakSeq: Normalization of Control to ChIP-Seq Sample
Before this comparison can be made the control dataset has to be appropriately normalized to the sample dataset. Naïvely one could use matching datasets with the same number of mapped reads by removing reads from the larger dataset. This is not the correct way to perform the normalization, as it is overly conservative. The sample dataset is composed of a portion of mapped reads that come from the background distribution whereas the remainder arises from peak regions that are genuine binding sites (see Methods for further details). Mapped reads that are part of genuine binding sites would incorrectly skew the apparent parity achieved by simply using the same number of mapped reads between sample and control. We actually want to normalize the control dataset against the background component of the sample dataset. In order to reduce the effects of peaks in the normalization, we divide each chromosome into short segments (length ~ 10 Kb) and perform the normalization using all segments that have at least one mapped fragment. We would like to exclude segments from the normalization procedure that contain peaks corresponding to binding sites; however, we do not want to exclude all putative binding sites identified in the first pass of the procedure as this would exclude segments that contain peaks that are present in the both the sample and control background distributions. Thus we introduce a parameter, Pf, which is the fraction of the peaks (ranked by peak height) that should not be included in the normalization procedure. If Pf = 0 then no peaks are excluded and all segments are used for the normalization whereas if Pf = 1, only segments that do not overlap any candidate peak are used for normalization. Using this procedure all the included segments contribute equally when computing the normalization factor rather than allowing the peaks to dominate.
For each segment (indexed by s
) not overlapping the Pf
fraction of putative peaks identified in the first pass, we count the number of mapped tags per segment for both the sample,
, as well as the control,
. Chromosome-by-chromosome we perform least squares linear regression between these two sets of counts,
. The slope of the regression is then a scaling factor, α
, between the number of counts from the control and the sample of interest. In general, setting Pf
= 0 will be more conservative because both the slope α
and the normalized counts of tags from the control for each potential target binding site will be larger and thus fewer regions will be deemed enriched relative to the control (see the normalization step in ).
PeakSeq: Second Pass Scoring Target Sites Relative to Control
For each of the putative binding regions, indexed by r,
we can now count the number of mapped fragments that overlap the region from both the sample dataset,
as well as the number from the control dataset,
. We appropriately normalize the count from the control by multiplying by the scaling factor computed above. For each putative site we first compute the fold enrichment, i.e. the ratio of the number of mapped reads from the sample
over the scaled number of mapped reads from the control,
. The fold enrichment is the signal normally computed for a transcription factor binding site, which should be proportional to the occupancy number for the binding site (the fraction of cells in the experiment that have the transcription factor bound at this site). Using the binomial distribution we can calculate a p
-value of the significance of the region’s enrichment in the number of fragments from the sample as compared to the scaled number from the control (because the scaled number is not in general an integer, this number is rounded up to the nearest integer value). The null hypothesis is that there is no enrichment (see Methods for statistical tests used).
As is typical in high-throughput experiments that generate a large number of results, corrections need to be made in order to account for multiple hypothesis testing. Due to the large number of statistical tests being performed, for any p
-value threshold used some number of false positives (potentially many) will arise by random chance. Following a standard approach for the analysis of large-scale experiments we employ a false discovery22-24
based approach using a q
. A Bonferroni-type correction for multiple hypothesis testing is typically overly conservative so we choose to use a Benjamini-Hochberg correction17
In order to determine whether a given putative target region r
is enriched in the number of mapped tags from the ChIP-Seq sample compared to the normalized input DNA control, we calculate the p-value from the cumulative distribution function for the binomial distribution, which corresponds to summing the tail of the distribution. The cumulative distribution function is given by
is the scaled number of sequence tags overlapping the target region from the input DNA,
is the number of tags from the sample
= 0.5 which is the probability under the null hypothesis that tags should occur with equal likelihood from the sample as from the control. Once np is sufficiently large the binomial distribution can be accurately approximated by a normal distribution
with mean np and variance np(1-p).
Correcting for Multiple Testing
We follow Benjamini and Hochberg17
in adjusting our p-value to correct for multiple testing. All the target regions that are tested for significance are ranked by p-value from most significant to least significant. Then for each region the q
-value is then given by
is the total number of regions tested. Enriched target regions are then selected using a q
-value threshold rather than a p-
Scoring ChIP-Seq Data Including Reads that Map to Multiple Locations
In order to investigate the effect of only including uniquely mapping reads we selected a single lane of Pol II ChIP-Seq and input DNA data and included reads that aligned to at most 10 distinct locations in the genome allowing for up to two possible mismatches per read. Alignments were performed using Eland. For reads that align to multiple locations, we randomly selected one of those locations. Scoring this data using the same PeakSeq procedure outlined below, we find that the number of binding sites identified increases by 17% compared to only using reads where the best match is unique. Thus we can use PeakSeq to score the reads that map to multiple locations; however, these binding sites are inherently ambiguous due to the nature of these sequences. Some of these regions will correspond to legitimate sites of factor binding to DNA. These results are available for download from http://www.gersteinlab.org/proj/PeakSeq/
Analysis of ChIP-Seq Data from Biological Replicas
In order to appropriately compare sequence reads from biological replicas we subdivided the data from each of the three different biological replicas (the sequences were also randomly permuted for each replica). We first selected 9 million reads from each of the three replicas (only 8.1 million reads were available for analysis from the third biological replica). Each of these datasets was scored against 9 million reads randomly selected from the input DNA control (the same 9 million control reads were used as a control for each analysis). Second, we randomly selected 4.5 million reads from each of two different biological replicas, which were then combined to form 9 million reads and these were scored against the 9 million control reads as before. This was done for all three combinations of selecting two-of-three pairs of samples. Lastly, we selected 3 million reads from each of the three replicas, which were combined and scored against the sampled control dataset. This type of analysis can be generalized for the comparison of more than 3 replicas.