Overall experimental design
We amplified S2 cells from the modENCODE batch before transferring them to the plates. We put a total of 15 plates in culture until the cells reached the appropriate concentration. We harvested cells from 2 plates to extract the genomic DNA. We subsequently treated the remaining 13 plates with formaldehyde at the same time, and then extracted the chromatin. Next, we used 3 plates to produce a chromatin input sample corresponding to the chromatin DNA of the treated cells. Each plate corresponded to an Eppendorf tube of chromatin and we performed ChIP experiments on 5 tubes for H3K36me3 and on 5 tubes for Su(Hw). We performed ChIP and DNA extraction procedures independently on each tube. After the purification of the DNA, we pooled together the tubes corresponding to the same sample (Su(Hw), H3K36me3, chromatin input and gDNA). After mixing, we again aliquoted each sample into 5 tubes. Next, we used 1 aliquot of each for ChIP-chip on Affymetrix tiling arrays for quality control, 1 aliquot for SE sequencing at the high-throughput genomic analysis core of the University of Chicago, 1 aliquot for SE sequencing at the high-throughput sequencing facility of the University of North Carolina at Chapel Hill (UNC), 1 aliquot for PE sequencing at UNC, and the remaining aliquot to back up the original samples.
After the expansion of the S2 cell line, we transferred cells to a plate. Once confluent, we added 1.8% formaldehyde to the cell culture. We harvested cells in the presence of the formaldehyde with the help of a cell scraper. After 15 minutes of incubation at room temperature, we quenched the crosslinking reaction with glycine for 5 minutes. We subsequently washed cell pellets 3 times with a lysis buffer. We performed regular chromatin extraction before sonication. We then used sonicated chromatin for ChIP experiments or directly for DNA extraction for the chromatin input samples. We performed ChIP as previously described1, 2
. The anti-H3K36me3 antibody was from Abcam (catalogue # ab9050/lot # 927884). The anti-Su(Hw) antibody was from Pamela K. Geyer's lab. Both are rabbit polyclonal antibodies.
Library preparation and sequencing
At the University of Chicago, we prepared the libraries according to Illumina's instructions accompanying the DNA Sample Kit (Part# 0801-0303). Briefly, we end-repaired DNA using a combination of T4 DNA polymerase, E. coli DNA Pol I large fragment (Klenow polymerase) and T4 polynucleotide kinase. We treated the blunt, phosphorylated ends with Klenow fragment (3’ to 5’ exo minus) and dATP to yield a protruding 3- ‘A’ base for ligation of Illumina's adapters, which have a single ‘T’ base overhang at the 3' end. After adapter ligation, we PCR amplified DNA with Illumina primers for 15 cycles, and band isolated library fragments of ~250 bp from a 2% agarose gel. We captured the purified DNA on an Illumina flow cell for cluster generation. We sequenced libraries on the Genome Analyzer IIx following the manufacturer's protocols. At UNC, we used a slightly different protocol of library preparation in which we band isolated the library fragments of ~150-500 bps immediately after the ligation of Illumina's adapters followed by 18 cycles of PCR amplification.
The definition of library complexity for PE and SE data
The library complexity of SE data is defined as the number of non-redundant SE reads divided by the total number of reads, where redundant SE reads are those that are mapped to the same location with the same orientation in the genome. The library complexity of PE data is defined as a non-redundant pair of reads divided by a total pair of reads, where redundant PE reads are those that have identical genomic locations on both ends.
The mapping of sequencing reads
We used ELAND to align the SE sequencing reads to the flybase BDGPv5 reference genome. We pooled together the uniquely mapped SE reads with no more than 2 mismatches from different runs up to 120 M for the ChIP-sample of Su(Hw) and H3K36me3. For chromatin input and gDNA samples, we added uniquely mapped PE reads to constitute the total 120 M reads, thereby compensating for the failed runs of SE sequencing of these two samples (Supplementary Table 1
). For a comparison of differences in the read mappability and coverage of the repeats region, we performed an estimation of the library complexity for the PE and SE reads. We first used Bowtie-0.12.5 to map the PE reads with almost all of the default settings (-chunkmbs 120), and we constrained the fragment size between 80 and 600 bp. Next, we re-aligned the uniquely mapped PE reads in the SE mode using the same parameter settings.
The mappability, the heterochromatin and the repeat regions of the Drosophila genome
We obtained the mappability data of Drosophila
genome from an early study3
and the details of how the mappability was calculated was described previously4
. The interspersed repeats and of low-complexity DNA that were identified by the RepeatMasker program (http://www.repeatmasker.org
). The simple-repeat refers to the simple tandem repeats (possibly imperfect repeats) that were identified by the Tandem Repeats Finding program5
. Both the repeat region and the heterochromatin region annotation were based on UCSC dm3, and were downloaded from the UCSC genome browser.
ChIP-seq data analysis algorithms
We chose 7 algorithms that are capable of using chromatin input data, are not restricted to only TF or histone marks, are supportive for analyzing ChIP-seq data from Drosophila
, and are among the most cited algorithms. These algorithms are CisGenome (v1.2), MACS (v1.40beta), spp (v1.8), QuEST (v2.4), Useq (v6.9), SISSRS (v1.4) and PeakSeq (v1.01). We did not include E-RANGE6
, two highly cited algorithms, in the evaluation because the parameters of E-RANGE were optimized for the mammalian genome and F-Seq does not provide good support for peak finding in invertebrates. For the evaluation of the algorithm performance on the ChIP-seq data of Su(Hw), we did not include PeakSeq because it does not provide the information of the peak summit of each peak, whereas the evaluation of peak quality requires the information of the peak summit. When we used the middle point of each peak identified by PeakSeq as a surrogate of the peak summit, PeakSeq exhibited a poor performance, thereby making the comparison unfair. For the evaluation of the algorithm performance on the ChIP-seq data of H3k36me3, we did not include SISSRs because the width of the regions identified by SISSRs is too small to be consistent with that of the broad regions.
ChIP-seq data analysis at different sequencing depths
We randomly sampled reads at sequencing depths of 0.45 M, 0.9 M, 2.7 M, 5.4 M and 16.2 M reads from a pool of 120 M reads for both ChIP and chromatin input samples. These sequencing depths approximately correspond to 9 M, 18 M, 55 M, 109 M and 327 M reads in a human ChIP-seq experiment4
. At each sampling depth, we generated five independent ”replicates” of the sequencing data. We averaged the analysis results of each algorithm over the five replicates before comparison.
ChIP-chip peak calling for Su(Hw) and the fold change calculation for the sequencing and tiling array platform
We performed peak calling for Su(Hw) using the MAT8
algorithm, which is among the best peak-calling algorithms for ChIP-chip data from Affymetrix data9
with a band width of 250 bp, a p-value cutoff of 10-5
and a false discovery rate (FDR) cutoff of 5%. We only considered the peaks with fold changes of no less than 3-fold (the detection limit of the Affymetrix tiling array9
) for further comparison with ChIP-seq peaks. All 500 bp windows that centered on the summit of these ChIP-chip peaks of Su(Hw) were used as reference ChIP-chip peaks and as a proxy for the true positives to evaluate the sensitivity of different algorithms for identifying ChIP-seq peaks. The fold change of the signal (ChIP versus chromatin input) in each 500 bp-scanning window centered on the probes was calculated using Tiling Analysis Software (Affymetrix). The fold change in the 200 bp and 500 bp windows that were centered on the summit of ChIP-seq peaks was calculated as follows: (number of covered fragments+1)ChIP
/(number of covered fragments+1)input
The use of IDR to quantify the reproducibility of peak calling between a pair of replicates
The reproducibility across replicates is essential to ChIP experiments not only at the level of read count data but also at the level of peak calling because the identified peaks usually are the primary substrates for downstream analysis. IDR (irreproducible discovery rate) is a statistical measure that assesses the consistency of the rank orders between a pair of rank lists10
. Unlike the usual scalar measures of reproducibility (e.g., the rank correlation), this measure describes reproducibility in terms of the extent to which the ranks of the entries on the lists are no longer consistent across replicates that are ordered in descending significance. Based on a copula mixture model, this measure provides a “score” that estimates the probability that each pair of peaks is reproducible, and it reports the expected rate of irreproducible discoveries (IDR) in the selected peaks in a fashion analogous to that of FDR. The number of reproducible peaks across replicates at given IDR levels can be used to compare the relative reproducibility of different peak calling algorithms.
IDR is independent of the threshold choice that is used for peak calling, and it emphasizes implicitly the consistency between the top-ranked peaks, rather than treating all of the ranks equally. Therefore, this method overcomes many limitations in traditional ways of measuring reproducibility and is suitable for our purposes. Detailed descriptions of the methodology and the implementation of IDR for narrow peak can be found in Ref. 10
. For broad peaks, often one peak overlaps with multiple (small) peaks on the other replicate. When this occurs, all these small peaks are lumped as one peak and the most substantial significance of these small peaks is used as the significance of the lumped peak (unpublished results, personal communication with Dr. Kundaje).
Because the number of identified peaks of some algorithms is much larger than others, we took the top 3000 significant peaks from all of the identified peaks for the evaluation. We used the R package in Ref. 10
for all of the IDR analyses.
The RNA-seq and H3K4me3 ChIP-chip data
We calculated the gene-expression level summarized as the RPKM value based on the RNA-seq data from a previous study11
. The H3K4me3 ChIP-chip data was generated, and the ChIP-enriched regions were identified, by the White lab. We considered the 2kb TSS-centered promoter to have H3K4me3 enrichment if it overlaps with the H3K4me3 peaks.
The data for enriched and depleted regions of various histone marks
The enriched and depleted regions of 15 histone marks that were identified from ChIP-chip data were obtained from modENCODE1, 12-14
. These histone marks include H3K18ac, H3K27ac, H3K27me3, H3K36me1, H3K36me3, H3K4Me3, H3K4me1, H3K4me2, H3K79Me1, H3K79me2, H3K9ac, H3K9me3, H4K16ac, H4K5ac, and H4K8ac.
Motif enrichment analysis
We mapped the position-specific weight matrix (PSWM) of Su(Hw) onto the genome of Drosophila (dm3) using CisGenome with a 3rd-order Markov background model. We calculated the distance between the nearest mapped motif and the peak summit using a custom Perl script.
The definition of positive and negative regions for H3K36me3 and the measurement of sensitivity and the false positive rate
H3K36me3 is highly enriched in exonic regions but not in intronic regions of actively transcribed genes15
, which allows us to approximate the positive and negative regions of H3K36me3 in the genome and to estimate the sensitivity and specificity of different algorithms using these predefined regions. We defined the positive regions of H3K36me3 as the exons of the top 4000 expressed genes (the evaluation results were similar for the top 1000 or top 2000 genes) and the negative regions as the gene body of the non-expressed genes. We estimated the expression level of each annotated gene based on the RNA-seq data from the S2 cells as the total number of reads of all of the unique exons per kb of total length of unique exons per million mapped reads (RPKM)16
. We averaged the RPKM value of each gene over two biological replicates. We used the same criterion to define the non-expressed genes as in the previous study, where the non-expressed genes were those with the number of unique mapped reads per kb per million mapped reads (RKPM) smaller than or equal to 411
. This cutoff was chosen based on the distribution of RKPM values in intergenic regions, where the probability of observing a RKPM value greater than or equal to 4 is approximately 5%. To control for the difference in peak width among algorithms, we used the coverage of the positive regions normalized by the total width of all predicted enriched-regions as a proxy for the sensitivity and the width-normalized coverage of the negative regions as a proxy for the false positive rate.
The calculation of the GC composition, the read-count ratio, and the coverage over different genomic features
We calculated the window-based GC composition (window-size 36bp, the same as read length) across the genome using the hgGcPercent program from Jim Kent (UCSC). We calculated the GC composition of sequencing reads of different samples using a custom Perl script. We calculated the window-based read-count ratio and the read coverage over different genomic features, including exons, gene bodies and repeat regions using the combination of custom Perl scripts and BEDTools17
. We calculated the genomic distribution of most hyper/hyposequenced 1-kb windows between the chromatin input and the gDNA samples by the stand-alone Cis-regulatory Element Annotation System (CEAS) package18
We performed all of the statistical analyses were in R, and showed all of the p-values smaller than 2.2 × 10-16 as P < 2.2 × 10-16, which is the default cutoff in R.