We focus here on the application of high-throughput sequencing using the Illumina Genome Analyser to chromatin immunoprepcipitation (ChIP-seq) for genome-wide identification of DNA–protein interaction sites, although the technical issues are also relevant to other sequencing applications and platforms. In a typical ChIP-seq protocol, extracts are prepared by sonication of formaldehyde cross-linked chromatin. DNA fragments associated with a target protein are co-immunoprecipitated from this extract using an antibody; in parallel, a portion of the total DNA in the extract is used as an input control sample. The sequences of the purified immunoprecipitated DNA and input control DNA are determined by sequencing. Usually, ~35
bp of sequence is read from millions of DNA fragments in the ~150- to 300-bp range (28
), then these sequence reads are mapped onto a reference genome.
Because a ChIP input control library should, in theory, equally represent all regions of the genome, we expected that its sequence reads would be distributed relatively uniformly over the genome. However, we observed that C. elegans
ChIP input control DNA sequence data (hereafter referred to as input sequence) display strong reproducible patterns (a). These patterns do not appear to be due to cross-linking and incomplete solubilization of DNA fragments in the input extract because similar patterns are evident in published genomic DNA sequence data (7
), where the DNA was deproteinated prior to fragmentation and not cross-linked (a). The reproducibility of the patterns indicates that they are not due to random sampling variations. Human input sequence has also been shown to have biased genomic coverage (11
We found that the input sequence read counts correlate with the GC contents of the underlying genomic sequences (a and Supplementary Table S1
). To examine this potential GC bias in more detail, we compared the GC frequency distribution of the C. elegans
input sequence reads with the expected distribution of the genome. This demonstrated an over-representation of GC-rich and an under-representation of AT-rich sequences in the input sequence data (b and c). A similar bias was seen with C. elegans
and human genomic DNA sequence (Supplementary Figures S1a and S2
Because genomic features can deviate substantially from the mean genomic GC content, a GC bias in sequencing data could lead to an artefactual inflation of read counts in such regions. For example exons are generally more GC-rich than introns (14
), as visualized by plotting GC content across a set of aligned internal exons and their flanking introns (a and g). As expected from the GC bias in input sequence data, we found a strong enrichment of sequence reads across C. elegans
and human exons (c and i). Similarly, raw input sequence data showed enrichment near aligned transcript start sites (TSSs) which resembles the GC content plots of the same regions (d, f, j and l). We also observed exonic and TSS peaks in raw C. elegans
genomic DNA sequence (Supplementary Figure S1
b and c).
Figure 2. Bias in GC content (a, d, g, j), mappability (b, e, h, k) and raw input sequence signals (c, f, i, l) across internal exons and around transcript start sites in C. elegans and human. The error bars represent the 95% confidence intervals of the estimated (more ...)
A second source of potential bias lies in the read mapping procedure. Short sequence reads obtained in ChIP-seq experiments can only be confidently mapped on the reference genome if their sequences are unique. After mapping, reads are usually either extended to the expected fragment size or shifted towards the centre of the distributions made up of forward- and reverse-strand reads (28
). Signals are then sampled at regular intervals to construct a genomic profile. Therefore, non-mappable regions and adjacent areas will have lower than expected read counts. We quantified mappability across the entire genome (see ‘Materials and Methods’ section) and found that it is higher in exons than in introns in both C. elegans
and human genome (b and h). A region of higher mappability was also noticed near TSSs (e and k), as observed previously in human genome (11
). Correction of the biases in GC content and mappability of high-throughput sequencing data is clearly necessary for determining true enrichment patterns in ChIP experiments.
We first explored whether a simple division by input control would correct biases in sequence data. We sequenced several different libraries of input control DNA (biological replicates) and found that each had a different extent of GC bias (Supplementary Figure S3
a). Moreover, we found that data from multiple sequencing runs of a library (technical replicates) can have different GC compositions (Supplementary Figure S3
b; data from human 1000 genomes project). These observations suggest that division by input would not correctly remove bias. Indeed, we found that dividing one input control by input controls with different GC biases resulted in apparent exonic enrichment or depletion (Supplementary Figure S4
). Similarly, we show below that dividing ChIP-seq data sets by different input controls leads to different resulting enrichment patterns. Therefore, data set-specific biases in both input and ChIP-seq data need to be corrected independently.
To estimate GC bias in a data set and determine correction factors individually for each library, we compared GC histograms of the sequence reads to the expected GC distribution of the genome. For each GC bin a correction factor was calculated by dividing the frequency expected to that observed. This factor was applied to each sequence read in a given bin, with the effect of shifting the experimental distribution to that of the genome. After applying this GC correction to C. elegans and human input sequence data, the enrichments on exons and near TSSs were greatly reduced in magnitude (green plots in a, b, g and h), indicating that GC bias plays a major role in creating systematic biases.
Figure 3. BEADS normalization of high-throughput sequence reads of C. elegans input sequence (a, b), H3K4me3 ChIP (c, d), H3K36me3 ChIP (e, f) and human input sequence (g, h) libraries. Shown are the results following each step of correction: Raw (uncorrected), (more ...)
To calculate a mappability correction, we simulated sequence reads where each possible sequence in the genome was represented once, and mapped these onto the reference genome using the same criteria for processing experimental data (see ‘Materials and Methods’ section). We then extended each read to the estimated fragment size and quantified mappability of a given genomic position by counting the number of possible overlapping extended mapped reads. At each genomic location, we scaled the GC-weighted input sequence signal according to mappability, with higher weights given to locations of low mappability. We introduced an adjustable cutoff filter to remove regions with very low mappability prior to subsequent analyses because the low read coverage of these regions makes it difficult to reliably correct the signals (see ‘Materials and Methods’ section). We found that the mappability adjustment step further flattened the exonic peak of input sequence data (blue plots in a and g). This step also further levelled the signals around the aligned TSSs (b and h), but the effect was smaller presumably because mappability varies less around TSSs than across exons (b, e, h and k). The GC and mappability corrections removed most of the bias in the input sequence data, but a small amount of bias still remained after these steps.
As part of sequencing library generation, DNA molecules are physically broken down into smaller fragments (e.g. by sonication or other methods) and this process could generate bias due to differences in underlying DNA structure. For example, heterochromatin is found to be more refractory to shearing, resulting in an under-representation of these DNA fragments compared to euchromatin (12
). Smaller-scale structural differences due to nucleotide composition might also cause DNA to be differentially susceptible to breakage. Such chromatin or DNA structural effects might contribute to the coverage inhomogeneity in high-throughput sequencing data.
In order to correct for such local biases, after GC and mappability correction, we pooled several independently generated input sequence data sets to create a ‘master’ control data set to reflect reproducible local biases. We then applied a local correction by dividing the signals of the experimental data set by the master control data set at each position in the genome. The input sequence data being studied was not included in the master control data set. We found that the local correction step further removed the unexpected enrichments of C. elegans input sequence data seen on exons and near TSSs (a and b). The signals from fully normalized input sequence data also no longer followed the patterns of GC content or mappability of underlying sequences (a). For human input sequence, the division step did not have a large correctional effect on the overall signal profile. The GC and mappability corrections appeared sufficient to remove biases across TSSs and exons (g and h). It is possible that the DNA structural differences were too subtle to be captured by the sequenced data due to its poor read coverage (~0.04×).
We next addressed normalization of C. elegans
ChIP-seq data. For this, we focused on three well-studied histone modifications, trimethylation of lysine 4, lysine 9 and lysine 36 of histone H3 (H3K4me3, H3K9me3 and H3K36me3). We previously showed using chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip) in C. elegans
, that similar to other organisms, H3K4me3 shows discrete peaks of enrichment near TSSs of transcribed genes, H3K9me3 shows higher enrichment across silent genes than active genes, and H3K36me3 is broadly enriched on the actively transcribed regions of genes (22
). Further, there is a significant enrichment of H3K36me3 on exonic compared to intronic chromatin. In contrast, H3K4me3 and H3K9me3 do not show general exon enrichment.
Chromatin immunoprecipitation enriches for DNA bound to factors of interest, but 60–99% of ChIP reads are estimated to be background noise (29
). Given the observed biases in input sequence data, this background is likely to cause artefactual patterns. Indeed, in addition to the expected signals, we found that H3K4me3 and H3K9me3 ChIP-seq data also showed enrichment on exons that was not seen in ChIP-chip mapping experiments (22
) ( and Supplementary Figure S5
). An expected exonic enrichment was observed in raw H3K36me3 ChIP-seq data (e), but it is possible that background bias contributes to this signal, making analysis of amount of enrichment difficult.
We tested the effects of normalization through division by input control sequence data. We carried out ChIP of H3K4me3 and H3K36me3 each from a different wild-type extract, and then we prepared libraries and sequenced ChIP and input DNA samples. When the input sequence of matched extract was used as a divisor for H3K4me3, we observed an unusual pattern: loss of the expected enrichment just downstream of the TSS and lower signal in exons relative to introns. Using the input sequence of non-matched extract as a divisor resulted in an expected promoter peak and removed most of the exonic enrichment (Supplementary Figure S6
a and b). For H3K36me3, dividing by the extract matched input control produced an exonic enrichment as expected but using the other input resulted in an exon depletion (Supplementary Figure S6
c and d). These different and unexpected enrichment patterns are likely to be due to the input and ChIP samples having different amounts of technical bias. We conclude that dividing ChIP signals by input signals is not a generally appropriate method for correcting systematic biases in sequencing data.
We next applied BEADS to remove bias from H3K4me3, H3K9me3, and H3K36me3 ChIP-seq data. Because a factor of interest might have a true binding preference for AT- or GC-rich sequences (e.g. H3K36me3 is enriched on exons, which are also GC-rich), using the entire set of ChIP-seq reads as we did for input control data could cause either over- or under-correction. Therefore, we attempted to separate the reads into two components, potential enriched regions and background, and used only background reads for estimating GC bias. We defined sequence reads as background if they did not overlap regions of potential enrichment identified using a peak calling program [(20
); see ‘Materials and Methods’ section]. After deriving GC correction factors from background reads, we applied the correction to the entire set of ChIP sequence reads. Therefore, an AT or GC preference for factor enrichment would not affect estimation of GC bias in the sequence data set. After BEADS normalization of H3K4me3 ChIP-seq data, the TSS peak was still present, as expected, but the exon peak disappeared (c and d). Similarly, H3K9me3 enrichment on exons was removed by BEADS normalization, but enrichment on silent genes was maintained (Supplementary Figure S5
). In contrast, exonic enrichment of H3K36me3 was still observed after BEADS normalization (e and f).
We found that generation of the background data set was relatively insensitive to over- or under-identification of peak regions (Supplementary Figure S7
). In addition, deriving GC correction from only a subset of sequence reads did not introduce apparent artefacts. Using only reads in the H3K4me3 or H3K36me3 background regions to calculate GC correction factors for input sequence data yielded similar results to using the entire set of input sequence reads (Supplementary Figure S8
). However, we found that use of the correct average library fragment size is necessary for effective bias removal. Under- or over- extending read lengths led to the generation of artefactual patterns (Supplementary Figure S9
We also tested the effect of BEADS normalization on the distributions of two proteins with punctate binding patterns: DPY-27, a component of the C. elegans
dosage compensation complex that has been shown to bind specifically to foci on chromosome X (23
), and BLMP-1, a C. elegans
transcription factor (26
). Similar to raw H3K4me3 ChIP-seq data, raw DPY-27 and BLMP-1 ChIP-seq data showed an unexpected enrichment on exons in addition to the sharp peak detected on known binding sites (Supplementary Figure S10
). After applying BEADS normalization to the data set, the expected peaks on chromosome X foci for DPY-27 and on binding sites of BLMP-1 were still sharp and high in magnitude, whereas the unexpected exon peaks were removed (Supplementary Figure S10
We have demonstrated that BEADS removes systematic biases present in high-throughput sequencing data. Before normalization, raw input sequence and ChIP data sets showed artefactual enrichments not reflective of true biology. After BEADS normalization, these enrichments were removed, but previously documented enrichments determined by other methods (such as ChIP-chip or qPCR) remained intact. Bias correction was thus essential for the analysis of these ChIP-seq data. Furthermore, when factor enrichments are low, false signals from systematic biases are likely to dominate. However, we note that since BEADS uses ‘background’ reads to derive GC correction factors, it is difficult to apply to data sets where background reads cannot be easily identified (e.g. nucleosome or RNA-seq data).
We expect BEADS to be useful in other applications of high-throughput sequencing besides ChIP-seq. For example, BEADS normalization could aid the detection of genome copy number variations, where bias in the distribution of mapped sequence reads could mask or enlarge differences in copy number. Bias correction using BEADS might also improve the performance of peak-calling algorithms.