High-throughput sequencing of chromatin immunoprecipitated DNA, or ChIP-seq [1
], has replaced microarray-based techniques as the standard tool for investigating protein-DNA interactions in the cell. However, the data generated will always contain noise due to sequencing biases, PCR-artefacts, low complexity regions/mappability, chromatin structure or non-specific antibody interactions in the ChIP-step. The noise appears as a background distribution of reads, or tags, which must be taken into account in downstream analyses such as peak-calling.
Experimental assessments of the background read distribution is favoured over purely theoretical and therefore not experimentally validated background models [2
]. One such assessment is to sequence the sonicated sample prior to immunoprecipitation (IP). The resulting read distribution is commonly referred to as 'input'. Ideally this distribution would be uniform but Kharchenko et al
] identifies three types of repeatable anomalies that arise in input: singular peaks with very high pile-up, non-uniform wide clusters of increased tag density and, lastly, small clusters of tag densities resembling real peaks but typically with small strand separation where aligned reads pile up in non-meaningful ways. The latter anomaly is difficult to distinguish from a true pile-up. These anomalies are significant to the analysis of ChIP data because the precipitate is a mixture of protein-DNA complexes and bare DNA; ChIP only enriches the protein target and typically only a few percent of the sequenced reads fall within identified peaks [3
]. Another source of false positives in ChIP-seq analysis is non-specific binding in the immunoprecipitate. To control for that, the sample can be precipitated using non-specific antiserum, i.e. immunoglobulin G (IgG) that does not have a known antigen in the organism under study. It should be noted that since the degree of enrichment will vary between different antisera an input control experiment adds information to the IgG control. The observed distribution in a specific IP is a mixture of reads due to input anomalies, non-specific and specific IP.
Many of the recently reported peak-callers for ChIP-seq data can make use of control-data to improve predictions of enriched regions. The strategies for correction of background densities vary but are, for instance, performed by simple contrast approaches such as subtraction of the background read distribution from the ChIP-signal or by calculating fold-changes. Other more sophisticated ways of filtering the peaks have been proposed such as using the background read densities as priors in a statistical framework or estimating the false discovery rate. See [2
] for a discussion of techniques and overview of peak-callers. However, none of the peak-callers offers a way to export the transformed (normalized) raw signal (e.g. ChIP-seq pile-up) actually used for inferring binding sites in the same format as the raw ChIP-seq data. Consequently, there is no way to visualize or compute statistics on the processed signal used internally in the peak-callers to detect enriched regions. The only normalization method published so far seems to be the one introduced by Taslim et al
]. This method yields an output signal with limited resolution due to its use of summary statistics in sliding windows of typically length 1 kb along the genome. This resolution might be sufficient in the application of main interest to Taslim et al
, which was detection of regions with differential enrichment of RNA polymerase II between conditions, where the exact location of sequenced reads is not required. However, it is an important limitation in applications where fine resolution mappings of for example protein-DNA interactions are studied.
Another issue with ChIP-seq data besides background noise is that different manufactures and versions of sequencing hardware produce reads of different sizes (usually 35-75 nucleotides). To facilitate comparison between different setups it is desirable that the representation of the signal is independent of the read length. There are at least two possibilities to make the representation independent of read-length. One option is to only use counts at the start of aligned reads, the 5' coordinates of reads that aligned the sense strand and the 3' coordinates of reads aligned to the anti-sense strand of that fragment, referred to as 5' and 3' below. Another commonly used option is synthetic in silico
extension of the read-length to the estimated mean length of the sequenced fragments. The latter results in extended ChIP-seq reads, also known as extended short-read single-end tags (XSETS) [6
]. The counts of XSETS can then be added up to produce a combined pile-up signal. Its merit can be seen in Figure where these two different representations of the signal are exemplified. The top panel shows the coverage of reads aligned to the sense (blue) and anti-sense strands (red), and also the combined coverage signal (XSET) in black where each read have been extended with 150 bp. The panels immediately below show the raw (middle) and smoothed (see Methods, bottom) estimates of 5' and 3' locations, respectively. Although there are two clearly visible peaks in the top panel, the 5' and 3' estimates do not surpass three counts at any individual position, which means that it is not easily detected by eye. This is at least partly due to the high biological variation within the cell population and the randomness of the shearing process: the 5' and 3' signals seldom pile up at a single position but rather enrich a small region. The prolonged signal, on the other hand, will generate pile-ups over a larger region, typically centred between the regions enriched by the 5' and 3' signals.
Figure 1 Strategy overview, synthetic & experimental results. (A) Results of the proposed normalization strategy on synthetic data. The top three panels represent (from top to bottom): i) the synthetic read signal, (sense in blue, antisense in red and (more ...)
However, synthetically extending sequence reads relies not only on an accurate estimate of the fragment length but also on that the estimate is representative of the distribution of fragments. Hence we focused on a method that produces normalised 5'- and 3'-read counts. Here we present, to our best knowledge, for the first time a normalization algorithm for ChIP-seq data that preserves the high resolution needed to fine map protein-DNA interactions. Since the fragment lengths will vary between experiments we apply an averaging (see Methods) of the 5' and 3' coordinates. The algorithm is based on regression modelling that uses sufficiently small windows (5 bp default) to retain high resolution whilst correcting for one or multiple experimental control measurements simultaneously. We present a demonstration of the strategy on a simulated example data set as well as an in depth evaluation of the normalisation procedure when applied to experimental transcription factor ChIP-sequencing data.