To motivate our work, we examine data for histone modification H3K27me3 (K27) in mouse ES and NP cells
[8]. displays a window of the data mapped onto the mouse genome through the UCSC genome browser
[17]. ES data has better S/N ratio as well as more peaks in gene-rich regions than in gene-poor regions. These characteristics introduce a bias that must be eliminated before comparing ES data to NP data, as can be seen in the results of the ChIPDiff method
[9] in the same figure: most of the differentially NP enriched regions proposed by ChIPDiff fall within gene-poor regions and are almost certainly false positives.
In the following, we use a notation similar to that of Xu
et
al. 2008
[9]. In particular, we assume that the data has been processed by dividing the genome into bins and collecting, for each bin, a count of the mapped sequence tags that fall within the bin. The result is a “library”, which is simply a list of positive integers, each successive integer associated with the next bin. Let

and

be two libraries containing the same histone modification for two different cell types – in our example, modification K27 for ES and NP cell types. Let

be the total number of bins in the library and set

and

to be the observed counts of the ChIP-seq fragments for libraries

and

respectively, where

, respectively

, are the sum of the fragments lying in the

th bin. In ChIP-seq, a tag is retrieved by sequencing one end of the ChIP fragment, and the median length of this fragment is around 200 bp
[5],
[18]. As was done in Xu
et
al. 2008
[9], we approximate the center of each fragment by shifting the tag end position by 100 bp downstream or upstream, according to its orientation on the chromosome. We choose different bin sizes for different types of the histone modifications so as to maximize the discriminative information between the two libraries of the different cell types and minimize the discrimination of the two replicates of the same libraries. We use the spread of the data in the scatter plots as a measure for discriminative information. A lower bin size favors a better spread (away from the diagonal) between the data of the two cell types in the scatter plot.
An observed fragment count

at the

th bin can be related to the actual number of histone modifications

at the

th bin using the following model:
Function

is the (unknown) deterministic function that describes the nonlinear transformation of the actual histone modifications, accounting for the various experimental conditions that may influence the observations in a systematic way. The additive

term accounts for the stochastic (background) noise introduced by the experimental setup, such as stray fragments from neighbouring modifications. Finally, the parameter

accounts for local genomic bias, mainly bias due to open chromatin regions and mapability, such noise is common in both the actual ChIP-seq library and the corresponding control dataset. Naturally, one could choose a stochastic, rather than a deterministic model for

; but since our goal is to detect regions with differential enrichment, and not to produce a detailed predictive model, the deterministic approach suffices.
ChIPnorm scheme
We will address each source of error separately and proceed in two main stages. In the first stage, we address the removal of stochastic background noise and local genomic bias in each library. In the second stage, we address the problem of normalization.
Stochastic noise

To solve the problem of stochastic background noise, both Bayesian modelling methods
[19] and statistical confidence measure methods have been used. In terms of statistical confidence, the problem reduces to evaluating the probability that a particular bincount (bincount is the total number of DNA fragments captured inside a bin of some size) would occur by chance
[20]. We estimate this probability by defining a “null hypothesis”, which is a random distribution of bincounts, and then comparing it with the distribution of bincounts of the ChIP-seq library.
To understand the rationale for choosing the amplified binomial distribution (ABD) as the random distribution for the bincounts of ChIP-seq library, we must examine the ChIP-seq process as illustrated in . The short segments of DNA are treated with a specific antibody to capture a particular histone modification in the genome. The rest of the fragments are washed away. The captured fragments are sequenced by a high-throughput sequencing method, which typically uses PCR amplification of the captured fragments
[21] before performing the final base-pair sequencing. The sequenced data is binned to obtain a ChIP-seq library.
To estimate the null distribution of a ChIP-seq library, we assume that the fragmented DNA from the sonicated whole cell is treated with an antibody that randomly captures fragments without any specificity. The bincounts of captured fragments then follow a binomial distribution. The captured fragments are amplified, sequenced, and binned to get the null distribution of the ChIP-seq library. This binned data follows a random distribution, in which each of the fragments following the binomial distribution is amplified; we refer to it as the amplified binomial distribution (ABD). We assume that each fragment is amplified by a constant amplification factor.
An ABD can be defined by two parameters: the total number of fragments

in its corresponding binomial distribution and the amplification factor

. We made two assumptions to calculate these parameters for the desired null distribution: (a) the total number of the fragments in the original ChIP-seq data

and in the corresponding null distribution

are the same, and (b) the total number of bins with zero bincount is the same in the ChIP-seq library (

) and the null distribution. Since the amplification does not change the number of bins with bin-count zero, we can write

, where

is total number of bins in the library. Since

and

are observed variables,

and

can be evaluated. Now the probability mass function of the random distribution prior to amplification is
The ABD is thus estimated as

.
We calculated the false discovery rate (FDR) for a ChIP-seq data using its corresponding estimated null distribution (ABD)
[20]. We declared bins with FDR

as significant bins in the ChIP-seq library. The value 0.05 is the standard value used in the field
[20].
Local genomic bias

ChIP-seq data contains many local genomic biases corresponding to open chromatin regions, over-amplified satellite repeats, GC-rich regions, and unmappable perfect repeat regions. Some of these biases depend on experimental conditions and others may vary among the cell lines in a systematic manner. If such cell-specific biases are not taken care of while comparing ChIP-seq libraries it will give many false positive differential regions. Some of these biases can be reduced by using an input DNA control library. To find differential regions, we must consider only those regions that are significantly enriched with respect to the input DNA control. The input DNA control are the DNA fragments in the ChIP-seq experiment prior to the application of the histone specific antibody. To find the enriched bins, we need to normalize the input DNA control with respect to the data.
Yang
et
al. 2002
[22] recommended the use of locally weighted regression (LOESS) normalization. The basic assumption is that the percentage of the differential sites, considered as outliers by LOESS, is small so that these sites do not affect the normalization. However, this is not true in the case of a ChIP-seq library. In a ChIP-seq library, the percentage of bins that are differentially enriched relative to the input DNA control can exceed 50%. These differentially enriched bins will affect the LOESS normalization as shown in
Figure S1(a) and lead to many false negatives. We introduce an iterative normalization to overcome this problem.
In the first stage of iterative normalization, we normalize the DNA control with respect to the data using quantile normalization. For illustration purposes, we first show the LOESS curve of the MA plot in
Figure S1(a). We first classify bins enriched with respect to the DNA control (fold change

) after quantile normalization as outliers. These outliers are then removed from further iterations.
Figure S1(b) shows the LOESS curve after removing the outliers. We see that the LOESS curve is relatively less affected by outliers. In the next iteration, we use non-outliers bins for second quantile normalization to get a more accurate estimate of the normalization function. Once we get this normalization function, we normalize (quantile) all bins. The process of removing outliers and then performing normalization can be repeated to rescue more bins falsely declared as non-enriched. Each bin is declared as enriched if its fold-change value of ChIP-seq data and quantile normalized control data is

. (The LOESS normalization is not used in the ChIPnorm method but is shown here only for illustrating the affects of normalization and removing outliers. Instead quantile normalization is used).
Bins which are declared both as significant in the ABD approach and as enriched in the iterative approach, are declared as ‘enriched-significant’. Bins which are declared as enriched-significant in either of the two libraries are passed on to the second stage, with their original bincount values.
Quantile normalization
Since our first stage removed the majority of bins with low S/N ratios and genomic bias, and since we expect interesting regions to have good S/N ratios, we now make the simplifying assumption:

and

.
With this assumption the next step is to normalize the data to the same scale so that bin values in the two libraries are comparable. Mathematically, given two observed data

and

, find a transformation

such that

.
We propose a quantile normalization method (similar to Bolstad
et al. 2003
[16]) to solve this normalization problem. Quantile normalization assumes that the distribution of the data of the two libraries that are being compared are similar. This may seem problematic because histone modifications change significantly during differentiation. But it is reasonable to assume that their probability distribution of the bincount over the whole genome is similar across different cell types. (This might not be true if one of the two libraries have histone modifications knocked out.) We use the inverse cumulative distribution function (on the modified data after removing noisy bins) of the enrichment level, as shown in . The

axis of this figure is the percentile while the

axis is the bin values. The figure shows the

and

bin values plotted against their cumulative percentile. To get the desired transformation of

, we must ensure that the post-transformation data

follows the same cdf as

. We fit a spline smoothing function on the bin values of library

, then, for all percentile values

, we perform a transformation

such that

. This is the desired transformation

. The proof is given in the supplementary material (
Supporting Information S1).
This transformation reduces the problem of comparing two libraries with different probability distributions to the problem of comparing two libraries following the same probability distribution, so that a direct comparison of values can now be used. Since in the second stage we considered bins which were declared as enriched-significant in either of the two libraries

and

(a union operation), some bins which are not declared as enriched-significant would be present in the second stage too. If both libraries were completely independent events, we would expect 50% of the bins to be enriched-significant, because of the union operation. In effect, we define a bin in library

to be differentially enriched for the target modification if (i) observed bin value in libraries

lies above the

region in the inverse cumulative distribution function and (ii) for some chosen fold change threshold

(

), we have

. Similarly we can define a bin in library

as differentially enriched if the bin value in library

lies above the

region in the inverse cumulative distribution function and for for some chosen threshold

(

), we have

. All bins are thus reported as differentially enriched or not. Adjacent bins of the same type of differential enrichment can be grouped together to form differential regions (DHE).
The complete ChIPnorm method summarized
The complete ChIPnorm method is summarized in the . In the first stage, we identify bins having a significant bincount compared to the estimated random distribution of a ChIP-seq library as significant bins, by using a false discovery rate (FDR) analysis. We also identify bins of a ChIP-seq library as enriched bins, if their bincounts are higher than the corresponding bincounts of the normalized input DNA control. Those bins which are both significant w.r.t. null hypothesis and enriched w.r.t. normalized input control DNA are declared as enriched-significant bins. Bins which are declared as enriched-significant in either of these two libraries are passed to the second stage. In the second stage, we normalize the enriched-significant bincounts of the two ChIP-seq libraries and use a fold change to obtain differentially enriched bins.
The normalization can also be used to find bins that are enriched in both libraries, thereafter called constitutively highly enriched (CHE). Bins which are above 50% in

and

in the inverse cumulative distribution function and also below threshold

are declared as CHE. While differential histone modification enrichment (DHE) regions help us understand why different types of cells behave differently, CHE regions are conserved between the cell types and thus presumably essential to the survival of both types.
The normalization method also facilitates the comparison of more than two libraries. Our method is easily extended to handle multiple types of histone modifications in multiple cell types. Such analyses can give more insight into combinatorial patterns of histone modifications, sometimes referred to as “histone language”
[23]. For example, Bernstein
et
al. 2006
[24] hypothesized that a bivalent domain with both H3K27me3 and H3K4me3 modification at the same site plays a crucial functional role in embryonic stem cells. Finally, the ChIPnorm method can be used with any ChIP-seq data – not just with histone modifications.
Experimental Design
We carried out a series of experiments with the two libraries for H3K27me3 and H3K4me3 histone modifications (ES and NP cells), including experiments for bias and sensitivity. Since H3K4me3 has sharper peaks than H3K27me3, it needs a finer resolution, and smaller bin sizes are used. Using the replicate data analysis described earlier, we chose a bin size of 1000 bp for H3K27me3 (K27) and a bin size of 200 bp for H3K4me3 (K4). The bin size of 1000 bp for H3K27me3 has also been used previously in literature
[9]. We compared ChIPnorm with six other normalization methods: (a) unit mean normalization; (b) quantile normalization; (c) MACS peak finder; (d) ChIPDiff method
[9]; (e) rank normalization; and (f) two-stage unit mean normalization. We ran these methods on the H3K27me3 data for ES and NP mouse cells provided by Mikkelsen
et
al. 2007
[8] (with whole cell extract (WCE) control library) and on the H3K27me3 data (of Broad Institute) for ES and GM12878 (replicate 1) from the human ENCODE project
[25],
[26]. (GM12878 is a lymphoblastoid cell line produced from the blood of a female donor with northern and western European ancestry by EBV transformation
[25].) Processing was done on individual chromosomes of the two libraries.
The five methods not yet described are as follows:
- • unit mean normalization: is the standard Affymetrix scaling method for microarray data [16]. To perform consistent comparison with the ChIPnorm method, we normalized the two libraries to have unit mean using a method similar to the Affymetrix scaling method. To normalize the bin values
of a library we calculated its trimmed mean
(the mean of the non-zero bins in the library) and then the normalized bin value is set to
. Finally a threshold (
) was used to classify bins as differential or not. - • quantile normalization: the two libraries are quantile normalized, and a fold change threshold (
) is used to classify bins as differential or not. - • MACS peak finder method: Although MACS is a peak-finding software [13], we use it indirectly to find differential regions as follows: peaks for one library are detected by giving the other library as control, and the bins with peaks are considered as differential regions.
- • rank normalization: the bin values of each of the libraries are sorted separately; the sorted lists are divided into 10 equal partitions, which we define as rank. Finally we compare the values of corresponding ranks at each bin value in both libraries. If the difference between the values is greater than a threshold
then the bin is classified as differential. - • RSEG method: RSEG is a recently published method [12] to not only find peaks in histone modifications but to also identify differential regions (rseg-diff) between two histone modification ChIP-seq libraries.
- • two-stage unit mean normalization: we removed the noisy bins using the first stage of the ChIPnorm method before applying the unit mean normalization and fold change classification.