Covalent modifications of chromatin, including DNA methylation and histone modifications, play critical roles in gene regulation and cell lineage determination and maintenance (Bernstein
et al.,
2007; Felsenfeld and Groudine,
2003). Defects in these epigenetic controls have been implicated in many pathological conditions in humans. Genome-scale profiling of these epigenetic marks has been dramatically facilitated by the recent progress in the ultra high-throughput massively parallel sequencing technologies (Barski
et al.,
2007; Mikkelsen
et al.,
2007). ChIP-Seq combines chromatin immunoprecipitation (ChIP) with high-throughput sequencing to map genome-wide chromatin modification profiles and transcription factor (TF) binding sites. It is characterized by high resolution, a quantitative nature, cost effectiveness and no complication due to probe hybridization as encountered in ChIP-chip assays (Schones and Zhao,
2008). A large amount of data has recently been generated using the ChIP-Seq technique, and these datasets call for new analysis algorithms.
Binding of TFs is mainly governed by their sequence specificity and therefore is typically associated with very localized ChIP-Seq signals in the genome. A number of algorithms have been developed to find the exact locations of TF binding sites from ChIP-Seq data (Chen
et al.,
2008; Fejes
et al.,
2008; Ji
et al.,
2008; Johnson
et al.,
2007; Jothi
et al.,
2008; Kharchenko
et al.,
2008; Nix
et al.,
2008; Rozowsky
et al.,
2009; Valouev
et al.,
2008; Zhang
et al.,
2008a). In contrast, the signals for histone modifications, histone variants and histone-modifying enzymes are usually diffuse and lack of well-defined peaks, spanning from several nucleosomes to large domains encompassing multiple genes (Barski
et al.,
2007; Pauler
et al.,
2009; Wang
et al.,
2008; Wen
et al.,
2009) (see, e.g.
Figure S1). The detection of diffuse signals often suffers from high noise level and lack of saturation in sequencing coverage. These generally weak signals render approaches seeking strong local enrichment, such as those peak-finding algorithms used in finding TF binding sites, inadequate.
Many modification marks are known to form broad domains (Barski
et al.,
2007; Wang
et al.,
2008). This is believed to be helpful in stabilizing the chromatin state and propagating such states through cell division robustly (Bernstein
et al.,
2007). A well-studied case is the trimethylation of histone H3 lysine 9 (H3K9me3). H3K9me3 recruits HP1 via its chromodomain. HP1 in turn recruits H3K9 methyltransferase Suv39h, which modifies H3K9 on other histones in the vicinity, thereby self-propagating the heterochromatin state (Aagaard
et al.,
1999; Bannister
et al.,
2001; Lachner
et al.,
2001). Another example is the trimethylation of histone H3 lysine 27 (H3K27me3). H3K27me3 is generated by the activity of the Polycomb complex, PRC2, and is believed to recruit the PRC1 complex (Schwartz and Pirrotta,
2007). In
Drosophila, it has been suggested that the spreading of H3K27me3 results from looping action of PRC1 and PRC2 that both anchor at the polycomb response elements (Schwartz and Pirrotta,
2007) with nucleosomes at a distance. Recent experiments in human cells indicate direct recruitment of PRC2 by H3K27me3 (Hansen
et al.,
2008), suggesting a mechanism for the spreading of H3K27me3. In addition to histone methylation, the more dynamic histone acetylation marks also cluster, and several histone acetyltransferases contain bromodomains that specifically bind acetylated histones (Dodd
et al.,
2007; Jacobson
et al.,
2000; Owen
et al.,
2000).
Motivated by the mounting evidence of recruitment by modified histones of their respective enzymes, we develop a spatial clustering approach for the identification of ChIP-enriched regions (SICER) in histone modification data. A central feature of our method is pooling together signals from all the nucleosomes located together in the same modification state. This feature improves the signal-to-noise ratio and is especially helpful in dealing with the difficult case of diffuse enrichment covering extended genomic regions produced by histone modifications, for which enrichment at any short distance of one or several nucleosomes does not appear to be significant enough.
Our method involves scoring each potential ChIP-enriched domain according to the collective profile of enrichment on the domain. We developed a mathematical theory for the score distribution in a genomic background model of random reads, and employed this theory to identify spatial clusters, large and small, unlikely to appear by chance. Utilizing a control library, we identified a set of candidate domains that exhibit ChIP signal clustering using the random background model, and compare the strength of the ChIP signal with that of the control signal at each candidate domain to determine the significance of enrichment. Using a scaling approach for evaluation of false positives that is based on the digitized nature of ChIP-Seq data, and two datasets with experimental validation, we demonstrated that SICER outperforms other ChIP-Seq methods in dealing with histone modification data. Furthermore, we demonstrated its use as an unbiased general noise filter in such important issues in the statistical analysis of ChIP-Seq data as data normalization, and scaling analysis of sequencing coverage (see
Supplementary Material).