-acting regulatory DNA elements, such as promoters, enhancers and insulators play an essential role in establishing precise temporal and tissue-specific gene expression patterns. Systematic and precise mapping of these regulatory DNA elements, especially enhancers, is a prerequisite for understanding gene expression programs in both healthy and diseased cells. Experimentally, enhancers can be mapped using the powerful technique, chromatin immunoprecipitation coupled with microarray chip (ChIP-Chip) (Kim and Ren, 2006
) or short-read sequencing (ChIP-Seq) (Park, 2009
). However, this approach is limited by the availability of a large number of chIP-grade antibodies specifically recognizing the transcription factors (TFs) of interest. On the other hand, enhancers can be computationally predicted based on the observation that they often contain dense clusters of TF binding sites (TFBS) in a short stretch of DNA (<1000 bp) and are often conserved. Methods relying on clustering of TFBS (Frith et al.
; Pennacchio et al.
; Sinha et al.
) require prior knowledge of the binding specificities of the TFs involved which is still quite limited. Methods based on sequence conservation (Blanchette et al.
; King et al.
; Visel et al.
) require precise alignment of regulatory DNA sequences from multiple species, which is not necessarily true for all elements.
Histone proteins in chromatin are subject to a number of covalent modifications, primarily at their N-terminal tails, including methylation, acetylation, phosphorylation, ubiquitylation and ADP-ribosylation. These chromatin modifications have profound influences on gene expression (Schones and Zhao, 2008
). Numerous genome-wide ChIP-Chip/Seq studies have provided data on the distribution of histone modifications in various model organisms and cell types. A picture is now emerging in which distinct genomic regions such as enhancers, promoters and gene bodies (both protein coding and non-coding RNA genes) have distinct histone modification signatures (Heintzman and Ren, 2009
; Schones and Zhao, 2008
). For example, high levels of histone 3 lysine 4 methylation have been found at gene promoters and at many enhancers (Heintzman et al.
; Wang et al.
). In addition, it has been shown that many regulatory elements carry these epigenetic modifications only in specific cell/tissue types or according to environmental conditions, which cannot be determined by comparative genomics based on sequence alone. Collectively, these observations suggest that epigenetic signatures could be an alternative and powerful way to pinpoint regulatory DNA elements in the genome.
Given the rapid growth of genome-wide chromatin modification data from different species and cell types, there is now a pressing need for computational tools capable of integrating various histone modification maps to discover regulatory DNA elements. Recently, several groups have started to develop computational tools to address this need. Heintzman et al.
) were the first to develop a computational tool for predicting promoters and enhancers in HeLa cells using six histone modification maps covering 1% of the human genome. Their algorithm predicts promoters and enhancers based on correlation to the average histone modification profiles trained on known examples. In spite of the success of the profile-based method, it is limited in two aspects: (i) the contribution of each histone modification mark to the classification method and their interdependency was not examined; (ii) the window size of histone modification patterns (10 Kb) was chosen arbitrarily. To improve the profile-based approach, Won et al.
introduced a Hidden–Markov Model (HMM) based method and used simulated annealing to optimize the window size and the choice of histone modification marks (Won et al.
). Evaluated using a set of known enhancers, the HMM-based and profile-based method achieved a positive predictive value [PPV = TP/(TP + FP)] of 54.8 and 53.0%, respectively, and a sensitivity [Sn = TP/(TP + FN), where TP is true positives, FN is false negatives] of 74.1 and 68.9%, respectively. Besides these two supervised learning-based approaches, Hon et al.
introduced an unsupervised approach to identify histone modification signatures by aligning segments of histone modification data (Hon et al.
). Using the same set of known enhancers, they showed that the unsupervised method achieved a sensitivity of 53.5%. No PPV was reported for the unsupervised method.
Although the success by these previous methods is encouraging, there is still a large room for improvement judging by the PPV and sensitivity values reported above. Part of the reason for the limited success of previous methods is that they do not fully employ the signal from the ChIP-Chip/Seq data. From a pattern recognition point of view, by improving the feature extraction step in these methods, one can ensure that potentially important signals are not missed. For instance, besides amplitude, the shape of the signal peaks (broad versus narrow, symmetric versus asymmetric, etc.) could also be very informative for distinguishing enhancers from other types of functional DNA elements and among different types of enhancers. With this in mind, we hypothesize that by introducing efficient data transformation and feature extraction procedures before classification, we can increase the overall accuracy of a classifier for predicting regulatory DNA elements using genome-wide chromatin signatures.