DNA associates with histone proteins to conform the chromatin [1
]. Histones generally carry posttranscriptional modifications in cells capable of modulating the expression of genes [2
]. For instance, there is a genome-wide relation between the histone 3 lysine 36 trimethylation (H3K36me3) and transcription activity [4
]. This and other epigenetic modifications are key to cellular differentiation [6
] and their alterations have been associated to early stages of cellular transformation in tumors [7
]. The combinations of the histone modifications, which can have cooperative or opposed effects on the chromatin state, have been proposed to reflect a histone code that would determine the regulation of gene expression and the cell state [9
]. High-throughput sequencing (HTS) technologies provide a very effective way to obtain information about the histone modification patterns at genome wide scale [10
]. Efforts to integrate available genome-wide datasets about chromatin in various conditions are crucial towards improving our understanding of the role of epigenetics in gene regulation.
Recent publications have made progress in the definition of a histone code of gene expression by generating predictive models of transcriptional activity based on histone mark information [11
]. They provide insights into possible mechanisms of regulation and a formal description of the postulated histone code [18
]. These methods generally relate the histone signals obtained from experiments of chromatin immunoprecipitation followed by HTS (ChIP-Seq) [20
], with a read-out of the gene expression based on expression microarrays or HTS for mRNAs (RNA-Seq) [21
]. In these approaches, the chromatin signal is generally represented in terms of read-counts or peak significance in the promoter and sometimes the gene body of genes. However, this analysis is generally based on one single condition or cell line. That is, they effectively compare the properties of different genes in a direct way, relying on the premise that signals in two different genes should be comparable, and the accuracy of their predictive model will be dependent on the accuracy of the estimation of the significance of the ChIP-Seq signals. However, genes present many variable properties, like number of introns or the presence of CpG islands in their promoter, which may affect these measurements. For instance, recent experiments show that the splicing machinery can recruit histone-modifying enzymes and influence the chromatin state, with the consequence that genes with introns tend to have higher levels of H3K36me3 signal [22
]. Thus, the comparison of genes with and without introns is not straightforward. Additionally, various other factors may affect the local density of HTS signal [23
]. For instance, the tag counts from an HTS experiment will be influenced by the chromatin structure of the DNA and by shearing effects [24
], not all regions have the same mappability [27
] and there is often a GC bias in the reads [28
]. These issues will reflect on differences in coverage between regions, which will be even more exacerbated for the broad signals that are obtained for histone ChIP-Seq experiments. Control samples can partly alleviate this, but their effectiveness depends very much on the sequencing depth. Thus, HTS signals from two genes are not directly comparable in general.
Here, we propose a new method to measure epigenetic signals and to relate them to expression based on the comparison between two conditions. In our approach, the same genomic locus is compared between two conditions; hence, the predictive model describes changes of gene expression in terms of relative changes in epigenetic mark densities between two conditions or cell types. Significance of these changes is calculated taking the read density into account, thereby mitigating the confounding effects mentioned earlier. Additionally, unlike a previous method that has made pairwise comparisons of epigenetic data from cell lines [17
], our method considers continuous changes of the epigenetic signal densities, rather than an on-off state description. Moreover, our framework provides greater flexibility than previous approaches for the generation of computational predictive models.
To illustrate our method, we have built a model of expression regulation from epigenetic changes using data from various ENCODE cell lines [29
]. In order to extend this relation, we include additional epigenetic data not considered previously, namely, HTS of DNase I hypersensitive sites (DNase-Seq) [30
] and DNA methylation data [31
]. Our results show a different epigenetic code for expression for intron-less and intron-containing genes, being this difference more prominent in genes with low GC content around the transcription start site. Moreover, eliminating anti-sense transcription and overlapping promoters and tails from different genes, which has not been done before, the prediction accuracy improves considerably. Furthermore, the predictive model built from one pair of cell lines performs with high accuracy in a different pair. Finally, we are able to generate a minimal code for expression regulation between two cell lines that is generic enough to correctly predict the regulatory outcome of up to 70% transcripts from a different pair of cell lines.