In this study, we have derived a novel two-step model to study the relationships between chromatin features and gene expression. With this model, we have shown strong correlation (for example, r = 0.9) between gene expression and chromatin features in various human cell lines, confirming the conclusions from previous studies with better performance. We also took advantage of the wide range of datasets from the ENCODE project and compared the accuracy of predicting RNA measured by different sequencing techniques (that is, CAGE, RNA-PET, and RNA-Seq), and from different cell lines (for example, embryonic stem cells, normal tissue cells, and tumor cells) and different cell compartments. We showed that different groups of chromatin features reflect gene 'on'/'off' status versus gene transcription levels. Also, we revealed different groups of chromatin features predict CAGE- versus RNA-Seq-based expression, suggesting transcription initiation and transcription elongation are represented by different sets of chromatin features. Comparisons among various cellular sub-compartments suggests that the non-polyadenylated RNAs might be regulated by different mechanisms from polyadenylated RNAs, and that chromatin-associated RNAs are likely transcribed, but uncapped.
Although previous studies have already identified the correlation between chromatin features and gene expression levels, our study makes additional contributions in three ways. First, our analysis benefits from the wealth of data produced by the ENCODE project, allowing us to use the widest range of data thus far to study this problem. The ENCODE Consortium quantified RNA species in whole cells and sub-cellular compartments, mapped histone modifications by ChIP-Seq, and measured chromatin and DNA accessibility in various cell lines. Unlike the limitations of other studies (for example, only one cell line, no RNA type), for the first time we have linked gene expression with its effectors in great detail and in well-matched conditions.
Second, we built a novel two-step model to quantify the relationship between chromatin features and expression. Several early studies [7
] either simply described this relationship or quantified chromatin features and/or expression. Recent studies [10
] have assessed the relationship using more sophisticated quantitative models. Here, our model expands upon this previous work by using both classification and regression, giving an even further in-depth analysis of the relationship. Given the observation that nearly 40% of all TSSs are not expressed in each of the investigated datasets (data not shown), applying regression directly on a dataset with many zeros could bias the result. Compared with a regression model alone, the two-step model shows an improvement in performance (for example, r
= 0.895 versus 0.871 for the dataset in Figure ; Table ). More importantly, chromatin features involved in turning gene expression 'on' and 'off' may differ from those that control the level of expression. This is why we chose a two-step model - first classifying the 'on' and 'off' genes by the available features, then performing regression on the expressed genes only - so each predicted expression is based on the product of the output of these two models. Additionally, instead of using a fixed bin for different chromatin features, we used the 'bestbin' strategy to capture maximal effects from different chromatin features. We have compared the performance of the 'bestbin' strategy with that of several other bin-selection methods. Table shows that the 'bestbin' approach improves the performance by 2 to 13% compared to fixed-bin or no binning, and that 'bestbin' has the best performance overall. Moreover, most chromatin marks show very stable 'bestbin', such as H3K36me3, DNase, H3K27me3, H4K20me1, and H3K9me1 (Figure S9 in Additional file 2
). Finally, using an optimal pseudocount led to a consistent improvement in performance compared with using a small fixed pseudocount (Figure S10 in Additional file 2
), without changing the primary conclusions.
Performance of different modeling and bin selection strategies
Third, our model performs well in predicting gene expression using chromatin features. Using a linear regression model to correlate histone modifications at promoters and expression in human CD4+
T cells, Karlić et al
] calculated a correlation coefficient of r =
0.77 for microarray data, and 0.81 for RNA-Seq data. Cheng et al
] showed that a support vector machine regression model learned from modENCODE worm data has r =
0.73 in human K562 cells, and r =
0.74 in mouse embryonic stem cells. Our model expands upon these well-performing models, with a number of datasets having r
> 0.9, and 55 (out of 78) datasets having r
While our model shows high correlation between chromatin features and gene expression levels, it cannot be used to imply the causal effect of chromatin features on gene expression. Henikoff and Shilatifard [40
] recently discussed the 'cause or cog' role of histone modifications in gene transcription, and proposed that histone modification patterns are actually the result of a series of dynamic processes coupled with transcription, including transcription factor binding, RNA polymerase elongation, nucleosome remodeling, and targeting of non-coding RNAs.
It has been shown that chromatin features possess a certain level of redundancy and that certain chromatin features may work in a combinatorial fashion. One way to study the effect of combinatorial chromatin features is to introduce interaction terms in the linear regression model, which is computationally expensive for a model with more than ten terms and has been shown to provide little contribution in improving the expression prediction accuracy [11
]. Instead, we grouped chromatin features into different categories according to their known function in transcriptional regulation and performed regression on each category. This is less computationally expensive and the results are straightforward to understand. For example, grouping H3K4me2, H3K4me3, H2A.Z, and H3K27ac together allows us to determine how predictive promoter marks are for gene expression. However, the details of how these multiple chromatin features work together to reflect the gene expression levels need further exploration.
The model can be further improved in several ways. While the model can well predict gene expression using the current available set of chromatin features, we could retrain the model by incorporating newly discovered marks (such as histone lysine crotonylation [41
]) and therefore study the importance of new effectors in regulating gene expression levels. Although our model shows good results for genes with single transcripts (Figure S11 in Additional file 2
), multiple transcripts from the same gene may be subject to differential chromatin-based regulation. It is interesting and challenging to interpret chromatin-based regulation for multiple transcripts with shared TSSs. In this study, we chose the transcript with the highest expression level as the representative if a gene has multiple transcripts, which could hamper our ability in uncovering the effectors of repressed genes or transcripts (for example, a repressive mark such as H3K37me3). Also, if a gene has zero (or low) expression, we cannot tell whether it is unexpressed or suppressed. Unlike active marks (where a higher signal level indicates a higher expression level), repressive marks cannot lead to a negative expression level. These limitations could potentially underestimate the relative importance of repressive marks, which underscores a need for future work on refining the models for repressed genes. We have shown the general application of models across different cell types. As an extension of this analysis, further work could include building models to relate differential gene expression with differential histone modification profiles, and evaluate the relative contributions of these modifications to differential expression between cell types (for example, in differentiated versus H1-hESC cells). Due to the requirements of our binning method, we only included transcripts longer than 4,100 bp in this study. Also, current analysis only includes experiments for RNA molecules longer than 200 nucleotides. This leaves room for improvement in understanding how chromatin features help regulate other genes (especially long or short non-coding RNA genes). With regular improvements in gene annotation and expression quantification techniques, it is promising that we will understand the regulation of gene expression more accurately in the future.