Chromatin features show distinct signal patterns around genic regions
To systematically study the genome-wide properties of various chromatin features, we collected more than 50 ChIP-chip and ChIP-seq profiles of histone modifications and DNA binding factors in
C. elegans from the modENCODE project (see Materials and methods). We divided the DNA regions around (± 4 kb) the transcription start site (TSS) and transcription termination site (TTS) of each transcript into small 100-bp bins and calculated the average signal of the chromatin features in each bin. As a result, each bin was assigned a matrix whose elements are the average signals of different features in different transcripts (Figure ). Figure shows the rich spatial pattern of 16 features in the early embryonic (EEMB) stage, where the signals are averaged over all transcripts. We first observed that the upstream and downstream regions of TSSs and TTSs are clearly distinct. Most chromatin features have higher signals in the transcribed regions (downstream of TSSs and upstream of TTSs). Interestingly, we found that RNA polymerase II (Pol II) has the strongest binding signal in regions right after the TTS, rather than within the transcribed region (Figure ). The enriched binding signals right after the TTS may indicate the importance of anti-sense transcription as a regulatory mechanism for gene expression [
14,
33]. Strong Pol II signal was also observed at regions before the TSS in some other developmental stages (Figure S1 in Additional file
1), which was also reported previously in
C. elegans by [
34], and was thought to be related to the accumulation of TSS-associated RNAs in mouse and human [
35,
36]. The signal pattern of histone H3 suggests that nucleosomes have lower occupation density in regions around the TSS and TTS than within the transcribed regions. H3K4me2 and H3K4me3 are enriched upstream of the TSS, consistent with their reported role as histone marks for active promoters [
14]. On the other hand, signals for H3K9me2 and H3K9me3 are depleted around TSS compared to neighboring regions, which may reflect the low density of nucleosomes around the TSS of genes [
28].
Chromatin features exhibit distinct spatial correlation patterns with gene expression levels
The different chromatin features display distinct spatial patterns. It is thus worthwhile to explore the relationship between these patterns and the level of gene expression. Making use of RNA-seq data obtained from the different stages of C. elegans, we quantified the expression level of each gene. For each bin, we then calculated the correlation between the gene expression levels and the average signals of each chromatin feature of the bin. Figure shows the spatial variation of these correlation coefficients around TSSs and TTSs. According to the correlation patterns, there are two main types of chromatin features: ones that are positively correlated with gene expression (such as H3K79me1, H3K79me2 and H3K79me3); and ones that are negatively correlated with gene expression (such as H3K9me2 and H3K9me3). While some features show largely uniform correlations across the 16-kb regions, some others are more variable across the regions. For example, H3K79me2 has a high correlation coefficient (0.65) near the TSS, but rather a low correlation (0.10) downstream of the TTS. It is interesting to observe that the negative features tend to have more uniform spatial patterns while the positive features tend to show greater variation. In addition, for chromatin features such as H3K79me2, although the average signal intensity decreases with distance downstream from the TSS, the correlation between the feature signal and the expression level remains high. This pattern suggests that, while some chromatin features have the strongest average signals only at some highly specific regions, the differences of their signals between genes with low and high expression levels remain strong over much broader regions.
We chose the long window size of 4 kb in order to inspect how fast the signals of the chromatin features fade out as we move away from the TSS and TTS. Indeed, the correlations of some chromatin features (for example, H3K9me3) remain strong a few kilobases away from the TSS and TTS, and the fading could only be observed at the 4-kb boundaries. To make sure that our conclusions are not affected by short genes with some bins having both the identities of being within 4 kb downstream of the TSS and within 4 kb upstream of the TTS, we also did the correlation analysis only on transcripts longer than 8 kb, and found that the correlation patterns are the same (Figure S2 in Additional file
2). Also, as the
C. elegans genome is quite compact, the region 4 kb upstream of a TSS or downstream of a TTS could be overlapping with another gene. We thus repeated the analysis using transcripts that are at least 4 kb away from any other known transcripts, and again obtained similar correlation patterns (Figure S3 in Additional file
3). Furthermore, analysis based on bins within intergenic regions again resulted in a similar correlation pattern. Therefore, the high correlation of gene expression with feature signal at distant locations does reflect the long-range effects of their regulation, instead of an artifact caused by chromatin structure of the nearby genes.
Furthermore, to assess whether the trends we observed are universal to all developmental stages rather than specific to the EEMB stage, we repeated the analysis in other stages, including late embryo, larval stages and young adult. Although the exact values of correlation coefficients vary across stages, the spatial patterns are consistent in all stages (Figure S4 in Additional file
4). In addition, a large number of genes are associated with multiple transcripts corresponding to different alternative splicing isoforms. In many cases, the overlap between these transcripts is substantial, which might affect the correlation patterns between chromatin features and expression. We thus repeated the correlation analysis using only genes with a single transcript, and obtained the same qualitative results (Figure S5 in Additional file
5).
Among the chromatin features shown in Figure , MES-4 and MRG-1 are factors associated with X-chromosome inactivation [
37,
38]. These factors are supposed to have different binding patterns in the X chromosome than in autosomes. We therefore analyzed their correlation patterns in X genes and autosomal genes separately. As expected, we found that MES-4 and MRG-4 associate predominantly with autosomal DNAs, while the dosage compensation complex (DCC) subunits bind specifically with X-chromosomal DNAs (data not shown), which is in line with previous reports [
19]. Consistent with this finding, MES-4 and MRG-4 show stronger positive correlation with autosomal gene expression.
Unsupervised clustering reveals general activating and repressing chromatin features for individual genes
As some chromatin features are positively correlated with gene expression levels and some are negatively correlated, the two groups potentially represent general active and repressive marks of gene expression. Yet since these correlations capture only the average behavior across all genes, it is still not clear if these features are strong indicators of the expression levels of individual genes.
In order to examine the relationship between chromatin features and the expression levels of all individual genes, we performed a two-way hierarchical clustering of both the chromatin features and the annotated genes, according to the feature signals at the TSS bins (bin 1). As shown in Figure , genes can be divided into two clusters (labeled as H and L, respectively) based on the signals of the 16 features. We found that the two clusters roughly correspond to genes with high expression levels (H) and genes with low expression levels (L), respectively (Figure ). These two clusters are characterized by complementary patterns of chromatin features. Cluster H is characterized by high signals of 11 features (the right component of the upper dendrogram), and low signals for the other 5 features. We note in particular that highly expressed genes tend to have a strong H3K36me3 signal, which is consistent with the role of H3K36me3 as a chromatin mark that activates transcription of associated genes. Similarly, the well-known repressive mark H3K9me3 shows a low signal. Compared to cluster H, genes in cluster L show the opposite pattern of chromatin signals.
To explore which regions around the TSS and TTS provide the greatest power in determining gene expression levels, we repeated the two-way clustering procedure for each of the 160 bins around TSSs and TTSs. Figure shows the resulting t-statistics. We observe that the signals slightly downstream of TSSs are the most informative. In general, the t-statistics decrease as the distance from the TSS or TTS increases. The decay is steeper at the region downstream of TTSs.
The above integrative analysis involves all chromatin features. To examine how each feature individually affects gene expression, for each feature we performed hierarchical clustering of the genes based on the collective signals of the feature at all 160 bins. An example is shown in Figure , in which signals of the single feature H3K79me2 at the different bins were used to cluster the genes. As in the case when all chromatin features were used, the signals from single chromatin features can divide genes into two clusters (that are not exactly the same as, but similar to, the ones obtained from all features) with a significant difference in expression level (Figure ). Again we quantified the power of each feature in distinguishing genes with high and low expression levels using t-statistics. As shown in Figure , apart from a few exceptions (black bars), most features are informative. The most informative features are H3K79me2, H3K79me3 and H3K4me2. The informative features can be further grouped into two classes. Activating features are those that are positively correlated with gene expression (cyan) and repressive features are those that are negatively correlated (blue).
Chromatin features can statistically predict gene expression levels with high accuracy using supervised integrative models
The above analyses suggest that gene expression levels can be at least partially deduced from chromatin features. To examine how much of gene expression is determined by chromatin features, we tried to predict gene expression levels using the features. We started with the simplified task of distinguishing highly expressed and lowly expressed transcripts, where the two classes of transcripts were constructed by discretizing gene expression levels (see Materials and methods). We divided all the transcripts into training and testing sets, and learned a support vector machine (SVM) model from the signals of all 13 chromatin features of the training transcripts at a certain bin (Figure ). The model was then used to predict to which class each transcript in the testing set belongs. We repeated the procedure for all 160 bins, and 100 different random splitting of the transcripts into training and testing sets for each bin (see Materials and methods). We represented the overall performance of the model using the receiver operating characteristic (ROC) curve and further quantified the accuracy using the area under the curve (AUC). Figure shows the ROCs corresponding to the prediction performance of five different bins. Compared to random ordering, which would give a diagonal ROC curve on average with an expected AUC of 0.5, we observed that all five curves are much better than random but with diverse performance, which indicates that all the bins are useful to classify gene expression but they are not equally informative. This result is consistent with what we have observed using the unsupervised method described above (Figure ). Instead of using SVM, we also learned support vector regression (SVR) models using similar procedures (see Materials and methods) to predict expression values directly. Figure shows that there is a high positive correlation (0.75) between the predicted levels from an SVR model and the actual expression levels measured by RNA-seq. This analysis suggests that chromatin features explain at least 50% of gene expression variation (see Materials and methods).
We then compared the prediction accuracy of all 160 SVM models learned from the different bins. As shown in Figure , the models learned from regions around the TSS (-300 to 500 bp) and upstream of the TTS (-200 bp to 0 bp) have highest accuracy, with AUC values greater than 0.9. Prediction accuracy decreases gradually as we move away from these regions, which confirms the spatial effects that we observed from the unsupervised analysis (Figure ).
We have also tested more comprehensive models that combine the chromatin features in 40 bins around the TSS (-2 kb to 2 kb). These comprehensive models achieve slightly higher prediction accuracy than those based on single bins, yet the enhancement is not dramatic, with an average AUC of 0.94 for the classification model (SVM) and an average correlation coefficient of 0.75 for the regression model (SVR) (Figure in Additional file
6).
We then learned SVM models using only features of individual types. As shown in Figure , the AUC obtained by using all features (black) is comparable to the AUCs obtained from models using only particular subsets of features. Strikingly, the model involving only the 9 histone modification features is almost as accurate as the model involving all 16 features. We further divided the histone modification features into four subsets: modifications on K4, K9, K36 and K79, respectively. While the integrated model with all histone modifications achieves an AUC value of 0.9, using just one of the subsets can yield an AUC higher than 0.8 (Figure ). In particular, the set H3K79 is found to be most predictive, which again confirms our previous finding of the importance of these histone modifications in regulating gene expression (Figure ).
The results of the supervised analysis suggest that chromatin features are not only correlated with expression but are also predictive of the expression levels of individual genes with good accuracy and could explain a large portion of the expression differences between different genes. We note that histone modifications may have other regions of enrichment that are informative about gene expression: for instance, the percentage of gene length with strong histone modification signals. We therefore examined the power of using these features for predicting gene expression levels. Specifically, we calculated the percentage of transcribed regions with strong signals (>10%) for all genes. Using them as predictors, we obtained high prediction accuracy (AUC = 0.90). However, a combination of these percentage features with the original chromatin features does not lead to obvious improvement in prediction accuracy, indicating that they are redundant.
Combination of chromatin features contribute to gene expression prediction
Both the unsupervised and supervised analyses above suggest that chromatin features possess a certain level of redundancy. In the unsupervised clustering (Figure ), different chromatin features show similar signal patterns around the TSS regions of genes. In the supervised predictions (Figure ), high accuracy was achieved by multiple features as well as feature subsets. Though the SVR model offers good prediction power, it may be instructive to build a simpler linear regression model to explore to what extent the chromatin features are redundant, and to what extent they are interacting in a combinatorial fashion. Specifically, for each bin, we modeled the expression level y as a linear combination of the effects of individual histone modification features xi and their products xixj:
We found that among the 66 (12 × 11/2) possible interactions between the 12 distinct histone modification features, many interactions are statistically significant. For example, for bin 1, we detected 12 significant interactions (
P < 0.001, linear regression) between the histone modifications (Table S7 in Additional file
7).
To quantify the importance of these interactions in determining gene expression levels, we compared the above regression model with a singleton model that does not contain the interaction terms:
By evaluating the prediction power of the two models using a cross-validation method, we found that with respect to the singleton model the interaction model improves prediction accuracy by 4%. Thus, the contribution of interactions among chromatin features to gene expression prediction is not substantial.
We further examined each pair of modifications individually to see if there is any redundancy between any of the modifications. Using simplified models each involving only two modification features, we found that no two histone modifications are completely redundant (Table S8 in Additional file
8). These results were confirmed by a similar analysis based on mutual information (Figure S9 in Additional file
9). Two examples are shown in Figure . In each example, we considered a specific pair of histone modification features, and divided all genes into four categories based on the signals of the two features at their TSS bins. In the first example (Figure ), expression levels are the lowest when both H3K4me3 and H3K36me3 are low but moderate if either one of them is high. This suggests that both features are activators. When both features have high signals, an even higher expression level is observed, showing that the two are not totally redundant. In the second example (Figure ), H3K9me3 is found to repress gene expression in general, while H3K79me3 is found to activate gene expression. As expected, a combination of high H3K9me3 signal and low H3K79me3 signal results in a lower expression level than when both signals are low. When the signals of both features are high, we observe a significant difference in gene expression compared to the other three cases, indicating that the features contribute to gene expression regulation in a collective manner.
Our analyses of the interactions between the above chromatin features only considered binary interactions between two features. For higher-order relationships involving more features, it is infeasible to perform the same type of analyses, as the number of feature combinations would become intractable. Also, the above analyses only suggest which features interact with each other, but do not explain how the features interact. In particular, the complex correlations between features and gene expression make it difficult to extract directional relationships between them (Figure S10 in Additional file
10). We therefore used Bayesian networks to study the higher order relationships between the chromatin features and gene expression (see Additional file
11 for details).
The chromatin model is developmental stage-specific
We have previously constructed an integrative model using chromatin features at the EEMB stage of C. elegans development and used it to predict gene expression levels at the same stage. How well can we predict gene expression levels at other developmental stages using the chromatin feature data from EEMB? To answer this question, we applied the model to predict gene expression at EEMB, L1 (larva stage 1), L2, L3, L4, and adult. Specifically, the chromatin feature data from EEMB were combined with expression data from a stage to train a SVM model, which was then used to predict gene expression levels of other genes at that stage. As shown in Figure , the chromatin model based on EEMB data is able to predict the expression at other developmental stages with reasonable accuracy (AUC = 0.8). However, the predictions of gene expression levels in all these stages have lower accuracy than the predictions for EEMB itself. This result suggests that signals from chromatin features are developmental stage-specific and regulate biological processes in a dynamic manner depending on the particular stage. The stage specificity is more apparent when we apply the model to genes that are differentially expressed between stages. For example, we have identified 4,042 genes that differ in expression levels by at least four-fold between EEMB and L3 stages. Using the EEMB stage chromatin model to predict the expression level of these genes, the prediction accuracy further decreases (AUC = 0.70).
Chromatin features show different correlation patterns with different genes in an operon
In
C. elegans some neighboring genes are organized into operons. The genes in an operon are co-transcribed as a polycistronic pre-messenger RNA and processed into monocistronic mRNAs [
39,
40]. Here we investigate the differential signals of chromatin features among genes in operons and how this organization affects their expression levels. We collected the first, second and last genes in 881
C. elegans operons and calculated the signals of chromatin features in each of the 160 bins around their annotated TSS and TTS. We observed strong correlations between expression levels and chromatin feature signals for the first genes (Figure ). In comparison, the correlation patterns for the second and last genes of the operons are not as apparent (Figure S12 in Additional file
12). The weaker correlations could be caused by the lack of signals for some histone modification types. As we observed, the mark for active promoters, H3K4me3, demonstrates strong signals around the TSS of the first genes, which is the shared promoter of genes in the same operon. In the upstream region of the internal genes, the H3K4me3 signal is often relatively weak. Alternatively, the weak correlation for internal genes may also be explained by the intensive post-transcriptional regulation of these genes, which can not be captured by our chromatin feature based model [
41]. In fact there is only weak correlation (Pearson correlation coefficient (PCC) = 0.10) between the expression levels of the first and the second genes. Moreover, on average the first genes are two-fold and three-fold more highly expressed than the second genes and the last genes, respectively. Taken together, although genes in the operons are co-transcribed, they are regulated post-transcriptionally to achieve distinct expression levels [
41].
Chromatin models learned from protein-coding genes are able to predict microRNA expression levels with high accuracy
Do chromatin features influence transcription of microRNAs in the same way as they do with protein-coding genes? As a way to study the similarity of the two mechanisms, we investigated the effectiveness of the chromatin model learned from protein-coding genes in predicting microRNA expression. Since precise TSSs are not available for most worm microRNAs, we calculated the signals of chromatin features in the genomic regions corresponding to pre-microRNAs, and used them as the input features for our chromatin model.
We predicted the expression levels of 162 worm microRNAs with genomic locations obtained from miRBASE [
42]. We then compared our predictions with the experimental measurements performed by Kato
et al. [
43]. As shown in Figure , our predictions are in good agreement with the experimental results in the EEMB stage (see also the prediction results for the L3 stage in Figure S13 in Additional file
13). Some microRNAs locate within or near gene loci, which may confound the prediction of microRNA expression. To address this issue, we also checked the prediction accuracy using only microRNAs that are away from any known gene, and obtained similar prediction accuracy (PCC = 0.62).
It is interesting to see that the expression of microRNAs can be accurately predicted using a chromatin model trained by data for protein-coding genes. Consistent with previous reports on microRNA transcriptional regulation [
44,
45], this result suggests that microRNAs and protein-coding genes share a similar mechanism of transcriptional regulation by chromatin modifications.
As with the prediction of expression levels of protein-coding genes, the prediction accuracy of microRNA expression also shows developmental stage specificity. When the signals of the chromatin features from the EEMB stage were used, the resulting model achieved the best accuracy when predicting microRNA expression at the same stage (PCC = 0.60), whereas for stages L1, L2, L3, L4 and adult, the accuracy is much lower (PCC < 0.50) (Figure S14 in Additional file
14). Similarly, when chromatin features at L3 were used to train the model, the model achieved better prediction results in L3 than in other stages.
Application to other organisms
The models described above provide a useful tool to integrate gene expression and chromatin data. Currently, the C. elegans dataset is the best one to demonstrate the utility of the method and we have focused on it here. However, we know that further integrated genomic datasets (comprising matched genome-wide histone features and expression measurements) are coming in many other organisms. Thus, to illustrate the broad utility of our method, we demonstrate here how readily it can be applied in other contexts. Specifically, we have packaged our methods as a tool and applied it to data sets from four other organisms: yeast, fruit fly, mouse and human. The results indicate that chromatin features, in particular histone modifications, are highly correlated to gene expression levels in all these organisms (Figure ). More importantly, the relative statistical contribution of each histone modification type to expression is similar in tested organisms (and also in different tissues, cell-lines, and developmental stages). For example, H3K4me3 signals around the TSS of genes show high predictive capability in all the analyses we have performed. We also found that the models based on expression levels measured by RNA-seq achieved higher prediction accuracy than those by microarrays, consistent with the higher measurement accuracy of RNA-seq compared to microarrays. Our method can, of course, be applied to multiple data sets in each species (for example, different developmental stages in fruit fly). Figure shows only a single illustrative example for each species. We only show initial statistical analysis here, further biological interpretation would, of course, be the subject of future studies.