3.1 Using in silico TF binding predictions to model gene expression
We first explore whether in silico TF binding predictions can replace ChIP-seq binding measurements. To do this, we build models using only TF binding features and compare models of gene expression based on TF ChIP-seq binding data with models based on in silico TF binding estimates (a). As expected, models using in vivo measurements of TF binding (ChIP-seq TFAS) explain substantially more variance (adjusted R2 = 0.644; a, bar 3) than models using either the non-tissue-specific in silico (PWM) or tissue-specific in silico (PWM + H3K4me3) affinity scores. We find that a model using tissue-specific in silico TF binding predictions (PWM scanning with a H3K4me3 prior) is much more predictive of expression than the non-tissue-specific in silico affinity score model (adjusted R2 = 0.521 and adjusted R2 = 0.279; a, bars 1 and 2, respectively), nearly doubling the explained variance.
Fig. 1. TF-only models of gene expression in mES cells. Each bar shows the proportion of gene expression variance explained by each TF-only model (adjusted R2). Error bars correspond to 95% confidence intervals estimated via bootstrapping. (a) The accuracy of (more ...)
We hypothesize that possibly one or more of the TFs are poorly described by their PWM, hampering the performance of our models that use in silico binding predictions. To explore this, we take the ChIP-seq TFAS model and replace a single TF with the equivalent tissue-specific in silico data. Replacing the single E2f1 tissue-specific in silico TFAS with ChIP-seq E2f1 data yields a model that is statistically indistinguishable from one using all 12 ChIP-seq datasets (, top two bars. R2 = 0.630 and R2 = 0.644, respectively, error bars overlap). Replacing any other tissue-specific in silico TFAS with a ChIP-seq TFAS has either minimal or no effect (b, compare bottom bar with others), with the exception of Sox2, where its inclusion surprisingly decreases R2 (see Section 4). Thus, with our in silico method and ChIP-seq data for one TF, we can achieve indistinguishable accuracy to using 12 ChIP-seq datasets.
Having observed that E2f1 is the sole TF, where replacing tissue-specific in silico E2f1 data with ChIP-seq TF data is sufficient to increase explained gene expression variance to be indistinguishable from using 12 TF ChIP-seq datasets, we also examine whether in an all ChIP-seq TFAS model, replacing the E2f1 ChIP-seq data with in silico data decreases the accuracy of the model. In our model using all ChIP-seq TFASs, we find that replacing any ChIP-seq TFAS except E2f1 with PWM + H3K4me3 data has no statistically significant effect on model accuracy (c, compare bottom bar with others). Replacing E2f1 ChIP-seq TFAS with the E2f1 PWM + H3K4me3 TFAS, however, causes a large decrease in R2, suggesting that the tissue-specific in silico binding predictions fail to completely capture the binding of E2f1. For any other TF in the model, replacing ChIP-seq data with tissue-specific in silico binding data causes no significant decrease in accuracy.
When we regress each ChIP-seq TFAS alone against expression (‘Results’ section in the Supplementary Material
; ), we find that E2f1 is the most individually informative TF. It has been suggested that E2f1 binds <20% of its binding sites directly in human ES cells (Bieda et al., 2006
). If this is the case, PWM-based methods would therefore be unable to predict the majority of E2f1 binding sites. Given the importance of E2f1 in this model, it is not surprising that models including in silico
E2f1 data are less predictive of gene expression.
Fig. 3. (a–d) Correlation of actual RNA-seq expression versus predicted expression in models of mES gene expression. Plots show measured (X) versus predicted expression (Y) of each gene. PWM is the non-tissue-specific in silico method, and PWM + H3K4me3 (more ...)
3.2 A combination of histone modification and DNase I hypersensitivity data is as predictive of expression as TF ChIP-seq data
Next, we examine whether integrating histone modification data and chromatin accessibility information (DNase) will improve the predictive accuracy of both tissue-specific in silico TFAS and ChIP-seq TFAS models of gene expression. We perform log-linear regression using our 12 TF affinity scores, both alone and combined with histone modification and chromatin accessibility data. In , we compare the descriptive power (expressed as adjusted R2) of four models: ChIP-seq TFAS only, tissue-specific in silico TFAS with histone modification and DNase data, ChIP-seq TFAS with histone modification and DNase data, and histone modification and DNase data alone (no TF binding data).
Fig. 2. Proportion of gene expression variance explained by each model (adjusted R2). Each bar corresponds to the proportion of the variance in RNA-seq gene expression levels explained by each model. Error bars correspond to 95% confidence intervals estimated (more ...)
We find that models that use no
TF binding data at all are surprisingly accurate. A model using just seven histone scores and the DNase score (bar 2, ) is as descriptive of gene expression as the model using the ChIP-seq TF binding data for all 12 TFs (bar 1, ). There is no statistically significant difference in the accuracy of these two models (R2
= 0.624 cf. R2
= 0.644, 95% confidence intervals overlap). While it is well known that histone modifications play key roles in mediating gene expression (Karlić et al., 2010
), it is still surprising that a model based on only these seven histone modifications and chromatin accessibility, and not on TF binding data, predicts expression almost as well as a model using the 12 TFs known to be key regulators of gene expression in pluripotent mES cells (Chen et al., 2008
Next, we determine whether combining histone modification and chromatin accessibility data with our tissue-specific in silico TF binding predictions improves predictive accuracy. We find that including incorporating histone modification and chromatin accessibility data with our tissue-specific in silico (PWM + H3K4me3) method to be as effective as using ChIP-seq TFAS data (R2 = 0.649 and R2 = 0.644, , bars 3 and 1, respectively, error bars overlap). We also find that our PWM + H3K4me3 method with histone and DNase scores to be more effective than only using histone modification and DNase data (R2 = 0.649 cf. R2 = 0.624, , bars 3 and 2, respectively, error bars do not overlap).
Finally, combining histone modification and chromatin accessibility data with ChIP-seq TFAS (right-most bar, ) provides a significant improvement to the regression fit (R2 = 0.695) over ChIP-seq TFAS alone. Both the histone and DNase score only model, and the ChIP-seq TFAS only model, are quite descriptive of gene expression. The improvement gained by combining histone, DNase and TF binding data suggests that there is additional information available in the histone and DNase data that is not present in the TF ChIP-seq data.
3.3 Predicted expression correlates well with measured expression
It is also informative to plot the predicted expression against the actual RNA-seq expression data for our models (a–d) and to calculate the Pearson correlation coefficients (r
). For several of our methods (tissue-specific in silico
TF predictions with histone and DNase data; ChIP-seq TF data only; ChIP-seq TF data with histone modification and DNase data; histone modification and DNase data alone), we actually achieve higher correlation between our predicted expression values and actual (RNA-seq) expression values than that observed between RNA-seq and microarray expression experiments in the same tissue (r
= 0.734; see the Supplementary Material
to Ouyang et al.
The predicted and actual expression values are fairly well correlated in the model that is based on ChIP-seq TF, histone and DNase scores (c, r = 0.834) and in the model t based on tissue-specific in silico TF, histone and DNase scores (d, r = 0.806). The correlation in the non-tissue-specific in silico model (PWM, a). We find only a relatively small difference in accuracy between these models, both having predicted expression values that correlate well with actual expression.
3.4 Principal components provide information on TF and histone roles
We then perform PCA using singular value decomposition of TF, histone modification and chromatin accessibility (DNase) data, decomposing a joint matrix (with TF, histone modification and DNase data as columns) into PCs (see Section 2.4). Using these components rather than TFASs, and histone and DNase scores directly, we perform log-linear regression. PCA allows the extraction of a small number of non-correlated PCs responsible for the majority of gene variance. We examine the role of each TF, histone modification and chromatin accessibility in the regression. We plot the relative loading of each feature within the most informative PC (PC2), weighted by the proportion of variance explained (adjusted R2) by that PC ().
Fig. 4. The relative loading of each feature in the second principal component for two models. (a) The relatively loading of each feature of PC2 for a model containing ChIP-seq TFAS with histone and DNase scores. (b) The relative loading of each feature of PC2 (more ...)
We find that the tissue-specific in silico
TF PCA model is very similar to the ChIP-seq TF PCA model at providing biological information on the regulatory roles of the features, especially in the most important PCs. Visually, a high degree of similarity between the PCs is present, with only minor differences (comparing a and b). Both models show all TFs as positively associated with transcription, with E2F1, Klf4, Zfx, c-Myc and n-Myc as the most weighted TFs. Likewise, the weighted loading on each histone modification is qualitatively similar between models. shows the second PC, which alone is the most informative PC in the regression. When a model with PC2 as the only feature is fit to expression, it explains almost half of gene expression variance (R2
= 0.475). The remaining PCs and a short discussion are provided in ‘Results’ section in the Supplementary Material
shows the relative loading of each component within PC2 (which accounts for ≈48% of explained variance) successfully recapitulating known knowledge of histone modification effects. Both models suggest that H4K20me3, H3K27me3 and H3K9me3 repress gene transcription. Both H3K9me3 and H4K20me3 are known to be involved in gene silencing, associating with heterochromatin (Schotta et al., 2004
). H3K27me3 is associated with facilitating binding of the Polycomb repressor complex 1 (Cao et al., 2002
). In PC2, both the loadings of H3K4me2 and H4K4me3 suggest activating effects, which are well reported throughout the literature (e.g. Barski et al., 2009
; van Ingen et al., 2008
). A difference between the ChIP-seq and tissue-specific in silico
(PWM + H3K4me3) PCA models is noted for H3K36me3, which is known to maintain genes in a ‘poised’ state for either activation or repression. The similarity between the predicted roles by our tissue-specific in silico
method and the roles predicted by the ChIP-seq data strongly suggests that our in silico
method is capable of making biologically sound predictions of TF and histone roles.
3.5 Confirmation in GM12878 cells
To insure that our results were not specific to mES cells, we repeated our experiments using equivalent data from GM12878 cells. The pattern of the results are extremely similar to those using mES cell data, with models based on TF ChIP-seq, histone ChIP-seq and DNase I hypersensitivity explaining more gene expression variance than models based on TF ChIP-seq alone (, bars 3 and 6). Compared with the mES cell models, the relative improvement in accuracy of the ChIP-seq-based gene expression model when chromatin modification and accessibility data are added is much larger (compare and ). This is consistent with the fact that the model based on histone and DNase I data alone (without any TF-specific data) is much more accurate than model based on TF ChIP-seq data. This may be due to the set of TFs we use with the mES cell data being more complete (with respect to the complete set of key regulators) than the set we use in the GM12878 experiments. We note as well that the model using PWMs plus histone and DNase data is far more predictive of gene expression than the model based solely on TF ChIP-seq data (, bars 3 and 5).
Fig. 5. Proportion of GM12878 gene expression variance explained by each model (adjusted R2). Each bar corresponds to the proportion of the variance in RNA-seq gene expression levels explained by each model. Error bars correspond to 95% confidence intervals estimated (more ...)
We note that compared with our previous results in mES cells, each model explains comparatively less expression variance in GM12878 cells (compare a and with ). For example, the best performing model in both tissues—ChIP-seq binding measurements combined with histone modification and chromatin accessibility data—explains ≈30% less of the total gene expression variance in GM12878 cells than in mES cells (R2 = 0.412 and R2 = 0.695, respectively). As noted earlier, this may be partially due to the TFs used in the model. Additionally, one of the most informative histone modifications in mES cells, H4K20me3 () was unavailable for GM12878 cells.
We also measure correlation between actual and predicted expression in GM12878 cells for each regression model and perform the singular value decomposition regressions in GM12878 cells and these results are included in the Supplementary Material