Differential histone modifications between ChIP-chip verified and non-verified TFBSs
In order to examine whether chromatin modifications are predictive features for functional TFBSs, we first investigated chromatin modification signals at ChIP-chip verified and non-verified TFBSs defined as follows. Based on previous TFBS prediction models, we denoted the TFBSs of a factor as the local genomic sequences that match its PSSM. We then used ChIP-chip experimental data to distinguish ChIP-chip verified and non-verified TFBSs based on the existence of actual binding peak signals. Although both verified and non-verified TFBSs contain TF binding PSSMs, they were found to have differential chromatin modification signals. Here, we use the factor SWI4, a component of the SBF complex regulating cell cycle gene expressions, as an example (Figure ). We observed that individual histone modifications varied significantly between ChIP-chip verified and non-verified TFBSs. Among the 14 different histone modifications under 2 conditions (YPD and H2
), 11 were significantly different (P
-value < 0.01) in their signals between the verified and non-verified TFBSs of SWI4 (Figure ). Among them, H3 and H4 signatures were particularly strong features for distinguishing between the two types of TFBSs, as they showed significantly lower signal in verified TFBSs than in non-verified TFBSs (P
-value < 10-20
). Consistent with previous studies, this indicates that verified binding sites of factors in regulatory regions are typically depleted of nucleosomes [27
]. Encouraged by the observed differential histone modifications between verified and non-verified TFBSs, we then constructed a model that combines histone features with binding motif information for target gene prediction in yeast.
Figure 1 Differential histone occupation and modifications between ChIP-chip verified TFBSs and non-verified motif matching sites. Showing SWI4 as an example, most histone modifications (in different colors) are significantly different between ChIP-chip verified (more ...)
Improving target gene prediction by combining histone modifications and PSSMs
Since the S. cerevisiae
genome is quite compact with respect to other higher eukaryotic species, it is reasonable to define the target genes of a TF as those with one or more upstream TFBSs. We combined chromatin modifications and PSSM data and used them as input features to a SVM learning model for predicting TF target genes. The model was compared with two other models that are based only on histone modifications or PSSMs. The prediction accuracy of the models was tested using a gold standard data set from ChIP-chip experiments, which provided target genes of 203 yeast TFs [5
]. Specifically, we choose 0.01 as the P
-value cutoff for target gene calling from ChIP-chip, which provides us with enough high confidence positive target genes for model training (Figure ). The data set was separated into training and testing data, and the performance of the models was assessed by cross-validation (see Materials and methods).
Figure 2 Chromatin modifications substantially improve TF target gene predictions. (a) ROC curves show improved TF target gene predictions using histone modifications. (b, c) Performance of prediction models for individual TFs, with PSSMs from Beer and Tavazoie (more ...)
For chromatin modifications, we used 11 histone modifications that covered most yeast ORF regions from ChIP-chip experiments [23
]. Since TFs bind to the upstream sequence of ORFs, we focused on histone modification signals in the 1 kb of sequence flanking transcription start sites (ATGs), because TFBSs were enriched in these regions. For TF PSSMs, two independent sets were obtained from previous studies. One set comprised PSSMs discovered using a sequence analysis-based method, basically looking for enriched motifs in the DNA regions upstream of all yeast ORFs [20
]. From approximately 5, 600 upstream sequences a total of 666 motifs have been discovered, among which 48 could be associated with known TFs according to the literature. The other set of PSSMs was based on ChIP-chip data [5
]. For each TF a target gene set was determined and then the binding motif for the TF was identified from the DNA region upstream of these genes.
Our results indicate that, for almost all TFs, the model using both histone modification and PSSM data achieves the best or nearly best performance (measured using the area under the receiver operator characteristic (ROC) curve (AUC)) compared to the other two models using histone modification data or PSSM data alone (Figure ; Additional files 1
). For example, we obtained an AUC of 0.89 for predicting target genes of the factor SUM1 when both the histone modification and PSSM data were used. If only the PSSM or histone modification information was used, however, the models resulted in lower AUCs (0.77 for PSSM only and 0.85 for histone modification only; Figure ; Additional file 1
). The improved performance of the combined model was observed for both of the TF PSSM sets (Figure ; Additional files 1
), indicating that the improvement does not rely on a particular source or quality of PSSMs. Interestingly, for some TFs, the histone-only model performed better than the PSSM-only model, while for some other TFs, the opposite was observed.
In order to examine whether we could achieve better TF target prediction when we have more chromatin modification data, we used histone modification data sets from two independent experiments, performed by Pokholok et al
] and Kurdistani et al
. respectively [22
]. We found that we achieved a higher prediction accuracy by using the data set from Pokholok et al
. than using that from Kurdistani et al
. This is probably due to the fact that the latter contains only histone acetylation data, while the former contains both histone methylation and acetylation data, which might provide complementary information on the regulation of chromatin structure and TF binding. We then combined the two data sets for predicting TF target genes, and found that, for most TFs, the performance was only slightly better than using the Pokholok et al
. data set alone (Figure ). Nevertheless, for some TFs we observed a substantial improvement when including the Kurdistani et al
. data set (Table ). It is thus promising that we might improve the performance of our chromatin model in the future by incorporating more histone modification data.
Figure 3 Model parameters. (a) Stricter thresholds for target gene calling result in better predictions. (b) Combination of independent histone modification data sets can improve target predictions. Predictions for 203 TFs are evaluated by area under the receiver (more ...)
Transcription factors with improved prediction by including multiple histone modification datasets
We also investigated the positional effect of histone modification signals for target gene prediction. First, we observed that signals of different types of histone modifications showed different patterns at DNA regions around the ATG, suggesting that they might affect TF binding in different ways. Second, histone modification signals upstream of the ATG are generally more predictive than those downstream of it, as we have observed for both 500-bp and 1, 000-bp flanking region sizes (Figure ). This is somewhat expected because TFBSs are more enriched in the upstream regions of ORFs for the purpose of transcriptional regulation. It is also interesting to see that ATG flanking regions of 500 bp and 1, 000 bp result in almost the same performance (Figure ). Given the compact nature of the yeast genome, transcription start sites for most ORFs are located within the region 1, 000 bp upstream of the ATG [31
]. The comparable performance when using 500-bp ATG flanking region indicates that most discriminative histone modification signals for TF binding are embedded in this region. We also find that histone modifications in intergenic regions are more predictive than those in ORF coding regions (Figure ).
The number of PSSM-containing genes is another factor that might also affect prediction using histone modifications. We counted the number of genes with PSSMs in their promoters for each TF, and found no obvious relationship between this and the prediction power (AUC) when using histone modifications (Additional files 3
). In fact, the number of PSSM-containing genes varies extensively from 79 to 5, 810 depending on the information content of the PSSMs, but its correlation with the prediction power is negligible (R = 0.1 for the Frankel et al
. data set [32
Both histone modifications and TF binding are involved in regulation of gene expression. To rule out the possibility that the capability of histone modifications for predicting TF target genes is actually mediated by the expression level of genes, we examined the marginal effect of gene expression level on TF target prediction. Including expression levels as an additional predictor in the histone modification-based models does not significantly change the prediction accuracy for any TF. Furthermore, SVM models based on expression level alone can hardly predict targets for any of the TFs (Additional file 5
). The only exception is FHL1 (prediction accuracy AUC = 0.79), which predominantly regulates the transcription of ribosomal protein genes; it is the extreme abundance of these genes that enables accurate prediction for this TF. Thus, the prediction power of histone modification models is not likely to be mediated by gene expression levels overall.
Condition specificity of the chromatin model
TFs bind to and regulate target gene expression in a complex and dynamic manner to coordinate biological processes [5
]. Chromatin modifications also change rapidly in response to stimuli from the extracellular environment [3
]. Therefore, chromatin modifications in one condition should match TF target binding in that condition but not other conditions.
We investigated the condition specificity of our chromatin model in two conditions, YPD medium and H2O2. We tested a total of 12 TFs, for which we had the PSSM, histone modification and TF target binding data under both conditions. For each of the TFs, we constructed two separate chromatin models: one model (model A) used PSSM and histone modifications under the YPD condition as features, while the other (model B) used PSSMs from the YPD condition but histone modifications under the H2O2 condition. The two models were then used to predict TF target binding under H2O2 conditions. It is generally believed that TFs keep their binding specificity PSSMs in different conditions and even over large evolutionary distances. Therefore, we used PSSMs from the YPD condition as a close approximation in model B, where no PSSM information is available under the H2O2 condition.
For TFs that are known to be functional under the H2O2 condition, model B performed better than model A. For example, HSF1, a heat shock TF that activates genes in response to stresses, is more active under the H2O2 condition, with 326 target genes, than in YPD medium, with only 123 target genes. Using condition-matched histone modification data (model B), our chromatin model achieved an AUC of 0.77. In contrast, using non-condition-matched data (model A), the chromatin model only achieved an AUC of 0.56 (Figure ). Similar results were observed for another TF, MSN2, which is activated along with MSN4 to regulate stress response genes. MSN2 is more active under the H2O2 condition, and our chromatin model performed better with condition-matched data (Figure ). These results indicate that histone modifications are actually dynamic and function in a condition-specific manner. Thus, target genes of TFs under a certain condition can be best predicted by using histone modification data from the same condition. In practice, this enables us to predict condition-specific target genes of TFs, which cannot be achieved using the PSSM-based method.
Figure 4 Condition specificity of the chromatin model for TF target prediction. (a, b) ROC curves showing the performance of two chromatin models in predicting target genes of HSF1 (a) and the MSN2/4 complex (b). The model (model B) using histone modifications (more ...)
Relative importance of different histone modifications for target prediction
TFs bind to the upstream regions of target genes by recognizing their specific binding motif PSSMs. We then asked an analogous question: do the binding sites of TFs have specific histone modification profiles? To address this question, we calculated a histone modification profile for each TF by averaging upstream ATG histone modification signals over all its target genes. In our analysis, we included 25 different histone modifications from the two studies mentioned above [22
We found that different TFs have distinct target histone modification profiles. A histone modification high in one TF's profile could be low in another TF's profile (Figure ). We performed unsupervised clustering analysis for the histone modification profiles of all TFs, and detected two TF clusters (Figure ). One of the two clusters showed generally larger variations (more high and low signals) among histone modifications in the upstream region of their target genes, while the signals in the other cluster are more often around the mean (P
-test). We thus refer to the 68 TFs in the former cluster as 'histone modification-sensitive' (histone-sensitive) TFs, and the 135 TFs in the latter cluster as 'histone modification insensitive' (histone-insensitive) TFs (Additional file 6
Figure 5 Target histone modification profiles and target-non-target differential modification profiles of TFs. (a) Target histone modification profiles comprising different normalized histone modification signals (columns) of TFs (rows). A target histone modification (more ...)
Correlations between pairs of histone modifications are shown in Figure , based on their signals in the histone modification profiles for all TFs. Only pairs with strong correlation (r > 0.5 or < -0.5) were connected in the form of a correlation network. The dense connectivity in this network reveals strong pairwise redundancy of histone modification signals, which is also indicative of redundancy for predicting target genes.
We next examined the relative importance of each histone modification for predicting target genes of all TFs. Given a histone modification, we compared its signal difference between target and non-target genes of a TF. The signal difference is represented as t-statistics (see Materials and methods for details), which indicate the relative importance of that histone modification for predicting the target genes of a TF. A larger absolute value for a t-statistic indicates more importance. The t-statistics for all histone modifications form a TF-specific profile, denoted as differential modification profiles of the TF. Interestingly, histone-sensitive TFs and histone-insensitive TFs defined based on target histone modification profiles are also distinct according to their differential modification profiles (Figure ). This suggests that histone-sensitive and -insensitive TFs are actually robust clusters with different patterns of histone modifications in their target genes.
Histone modification sensitivity of transcription factors
To understand the biological nature of the histone-sensitive and -insensitive TFs, we explored the different features of these two TF classes under different biological 'contexts'. First, we observed a difference in predictive power of histone modifications for target gene prediction between the two TF classes. As shown in Figure , histone modifications are generally more predictive of the target genes of histone-sensitive TFs than those of histone-insensitive TFs. This is due to the fact that target genes of histone-sensitive TFs have stronger histone modification signals, which substantially improve the performance of our chromatin model.
Figure 6 Distinctions between histone-sensitive and -insensitive TFs. (a) Target genes of histone-sensitive TFs are better predicted using intergenic or upstream histone modifications than are those of histone-insensitive TFs. (b) Histone-sensitive TFs have a (more ...)
Histone-sensitive and -insensitive TFs also show distinct topological characteristics in biological networks. In general, histone-sensitive TFs have less target genes than histone-insensitive TFs (Figure ). Yu and Gerstein constructed a hierarchical network in yeast based on TF-TF regulation relationships identified by ChIP-chip [34
]. We mapped the histone-sensitive and -insensitive TFs onto this hierarchical network and found that histone-sensitive TFs were enriched in the upper layers. This suggests that histone-sensitive TFs are more likely to act as 'managers' that regulate other TFs, while histone-insensitive TFs tend to be 'workers' at the bottom layer of the hierarchy (Table ). We also examined the 'degrees' of these TFs in the protein-protein interaction networks [35
]. Our results indicate that histone-sensitive TFs tend to have more physical interaction partners than histone-insensitive TFs (Figure ). The high connectivity of histone-sensitive TFs further implies their functional importance.
Transcription factor histone sensitivity relates to hierarchical level in a regulatory network
Interestingly, the histone sensitivity of TFs also indicates distinct co-regulation relationships. Two TFs are said to be co-regulatory if their sets of targets significantly overlap. Among the approximately 14, 000 possible TF pairs, we found 1, 440 significant co-regulatory relationships (P < 0.05, Fisher's exact test). Among the TFs involved in co-regulatory relationships, 64 are histone sensitive and 95 are histone insensitive. Of the 1, 440 significant co-regulatory pairs, 447 are between two histone-sensitive TFs, 437 between two histone-insensitive TFs, and 556 between one histone-sensitive TF and one histone-insensitive TF. Fisher's exact test showed that histone-sensitive TFs are more likely to be involved in a co-regulatory relationship than histone-insensitive TFs (P < 10-16). In summary, the histone-sensitive TFs reside mostly in the upper layers of the regulatory network, and tend to work and communicate with other TFs during transcriptional regulation.
Furthermore, we found that the expression levels of histone-sensitive TFs were higher than those of the histone-insensitive TFs (Figure ). It seems that the histone sensitivity of TFs is also related to their biological functions. For example, TFs involved in cell cycle regulation predominantly belong to the histone-sensitive class (Table ). Out of 17 cell cycle TFs reported in a previously study [36
], only CST6 was classified to be histone insensitive. We also examined TFs that were specific to certain conditions - for example, heat shock or oxidative stress. Condition-specific TFs were classified into both histone-sensitive and -insensitive classes; thus, it is not obvious whether the histone sensitivity and condition specificity of TFs are related.
Transcription factor histone sensitivity relates to cellular functions
PSSM predictability and cooperativity of transcription factors
TFs exhibit different PSSM predictability in that the targets of some TFs are well predicted using the PSSM alone but others are not. PSSM predictability reflects the extent to which TFs recognize binding sites through motif matching. Since PSSMs are available for only 50 TFs, accounting for about a quarter of 203 TFs with histone-modification profiles, it is difficult to perform systematic identification and classification to categorize them as well-predictable or weakly predictable using PSSMs. However, we have identified ten TFs with a target prediction AUC of > 0.7 using their PSSMs alone, providing a confident subset of PSSM well-predictable TFs (Table ). To check whether the high predictability is attributed to PSSM specificity, we calculated the information content of these PSSMs. We found that the information content of the ten well-predictable TFs' PSSMs is not different from those of other TFs (P = 0.4, Wilcoxon test). We investigated the expression level, number of target genes, and the hierarchy in the regulatory network of these ten TFs and found no significant difference from the other TFs. We will be able to make more confident conclusions when more PSSMs for TFs become available in the future.
Top 10 PSSM well-predictable transcription factors
For TFs that are weakly predictable using PSSM information, we hypothesized that these TFs may bind to their targets indirectly by cooperating with other TFs. If this is the case, we would expect to predict the target genes of such a TF accurately by using the PSSM of its cooperative TF. We tested this by using each PSSM to predict targets of all TFs (Additional files 1
). The targets of most TFs were best predicted by their own PSSMs, but some TFs have their targets better predicted using other TFs' PSSMs. For example, YAP1's target genes were better predicted using CAD1's PSSM - the AUC increased from 0.71 to 0.75. Similarly, the target genes of INO4 were better predicted using INO2's PSSM than it's own, with the AUC increasing from 0.78 to 0.81. In fact, YAP1 and CAD1 work together in stress-induced transcriptional responses, and INO2 and INO4 form heteromeric complexes involved in phospholipid biosynthesis [37
]. We also found that the PSSMs of the cooperative TFs are actually quite similar, measured using a similarity score range from 0 to 1. The PSSMs of CAD1 and YAP1 have a similarity score of 0.72 (top 1% among all pairs), and those of INO2 and INO4 have a similarity score of 0.55 (top 5%). This further indicates the cooperation between the two TF pairs through indirect binding. Therefore, TF target gene prediction using 'cross-PSSMs' could help identify co-operative interactions between TFs. On the other hand, this suggests that using a TF's own PSSM may not always be best for predicting its target genes, especially when there is evidence it co-operates with another TF.
Chromatin model improves prediction of TF binding sites
We have examined our chromatin model using the TF target data from large-scale ChIP-chip experiments [5
], and shown its effectiveness for predicting target genes. The study by Harbison et al
] investigated TF binding within yeast promoter regions only. However, technical advancement of ChIP-chip and ChIP-seq has enabled us to obtain the binding sites of a TF across the whole genome. Given these more high-resolution data, our chromatin model can also be used to predict TFBSs. For demonstration, we use the ChIP-seq data for STE12 as an example [39
We examined chromatin signals around STE12 binding peaks. We found many histone marks, such as H3K9ac, were enriched in the peak regions (Additional file 7
), implying that they are informative for predicting TFBSs. We separated the yeast genome into 100-nucleotide bins, and divided them into positive bins and negative bins according to their overlap with STE12 binding peaks. Then we constructed SVM classification models (histone modification model, PSSM model and combined model) to predict positive bins in a similar way as predicting target genes. We estimated their prediction accuracy by using a cross-validation method, which verified the effectiveness of the chromatin model for binding site prediction (Figure ). The prediction power of using both histone modification and PSSM information is roughly the same as when using only histone modification (AUC = 0.73), whereas using PSSM only is close to random (AUC = 0.5). The advantage of combining histone modifications with PSSM information, however, is clearly demonstrated when the positive predictive value (PPV; the fraction of positive predictions being true positives) is concerned (Figure ). This is of particular importance in TFBS prediction because reducing false positive predictions is the major challenge of model improvement.
Chromatin modifications improve STE12 binding site prediction. (a, b) The AUC (a) and positive predictive value (b) of the histone-only model (HM), the PSSM-only model (PWM) and the combined model (PWM+ HM) for STE12 binding site prediction.
Comparison with previous methods
We compared our SVM-based method with several previously published approaches, including Cluster-Buster [13
], MCAST [40
], EEL [41
], and Stubb [16
]. We calculated the prediction accuracy of each method by applying it to 10 TFs with more than 200 target genes under the YPD condition. As shown in Table , our method integrating histone modification and PSSM data sets achieves the best prediction for most TFs. For histone-senstive TFs such as SWI4 and SWI6, including histone modification data can improve target prediction accuracy substantially, and PSSM alone gives relatively poor predictions no matter what algorithms are used to search for TFBSs.
Comparison of several computational methods for target gene prediction
Among the previously published methods, EEL and Stubb take advantage of conservation of TF binding motifs between related species, and as shown they achieve relatively more accurate prediction results than FIMO, Cluster-Buster and MCAST. We also tried the 'Chromia' method proposed by Won et al
]. Similar to our method, Chromia integrates histone modification and PSSM data sets but using a hidden Markov model. The method has shown impressive performance when applied to genome-wide ChIP-seq data in mouse for predicting TFBSs. However, when applied to the yeast data in our case, it does not result in good prediction due to the low coverage of the histone modification and TF binding data sets [5
]. For example, the arrays used for the Pokholok et al
. ChIP-chip data contain approximately 42, 000 probes (60-mers), representing only about 20% of the yeast genome [23
]. The arrays used for identifying yeast TFBSs are essentially promoter arrays, covering only DNA regions around the transcription start site of yeast ORFs [5
]. In practice, our method requires only data for interested regions (for example, promoter regions), and thereby is more flexible and can be applied to a wide range of data sets.
We examine the effectiveness of these methods for predicting STE12 binding sites (see 'Chromatin model improves prediction of TFBSs' section). Chromia predicts STE12 binding sites with an AUC of 0.66 and a PPV of 5.6%, presumably also due to the low resolution of the histone modification data. Those motif-based methods perform similarly to our PSSM-based methods (Figure ), and taking into account conservation does not lead to significant improvement for the STE12 case.