In the three experiments with gcn4, mbp1 and ste12, the correct motifs found by our method were very strong: they were detected in several clusters with top ranks. These motifs could also be detected without functional clustering. The yap1 experiment demonstrated the advantage of our method. The canonical Yap1 binding motif, TTACTAA, is detected in one ‘negative’ cluster consisting of five genes on GO node ‘oxygen and reactive oxygen species metabolism’, consistent with the previous finding of Yap1 function in the oxidative stress response (25
). This motif could not be detected in the NOGO test even when we relax the filtering criteria in the motif search. The reason may be that the majority of the significant genes do not have Yap1 canonical binding sites in their promoters. Some of them may be Yap1 targets using degenerate/variant motifs, but more likely, many of them change their expression due to indirect effects of YAP1 deletion.
It is interesting to see that the canonical Gcn4 motif was found in several other experiments besides the GCN4 mutant, including GLN3, MAC1, SWI4, SWI5 and YAP1 (Table , marked a
). Since Gcn4 was suggested to be a master regulator of gene expression in response to cellular stresses (14
), it is possible that those mutations may have triggered some compensatory responses involving GCN4. The yap1 experiment is complicated by the fact that Yap1 was shown to bind to the Gcn4 site less optimally than to its canonical site (25
). However, ChIP chip data do not support the binding of Yap1 to the Gcn4 site because positive Yap1 binding is only seen in the promoters of a small fraction (<20%) of genes in clusters where a significant Gcn4 motif is found (Supplementary Material Table S1). Consistent with the known function of Gcn4 as an activator responding to amino acid starvation and other cellular stresses in yeast (14
), the vast majority of the significant genes in the GCN4 mutant were down-regulated and most of the significant clusters obtained with our method were on GO nodes related to amino acid metabolism or biosynthesis (Supplementary Material Table S1). The Gcn4 binding motif is detected in almost all of the above clusters, suggesting that these genes are likely to be activated by Gcn4 in the wild type even under non-starved conditions.
Several factors may have contributed to the failure of our approach to detect expected TF binding motifs. First, the experimental conditions may not be appropriate for the TF to manifest its function. For example, Gln3 is known to activate genes involved in the usage of poor nitrogen sources and those genes are repressed when readily used nitrogen sources are available (26
). Therefore, it is possible that under normal culture conditions, as in the Rosetta experiments, the target genes of Gln3 are repressed in wild-type cells and the deletion of GLN3 would have no effect. In fact, among the 118 significant genes in the gln3 experiment, only two gene promoters were shown to bind Gln3 in the ChIP chip data, even at a very loose P
-value cut-off (0.05). Second, our knowledge of gene functions, as represented in the GO annotation, is far from complete. This prevents us from a successful functional clustering in some cases. For example, a NOGO motif search detected a Swi5 motif in the down-regulated gene set, but we did not obtain any significant functional clusters with the same set of genes. Among the 22 genes in this set, 7 (32%) are annotated as ‘biological_process unknown’ (GO:0000004). Even those genes with some known functions may still have other functions that have not been annotated. Third, functional redundancy of some TFs may have reduced the effect of gene deletion on the direct targets, therefore reducing the signals in the expression data. This could be one reason why we did not detect the canonical binding motif for SCB (Swi4) because it is well known that Mbp1 and Swi4 have overlapping functions. In addition, expression profiling using non-synchronized yeast cell populations may also have reduced the TF deletion effect on some cell cycle-related target genes. One example was reported by Koch et al
): the mRNA levels of some Mbp1 target genes in the MBP1 null mutant were intermediate between the peaks and troughs observed in wild-type cells during the cell cycle, possibly because the Mbp1–Swi6 complex could be an activator or repressor depending on the phase of the cell cycle. Therefore, the average mRNA levels for some cell cycle-related Mbp1 targets may be similar in the non-synchronized cell populations of both mutant and wild-type. Similar effects could be relevant to the Swi4 null mutant as well.
The mac1 experiment may be a special case deserving more discussion. Our method reported the canonical Mac1 binding motif GAGCAAA (CuRE, copper-response element) in the cluster ‘iron transport’ (GO:0006826), second to the most significant motif TGCACCC (Supplementary Material Table S1). However, a close examination of this cluster raises questions about the validity of this result. First, among the five genes (FET5, FRE2, FRE1, FET3 and ENB1) in this cluster, only FRE1, a known Mac1 target (28
), showed significant promoter binding in the ChIP chip data. In fact, only between two and five promoters among the 89 significant genes in the MAC1 mutant experiment were shown to bind Mac1 in ChIP chip data at P
-value cut-offs of 0.001 and 0.05, respectively. Second, it is known that two copies of CuRE in the promoter are necessary for efficient activation of downstream gene transcription (TRANSFAC database). Among the known Mac1 target genes, CuREs tend to be close to each other in the promoter. However, the FET5 promoter contains only one CuRE and the two CuREs in the promoter of ENB1 are >350 bp apart, with one very far from the translation start site. Therefore, these two genes may not be true Mac1 targets (unless other variant or degenerate CuREs exist). On the other hand, the top motif, TGCACCC, perfectly matches the core of the RCS1 (AFT1) motif consensus in the TRANSFAC database. RCS1 is known to be involved in high affinity iron ion transport (SGD annotation), which is consistent with the ‘iron transport’ cluster. In fact, TGCACCC is detected as the top motif in all positive clusters in the mac1 experiment (Supplementary Material Table S1) and we see a 26% increase in RCS1 mRNA level in MAC1 mutant versus wild-type (ratio = 1.26, P
-value = 0.07), consistent with the role of RCS1 as a transcriptional activator (29
). Although the fold change and P
-value does not reach our criteria for significant genes, it may be biologically significant, as it is well known that a small fluctuation in TF expression may have a big impact on downstream genes. Therefore, we think a more plausible interpretation is that many of the significant genes in the mac1 experiment are due to increased expression of RCS1, an indirect effect of MAC1 deletion, and that the detection of a CuRE in our analysis may be due to the coupling between iron and copper transport. Alternatively, the detection of a CuRE in our analysis could be an artifact. It is reported that the CuRE is strongly bound by Mac1 only under copper starvation (28
). Therefore, it is also possible that under the yeast culture conditions used in the Rosetta Compendium, Mac1 was not active in the wild-type. In this case, deletion of MAC1 will not change expression of most of the Mac1 targets. This echoes our point in the previous paragraph, the experimental conditions are critical in the study of TF functions.
Although we do not detect the published SCB (Swi4) binding motif in the SWI4 deletion mutant experiment, our method identified two very similar motifs in the histone cluster and the CDK regulation cluster. An interesting point with these two motifs is that they are in the form A?GCGAA, which is somewhat similar to both the canonical Swi4 binding motif CGCGAAA and the Mbp1 binding motif ACGCGT. As can be seen in Tables and , the promoters of many genes in these two clusters were shown to bind Mbp1 in the ChIP chip experiments of Lee et al
) and/or Iyer et al
). Some of these genes do not have a single copy of even a degenerate Mbp1 site, ACGCGN. Although a motif like A?GCGAA may have a lower affinity for Swi4 and Mbp1 than their canonical binding motifs, it may provide a mechanism for cross-talk between the two pathways, as it is known that Swi4 and Mbp1 overlap functionally. A relatively low affinity for this motif may be compensated for by multiple occurrences of this motif in the promoters or via cooperation with other factors, like those of histone genes. It is also worth noting that in a recent study by Liu et al
), who applied their motif discovery algorithm (MDscan) to earlier published ChIP chip data, the top ranked motifs reported for Swi4 were ACGCGAA and AACGCGA, resembling the motif we found.
One difference between our method and many early microarray data analysis methods is that our method attempts to combine the information from genome-wide expression data and known gene functions, while others mostly use functional information ad hoc to confirm or interpret the resulting clusters. Another feature of our approach is that our motif search algorithm uses the promoters of a set of non-significant control genes as background instead of sequences based on a random model. That may have enhanced the sensitivity of our motif search because we at least partially corrected the bias in the sequence word distribution. As a consequence, our motif search algorithm does not seem to be severely affected by simple repeats [e.g. poly(A), poly(T) and dinucleotide repeats] in promoter sequences while some other motif search methods often need to mask these simple repeats before searching.
A recent study by Wang et al
) extended the REDUCE algorithm (10
) and applied it to a dataset consisting of more than 500 microarrays, including the Rosetta Compendium, in an attempt at systematically reconstructing transcription networks. REDUCE is a powerful algorithm for motif detection, as demonstrated by Bussemaker (10
) and more recently by van Steensel (32
). Wang et al
) successfully rediscovered the known motifs of several TFs in corresponding TF perturbation experiments. However, their method appeared to be susceptible to indirect effects of TF perturbations. They reported TGACTCA as the motif for Yap1 and TGCACCC as the candidate motif for Mac1. In contrast, our method successfully detected the canonical Yap1 motif TTACTAA and arguably detected the CuRE GAGCAAA for Mac1. Therefore, we believe our method is to some extent complementary to theirs. It also reveals that conclusions based solely on a single TF perturbation expression study may not be reliable. Other sources of information, such as ChIP data or multiple expression arrays with different types of perturbations on a TF, are needed to verify the results and reach a sensible conclusion.
As discussed previously, the effectiveness of our method relies on the level of present knowledge about gene functions. The eight TFs in our test experiments are relatively well studied. A key question is, did we merely recover TF/target gene information already in the functional annotation? This question is critical if one wants to extrapolate the performance of our method to other less studied TFs. We manually checked the evidence codes and references used for GO annotation in SGD for a few small clusters in the mbp1 and yap1 experiments. For the putative target genes in those clusters, none of the references involved direct binding assays of TF/cis-elements. Instead, most of the references involved phenotypic studies. A few annotations are linked to review papers or with the evidence code IEA (inferred from electronic annotation), which may include some information from binding assays. Therefore, we believe that our method may have inferred TF/cis-element relations based on mRNA level changes and promoter sequences, combined with functional information mostly obtained from phenotypic studies. As an example, MBP1 is annotated in SGD as involved in ‘DNA replication’ (GO:0006260). One of the significant clusters found with our method in the MBP1 mutant experiment is on node ‘DNA repair’ (GO:0006281) and every gene in that cluster contains at least one copy of ACGCGT in its promoter. This suggests that Mbp1 may also be involved in DNA repair and that our method may be able to reveal new functional relationships between genes. The effort of GO annotation in SGD is still ongoing. The current annotation probably reflects only a subset of our knowledge about all yeast genes in the literature. When more functional annotation information is available, our method should become more effective.
In this study, we used ChIP chip data to verify the motifs found with our method. Of course, when ChIP chip data are available for a TF, it may be more desirable to use these data directly with a method such as MDscan (30
) or that of Kato et al
. (submitted for publication) to detect the binding motif of that TF. However, even after a large-scale study such as that of Lee et al
), ChIP chip data are still not available for many TFs in yeast. ChIP chip data for other species are far less common. Therefore, methods such as ours are valuable in detecting relevant TF binding motifs when only TF intervention data are available. As we have shown in the last section, our method is able to provide a promising candidate motif list without using ChIP chip data. In addition, the functional clustering algorithm we implemented is not limited to motif finding. It can be applied to lists of candidate genes obtained with other methods, for example, with genes that are differentially expressed between two tissue samples or genes significantly bound by a TF in ChIP experiments. This approach may reveal possible new functional relations among genes or provide new insights into the function of the TF in question.
Our method at present is still rather simple and we can foresee several improvements. For example, in our motif search we now only count exact word matches without tolerating variants. The sensitivity may be improved if we allow degenerate motifs. The challenge, however, is that the top ranks in such a search will be dominated by variants of the strongest motif. Thus, better filtering methods will be required. Another issue is the P
value cut-off in the functional clustering. Currently we determined the P
value cut-off empirically as a relatively stringent 5E–6 in order to address the multiple test problem. While our random control test suggested that this cut-off should keep the random hits at a fairly low level, this fixed cut-off may not be optimal. It may be too stringent when the number of significant genes is small and may be too liberal when the number of significant genes is large. An adaptive cut-off based on a false discovery rate control (33
) may be more desirable. In addition, we currently examine the distribution of the motif hits among the genes in a cluster and the locations of the matching sites in the promoters empirically and ad hoc
. A more formal method of handling these aspects may further enhance the specificity of our prediction.
As discussed previously, expression array data of a TF intervention may not always contain enough information for motif detection. A similar problem may exist for ChIP chip experiments as well. In some preliminary tests using the Gln3 and Mac1 ChIP chip data from Lee et al
) and an extended version of MDscan (30
), we were unable to detect any motifs resembling the published motifs for Gln3 or Mac1. The reason may be similar to that for the Rosetta deletion experiment, these TFs are not active under the culture conditions of the experiments and few real targets are bound by the TF studied. Moreover, some binding detected by a ChIP chip may not be specific or functional. Thus, the signal may be too weak to be detected by a program like MDscan. With current technologies, both expression array data and ChIP chip data contain a significant amount of noise. However, they may reflect different aspects of the same biological process. Methods that integrate the information from expression arrays and ChIP chips are definitely worth more investigation. Additional improvement may be gained with further integration of other sources of information, including, but not limited to, biological knowledge in the literature and databases. Our study is a first step in that direction. Even with our current method, the success rate of our results is nevertheless encouraging. It is our belief that with the rapid accumulation of biological data, our approach, with further improvements, will be valuable for identifying TF/target relationships and for deciphering genetic networks of yeast and other living organisms.