We have presented a powerful approach for discovering transcriptional regulatory elements that are globally conserved between pairs of genomes. Our approach requires only two unaligned genomes, thus allowing the use of genomes of arbitrary divergence and those with extensive rearrangements of noncoding regions. Moreover, our motif-finding strategy does not use any parameters other than a conservation score threshold, used to separate presumptive functional from nonfunctional motifs. We have shown that such thresholds can be roughly estimated using independent biological data, when available. Our approach is also computationally efficient: whole eukaryotic genomes can be processed in minutes on a typical computer. In turn, this efficiency allows FastCompare to explore exhaustive pattern lists.
Our results show that FastCompare can recover most of the known functional binding sites in S. cerevisiae when its upstream regions are compared to those of a related species, S. bayanus. We comprehensively explored the globally conserved motif content between worms, flies and mammalian genomes, discovering large sets of known and novel motifs. The use of external information (expression data, functional categories, TRANSFAC, chromatin IP and known motifs) clearly shows that our method is able to detect conserved and functional motifs in all the phylogenetic groups that we studied. In all analyses, we have shown that some of the discovered known or novel motifs were severely constrained, either in terms of position relative to the start of translation or in orientation. We also observed that some of the known or novel motifs are co-conserved within upstream regions, potentially revealing interactions between the (often unknown) transcription factors that bind them.
We have created a set of web tools to superimpose the most globally conserved k-mers discovered by FastCompare to user-supplied sequences or multiple alignments. An example is shown in Figure , in which the upstream regions of the STE2 gene (encoding the alpha-factor pheromone receptor) from four different yeast species were aligned using ClustalW, and the most globally conserved k-mers are highlighted. All experimentally determined sites for STE2 were also predicted to be globally conserved by FastCompare. Moreover, several other sites also appear to be conserved, both at the global level (predicted by FastCompare) and the local level (shown by the multiple alignment). In Figure , the same analysis was performed on only two orthologous upstream regions instead of four. Many more sites appear to be locally conserved than when using four species, but the globally conserved sites found by FastCompare allow the efficient selection of experimentally verified and putative binding sites. These tools should be particularly useful in designing stepwise promoter deletions and site-directed mutagenesis experiments for understanding the regulatory code of specific genes.
Figure 10 Partial representation (most proximal region) of the aligned 1 kb upstream regions of the S. cerevisiae STE12 gene and its orthologs. (a) The highest scoring 7-mers found by FastCompare in a comparison between S. cerevisiae and S. bayanus are highlighted. (more ...)
While powerful, our approach has potential limitations. Our current approach allows matches to a given k
-mer to be on different strands within pairs of orthlogous upstream regions. This flexibility substantially increases the number of k
-mers that are supported by independent biological data (that is, true positives), at least for yeasts and worms (data not shown). However, it is difficult to evaluate whether this flexibility introduces more true positives than false positives. Also, transcription factors often bind several slightly distinct sites with different affinities, and it is widely acknowledged that binding-site degeneracy is better captured by using position-weight matrices (PWM) instead of k
-mers or consensus patterns [74
]. To evaluate whether weight matrices would display better conservation scores, we calculated a conservation score for weight matrices corresponding to 20 well characterized yeast binding sites, and compared them to the conservation scores obtained for the best k
-mers that unambiguously correspond to the same binding sites. Conservation scores for weight matrices were calculated as described for k
-mers in Materials and methods, except that we used the weight-matrix score thresholds that maximize the significance of the overlap between the two sets of ORFs containing matches to the weight matrices in each species. This involves progressively lowering the score threshold by small increments, and for each threshold, calculating the overlap and its hypergeometric p
-value. We then choose the score threshold corresponding to the most significant p
-value, and use the negative natural logarithm of this p
-value as the conservation score. As shown in Table , only in 11 cases out of 20 did weight matrices have a higher conservation score than the corresponding k
-mers. These results suggest that k
-mers provide results that are almost as good as those obtained using weight matrices, when utilizing the network-level conservation criterion. One reason why, in many cases, k
-mers have a higher conservation score than weight matrices may have to do with the more narrow selection of k
-mers for binding sites with similar or identical affinities. In fact, we recently showed that PWM scores, widely seen as proxies for binding affinity, are statistically conserved in a comparison between S. cerevisiae
and S. bayanus
]. In the context of the present study, the different k
-mers representing each transcription factor binding site may be defining affinity classes that are more strongly conserved than a looser definition of a binding site represented by a weight matrix. Recent work in bacteria has established the importance of binding affinity, especially with respect to coordinating the temporal order of events [75
Comparison of conservation scores between highest scoring k-mers and position weight matrices (PWM) for 20 known regulatory elements in S. cerevisiae, obtained when comparing S. cerevisiae and S. bayanus
However, Table shows that the conservation score for weight matrices describing very degenerate binding sites, such as RAP1, is significantly higher than the conservation score obtained for the best corresponding k-mer. This suggest that our k-mer based approach is limited in its ability to discover highly degenerate binding sites.
As shown by our inter-group analysis, many regulatory elements have remained functional across evolution, but few have remained upstream of the same genes. The network-level conservation principle thus appears less applicable to species that diverged very long ago. For example, when we compared the Drosophila and mosquito genomes (which diverged approximately 400 million years ago), we only found a handful of k-mers (interestingly including GATA-factor and Myc/Max binding sites) to have conservation scores above those obtained from randomized data.
There are also several directions in which our approach could be extended. From a methodological standpoint, the approach could be extended to take into account local over-representation of identical or nearly identical copies of the same binding sites, a well known feature in the promoter regions of higher eukaryotic species [16
]. To discover highly degenerate regulatory elements, k
-mers could be used to seed weight matrices whose individual weights could be optimized for network-level conservation, using stochastic optimization procedures (for example, simulated annealing; Mike Beer, personal communication). Introns and downstream noncoding regions could also be explored using our approach, as these regions are known to harbor functional regulatory elements in metazoan genomes. While our approach can deal with genomes presenting arbitrary levels of divergence and rearrangements, it would be interesting to investigate how global alignments or suboptimal and non-overlapping local alignments [76
] could be used to filter out regions of non-conservation. This approach would be particularly interesting when analyzing very long upstream regions, in order to increase the signal-to-noise ratio. Finally, mRNA 3' UTRs could be compared in order to find specific downstream regulatory elements involved in post-transcriptional mRNA regulation (for example, mRNA localization, decay or translational repression).