|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
We describe a powerful new approach for discovering globally conserved regulatory elements between two genomes. The method is fast, simple and comprehensive, without requiring alignments. Its application to pairs of yeasts, worms, flies and mammals yields a large number of known and novel putative regulatory elements. Many of these are validated by independent biological observations, have spatial and/or orientation biases, are co-conserved with other elements and show surprising conservation across large phylogenetic distances.
One of the major challenges facing biology is to reconstruct the entire network of protein-DNA interactions within living cells. A large fraction of protein-DNA interactions corresponds to transcriptional regulators binding DNA in the neighborhood of protein-coding and RNA genes. By interacting with RNA polymerase or recruiting chromatin-modifying machinery, transcriptional regulators increase or decrease the transcription rate of these genes. Transcriptional regulators bind specific DNA sequences upstream, within or downstream of the genes they regulate, and a large number of experimental and computational studies are aimed at locating these sites and understanding their functions (for example [1,2]). The increasing availability of whole-genome sequences provides unprecedented opportunities for identifying binding sites and studying their evolution. The strong conservation of functional elements (binding sites, protein-coding genes, noncoding RNAs, and so on) across even distantly related species should make it possible to predict these functional elements and prioritize them for experimental validation. The few large-scale comparative genomics approaches for finding transcriptional regulatory elements have so far relied mostly on detecting locally conserved motifs within global alignments of orthologous upstream sequences [3,4]. Although very powerful and straightforward, these approaches cannot be used when upstream regions are very divergent or have undergone genomic rearrangements. For example, aligning the mouse and puffer fish orthologous upstream regions would be very difficult, because of the great reduction that the puffer fish intergenic regions have undergone . Also, global alignments cannot be used when the positions of regulatory elements within functionally conserved promoter regions have been scrambled, for example through genomic rearrangements. Also, global alignment-based approaches often generate an overwhelming number of predictions because of the basal conservation between the genomes under study. To reduce the number of predictions, multiple global alignments of upstream sequences from several related species have been used, yielding many new candidate binding sites [3,4]. However, multiple (more than two) closely related genome sequences are not always available; moreover, by focusing only on regulatory elements that are conserved between several genomes, these approaches might miss elements that are conserved in more local areas of the phylogenetic tree.
Here we describe a simple and efficient comparative approach for finding short noncoding DNA sequences that are globally conserved between two genomes, independently of their specific location within their respective promoter regions. Our method, which we call FastCompare, is based on a principle that we have termed 'network-level conservation' , according to which the wiring of transcriptional regulatory networks should be largely conserved between two closely related genomes.
Our previous attempts at using network-level conservation relied on Gibbs sampling to find candidate regulatory elements . However, Gibbs sampling and related algorithms are not fully appropriate in this context, because of the low density of actual binding sites in pairs of orthologous upstream regions. Moreover, these algorithms are non-deterministic, relatively slow, and rely on sequence sampling, which makes them likely to miss many regulatory elements. While our previous approach was successful at predicting a large fraction of functional regulatory elements in the relatively small yeast genome, analyzing larger and more complex metazoan genomes requires faster and more exhaustive algorithms. Here, we use a faster, simpler and more comprehensive approach for detecting conserved and probably functional regulatory elements using the network-level conservation principle. FastCompare allows comprehensive exploration of the conserved - but not aligned - motifs between two genomes, while retaining a linear time complexity. We apply our approach to a large number of species, including yeasts, worms, flies and mammals, and describe some of the most conserved known and unknown regulatory elements within these genomes. We also show how this approach may help reconstruct part of the transcriptional network and reveal some of its associated constraints. Finally, we show that a large number of predicted motifs are conserved within and across different phylogenetic groups.
In the following sections, pairs of closely related species are termed phylogenetic groups. We applied FastCompare to the four following phylogenetic groups: yeasts (Saccharomyces cerevisiae and S. bayanus), worms (Caenorhabditis elegans and C. briggsae), flies (Drosophila melanogaster and D. pseudoobscura) and mammals (Homo sapiens and Mus musculus). For each phylogenetic group, we describe some of the most interesting, known and novel, predicted regulatory elements. For each of these regulatory elements, we perform independent validation using gene expression data, chromatin immunoprecipitation (IP) data, known motifs and data from several biological databases (Gene Ontology (GO)/MIPS, TRANSFAC), and show that the most globally conserved predicted regulatory elements are strongly supported by these independent sources.
The average nucleotide identity between S. cerevisiae and S. bayanus upstream regions is approximately 62%  (similar to the identity between human and mouse upstream regions) and divergence times are estimated between 5 and 20 million years . The number of ortholog pairs between S. cerevisiae and S. bayanus is 4,358 (see Materials and methods). We chose to analyze 1 kb-long upstream regions, because most of the known transcription factor binding sites in S. cerevisiae are located within this range . Using FastCompare, we calculated a conservation score for all possible 7-, 8- and 9-mers on the corresponding 8.6 megabase-pairs (Mbp) of sequences and sorted each list separately according to conservation score (see Figure Figure1;1; the raw sorted lists are available on our website ). On a typical desktop PC, this analysis took approximately 5 minutes (for example, the entire set (8,170) of 7-mers was processed in 35 seconds).
As described in Materials and methods, conservation scores are calculated for all k-mers (with fixed k), and are relative measures of network-level conservation for these k-mers (the higher the conservation score, the more conserved the corresponding k-mer). We first describe the distribution of conservation scores for all 7-mers. As shown in Figure Figure2,2, the distribution of conservation scores has a very long tail and many 7-mers on the tail correspond to well known regulatory elements in S. cerevisiae (see below for a detailed description of these sites). To verify that such high conservation scores could not be obtained by chance, we generated randomized sequences as described in Materials and methods and re-ran FastCompare on these sequences. The corresponding distribution of conservation scores is shown on Figure Figure22 and clearly shows that the high conservation scores corresponding to known regulatory elements are extremely unlikely to arise by chance.
We used various independent sources of biological data to demonstrate that k-mers with the highest conservation scores are likely to be functional. For a given k-mer, we define the 'conserved set' as the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the k-mer in their upstream regions (see Materials and methods). We found that conserved sets defined for the highest-scoring 7-mers are significantly enriched with genes whose upstream regions contain occurrences of known motifs in yeast (Figure (Figure3a),3a), significantly enriched with genes whose upstream regions were shown to be bound by known transcription factors in vivo (Figure (Figure3b),3b), and significantly enriched in at least one MIPS functional category (Figure (Figure3c).3c). We also show that the number of 7-mers found upstream of over- or underexpressed genes in at least one microarray condition increases with the conservation score (Figure (Figure3d)3d) and that the number of 7-mers matching at least one TRANSFAC consensus also increases with the conservation score (Figure (Figure3e).3e). Altogether, these data provide strong and independent evidence that our method identifies functional yeast regulatory elements by giving them a high conservation score.
Closer examination of Figure 3a-d shows that the 400 highest-scoring 7-mers are most strongly supported by independent data. Therefore we retain them for further analysis and, when possible, replace them by 8-mers and 9-mers with higher conservation scores and also add the high-scoring 8-mers and 9-mers without high-scoring substrings, as described in Materials and methods. This processing yields 398 k-mers (k = 7, 8 and 9).
Then, for each of these 398 k-mers, we determine the optimal window within the initial 1 kb which maximizes the conservation score (see Materials and methods); we then re-evaluate the functionality of each of the 398 k-mers with the independent biological information described above, using the new conserved sets. The full information for the 398 k-mers is available at .
Using known transcription factor binding site motifs, genome-wide in vivo binding data, functional annotation and literature searches, we found at least 27 different known transcription factor binding sites among the 398 highest scoring k-mers. These regulatory elements, along with their support from independent biological data, are shown in Table Table1.1. Some of the best-known binding sites are represented several times within the 398 top scoring k-mers, in the form of slightly distinct or overlapping sequences (see ). Note also that we use very stringent criteria for identifying known binding sites among our predictions. When we matched our predictions to the known motifs published in  (regular expressions), we predicted 42 out of 53 known motifs (Kellis et al.  predict exactly the same number of motifs, and essentially the same motifs, but using multiple alignments of four yeast genomes).
Among the 27 different known regulatory elements returned by FastCompare, several (Swi4, Mbp1, Sum1/Ndt80, Fkh1/2) are involved in regulating the yeast cell cycle. The other known sites are also involved in fundamental biological processes in yeast: amino-acid metabolism (Cbf1, Gcn4), meiosis (Ume6), rRNA transcription (PAC and RRPE), proteolytic degradation (Rpn4), stress response (Msn2/Msn4) and general activation/repression (Rap1, Reb1). As described in Materials and methods, our approach also handles gapped motifs. Thus, the binding sites for Abf1, a chromatin reorganizing transcription factor (CGTNNNNNNTGA), and Mcm1, a factor involved in cell-cycle regulation and pheromone response (CCCNNNNNGGA), were also identified as very high-scoring patterns and strongly supported by independent information (known motifs and chromatin immunoprecipitation).
When we used the same independent biological data to evaluate the 400 highest-scoring 7-mers obtained on randomized data, we found only three known binding sites (RRPE, FKH1 and BAS1).
Several known binding sites are not found among the 398 top-scoring k-mers, perhaps because their transcriptional network has undergone extensive rewiring since the speciation of the two yeasts, or because the corresponding transcription factors regulate few genes. In some cases, the presence of several known sites (clearly identified in terms of independent data) among the full set of 7-mers argues in favor of the rewiring hypothesis. For example, the binding site for the Rcs1 transcription factor, TGCACCC, only appears at the 1,883rd position within the list of ranked 7-mers. Despite its lack of conservation, this site is strongly backed by independent biological information: it is identified as a known motif, it is found in 33 microarray conditions, and its conserved set is significantly enriched in genes annotated with homeostasis of metal ions (p < 10-5), which is the known function for Rcs1 . Similarly, the known binding sites for the Ace2/Swi5 and Hsf1 transcription factors were clearly identified (in terms of independent data) within the complete list of 7-mers, but not among the 398 highest scoring k-mers.
It is now known that functional regulatory elements can be positionally constrained, relative to other regulatory elements or to the start of transcription [7,11,12]. To assess whether some of the predicted regulatory elements are positionally constrained in yeast, we calculated the median distance to ATG for the conserved sets of each of the 398 k-mers and independently built the distribution of median distances to ATG for all 7-mers as described in Materials and methods (the distribution is shown in Figure Figure4)4) and found d0.025 = 350 and d0.975 = 680. In other words, a median distance to ATG of less than 350 or higher than 680 should each arise by chance with only a 2.5% probability. Among the 398 most conserved k-mers, more than a fifth (86) have their median distance below 350 (p < 10-52), while only seven have a median distance greater than 680. A closer examination reveals that a few known sites are particularly constrained. For example, the binding sites for Reb1, PAC, TATA, Swi4, Rpn4, RRPE and Mbp1 are found to be situated relatively close to the start of translation, with a median distance to ATG between 150 and 300 bp. Some of these constraints were also found to be good predictors of gene expression in a recent study  (for RPN4, PAC and RRPE, for example). In contrast, binding sites for Met4, Ume6, Hap4, Rap1, Ino4 and Ste12 are found to be situated at a greater median distance, between 400 and 500 bp from ATG.
We found many novel motifs among our highest-scoring predictions. For example, we found two strongly conserved motifs, AGGGTAA (rank 17) and TGTAAATA (rank 31), which are situated relatively close to ATG (with a median distance to ATG of 349 and 378.5 bp, respectively) and more often in upstream regions than in coding regions (with ratios of 1.95 and 1.83, respectively). Interestingly, TGTAAATA also has a statistically significant 5' to 3' orientation bias (binomial p-value < 10-7). However, neither of the two putative sites is supported by independent biological data. Additional expression data may help define their biological role. Other sites, such as CAGCCGC or GCGCCGC are found upstream of over- or underexpressed genes in many microarray conditions (15 and 6, respectively). While these two sites are similar to the canonical Ume6-binding site, the latter was not found in any microarray conditions (as none of the microarray experiments we used is related to meiosis, the biological process which Ume6 is known to be involved in), suggesting that the two sites are bound by other factors.
We repeated the same analysis on distinct pairs of yeast species other than S. cerevisiae/S. bayanus. We first compared S. cerevisiae and S. paradoxus (a much closer relative of S. cerevisiae) and found 15 of the 27 known motifs we obtained when comparing S. cerevisiae and S. bayanus (results are available at ). We also compared S. cerevisiae with S. castellii, which is a more distant relative within the Saccharomyces phylogenetic group. S. castelli is interesting in that its upstream regions cannot be globally aligned with those of S. cerevisiae, because of extensive sequence divergence . We also found 15 of the 27 known motifs found in the S. cerevisiae/S. bayanus comparison (results at ), although they were different from the S. cerevisiae/S. paradoxus conserved motifs. Interesting similarities and differences in conservation were revealed when comparing the known motifs discovered in each comparison. For example, the PAC, RRPE and Mbp1 motifs were found within the highest-scoring k-mers in all three comparisons, hinting at the conserved role of the corresponding proteins. However, the Reb1-binding site, which was found to be highly conserved between S. cerevisiae and S. bayanus (rank 1), is much less conserved between S. cerevisiae and S. castelli (rank 230). This argues for extensive rewiring in the Reb1 transcriptional network in the lineage that led to S. castelli.
To discover interactions between regulatory elements, we searched for co-conservation of pairs of high-scoring predicted regulatory elements, as described in Materials and methods. Not surprisingly, the most conserved interaction is between RRPE (AAAAATTTT) and PAC (CTCATCGC), with a median distance D = 22 bp [11,13]. We also find that the Cbf1-binding site (CACGTGA) is strongly co-conserved with the Met4-binding site (CTGTGGC), and that these two sites are separated by a short distance (D = 44.5) in S. cerevisiae. Indeed, it has been shown that the binding of Cbf1 in the vicinity of a very similar sequence (AAACTGTG) enhances the DNA-binding affinity of a Met4-Met28-Met31 complex for this sequence , and that the median distance between the above Cbf1 and Met4 sites is small .
Many of the predicted interactions have not yet been experimentally studied. For example, we found that the highest scoring Reb1 motif (CGGGTAA) is significantly co-conserved with both the highest scoring RRPE motif (AAAAATTTT) and the highest scoring PAC motif (CTCATCGC), with a short median distance between the two sites in both cases (D = 38 and D = 63.5, respectively). The Reb1/RRPE interaction was also discovered independently as a good predictor of expression . We also found that Reb1 interacts with the Cbf1 motif (CACGTGA), also at a short median distance (D = 30). An interesting interaction between RRPE and an unknown motif, TGAAGAA, displays a conserved set strongly enriched in translation (p < 10-11), while RRPE alone is more strongly enriched in rRNA transcription (p < 10-14). The full sorted list of interactions is available at .
In contrast to yeast, relatively little is known about cis-regulatory sequences in C. elegans. There is a dramatically greater complexity of transcriptional regulation in multicellular organisms. Indeed, transcription factors in multicellular organisms regulate cohorts of genes in different tissues and at different times during development . C. elegans promoter regions often contain many domains of activation/repression and, as a result, are much larger than those in yeast.
We applied FastCompare to the genomes of C. elegans and C. briggsae, two worms that diverged about 50-120 million years ago . The number of orthologous open reading frames (ORFs) between these two species is 13,046 and here we have only considered 2,000 bp upstream regions. It takes approximately 11 minutes for FastCompare to process the corresponding 50 Mbp of sequences and calculate a conservation score for all 7-, 8- and 9-mers on a typical desktop PC.
The distribution of conservation scores for all 7-mers shows that high conservation scores are unlikely to be obtained by chance (Figure (Figure5a).5a). As shown in Figure Figure5a,5a, many known regulatory elements fall on the tail of the distribution. We then used functional categories, over- or underexpression, and TRANSFAC motifs to assess the ability of FastCompare to predict functional regulatory elements. Figure 5b-d shows that support for the highest-scoring k-mers by functional enrichment, expression and TRANSFAC strongly increases with conservation score. We have only retained the 400 highest-scoring 7-mers, which are particularly well supported by independent biological information as shown in Figure 5b,c. Starting from these 400 highest-scoring 7-mers, we obtain 437 k-mers (k = 7, 8 or 9) using the procedure described in Materials and methods.
As shown in Table Table2,2, at least 15 distinct known binding sites in C. elegans and other metazoan organisms were identified among the 437 predicted regulatory elements.
One of the most conserved is TGATAAG, the binding site for the GATA factors, a family of regulators controlling intestinal development (see  for review). Another motif returned by FastCompare, GTGTTTGC, corresponds to the binding site for the forkhead-related activator-4 (Freac-4) . Note that this motif is also compatible with the PHA-4-binding site (published consensus: T[AG]TT[GT][AG][CT] ), present in the upstream regions of pharyngeal genes  (PHA-4 is also a member of the forkhead family of transcription factors). FastCompare also returned TGTCATCA, the known binding site for the SKN-1 transcription factor (published consensus [AT][AT]T[AG]TCAT). In C. elegans, SKN-1 is known to initiate mesendodermal development by inducing expression of the GATA factors MED-1 and MED-2 (required for mesendodermal differentiation in the EMS lineage) .
The GAGA-factor binding site (AGAGAGA) was also found as a highly conserved pattern. GAGA repeats in upstream regions have been shown to be functional in C. elegans in at least two separate studies [22,23]. At least one GAGA-binding protein has been identified in D. melanogaster, and is assumed to create nucleosome-free regions of DNA, thus allowing additional transcription factors to bind those regions . However, the ortholog of this protein has not yet been identified in C. elegans .
We also found CAGCTGG, a site known to be bound by the myogenic basic helix-loop-helix (bHLH) family of transcription factors (in worms, flies and mammals) and AP-4 transcription factors (in mammals) [25,26] (published consensus CAGCTG [27-29]). The homolog of human AP-4 was found to be ubiquitously expressed in D. melanogaster and a C. elegans homolog has also been identified . FastCompare returned GTAAACA, the known binding site for the DAF-16 transcription factor (published consensus GTAAACA [30,31]). DAF-16, a FOXO-family transcription factor, was shown to influence the rate of aging of C. elegans in response to insulin/insulin-like growth factor-1 signaling [31,32].
Searching for gapped motifs found few strongly conserved sites. However, when searching for 8-mers with a 5-bp gap, we found that TGGCNNNNNGCCA, the known binding site for nuclear factor I (NFI) , had a score comparable to those of the highest-scoring k-mers.
Several of the C. elegans sites returned by FastCompare and shown in Table Table22 are known to be functional transcription factor binding sites in other species. For example, TGACTCAT, identical to the AP-1-binding site , is known to be bound in yeast (by Gcn4), Drosophila , mouse and human (see  for a review).
FastCompare also returns the CACGTGG motif, which is the binding site for the Myc/Max complex, a family of bHLH transcription factors . Among the top-scoring motifs in Table Table2,2, we also find AAGGTCA, the hormone response element (HRE), bound by several transcription factors in human, mouse, fruit fly and silkworm (published consensus [CT]CAAGG[CT]C[AG] [38,39]); TGACGTC, the cAMP response element (published consensus TGACGTCA ); CCCGCCC, the binding site for the mammalian Sp1 transcription factor (known consensus CCCCGCCCC); ATCAATCA, the known binding site for the human proto-oncogene Pbx-1 . A similar site, ATCAATTA, has been shown to be bound in vitro by the Drosophila homolog of Pbx-1, the extradenticle (exd) protein . Moreover, CEH-20C was identified as the C. elegans homolog of both Pbx-1 and exd. Other known sites discovered by FastCompare include CAGGTGA, similar to the known binding site for the Snail protein, a transcription factor involved in dorso-ventral pattern formation in Drosophila (published consensus [AG][AT][AG]ACAGGTG[CT]AC ), and TTCGCGC, the known binding site for the E2F proteins, a family of transcription factors involved in regulating the cell cycle in Drosophila and mammals (published consensus TTTCGCGC ). An E2F homolog has been identified in C. elegans and recently shown to be involved in cell-cycle regulation [45,46].
As in yeast, several of the known binding sites in C. elegans appear to be constrained in terms of position. Using the distribution of median distances for all 7-mers (see Materials and methods), we found d0.025 = 690 and d0.975 = 1,135. Among the 437 highest-scoring k-mers, we found that 75 are located below the lower threshold, a proportion that is much higher than the expected 2.5% (p < 10-38). The binding sites for forkhead-related activator-4 (Freac-4), Sp1, E2F and AP-1 are particularly constrained (see Figure Figure6).6). We found only 21 k-mers to be located further away from the distant d0.975 threshold. Interestingly, the most conserved k-mer among these 21, CCACCAGGA (rank 96), is found in the upstream regions of over- or underexpressed genes in 57 microarray conditions.
Note that for a few predicted elements (for example, CAGGTGA, rank 111), the median distance falls outside of the optimal window; this is due to the fact that, for these elements, the median distance does not correspond to the peak of the distribution of distances to ATG. Hence, for these elements, the optimal window provides a better descriptor of the positional bias than the median distance. Additional analysis reveals that several of the known binding sites discovered in this study are constrained in term of orientation. For example, the binding site for the GATA-factor(s) (as shown in Table Table2)2) is significantly more often found in the 3' to 5' orientation, relative to downstream genes. Probably the most interesting finding is that the GAGA repeats appear to be strongly oriented 3' to 5' relative to their downstream genes. Indeed, 2,375 out of 3,557 (67%) of the AGAGAGA sites are oriented 3' to 5', a proportion that is much larger than the expected 50% (p < 10-90). This bias is confirmed by the fact that TCTCTCT alone (not taking into account its reverse complement) has a much higher conservation score (129.2) than AGAGAGA (34.3). We also found that several related motifs display a similar, albeit weaker, orientation bias, for example, GAAGAAG (p < 10-16), GGAGGAG (p < 10-10). It is interesting that all the GAGA repeats found to be necessary for correct expression of the ceh-24 and unc-54 genes are in fact TCTC repeats [22,23]. The conserved sets for TCTCTCT or AGAGAGA were not found to be enriched in any GO category. Note that this orientation bias is not due to genes with the repeats in their upstream regions being predominantly located on one strand, as these genes are approximately identically distributed on each strand (1,065/1,122, p = 0.89). Interestingly, conserved GAGA repeats in D. melanogaster were also found to be constrained in terms of orientation, but at a much lower significance (p < 10-4, see below). Although it is possible that the TCTC repeats are bound at the 5' untranslated region (UTR) mRNA level, the positional distribution of the conserved AGAGAGA sites does not indicate a strong positional bias with respect to ATG (DATG = 893).
FastCompare also returned many novel motifs; some of the most interesting ones are shown in Table Table3.3. The top-scoring motif, CTGCGTCT, belongs to this category. A larger version of that motif, TCTGCGTCTCT, was found in a recent study to be necessary for the expression of several ethanol-response genes . However, the very high conservation of this site suggests a broader role. It is interesting to note that this site was not significantly found upstream of under- or overexpressed genes in any microarray conditions (including the data from ). Interestingly, the most conserved k-mer found in yeast, the binding site for the Reb1 protein, had the same property. Moreover, this site displays a relatively strong orientation bias 5' to 3' (p < 10-10).
Several of the other novel predicted regulatory elements in Table Table33 have interesting properties. For example, the fourth most-conserved k-mer, CGACACTCC, is one of the closest motifs to ATG, with a median distance of 234 bp, and its conserved set is strongly enriched in genes involved in positive regulation of growth (a biological process defined in GO as the increase in size or mass of all or part of the worm) (p < 10-7). Another predicted regulatory element, CGAGACC (rank 20), is found upstream of downregulated genes in 23 microarray conditions. Interestingly, it is found upstream of downregulated genes in a study measuring gene-expression changes at several time points during worm aging , in two distinct strains (fer-15 and spe-9;fer-15) and at similar time points (6, 9 and 10 days for fer-15, 9 and 11 for spe-9;fer-15). In addition, the functional enrichment of its conserved set points at a potential role in embryonic development (p < 10-7). Another strongly conserved and novel motif, CTCCGCCC (rank 14), was independently found upstream of almost all transcribed worm microRNA genes in a recent study .
We found many interactions between the most conserved k-mers found at the previous stage. For example, the most conserved k-mer, TCTGCGTCT, is very often co-conserved with AGAGAGA. The high-scoring interaction between the DRE-like motif, AATCGAT and the putative E2F-binding site, TTTTCGC, also appears interesting. Indeed, the conserved sets for both k-mers are separately enriched significantly with genes involved in embryonic development, according to GO (p < 10-8 and p < 10-7, respectively). However, the conserved set of genes having both elements in their upstream regions is even more enriched in this GO category (p < 10-9). TTTTCGC also seems to interact with the novel site CGACACTCC, and the corresponding conserved set is enriched with genes involved in modification-dependent protein catabolism (p < 10-5). The full list of motif interactions is available at .
We applied FastCompare to the genomes of D. melanogaster and D. pseudoobscura, two species of Drosophila that diverged about 46 million years ago . The number of orthologous ORFs between these two species is 11,306 and here we only consider 2,000-bp upstream regions. Using 5,000 bp instead produced similar results, but also produced additional putative binding sites (results are available at ). It takes approximately 10 minutes for FastCompare to process the corresponding 45 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.
The distribution of conservation scores shown in Figure Figure7a,7a, for actual and randomized data, shows once again that the high conservation scores obtained with the real sequences are very unlikely to be achieved by chance. Also, as shown in Figure Figure7a,7a, many known regulatory elements fall on the tail of the distribution.
As for the yeast and worm genomes, we used functional annotations (GO), expression data and known TRANSFAC sites to evaluate the FastCompare predictions. Unfortunately, expression data is often available for only a subset of genes and its analysis led to very few validations. However, Figure 7b,c clearly shows that functional enrichment of the conserved sets and TRANSFAC matches strongly correlate with conservation score. As with yeasts and worms, we focused on the 400 highest-scoring 7-mers, which are particularly well supported by the functional enrichment analysis (see Figure Figure7b).7b). The simple processing described in Materials and methods yielded 469 k-mers (k = 7, 8 or 9), which we further analyze below.
As shown in Table Table4a,4a, we found at least 16 distinct known regulatory elements among the 469 highest-scoring k-mers. The most conserved element, AACAGCTG, is similar to the site known to be bound by AP-4 (mammals) and MyoD (worms, flies and mammals). One of the most interesting predictions is TATCGATA (rank 12); this palindromic motif, known as the DNA replication-related element (DRE), has been experimentally proved to be necessary for proper expression of several cell proliferation-related genes in D. melanogaster  and, more recently, the genes encoding the TATA-binding protein (TBP)  and catalase  in the same organism. Interestingly, it is both the motif with the closest median distance to ATG (DATG = 168), and the most over-represented k-mer (among the 469 highest scoring ones) within D. melanogaster upstream regions compared to exons, with a ratio of 5.39.
Several of the other predicted sites are known to be bound by Drosophila transcription factors involved in development. For example, FastCompare predicts TTTATGGC (rank 14) and TAATTGA (rank 24), the binding sites for two homeodomain transcription factors. The first site matches the TRANSFAC consensus binding site for Abd-B ([CG]NTTTATGGC), while the second site is the known consensus binding site for the Antennapedia (Antp) class of homeodomain proteins  (TAATTGA matches the TRANSFAC consensus binding site for Ubx, a member of the Antp class). FastCompare also predicts ATTTATGC, a site matching the TRANSFAC consensus binding site for the chicken CdxA protein ([AC]TTTAT[AG]), the homolog of the Caudal protein in D. melanogaster. Also, FastCompare predicts CAGGTGC, the binding site for the Snail repressor/activator protein, a transcription factor required for proper mesodermal development .
FastCompare also predicts ATTTGCATA (rank 3) as one of the most conserved putative regulatory elements between the two flies. This site is the binding site for the POU-domain family of transcription factors, and it is probably bound by one or several of the three POU-domain transcription factors in Drosophila: DFR, PDM-1 and PDM-2. These three proteins are involved in different stages of Drosophila development: DFR is expressed in midline glia and in tracheal cells , whereas the redundant PDM-1 and PDM-2 are essential for proper neuronal development .
Many of the known motifs found when comparing the two Drosophila genomes were also found when analyzing the worm genomes. For example, GAGA repeats are found to be strongly conserved, slightly oriented 3' to 5' (p < 10-4), and very significantly found upstream of genes involved in morphogenesis (p < 10-23). GTAAACA (rank 147), the DAF16-binding site in C. elegans, is also one of the most conserved sites between the two Drosophila genomes. This site is probably bound by dFOXO, the unique homolog of the C. elegans DAF16 protein in D. melanogaster .
As for both previous phylogenetic groups (yeasts and worms), the median distances to ATG for the conserved elements show that some of the predicted regulatory elements are severely constrained in terms of position. Among the most constrained k-mers are the DRE site (TATCGATA, DATG = 168) and the known AP-4/MyoD binding site (AACAGCTG, DATG = 373). However, both the optimal windows and the median distances in Table Table4a4a show that, compared to previously studied organisms, a smaller number of conserved regulatory element are constrained. Using the distribution of median distances for all 7-mers, we find that the d0.025 = 798 and d0.975 = 1,126. Among the 469 highest scoring k-mers, 45 fall below 798 (p < 10-13) and 36 above 1,126 (p < 10-8), once again suggesting weaker positional constraints than in yeasts and worms, at least when considering the first 2,000 bp of 5' upstream sequences.
FastCompare predicts many putative regulatory elements in Drosophila that to the best of our knowledge are unknown (Table (Table4b).4b). One of these novel sites, CAGGTAG (rank 143), was found upstream of several genes that are activated before widespread activation of zygotic transcription (which begins during the 14th nuclear cycle), in several Drosophila species ; it was also found to be necessary for the early expression of several of these genes (Sxl and sisterlessB) in a subsequent study (J.R. ten Bosch, J.A. Benavides and T.W. Cline, personal communication). It is interesting to see that this particular site is significantly conserved upstream of genes involved in cell fate commitment (p < 10-8).
Some of these sites, such as the palindromic TTAATTA (rank 31), are found much more often in upstream regions than in exons (with an over-representation ratio of 3.07). Others, such as ACACACAC, are found to be significantly enriched upstream of genes in known functional categories (embryonic development, p < 10-9). The same site appears to be strongly oriented 5' to 3' (p < 10-12). Others, such as GTGTGACC or AAATGGCG, appear to be located closer to ATG than most other sites (DATG = 296 and 592, respectively).
We found many potential interactions between the most conserved sites discovered by FastCompare. For example, the POU-domain-binding site ATTTGCATA was found to be strongly co-conserved with TAATTGA, the Antp-binding site, and with many other potential homeodomain sites, such as AATAAAT and TAATTAA. The CACA repeats were also found to be co-conserved with several different sites, and in some cases, the set of genes having both sites simultaneously conserved in their upstream regions (conserved sets) was found to be enriched in certain functional categories, for example, ACACACAC and GAGAGAG, regulation of transcription (p < 10-12); ACACACAC and TAATTGC (an Antp variant site), embryonic development (p < 10-5). The full list of interactions is available at .
The much larger noncoding regions of mammalian genomes present significant challenges for computational motif discovery. Also, many repeat elements (for example, Alu) have colonized mammalian genomes and are likely to be conserved between closely related genomes. The distance between enhancers and the transcriptional start of the genes they regulate can be extremely large, reaching tens of kilobases. Finally, gene predictions and gene boundaries are still largely unverified experimentally for a large number of genes.
We applied FastCompare to the genomes of H. sapiens and M. musculus,, which diverged about 75 million years ago . The number of orthologous ORFs between these two species is 15,983 and again, we have only considered 2,000-bp upstream regions. As in flies, using 5,000-bp instead produced similar results. It takes approximately 15 minutes for FastCompare to process the corresponding 60 Mbp of sequences and calculate a conservation score for all 7-mers, 8-mers and 9-mers on a typical desktop PC.
Unlike the other genomes considered so far, the output of FastCompare from the mammalian genomes is dominated by GC-rich sequences, probably corresponding to CpG islands (GC-rich regions known to be associated with the promoters of many genes). However, analysis of the FastCompare output yielded the same validations as for other species. Indeed, the distribution of conservation scores obtained on actual and randomized sequences shows that high conservation scores are very unlikely to be obtained by chance (Figure (Figure8a).8a). As with other species, many known regulatory elements are on the tail of the distribution (Figure (Figure8a).8a). Also, as shown in Figure 8b-d, more k-mers are found upstream of over or underexpressed genes, more k-mers have their conserved set enriched with GO functional categories, and more k-mers match TRANSFAC consensus sites as the conservation score increases.
We found that masking Alu repeats did not influence the output of FastCompare (data not shown). To overcome the overabundance of GC-rich sequences in the FastCompare output, we use longer k-mers as starting points, namely 8-mers instead of 7-mers. We started with the 600 highest-scoring 8-mers, and replaced each of these 8-mers by one of its substrings (7-mer) or one of its superstrings (9-mer), when their conservation score is higher. We then removed duplicates in the list and added the high-scoring 9-mers that have no substrings within the list. This procedure yielded 284 k-mers (k = 7, 8, 9). Subsequent validation was limited to this small set of high-scoring predictions.
As shown in Table Table5a,5a, we found 17 distinct known regulatory elements among the 284 highest-scoring k-mers. Among these are the well characterized sites for the Sp1, C/EBP, CREB and Myc/Max proteins or families of proteins. These four sites reside very close to ATG (their median distance to ATG is between 100 and 250 bp), suggesting that the four proteins (or families of proteins) may be involved in intimate interactions with the transcriptional complex. Sp1 is an ubiquitous transcription factor, involved in the basal expression of a large number of genes in mammals (see  for review). The CCAAT/enhancer binding protein (C/EBP) has been implicated in the regulation of cell-specific gene expression mainly in hepatocytes, adipocytes and hematopoietic cells (see  for review). Both Sp1 and C/EBP are constitutive transcription factors whose presence is necessary for significant induction of a large number of genes . The CRE-binding protein (CREB or CBP) is a transcription factor that binds cyclic AMP (cAMP) response elements (CREs) in the promoters of specific genes, and functions as a co-activator for a large number of other transcription factors (see  for review). The Myc/Max heterodimer binds the CACGTG sequence, and also acts as a transcriptional activator (see  for review).
Interestingly, we found that some of the most conserved interactions between k-mers (see Materials and methods) involve Sp1-binding sites (CCCGCCC or CCGCCCC) with other known sites such as CACGTGAC (Myc/Max), TGACGTCA (CREB), CGCAGGCGC (unknown), GCCAATC (CCAAT-box) and ACTTCCG (Ets), and that the median distances between these sites are relatively small (138, 164, 200.5, 234 and 234, respectively).
Among the other predicted regulatory elements returned by FastCompare are CCGCCTC, a site known as the insulin response element ; CGGAAGTGA, a site known to be bound by the GA-binding protein in human ; CGCATGCG, a site known as the palindromic octamer sequence, which was found at 132 bp (relative to ATG) upstream of the inosine-5'-monophosphate dehydrogenase type II gene in human, and shown to be functional in resting and activated T cells using site-directed mutagenesis, in vivo footprinting and electrophoretic mobility shift assay (EMSA) ; TTTCGCGC, the E2F-binding site; TAATCCCAG, a site known to be bound in D. melanogaster by the anterior morphogen Bicoid, and also recently shown to be bound in human by Goosecoid-like (GSCL) . Interestingly, this site has a relatively strong orientation bias 3' to 5' (p < 10-14). It is also the site with the strongest over-representation in upstream regions compared to exons that we observed, with a ratio of 7.06.
FastCompare also predicts ATTTGCAT, the binding site for the POU-domain Oct-1 and Oct-2 proteins, known to bind the promoter and intronic enhancer of immunoglobulin genes ; it also returns GGAAGTCCC, a site that was shown to bind NFκB [70,71], a transcription factor involved in a variety of pathways (including inflammation, response to infection and oxidative stress, and apoptosis).
The distribution of distances to ATG for all 7-mers (Figure (Figure9)9) shows an interesting bimodal shape, indicating that a large number of short sequences are constrained to reside around 500 bp to ATG. We calculated d0.025 = 342 and d0.975 = 1,185 and found that 83 k-mers among the 284 highest-scoring ones have a shorter median distance than 342 (p < 10-63) and only 11 have a larger median distance than 1,185. Indeed, a majority of the known sites identified by FastCompare are preferentially located near the 5' start of genes, with some elements being very close to ATG (for example, the CREB site, whose median distance to ATG is 107, whereas the optimal window is [0;1,000]). Nonetheless, a few known motifs do not seem to show any positional constraints. For example, the Bicoid-like site TAATCCCAG has a median distance to ATG of 1,258.
FastCompare identifies many putative regulatory elements which to the best of our knowledge are novel (Table (Table5b).5b). Some of these predicted regulatory elements are found upstream of over- or underexpressed genes in many microarray conditions. One example is CCCCAGC, which is significantly found upstream of overexpressed genes in 21 conditions (out of 30) of the human cell-cycle experiment . Other conserved elements are found much more often in upstream regions than in exons, for example, CCCCTCCC or TCTCGCGA, with ratios of 5.12 and 4.45, respectively. Others appear to be positionally constrained, for example, the palindromic CTGCGCA with an optimal window [0;300] and a median distance to ATG of 199, or constrained in terms of orientation, for example, GTGAGCCAC, which is significantly oriented 5' to 3' (p < 10-6).
To gain a better understanding of the network-level conservation of regulatory elements between the different phylogenetic groups, we compared the results we obtained by applying FastCompare to yeasts, worms, flies and mammals in the previous sections. We calculated the overlap (and its significance) of the 400 highest-scoring 7-mers and 8-mers found for each phylogenetic group. As shown in Table 6a,b, the number of shared predicted sites correlates with phylogenetic distance (the number of high-scoring putative motifs that two phylogenetic groups have in common decreases as the phylogenetic distance between the groups increases). All of the overlaps were found to be statistically significant, except for the yeast-human comparison. For both 7-mers and 8-mers, the best overlap is the one obtained between the two invertebrate phylogenetic groups: worms and flies. Indeed, simple observation of the identified known regulatory elements in Tables Tables22 and and4a4a reveals that these two organisms have a large number of predicted binding sites in common.
However, when we looked at the overlap between conserved sets for identical high-scoring k-mers in different phylogenetic groups (after determination of reciprocal best BLAST hits between the considered species), we found little overlap. The only significant overlap we found (after Bonferroni correction) was between the GATA sites (GATAAGA) in worm and fly (p = 2.5 × 10-4). As a control, we performed the same analysis within the yeast phylogenetic group, using the S. cerevisiae/S. bayanus and S. paradoxus/S. mikatae 400 most conserved 7-mers. One hundred and ninety-five sites were found in both groups of 7-mers, and for all of them, the overlaps between the conserved sets obtained separately in the S. cerevisiae/S. bayanus and S. paradoxus/S. mikatae analyses were highly significant, with hypergeometric p-values < 10-40. Therefore, our results strongly suggest that, while transcription factors have largely retained their ability to recognize specific DNA sites, their targets have largely changed through appearance or disappearance of those binding sites in promoters. This hypothesis is supported by recent analysis of the fission yeast cell cycle using microarrays, which showed that the role and the binding sites for several of the main transcription factors involved in regulating the yeast cell cycle (Swi4/Mbp1, Fkh1/Fkh2, Swi5/Ace2) are conserved between budding and fission yeasts (which diverged about 1 billion years ago), but the sets of genes that they regulate overlap much less than expected (only about 50 orthologous genes are cell-cycle-regulated in both species) .
It is particularly interesting to consider the seven 8-mers that are top predictions for all three multicellular phylogenetic groups (note that many more 7-mers are conserved between these groups). These sites include the CRE (TGACGTCA, GACGTCAC and ATGACGTC), the POU-domain binding site (ATTTGCAT), and the HRE (CAAGGTCA). A fourth site is also shared (GCCACGCC, CCACGCCC), which to the best of our knowledge is a novel motif. Its strong over-representation in upstream regions compared to coding regions, and its closeness to ATG (median DATG = 230 for GCCACGCC) make it a promising candidate for experimental testing. Interestingly, the location constraints on these conserved sites can vary across phylogenetic groups. For example, the CRE appears weakly constrained in worms and flies in terms of distance to ATG (DATG = 708 and 825, respectively), but is very close to ATG in mammalian genomes (DATG = 107). However, the distances to ATG of the POU-domain-binding sites (862, 882 and 729, respectively) indicate that their positional constraints are shared among the phylogenetic groups. The same holds for the HRE binding site (845, 1,015.5 and 895, respectively).
We have presented a powerful approach for discovering transcriptional regulatory elements that are globally conserved between pairs of genomes. Our approach requires only two unaligned genomes, thus allowing the use of genomes of arbitrary divergence and those with extensive rearrangements of noncoding regions. Moreover, our motif-finding strategy does not use any parameters other than a conservation score threshold, used to separate presumptive functional from nonfunctional motifs. We have shown that such thresholds can be roughly estimated using independent biological data, when available. Our approach is also computationally efficient: whole eukaryotic genomes can be processed in minutes on a typical computer. In turn, this efficiency allows FastCompare to explore exhaustive pattern lists.
Our results show that FastCompare can recover most of the known functional binding sites in S. cerevisiae when its upstream regions are compared to those of a related species, S. bayanus. We comprehensively explored the globally conserved motif content between worms, flies and mammalian genomes, discovering large sets of known and novel motifs. The use of external information (expression data, functional categories, TRANSFAC, chromatin IP and known motifs) clearly shows that our method is able to detect conserved and functional motifs in all the phylogenetic groups that we studied. In all analyses, we have shown that some of the discovered known or novel motifs were severely constrained, either in terms of position relative to the start of translation or in orientation. We also observed that some of the known or novel motifs are co-conserved within upstream regions, potentially revealing interactions between the (often unknown) transcription factors that bind them.
We have created a set of web tools to superimpose the most globally conserved k-mers discovered by FastCompare to user-supplied sequences or multiple alignments. An example is shown in Figure 10a, in which the upstream regions of the STE2 gene (encoding the alpha-factor pheromone receptor) from four different yeast species were aligned using ClustalW, and the most globally conserved k-mers are highlighted. All experimentally determined sites for STE2 were also predicted to be globally conserved by FastCompare. Moreover, several other sites also appear to be conserved, both at the global level (predicted by FastCompare) and the local level (shown by the multiple alignment). In Figure 10b, the same analysis was performed on only two orthologous upstream regions instead of four. Many more sites appear to be locally conserved than when using four species, but the globally conserved sites found by FastCompare allow the efficient selection of experimentally verified and putative binding sites. These tools should be particularly useful in designing stepwise promoter deletions and site-directed mutagenesis experiments for understanding the regulatory code of specific genes.
While powerful, our approach has potential limitations. Our current approach allows matches to a given k-mer to be on different strands within pairs of orthlogous upstream regions. This flexibility substantially increases the number of k-mers that are supported by independent biological data (that is, true positives), at least for yeasts and worms (data not shown). However, it is difficult to evaluate whether this flexibility introduces more true positives than false positives. Also, transcription factors often bind several slightly distinct sites with different affinities, and it is widely acknowledged that binding-site degeneracy is better captured by using position-weight matrices (PWM) instead of k-mers or consensus patterns . To evaluate whether weight matrices would display better conservation scores, we calculated a conservation score for weight matrices corresponding to 20 well characterized yeast binding sites, and compared them to the conservation scores obtained for the best k-mers that unambiguously correspond to the same binding sites. Conservation scores for weight matrices were calculated as described for k-mers in Materials and methods, except that we used the weight-matrix score thresholds that maximize the significance of the overlap between the two sets of ORFs containing matches to the weight matrices in each species. This involves progressively lowering the score threshold by small increments, and for each threshold, calculating the overlap and its hypergeometric p-value. We then choose the score threshold corresponding to the most significant p-value, and use the negative natural logarithm of this p-value as the conservation score. As shown in Table Table7,7, only in 11 cases out of 20 did weight matrices have a higher conservation score than the corresponding k-mers. These results suggest that k-mers provide results that are almost as good as those obtained using weight matrices, when utilizing the network-level conservation criterion. One reason why, in many cases, k-mers have a higher conservation score than weight matrices may have to do with the more narrow selection of k-mers for binding sites with similar or identical affinities. In fact, we recently showed that PWM scores, widely seen as proxies for binding affinity, are statistically conserved in a comparison between S. cerevisiae and S. bayanus . In the context of the present study, the different k-mers representing each transcription factor binding site may be defining affinity classes that are more strongly conserved than a looser definition of a binding site represented by a weight matrix. Recent work in bacteria has established the importance of binding affinity, especially with respect to coordinating the temporal order of events .
However, Table Table77 shows that the conservation score for weight matrices describing very degenerate binding sites, such as RAP1, is significantly higher than the conservation score obtained for the best corresponding k-mer. This suggest that our k-mer based approach is limited in its ability to discover highly degenerate binding sites.
As shown by our inter-group analysis, many regulatory elements have remained functional across evolution, but few have remained upstream of the same genes. The network-level conservation principle thus appears less applicable to species that diverged very long ago. For example, when we compared the Drosophila and mosquito genomes (which diverged approximately 400 million years ago), we only found a handful of k-mers (interestingly including GATA-factor and Myc/Max binding sites) to have conservation scores above those obtained from randomized data.
There are also several directions in which our approach could be extended. From a methodological standpoint, the approach could be extended to take into account local over-representation of identical or nearly identical copies of the same binding sites, a well known feature in the promoter regions of higher eukaryotic species . To discover highly degenerate regulatory elements, k-mers could be used to seed weight matrices whose individual weights could be optimized for network-level conservation, using stochastic optimization procedures (for example, simulated annealing; Mike Beer, personal communication). Introns and downstream noncoding regions could also be explored using our approach, as these regions are known to harbor functional regulatory elements in metazoan genomes. While our approach can deal with genomes presenting arbitrary levels of divergence and rearrangements, it would be interesting to investigate how global alignments or suboptimal and non-overlapping local alignments  could be used to filter out regions of non-conservation. This approach would be particularly interesting when analyzing very long upstream regions, in order to increase the signal-to-noise ratio. Finally, mRNA 3' UTRs could be compared in order to find specific downstream regulatory elements involved in post-transcriptional mRNA regulation (for example, mRNA localization, decay or translational repression).
First we determined orthology relationships between ORFs on the basis of reciprocal best BLAST hits (Figure (Figure1a)1a) and extracted the corresponding upstream regions from the genome sequences. Then, we considered every possible short DNA sequence of length k (k-mer, with k between 7 and 9) as a candidate regulatory element. For each k-mer, we found the set of ORFs whose upstream regions contain at least one exact match to the k-mer, anywhere in the upstream region, in the first genome. We did the same for the second genome, obtaining another set of ORFs. Then, we calculated the overlap between the two sets and assessed its statistical significance (Figure (Figure1b).1b). The statistical significance of the overlap provides a measure of conservation with which we score and rank every possible k-mer (Figure (Figure1c).1c). Note that our approach is very different from the classical k-tuple DNA sequence-analysis methods [77,78], which are not based on comparative genomics and are local methods; that is, they only deal with single promoters or small sets of functionally related promoters (while our approach provides a genome-level measure of conservation for candidate regulatory elements).
Sequence data were downloaded from the Saccharomyces Genome Database (SGD) for all yeast species considered in this paper; worm (C. elegans and C. briggsae), Drosophila (D. melanogaster), human (H. sapiens) and mouse (M. musculus) sequence data were downloaded from Ensembl . The D. pseudoobscura genome sequences (contigs) were downloaded from . The upstream regions used in this study are immediately adjacent to the ATG codons of their downstream genes, and are 1-kb long (yeasts) or 2-kb long (worms, flies and mammals). Note that transcription-factor-binding sites generally reside in the region situated upstream of the transcription start site. Unfortunately, not all genes have well annotated transcription start sites. This problem should not, however, strongly influence the output of FastCompare, as distances between start of transcription and start of translation should be at most on the order of a few hundred base-pairs (except in certain cases, for example when 5' UTRs are interrupted by long introns). However, as gene structures become better annotated (mainly as a result of massive cDNA sequencing projects) and promoter regions become more accurately delimited, we expect that the ability of FastCompare to discover regulatory elements will be significantly improved.
Orthology information provided by Ensembl or by Kellis et al.  was used throughout this study, when available. Ensembl provides strong homology relationships between genes from different species, but does not provide reciprocal best matches. Therefore, we determine reciprocal best matches using the provided sequence identity between homologous genes. When orthology information is not available in Ensembl (for example, between D. melanogaster and D. pseudoobscura, or between distant species such as S. cerevisiae and C. elegans), we determine orthologs using the reciprocal best BLAST hits approach.
Given a value of k, we first generated the set of all possible k-mers and removed half of them on the basis of reverse complementarity. We also removed k-mers with very low complexity and which are over-abundant in the intergenic regions of the genomes we analyzed (that is, those that contain k - 1 or more As or Ts), as these sequences are unlikely to be regulatory elements. Every remaining k-mer (that is, 8,170 for k = 7) is then considered as a candidate regulatory element. For each k-mer, we found the set of ORFs in the first species that have at least one exact occurrence of the k-mer in their upstream regions. We then found the set of ORFs in the second species that have at least one occurrence of the same k-mer in their upstream region. Importantly, the matches can be anywhere in the upstream regions: they do not have to be at the same positions in two orthologous upstream regions (as with multiple alignment) and can be on any strand. Since both functional and non-functional elements are expected to be conserved between two closely related species, the two sets are expected to overlap. However, under the network-level conservation principle, the extent of the overlap - and therefore its statistical significance - will be even greater for k-mers that represent functional transcription factor binding sites. The significance of the overlap can be measured using the hypergeometric distribution. The probability of two sets of size s1 and s2, drawn from a set of N elements, to have i or more elements in common is given by :
In this way, all k-mers can be ranked by their hypergeometric p-values. It is important to note that due to basal conservation (that is, conservation arising from common ancestry), the hypergeometric p-values will generally be very small for most k-mers. Therefore, we only use these p-values as relative measures of network-level conservation and focus on k-mers with the greatest conservation. For simplicity, we define the 'conservation score' to be the negative logarithm (base e) of the hypergeometric p-value obtained for a given k-mer. Therefore, the more extensive the overlap between the two sets, the higher the conservation score. Also, for the same k-mer, we call 'conserved set' the set of ORFs corresponding to the overlap between the two sets of orthologous ORFs containing at least one exact match to the k-mer in their upstream regions. Conserved sets are used throughout this study to get insights into the function of the most conserved k-mers, using functional annotation [81,82], chromatin IP , known motifs, and to evaluate whether these k-mers are constrained in terms of position or orientation.
The current FastCompare implementation handles k-mers with a user-specified gap (termed gapped k-mers), which is a straightforward extension of the approach described above. The conservation score returned by FastCompare is independent of the size of the patterns (that is, the value of k); therefore k-mers with different sizes, and gapped k-mers (for example, CGTNNNNNNTGA) can be compared.
We use the following strategy when applying FastCompare to pairs of genomes. First, we calculate conservation scores for all 7-mers, 8-mers and 9-mers. We then retain only the m highest-scoring 7-mers, with m chosen according to independent biological data (alternatively, m could be chosen according to the estimated number of transcription factors in the species being considered). We then replace each of the retained 7-mers by an 8-mer (if there is one) with higher conservation score for which the considered 7-mer is a substring. We also include within the final list the 8-mers which do not have any substrings within the m 7-mers. We then repeat the same process for the retained 8-mers, replacing each of them by its higher scoring 9-mer superstring if there is one, and add the 9-mers that do not have any substring within the 8-mers. This strategy thus allows the optimal length for candidate regulatory elements to be determined.
FastCompare is implemented in the C language and uses efficient data structures (hash tables and prefix trees ). For a given value of k, the worst-case time complexity is O(kn + 4k(p + k)), where n is the total amount of upstream sequences and p is the total number of orthologous pairs. Note that the first term is generally much larger than the second one; therefore the complexity of our approach can be seen as linear in the combined sizes of the genomes to be compared (when k is restricted to 7, 8 and 9). The calculation of hypergeometric p-values involves factorials of large integers, so we use specialized C routines, as described in . FastCompare runtimes provided in the Results section are obtained using a standard desktop PC (2.0 GHz CPU, 1 GB RAM).
As described in Results, we applied FastCompare to 1 kb (yeast) or 2 kb upstream regions (worms, flies and mammals). While these lengths are reasonable, they are somewhat arbitrary, and it is known that some regulatory elements are constrained to be within specific distances (often shorter than 1 kb) from the start of transcription, reflecting mechanistic constraints for transcription factor-transcription factor or transcription factor-RNA polymerase interactions . Moreover, some regulatory elements have orientation biases (see [11,12] for examples). To discover such constraints, we analyzed the most conserved k-mers found at the previous stage in the following ways.
First, for each high-scoring k-mer, we calculated the median distance to ATG (as the start of transcription is generally not known) for the set of all (non-overlapping) occurrences of this k-mer within the upstream regions of its conserved set (see previous section for a definition of the conserved set of a given k-mer). To statistically assess whether the median distance to ATG for a given k-mer is unusually small or large, we built the distribution P(d) of median distances to ATG, for the entire set of 8,170 7-mers. We first created a histogram by binning the median distances to ATG for all 7-mers into 20-bp bins, and then smoothed the histogram (using a normal kernel and a bandwidth of 50 as implemented in the ksmooth function of the R statistical software package). Then, using numerical integration, we sought the distance thresholds d0.025 and d0.975 such that P(d <d0.025) = 0.025 and P(d <d0.975) = 0.975. We then considered the median distance to ATG for a given k-mer as unusually short or long when it is less than d0.025 or greater than d0.975, respectively.
For each k-mer, we also sought the sequence window which maximizes the conservation score by progressively shortening all upstream regions (all having equal lengths) by 100 bp increments from the 5' end. Then, we did the same from the 3' end using the optimal 5' end found in the previous step. Evaluating every possible window whose length is a multiple of 100 bp almost always yields identical results. We then calculated the conserved sets for these windows, and output the orientation (strand) for each k-mer occurrence within its conserved set (palindromes were counted on both strand).
Finally, using the results of the previous step, for each k-mer, we used the binomial distribution to assess whether the proportion of occurrences of this k-mer (within its conserved sets) on one strand is significantly smaller (or larger) than 0.5. Binomial p-values less than 0.05 (after Bonferroni correction) are considered significant.
It is now known that the regulatory code governing the expression of genes is combinatorial [11,85,86]. The network-level conservation principle can be trivially extended to discover interactions (that is, co-conservation) between two k-mers. To focus on heterotypic interactions, we only examined k-mers that differ by more than l nucleotides, after optimal ungapped alignment. We tested several values of l and found that l = 4 was most appropriate when using 7-, 8- and 9-mers. Then, we proceeded as described above, except that instead of seeking two sets of ORFs (one for each species) whose upstream regions contain a single k-mer, we sought the two sets of ORFs that contain the two k-mers simultaneously. Once these two sets were available, we evaluated the extent of their overlap as described above, and rank interaction pairs according to their conservation score.
We used randomized data to show that high conservation scores (obtained as described above) are unlikely to be obtained by chance, and independent biological information to assess the ability of FastCompare to predict functional regulatory elements by giving them a high conservation score. We also estimated the over-representation of predicted regulatory elements in upstream regions compared to coding regions.
Our goal was to generate new pairs of upstream regions that are conserved at the same level of divergence as the actual sequence data. We align each pair of orthologous sequences using the Needleman-Wunsch algorithm , and calculate substitution frequencies between all pairs of nucleotides (A → A, A → T, and so on). Then, we reconstructed new pairs of orthologous sequences by mutating one of the sequences in each initial pair using the estimated frequencies. Generating the sequences to be mutated using locally estimated first-order Markov models yielded the same results.
The proportions of 7-mers supported by each type of independent data, as presented in Figures Figures3,3, ,5,5, ,77 and and8,8, is calculated as follows. In these figures, support for a given 7-mer is considered as binary, and depends on whether the 7-mer meets the particular validation criterion or not (or whether it is found upstream of over- or underexpressed genes, in at least one microarray condition, see below). 7-mers are first sorted by conservation score, and the proportion of supported 7-mers were calculated using a sliding window of 100 7-mers. For each window and each type of independent biological data, we simply calculated the number of 7-mers for which support is available and divided this number by 100.
Yeast (S. cerevisiae), worm (C. elegans), fly (D. melanogaster) and human (H. sapiens) functional categories and corresponding ORF annotations were downloaded from the MIPS  and GO  websites. The statistical significance of the functional enrichments within sets of ORFs was evaluated using the hypergeometric distribution, as discussed above. Hypergeometric p-values for functional enrichment were not corrected for multiple testing, but only p-values smaller than 10-4 are reported, providing a slightly less stringent thresholds than Bonferroni corrections.
Weight matrices corresponding to known yeast motifs were obtained from Gibbs sampling-based motif finding on chromatin IP data , functional categories and clusters of co-expressed genes . Only high-confidence binding sites (that is, sites confirmed by several sources including the literature) were included in our list of known motifs. We label a given k-mer as a known motif if it meets the following two criteria. The first is significant overlap (p < 10-4) between the conserved set for the given k-mer and the set of ORFs whose upstream regions contain at least one match to the known motif (the sets of ORFs were defined using ScanACE with the weight matrix for the known motif, and with the standard average minus two standard deviations threshold ). The second criterion is strong sequence similarity between the considered k-mer and the known motif weight matrix. To evaluate this similarity, we turn the considered k-mer into a weight matrix of 0s and 1s, and use CompareACE  to calculate the Pearson correlation between the weights of this matrix and the weights of the known motif weight matrix; correlation coefficients > 0.65 are considered significant. Finally, for a given k-mer, we report the known motif for which the above hypergeometric p-value is the smallest.
Genome-wide binding locations were previously evaluated for 106 transcription factors in S. cerevisiae . For each transcription factor, we retain the set of ORFs with p-value < 0.001 (see  for details of the error model). To evaluate a given k-mer with respect to chromatin IP, we evaluate the statistical significance of the overlap between the conserved set of the considered k-mer and the set of ORFs defined as described above corresponding to each transcription factor. We report the most significant chromatin IP enrichment, with hypergeometric p-value < 10-4.
The 309 weight matrices and corresponding consensus patterns for known transcription factor binding sites were downloaded from [90,91]. k-mers were then simply matched to the consensus patterns. We eliminated consensus patterns that match too many k-mers, by matching each of them to all (8,170) 7-mers and removing consensus patterns that matched more than 50 7-mers.
Expression data for all species considered were downloaded from diverse sources [92,93]. Overall, we downloaded 765 microarray conditions for S. cerevisiae, 555 conditions for C. elegans, 156 conditions for D. melanogaster, and 1,384 conditions for H. sapiens. We use these expression data in the following way.
We evaluated the over-representation of each k-mer in the upstream regions of genes that are themselves over- or underexpressed in certain microarray conditions. Over- or underexpressed genes are operationally defined as having a log ratio of intensity above average plus two standard deviations, or below average minus two standard deviations, respectively (averages and standard deviations are calculated for each condition; using fold changes instead of standard deviations produced roughly the same results). To evaluate the over-representation of a given k-mer in a given microarray condition, we defined as O1 the set of overexpressed genes in this condition, and as O2 the set of ORFs whose upstream regions contain at least one occurrence of the considered k-mer, genome-wide. Then, we evaluated the significance of the overlap between O1 and O2 using the hypergeometric distribution, as described above. Overlaps whose hypergeometric p-value is smaller than 0.05 (after Bonferroni correction) were considered significant. We proceeded separately with the set of underexpressed genes in the same way. The total number of microarray conditions (overexpressed plus underexpressed) for which a k-mer was found to be significantly over-represented is reported. Note that we do not use the conserved set for the considered k-mer here, as we do not want to restrict our analysis to orthologous genes. Indeed, except for yeast, microarrays often contain only a fraction of all genes within the considered organism. In these cases, the overlap between conserved sets and over- or underexpressed genes can be very small, reducing statistical power. Using all genes, therefore, increases our power to detect significant associations, while retaining a uniform approach for all species considered.
As shown in  for the yeast RAP1 transcription factor, some transcription factors bind intergenic regions much more frequently than they bind coding regions. Consequently, it is expected that sequences corresponding to regulatory elements are more often present in intergenic regions than in coding regions. To evaluate this bias, we calculate the ratio of the number of genes that have the k-mer in their upstream regions over the number of genes that have the k-mer in their coding regions (using only exons), and we correct this ratio using the average length of the upstream and coding regions.
The FastCompare implementation, all the sequences, and results are available on our website .
We thank David Stern, Mike Beer, Chang Chan, Yir-Chung Liu and two anonymous reviewers for providing helpful comments on the manuscript, Mike Beer for providing weight matrices for known transcription factors in yeast and the other members of the Tavazoie laboratory for helpful discussions. S.T. is supported in part by grants from NSF CAREER, DARPA, and NIH.