Model for Phylogenetically Unrelated Sequences
In order to motivate and explain our model for phylogenetically related sequences it is helpful to first introduce the model for sequences that are not phylogenetically related. In this context, “not phylogenetically related” means that for any pair of sequences in the input data, their common ancestor sequence is sufficiently far in the evolutionary past that mutations have been introduced multiple times at each position in the sequences. That is, any similarity left between the input sequences cannot be due to evolutionary proximity.
We assume that our data contain an unknown number of sites for an unknown number of different TFs. The state space of possible solutions to the problem of identifying the binding sites contained in these sequences consists of all possible ways in which one can assign groups of binding sites to these sequences. An example of such binding site assignments, which we call “configurations,” is shown in .
Assuming that the width of the binding sites is m bases, each contiguous segment of m bases in our dataset is a potential binding site. We call such m-base segments “windows” and will generalize this concept later for phylogenetically related sequences. We label the WMs for the different TFs with “colors” one, two, etc., and use the color zero to indicate nonfunctional “background” sequence. Thus, formally, a configuration C is an assignment of nonzero colors to a particular set of nonoverlapping windows. Note that the requirement that colored windows cannot overlap means that a colored window “blocks” up to 2(m −1) others.
Given a dataset of input sequences S, our model assigns to each possible configuration C a posterior probability P(C|S). Using Bayes's theorem we have
) is the prior probability of configuration C
. As described in Materials and Methods
, PhyloGibbs provides for priors that incorporate prior information on the likely number of motifs and binding sites in the input sequences. The likelihood function P
) gives the probability that, for each color in C,
all sequences of the same color were sampled from a common WM, multiplied by the probability of the remaining sequences under a “background” model. That is, we formally have
where we denote all sequence that is colored zero in the configuration C
by S C,
the background model for this sequence by P
), the set of sequences in windows with color c
and the probability that all sequences in Sc
were drawn from a common WM by P
). Expressions for all these quantities are derived in Materials and Methods
. The probability P
) that all sequences in color c
were sampled from the same WM is obtained by taking the probability P
) that all sequences Sc
were sampled from a particular WM w,
and integrating this over all possible WMs w
When considering datasets that contain phylogenetically related sequences, such as orthologous intergenic regions from related species, the main problem is distinguishing sequence similarity that is due to evolutionary proximity from sequence similarity that arises from functional constraints. That is, when calculating the probability P(S) that a group of sequences S are all binding sites for a common WM, we should treat sequences that are orthologous by a different model than those that are not phylogenetically related. We derive such a model below. In order to apply this model we also have to determine which parts of the input sequences are orthologous, i.e., we need to provide a multiple alignment. We could in principle let the algorithm search both the space of multiple alignments and the space of binding site configurations C at the same time, but we decided that this state space is likely too large to search effectively. Moreover, for closely related species large segments of the orthologous intergenic regions can be unambiguously aligned, and by pre-aligning these we significantly reduce the search space for the algorithm.
Our strategy is thus to first produce a multiple alignment and then search the space of binding site configurations that are consistent with this alignment. Standard global multiple alignment algorithms [29
] can be used for this task, and PhyloGibbs can be run on the outputs they produce. However, as discussed in the Introduction, alignments of orthologous intergenic regions from related species (such as the recently sequenced Saccharomyces
]) show a mosaic pattern of well-conserved blocks interspersed with stretches of unconserved sequence, and global alignment algorithms may spuriously align many of these phylogenetically unrelated parts. Binding sites are typically not restricted to conserved regions [18
], and spurious alignment of phylogenetically unrelated regions may hamper the discovery of binding sites contained within them. Our strategy is thus to only align those parts of the intergenic regions that are clearly orthologous and can be unambiguously aligned, and to leave the rest unaligned. In searching binding site configurations we demand consistency of the configuration with the aligned orthologous blocks, but at the same time also allow windows to be placed in unaligned parts of the sequences. Such syntenic local multiple alignments are produced by the algorithm Dialign [22
], and we used this algorithm for producing the multiple alignments of the datasets that we report on below (see [32
] for a recent review of the performance of various alignment algorithms on orthologous intergenic DNA). Another algorithm that produces syntenic local multiple alignments and is especially suited for large genomic regions is Threaded Blockset Aligner [33
Once we have a syntenic multiple local alignment, we treat columns of aligned bases as phylogenetically related, i.e., arising from a common ancestor base. The state space again consists of all possible configurations of binding sites but now with the constraint that “windows” that include aligned bases have to extend over all sequences in the alignment. That is, we assume that if a binding site occurs in a sequence segment that is aligned with sequence segments from the other species, then binding sites for the same TF have to occur in the corresponding positions of these other sequence segments. To this end we extend the concept “window” (denoting a position of a potential binding site) to multiple local alignments, as illustrated in .
An Alignment of Four Sequences Showing Three Legitimate Windows and One Illegitimate Window
The figure shows a sample stretch of four aligned sequences, where uppercase letters are aligned and lowercase letters are “independent.” In an initial pass the program identifies the set of all legitimate “windows” in the entire sequence data. Each of these windows may encompass one or more sequences. The windows must contain consistently aligned uppercase letters: there should not be “gaps” that give inconsistent spacing between aligned uppercase letters. For example, in , the sequences boxed with solid lines show consistent windows, whereas a dotted line boxes an inconsistent window. Note that, as in the leftmost window in the figure, lowercase letters can be used to complete a window in which only some letters are uppercase. The details of identifying the set of all legitimate windows in the sequence data are described in Materials and Methods
. At the end of this procedure, we have a list of windows that represent potential sites for TF binding sites; some of these windows contain only one sequence and represent a potential independent occurrence of a binding site, while others contain multiple sequences and represent potential binding sites that evolved from a common ancestor site and were conserved across multiple species.
Next, we need to generalize our probabilistic model to multiply aligned orthologous sequences. For the single-sequence windows of the previous section, the probability P(s|w) of observing the sequence s given a WM w is simply given by
the base at position i
. For a window overlaying a region of multiply aligned sequences, the single base si
is replaced by the set of bases Wi
in the i
th column of the alignment. We now replace the probability
with the probability P
) that the bases Wi
in the i
th column of the window derive from a common ancestor base, and have been evolving under the selective constraints set by the requirement that they remain binding sites for the TF represented by WM w
. Our evolutionary model assumes that mutations are introduced at a fixed rate and that the probability for selection to fix a mutation toward base α in the i
th column of a site is proportional to the WM entry wαi
. Since the positions are mutually independent we have for the whole window
The probability P
) that all windows W c
in color c
evolved under the constraints set by WM w
is simply given by the product
and the probability P
) is again given by an integral of P
) over the space of all possible WMs. The background score P
) for a window W
is given by the exact same expression as P
) except that the WM entries wαi
are replaced with the background probabilities for the bases in each column. Detailed derivations and explicit expressions are provided in Materials and Methods
Strategy: Anneal and Track
By repeating moves from the move set described in the previous section PhyloGibbs will, in the limit of long time, sample each configuration C
according to its posterior probability P
). Even though the distribution P
) represents everything that can be inferred from the data using our model, we still have to decide how to usefully summarize this information. There is, to our knowledge, currently no generally satisfactory solution to this problem (see the discussion in [28
]). One would like the summary to report all the relevant features that are shared by the configurations C
with high posterior probability P
). In particular, one would like to identify groups of windows that with high probability share a color. Given a reference set of windows, it is straightforward to check what fraction of the time different subsets from this reference set are colored the same, and the fraction of the time that other windows are co-colored with windows in this reference set. However, one obviously cannot check this for all possible subsets of windows, and we thus have to decide what reference sets we want to “track.”
Our current strategy is to first search for the configuration C*
that has the globally maximal posterior probability P
) and to take the sets of co-colored windows in this configuration as the reference sets. We use simulated annealing to search for this globally optimal configuration C*
, which we call the “reference configuration.” Instead of sampling configurations from the distribution P
) we raise each probability to the power β and sample from the posterior
. Initially we set β = 1 and slowly increase β with time until the sampler “freezes” into a configuration C*
with (at least locally) maximal probability P
The annealing phase is followed by a tracking phase in which we sample from the distribution P
) in order to estimate, for each window w
and each color c,
the probability p
) that window w
belongs to the same motif (color) as the windows in color c
of the reference configuration C*
. The probabilities p
) are estimated as follows. After every cycle of moves we compare the current configuration C
with the reference configuration C*
. Specifically, for every color c
in the reference configuration C*
we determine which color
of the current configuration C
most closely (allowing for shift and reverse-complement of the windows in
with respect to the windows in c;
see Materials and Methods
). For every window w
(properly shifted to align with c
), we add a count of one to the number of times n
) that w
is associated with color c
. At the end we set p
) = n
the total number of cycles.
In its default mode of operation PhyloGibbs reports the following results for the input set of sequences S: (1) the configuration C* that has maximal posterior probability P(C|S), (2) an inferred WM for each color c in configuration C*, (3) for each color c, all windows w for which p(w,c) ≥ pmin, with pmin a cutoff that the user can specify, and (4) a WM for each color c from reference configuration C* that is inferred by weighing the member windows w with their membership probabilities p(w,c).
Note that, in general, different member windows w of color c in reference configuration C* will have very different posterior probabilities p(w,c) to be members of the motif. In addition, for each color c the tracking will typically uncover windows w that were not a member of color c in configuration C* but that with reasonable probability p(w,c) belong to the motif as well. One inherent limitation of our anneal-and-track strategy is that it can in principle miss a group of windows that are co-colored a significant fraction of the time in P(C|S) but that do not occur in configuration C*. Trivially, if the user allows for too few colors, some motifs may be missed. It generally does not hurt the results to overestimate the number of colors (different motifs), although it increases running time. Similarly, it is also not necessary to accurately estimate the total number of sites. If the prior overestimates the total number of sites, some spurious sites in the reference configuration will simply fail to track. Similarly, if the total number of sites is underestimated, some sites that were missed in the reference configuration will be picked up in the tracking phase. To give maximal flexibility to the user, the algorithm can also, instead of doing an anneal, be given an input reference configuration and track the motifs in this configuration.
As far as we are aware, PhyloGibbs is the only motif-finding algorithm that rigorously assigns posterior probabilities p
) to the binding sites that it reports by sampling the entire space of binding site configurations. The Gibbs motif sampler [4
] WGibbs samples the space of configurations while keeping track of the best configuration C*
that it has seen, i.e., it does not use annealing. It also reports posterior probabilities for the sites it reports, but these are based on sampling of only a small subspace of configurations around the configuration C*
. Algorithms that search the space of WMs [6
] use EM to find an optimal WM explaining the data and then assign posterior probabilities to reported sites by assuming this WM is correct. That is, they do not take into account the (likely) possibility that the WM found through EM is not entirely correct.
As we show below, only the rigorous sampling of the space of all configurations as implemented in PhyloGibbs is capable of assigning realistic posterior probabilities to the sites it reports. It is sometimes attempted to identify “significant” motifs by simply rerunning one or several different motif-finding algorithms and looking for recurring motifs. However, this merely generates the subsidiary problem of clustering the multiple predictions. Instead of using ad hoc scoring schemes for clustering, reported binding sites should ideally be clustered using the same probabilistic scoring that generated them, i.e., as in [28
]. PhyloGibbs generalizes the binding site clustering algorithm of [28
] and subsumes any problems of clustering.
Performance on Synthetic Data: Anneal
In general there are three qualitatively different issues that contribute to the performance of motif-finding algorithms on real data. First, all motif-finding algorithms make assumptions about the data that will ignore at least some of the complexities of real data. The performance of a given motif-finding algorithm will depend on the extent to which these ignored complexities affect the algorithm's ability to perform its task. Second, the search spaces of all possible WMs or all possible binding site configurations are too large to search exhaustively, and therefore all algorithms employ heuristic methods to search for the globally optimal WMs or configurations. The extent to which the heuristic methods succeed or fail will also affect the performance of the algorithms. Third, even if the data adhere to all assumptions that an algorithm makes, and the algorithm successfully finds the global optimum in the search space, this still does not guarantee that the algorithm will recover the correct motifs and sites. That is, if the motifs are fuzzy and the sites are embedded in long background sequences it might occur that, by chance, the background contains sets of sites that are more conserved and more similar than the embedded sites. In this case it will be impossible for any algorithm to recover the true sites.
By generating synthetic data to accord, as much as possible, with the assumptions that the motif-finding algorithms make, we can study the second and third issues separately from the first issue. In this section and the next we do a number of such tests. In our first test we want to evaluate to what extent PhyloGibbs can recover a fixed number of sites embedded in a perfect alignment of orthologous sequences as the quality of the WMs and the phylogenetic distances of the orthologs are varied. At the same time, we want to test how well PhyloGibbs performs when operating on perfect alignments compared to algorithms that do not take phylogenetic information into account and that cannot operate on multiple alignments (including PhyloGibbs in the mode where it ignores phylogenetic information). This test will indicate how much performance can be improved by using phylogenetic information and multiple alignments in an ideal situation. For ease of reference, from now on we refer to all algorithms that use phylogenetic information and that can operate on multiple alignments as “phylo” algorithms, while referring to algorithms that treat all sequences as independent as “non-phylo” algorithms.
For our first test we generated synthetic datasets as follows. (1) We first generated a WM of width w
. For each position in the WM we picked a random “consensus” base, set the probability of that base to p,
and set the probabilities of the other bases to (1 − p
)/3. The WM was thus parametrized by p,
which we call its “polarization.” For some datasets we also generated random WMs by picking, for each position, a random distribution (wa, wc, wg, wt
) uniformly from the simplex
. Note that for these random WMs the “polarization” varies from column to column. (2) We sampled s
sites from the WM and embedded them in a random sequence of length L
. This sequence formed the ancestor sequence. (3) We then generated S
descendant sequences as follows. For each base the descendant's base was equal to the ancestor's base with probability q
. With probability 1 − q
we mutated the base. Outside binding sites, a mutation replaced the ancestor base with a randomly chosen base. Within binding sites, a mutation replaced the ancestor base with a new sample from the WM. Because the WM is generally biased toward particular bases this results in more conservation within binding sites than outside them. We refer to the parameter q
as the “proximity” of the descendants to their common ancestor. (4) We measured performance by the overlap between the predicted sites in the reference configuration C*
and the embedded sites. Formally, we counted the number of bases in the intersection of predicted and embedded sites and divided by the total number of bases in embedded sites (which, in these tests, equals the total number of bases in predicted sites). A performance of one thus corresponds to a perfect overlap of the embedded and predicted sites.
We compared the performance of PhyloGibbs with those of non-phylo algorithms on alignments of S = 5 orthologous intergenic regions of length L = 500 containing s = 4 sites for a single motif, as a function of the WM polarization p and the proximity of the descendants q. The results are shown in .
Performance of PhyloGibbs and Non-Phylo Motif-Finding Algorithms on Alignments of Orthologous Intergenic Regions as a Function of the Evolutionary Proximity of the Orthologs and the Quality of the WM
All algorithms assume that the data are a mixture of random uncorrelated background sequences and samples from a number of WMs of certain lengths. With the exception of the phylogenetic relationship of the sequences, which is ignored by the non-phylo algorithms, the synthetic data are thus in complete accordance with the assumptions that each of the algorithms make. For each algorithm we specified the correct length and number of sites. Since when using PhyloGibbs with phylogeny the windows extend over all five sequences in the alignment, we asked PhyloGibbs to predict four multi-sequence windows for a single motif, while we asked the non-phylo algorithms to search for 20 single-sequence sites for a single motif. Since for any algorithm, the performance differs substantially between input datasets that were generated with the same parameter settings, we averaged results over 50 datasets and in show two standard errors (dotted lines) around the average performance (solid lines).
All non-phylo algorithms, including PhyloGibbs when phylogeny is turned off, perform roughly equally well (or badly). For highly polarized WMs all non-phylo algorithms perform quite well. In contrast, for low polarizations (p = 0.6 in the upper left panel) or for random WMs (lower right panel), all non-phylo algorithms perform hardly better than random predictions would do (the four embedded binding sites cover 8% of the input data). The second thing to note is that, especially as phylogenetic distance between the orthologs increases (lower q), PhyloGibbs performs substantially better than the non-phylo algorithms. As the amount of conservation due to evolutionary proximity becomes larger than the amount of conservation due to functional constraints, i.e., as q becomes larger than p, the performance drops and the performance of PhyloGibbs becomes only marginally better than that of the non-phylo algorithms.
It is important to point out that PhyloGibbs's superior performance for these data is partly due to the fact that the five sequences have been perfectly aligned and that it is searching only through configurations that are consistent with this alignment. In contrast, the non-phylo algorithms treat the five sequences as independent and have to search a much larger space of configurations. For real data we of course do not have perfect alignments and it will generally be hard to obtain good alignments when the proximity q becomes small, which will negatively affect the performance of PhyloGibbs. This effect is seen below in our tests on real data.
In Figure S1
we show the results of additional tests analogous to those shown in . In these tests we chose the WMs and phylogeny to mimic as closely as possible the situation in yeast, whose real data we study below. That is, instead of creating random WMs we used “real” WMs of yeast TFs with known binding specificities, and we used phylogenetic distances for the five descendants that are proportional to the phylogenetic distances of the Saccharomyces
sensu stricto species that we use in our tests on real data below. The results with these more realistic synthetic datasets are quantitatively very similar to the results shown in the upper right panel of .
It might also be asked if the non-phylo algorithms are put at a disadvantage by the fact that they have to search for a much larger number of sites. That is, to get a 50% performance PhyloGibbs needs only to get two multi-species sites correct, whereas the non-phylo algorithms need to get ten sites correct. To test this we ran MEME and WGibbs on single sequences, as opposed to groups of S = 5 orthologs, and asked the algorithms to find only s = 4 sites instead of sS = 20. The non-phylo algorithms MEME and WGibbs performed extremely poorly in this mode. It thus appears that, at least for this kind of synthetic data, the benefit of having more sites outweighs the negative effects of having more sequence to search through.
Although PhyloGibbs performed consistently better than the non-phylo algorithms, in many cases it recovered only a fraction of the embedded sites. Since the synthetic data were generated exactly according to the model that PhyloGibbs assumes, there are only two possible reasons for the failure of PhyloGibbs to recover the embedded sites. The first possibility is that the correct configuration, i.e., with the four binding sites occurring at the positions where they were embedded, is the globally optimal binding site configuration, but that the anneal phase failed to identify it and instead settled on an only locally optimal configuration. In that case the posterior probability P(Ccor|S) of the correct configuration Ccor should be higher than the posterior probability P(C*|S) of the configuration C* that the anneal obtained. The second possibility is that the anneal did identify the globally optimal configuration, and that this configuration C* has higher posterior probability P(C*|S) than the posterior probability of the correct configuration P(Ccor|S). This can happen when, by chance, positions outside of the embedded binding sites are more conserved and/or better match a common WM than the embedded sites, making it impossible for any algorithm to recover them correctly.
To investigate how often the anneal in PhyloGibbs identifies the globally optimal configuration, we compared P(C*|S) with P(Ccor|S) for the runs with proximities q = 0.2, q = 0.5, and q = 0.8, shown in the lower right panel of . For each value of q there were 50 independent datasets generated, and PhyloGibbs was run five times on each dataset. Thus, for each value of q there were 250 runs in total. We found that P(C*|S) ≥ P(Ccor|S) for 245 of the 250 runs for q = 0.2, for 243 of the 250 for q = 0.5, and for all 250 runs for q = 0.8. Thus, for 98.4% of the runs, the posterior probability of the configuration found in the anneal was at least as high as the posterior probability of the correct configuration.
In conclusion, these first tests with synthetic data showed that, when sites from WMs are embedded in a random ancestor sequence, and PhyloGibbs is given a perfect alignment of a set of descendants of this sequence, it performs significantly better than algorithms that treat the descendants as independent sequences. It also shows that as the similarity between sites becomes less than or equal to the similarity between orthologous sequences due to evolutionary proximity, it becomes impossible for any algorithm to accurately recover the sites.
In the first test PhyloGibbs used both information from the overrepresentation of a motif in the data and information about the conservation of its sites. We next investigated how many species one would need, in an ideal situation, to reliably infer the location of a single binding site using conservation only. To test this we generated synthetic intergenic regions of length L = 500 that contained a single site for a single randomly chosen WM of width w = 10 and created alignments of S descendant sequences with proximity q = 0.5. We then ran PhyloGibbs for different values of S, asking it to search for a single multi-sequence window in the alignment. The results are shown in .
Performance of PhyloGibbs in Recovering a Single Site of a Randomly Chosen WM of Width w = 10 from the Alignment of S Orthologous Intergenic Regions of Proximity q = 0.5 and Length L = 500 as a function of S
We see that more than ten species are needed to have a 50% probability to recover a single site of a random WM at q = 0.5, and that as many as 25 species are necessary to recover the site with 90% probability. Note that significantly more species are necessary to recover a single site than are necessary to recover a group of multiple sites from the same WM. That is, in the lower right of about 40% of the quartet of sites is recovered at q = 0.5, whereas in at S = 5 the single site is recovered with a probability of only 0.15. In conclusion, this test has shown that if enough species are available in which a given binding site occurs, and these can be reliably aligned, then even individual sites can be reliably recovered by PhyloGibbs.
Performance on Synthetic Data: Tracking
In the next section we compare the performance of PhyloGibbs and other algorithms that use phylogeny (PhyME [25
] and EMnEM [24
]) on real data from five Saccharomyces
species. To compare and contrast with those tests we here create synthetic datasets that are comparable with those real datasets but that are idealized in the sense that they respect the assumptions that the various algorithms make about the data as much as possible. This test allows us to directly compare the performance of the phylo algorithms on idealized data, and the extent to which they outperform non-phylo algorithms in this idealized setting.
As mentioned in the previous section, our synthetic data are in accordance with all assumptions that the non-phylo algorithms make about the data, except of course for the phylogenetic relationships between the sequences. Our synthetic orthologous sequences are generated in accordance with the evolutionary model that PhyloGibbs and PhyME assume. For these two algorithms the synthetic data are thus in exact accordance with the assumptions that these algorithms make. EMnEM employs an evolutionary model that uses the same substitution matrix both within and outside of sites, but allows each position in a binding site to evolve at a different overall rate. This model is thus less realistic than the model that PhyME and PhyloGibbs use in that it ignores that the probabilities of different substitutions within a site depend on the site's WM. However, since it has more free parameters that can be fitted, we suspect that in practice it will be able to reasonably approximate the evolutionary model that PhyloGibbs and PhyME use.
We followed the estimates of [35
] and created 250 synthetic datasets, each containing nine binding sites, sampled equally from three random WMs for each intergenic region (see for details). We assessed the results with a plot of specificity (fraction of predictions matching true sites) versus sensitivity (fraction of true sites that were recovered). All the algorithms report posterior probabilities (or a p
-value) for the sites they report, which we used to rank the predictions. We pooled the predictions from all datasets and then generated successively larger lists of predictions by including all predictions over a given posterior probability. For each list we then determined the specificity and sensitivity and plotted them in (see Materials and Methods
Performance of Several Motif-Finding Algorithms on Synthetic Data Prepared as for
The algorithms that ignore phylogeny did not recover more than 16% of the true sites (sensitivity), and did so with a nearly fixed specificity of around 20%, meaning that there is not much enrichment in true sites in the top versus the bottom of their ranked lists. The algorithms that exploit the phylogeny all did better for the simple reason that they operate on the perfect multiple alignments and therefore their search space is much smaller. Of these three algorithms PhyloGibbs performed best. PhyME, in common with the non-phylo algorithms, reports a very limited range of posterior probabilities for the sites it reports, which leads to a relatively small “dynamic range” in sensitivity/specificity.
Note that even with this perfectly aligned data, to get 50% of the true sites, PhyloGibbs needed to make more than twice as many predictions. Again, to determine to what extent the failure of PhyloGibbs to recover all the embedded sites was caused by the anneal getting trapped in locally optimal configurations, we compared the posterior probabilities P(C*|S) of the reference configurations with those of the correct states P(Ccor|S). We found that P(C*|S) was greater than or equal to P(Ccor|S) for all 250 datasets.
These synthetic data also provide the opportunity to test how well the algorithms assess their reliability, i.e., how well the reported posterior probabilities for their predictions match the specificities (fraction of predictions matching true sites) we compute by knowing the true sites. Ideally the two are the same, so that for real data one could use the posterior probabilities to gauge the fraction of correct predictions. The right panel of shows that with the exception of PhyloGibbs (with and without phylogeny) all algorithms were much too optimistic about the quality of their predictions: EMnEM and MEME gave posterior probabilities larger than 95% when their specificity was around 35%, and WGibbs gave posteriors of 90% when its real specificity was only 20%. MEME is not included in the right panel of because it reports p-values instead of posterior probabilities.
Both EMnEM and PhyME overestimated their specificity because they calculate their posterior probabilities for sites under the assumption that the WMs that they infer are correct. In reality, the inferred WM will often not match the true WM that generated the data. For WGibbs the overestimation stems from the restricted sampling of configurations around the one that gave the maximum posterior probability during sampling. Only PhyloGibbs bases its posterior on sampling of the whole space of binding site configurations.
In Figure S2
we present the results of a test analogous to the one in , using again “real” WMs from yeast instead of random WMs, and using the proximities of the sensu stricto Saccharomyces
species instead of proximity 0.5 for all descendants. All algorithms performed substantially better in these tests, which is in all likelihood mostly because of the higher information scores (see Materials and Methods
) of the yeast WMs compared to random WMs. All phylo algorithms still outperformed the non-phylo algorithms, and of the phylo algorithms, PhyloGibbs performed significantly better than PhyME and EMnEM. In contrast to the test shown in , PhyME clearly outperformed EMnEM on this test.
In summary, these tests have shown that, given perfectly aligned input sequences, all phylo algorithms substantially outperform non-phylo algorithms. The tests have also shown that, on data that are in accordance with the assumptions that the phylo algorithms make (almost all for EMnEM), PhyloGibbs outperforms the other algorithms. In addition, only PhyloGibbs is capable of reasonably estimating the reliability of its own predictions.
Results on the Yeast Genome
To test the performance of PhyloGibbs and other algorithms on real data we decided to use data from the recently sequenced yeast genomes [15
]. For our first test on these data we used the list of documented binding sites for yeast TFs in the Saccharomyces cerevisiae
promoter database (SCPD) [26
]. After “clean up” this list contains 466 binding sites upstream of 200 different genes with a little less than 30 bp in sites per intergenic region. Based on recent estimates [35
] this probably corresponds to roughly a third of all the binding sites in these upstream regions. This dataset of experimentally verified sites allows us to quantitatively compare the abilities of the different algorithms to recover true binding sites using only the orthologous sequences of the five sensu stricto species as we did with the synthetic data in . By comparing the performance on the real data with the performance on synthetic data we also learn about the effect, on the various algorithms, of the complexities in the real data that are not captured by the assumptions that the algorithms make.
For each of the 200 genes, we gathered its upstream region together with the orthologous upstream regions from S. paradoxus, S. mikatae, S. bayanus,
and S. kudriavzevii
. In complete analogy with the previous test on synthetic data, we ran PhyloGibbs with and without phylogeny, PhyME, EMnEM, WGibbs, and MEME on each of these 200 upstream regions. For PhyloGibbs we used Dialign [22
] alignments of the upstream regions. PhyME uses the MLAGAN [29
] software for its multiple alignments, and EMnEM uses ClustalW alignments [30
]. Each of these algorithms was asked to search for three motifs and an expected three sites per motif (nine sites in total). The non-phylo algorithms were also asked to search for three motifs and to expect ten sites per motif. (We experimented with other site numbers, and ten gave the best overall results for the non-phylo algorithms.) For the phylo algorithms we also needed to specify the phylogenetic tree. We approximated the topology of the sensu stricto species by a star topology and set the branch lengths based on recent estimates of conservation rates at third positions in 4-fold degenerate codons between these genomes [35
] as described in Materials and Methods
The left panel of shows the performance of the algorithms on this dataset analogously to the left panel of . We see that, as on the synthetic data, PhyloGibbs outperformed all other algorithms on the real data. In contrast to the performances on the synthetic data, the difference between the performances of the phylo and non-phylo algorithms was much less pronounced. At very low sensitivity, PhyloGibbs run without phylogeny performed almost equally well as PhyloGibbs with phylogeny. PhyME had high sensitivity and outperformed both EMnEM and the non-phylo algorithms at this sensitivity, but it seemed unable to make very specific predictions. EMnEM did not perform better than any of the non-phylo algorithms on these data.
Performance of Several Motif-Finding Algorithms on 200 Alignments of Orthologous Intergenic Regions from Five Saccharomyces Species Containing Documented Binding Sites
We believe that one important factor contributing to the smaller difference between the phylo and non-phylo algorithms is the limited reliability of the multiple alignments. Since all phylo algorithms only sample configurations consistent with the alignment, any errors in the alignment will hurt their performance. Another factor that probably plays a role is that all phylo algorithms assume that when a site occurs in a conserved block, the site must occur in all species. This is probably not always true, i.e., there are cases where only some of the sequences in an aligned block have retained the site. The non-phylo algorithms can easily deal with this by placing windows only on those sequences that have retained the site, but the phylo algorithms cannot, and a block with several binding sites may be “spoiled” by a single sequence that is missing the site.
All specificities in are significantly lower than those for the synthetic data in . There are, of course, many differences between the synthetic and real data (the real data have more complex background, WMs of varying widths, varying numbers of motifs and sites, etc.), but we believe the main reason for the much lower specificity is that the specificities are based on counting only documented sites, and that many true binding sites are not yet documented. There is no reason to believe that the algorithms are more likely to recover documented binding sites than they are to recover true but undocumented binding sites. The reported specificities sr counting only documented sites thus likely underestimate the true specificities st by a factor that roughly corresponds to the fraction fd of all true sites that are documented. That is, assuming that all algorithms are equally likely to recover true but undocumented sites as they are to recover documented sites, and assuming that the algorithm's true specificity st matches the specificity sp that it predicts itself, we have
the reported specificity on documented sites as shown in the left panel of . This implied value fd
is shown in the right panel of as a function of the predicted specificity sp
. We see that PhyloGibbs predicted a fraction fd
that was relatively insensitive to sp
and lay between 33% and 45% over a wide range. PhyloGibbs without phylo predicted an fd
in the range of 25% to 40%. Both of these predictions are consistent with the independent estimate [35
] that the documented sites represent about 33% of all true binding sites (black line). As with the synthetic data, these results suggest that PhyloGibbs's own assessment of the reliability of its predictions is fairly accurate. Thus, while 18.5% of all predicted sites at a sensitivity of 50% matched documented binding sites, the true specificity is probably somewhere around 55%, and the predicted sites at sensitivity 10% are likely almost all real (A rescaled version of the left panel of is shown in Figure S3
). In contrast, the values of fd
obtained for the other algorithms were all unrealistic, for reasons that we already discussed above.
All binding sites that PhyloGibbs predicted in the upstream regions of the genes with one or more sites in SCPD are listed in Dataset S1
. They can also be viewed at [36
Inferring Yeast's “Regulatory Code”
In the previous sections PhyloGibbs inferred the locations of regulatory sites in one intergenic region at a time. Although sites for a given TF often occur in multiple copies in a single intergenic region, there are also many cases where only a single site occurs, and in those cases PhyloGibbs has to rely on conservation alone to infer the locations of the regulatory sites. However, PhyloGibbs is not limited to run on a single multiple alignment of orthologous intergenic regions but can also run on a set of multiple alignments for co-regulated genes, which should significantly increase sensitivity and specificity.
To test the performance of PhyloGibbs in this setting we used data from a recently published [27
] draft transcriptional “regulatory code” of S. cerevisiae
. Harbison et al. [27
] performed ChIP-on-chip experiments with 203 different TFs from S. cerevisiae
to identify the intergenic regions that are bound by each of them. A suite of six motif-finding algorithms was run on these intergenic regions (several algorithms also used the orthologous regions from other species), and the results were then clustered to arrive at a consensus WM for each TF. When no motif was found computationally for the intergenic regions pulled down, the literature was used, whenever possible, to define a motif. This led to predicted WMs and binding sites for 102 TFs.
We tested PhyloGibbs on the highest confidence set of intergenic regions regulated by each factor. We focused on the 45 TFs that had the fewest binding sites annotated in [27
] (a minimum of three and a maximum of 25) since these are generally the most challenging to locate. For 21 of these 45 TFs, the six motif-finding algorithms employed in [27
] failed to find a significant motif in the input data, and the reported motif and sites are based entirely on a consensus obtained from the literature (of the remaining 57 WMs with more than 25 or less than three annotated sites, 16 were also solely based on literature.)
We first tested whether, in contrast to the motif-finding algorithms employed in [27
], PhyloGibbs was capable of recovering a significant motif that matches the literature consensus for these 21 TFs. For each TF we collected all intergenic regions that were annotated in [27
] to contain at least one binding site, collected their orthologs from the other sensu stricto Saccharomyces
species, produced multiple alignments using Dialign [22
], and ran PhyloGibbs on each of these sets of alignments. Since each of these collections of intergenic regions will typically contain binding sites for multiple TFs, we asked PhyloGibbs to search for three motifs, with a total number of sites equaling three times the number of annotated sites for the TF in [27
]. The results of this test are shown in .
Table 1 Results of PhyloGibbs on Collections of Intergenic Regions for 21 TFs for Which the Motif-Finding Algorithms in  Failed to Recover a Significant Motif but for Which a Literature Consensus Motif Is Available
We evaluated the results that PhyloGibbs reported for each TF in various ways. As described in Materials and Methods
, PhyloGibbs reports two WMs for each motif that it finds: one constructed from the configuration C*
at the end of anneal, and one by weighing the member windows of a motif with their membership probabilities obtained through tracking. We compared these WMs with the WM that is reported in the literature for each of these 21 TFs. We also compared the sites that PhyloGibbs reported, in both anneal and tracking, with the sites reported in [27
]. For each TF we ran PhyloGibbs twice, first with a motif width matching the literature consensus, and then with a default width of 15. We then determined which of the motifs that PhyloGibbs reported best matches the literature motif, and we report a number of statistics for this motif (). For example, the statistics for TF GCR1 show that, for one of the motifs reported by PhyloGibbs, both the anneal WM and the tracking WM have a probability 1.0 to match the literature WM. There are ten windows with this color in the anneal configuration C
*, of which six match sites annotated in [27
] for GCR1.
During tracking, on average 7.96 windows were members of this motif, and on average 5.17 of these members matched sites annotated in [27
]. A total of seven sites were annotated for GCR1 in [27
We see that for 16 of the 21 TFs, PhyloGibbs found a motif that matched, according to at least one statistic, the consensus motif known for this TF in the literature. PhyloGibbs thus apparently outperformed all of the motif-finding algorithms used in [27
] on these 16 datasets. For the top eight TFs in there is a good match between the WMs that PhyloGibbs reports and the literature WM as well as a significant overlap between the reported sites and the sites reported in [27
]. This agreement might suggest that these groups of sites almost exhaustively capture all sites genome-wide for these eight factors. To test this we picked one example, GCR1, and compared the reported sites with known sites from the literature. In [26
] there are six genes whose upstream regions have reported GCR1 sites (TPI1, CDC19, ENO2, ADH1, ENO1,
More recently, it was shown that there are additional GCR1 sites upstream of GKL1
]. Somewhat surprisingly, of these eight genes only one (CDC19)
is among the set of four upstream regions containing GCR1 annotated sites in [27
]. We ran PhyloGibbs on the eight upstream regions of TPI1, CDC19, ENO2, ADH1, ENO1, PGK1, GKL1,
and recovered a motif that perfectly matched the GCR1 literature consensus. This motif is represented with sites in all upstream regions, although the site upstream of GKL1
has a posterior probability of only 0.1. Thus, the sites PhyloGibbs found in these upstream regions are very likely true GCR1 sites. This indicates that the sites reported in [27
] are far from exhaustive.
The results for PUT3 seem paradoxical. All sites PhyloGibbs reported matched sites reported in [27
], but the WMs did not match. The reason for this is that the consensus for PUT3 used in [27
], CGG..........CCG, has a 10-bp spacer that is presumed to contain random bases. In contrast, the motif that PhyloGibbs inferred, CGGNNNNGGNTTCCCG, is much more specific. It has been established that PUT1
are directly regulated by PUT3 [39
]. The upstream region of PUT2
is indeed among the three upstream regions annotated with sites for PUT3 in [27
], but PUT1
is not. We added this upstream region to the collection of upstream regions for PUT3 and reran PhyloGibbs. PhyloGibbs again found the PUT3 motif, which now included two good binding sites upstream of PUT1
For MET31 the WMs matched reasonably well, and two out of three sites in configuration C*
matched, but the sites did not cluster stably during tracking. According to the literature [40
], MET31 directly regulates MET25, MET3, MET14, GSH1,
. None of their upstream regions are among the upstream regions annotated with sites for MET31 in [27
]. We ran PhyloGibbs on these five upstream regions and found a motif that matched the literature consensus that had sites in all five of these upstream regions that are stable under tracking. Thus, as with GCR1, all these sites are very likely true MET31 sites not annotated in [27
For ADR1 and MAC1, both reported WMs showed a significant match to the literature motif but the reported sites overlapped only marginally with the sites reported in [27
]. For ADR1 there are two genes that have known sites upstream (ADH2
]. Neither of these have annotated sites for ADR1 in [27
]. A recent microarray-based study [43
] lists 74 genes as under control of ADR1, of which only PXA1
occurs among the upstream regions with sites for ADR1 in [27
]. Both ADH2
occur in this list as well. We ran PhyloGibbs on the upstream regions of ADH2, CTA1,
and found a motif matching the ADR1 literature consensus and containing sites in all three upstream regions. Again these sites are thus very likely true sites for ADR1 that, except for the PXA1
sites, are not contained in the annotation of [27
]. For MAC1 a similar story applies. Binding sites for MAC1 are listed upstream of FRE1
], and CTR3
are additionally identified as targets in the literature [44
]. Of these only CTR1
is among the genes with sites for MAC1 in [27
]. Running PhyloGibbs on these four upstream regions recovered a motif that matched the MAC1 literature consensus perfectly and that had sites upstream of FRE1
It also had a site upstream of orthologs of CTR3
but, curiously, not in the S. cerevisiae
upstream region of CTR3,
which, equally curiously, did not align well with the upstream regions of its orthologs. PhyloGibbs found no site in CTT1
. This case warrants closer study.
For HAP5 and SKO1, only the anneal WM matched the literature WM. Although a reasonable number of windows occurred on average during tracking for these motifs, there was no stable core. Even the stablest window in each group was only present about 50% of the time. The membership of these groups thus fluctuated significantly during tracking, and this is reflected in the fact that the information score (see Materials and Methods
) of the tracking WM is much lower than that of the anneal WM. In [46
] five genes (AHP1, GLR1, GRE2, SFA1
are reported as direct targets of SKO1. None of their upstream regions occur among the upstream regions with annotated sites for SKO1 in [27
]. We ran PhyloGibbs on the upstream regions of these five genes and found a motif that matches the literature consensus and had sites in the upstream regions of all but SFA1
. Interestingly, the consensus of the motif PhyloGibbs reported, TTACGTAA, subtly differs from the literature consensus TGACGTCA. The PhyloGibbs consensus is still a palindrome but the second guanine and penultimate cysteine are replaced by thymine and adenine, respectively.
For GZF3 and RLM1 there was only a moderate match of the anneal WM to the literature WM, and no overlap whatsoever of the reported sites with the sites reported in [27
]. Coffman et al. [47
] report GAP1, DAL80,
as direct target genes of GZF3. Again, none of these are among the genes annotated with sites for GZF3 in [27
]. We ran PhyloGibbs on the upstream regions of these three genes and recovered a motif that significantly matched the literature motif and had sites upstream of all three genes. Interestingly, the motif that PhyloGibbs reported, GATWAGCGAT, while matching the literature consensus GATAAG, is longer and significantly more specific. Dodou et al. [48
] report RLM1, SMP1, HKR1, KTR2, HSP150,
as direct targets of RLM1. Of these, only HSP150
is among the genes with sites annotated in [27
]. We ran PhyloGibbs on the set of six upstream regions from [48
] and recovered a motif that reasonably matched the RLM1
literature consensus. The consensus of the motif that PhyloGibbs reported is WGCWAANNTTARAW, whereas the literature consensus is CTAWWWWTAG. PhyloGibbs found sites upstream of four (SMP1, HSP150, KTR2,
of the six genes, with multiple sites in front of HSP150.
Finally, there were five TFs (DAL80, MOT3, ROX1, YAP6, and YOX1) for which PhyloGibbs did not find any motif matching the literature motif among the intergenic regions from [27
]. DAL80 is one of a family of four GATA TFs that all bind motifs containing the consensus GATA. The experimentally best confirmed target genes for DAL80 are DAL3, DAL80
]. None of these have binding sites for DAL80 annotated in [27
]. We ran PhyloGibbs on the upstream regions of these four genes and found a motif with GATAAG consensus that had at least two sites in each of the upstream regions.
Hongay et al. [51
] suggest that ERG2, ERG6,
are direct targets of MOT3. Again, none of these upstream regions have sites annotated for MOT3 in [27
]. We ran PhyloGibbs on the upstream regions of these three genes but did not find a motif clearly matching the literature consensus.
Linde and Steensma [52
] found 25 genes in an expression array experiment that are likely targets of ROX1, among which are the three genes (CYC1, HEM13,
with binding sites in [26
]. None of these 25 genes are among the genes with ROX1 sites annotated in [27
]. We ran PhyloGibbs on the three upstream regions of CYC1, HEM13,
and recovered a motif that perfectly matches the literature consensus and had sites in each of the three upstream regions.
For YAP6 a consensus binding site has been established by in vitro studies of different YAP proteins binding DNA [53
]. As far as we can tell no clear target genes are known in the literature for YAP6.
Finally, Pramila et al. [54
] report 28 target genes for the homeodomain protein YOX1. Of these only SPO12
is among the genes with sites annotated for YOX1 in [27
]. YOX1 is known to interact directly with MCM1, and a motif for YOX1 was reported in [54
] that was found by first identifying the MCM1 binding sites in the upstream regions of the target genes, and then searching for an additional overrepresented motif near the MCM1 sites. We also ran PhyloGibbs on the upstream regions of these 28 genes and identified a motif with consensus ATTACWTTTCCYNAW. The right end of this consensus matches the MCM1
consensus tttCC.rAt..gg, and the left end corresponds to the standard homeodomain core ATTA. This motif has sites in about half of the 28 upstream regions. Note that Pramila et al. [54
] report a YOX1 consensus of yaATTa that differs from the consensus, AsAATA.TGAmr, that is reported in [27
shows the results of running PhyloGibbs on the remaining 24 TFs with fewer than 25 sites in [27
], where their computational methods did produce a motif. The table shows that for 18 of these, both the anneal and tracking WM from PhyloGibbs matched the WM in [27
], along with a significant fraction of the sites. As with the case of GCR1 above, one should not conclude from this that the reported sites in any way cover all the true sites for these TFs. For four TFs, a motif that PhyloGibbs reported had some overlap with the motif reported in [27
], and there was no meaningful overlap for only two cases: SPT2 and SPT23. We examined in detail only these two cases.
Table 2 Results of PhyloGibbs on Collections of Intergenic Regions for 24 TFs for Which the Motif-Finding Algorithms in  Found a Significant Motif
The protein SPT2 is involved in regulation of chromatin structure and is known to interact directly with the SWI/SNF complex and with histones. SPT2 has been reported to not have any sequence specificity in its DNA binding [55
], and more recently Novoseler et al. [56
] have proposed that SPT2 binds to two strands of double-stranded DNA at their crossing point. Moreover, the only well-established target of SPT2 that we found in our cursory survey of the literature was the HO
gene, and this gene is not among the genes annotated with sites for SPT2 in [27
]. We thus believe that the motif reported in [27
] is dubious. It is well established that SPT23 regulates OLE1
], but this gene is not among the genes with sites for SPT23 annotated in [27
]. Given that SPT23 does not seem to have a DNA-binding domain, it is likely that SPT23 functions as a cofactor and lacks specific DNA binding on its own.
In summary, PhyloGibbs, when run on the highest quality intergenic regions and their orthologs reported in [27
], found a motif that matches the literature consensus for 16 of 21 TFs, where the computational methods of [27
] failed. For 11 TFs (ADR1, DAL80, GCR1, GZF3, MAC1, MET31, MOT3, RLM1, ROX1, SKO1, and YOX1), where the correspondence was weakest or nonexistent, we extracted co-regulated groups of genes from the literature. In every case there was little or no overlap between the literature list and the set of regulatory targets claimed in [27
]. For all but one of the 11 (MOT3), PhyloGibbs found a motif that matched the literature consensus and reported sites in all or almost all of the upstream regions. Thus, when a solid motif was not found, the problem was likely with the set of intergenic regions in [27
], not with PhyloGibbs.
Detailed comparisons of PhyloGibbs's results with the annotations of [27
] are in Dataset S2
. A list of all binding sites predicted by PhyloGibbs for the 45 TFs is in Dataset S3
. They can also be browsed at [36