We developed an approach to identify conserved TFBSs that combines probabilistic models of binding-site specificity [20
] with probabilistic models of evolution [23
]. Starting with an alignment of sequences from multiple related species, we use the known sequence specificity for a transcription factor to compare the likelihood of the sequences under two evolutionary models - one for background and one for TFBSs. The central feature of this method that underlies its ability to identify conserved TFBSs is that it uses a specific probabilistic evolutionary model for the binding sites of each transcription factor. The evolutionary model we use for TFBSs [25
] assumes that sites were under selection to remain binding sites throughout the evolutionary history of the species being studied. This model uses the sequence specificity of the factor to predict patterns and rates of evolution that recapitulate the patterns and rates observed in real TFBSs [19
MONKEY: scanning alignments to identify conserved transcription factor binding sites
MONKEY, our tool for identifying conserved TFBSs, takes as input a multiple sequence alignment, a tree describing the relationship of the aligned species, a model of a transcription factor's binding specificity and a model for background noncoding DNA. It returns, for each position in the alignment, a likelihood ratio comparing the probability that the position is a conserved binding site for the selected factor compared to the probability that the position is background.
Extending matrix searches to multiple sequence alignments
For the model of binding specificity, we use a traditional frequency matrix [20
]. The values in the matrix - fib
- represent the probability of observing the base b
(A, C, G or T) at the i
th position in a binding site of width w
. For the model of the background, we use a single set of base frequencies gb
A widely used statistic for scoring the similarity of a single sequence to a frequency matrix is the log likelihood ratio comparing the probability of having observed a sequence X of width w under the motif model (a frequency matrix, designated as motif) to the probability of having observed X under the background model (designated by bg), which can be easily reduced to:
where Xib is an indicator variable which equals 1 if base b is observed at position i, and zero otherwise.
This classifier can be motivated by the approximation that the data are distributed as a two-component mixture of sequences matching the frequency matrix and sequences drawn from a uniform background. In practice, we compute this score using a position-specific scoring matrix (PSSM) with entries, Mib = log(fib/gb), and find S for a particular w-mer by adding up the entries that correspond to the bases in the query sequence.
In extending this to a pair of aligned sequences X
, we want to perform the same calculation on their common ancestor A
. Since A
is not observed, we consider all possible ancestral sequences by summing over them, weighting each by their probability given the data (X
), the phylogenetic tree (T
) that relates the sequences, and a probabilistic evolutionary model [23
We can write a new score representing the log-likelihood ratio that compares the hypothesis that X and Y are a conserved example of the binding site represented by the frequency matrix to the hypothesis that they have been drawn from the background:
where Rmotif and Rbg are rate matrices describing the substitution process of the binding site and background respectively. Using the conditional independence of the sequences X and Y on the ancestor, A, and writing TAX for the evolutionary distance separating sequence X from A, this becomes:
The class of evolutionary models used by MONKEY define a substitution matrix, p
) = eRt
, that represents the probability of observing each base at position i
in the extant sequence (X
) given each base in the ancestral sequence (A
) after t
units of evolutionary time or distance, given some rate matrix, R
]. Since these models retain positional independence, we can rewrite this as:
This can be extended to more than two sequences, that is,
, ..., Z
), by replacing the probabilities of X
with the probability with the left and right branches of the tree below, and performing the calculation at the root. The probabilities of the left and right branches of the tree can be calculated recursively as has been described previously [23
Once again, for practical purposes we can convert these scores to a PSSM, whose entries are given for the pairwise case by:
where at each position we now index by the bases a and b in the two sequences. For multiple alignments of n species, each position requires 4n entries.
The use of evolutionary models is critical to the function of MONKEY. Myriad of such models exist, and in principle all can be used in MONKEY. For the background, it is natural to use a model appropriate for sites with no particular constraint, such as the average intergenic or synonymous rates. MONKEY allows the use of the JC [26
] or HKY [27
] models, and here we use the latter with the base frequencies, rates and transition-transversion rate-ratio estimated from noncoding alignments assuming a single model of evolution over the noncoding regions (see details in Materials and methods). It is also possible to estimate the evolutionary model separately for each intergenic alignment, although the small size of yeast intergenic regions leads to variable estimates.
In principle, the JC and HKY models can also be used for the motif, with rates set according to our expectation of the overall rate of evolution in functional binding sites, which has been estimated as two to three times slower than the average intergenic rate [19
]. However, we have previously shown that there is position-specific variation in evolutionary rates within functional transcription factor binding sites [19
] and that positions in a motif with low degeneracy in the binding-site model evolve more slowly than positions with high degeneracy; this relationship between the equilibrium frequencies and the position-specific evolutionary rates is accurately predicted by an evolutionary model from Halpern and Bruno (HB model) [25
In using this model, we assume that sequences evolve under constant purifying selection to maintain a particular set of equilibrium base frequencies. The use of this model corresponds to a definition of a conserved TFBS as a sequence position where there has always been a binding site for the transcription factor. Although the model does not strictly require that a binding site be present in each of the observed species, positions lacking such sites will have lower probabilities as they require the use of less probable substitutions. The rate of change from residue a to b at position i in the motif is given by:
where Q is the (position independent) underlying mutation matrix, which we set equal to the background model (Q = Rbg), and f is the frequency matrix describing the specificity of the factor. Thus, for each position in the motif, the HB model predicts the rates of each type of substitution as a function of the frequency matrix, and the background model.
Comparing hits for different factors and evolutionary distances: computing the null distribution
To compare scores from different evolutionary distances and different factors, it is critical that we are able to assign significance to a particular value of the score. To do so, we need to compute the distribution of the score under the null hypothesis that the sequence is part of the background. Calculating a p
-value for a score S
in a single sequence requires the enumeration of all possible w
-mers that have a score S
or greater under the background model. For n
aligned sequences this requires the enumeration all 4wn
possible sets of aligned w
-mers with scores S
or greater under the background model. While the number of possible alignments of n w
-mers can be unmanageably large for even small values of n
, because we treat each position independently we can enumerate these possibilities efficiently using an algorithm developed for matrix searches of single sequences [28
Every observed score is a sum of w numbers, one from each column of the matrix. The probability of observing exactly score S is the number of paths through the matrix whose entries add up to S, weighted by the probability of the path. By converting the matrix to integers, we can compute this probability for all values of S recursively. We initialize Pi(S) (the probability of observing score S after i columns in the matrix) by setting P0(S) = 1 for S = 0, and P0(S) = 0 for S ≠ 0. We then compute the values of the function for i = [1, w] as follows:
For aligned sequences, c represents a column in the alignment, and the sum is over all 4n possible columns an alignment of n sequences. The probability distribution function (PDF) of scores is Pw(S), and from this the cumulative distribution function (CDF), the probability of observing a score of S or greater, can be directly computed. Although in principle we can compute the probabilities to arbitrary precision, because the time complexity increases with the number of possible scores, we limit the precision to within approximately 0.01 bits.
Figure compares empirical p-values from 5,000 pairs of sequences evolved in a simulation (see Materials and methods) with those computed by this method, and shows that they agree closely. We have used this method to compute the CDFs for alignments of up to six species, and therefore can apply our method to most comparative genomics applications. We note, in addition, that the likelihood ratio scores are approximately Gaussian (data not shown). As the means and variance of the scores under each model can be computed efficiently (see Materials and methods) we can estimate p-values using a Gaussian approximation (Figure ) when the number of sequences in the alignment is large.
Figure 1 Accuracy of p-value estimations. To examine the accuracy of our p-value estimates, we compared the empirical p-value (computed from the observed distribution of scores) to p-values computed using either the exact method described above (black points) (more ...)
Heuristics for alignments with gaps
The treatment of alignment gaps in identifying conserved TFBSs is somewhat problematic. One the one hand, nonfunctional sequences may be inserted and deleted over evolution more rapidly than functional elements [30
], and thus the presence of a gap aligned to a predicted binding site could indicate that it is nonfunctional. On the other hand, alignment algorithms are imperfect, and must often make arbitrary decisions about the placement of gaps. We sought to design a heuristic that accommodated both these aspects of genomic sequence data by locally optimizing alignments for the purpose of comparative annotation of regulatory elements.
The idea is to assign a poor score to regions of the alignment with a large number of gaps, but to locally realign regions with a small number of gaps to identify conserved but misaligned binding sites. To do this, we scan along the ungapped version of one of the aligned sequences - the 'reference' sequence. For each position in the reference sequence pr, we define a window in each other sequence around ps, the position in sequence s aligned to position pr. The window runs from ps - (a + b) to ps + w + (a + b), where a and b are the number of gaps in the aligned versions of sequences r and s in position p to p + w, where p is the position in the alignment of pr. For each subsequence of length w in the window, we calculate the percent identity to the reference sequence, and create an alignment of pr to pr + w (in the reference sequence) to the most similar word in the window of each other sequence. This locally optimized alignment is then scored. Note that if a and b are zero (meaning there are no gaps in the aligned sequence), no optimization is done. If a is too large (in most contexts greater than five) we exclude that region of the alignment from further. This heuristic encapsulates the idea that too many gaps are indicative of lack of constraint, but conservatively allows for a few gaps due to alignment or sequence imperfections.
Application to Saccharomyces
The genome sequences of several species closely related to the budding yeast Saccharomyces cerevisiae
have recently been published and become models for the comparative identification of transcription factor binding sites [8
]. We aligned the intergenic regions of S. cerevisiae
genes to their orthologs in S. paradoxus
, S. mikatae
, S. bayanus
and S. kudriavzevii
genomes using CLUSTALW (see Materials and methods) and sought to evaluate the effectiveness of MONKEY under different evolutionary models and distances.
Ideally, we would use several diverse transcription factors with known binding specificity, where the set of matches to the factor's matrix in the S. cerevisiae
genome could be divided into two reasonably sized sets: those known to be bound by the factor (positives) and those known not to be bound by the factor (negatives). Unfortunately, even in yeast, the number of such cases is limited. For many factors we can identify true positives by combining high- and low-throughput experimental data that supports the hypothesis that a particular position in the genome is bound by a given factor. A true negative set, however, must be constructed on the basis of lack of evidence that a sequence is functional, as the interpretation of negative results almost always is ambiguous. In the case of transcription factor binding sites this is particularly problematic, because DNA-binding proteins have overlapping specificity, and we may therefore observe conservation of a binding site because it is bound by another factor with similar specificity. After evaluating all factors with binding specificity in Saccharomyces cerevisiae
Promoter Database (SCPD) [33
], we focus on Gal4p and Rpn4p for further analysis (see Table for properties of these factors, and Materials and methods for a description of the selection of positive and negative sets).
Definition of positive and negative sets of matrix matches
The effects of evolutionary models on the discrimination of functional binding sites
To evaluate the performance of our evolutionary method in correctly identifying bona fide binding sites, we calculated the p-values of the positive and negative sites for each factor, using MONKEY on alignments of all five genomes for Rpn4p and four species (with S. kudriavzevii excluded because too few sequences were available) for Gal4p. We compared the performance of MONKEY with the HB model to scores from S. cerevisiae alone and to a 'simple' score (equal to the average of the single sequence log likelihood ratios) that utilizes all the comparative data without an evolutionary model.
The results are summarized in Table . An ideal scoring method would assign low p-values to real sites (positives) and high p-values to spurious sites (negatives), and we therefore compared the p-values assigned by monkey based on the HB model to those based on the 'simple' score. Not surprisingly, both methods were a great improvement over searching in S. cerevisiae alone. Overall, when compared to each other, the HB score assigned lower p-values to the binding sites more often in the positive sets (90% for Gal4p and 80% for Rpn4p) and less often in the negative sets (20% for Gal4p and 25% for Rpn4p) than did the simple score. We note that some of the supposedly functional Rpn4p sites were assigned higher p-values in S. cerevisiae alone, suggesting that they are not in fact conserved; these will be discussed below.
Performance of different scores in recognizing functional and nonfunctional sites
The effect of evolutionary distance on the discrimination of functional binding sites
As evolutionary distance increases, we expect fewer matches to the matrix to be conserved by chance, which implies that the probability of observing matches as highly conserved as the functional sites should decrease. Similarly, we expect the nonfunctional sites to show many substitutions and their p-values to increase over evolution. To explore the change in p-values over evolutionary distance, we scored the functional and nonfunctional sets of binding sites at a variety of evolutionary distances by creating alignments of different combinations of species (see Materials and methods). The median p-value of the positive set of TFBSs decreases monotonically with evolutionary distance, with the rate of decrease an approximately constant function of evolutionary distance (see Figure ). The median p-value for the binding sites in the negative set increases with evolutionary distance, although somewhat erratically. This demonstrates that MONKEY effectively exploits evolutionary distance, and confirms our intuition that as evolutionary distance increases, functional elements should be increasingly easy to distinguish from spurious predictions.
Figure 2 Significance of matches increases with evolutionary distance. Median p-values for the positive (black squares) and negative (white triangles or white triangle points) sets of binding sites for (a) Gal4p and (b) Rpn4p at different evolutionary distances (more ...)
To test this hypothesis on a more quantitative level we sought to compare the observed scores with the expected scores assuming that binding sites evolved precisely according to the evolutionary models used by MONKEY. Briefly, given a binding-site model and a phylogenetic tree, we assume we have observed a binding site in the reference genome, and that this site evolves along the tree under either the motif model (HB) or background model (HKY), representing functional and nonfunctional binding sites, respectively (see Materials and methods for details). The expected p
-values associated with the functional binding sites (Figure , solid lines) showed reasonable agreement with the models, consistent with previous observations that they are evolving under constraint that is well modeled by the purifying selection on the base frequencies in the specificity matrix [19
Pairwise versus multi-species comparisons
The comparisons at the different evolutionary distances used in Figure employed variable numbers of species, with the shorter distances representing primarily pairwise comparisons and the longer distances comparisons of three or more species. While we expect the variation in p-values with different combinations of species to be primarily a function of the evolutionary distance spanned by these species, there will also be effects related to the number of species and the topology of the three. For example, in the limit of very long branch lengths, the evolutionary p-values are on the order of the power of the number of species and are independent of evolutionary distance. In contrast, in the limit of very short branch lengths, the evolutionary p-values depend only on the distance spanned by the comparison, as most of the information provided by additional species is redundant. However, because most comparisons that are actually carried out are far from either of these extremes, we sought to evaluate the effects of species numbers and tree topology for the Saccharomyces species analyzed here.
First, we recomputed the expected p-values for all the distances analyzed in Figure , except that instead of using the real tree topology, we used a single pairwise comparison at the same evolutionary distance (Figure , dotted lines). For example, for the Rpn4p analyses using all five species we assumed a pairwise comparison at an evolutionary distance of around 1.1 substitutions per site. Note that this is considerably more distant than any of the pairwise comparisons available among these species. The predictions for the pairwise and multi-species comparisons are very similar, suggesting that at the evolutionary distances spanned by these species there is little difference in using multiple species alignments relative to a pairwise alignment that spans the same evolutionary distance. Only at the longest distances considered (greater than 0.8 substitutions per site) does the power of the pairwise comparison begin to level off, although there are other reasons that multiple species comparisons might still be preferred (see Discussion).
To complement this theoretical analysis, we were interested in using empirical data to compare pairwise and multi-species analyses. Fortuitously, the evolutionary distance between S. cerevisiae and S. kudriavzevii is almost exactly equal to the evolutionary distance spanned by S. cerevisiae, S. paradoxus and S. mikatae (median tree length approximately 0.5 substitutions per site; see Figure ). Because our models predict that we are in a regime where evolutionary distance is the primary determinant of the p-values, we expect searches using these different sets of species to yield similar results. We tested this hypothesis by calculating the p-values associated with the Rpn4p-binding sites using the sequences from these two comparisons. The median p-values in both the positive and negative sets are very similar (Figure ), confirming that at these relatively short evolutionary distances, the power of the comparative method is independent of the number of species considered (see Discussion).
Figure 3 Significance of binding sites in pairwise or three-way comparisons at similar evolutionary distance. (a) Histogram of the percent identities of all aligned noncoding regions of S. cerevisiae and S. kudriavzevii (open squares) and S. cerevisiae, S. paradoxus (more ...)
Taken together, these results strongly support the idea that when appropriate methods are used, data from multiple species can be combined effectively to span larger evolutionary distances. Note that this in no way implies that the addition of extra species to an existing pairwise comparisons is not useful - such additions will always increase the evolutionary distance spanned by the species and thus will increase the power of the comparison.
Testing the power of comparative annotation of transcription factor binding sites
At the distances spanned by all available sequence data, the p-values are so small that we no longer expect to find matches of the quality of those in the positive set by chance, especially for Rpn4p. To test this further, we scanned both strands of all the available alignments of all five sensu stricto species (around 2.7 Mb) to identify our most confident predictions of conserved matches to the Rpn4p matrix. We chose the p-value cutoff of 1.85 × 10-8, which corresponds to a probability of 0.05 of observing one match at that level over the entire search (using a Bonferroni correction for multiple testing). After excluding divergently transcribed genes, there were 56 genes that contained putative binding sites at that p-value. Of 32 genes in our positive set that had sequence available for all five species, 30 had binding sites below this p-value. Of the 28 genes in the negative set for which sequences were available, only three had binding sites below this cutoff. In this (nearly ideal) case we have ruled out nearly 90% of the negative set at the expense of less than 10% of the positives.
Examining the expression patterns of these genes (Figure ) allows them to be divided into three major classes. The first is a group (indicated by a blue bar) containing 30 genes (28 of which were in our original positive set and two other genes) that show a very similar pattern over the entire set of conditions. The second group (indicated by a green bar) contains 11 genes (of which only one was in our original positive set) that show uncoordinated gene expression changes in some conditions in addition to the stereotypical Rpn4p expression pattern. It is possible that these genes' regulation is controlled by multiple mechanisms under different conditions [34
], and regulation by Rpn4p is one contribution to their overall pattern of expression. Further supporting this hypothesis, only one of these genes (UFD1
) is annotated as involved in protein degradation, and three (YBR062C
) have unknown functions.
Figure 4 Relationship between conserved Rpn4p-binding sites and expression. (a) We identified 56 Rpn4p-binding sites with p-values below 1.85 × 10-8 using all five species and the HB model. The expression patterns of these genes (clustered and displayed (more ...)
Finally, and most surprising from the perspective of comparative annotation, is a third set of 14 genes, including one from our original positive set and three from our negative set, most of which show no evidence of the proteasomal expression pattern associated with Rpn4p (Figure ). It is extremely unlikely that these sequences have been conserved by chance, and we suggest that they represent matches that are conserved for reasons other than binding by Rpn4p (see Discussion).
Nonconserved binding sites in regulated genes
Having identified examples of conserved binding sites whose nearby genes showed no evidence of function, we decided to examine the converse: binding sites near regulated genes, and therefore presumably functional, that are not conserved. Figure shows the p-values of individual positive Rpn4p sites at different evolutionary distances. While most of the sites follow the trajectory predicted for sites evolving under the HB model, the p-values for four of the positive sites seem to be well-modeled by the 'background' or unconstrained model. This is surprising because we expect these binding sites to be functional, and therefore under purifying selection. One explanation is that some of these sites may have been misannotated as functional. For example, in addition to a nonconserved positive site, the upstream region of REH1 contains another binding site that is a weaker match to the Rpn4p matrix (Figure ) and did not pass our threshold for inclusion in the positive set (see Materials and methods). This weaker match is more highly conserved and may represent the functional site in this promoter. In the case of PTC3, however, we can find no other candidate binding sites nearby (Figure ). This represents a possible example of binding-site gain, a proposed mechanism of regulatory evolution at the molecular level (see Discussion).
Figure 5 Some apparently functional Rpn4p-binding sites are not conserved. (a) The MONKEY p-values (points) of all putatively functional Rpn4p-binding sites at varying evolutionary distances, along with the expected values under the HB and HKY models (solid traces). (more ...)
Different factors have different relationships between significance and evolutionary distance
The optimal selection of species for comparative sequence analysis remains an open question. To analyze this question for transcription factor binding sites, we examined the relationship between evolutionary distance and the MONKEY p
-values for several S. cerevisiae
transcription factors (Figure ) for which sufficient characterized binding sites were available in SCPD [33
]. We find that while all factors show the tendency for p
-values to decrease with evolutionary distance, the p
-values for each factor remain very different. For example, with alignments of four species spanning about 0.8 substitutions per site, we expect a conserved match to the Gcn4p matrix as good as the median functional binding site (Figure , red triangles) approximately every million bases of aligned sequence. This in contrast to Rpn4p, for which in the same alignments we expect such a match (Figure , violet crosses) only once in about 1 billion base pairs. Thus, the evolutionary distance required to achieve a desired p
-value is different for different factors. Understanding the relationship between a frequency matrix and the behavior of its p
-values is an area for further theoretical exploration. We note that, once again, we can predict the behavior of these p
-values (Figure ), and that while our predictions agree qualitatively, there is considerable variability.
Figure 6 The evolutionary distance required to confidently identify conserved binding sites varies among transcription factors. (a) Median p-values for functional binding sites for various factors at different evolutionary distances. The evolutionary distance (more ...)
MONKEY is implemented in C++. It is available for download under the GPL and can be accessed over the web at [35