|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: TP RBRA. Performed the experiments: TP RBRA. Analyzed the data: TP RBRA. Contributed reagents/materials/analysis tools: TP. Wrote the paper: TP RBRA.
The promoter regions of many genes contain multiple binding sites for the same transcription factor (TF). One possibility is that this multiplicity evolved through transitional forms showing redundant cis-regulation. To evaluate this hypothesis, we must disentangle the relative contributions of different evolutionary mechanisms to the evolution of binding site multiplicity. Here, we attempt to do this using a model of binding site evolution. Our model considers binding sequences and their interactions with TFs explicitly, and allows us to cast the evolution of gene networks into a neutral network framework. We then test some of the model's predictions using data from yeast. Analysis of the model suggested three candidate nonadaptive processes favoring the evolution of cis-regulatory element redundancy and multiplicity: neutral evolution in long promoters, recombination and TF promiscuity. We find that recombination rate is positively associated with binding site multiplicity in yeast. Our model also indicated that weak direct selection for multiplicity (partial redundancy) can play a major role in organisms with large populations. Our data suggest that selection for changes in gene expression level may have contributed to the evolution of multiple binding sites in yeast. We conclude that the evolution of cis-regulatory element redundancy and multiplicity is impacted by many aspects of the biology of an organism: both adaptive and nonadaptive processes, both changes in cis to binding sites and in trans to the TFs that interact with them, both the functional setting of the promoter and the population genetic context of the individuals carrying them.
TFs regulate gene expression by binding to specific sequences in the promoter regions of their target genes. Promoters often contain multiple copies of the same TF binding sites. How does this multiplicity evolve? One possibility is that individuals with multiple, redundant binding sites have higher fitness. However, nonadaptive processes are also likely to be important. Here, we develop a mathematical model of the evolution of TF binding sites to help us disentangle how different evolutionary mechanisms contribute to the evolution of binding site redundancy and multiplicity. We show that recombination is expected to promote the evolution of multiple binding sites. This prediction is corroborated by genome-wide data from yeast. Another important factor in the evolution of multiplicity predicted in our analysis is TF promiscuity, that is, the ability of a TF to bind to multiple sequences. In addition, our analysis indicated that direct selection can have large effects on the evolution of redundancy and multiplicity. Data from yeast identified selection for changes in expression level as a candidate mechanism for the evolution of multiple binding sites. We conclude that, although selection may play a major role in the evolution of multiplicity in regulatory regions, nonadaptive forces can also lead to high levels of multiplicity.
Promoters frequently contain multiple functional regulatory elements . For example, the regulatory region for stripe 2 of even-skipped (eve) of the fruit fly Drosophila melanogaster comprises 17 binding sites for four transcription factors (TFs), including five binding sites (B1–B5) for the activator bicoid (bcd) . How does cis-regulatory element multiplicity evolve? There are three possibilities. First, perhaps “more is better” when it comes to TF binding sites. Multiple binding sites may cause changes in the level of gene expression or in its robustness against variation in TF concentrations , –. Second, multiplicity might be favored by selection, but independently of its functional consequences. For example, genotypes with many binding sites may be more likely to produce viable offspring after mutation or recombination with genotypes with fewer binding sites –. Third, cis-regulatory element multiplicity may arise by nonadaptive processes –. Stone and Wray  have shown that a population of diploid individuals could evolve two identical copies of a 6 base pair (bp) binding site in a 200-bp promoter every generations through random mutation and genetic drift alone. The intergenic regions of Saccharomyces cerevisiae are bp long on average, whereas those of multicellular eukaryotes can be orders of magnitude longer.
The common thread to all the evolutionary scenarios listed above is redundancy, the ability of structurally identical elements to contribute to the same function –. Redundancy is thought to be widespread in biological systems. In eukaryotes, a large proportion of genes are duplicates, and deletion of one copy often has little or no phenotypic effect because the other copy can compensate for the loss of function . Functionality and redundancy are more difficult to establish for the case of multiple cis-regulatory elements . The five bcd binding sites in eve the stripe 2 enhancer are not fully redundant because loss-of-function mutations to B1, B2 or B3 cause reduced eve stripe 2 expression and gain-of-function mutations to B4 and B5 lead to increased expression , . However, redundancy was likely important in the evolution of these sites. When Ludwig and colleagues  compared the stripe 2 enhancers of different species of Drosophila, they found that some of them lacked the B3 site (Figure 1). This observation implies that the B3 site evolved recently in the lineage leading to the last common ancestor of D. melanogaster and D. simulans. Furthermore, the B3 site was probably redundant when it first appeared because the stripe 2 enhancers of three species lacking the B3 binding site were able to drive expression of a reporter gene in D. melanogaster embryos coincident with native eve stripe 2 (Figure 1). Thus, redundant transitional forms can, in principle, play an important role in the evolution of cis-regulatory element multiplicity , . In this paper we develop a model of binding site evolution and use it to evaluate the plausibility of different scenarios for the evolution of cis-regulatory element redundancy and multiplicity. We then test predictions obtained from our model using data from yeast.
Here we introduce a model of binding site evolution. The model extends earlier phenomenological models of the evolution of cis-regulatory element redundancy , , , ,  in that it considers binding sequences and their interactions with TFs explicitly, albeit in a simplified manner . We also build upon recent attempts to apply the mutational network approach ,  to the study of gene regulatory networks –. We then use our model to investigate the conditions favoring the evolution of multiple TF binding sites.
A target gene has a promoter containing cis-regulatory sites for a number of TFs. denotes the jth binding site for . The TFs regulate expression of the target gene according to the following rules:
Rules #2 and #3 are compatible with the two-state model for TF binding , –. Unless otherwise stated, our model deals with the evolution of the binding sites for a single transcriptional activator. For a discussion of how our model can be extended to repressors see ‘Generalizations and caveats’.
The target gene is considered functional if, given normal levels of expression of its transcriptional regulators , it is active above a threshold level, arbitrarily set to in this paper. Consider a promoter that contains binding sites for and is capable of sustaining gene function. A particular site is considered functional if binding to the site has an effect on gene expression, that is, if . Multiple binding sites () are considered redundant if at least one of them can be deleted without affecting gene function. Full redundancy occurs when the viability of redundant and nonredundant genotypes is the same; partial redundancy occurs when the viability of redundant genotypes is higher than that of nonredundant ones ,  (see also ‘Natural selection’ below). Note that, according to the above definitions, multiplicity does not imply redundancy (full or partial).
In our model, the total effect of on the expression of a gene (, Equation 1) can change in three ways: a mutation in a binding site j that alters its (cis), a mutation in the coding sequence of that modifies the function directly (trans), and a change in the concentration of , . In the rest of the paper we consider only the first two types of evolution. We begin by considering the cis evolution of a single binding site.
One way to represent the evolution of a binding site is through its mutational network . Two genotypes are connected in a mutational network if one genotype can be obtained from the other through a single mutation. For example, the sequences ACGCGC and ACGCAT are both connected to ACGCGT, but not to each other, in the mutational network of all possible DNA sequences of length base pairs (Figure 2A). If the mutation rate per base pair per generation is , then ACGCGT will mutate into ACGCGC with a probability . One difficulty with this approach is that even the relatively short sequences of TF binding sites ( to 10 bp) define large mutational networks (e.g., the network of DNA sequences of length has sequences).
Given that binding site functionality in our model is determined by the number of mismatches m relative to the canonical sequence, we can simplify the mutational network of a binding site by collapsing all sequences with a given m. This grouping of genotypes is appropriate because every sequence contained within a given m class has exactly the same number of mutational neighbors, both within the m class (obtained by mutating already mismatched sites) and in the neighboring classes. A similar approach has been employed by others , . We call the resulting network a condensed mutational network (Figure 2B). The m condensed genotypic class includes sequences. For example, if ACGCGT is the canonical sequence of the yeast TF Mbp1 , then both ACGCGC and ACGCAT belong to the condensed class. The condensed mutational network representation is extremely compact, implying a reduction from to states for a binding site of length n.
Now we introduce evolution over the condensed mutational network. Consider an infinite-sized population of asexual, haploid organisms. Each individual has a genotype at a binding site of length n, such that the population is distributed over a condensed mutational network. The state of the population is given by a vector of frequencies , where is the proportion of sequences in the population a Hamming distance m away from the canonical sequence. The population reproduces asexually, in discrete generations. Mutation causes the population to evolve according to the equation:
where is the state of the population at time t, and Q is the transition matrix such that is the probability that the offspring from an individual with i mismatches has j mismatches. Assuming that a sequence cannot acquire more than one mutation in a single generation (appropriate for realistic values of ), the nonzero elements of row i of Q are given by: , and . For example, the probability that the sequence ACGCGC from the condensed class will mutate into the canonical sequence ACGCGT () is . The probability that it will mutate into a sequence from class is the probability that a mutation occurs at a site other than the already mismatched site: . And the probability that it will remain in the class is the probability that either no mutation occurs or that a mutation occurs at the mismatched site but does not result in the canonical sequence (CA or CG, but not CT): . The transition probabilities are the same for any other sequence, such as ACGCAT.
We introduce selection by assuming that target gene function is required for viability. The population can only occupy states within the viable portion of the condensed mutational network. Every generation, mutant genotypes may appear in the inviable part of the condensed mutational network, but they fail to reproduce. The evolutionary dynamics of the population can be described by restricting Equation 2 to the set of viable genotypes:
where ‘’ means entry-by-entry multiplication of the two vectors, is a vector of the viabilities of each genotypic class (1 and 0 correspond to viable and inviable, respectively) and is the mean viability of the population.
Equations 2 and 3 can be generalized for genotypes with any number of binding sites K (Figures 3A, S2 and S3). This amounts to considering that each possible binding site defines an axis in a K-dimensional space, and that each point along that axis is the Hamming distance between the site and the canonical binding sequence of the corresponding TF. In the next four sections we consider the case in detail (see ‘Number of segregating binding sites’ for ).
We begin by considering one of the simplest situations that can be represented in our model: an essential gene regulated by a single constitutively expressed activator . Gene function is required for viability. A binding site for this TF, , is functional only if it matches the canonical binding sequence exactly :
A single functional binding site is both necessary and sufficient to sustain gene function; additional functional binding sites are fully redundant.
The condensed mutational network for the case of sites is shown in Figure 3A. The axes represent the Hamming distances of each binding site relative to the canonical sequence. The viable portion of the condensed mutational network comprises the genotypic classes that have at least one functional site, and corresponds to the left and bottom edges of the mutational network in Figure 3A. There is only one redundant genotypic class: that possessing two functional binding sites (open circle in Figure 3A).
Not all genotypes within the viable portion of the condensed mutational network have the same reproductive value, defined as the proportion of viable offspring they produce. The reproductive value of condensed genotypic class i is given by: . The redundant genotype (0,0) has a reproductive value of because both of its mutational neighbors, (0,1) and (1,0), are viable. All other viable genotypes (with one functional binding site) have reproductive value (i.e., half of its mutational neighbors are inviable).
When the population reaches mutation-selection equilibrium it is not evenly distributed over all viable genotypes. Rather, the redundant genotype is approximately 2-fold overrepresented in the population, relative to other (nonredundant) genotypes (Figure 3B, squares). This finding is consistent with the prediction  that more highly connected genotypes in a neutral network should be overrepresented at equilibrium relative to a uniform distribution. But although the redundant genotype is overrepresented at equilibrium, redundancy cannot evolve easily in this model. That is because the (noncondensed) set of viable genotypes contains a single redundant genotype, but 8,190 nonredundant ones, 83% of which include at least 4 mismatches in the nonfunctional binding site. Thus, at equilibrium, the redundant genotype constitutes a miniscule proportion of the population (0.012%). This pattern is visible in the sum of the frequencies of all genotypes in a viable condensed genotypic class (Figure 3B, circles).
The model outlined above can be considered neutral with respect to redundancy because redundant and nonredundant genotypes have the same viability , . The interaction between viability selection and the structure of the mutational network in this model does create indirect selection for multiplicity , but it is too weak to maintain a substantial proportion of redundant genotypes in the population. The results in this section are consistent with those obtained by Gerland and Hwa using a similar model . In the next three sections, we build on this model by introducing different mechanisms independently, one at a time, and investigating how they affect the evolution of redundant genotypes.
Partial redundancy is thought to be more common than full redundancy , . The presence of multiple binding sites might be advantageous if, for example, it changes the expression level of the target gene, or buffers expression against fluctuations in TF concentration , –. We model partial redundancy by setting the viabilities of redundant and nonredundant genotypes to and , respectively. The equilibrium frequency of the redundant genotype increases with the strength of selection for redundancy (; Figure 4A). The effect of selection on redundancy undergoes a phase transition around the point where selection becomes strong relative to the rate of mutation from redundant to nonredundant genotypes (): the response to selection is small for weaker selection, but it increases sharply for stronger selection.
We incorporate recombination into our full redundancy model by taking into account the probability that each genotype has resulted from recombination between each available pair of genotypes (see Protocol S1 for details). Recombination is only allowed between binding sites, not within them. Recombination changes the evolutionary dynamics because it allows long steps across the mutational network. A modest amount of recombination between sites () leads to the evolution of a high level of redundancy at mutation-recombination-selection equlibrium (Figure 4B). Lynch obtained similar results using a simpler model .
Our result can be understood by considering recombination between nonredundant genotypes containing different functional binding sites, that is, in Figure 3A. A recombination event between the sites produces two genotypes: one viable, with two functional binding sites (redundant), and another inviable, without any functional sites. In contrast, redundant genotypes always give rise to viable offspring, regardless of the kind of genotype they recombine with. This leads to strong selection against nonredundant genotypes. Stochastic simulations confirm that recombination can have a large effect on the evolution of redundancy even in finite populations (Figure S4).
Many TFs are promiscuous, that is, they can bind to several different sequences (Figure 1). Our model allows us to explore the implications of different levels and kinds of TF promiscuity for the evolution of redundancy (Figure S2B; also, see next section). We consider two different ways of increasing the promiscuity of the TF described in the basic full redundancy model (see Equation 4).
We begin by considering the “all or nothing” case where affects gene expression through a binding site according to the following relationship:
The binding site is functional only if its sequence differs from the canonical sequence of by no more than mismatches (Equation 4 is the special case for ). Thus, the greater the value of , the more promiscuous the TF. Increasing expands both the size of the viable portion of the condensed mutational network and the number of redundant states. Figure 5 shows the viable portions of the condensed mutational networks for a stringent (a: ) and a promiscuous TF (b: ). The promiscuous TF evolves an equilibrium frequency of redundant genotypes two orders of magnitude greater than the stringent one.
Generally, mismatches reduce the binding affinity and, therefore, the regulatory influence of a TF , , . For example, in the D. melanogaster eve stripe 2 enhancer, the different bcd binding sites show different numbers of mismatches relative to the bcd consensus recognition sequence, inferred from in vitro binding assays ,  (Figure 1). The deletion of binding sites with lower numbers of mismatches (B1 or B2, both with ) result in much more severe reductions in stripe 2 expression, when compared to deletions in sites with higher numbers of mismatches (B4 or B5, both with ; B3, with ) , . In addition, when the high-m sites B3–B5 were mutated into consensus sites (), they restored expression of a defective promoter lacking the B1 binding site . To incorporate this type of “graded” TF promiscuity in our model, we defined as a decreasing function of (instead of a step-function as in Equation 5). Figure 5 shows two examples (c, d) that imply that graded promiscuity can promote the evolution of redundant cis-regulation more strongly than the all or nothing kind. The reason for this is that graded TF promiscuity can lead to the appearance of nonredundant genotypes containing multiple binding sites capable of sustaining gene function together but not in isolation (gray, Figure 5). These results show that nonredundant multiplicity can evolve from redundant transitional forms.
Until now, our model has assumed that only alleles at binding sites segregate within a population at a given time. This assumption may not be met in reality. Long promoters provide the opportunity for more sites to arise by chance in a population  (Figure S2). Other factors that are expected to influence the number of segregating binding site alleles include the length of the site (n, Figure S2A), the match between the GC-content of the promoter and that of the canonical binding sequence , the promiscuity of the TF (m, Figure S2B; see previous section), the mutation rate and the population size.
As the number of segregating binding sites in the full redundancy model increases, the dimensionality of the model and the number of possible redundant genotypes also increase (Figures S2C and S3). The increase in the number of available redundant genotypes results in an increase in the total equilibrium frequency of these genotypes (Figure S2D). But although the number of redundant genotypes grows roughly exponentially with the number of segregating binding site alleles, the equilibrium frequency of redundant genotypes increases linearly, suggesting that the number of segregating binding site alleles has only a modest effect on the evolution of redundancy. The situation changes when the expected binding site copy number in a sequence is (e.g., points above the dashed line in Figures S2A and S2B). If that occurs, the maintenance of redundant cis-regulation does not require a selective explanation.
All the results derived above for an individual transcriptional activator can be generalized to two scenarios. First, to combinations of different transcriptional activators following similar rules. This would allow us to model the evolution of cis-regulatory element degeneracy (the equivalent of redundancy for elements that are structurally different ). Second, to transcriptional repressors (considered individually or in combination), where the function of the target gene is defined by its inactivity. Selection for decreased gene expression is expected to influence the evolution of the copy number of the binding sites of transcriptional repressors in the same way that selection for increased gene expression affects the evolution of the copy number of the binding sites of transcriptional activators. A major challenge for future work is to consider the simultaneous evolution of sites for activators and repressors in the same promoter.
Our model includes many simplifying assumptions, such as that the positions within a binding site influence TF binding uniformly and additively, and that TFs act additively through multiple binding sites. Additivity among the positions of a binding site appears to be a reasonable approximation ,  and (together with uniformity) serves as the basis for the widely used two-state model , –. The other assumptions are not particularly realistic: synergistic effects among binding sites are commonplace ,  and many TFs are not uniformly promiscuous (Table S1). The extent to which changing the assumptions of our model would modify our conclusions is not clear at present, and remains a fundamental question for future modeling.
To evaluate the level of regulatory multiplicity in the yeast genome, we have scanned all intergenic sequences depleted of nucleosomes  upstream of a single protein-coding gene ( sequences, covering 8% of the genome) for 326 position weight matrix (PWM) models of 179 TFs from the literature , – (see Methods; Tables S1 and S2). In what follows, we analyse the 312 PWMs (corresponding to 176 TFs) predicted to have at least two binding sites in total. For simplicity, we refer to intergenic regions as promoters. On average, each promoter contained 0.08 binding sites of each PWM (standard deviation, ).
We defined the amount of regulatory multiplicity (M) for a PWM as the proportion of promoters having at least one binding site that have two or more binding sites. On average, PWMs showed 7.1% multiplicity (; Table S3). The M measure of multiplicity is partly confounded with overall binding site copy number. To correct for this effect, we calculated the expected value of M for each PWM under the assumption that binding site copy number in a promoter region i of length (nucleosome depleted) is Poisson distributed with expectation , where is the observed number of binding sites in promoter j. Figure 6 shows that the observed cis-regulatory element multiplicity was approximately 40% higher than that expected under the null expectation (paired Wilcoxon test of the hypothesis that : ).
How did the excess multiplicity shown in Figure 6 evolve? Our model and the literature suggest three possibilities , –, : 1) recombination, 2) direct selection for increased robustness in gene expression, or 3) direct selection for increased gene expression. Each hypothesis makes a different prediction about promoters displaying binding site multiplicity: they should experience 1) higher recombination rates, or be upstream of genes showing 2) more robust expression patterns, or 3) higher expression levels (for activators; the opposite is expected for repressors). To test these hypotheses, we looked for genome-wide associations between cis-regulatory element multiplicity and a range of features of the promoters and the genes downstream of those promoters. Consider a genomic property (e.g., promoter length). For each PWM, we calculated an effect size , where () is the mean of associated with promoters containing a single (multiple) binding site(s), and is an unbiased estimate of the pooled standard deviation  (the effect size for binary traits, such as gene essentiality, was estimated as the arcsine transformed risk difference based on 22 tables). We then combined the effect sizes and respective variances corresponding to different PWMs in a random-effects meta-analytic model. The results are summarized in Figure 7.
Promoters with higher numbers of crossovers  showed significantly higher levels of binding site multiplicity (, ; Figure 7), which is consistent with the recombination hypothesis. This association is explained in part by promoter length (Spearman's rank correlation coefficient: , ). However, we believe that our data provide strong backing for the recombination hypothesis for three reasons. First, promoter length alone cannot explain the excess multiplicity illustrated in Figure 6 because it was considered in the calculation of . When the analysis was restricted to the subset of PWMs displaying excess multiplicity (), the effect size of crossover number was unchanged (, ). Second, the effect size of the residual crossover number from a Poisson log-linear regression model with log-transformed promoter length as an explanatory variable decreased (), but remained statistically significant (, ). Third, a measure of frequency of meiotic double-strand breaks (DSBs) per bp  was also elevated in promoters showing cis-regulatory element multiplicity (, ; Figure 7).
Promoters showing binding site multiplicity tended to be upstream of genes showing low robustness in gene expression to various trans-perturbations  (all , , which contradicts the hypothesis that redundancy has evolved as a result of selection for robustness in gene expression. Although promoters with multiple binding sites were also more likely to contain a TATA box  (, ), the results shown in Figure 7 did not change qualitatively when the analyses were repeated separately for genes with and without TATA boxes (not shown). Furthermore, multiplicity was not associated with protein expression noise  (, . Genes downstream of promoters with multiple binding sites tended to have higher expression levels (both protein and mRNA: , ), which is consistent with the selection for expression hypothesis for activators, but not repressors.
Cis-regulatory element multiplicity was associated with several correlates of gene functionality (Figure 7). Promoters containing multiple sites tended to evolve more slowly  (divergence: , ), and the genes downstream of these promoters tended to show higher levels of selective constraint  (: , ) and to be involved in interactions with a greater number of other genes  (degree centrality: , ). Genes with duplicates elsewhere in the genome were more likely to show binding site multiplicity  (, ). Several gene ontology terms were significantly enriched in genes downstream of promoters containing multiple sites, including: plasma membrane, transporter activity, transcription regulator activity, DNA binding and transport (Table S5).
Our mathematical model suggests that selection for multiple binding sites, that is, partial redundancy, can influence the evolution of cis-regulatory element multiplicity, provided that the redundant genotype has a selective advantage . Mutation rates per base pair in DNA-based organisms are of the order of . Therefore, weak selection can play a major role in the evolution of cis-regulatory element multiplicity, provided that the effective population size is also large enough () to render genetic drift negligible , .
We found a positive association in yeast between the presence of multiple binding sites for a TF and expression level of the downstream gene. This association is unlikely to have evolved neutrally or as a correlated response to the increases in multiplicity generated by recombination (see below) because gene expression patterns are under intense stabilizing selection ,  and increases in gene expression are energetically costly ; also, the effect sizes of crossover number are not significantly correlated with those of either mRNA or protein abundance (both ). Rather, the association between multiplicity and gene expression is consistent with an adaptive origin of cis-regulatory element multiplicity. Increasing the number of binding sites for transcriptional activators (inhibitors) in a promoter typically increases (decreases) gene expression , . Since transcriptional activators are thought to be more common than repressors in yeast, and many TFs can perform either role , selection for different levels of expression of certain genes in certain environments could, over time, generate a positive association between cis-regulatory element multiplicity and expression level (provided that there is no strong overall bias towards selection for reduced expression). This adaptive scenario for the evolution of binding site multiplicity is consistent with the observation that functional TF binding sites have been frequently gained or lost in a lineage-specific manner among three closely related species of yeast , .
Our finding that cis-regulatory element multiplicity was associated with lower robustness in gene expression to various trans-perturbations contradicts the hypothesis that redundant genotypes benefit from being more robust , . An earlier study reported a positive association between the number of binding sites for any TF—a possible correlate of both redundancy and degeneracy —and variation in gene expression in yeast using different data from ours . Our evidence is, of course, correlative: a more direct test would be to compare the robustness in expression of genes downstream of promoters containing multiple binding sites with that of the same genes with various combinations of sites mutated or deleted. Nevertheless, the observed relationships between multiplicity and robustness are also consistent with selection for changes in expression level. If the main consequence of gaining binding sites is to increase the effect of a TF on gene expression ( in Equation 1), then changes in the levels of TFs, such as those caused by viable knockout mutations , are expected to lead to greater variance in these effects in promoters containing multiple sites, compared to promoters containing a single binding site.
In addition, our model highlighted three candidate nonadaptive mechanisms for the evolution of cis-regulatory element multiplicity through fully redundant transitional forms. The first is the neutral evolution of multiple binding sites in long promoters. Such “trivial” redundancy is expected to occur in a long promoter if functional binding sites can occur over a large proportion of its length. Although the latter condition is difficult to evaluate in real organisms, intergenic regions longer than -bp are common in several mammals, including humans. Therefore, many mammalian promoters may be trivially redundant , . This could explain the observation that approximately a third of human functional TF binding sites are not functional in rodents . However, we do not expect that trivial redundancy played a dominant role in the evolution of multiplicity in organisms with relatively shorter promoters and larger populations, such as yeast.
The second mechanism is recombination. Based on our model we predict that recombination between binding sites on the order of will promote the evolution of cis-regulatory element redundancy. In yeast, a pair of sites 100-bp apart is expected to experience , ; if yeast only undergo sexual reproduction once every 1,000 asexual generations  we estimate , suggesting that this process has operated in yeast. We found a positive association between the presence of multiple binding sites for a TF and recombination rate in yeast. Estimates based on polymorphism data from 10 species of plants and animals  give bp, indicating that recombination is likely to be a powerful force in the evolution of cis-regulatory element multiplicity in other eukaryotes with relatively large populations. Our findings are in agreement with recent work showing that recombination selects for “mixable” genotypes , which leads to the evolution of higher mutational robustness , –. Our model predicts that redundant genotypes are robust to mutations in the binding sites, but this kind of mutational robustness does not imply robustness in the expression pattern of the downstream gene to trans-perturbations. In fact we found that cis-regulatory element multiplicity was associated with reduced robustness to perturbations in trans (see previous section).
The third nonadaptive mechanism indicated by our model is that increases in TF promiscuity promote the evolution of cis-regulatory element multiplicity. We could not test this prediction directly with our yeast data because we made an implicit assumption about the level of TF promiscuity when we scanned for binding sites. However, Bilu and Barkai , using a different data set from ours, reported that binding sites tended to be “fuzzier” (i.e., have lower PWM scores) when they appeared in promoter regions containing other binding sites for any TF. This observation is consistent with the prediction that graded TF promiscuity allows the existence of viable genotypes containing multiple binding sites, where each binding site is fuzzier than those found in viable genotypes containing fewer binding sites. Graded TF promiscuity is believed to be common , , , suggesting that multiplicity will often evolve through transitional forms showing redundant cis-regulation that then degenerate into nonredundant forms. If this evolutionary scenario is common, then lack of redundancy in extant genotypes containing multiple binding sites will be a poor indicator of whether or not its ancestral genotypes were redundant.
Our results suggest that redundant transitional forms can, indeed, play an important role in the evolution of cis-regulatory element multiplicity. Many aspects of the biology of an organism affect the evolution of redundancy and multiplicity: both adaptive and nonadaptive processes, both changes in cis to binding sites and in trans to the TFs that interact with them, both the functional setting of the promoter and the population genetic context of the individuals carrying them. Thus, understanding how gene networks evolve will require going beyond mere plausibility arguments into rigorous testing of specific mechanisms , . We believe that the approach developed here provides a valuable framework to advance this research program.
The results reported in Figures 3– 55 and were based on a deterministic version of the model (i.e., assuming infinite population size). The frequencies of different genotypic classes at mutation-selection or mutation-recombination-selection equilibrium were calculated by iterating populations for as long as necessary for genotypic class frequencies not to change by more than from one generation to the next.
We used 326 PWMs summarizing the binding specificities of 179 putative yeast TFs reported in four studies , – (Table S1; see Protocol S1 for more details). Sequences scoring 95% or higher of the highest possible score for a given PWM were considered putative binding sites (on average, this allowed 1.23 mismatches, s.d.=1.51). Each intergenic sequence was scanned with a PWM and its reverse complement and the number of matches were counted (simultaneous hits on exactly the same sequence and its reverse complement 1 nucleotide were counted as a single hit; otherwise, binding sites overlapping over nucleotides were counted separately).
Other studies have attempted to distinguish between real and “impostor” binding sites by taking into account additional information, such as the degree of conservation of putative sites , . We did not follow this approach because promoter sequence divergence is significantly correlated with many of the genomic features shown in Figure 7 (Table S4; see next section).
We calculated the following quantities for each intergenic region: 1) sequence length (including regions occupied by nucleosomes); 2) proportion of sequence occupied by nucleosomes ; 3) whether it contains a TATA box . 4) GC content of the sequence; 5) a measure of the frequency of meiotic DSBs ; 6) proportion of nucleotides that differ between S. cerevisiae and S. paradoxus . 7) number of crossover events . We also calculated the following quantities for the gene downstream of these promoters: 1) three measures of robustness to trans-perturbations , derived from measurements of the variance in levels of gene expression (corrected for mean) across 167 viable knockout mutations (genetic), 30 wild isolates (genetic background), and 35 environments (environmental robustness); 2) essentiality, whether a homozygous knock-out of the gene was lethal , ; 3) whether the gene has a duplicate elsewhere in the genome ; 4) , the ratio between the rates of nonsynonymous and synonymous site substitution based on the comparison between S. cerevisiae and S. paradoxus ; 5) degree centrality, the total number of interactions with other genes ; 6) protein expression noise ; 7) mRNA and 10) protein abundance , . See Protocol S1 for more details.
Supplementary Methods. Sections: modeling recombination; yeast data; software.
(0.09 MB PDF)
Examples of f functions consistent with the viable portions of the condensed mutational networks in Figure 5. The dashed line indicates the threshold for driving gene expression.
(0.06 MB PDF)
Redundancy is more likely to evolve if there are more segregating binding site alleles. (A) Expected number of exact matches to canonical binding sequences of different lengths (n) in promoters of different lengths (L). (B) Expected number of matches to an 8-bp canonical binding sequence allowing for different numbers of mismatches (m) in promoters of different L. The value of m models different levels of TF promiscuity. In (A) and (B) values are means and 95% confidence intervals of 10 independent sets of 104 random sequences with the same average GC content as yeast intergenic regions (except for n = 8, m = 0 and L≤200, where 60 sets of sequences were used). Dashed lines mark an expected number of 2 binding sites. (C) Number of redundant genotypes and (D) total equilibrium frequency of redundant genotypes for different numbers of segregating binding sites (K). For K = 2, the model is that shown in Figure 2. See Figure S3 for K = 3.
(0.09 MB PDF)
Condensed mutational networks for a promoter with K = 3 binding sites (all with length n = 6). (A) Diagram of gene with three binding sites. (B) Condensed mutational network. Axes represent the numbers of mismatches of each binding site relative to the canonical sequence. Each node represents a genotypic class. As in Figure 2, the promoter regulates an essential gene such that at least one canonical binding site is required for activity. The nodes shown in black define the viable portion of the condensed mutational network. The nodes in gray represent inviable genotypes. (C) Shows only the viable portion of the condensed mutational network. The genotypes highlighted in gray are redundant.
(0.17 MB PDF)
Stochastic simulations of the effect of recombination. Populations of different sizes (N) are initialized at mutation-selection equilibrium. (A) r/μ = 0.1, (B) r/μ = 1, and (C) r/μ = 10. In all cases, we used μ = 0.1, an unrealistically high value. Values are medians of the frequencies of redundant genotypes for 500 replicate populations. In populations of both sizes redundancy evolves quickly, but is then lost by drift. Dotted lines show the deterministic expectation (see Figure 3B).
(0.08 MB PDF)
Binding site PWM models used in our study. Values are the length (8.7 ± 2.6 bp, mean ± standard deviation), GC content (0.53 ± 0.19) and mean information content (I) per position (1.33 ± 0.29) of each PWM (after processing as described in Protocol S1). The value of I can vary between 0 and 2, and is a measure of the energy contribution of a position to TF binding. Each PWM is summarized by its canonical sequence: “.” indicates a position with I = 0; “[ / ]” indicates bases with the same weight at a given position. Letters in parentheses after TF names indicate the study from which we took the PWM data: B, Badis et al. (2008); H, Harbison et al. (2004); M, MacIsaac et al. (2006); Zhu et al. (2009).
(0.06 MB PDF)
Binding site PWM models not considered in our study. These PWMs were excluded because they were almost identical to the PWMs listed in the ‘Equivalent’ column, shown in Table S1. Letters in parentheses after TF names indicate the study from which we took the PWM data: B, Badis et al. (2008); H, Harbison et al. (2004); M, MacIsaac et al. (2006); Zhu et al. (2009).
(0.04 MB PDF)
Data used to construct Figure 6. The data are sorted by decreasing M Obs/M Exp. The second and third columns show the Total number of binding sites revealed in a scan across the number of promoter regions in the ‘Prom’ column (the numbers vary because the minimum promoter length considered in each scan is twice the length of the PWM). Letters in parentheses after TF names indicate the study from which we took the PWM data: B, Badis et al. (2008); H, Harbison et al. (2004); M, MacIsaac et al. (2006); Zhu et al. (2009).
(0.05 MB PDF)
Matrix of correlations between genomic features. Values are Spearman's rank correlation coefficients (ρ). Data used for the cluster analysis in Figure 7.
(0.04 MB PDF)
Association between multiplicity and GO slim terms from each domain. Significance levels after correction for multiple comparisons using the Holm method: *, P<0.01; **, P<0.001; ***, P<0.0001.
(0.06 MB PDF)
We thank G. Balázsi, T. Cooper, S. Proulx, C. Ray and M. Siegal for helpful discussions. S. Proulx shared the yeast gene expression robustness data.
The authors have declared that no competing interests exist.
This work was supported in part by grants from the National Science Foundation (EF-0742803) and the James S. McDonnell Foundation to RBRA. TP was supported in part by a fellowship from the Fundação para a Ciência e a Tecnologia (Portugal). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.