The number of binding sites is correlated with expression variability
To examine whether there is a connection between combinatorial regulation and the length of transcription factor binding sites, we considered the comprehensive map of S. cerevisiae
binding site locations, derived by Harbison and coworkers [8
]. This map was generated using a ChIP-chip assay, characterizing all promoter regions that bind a specific transcription factor, followed by a computational analysis that predicted the precise location of each binding site. All together, the data set includes 9,715 binding sites for 102 transcription factors (about 30% of all putative factors), distributed among 2,928 gene promoters.
The number of binding sites varied greatly among gene promoters. Whereas in most promoters at most one or two binding sites were identified, a fraction of genes (about 4%) exhibited more than ten binding sites in their promoter region (Figure ). Genes displaying multiple binding sites in their promoter exhibit a more variable expression pattern (Figure ; see Materials and methods, below), suggesting that the number of binding sites appearing in a gene's promoter can serve as a plausible measure of the degree of combinatorial regulation.
Figure 1 Distribution of binding sites numbers and correlation to gene expression. (a) Cumulative fraction of genes according the number of binding sites in their promoter region. (b) Expression variance averaged over all genes with like number of binding sites (more ...)
Binding sites for specific transcription factors are less specific when they act in combination with other sites
To examine whether binding site properties depend on their co-appearance with additional sites in the same promoter region, we focused first on binding sites for specific transcription factors. The factor that binds the largest number of genes (293) is Reb1, whose well defined consensus binding site consists of seven nucleotides. As expected, in most gene promoters the predicted Reb1 binding site somewhat deviates from the precise consensus. We considered whether this deviation depends on the number of additional binding sites appearing in the same promoter.
The match of the Reb1 binding site to its consensus motif decreased sharply with the number of co-appearing binding sites (Figure ). Although this is particularly striking for Reb1, similar behavior was observed for two-thirds of all 102 transcription factors and for 82.5% of the 40 transcription factors that regulate at least 50 genes (P = 5 × 10-5 was estimated for this number of factors, by randomly shuffling the binding sites of each factor and assuming a normal distribution). We conclude that binding sites for a specific transcription factor tend to be less specific when they co-appear with additional binding sites in the same promoter regions.
'Fuzziness' of Reb1 binding sites. Average fit of Reb1 binding sites to the consensus matrix, as a function of the number of binding sites within the promoter they appear in.
Because different factors often compete for the same binding site [9
], we considered whether the reduced precision of the motif reflects the need to comply with several factors, and perhaps also to tune the binding equilibrium between them. However, our analysis does not support this possibility because there was no significant difference between the fit to the consensus of binding sites that overlap other binding sites and of those that do not. In fact, for 25 of the 40 transcription factors that regulate at least 50 genes, the average fit to the motif was higher for binding sites that overlap other sites as compared with those that do not (see Materials and methods, below).
Binding sites that appear in combination with other sites tend to be shorter and less specific
The results above focus on a particular binding site and compare its sequence in different promoter regions. We then considered whether binding sites that tend to appear in promoters containing multiple sites are shorter, on average, than are binding sites that act in isolation. To examine this, we counted for each gene the number of binding sites in its promoter and measured their average length (as it appears in [8
]). Indeed, there is a clear inverse correlation between these two values; the higher the number of binding site, the shorter is their average length (Figure ; Additional data file 7). Note that length here is defined according to the motif consensus, as indicated by Harbison and coworkers [8
Figure 3 Average promoter and gene properties as a function of the number of binding sites. (a) Average binding site length. (b) Fraction of essential genes. (c) Sum of expression correlations. (d) Fraction of binding sites that are 'new' (not conserved in other (more ...)
One possibility is that this negative correlation merely reflects the fact that shorter binding sites appear more often (or are predicted more often by the computational method used). To control for this possibility, we examined the distribution of correlations obtained by reshuffling the binding data. Indeed, the observed correlation is 13.6 standard deviations away from the mean of this random distribution, corresponding to a P
value of about 10-42
(assuming a normal distribution). Moreover, essentially the same results are obtained when controlling for multiple appearance of the same binding sites, and considering only the number of transcription factors that bind the promoter (Additional data file 4). In contrast to the total number of binding sites, this latter measure is independent of the computational methods used by Harbison and coworkers [8
] in defining binding sites.
Importantly, the negative correlation between the length of a binding site and the number of additional sites appearing in the same promoter region does not depend on the precise definition of binding-site length. In fact, similar correlations, with equivalent statistical significance, were observed also for more refined definitions of binding-site length or 'fuzziness', including Euclidean or KL distance of the motif from the background distribution, the average fit of a binding site to the motif, and the probability of a given binding site to appear at random (see Materials and methods, below; also see Additional data file 1).
Particularly informative is the fuzziness measure, which describes the average fit of the motif to its consensus site (Additional data file 1 [panel d]). Longer motifs are expected to have more ambiguous positions than shorter ones because there is some flexibility in defining the boundaries of a binding site, and also simply because there are more positions that can be ambiguous. Indeed, when considering all appearances, longer sites tend to be fuzzier than shorter ones (Additional data file 2). Because motif length is negatively correlated with the number of co-appearing sites (Figure ), the null hypothesis is that motif fuzziness is negatively correlated with the number of co-appearing sites. The observation that the opposite phenomenon occurs (Additional data file 1 [panel d]) further emphasizes the statistical significance of the correlation between motif fuzziness and the number of co-appearing binding sites.
Functional characterization of genes under combinatorial control
Taken together, our results suggest that multiple binding sites are associated with shorter and less specific binding sequences. One possibility is that motif multiplicity allows for mutations that decrease the length and specificity of the motif. In this model, interactions between factors can compensate for the decreased specificity of each individual site, ensuring precise expression of the associated gene.
Alternatively, shorter and fuzzier motifs may indicate lower pressure to maintain precise control of the expression of the associated gene. Lower selective pressure would allow for mutations that reduce binding-site specificity on the one hand, and would also allow for the addition of new binding sites on the other. In this case, both binding-site fuzziness and combinatorial regulation reflect the same gene property, but they do not cause each other.
To try to differentiate between the two possibilities, we examined the properties of genes with promoters that exhibit a large number of binding sites. Interestingly, we found that essential genes (in rich glucose medium [10
]) are over-represented among genes with few binding sites (Figure ). This preferential appearance of binding sites in the promoter regions of nonessential genes, the regulation of many of which we conjecture to be under lower negative selection, supports the possibility that binding site abundance depends on the selective pressure acting on the region.
Genes that are not essential for growth in rich glucose medium might still be essential for growth in other conditions. To complement the analysis described above, we also analyzed the number of binding sites upstream from genes whose knockout led to slow and fast growth in different growth mediums (Yeast Deletion Project [11
]). As shown in Table , in all five conditions for which data are available those genes whose deletion leads to slow growth and whose regulation we conjecture to be under stronger negative selection have, on average, few binding sites. Similarly, genes whose deletion does not hamper growth tend to have a large number of binding sites. We note, however, that these additional conditions are still only a subset of those that are of relevance, and ultimately more experiments are needed to test this hypothesis in full.
Average number of binding sites for genes leading to slow and fast growth
As another indicator of the functional importance of the transcriptional regulation of a particular gene, we considered the number of genes that are correlated with it. Indeed, genes that are part of large co-regulated groups tend to exhibit a lower number of binding sites in their promoter region, as compared with genes that are co-regulated with only a few genes (Figure ; P
). A similar although less significant (P
= 0.04) correlation was observed for genes that participate in large protein complexes [13
The gene properties above provide only an indirect indication of the functional importance of a gene and thus of the selective pressure to maintain its expression. Perhaps a more direct way to identify promoters that are under negative selective pressure is to differentiate between promoters that potentially regulate two genes on the two opposing strands ('divergent promoters') and those that regulate only one. The former group is likely to be under stronger negative selection because mutations there will potentially effect the regulation of both genes. Indeed, as can be seen in Figure , divergent promoters tend to exhibit a lower number of binding sites, supporting the proposal that binding site multiplicity reflects lower selection pressure on promoter regions.
Distribution of 'divergent' promoters. The fraction of promoters that potentially regulate two genes in each subset of promoters with an equal number of binding sites.
Finally, we also looked for Gene Ontology terms associated with sets of genes whose promoters exhibit an exceptionally high or low average number of binding sites (Table ). Genes involved in metabolism appear to have a higher number of binding sites, but this enrichment is only marginally significant (P values shown are the probability for a set of this size to have the observed average number of binding sites).
Average number of binding sites according to GO annotations
'Preferential attachment' pattern for the addition of new binding sites
Our findings are consistent with a model whereby increased fuzziness and increased number of binding sites both reflect reduced selection pressure to maintain precise expression. To examine this possibility from a different angle, we considered whether new binding sites tend to appear preferentially in some promoter regions. If multiple sites merely compensate for binding-site specificity, then no specific trend is expected. By contrast, if multiple sites (and the fuzziness of binding sites) reflect reduced constraints on gene expression control, then new binding sites would be expected to appear in promoters of genes that already exhibit a large number of binding sites. Indeed, their appearance in such regions is probably less likely to be selected against.
To examine the appearance of new binding sites, we used the data comparing the conservation of binding sites between S. cerevisiae
and the three sensu stricto
species whose genomes were recently sequenced [14
]. It is likely that sites that are conserved in these species were also present in the genome of the common ancestor and thus represent ancient binding sites. In contrast, binding sites that are not conserved in any of the species may represent the new additions to the S. cerevisiae
We found that new binding sites tend to appear in promoter regions that already contain a large number of binding sites (Figure ). By randomly shuffling the binding-site data, we estimated this observation to be highly significant (P is approximately 10-22, assuming a normal distribution).