|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: JSR KB GPM WHL. Analyzed the data: JSR KB GPM. Wrote the paper: JSR KB GPM WHL.
It is now experimentally well known that variant sequences of a cis transcription factor binding site motif can contribute to differential regulation of genes. We characterize the relationship between motif variants and gene expression by analyzing expression microarray data and binding site predictions. To accomplish this, we statistically detect motif variants with effects that differ among environments. Such environmental specificity may be due to either affinity differences between variants or, more likely, differential interactions of TFs bound to these variants with cofactors, and with differential presence of cofactors across environments. We examine conservation of functional variants across four Saccharomyces species, and find that about a third of transcription factors have target genes that are differentially expressed in a condition-specific manner that is correlated with the nucleotide at variant motif positions. We find good correspondence between our results and some cases in the experimental literature (Reb1, Sum1, Mcm1, and Rap1). These results and growing consensus in the literature indicates that motif variants may often be functionally distinct, that this may be observed in genomic data, and that variants play an important role in condition-specific gene regulation.
Transcription of genes into mRNA is mediated by transcription factor (TF) binding sites in upstream promoter and enhancer sequences. Mutations in these promoter sequences therefore affect gene regulation and may contribute to pathogenesis or evolution , , , , , , , , , . Different types of promoter variation are rapidly being explored, including heterozygous variation between promoter copies resulting in allele-specific expression , , complete gain and loss of regulatory function by single nucleotide substitutions , , and differences in binding properties among binding site motif variants (BSMVs) that promote differential interactions with co-activators . The observation that BSMVs from co-associating sites in the genome often co-vary with each other to maintain function has led to a method for discovering binding sites by searching for correlated SNPs 1–2 kb apart among individuals , . Promoter variation is an important source of data that will aid understanding the encoding of regulatory function in promoter and other regulatory sequences. The function of several promoters has now been modeled computationally , . However, predicting the activity of promoters on a genome-wide scale will require a sophisticated understanding of the functional effect of BSMVs, the interaction of bound TFs with dynamically changing cofactors, the combinatorial interactions between these sites, and with other epigenetic factors.
Functional BSMVs have been shown to be important in promoting condition-specific activity of transcription factors. BSMVs that have different rates of occupancy (or affinity) by a TF can result in differential gene expression , , , , . McCord et al.  showed a predictive relationship between binding site affinity for many TFs and condition specific differential expression using genome-wide expression data. Ordered binding affinities can explain linear chains of activation, shutoff, or synchronization in dynamic pathways . Differential affinity has been shown to act in coordination with higher order chromatin modifications  and methylation . Computational and data mining approaches to learn these patterns from genomic sequence and expression data will be an important approach for elucidating cases and principles where BSMVs contribute to functionality. For example, Michal et al.  showed that sets of short sequences from promoters can be grouped together according to the expression of associated genes, and that single mutations between these sequence groups are related to known functionally-relevant BSMVs.
One common assumption is that affinity differences between alternative nucleotides provide the biological basis for BSMVs, yet the explanatory power of affinity differences alone is relatively weak. Further, the experimental literature suggests that the mechanisms by which BSMVs mediate differential expression are far more complex. In this more complex class of studied cases, BSMVs cause the bound TF to adopt different conformations, directing interactions with specific cofactors , , , , , , , ; these have been termed allosteric regulators . For example, in the mouse, a single nucleotide difference within the Pit-1 TF binding site determines activation or repression of growth hormone in different cell–types of the posterior pituitary caused by different conformations of the DNA binding domain when it sits on alternative BSMVs . Variation in interaction with sets of cofactors represents a more realistic view of the combinatorial nature of cellular interactions, where the presence or absence of an effected cofactor in different conditions determines whether the BSMV will actually cause a functional effect . Differential ability to interact with cofactors (including repressors versus activators) or TFs bound at cognate binding sites may be a primary basis for regulatory differentiation of BSMVs. For example, the energetics and orientation of the Jun-Fos heterodimer when bound to DNA is altered by single nucleotide variants of the TGACTCA binding site motif . These BSMVs cause differential regulation both among genes and among individuals.
BSMVs may only have simple binding affinity differences for TFs, where one variant is a “higher quality” binding site and has a higher occupancy or recruitment rate , , . Such BSMVs would not be directly causal of complete differences in regulatory activity among their respectively regulated genes. For example, an activating TF at low concentrations may drive higher expression of genes with a high affinity BSMV than genes with a low affinity BSMV. At high enough concentrations of the TF, both high and low affinity BSMVs may be fully occupied, and the expression level of genes with both BSMVs would be the same. Even if expression levels are not comparable between genes, the steepness of the TF concentration-gene expression response curve will vary between BSMVs with different affinities.
In contrast, a proportion of BSMVs that are affected by allosteric effects are expected to show a regulatory impact that is, at the extreme, completely reversed between BSMVs. The most obvious examples are BSMVs that switch between activators and repressors, depending on the presence of cofactors . Such reversals require condition-specific cofactors that are responsible for the differential function of the BSMV across conditions. A condition-specific cofactor may bind differentially to the TF depending on the TF conformation induced by the specific BSMV of the motif . The result is that expression is both BSMV-specific and condition-specific, dependent upon the presence or the activity of the cofactor across conditions. While the regulatory effects of BSMVs that differ in affinity is never expected to be reversed, by searching for BSMVs that are associated with opposite regulatory effects, we propose to identify BSMVs whose action is due to more complex interactions than affinity alone. We applied a statistic focused on detecting instances where the relative expression levels of target genes with distinct BSMVs are maximally different between conditions. We use this statistic to assess the minimum contribution of allosteric interactions to the function of BSMVs, to identify novel candidates for further investigation, and to assess the contribution of these more complex regulatory types to the evolution of regulatory systems.
We tested whether changes in gene expression patterns can be attributed to functional BSMVs by comparing distances between pairs of expression profiles associated with each nucleotide variant at each position of a binding site, where the expression of each gene is ranked across different experimental conditions. We emphasize that BSMVs discussed here are considered only at a single motif position at a time, and the variation in the motif is observed at different promoters in the same genome (as opposed to, for example, population-level variation). Specifically, the effect size was calculated from the average difference between BSMVs in the ranking of expression values for genes controlled by those BSMVs across experiments. We call this metric the variant distance of ranked experiments (VDRE). For each variable TF motif position, VDRE subdivides genes into groups based on the nucleotide at that position in the binding site of the gene's promoter. All variants are simultaneously considered, resulting in a maximum of four BSMV groupings, one for each of the four nucleotides. Significance is measured by comparing within- versus between- BSMV distributions of VDRE with a distribution based on permuted data. The largest effect size in VDRE would occur if the relative ranking of gene expression across conditions is exactly reversed between BSMVs.
The gene expression data used in the VDRE analysis was obtained from 211 published Affymetrix S98 expression microarrays from Saccharomyces cerevisiae. The BSMVs were obtained from genome-wide binding site annotations for 77 TFs in S. cerevisiae, derived by computationally scanning the genome with motif models based on ChIP-chip binding assays, conservation and motif overrepresentation . We divided putative binding sites into a primary (high probability) set and a secondary (low probability) set. We considered only target genes with a single primary binding site. This allowed us to consider 195 variable positions (each with two or more BSMVs) from 48 TF binding motifs.
Using this data set with VDRE, we found that ~29% of TF binding motifs have functional BSMVs (Table 1; Table S1; Fig. S1). In total, we identified 9% (17/195) of the motif positions as functionally variant (p<0.05) across the conditions surveyed in this study at a false discovery rate of 0.3, suggesting that ~12/17 functional BSMVs are true positives. As expected, average distance in expression profile between genes with the same BSMV is significantly smaller than the distance between genes with a different BSMV for functionally variant positions, but not for other variable binding site positions (Fig. 1). In our analysis, only genes with a single primary input are considered; however, if additional target genes with multiple primary inputs are also included, some functional BSMVs are still detected, even though complex regulation was not considered (Fig. S2; Table S2).
We further tested the functional BSMVs identified according to the VDRE statistic (single primary inputs only) to see if they display reversal of their regulatory effects between different experiments—that is, whether their rankings of expression across conditions are reversed between BSMVs. We tested all pairwise combinations of BSMVs for a significant reversal in the ranks of expression levels between genes associated with different BSMVs across experiments. We also tested whether, when experiments are ordered according to the average difference in rank between BSMVs, a line fitted through the average ranking of one BSMV has a positive slope, and a line fitted through the average ranking of the second BSMV has a negative slope. We found that 8 of the 17 functional BSMVs pass both of these tests (14/37 individual comparisons). For these cases, we suspect that the simple binding affinity model can be rejected in favor of a cofactor interaction model.
Our test can only detect functional BSMVs given a dataset of expression patterns across heterogeneous experimental conditions. These conditions must be different enough from each other to create condition-specific expression patterns that cause differential change in the activity of BSMVs. Given this requirement for environmental heterogeneity, the VDRE statistic is agnostic about the relationship between individual gene expression experiments, whether they represent replicates, points in a time series, different concentrations of media additives, or equivalent treatments conducted in different labs. Yet the VDRE approach predicts that the effect of the BSMV will group individual experiments into biologically meaningful clusters, because there will be a detectable and consistent reordering of the expression ranks only when a proportion of the experiments are similarly affected. To test this prediction, we grouped the experiments into classes according to basic treatment (starvation, sporulation, etc.), and examined whether like-experiments cluster together in their explanation of the functional BSMVs.
For each pairwise combination of nucleotides observed at a functionally variant motif position, we ordered all the experiments by the relative mean expression difference between the sets of target genes of each BSMV, and found that similar experimental treatments group together (p<0.05) in all 47 pairwise comparisons in the Affymetrix dataset (Table S3). For example, the effect of “A” and “T” BSMVs at position 4 of the Mcm1 binding site are different after exposure to MMS compared to desiccation and rehydration (p<0.0001; Fig. 2A). Similarly, the effect of “A” and “T” BSMVs at position 8 of the Sum1 binding site are different during sporulation compared to all other treatments (p<0.0001; Fig. 2B, 2C).
We also find that 56 out of 83 pairwise comparisons between functional BSMVs identified from an among-species comparative dataset (described below in the “conservation” discussion section) show condition specificity (Table S3). For example, the difference in regulation of genes with “G” or “A” at position 9 for the Reb1 binding sites is highest during growth in glycerol in all three Saccharomyces species examined, and therefore these experiments cluster together in Figure 3A–C.
As an independent line of evidence supporting the condition-specific action of functional BSMVs, genes associated with particular BSMVs often show enrichment for gene ontology (GO) processes consistent with their condition specific effects (Table S4). In the example of Mcm1, genes with an “A” variant of the binding site are induced during desiccation and rehydration (Fig. 2A), and these genes are also enriched for the protein modification GO process (p=0.004). Genes with “T” BSMVs are upregulated in other conditions, and these genes are enriched for the DNA metabolism GO process (p=0.02). In the example of Sum1, genes with an “A” variant at position 8 of the binding site are upregulated specifically during sporulation (Fig. 2B, 2C) and are also enriched for the sporulation GO process (p<0.001). Genes with a “T” BSMV at this position are upregulated in other conditions, and these genes are enriched for the protein biosynthesis GO process (p=0.04).
The quality of the binding site annotations for a TF and the extent to which the TF's target genes are influenced by the TF are both important for our conclusions. To increase our power to detect functional BSMVs, in the analysis presented above we focused on target genes with simple regulatory control regions. As a proxy for regulatory simplicity, we selected only those target genes with a single primary binding site for the same TF (posterior probability >0.7). Pairs of such genes have significantly more similar expression profiles than pairs of genes that either have additional primary binding sites (p<2.2e-16) or than random gene pairs (p<2.2e-16), as is expected for genes participating in simpler regulatory circuits (Fig. 4). Target genes sharing secondary TF binding sites (posterior probability <0.7 and >0.2) have significantly less similar expression profiles than target genes that share a single primary binding site (p<2.2e-16), indicating that the posterior probabilities of the binding site predictions are reasonable. While we relied on genome-scale binding site annotations, a small number of RAP1 binding sites have been experimentally determined; these sites show the same pattern between BSMVs and target gene expression as predicted to be functional in this paper (Fig. S3). This suggests good concordance between the genome-scale annotations and experimentally validated sites for the binding sites considered in this study.
BSMVs considered in the analysis have, on average, ~35 target genes, and functional BSMVs do not have a significantly different number of target genes than do non-functional BSMVs (p=0.4; Fig. S4). However, power to identify functional BSMVs is a function of the number of within and between-BSMV comparisons, not simply the number of target genes. If BSMVs are distributed unevenly among genes or if genes are partitioned into too many BSMV groups, then there may be few between-BSMV comparisons or within-BSMV comparisons, reducing the power. We find fewer functionally variant binding site positions in our dataset that have a small number of either within or between comparisons than do all variable motif positions, indicating that our permutation test does not spuriously indicate sites for which there is too little information to reliably classify them as functional, and that most variable binding positions have a substantial number of both within and between comparisons (Fig. 5). The permutation test itself accounts for correlation structure due to multiple pairwise comparisons.
We selected genes with only a single primary TF binding site, yet it is possible that functional BSMVs we detect are due to co-occurrence between a particular BSMV and a TF binding site that falls below our primary stringency threshold or is not otherwise known or annotated. In this case, similar expression profiles may be caused by the presence of separately binding TFs . The non-random presence of such secondary binding sites may either be biologically related to cooperation with the BSMV, or may occur by chance. To examine possible co-occurrences in our dataset, we searched for correlations between functional BSMVs and secondary (lower quality) binding sites. For at least 13 of the 17 motif positions with functional variants (and 44 of the 54 positions discovered across species, below), we can exclude the possibility that the association between BSMV and expression profile is due to a secondary TF binding site co-occurring with a particular BSMV (Table S5). We note that most secondary binding sites probably do not represent real binding sites, and we suspect that these correlations are due by chance to the extremely large number (~18,000) of secondary binding sites genome-wide. Our estimate of the fraction of functional BSMVs that could potentially be explained by additional low probability binding sites is conservative, since we cannot consider TFs that do not yet have characterized binding sites.
We applied our method to each of four Saccharomyces sensu stricto species, using a published comparative data set of gene expression during stress conditions, assayed on a single cDNA microarray platform . In this dataset, we found that ~30–39% of TF binding motifs have functionally variant positions in Saccharomyces sensu stricto species (Table 1; Table S1; Fig. S5). This proportion is comparable to the ~29% of motifs with functional variants identified in the Affymetrix dataset. Nine out of these 21 motifs have functional variants that are conserved in more than one species. These conserved functional BSMVs comprise about one fifth (9/42) of the positions identified as having functional variants (Table 2).
This conservation suggests that the BSMVs are under evolutionary constraint to preserve their function. Indeed we find that there is purifying selection acting both on variable and highly variable motif positions. We calculated the average evolutionary substitution rate of each site across the Saccharomyces sensu stricto phylogeny, and found that low information BSMV positions (≤1 bit of information) evolve significantly slower than sites that are expected to be evolving neutrally: the third position of codons, introns, or other intergenic regions (Fig. 6). These low information BSMV positions have nearly twice the number of evolutionarily invariant sites (65%) compared to third codon positions (37%), and more than intergenic regions (47%) and introns (62%; Table 3). These results suggest that even highly variable binding site motif positions are functionally constrained. This observation is in agreement with previous studies which showed that different rates of nucleotide substitution at binding sites are sometimes associated with functionally different classes of BSMVs, where classes sometimes differ by only a single nucleotide , .
Considering the limited number of genes that meet our criteria for having a simple cis-regulatory promoter, and the finite number of conditions for which expression data is available, the proportion of functional BSMVs (9%) among all motif positions is remarkable.
We turned to the literature to assess the validity of a sample of the functionally variant binding positions we identified. We discuss what is predicted about each example from the VDRE approach alone, and then discuss each prediction in light of experimental evidence from the literature. Position 4 of the Mcm1 binding site, also called the middle sporulation element, is an example of a functionally variant binding site position identified in this analysis (Fig. 2A; p=0.032). Under conditions where yeast is subjected to desiccation and rehydration, genes with an “A” at this position are induced, in comparison to genes with a “T” at this position. Under conditions where yeast is treated with methyl methanesulfonate (MMS), a DNA-damaging alkylating agent, the genes with an “A” at this position are repressed, in comparison to genes with a “T” at this position. A third category of genes has “C” at this position, and the VDRE scores of all three nucleotides (“A”,“T” and “C”) were considered when determining that the position is functional (Fig. S1). The Mcm1 protein is a member of the MADS box family and plays important roles in several diverse cellular processes; therefore, its binding site has been extensively characterized. When Mcm1 binding sites were selected from a pool of random sequence oligonucleotides, about three quarters of the selected sequences had “A” at position 4, ~15% contained a “T” at this position, and Mcm1 had a higher affinity to “A” BSMVs than to “T” BSMVs . Putative Mcm1 binding sites were cloned in a heterologous promoter in front of a reporter gene , and a Mcm1 binding site was subjected to saturation mutagenesis in front of a reporter , and in both cases, Mcm1 binding sites with “A” variants at position 4 showed higher (~2×–3×) activation of the reporter than “T” (or “C”) variants.
Mcm1 acts as an activator alone, but as a repressor when co-bound with α2. The saturation mutagenesis of the Mcm1 binding site shows that BSMVs have different effects, depending on whether or not the α2 is co-bound . An “A” nucleotide at position 4 of the binding site results in more than twice as much activation of the reporter gene than a “T”, but when α2 is present the high level of repression of reporter gene by the two BSMVs is almost identical–130× for the “A” BSMV and 126× for the “T” BSMV. One reason for this combinatorial effect may be that Mcm1 is known to induce sequence-specific DNA bending, which in turn regulates the formation of ternary complexes with other cofactors , . Many of the single base pair changes in the binding site that alter its DNA bending and transcriptional regulation do not affect the affinity of the TF for the binding site . Our finding that the “A” and “T” variants at position 4 of the Mcm4 binding site have different effects under different conditions makes sense because cofactors that act in a BSMV-specific way may be present in only a subset of these conditions. Although we have not determined which cofactor(s) are involved in our case, it is interesting that α2 is absent from the haploid a-mating type strain used in the MMS experiments , but present in the a/α diploid strain used in the desiccation/rehydration experiments .
Sum1 provides another example of how BSMVs may regulate target genes in a condition-specific manner through the participation of another factor, in this case, a competing transcription factor. During growth in rich media, we find that genes regulated by binding sites with a “T” at this position are induced, relative to genes with an “A” at position 8 (Fig. 2B; significance of functional BSMV p=0.003). During sporulation, the opposite relationship is observed. (Sum1 binding sites with “C” at this position are also functional; Fig. S1). During vegetative growth, Sum1 induces expression of target genes, and the regulatory difference between genes with different variants at position 8 of the Sum1 binding site is small; indeed, while Sum1 has been shown experimentally through mutagenesis to bind sites with a “T” BSMV at position 8 at about 20% the rate of sites with an “A,” repression of reporter activity remained similar between the BSMVs in that study . However, during sporulation, the repressor Ndt80 is also expressed, and competes with Sum1 for binding to the motif, dictating whether the site acts as a repressor or activator. The relative affinity of the BSMV for Ndt80 versus Sum1 acts as a molecular switch that induces only the genes required for the meiotic G2-to-M transition. For the “A” variant at position 8 of the binding site, Ndt80 out-competes Sum1 and causes induction of the target gene, while for the “T BSMV, Ndt80 does not out-compete Sum1, and the repressive effect of Sum1 on the target gene remains the same as it was for the “A” BSMV in the absence of Ndt80. This type of effect may explain why “A” and “T” functional variants at position 8 of the Sum1 binding site have different regulatory associations with target genes in sporulation media versus other conditions.
The functional BSMV at position 8 of the Sum1 binding motif remains significant when also considering target genes with multiple primary inputs using VDRE (p<0.001), and its effect on target genes in different conditions remains the same, even though the number of genes considered is greater (Fig. 2C). Although the method presented here does not explicitly accounts for the effects of both multiple regulatory inputs and BSMVs, such an approach is currently under development .
The functional BSMVs revealed using the two different platforms (cDNA vs. Affymetrix) were largely non-overlapping. This is the expected result since the regulatory function of BSMVs we detect is condition specific, and the conditions investigated in these sets of experiments are different.
A proportion of the functional BSMVs were identified in multiple species, suggesting that the BSMVs are under evolutionary constraint to preserve their function. For example, position 9 of the Reb1 binding site was identified as having functional BSMVs in S. cerevisae, S. paradoxus and S. mikatae. Genes regulated by binding sites with a “G” at this position are induced relative to genes with an “A” during growth in glycerol in all three species (Fig. 3A–C). In a small-scale affinity selection experiment, Reb1 had lower binding strength to sites with “G” at position 9 than to sites with “A” at position 9, and “G” BSMVs promoted lower transcriptional activity than “A” BSMVs when grown on 2% glucose plates .
Position 10 of the Rap1 binding site also has functional BSMVs identified in multiple species. During glucose starvation conditions (growth in glycerol), genes with “C” BSMVs are induced with respect to genes with “T” BSMVs (Fig. 3D–F). Differences in affinity of Rap1 binding sites have been shown to be specifically associated with expression in low glucose conditions, according to a precise set of experiments including ChIP-chip, protein binding microarrays, deletion mutants, and gene expression analysis , . High affinity sites are constitutively bound by Rap1, while low affinity binding sites are protected by chromatin structure from Rap1, except during low glucose conditions, when chromatin conformational changes expose them, and Rap1 binds and induces expression. According to our method, such BSMV-by-condition patterns for Rap1 can be learned from accurate binding site predictions and expression patterns alone.
Yeast has only around 200–300 TFs to regulate its complex regulatory function—from budding to the cell cycle to selectively metabolizing dozens of different energy sources. The fundamental question in regulatory biology is how a relatively small number of TFs orchestrate the regulation of thousands of genes to achieve innumerable phenotypic responses. The fine-tuning of TF binding motifs at non-consensus positions may provide an important source of control in coordinating these condition-specific expression patterns.
In this study, we found that a significant proportion of variable positions in TF binding motifs may have functional consequences. Several of these predictions are in agreement with available experimental evidence, and several are corroborated by conservation across species. We considered only a single variable position at a time and did not explicitly account for promoters with complex regulatory inputs. More functional BSMVs should be found if combinations of positions and/or binding sites are formally considered .
Functional BSMVs allow the same TF to have a broad range of regulatory effects simultaneously over different target genes. Our results, consistent with the molecular biology literature, show that these differential regulatory effects between BSMVs can change with the concentration of the TF and/or the concentration of cofactors across environmental or cellular conditions.
As the complexity of organisms increase, the complexity of their regulatory responses needs to also increase to accommodate differential expression across tissues and numerous developmental stages. We therefore expect that the contribution of functional BSMVs to the cis-regulatory code of higher eukaryotes may be even more pronounced, an idea supported by the observation of such BSMVs in the experimental literature in diverse organisms such as nematode , , fly , mouse , and human .
Binding site annotations were obtained from SwissRegulon , where position weight matrices (PWMs) from over-represented motifs in microarray bound DNA regions from high-throughput chromatin immunoprecipitation of 102 TFs  in S. cerevisiae were calculated by PhyloGibbs , and where these PWMs were inputted into MotEvo, which is a scanning algorithm that finds hits to a PWM, but also considers conservation in other species . To obtain a set of genes under putatively simple forms of regulatory control, we included promoters and target genes with a single primary TF binding site, where primary binding sites have a posterior probability (according to MotEvo) of 0.7 or greater. This results in 1219 genes included among the target sets for the Affymetrix expression data set (see below) and between 648 and 664 genes, depending on the species, for the cDNA expression data sets. For comparison, we also examined sets of genes with multiple primary binding sites, or with secondary binding sites, which have a posterior probability less than 0.7 but greater than 0.2.
The lengths of the binding site motifs vary from 6 bp to 16 bp. The information, Rij, at position j of site i, was calculated according to , given the following base frequencies: fA=0.307, fC=0.188, fG=0.188, and fT=0.316.
To calculate the nucleotide substitution rate at each position in each binding site, we performed a whole-genome multiple sequence alignment and pairwise alignments using MAUVE  between S. cerevisiae and S. bayanus, S. mikatae, S. paradoxus and S. kudriavzevii, and discarded regions of sequence that are gapped in S. cerevisiae. We discarded the binding site if the alignment from the pairwise and multiple sequence alignments was different or if gaps existed in the sequence, except in the following cases: (1) no orthologous alignment of the region was produced by one method (pairwise or multiple) but an ungapped alignment was produced by the alternative method and (2) the alternative alignments of the binding sites are the same except for the introduction of gaps into the pairwise alignment, in which case we used the multiple alignment. We discarded positions that contain a gap or are unalignable.
Branch lengths for the five-species phylogeny  were calculated in PAML under a reversible model and a gamma distribution of rates with four categories . Site-specific normalized average evolutionary rates were estimated based on this genome alignment and phylogeny using empirical Bayesian estimation in the program Rate4Site 2.01 . A gamma distribution of rates with 35 rate categories was fit with an alpha parameter value of 0.62. The mean rate of sites from each of the following classes was calculated: coding regions (first, second and third positions), introns, intergenic regions, binding site positions with greater than one bit of information, and binding site positions with less than or equal to one bit of information. The 95% confidence interval for the mean rates of each class was calculated from 1000 nonparametric bootstrap replicates in the R package boot. We also calculated proportion of invariant sites for each of these classes. The 95% confidence interval for these proportions was calculated using the method of Wilson .
We assembled a concatenated expression microarray dataset representing S. cerevisiae expression in 211 experimental conditions from S98 Affymetrix array data in the NCBI Gene Expression Omnibus. To maximize comparability we chose to use data from only one array-type, Affymetrix S98, which has the largest number of experiments available among all platforms and array designs. Expression values were normalized by the robust multi-chip average (rma) algorithm implemented in the R-affy package of Bioconductor .
The second expression dataset we used contains measurements under equivalent environmental stress conditions for four yeast species from the Saccharomyces sensu stricto complex: S. cerevisiae, S. paradoxus, S. kudriavzevii and S. mikatae measured on Y6.4kv6 cDNA arrays (GEO record GSE3406 ). cDNA hybridization after each stress condition for each species was measured relative to cDNA before application of the stress condition for that species, with either four biological replicates (two stress conditions) or across six time points (five stress conditions).
In order to make between-gene comparisons of expression across conditions, we rank-ordered each gene's expression level across conditions, and used the ranks as a proxy for the expression levels in all analyses. To identify functional BSMVs, we selected a set of binding site positions that have at least two target genes for at least two BSMVs. For a given position in a motif, we compared two distributions of pairwise Euclidean distances between the ranked expression profiles (VDRE): DW, the distribution of distances between pairs of targets that have the same BSMV, and DB the distribution of distances between pairs of targets that have different BSMVs. The functionality score of the motif at one position is:
where N is the number of within-BSMV pairwise distances in the summation and M is the number of between-BSMV pairwise distances, and the distances between target genes of all variants of a binding site motif are considered simultaneously for a single motif position. We stress that this measure for assessing the functionality of BSMVs incorporates expression distances between target genes of all the nucleotide variants of that binding site position, not only between genes associated with just two BSMVs. The significance of the association between BSMV and expression profile was calculated by permuting the assignment of target genes to BSMVs 1000 times and comparing observed values of F with the permuted null distribution. The full set of motif positions with functional BSMVs detected in each species is shown in Table S1. The false discovery rate was calculated using the p-values from the observed data .
For comparison, we also identified functional BSMVs for all genes, including genes with more than a single primary binding site in their promoter. The same methodology was followed as for genes with a single primary input. We checked to see whether the nucleotides defining the functional BSMVs are correlated with low probability, secondary binding sites (posterior probability <0.7 and >0.2).
We assessed the coincidence of secondary binding sites for all other TFs with each BSMV at each a functionally variant position. More specifically, among the target genes for a TF with functional BSMVs at a particular position, we tallied the number of targets with a particular BSMV, or a particular coincident low probability binding site, both or neither, resulting in a 2×2 contingency table that we tested for a significant correlation with Fisher's exact test at a level of 0.01.
Genes that are bound by the same transcription factor are often co-expressed . To assess our assumption that genes with only a single primary binding site are subject to simpler regulatory control than genes with additional binding sites, we compared gene expression between the set of all genes with a single primary site for a TF versus the set also including genes that are bound by the TF but may also have additional primary binding sites. We used the Euclidean distance between ranked gene expression values as a comparative measure of gene expression value, a measure that is commonly used , , . We performed a Welch two sample t-test on the two groups to assess whether genes with a single primary binding site (bound by the same TF) have significantly smaller expression distance between them than genes bound by the same TF, but which may also have additional primary binding sites. We similarly assessed our assumption that primary binding sites contain better predictions than secondary binding sites by comparing the expression distance between genes that share a primary binding site versus genes that share a secondary binding site. We compared these distances to expression distances between random pairs of genes.
We sought to quantify the biological consistency of the set of experimental conditions for which each functional BSMV explains target gene expression differences. For each pair of BSMVs, we sorted the experiments according to the difference of the mean expression value for the target sets (normalized by the variance in expression value), and then considered whether similar experimental conditions are clustered. We classified all experiments into groups of similar experimental types. For the comparative dataset, each stress condition was considered a single class including all time points and replicates. For the Affymetrix dataset, all rich media wild type/control experiments were classified together. Most other experimental types (e.g. growth in different media types, deletion strains, etc.) were classified into respective categories across time points. One dataset (replicates GSE1311–1314) which includes a large number of experiments was separated into the two stages of the experiment (desiccation and rehydration). The level of clustering of conditions according to differences in expression values of the BSMVs is defined by:
where CQ the average distance between ranks of experiments of different types, CW is the average distance between ranks of experiments of the same type, A is the number of same-experimental-type pairwise differences in the summation, and B is the number of different-experimental-type pairwise differences. We compared the observed level of clustering to the distribution of clustering for 10,000 data sets, where the assignment of experimental types is permuted.
We also considered whether target genes associated with different functional BSMVs show an enrichment of particular Gene Ontology (GO) biological processes. For each set of target genes with a given BSMV, we permuted the biological processes between BSMVs to create 10,000 null data sets. We examined whether target genes associated with particular BSMVs show enrichment for biological processes compared to the null distribution for that process. A full list of significant GO enrichments for functional BSMVs is shown in Table S4.
Comparison of average gene expression levels between genes with different functional transcription factor binding site motif variants (BSMVs) in S. cerevisiae (Affymetrix). Mean expression levels for target genes of functional BSMVs of S. cerevisiae using expression data from 211 Affymetrix S98 arrays and a variety of experimental conditions. Even if more than two BSMVs exist at a position, only two are shown in each individual graph, and additional graphs show the pairwise comparison between each BSMV present at each position. The means are ordered across conditions according to the difference between mean expression of the two BSMVs. Vertical lines extending from each point indicate the standard deviation of the mean. Horizontal black bars indicate the difference between the mean ranks. The significance of the functional BSMVs was determined without reference to the segregation of experimental conditions, which are shown according to color along the x-axis. The number of targets for each BSMV graphed are shown at the bottom right hand of the graph.
Mean expression levels of target genes with multiple primary inputs (posterior probability >0.7), for functional binding site motif variants (BSMVs) that were identified in both a dataset with multiple primary inputs, and limited to genes with a single primary input. Expression data is from 211 Affymetrix S98 arrays and a variety of experimental conditions. Even if more than two BSMVs exist at a position, only two are shown in each individual graph, and additional graphs show the pairwise comparison between each BSMV present at each position. The means are ordered across conditions according to the difference between mean expression of the two BSMVs. Vertical lines extending from each point indicate the standard deviation of the mean. Horizontal black bars indicate the difference between the mean ranks. The significance of the functional BSMV was determined without reference to the segregation of experimental conditions, which are shown according to color along the x axis. The number of targets for each BSMV graphed is shown at the bottom right hand of the graph.
Comparison of expression patterns in S. cerevisiae of genes associated with “A” and “T” binding site motif variants (BSMVs) at position 7 of the RAP1 transcription factor binding site based on binding site annotations from small-scale experimental mapping or the genome-wide annotations used in this study. Of six RAP1 binding sites for single input genes according to SCPD, three identical (same position and sequence) binding sites are found in the genome-wide binding site annotations used in this study. A) Genes with only a single RAP1 binding site and no other TF binding site annotations collected from the experimental mapping literature in SCPD  were selected. Average expression values of 3 genes with an “A” nucleotide at position 7 of their associated RAP1 binding site are lower during growth in glycerol than the average expression values of 2 genes with a “T.” Expression values of genes with either BSMV are not different during growth in other stress conditions. B) Position 7 of the RAP1 binding site is functionally variant (p=0.024) based on our analysis of genome-wide binding site annotations for RAP1 derived from genome-scanning with models based on ChIP-chip binding assays, conservation and motif overrepresentation , , , . Similar to above, average expression of 27 genes with an “A” BSMV at position 7 of their associated RAP1 binding site are expressed at a lower level than 3 genes with a “T” BSMV.
Number of target genes associated with functional and non-functional binding site motif variants (BSMVs). Variable motif positions considered in the analysis have, on average, ~35 target genes, and functional BSMVs do not have a significantly different number of target genes than do non-functional BSMVs (p=0.4).
Complete set of figures showing comparison of average gene expression levels between genes with different functional transcription factor binding site motif variants (BSMVs) in S. cerevisiae, S. kudriavzevii, S. mikatae and S. paradoxus based on expression data from Y6.4kv6 cDNA arrays. Mean expression levels for target genes of functional BSMVs found at positions in TF binding sites using expression data from Y6.4kv6 cDNA arrays and stress conditions. Even if more than two BSMVs exist at a position, only two are shown in each individual graph, and additional graphs show the pairwise comparison between each BSMV present at each position. The means are ordered across conditions according to the difference between mean expression of genes regulated by the two BSMVs. Vertical lines extending from each point indicate the standard deviation of the mean. Horizontal black bars indicate the difference between the mean ranks. The significance of the functional BSMVs was determined without reference to the segregation of experimental conditions, which are shown according to color along the x-axis. The number of targets for each BSMV graphed is shown at the bottom right hand of the graph.
Transcription factor binding site motif positions that have functional variants inferred according to the variant distance of ranked experiments statistic (p<0.05).
Transcription factor binding site motif positions that have functional variants inferred according to the variant distance of ranked experiments statistic (p<0.05) in both target genes with single primary inputs and multiple primary inputs.
Condition specificity of functional binding site motif variants (BSMVs). Significance of segregation of experimental conditions dependent upon upregulation of the major base or minor base. Cases are shown where the position has a functional BSMV and where there is clustering of similar experimental conditions when the experiments are sorted according to the difference in regulation of each-BSMV's target set.
Gene ontology enrichment among genes with alternative binding site motif variants at functionally variant binding site motif positions.
Secondary transcription factor binding sites correlated with binding site motif variants at nearby functionally variant binding sites. Secondary binding sites have a posterior probability of <0.7 and >0.2. Significant coincidence of secondary binding sites for each other TF with each nucleotide at each functional binding site motif variant position is given according to Fisher's exact test at a level of 0.01.
Ilya Ruvinsky, Matthew Saunders, and Michael Bradley provided useful discussion and comments on the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This study was supported by National Institutes of Health (www.nih.gov) grants (GM30998 and GM081724) and the International Balzan Foundation (www.balzan.org). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.