|Home | About | Journals | Submit | Contact Us | Français|
Identifying the nucleotides that cause gene expression variation is a critical step in dissecting the genetic basis of complex traits. Here, we focus on polymorphisms that are predicted to alter transcription factor binding sites (TFBSs) in the yeast, Saccharomyces cerevisiae. We assembled a confident set of transcription factor motifs using recent protein binding microarray and ChIP-chip data and used our collection of motifs to predict a comprehensive set of TFBSs across the S. cerevisiae genome. We used a population genomics analysis to show that our predictions are accurate and significantly improve on our previous annotation. Although predicting gene expression from sequence is thought to be difficult in general, we identified a subset of genes for which changes in predicted TFBSs correlate well with expression divergence between yeast strains. Our analysis thus demonstrates both the accuracy of our new TFBS predictions and the feasibility of using simple models of gene regulation to causally link differences in gene expression to variation at individual nucleotides.
Natural variation in gene expression underlies many diseases (Knight 2004; Cookson et al. 2009) and plays an important role in evolution (Wray 2007; Carroll 2008). A number of studies have demonstrated that gene expression variation is widespread and heritable across a wide range of species, including human, rat, mouse, yeast and Drosophila (Rockman and Kruglyak 2006). Identifying the specific genomic changes that cause gene expression variation is a vital step in understanding phenotypic diversity and the genetic architecture of complex traits. Loci that cause expression variation can be classified as cis- or trans-acting. Cis-acting variation is often thought to be prevalent in evolution because it is believed to cause fewer pleiotropic effects than trans-acting variation (Chen and Rajewsky 2007; Ronald and Akey 2007; Carroll 2008). Despite the presumed importance of cis-acting variation, only a few polymorphisms that cause gene expression differences have been identified, largely because of the difficulty of fine-mapping phenotypic traits in most organisms. We thus explored the feasibility of using genome-wide computational predictions of transcription factor binding sites (TFBSs) to predict nucleotides causing variation in transcript levels, using the yeast, Saccharomyces cerevisiae, as a model system. Although it is clear that gene expression can be regulated posttranscriptionally, our expectation was that transcriptional control would likely play a major role in determining transcript abundance.
In this study, we present two major results. The first result is a major reannotation of the yeast transcription regulatory network. A number of groups have produced TFBS predictions for S. cerevisiae based on motifs inferred from a large set of ChIP-chip experiments (Harbison et al. 2004; Erb and van Nimwegen 2006; Macisaac et al. 2006). We previously published algorithms for predicting transcription factor (TF) motifs (Siddharthan et al. 2005) and predicting TFBSs (van Nimwegen 2007) and also demonstrated the accuracy of the algorithms in the case of S. cerevisiae (Erb and van Nimwegen 2006). The current study builds on our previous work by incorporating a large set of new TF motifs from recent protein binding microarray experiments (Badis et al. 2008; Zhu et al. 2009), allowing us to significantly increase the scope of the network while maintaining a high degree of specificity. We expect that our annotations are likely to be of independent interest to the community, and they are freely available online (http://www.swissregulon.unibas.ch/).
Our second major result is the identification of a subset of genes for which we can significantly correlate changes in the predicted TFBSs with gene expression divergence. The problem of predicting gene expression from sequence alone is well known to be difficult because of the complexity of cis-regulatory regions, even in a relatively simple eukaryote, such as S. cerevisiae (Yuan et al. 2007). For example, the effects of a mutation at a given TFBS may depend on the constellation of other TFBSs in the promoter. Several authors examined the correlation between differences in TFBSs and gene expression divergence between different yeast species (Doniger and Fay 2007; Tirosh et al. 2008) or duplicated genes within S. cerevisiae (Zhang et al. 2004; Leach et al. 2007). These studies had only limited success in correlating expression with sequence that we hypothesize is partly because of the large evolutionary distances used in the comparisons. For example, the sequence divergence between two commonly studied S. cerevisiae strains, S288c and RM11-1a, is ~0.5%, whereas the divergence between S. cerevisiae and S. paradoxus is as much as ~12% for coding sequence and ~18% for noncoding sequence (Cliften et al. 2001). Likewise, most gene duplications in yeast are ancient, with the majority of the duplication events occurring around the time of the eukaryote–prokaryote split (Gu et al. 2005). Therefore, at these larger evolutionary distances promoters typically differ at multiple positions. We reasoned that fewer complex changes in cis-regulatory region organization are likely to have occurred over the timescales separating S. cerevisiae strains, allowing us to more readily correlate sequence and expression divergence.
Taken together, our evolutionary and gene expression analyses demonstrate that our new TFBS predictions significantly improve on the previous annotations and that for a subset of genes, changes in predicted TFBSs correlate significantly with changes in gene expression divergence. Our fine-scale sequence-based computational approach thus complements the classical phenotype-based approach in which quantitative trait locus (QTL) mapping methods are used to identify genomic loci associated with the phenotype. Ultimately, we expect that a combination of the two approaches will be necessary for elucidating the mapping of genotype to phenotype.
We combined 89 position-specific weight matrices (PWMs) from Zhu et al. (2009), 112 PWMs from Badis et al. (2008), and 72 PWMs from Erb and van Nimwegen (2006). Overall, visual inspection of TF motifs inferred by more than one method suggests that there is good agreement between the three data sets. Single PWMs for each TF were obtained using a Bayesian procedure that takes a set of PWMs as input and determines the relative alignment of the PWMs that maximizes the probability that the entire set derives from a single underlying PWM and also infers this underlying PWM (FANTOM Consortium 2009). This method also determines whether the data are consistent with all PWMs deriving from one common PWM. For 12 TFs, two of the methods agreed while the third was an outlier, so for each of these TFs, the outlier was manually removed and the two remaining PWMs were aligned. For two TFs, the protein binding microarray methods disagreed between a dimer and monomer motif so we resolved these cases manually. For the other TFs, we first aligned the two protein binding microarray PWMs and then aligned the resulting average protein binding microarray PWM with the ChIP-chip PWM. Finally, we manually trimmed the motif boundaries to exclude positions with little information content and discarded the motif for FHL, a forkhead-like TF that, based on in vitro assays, is suspected of not binding DNA directly (Rudra et al. 2005).
All analyses were performed on the April 2006 Saccharomyces Genome Database (SGD) version of the S. cerevisiae genome sequence to facilitate comparison with our previous TFBS predictions (Erb and van Nimwegen 2006). There have not been major changes in the S. cerevisiae genome since 2006. Intergenic regions were aligned to S. paradoxus, S. mikatae, S. kudriavzevii, and S. bayanus using MLAGAN and used as input to the MOTEVO program as previously described (Erb and van Nimwegen 2006).
From the SCPD (Zhu and Zhang 1999) and TRANSFAC (Matys et al. 2003) databases, we curated a set of 452 binding for sites which a binding TF could be identified unambiguously (Erb and van Nimwegen 2006). For each TF for which both MotEvo predictions and at least one known site were available and for each intergenic region, we calculated the sum of the posteriors of all binding sites in the region. The predicted target TF/promoter–region combinations were then ordered by the sum of posteriors and at different cutoffs the fraction of all known targets that were among the predictions (sensitivity) and the fraction of all predictions that correspond to known targets (specificity) were calculated. For the ChIP-chip data of Harbison et al. (2004), TF/promoter–region target combinations were sorted by the P value of binding and the sensitivity and specificity were calculated at different P value cutoffs.
In processing the raw single nucleotide polymorphism (SNP) data, we used a threshold of 40 on the Phred score and normalized the allele frequency by the sequencing coverage, following Liti et al. (2009). The results were similar for Phred cutoff of 20 or when excluding singleton polymorphisms (data not shown). For the derived allele frequency (DAF) distributions, we aligned the S. cerevisiae and S. paradoxus reference genomes with MAVID (Bray and Pachter 2004) and rooted the S. cerevisiae SNPs with the S. paradoxus reference sequence and vice versa. We defined conserved elements as 7-mers (possibly overlapping) conserved in the same five species used for TFBS prediction because the average information score of the PWMs was 14 bits. We varied the parameters by testing 6- to 12-mers and 4 or 5 species. The results were entirely consistent in that the inferred selective constraint increased with longer motifs or more species.
We collected all TFBSs that were predicted in the BY strain and contained exactly one SNP relative to the RM strain and determined the difference dl in log-likelihood of the BY and RM sequences for the corresponding PWM. From this, we determined the distribution of dl for all observed SNPs, weighing each SNP by the posterior probability of the TFBS in which it occurs. We compared the distribution of observed dl with two randomized distributions. First, we used the distribution of dl of all single point mutations of the TFBS containing SNPs, again weighing the mutations in a TFBS by the posterior probability of the TFBS. Second, we used the distribution of dl of all single point mutations at the same position as where the observed SNP occurred. To take into account the sequence composition of intergenic regions, different point mutations in the randomized sets were also weighed by the overall frequency of the corresponding nucleotide in intergenic regions.
Unless specified otherwise, we used 600-nt upstream of the transcription start site to define the promoter region. Transcription starts were defined in Zhang and Dietrich (2005) and David et al. (2006). For divergently transcribed genes where the intergenic region was less than 1,200 nt, we divided the region into two equal-sized promoter regions, based on the result of Erb and van Nimwegen (2006) that most TFBSs likely regulate only one gene. We removed the following TFs as not being transcribed in rich media based on tiling array data (David et al. 2006): DAL80, GAL4, SIP4, and THI2.
For the correlation analysis, we varied the promoter region length from 400 to 1,000 in increments of 200, the posterior probability cutoff from 0.3 to 0.9 in increments of 0.2, and a fold change cutoff from 0 to 0.6 in increments of 0.2. We experimented with other strategies, such as taking the sum or the expectation instead of the max, using two sets of estimated fold changes (Brem and Kruglyak 2005; Wang et al. 2007) and using the estimated activities of the TFs over all segregants, fit using a linear model, similar to previous works (Sun et al. 2007; Ye et al. 2009). None of these strategies resulted in improved correlations though the sum statistic gives similar results to the max statistic (supplementary table S1, Supplementary Material online). We also attempted to divide genes into groups according to which TF regulated them. In theory such a grouping might give an improvement because of the different ways in which changes in PWM score might affect changes in expression. In practice, however, the groups were very small and resulted in high variance in average correlation coefficient, such that of 51 TFs with >2 genes in their group, 21 had negative correlation. For the analysis of the data from Emerson et al. (2010), we used the dependent method of parameter estimation (Emerson et al. 2010) because it has higher accuracy and the correlation between cis and trans effects is not relevant in our application.
We started by assembling a catalog of 164 yeast TF PWMs. This catalog was computed using our previously described algorithm (FANTOM Consortium 2009) to combine 141 motifs derived from two recent sets of protein binding microarray experiments (Badis et al. 2008; Zhu et al. 2009) with 79 motifs predicted from genome-wide ChIP-chip data (Harbison et al. 2004; Siddharthan et al. 2005). By comparison, a previous, commonly used data set based only on ChIP-chip data and literature-derived motifs (Macisaac et al. 2006) contained 124 motifs. Our motif set thus contains a large fraction of the ~200 TFs in the S. cerevisiae genome (Harbison et al. 2004).
Using our updated catalog of motifs, we predicted TFBSs in S. cerevisiae using MotEvo, a Bayesian TFBS prediction algorithm that combines matches to a given PWM with a rigorous analysis of orthologous sequence segments across related species using an explicit statistical model of TFBS evolution (van Nimwegen 2007). That is, while promoter segments that are likely capable of being bound by a given TF are identified based on PWM match, cross-species comparison is used to evaluate the evidence of purifying selection acting to preserve the binding site, and a posterior probability that the site is functional is assigned based on this evidence. Thus, whereas MotEvo is in principle able to detect binding sites that are specific to a single species, higher probability will be assigned to those sites that exhibit evidence of selection acting to preserve them. To facilitate comparison with our previous annotations, we used the same set of parameters and alignments as in our previous analysis (Erb and van Nimwegen 2006). Because it is known that low-affinity (Tanay 2006) and nonconserved (Dermitzakis et al. 2003; Emberly et al. 2003) binding sites can be biologically important, we compared the two annotations over a wide range of posterior probability thresholds (0.05 < Prob < 0.9). Over this range, the new annotations contained roughly twice the number of TF-TFBS relationships and 31–44% more bases in at least one TFBS compared with the old annotations. Moreover, there was substantial overlap between the bases previously annotated to be in TFBSs and those newly annotated to be in TFBSs (table 1). Although ~22–25% of the bases in the old annotations were reannotated as not in TFBSs in the new annotations (table 1), the large overlap implies that the major change between our annotations was the addition of the new motifs from the protein binding microarray data rather than changes in previously known motifs.
Estimating the absolute sensitivity and specificity of genome-wide TFBS predictions is known to be difficult because there are essentially no comprehensive collections of TFBSs with experimentally demonstrated functionality to use as a reference. In our previous annotation (Erb and van Nimwegen 2006), we used data from the SCPD (Zhu and Zhang 1999) and TRANSFAC (Matys et al. 2003) databases to curate a collection of 452 experimentally determined TFBSs from 184 promoter regions and calculated the sensitivity and specificity of the MotEvo annotations on this small set of known sites. We have repeated this analysis for our new TFBS annotation (fig. 1), and we find that the specificity attained by the new annotation is essentially the same as that of the previous annotation. It should be noted that since the set of known sites represents only a small fraction of all true functional sites, the specificity reported in figure 1 underestimates the true specificity of our predictions by a substantial factor.
Although it is tempting to use results from genome-wide binding (i.e., ChIP-chip) or microarray experiments to define reference genome-wide target sets and to use these for estimating absolute specificities, we previously demonstrated (Schlecht et al. 2008) that computationally predicted TFBSs show more overlap with target sets obtained by different high-throughput methods than the experimental target sets show with each other. This suggests that computational predictions of the genome-wide targets of TFs may be more accurate than targets inferred from high-throughput experiments. To further analyze this phenomenon, we obtained the genome-wide binding data from the ChIP-chip experiments of Harbison et al. (2004) and used these to predict target promoters, sorted by P value, for each TF analyzed. We then calculated the sensitivity and specificity that the ChIP-chip data attain on the same set of known sites (fig. 1). Strikingly, the specificity attained by the ChIP-chip data is a factor 5–10 lower than that attained by the MotEvo predictions, strongly supporting that MotEvo’s predictions are substantially more accurate than those based directly on ChIP-chip data. In summary, our observations imply that our new annotations represent a significant increase in coverage of the yeast genome TF regulatory network and that these predictions are at least as accurate as predicted targets based on high-throughput experiments.
To further validate the accuracy of our TFBS annotations and establish their functional significance, we used a population genomics approach similar to previous works (Fairbrother et al. 2004; Chen and Rajewsky 2006; Chen et al. 2009). The basic idea of this approach is to use the estimated strength of natural selection on predicted cis-regulatory sites as a measure of the functionality of the sites and the accuracy of the predictions. To carry out this analysis, we used data from a recent survey of polymorphism in 39 isolates of S. cerevisiae (Liti et al. 2009). We used two statistics to quantify the level of selective constraint: SNP density and minor allele frequency (MAF). Although the SNP density measure can be biased by heterogeneity in the mutation rate, the allele frequency spectrum is free from such mutation rate biases (Fay et al. 2002) and thus is likely to be a more accurate measure of selective constraint than SNP density. As reported by Liti et al. (2009) and confirmed in our study (data not shown), the DAF spectrum has an anomalously high number of high frequency alleles. Such a pattern is consistent with positive selection acting on those alleles. However, this pattern can also result from misspecification of ancestral alleles, which is likely to occur when the outgroup and ingroup species are separated by a large evolutionary distance, as are S. paradoxus and S. cerevisiae. For this reason, we followed a previous analysis of noncoding SNPs in S. cerevisiae (Fay and Benavides 2005) by using MAF spectra rather than DAF spectra.
We observed that selective constraint as measured by SNP density was greater on predicted TFBSs than on synonymous sites or on sites in intergenic regions (which include TFBSs) (table 2). This is likely to be a conservative test for purifying selection because many synonymous sites are under selective constraint in S. cerevisiae (Zhou et al. 2010), and intergenic regions are likely to contain constrained sequences other than TFBSs (e.g., nucleosome positioning elements, noncoding RNAs etc.). When comparing the new and old sets of TFBS predictions, we found that the selective constraint was virtually identical across the full range of posterior probabilities (table 2). This result further supports that the new TFBS predictions achieved essentially the same specificity as the old TFBS predictions while significantly improving the overall coverage of the yeast transcriptional network. The MAF spectra exhibited similar patterns to those observed in the SNP density analysis (fig. 2). That is, the frequency spectrum was more skewed toward rare alleles for predicted TFBSs than for synonymous sites and intergenic regions.
However, when we compared TFBSs with nonsynonymous sites and with 7-mers in intergenic regions that were completely conserved across the five species (hereafter “conserved elements”; see Materials and Methods), we found that the selective constraint on TFBSs was lower than on either of these two classes of sites (table 2, fig. 2). One reason for this result could be that many positions in TF motifs are degenerate and therefore expected to evolve under relatively low selective constraint. The results presented in the next section provide support for this interpretation. We restricted our attention to positions in TFBS conserved across the five species and found that these positions were indeed highly constrained as measured by SNP density, similar to positions in conserved elements. According to the MAF distribution analysis, conserved positions in TFBSs were even more strongly constrained than conserved elements. Overall, these data suggest that selective constraint on conserved positions in TFBS is at least as high as that on nonsynonymous sites or conserved elements.
We confirmed these patterns by examining the SNP density and MAF spectra in 35 isolates of the closely related species S. paradoxus (Liti et al. 2009) (data not shown). Although these data were very similar to the S. cerevisiae results overall, the MAF distribution for nonsynonymous sites was more strongly skewed toward low-frequency alleles in S. cerevisiae than in S. paradoxus. This result is likely due to the draft nature of the S. paradoxus gene annotations, which were simply lifted over from S. cerevisiae based on the genome alignment (http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp_manual.pdf). For example, over 1,000 annotated genes in S. paradoxus have a coding-sequence length that is not a multiple of three. Thus, a significant fraction of putative nonsynonymous sites are likely to be actually synonymous or intergenic sites.
So far we have considered only the SNP density and MAF spectrum within predicted TFBSs. These analyses demonstrated that positions within TFBS are under negative selection but they do not address the specific function of these nucleotides since conceivably they could have a function other than acting as a TFBS. To test for evidence that positions in TFBSs are specifically under selection for binding to the corresponding TF, we compared the distribution of PWM score changes of the observed SNPs with those resulting from randomly mutating a randomly chosen position in the same TFBS. We also performed a more stringent test in which we compared the observed PWM score changes with those that result from randomly mutating the same position in the TFBS. As shown in figure 3, the PWM score changes observed in the SNPs are very significantly biased to maintain the affinity of the TFBSs to their cognate TF (P values of 4 × 10−131 and 5 × 10−18, respectively, Kolmogorov–Smirnov test). These results strongly suggest that the predicted TFBSs are indeed under selection for maintaining their affinity to the cognate TF. Supplementary Figure S1 (Supplementary Material online) shows the analogous results for predicted TFBSs of three individual TFs. To summarize the results of this analysis for individual TFs, figure 4 shows the average and standard error of the difference between PWM score changes for the observed SNPs and PWM score changes in the randomized data sets for each TF separately. Although in many cases, the number of SNPs in predicted TFBSs is too small for a meaningful statistical analysis, for the large majority of individual TFs, the observed PWM score changes are biased toward maintaining the affinity of the TFBS (i.e., the vertical coordinate in fig. 4 is negative), demonstrating that selection for maintaining TFBS affinity applies across most of the TFs for which we provide predictions.
Having examined the selective constraint on our predicted TFBSs, we next tested if changes in the TFBSs correlated with changes in gene expression between S. cerevisiae strains. We thus analyzed genome-wide gene expression and genotype data from 112 haploid segregants from a cross of two parental S. cerevisiae strains (Brem and Kruglyak 2005), a wild strain, RM11-1a, and a standard laboratory strain, BY4716 (hereafter RM and BY, respectively). Treating gene expression level as a quantitative trait, roughly a quarter of all gene expression levels were significantly associated with a marker close to the gene itself. We will refer to these genes as cis-expression QTLs (eQTLs) or “CE genes,” with the understanding that in some cases the variation may in fact be due to a nearby trans factor. CE genes are also referred to as genes with local eQTLs (Rockman and Kruglyak 2006).
For the remainder of the analysis, we restricted our attention to genetic variation relevant to these eQTL data (i.e., we considered only SNPs between the BY and RM strains). Likewise, we only considered TFs expressed by cells growing in rich media (David et al. 2006), the experimental condition used for the microarray measurements of gene expression. First, we confirmed that the upstream cis-regulatory regions (hereafter referred to as promoter regions) of CE genes were significantly enriched for SNPs between RM and BY in TFBS when compared with the promoter regions of all genes (Chi-square test, P value 0.0147), consistent with a previous result that used a different set of TFBS predictions (Ronald et al. 2005). For this analysis, we restricted our attention to the most confident set of TFBS predictions (posterior probability > 0.9) because we previously observed that the degree of selective constraint correlated well with the posterior probability cutoff on the TFBSs.
Next, we turned to the more difficult problem of correlating the magnitude of the changes in PWM scores with the expression fold change. Because the annotation of TFs as activators or repressors is not currently complete, and the role of a TF as activator or repressor can be dependent on the binding of cofactors or the cellular condition, we examined the correlation of the absolute value of the changes in PWM scores with the absolute log fold change of mRNA expression. Manually annotating TFs as activators or repressors based on their SGD annotations (http://www.yeastgenome.org/) and considering the signs of the changes in PWM scores and mRNA expression did not result in any improvement (data not shown). We made the further approximation of taking the maximum PWM score change over all promoter region SNPs for promoters with multiple SNPs. However, the results were very similar when taking the sum instead of the maximum (supplementary table S1, Supplementary Material online). Because the observed correlations depended on the parameters (e.g., promoter length, posterior probability of a TFBS etc.), we explored the parameter space thoroughly (see Materials and Methods). We then compared the distribution of Pearson correlation coefficients over the parameter space for different sets of genes and TFBS predictions.
For the set of CE promoter regions, the average correlation over all parameter settings was moderately strong (average Pearson’s R = 0.301). In comparison, when using the previous TFBS predictions, we found a lower average correlation (average Pearson’s R = 0.203). This difference in correlation corresponds to approximately a doubling of the variance in expression-level change explained by TFBS changes and thus strongly supports the conclusion that our new TFBS annotations are a significant improvement. To estimate the statistical significance of the correlation results, we computed the same statistic over 1,000 random permutations of the fold change to gene assignments. The highest correlation observed among the permutations was 0.0947 for the old TFBSs and 0.116 for the new TFBSs, implying an empirical P value of less than 0.001.
Many of the CE genes had only a small difference in expression between the two parental strains, even though this difference was significantly linked to a genomic locus. We thus further constrained the CE set to only those genes with a statistically significant expression difference based on a larger set of microarray experiments for the two parental strains (Wang et al. 2007). We further removed all genes with trans-eQTLs to minimize non-cis sources of variation (Yvert et al. 2003). The remaining set of 305 genes, which we call our “restricted cis-eQTL” or RCE genes (supplementary table S2, Supplementary Material online), is small but contains the most confident genes for the purposes of identifying causal cis-regulatory SNPs. Using GO term enrichment analysis (Berriz et al. 2009), we found that the genes in the RCE set are significantly enriched for genes functioning in processes related to the cell wall (corrected P = 0.001) and plasma membrane (corrected P = 0.005). Furthermore, the expression of one gene associated with the plasma membrane, FLO11, was previously shown to correlate with the brightness difference of the cell wall in the two relevant strains, BY and RM (Nogami et al. 2007). Cell wall organization and biogenesis were also observed as an overrepresented functional category for genes with experimentally measured variation in binding of the TF Ste12 in segregants of a cross of a BY-related laboratory strain with a different divergent strain, HS959 (Zheng et al. 2010). Thus, the RCE set of genes contains functionally coherent subsets of genes that may contribute to phenotypic differences between the BY and RM strains.
When we repeated the correlation analysis on the set of RCE genes, we found significantly stronger correlations than for the CE genes (fig. 5, average Pearson’s R = 0.314 for the old predictions and 0.514 for the new predictions). Our analysis thus demonstrates the feasibility of correlating sequence divergence at individual nucleotides with expression divergence, at least for a restricted set of genes. It also demonstrates that our new TFBS predictions improve on our previous TFBS predictions because they correlate more strongly with gene expression changes.
Because the maximum or sum of PWM changes is positively correlated with the number of TFBSs and SNPs in the promoter, it is plausible that either of these is the underlying signal in our experiments. To exclude this possibility, we examined the relationship between expression change and either number of TFBSs or SNPs over the same parameter space (see Materials and Methods). However, we found only weak correlations in both cases (supplementary table S1, Supplementary Material online), suggesting that it is necessary to combine both TFBS and SNP information, and consider PWM score changes, to achieve reasonable correlations with gene expression. Another plausible explanation for our results is that the new predictions are enriched for binding sites of a small number of TFs that, for an unknown reason, show exceptionally good correlations between PWM score and expression change. However, we do not observe a limited set of TFs regulating the genes in the RCE set: there are 41 TFs that regulate at least one gene in the RCE set and of these TFs, 21 are associated with TFBS SNPs after taking the maximum over each promoter individually. Thus, the improved correlation combines contributions from a large number of TFs.
In a recent study of cis and trans effects between the BY and RM strains using next-generation transcriptome sequencing (Emerson et al. 2010), the authors identified a “cis only” set (61 genes) for which only cis effects were detected and a “cis major” set (an additional 371 genes) for which both cis and trans effects were detected but the cis effects were significantly larger. We found an average correlation of R = 0.212 for the cis major set, similar to what we obtained for CE genes (R = 0.301). For the cis only set, we found an average correlation of R = 0.450, similar to our result for the RCE set (R = 0.514) (see Materials and Methods). We conclude that our gene sets and those of Emerson et al. (2010) are both enriched for cis effects but we used our sets of genes because they are larger and produced slightly higher correlations.
In addition to showing a statistical association between gene-expression changes and TFBS changes, another important goal is to identify specific nucleotide changes that underlie functional divergence. Examination of the TFBS SNPs in the RCE promoter regions (supplementary table S3, Supplementary Material online) reveals several cases where our predictions align with additional biological information to generate hypotheses that can be tested experimentally. In particular, there are several RCE promoter regions that contain TFBS SNPs for two or more TFs with related functions. Some such cases are trivial because two TFs (e.g., INO4/INO2, MSN2/MSN4, PBF1/PBF2) have identical or nearly identical PWMs and so share the same TFBS. However, there are several nontrivial cases of multiple TFBS SNPs. One example involves SKN7 and MCM1, both of which have TFBS SNPs in the promoter region of AMN1, which encodes a protein involved in exit from mitosis (Wang et al. 2003). These two TFs function together in osmoregulation (Li et al. 1998). For both the SKN7 TFBS and the MCM1 TFBS, the PWM score is higher in BY relative to RM, perhaps indicating a larger connection between mitotic exit and osmotic stress in BY than in RM. Another example involves the glucose-dependent repressors MIG1, MIG2, and MIG3 (Westholm et al. 2008), all three of which have TFBS SNPs in the promoter region of YKL187C, which encodes a protein of unknown function. In this case, the PWM score for the MIG1 TFBS changes in the opposite direction from the other two. This might therefore be an example of compensatory evolution leading to the same total amount of repression. We previously predicted such compensatory coevolution of inputs to the same gene on theoretical grounds (Siegal et al. 2007). In the promoter region of APT2, on the other hand, the PWM scores of TFBS for MIG1, MIG2, and MIG3 are all higher in BY, implying greater glucose-mediated repression of APT2 in BY than in RM.
Identifying the nucleotides responsible for gene expression variation is an important problem in genetics and evolution. Although advances have been made using genome-wide association studies to map complex phenotypes (Hindorff et al. 2009), the mapping resolution of these studies is typically too low to pinpoint causal genes let alone individual nucleotides (Altshuler et al. 2008). One promising idea is to use eQTL analysis to predict causal genes, under the assumption that candidate genes in a disease-related locus that also have cis-eQTLs are more likely to be causal genes for the phenotype (Cookson et al. 2009). This approach has been used to map a number of causal genes for obesity in mice (Schadt et al. 2005; Yang et al. 2009).
Here, we have taken the eQTL mapping paradigm to the single nucleotide level by predicting specific nucleotide differences that cause gene expression divergence using computational predictions of TFBSs. Several groups previously explored the relationship between sequence divergence and gene expression divergence across yeast species (Doniger and Fay 2007; Tirosh et al. 2008). Our analysis differs from those analyses in several ways. First, we produced a set of TFBS predictions that is significantly larger than those used in previous studies yet maintains specificity. Beyond the analyses presented here, we believe that our comprehensive TFBS annotations will be of independent interest to the community.
Second, we used eQTL and microarray data to focus our attention on a restricted set of 305 genes whose differences in expression are likely caused by changes in cis-regulatory elements. Thus, although it remains difficult to correlate sequence divergence with gene expression change for all genes in the genome, we have shown that such an analysis is possible at least for a restricted set of genes. Third, we focused on closely related yeast strains as opposed to different yeast species to minimize the occurrence of complex changes in promoter organization. A recent study showed that variation in TF binding between S. cerevisiae strains can often be associated with specific mutations in TF binding sites (Zheng et al. 2010), but they studied only one TF, whereas we analyzed sites for 164 TFs. Together, these improvements allowed us to make progress on the problem of identifying sequence determinants of gene expression variation (Yuan et al. 2007).
There have also been several attempts to correlate sequence and expression divergence in metazoans. Castillo-Davis et al. (2004) examined the correlation of cis-regulatory region divergence and gene expression in Caenorhabditis elegans and C. briggsae. Because there does not exist a set of TFBS annotations for C. elegans of comparable accuracy with those for yeast, the authors estimated promoter region divergence using alignment programs. Andersen et al. (2008) took a similar approach to ours using TFBS predictions in humans. Several other authors have explored the relationship of sequence change and gene expression change. Segal et al. (2007) showed computationally that a single nucleotide change in a TFBS can change the mechanism of TF binding and thereby significantly change gene expression. Lapidot et al. (2008) examined specific nucleotide changes in TFBSs and found that changes involving adenine were more likely to maintain the expression pattern, whereas changes involving guanine were more likely to change it. Swamy et al. (2009) studied the impact of one or two nucleotide changes in TFBSs on gene expression in S. cerevisiae cells grown in different conditions, unlike our study which focused on changes between strains of S. cerevisiae under one growth condition. They found that 1/3 of variable positions in TFBS motifs and 20% of dependent position pairs in TFBS motifs are correlated with gene expression. Many of these TFBS positions were also evolutionarily conserved and condition dependent.
The complexity of transcriptional regulation in metazoan genomes poses significantly greater difficulties than in the relatively simple yeast genome. For example, promoter and enhancer regions are much larger and not characterized as well. It is also more difficult to control the environmental conditions and to assay the cell-type specificity of gene expression change in a metazoan compared with unicellular organisms. Nonetheless, we expect that the availability of more functional genomics data sets for humans, such as protein binding microarray data (Badis et al. 2009) and tissue-specific eQTL data, will likely make the problem of mapping DNA sequence change to gene expression change in humans more tractable in the near future.
We thank Rachel Brem for sharing the eQTL data with us. We also thank Ian Ehrenreich, Sasha Levy, and Matt Rockman for comments on the manuscript as well as Nicholas Socci and the Siegal laboratory for helpful discussions. This work was partially supported by the National Institutes of Health (grant numbers F32HG 004590-01, K99HG004515-01, R00HG004515-02 to K.C.), National Science Foundation (grant number IOS-0642999 to M.L.S.), and Swiss National Science Foundation (grant number SNF: 3100A0-118318 to E.v.N.).