|Home | About | Journals | Submit | Contact Us | Français|
Noncoding genetic variation is known to significantly influence gene expression levels in a growing number of specific cases; however, the patterns of genome-wide noncoding variation present within populations, the evolutionary forces acting on noncoding variants, and the relative effects of regulatory polymorphisms on transcript abundance are not well characterized. Here, we address these questions by analyzing patterns of regulatory variation in motifs for 177 DNA binding proteins in 37 strains of Saccharomyces cerevisiae. Between S. cerevisiae strains, we found considerable polymorphism in regulatory motifs across strains (mean π = 0.005) as well as diversity in regulatory motifs (mean 0.91 motifs differences per regulatory region). Population genetics analyses reveal that motifs are under purifying selection, and there is considerable heterogeneity in the magnitude of selection across different motifs. Finally, we obtained RNA-Seq data in 22 strains and identified 49 polymorphic DNA sequence motifs in 30 distinct genes that are significantly associated with transcriptional differences between strains. In 22 of these genes, there was a single polymorphic motif associated with expression in the upstream region. Our results provide comprehensive insights into the evolutionary trajectory of regulatory variation in yeast and the characteristics of a compendium of regulatory alleles.
Noncoding genetic variation makes a significant contribution to phenotypic diversity and disease susceptibility by modulating gene expression (Rockman and Kruglyak 2006; Skelly et al. 2009). Examples of noncoding variants causing phenotypic differences within and between species are rapidly accumulating in diverse lineages (Wray 2007). For example, noncoding variants have been identified that cause pigmentation differences in Drosophila (Wittkopp et al. 2002), skeletal reduction in stickleback fish (Shapiro et al. 2004), skin wrinkling in the domesticated dog (Akey et al. 2010; Olsson et al. 2011), and loss of neck feathers in chicken (Mou et al. 2011). Although the precise molecular mechanisms that causal noncoding variants act through remain poorly defined, many regulatory variants likely alter the binding of sequence-specific DNA-binding proteins. These proteins affect gene expression by interacting with the transcriptional machinery, cooperatively binding to other activating or repressing proteins, or modulating chromatin structure (Lee and Young 2000; Farnham 2009).
Yeast is an excellent system in which to study noncoding variation because of the availability of whole-genome sequences from diverse strains and species. For example, whole-genome sequences are available for 37 Saccharomyces cerevisiae strains, which are functionally and geographically diverse (Liti et al. 2009). In addition, sequence motifs for the majority of known DNA-binding factors in yeast have been characterized (Bryne et al. 2008). Motif usage across species has been studied extensively in yeast. Previous work on the evolution of noncoding regions has shown that motifs rapidly turn over between species, including yeast (Dermitzakis and Clark 2002; Moses et al. 2006; Borneman et al. 2007; Doniger and Fay 2007). In some cases, genes whose co-expression has been conserved across species may have acquired different regulators in different species, as in the case of ribosomal protein modules in yeast (Wapinski et al. 2010). Despite this frequent turnover, often the presence of specific motifs is conserved, if not the location (Doniger and Fay 2007). Motifs that are conserved within and between species are correlated with several characteristics, such as being upstream of essential genes, closer to transcription initiation sites, and within open chromatin regions (Francesconi et al. 2011).
More generally, previous analyses of noncoding regions in diverse species have found strong signatures of both positive and negative selection (Mustonen and Lassig 2005; Chen et al 2010; He et al. 2011). These studies have had several limitations, however; for example, they have focused on small collections of known binding sites (Mustonen and Lassig 2005), motifs involved in key developmental modules (He et al. 2011), or motifs ascertained based on their conservation (Chen et al. 2010).
Gene expression variation has also been studied extensively in yeast. These studies have revealed that specific classes of genes are more likely to diverge between species (Thompson and Regev 2009), and such loci share architectural features such as containing a TATA box in their promoter and harboring more binding sites for regulatory proteins (Tirosh et al. 2006). In addition, expression QTL (eQTL) studies have identified a significant role for cis-acting variation in gene expression differences between strains or species (Brem et al. 2002; Ronald and Akey 2007; Ehrenreich et al. 2009; Tirosh et al. 2009; Emerson et al. 2010). Differences in predicted motifs have been associated with expression differences for some of the genes with cis-linkages in a cross between two strains (Chen et al. 2010). In addition, Zheng et al. (2010) identified several hundred genes showing significant gene expression variation associated with differences in protein binding for the factor STE12 (Zheng et al. 2010). Thus, variation in DNA-binding motifs can be an important causal source of gene expression variation.
In this study, we describe a comprehensive genome-wide analysis of polymorphisms located in 177 DNA sequence motifs across 37 S. cerevisiae strains (Liti et al. 2009). We expand the number of motifs studied from previous studies, and identify motifs genome-wide in an unbiased manner without regard to conservation. We perform extensive population genomics analyses that reveal DNA sequence motifs are subject to purifying selection, and quantify the strength of selection for each motif. Furthermore, we used RNA-Seq data that were previously collected for 22 of these strains and performed association analyses between polymorphisms in motifs and differences in gene expression. We identified six polymorphic motifs associated with widespread and consistent changes in gene expression, 49 polymorphic motifs associated with transcriptional variation at individual genes, and a compendium of high confidence regulatory alleles.
We first examined patterns of motif differences across 37 globally and functionally diverse S. cerevisiae strains whose genomes have been sequenced (Liti et al. 2009; supplementary fig. S1, Supplementary Material online), by independently calling motifs in all strains (see Materials and Methods). We found substantial divergence in motif content across strains. The average pairwise number of motif differences per intergenic region is 0.91 motifs (range 0–27; fig. 1A), and as expected pairwise motif differences recapitulate the known phylogeny (data not shown). Across all strains, a median of eight motifs were called in each intergenic region, and a median of four motifs per intergenic region were variable in at least one of the 37 strains (range 0–137). One example of a highly divergent region is the region upstream of AAH1, an adenine deaminase, which is regulated by nutrient levels (fig. 1B). Another highly variable region is upstream of FLO1, which is involved in flocculation, a phenotype known to have diverged between laboratory and wild strains (Liu et al. 1996). Interestingly, a cluster containing both lab and wild strains shows a divergent motif pattern in this region (fig. 1C). A list of additional genes with highly polymorphic motif patterns is provided in supplementary table S1, Supplementary Material online.
To quantify the strength of selection acting on intergenic regions more systematically, we used the McDonald-Kreitman framework to assess deviations from neutral expectations across intergenic regions (McDonald and Kreitman 1991). For measures of divergence, we used S. paradoxus as an outgroup. We initially characterized the evolutionary forces acting at four classes of sites: nonsynonymous, noncoding sites within predicted motifs, noncoding sites outside predicted motifs, and experimentally determined motifs (MacIsaac et al. 2006; see Materials and Methods). Specifically, we counted polymorphic and diverged sites across all intergenic and genic regions that could be aligned between S. cerevisiae and S. paradoxus (~4,700 regions). As putatively neutral sites, we used synonymous sites. We found that purifying selection acts on all four classes of sites (P < 2.2 × 10−16). We next estimated the –log(Neutrality Index), denoted as -log(NI) (Rand and Kann 1996), to compare the magnitude of purifying selection across site types. A value for –log(NI) of zero is consistent with neutrality, negative values suggest negative selection, and positive values indicate positive selection. As expected, the −log(NI) was lowest for experimentally determined motifs, which appear to be under strong purifying selection. We also found that –log(NI) was lower at noncoding sites inside predicted motifs compared with noncoding sites outside of motifs and nonsynonymous sites, suggesting that a higher proportion of sites falling within predicted motifs are under purifying selection than in the other classes of sites (fig. 2A). The observation that −log(NI) at noncoding sites outside predicted motifs was similar to that at nonsynonymous sites is unexpected because noncoding sites outside motifs are generally thought to be subject to less functional constraint. However, this result may be due in part to the high threshold we used to call motifs; lowering the threshold resulted in the −log(NI) at noncoding sites outside motifs becoming closer to neutral expectations (supplementary fig. S2, Supplementary Material online).
To identify heterogeneity of selective constraint across DNA-binding motifs, we calculated a motif specific estimate of the −log(NI). As shown in figure 2B, selective constraint varies widely across motifs, with some motifs under very strong purifying selection. Out of 133 motifs with sufficient data (see Materials and Methods), we identified 112 whose –log(NI) was significantly less than zero (supplementary table S2, Supplementary Material online). As expected from the earlier discussed analysis, a sizable number of motifs (63) had a –log(NI) significantly lower than that at nonsynonymous sites.
Moreover, we examined constraint acting at the level of individual intergenic regions. To this end, we compared polymorphism and divergence at sites that fell within predicted motifs in each region with synonymous sites in the genes flanking each region. We found that many intergenic regions had negative –log(NI), as expected from the motif-specific results described earlier, although the power of this analysis is lower given the reduced number of polymorphisms and divergent sites within each region. Using the MK test, we identified 152 regions that have significant evidence for purifying selection at false discovery rate (FDR) = 0.10. Eleven of these regions were significant after a more stringent Bonferroni correction for multiple testing (supplementary table S3, Supplementary Material online). We did not find any regions significant for positive selection at FDR = 0.10 or after a Bonferroni correction; however, four regions had a suggestive P value (P ≤ 0.05, uncorrected). Three of these regions flanked genes of unknown function; the remaining region flanks ADH4, an alcohol dehydrogenase gene, which has been linked to increased ethanol production (Mizuno et al. 2006). Interestingly, many S. cerevisiae strains were domesticated for use in fermentation, and thus positive selection for changes in the regulation of ADH4 may have occurred between S. cerevisiae and S. paradoxus, which may have made S. cerevisiae favorable for use in domestication.
To assess the relationship between motif and gene expression variation, we obtained RNA-Seq data that had been collected on a subset of the 22 strains of S. cerevisiae analyzed earlier (Skelly et al. submitted). We performed extensive normalization of the data to account for batch effects and unknown sources of variation (see Materials and Methods). By analyzing the complete distribution of P values using the positive false discovery rate approach of Storey and Tibshirani (2003), we estimate that 79.0% of genes are differentially expressed across the 22 strains. Of these, 5,472 genes are significantly differentially expressed at a FDR = 0.10.
We investigated the relationship between motif polymorphism and transcriptional variation using two complimentary approaches. First, we tested for associations between the presence or absence of motifs and expression levels at downstream genes. Specifically, we performed association tests correcting for population structure for 13,089 motifs located upstream of 3,505 distinct genes (Connelly and Akey 2012). We note that with a small sample size of 22 strains, we have limited power to detect variants, except those with large effect sizes (supplementary table S4, Supplementary Material online). We found 49 polymorphic motifs located upstream of 30 distinct genes that were correlated with significant changes in gene expression (FDR = 0.10). Of the 49 associated polymorphic motifs, 21 resulted in increased expression with the presence of the motif (i.e., acted as an activator) and 28 resulted in decreased expression with the presence of the motif. Interestingly, 22 of these genes contained only a single polymorphic motif associated with expression variation in the upstream region (table 1). In addition, one gene did not contain any additional promoter variants located outside of motifs that are in strong linkage disequilibrium (r2 > 0.8) with the polymorphic motif (fig. 3). Moreover, we found evidence for one case of a bidirectional promoter, where polymorphism in the REB1 motif was associated with changes in expression of both flanking genes.
In addition, we tested the hypothesis that levels of sequence conservation varied between polymorphic motifs that were or were not associated with differences in gene expression. Using phastCons scores from the 12-species yeast alignment (Siepel et al. 2005), we compared the mean conservation score at 1,039 polymorphic motifs nominally associated with expression differences among strains (P ≤ 0.05) with a null distribution constructed by drawing the same number of randomly chosen motifs not associated with gene expression differences. We found conservation was significantly higher at motifs associated with expression differences (P = 0.024). Thus, the statistical and bioinformatics data strongly suggest that these 22 polymorphic motifs are enriched for causal regulatory polymorphisms.
Second, we tested whether motifs were acting consistently as activators or repressors across a majority of genes upstream of which they were polymorphic. Specifically, for the ith motif, we identified all genes whose upstream intergenic region contained a variable motif i. We discarded genes where the variable motif was only observed in a single strain. Next, we converted gene expression values for this set of genes to a Z score and tested for differences between the distribution of expression values when motif i was present or absent (see Materials and Methods). At a FDR = 0.10, we found that polymorphisms in 9 out of the 148 motifs were significantly associated with consistent transcriptional differences (5 motifs were significantly associated with increased expression and 4 motifs were significantly associated with decreased expression; table 2 and fig. 4).
Finally, we investigated what characteristics were associated with high expression divergence. As a measure of expression divergence between strains, we calculated the average pairwise difference in expression between strains. We first tested whether the absolute value of the –log(NI) for motifs in each region was associated with expression divergence. We used the absolute value so that any region under either positive or negative selection would have a value greater than zero and regions under no selection would be closer to zero. We found a negative correlation (ρ = −0.07, P = 2.43 × 10−6, Spearman rank-sum test), demonstrating that regions under stronger selection showed less expression divergence. We also tested whether nucleotide diversity (π) within predicted motifs was associated with expression divergence. We found a positive correlation (ρ = 0.10, P = 6.84 × 10−14, Spearman rank-sum test), illustrating that higher nucleotide diversity was associated with higher expression divergence between strains. This correlation was still significant after controlling for the presence or absence of TATA box and for the nucleosome occupancy upstream of each gene (see Materials and Methods).
Interpreting noncoding variation is challenging yet vital for identifying causal regulatory variation, delimiting the contribution of expression variation to phenotypic diversity and evolutionary diversification, and elucidating the molecular mechanisms through which noncoding variation acts. By focusing on interpretable noncoding variation, namely variants within known motifs for DNA-binding proteins, we were able to perform detailed evolutionary and statistical analyses on the evolutionary pressures acting at these motifs and the functional consequences of putative regulatory variation.
We first addressed the evolutionary pressures affecting motif diversity and divergence, and found that motifs are generally subject to purifying selection. These results are broadly consistent with previous analyses demonstrating purifying selection acting on yeast promoter and 3′-untranslated regions (Mustonen and Lassig 2005; Ronald and Akey 2007; Chen et al. 2010). Similarly, studies in humans have found decreased nucleotide diversity in open chromatin regions (Thurman et al. 2012; Vernot et al. 2012) and have correlated transcription factor occupied sites with higher conservation across multiple species (Neph et al. 2012). Similarly, we found that experimentally validated sites were subject to stronger purifying selection. Interestingly, we found that the level of purifying selection acting on all predicted motifs was still quite strong. We also found that the selection on experimentally determined sites and on predicted sites was stronger than that on nonsynonymous variants. One possible explanation for this is that a smaller proportion of nonsynonymous sites will actually affect gene function compared with the proportion needed to disrupt a motif. It is also possible that by testing only motifs within regions which could be aligned between the two species, we may be biased toward detecting conserved motifs. In comparison with other species, it is interesting that similar studies in Drosophila have found widespread evidence of adaptive evolution in noncoding regions (Andolfatto 2005), whereas we found little evidence for adaptive evolution. We speculate that these differences in the tempo and mode of noncoding evolution between species may be due, at least in part, to differences in effective population size. We also found that while a majority of motifs are under purifying selection, a subset is evolving neutrally. This may suggest that the position weight matrices for these motifs are ineffective at identifying functional binding sites or that, alternatively, these motifs are in general less constrained.
To investigate the effects of motif changes on transcriptional variation, we characterized gene expression differences among 22 strains. We identified nine motifs acting consistently as activators or repressors across a majority of genes they regulated. These transcription factors are involved in diverse processes, but it appears that they are broadly active in phosphate-limiting conditions. We also identified 30 genes where one or more motifs were associated with gene expression variation. Approximately one-third of these genes contained multiple motifs associated with expression variance at that gene, making it difficult to identify the causal variant, though it is also possible that there may be multiple motif changes contributing to gene expression differences at these loci, as observed previously (Prud’homme et al. 2006; Tao et al. 2006). In addition, we were able to identify 22 genes with only one motif associated with expression differences. Although for all but one of these there were other SNPs in strong LD with the associated motif in the intergenic region, SNPs that fall within motifs are a strong candidate for being a causal SNP because of their potential functional role.
We found that conservation scores across species were significantly higher at motifs associated with expression differences than at motifs not associated with expression differences, suggesting that cross-species conservation is useful for fine-scale mapping causal regulatory variation. In addition, measures of constraint within species that combine information across multiple motifs in a region were useful for predicting more general patterns of expression divergence. Specifically, we found that regions with less constrained motifs as measured by the −log(NI) and nucleotide divergence were more likely to have higher expression divergence, although the magnitude of the correlation was modest.
There are several limitations to our study design. Because we are using computationally predicted motifs, not all are actually used in vivo; however, by using stringent cutoffs for calling motifs (see Materials and Methods), we attempted to collate a high confidence set of predicted motifs on which to perform our analyses. The evolutionary analyses also suggest that we are identifying active sites that are under constraint. In addition, our study only tests the effects of motif variation on gene expression in one experimental condition. Finally, as our sample size was small, we are underpowered to detect associations attributable to rare variants or variants with small effect sizes (supplementary table S4, Supplementary Material online).
In summary, our approach demonstrates the utility of using motif predictions in conjunction with functional genomics data for identifying functional noncoding sequence variation and DNA-binding proteins that have significant effects on gene expression. In the future, it will be important to integrate additional types of data, such as in vivo DNA-binding protein information and ChIP-Seq data, to facilitate the interpretation of noncoding variation, the identification of causal noncoding variants, and the correlation of transcriptional variation to phenotypic diversity. Such integrative genomics analyses are likely to play a key role in ultimately developing predictive models to distinguish functionally important noncoding variation from functionally and phenotypically benign variants.
We obtained sequence data and whole genome alignments for 37 S. cerevisiae strains and the S. paradoxus reference sequence (CBS432-0809) from the Saccharomyces Genome Resequencing Project (Liti et al. 2009). The alignment between S. cerevisiae and S. paradoxus was done by repeating masking the reference S. cerevisiae genome and CBS432. The programs LASTZ (Harris 2007) and TBA (Blanchette 2004) were used to construct the alignment. Substitution scoring parameters for LASTZ alignments were inferred using two S. cerevisiae strains (the reference strain and RM11_1A). For all further analyses, we excluded intergenic regions that aligned to more than one contiguous block in S. cerevisiae.
We searched all intergenic regions for the 37 strains and S. paradoxus for each of the 177 known DNA binding motifs on both strands using Position-specific site matrices obtained from JASPAR and converted to PWMs (Bryne et al. 2008). Note that these matrices come from experimental studies and are not ascertained based on conservation across species. In all further analyses, we did not include sites with missing data in 1 or more of the strains, or sites that were called due to indels to mitigate alignment errors. Motifs were called if they had 90% of the observed maximum weight matrix score.
For the experimentally determined sites, we used binding sites identified by ChIP-chip (MacIsaac et al. 2006) that were significant at P < 0.001 and not subject to conservation criteria. This list consists of 9,708 motif sites.
We calculated the NI as: (Rand and Kann 1996). Here, D is the count of polymorphic sites between S. cerevisiae and S. paradoxus, and P is the count of polymorphic sites (frequency greater than 5%) between the S. cerevisiae strains, n = neutral sites (synonymous sites), and s = putative selected sites. When calculating the NI for each intergenic region, we used the synonymous sites from immediately flanking genes. For bootstrapping, we resampled 1,000 times from the data for each intergenic region.
Raw RNA reads were obtained from Skelly et al. (submitted). We mapped RNA-Seq reads to the S288c reference genome (UCSC sacCer2) using the program BFAST version 0.6.4e (Homer et al. 2009) with options –K 100 and –M 500 to bfast match. We aligned colorspace reads using a main index with mask 111111111111111111 (hash width 14) and secondary indexes with masks 1111101110111010100101011011111, 1011110101101001011000011010001111111, and 10111001101001100100111101010001011111 (all using hash width 14). We output the results in SAM format and converted to BAM format using samtools (Li et al. 2009). We computed read depth across genes using bedtools version 2.15.0 (Quinlan and Hall 2010).
We normalized counts for each gene by the number of total read counts for that strain. We then carried out a median normalization step to normalize across flow cells (Pickrell et al. 2010). After this step, we removed any genes that had no counts across any strain. Finally, we fit a linear model of the form log(normalized_counts) ~ batch + flow cell + strain + significant surrogate variables. We used the R package sva to calculate surrogate variables, which revealed four significant surrogate variables (Leek and Storey 2007). Further tests used the residuals from this model.
We used a random effects model to test for a strain effect using the R package lme4, using the Maximum Likelihood method and calculating P values using the χ2 distribution (R Development Core Team 2011). We assigned q values by permuting the strain assignments 1,000 times and repeating the analysis, calculating the empirical P value from this distribution, then using the R package q value to assign q values (Storey and Tibshirani 2003).
For each gene with a variable motif, we tested the hypothesis that there was a difference in expression between strains with the motif present and strains with the motif absent, using the program EMMA to control for population structure (Kang et al. 2008). P values were again assigned by permuting the motif presence/absence labels 1,000 times, and calculating q values as earlier.
For each gene, we converted the normalized expression values to Z scores. For each motif, we then identified genes that had a variable motif upstream. We tested for differential expression by combining expression Z scores from all strains with the motif present, and compared them to Z scores from strains with the motif absent, combining these Z scores across the genes identified earlier. We tested for differential expression using a t test, and determined q values by permuting the labels of present/absent for each gene 1,000 times.
The simulations were done as previously described (Connelly and Akey 2012). Briefly, we chose 1,000 random SNPs, which fell within genes or 1,000 bp up- or downstream of genes and which had a minor allele frequency of at least 3 out of 22 as causal SNPs and simulated data based on the genotype at each SNP. We generated simulated data of three effect sizes, 25% of variance in phenotype explained by the genotype, 50% of variance explained by the genotype, and 75% of the variance explained by the genotype. This was equal to a fixed effect of k = 1.64, 2.85, and 4.885 times the standard deviation, respectively, solving for k using the formula percent variance explained = p(1 − p)k2/(p(1 − p)k2 + 1 − 1/n) = ~1/(1 + 1/(p[1 − p]k2) where k is the fixed effect of x times the standard deviation, p is the frequency of the polymorphism with the fixed effect, and n is the number of individuals (Yu et al. 2006). To assess power, we tested for association between the simulated data and the genotype at the causal variant for each of the 1,000 simulations using EMMA (Kang et al. 2008). To assess the type I error rate, we chose 1,000 random SNPs and asked how often they showed association with any of the 1,000 simulated data sets.
We obtained phastCons scores from the UCSC genome browser for each position in the S288c genome (Siepel et al. 2005). We used the P values from the gene-specific test above to identify motifs nominally associated with expression differences among strains (P ≤ 0.05, n = 1,039). To assess significance of polymorphic motif conservation scores, we generated a null distribution by calculating mean conservation from 1,000 randomly selected motifs that are not associated with expression differences (P > 0.05).
We obtained calls for the presence or absence of a TATA box upstream of each gene (Tirosh et al. 2006). For a measurement of nucleosome occupancy, we used the genome-wide nucleosome occupancy data (Lee et al. 2007) and calculated nucleosome occupancy 100 bp upstream of transcription start sites (Zhang and Dietrich 2005), a similar approach to that taken by Tirosh and Barkai (2008). We used a linear model to test for the effect of nucleosome occupancy, TATA box presence, and nucleotide diversity within motifs on expression divergence.
The authors thank the Yeast Resource Center. This work was supported by National Institute of Health grants R01GM094810 and R01GM098360 to J.M.A. M.J.D. is a Rita Allen Scholar.