|Home | About | Journals | Submit | Contact Us | Français|
Initiation and regulation of gene expression is critically dependent on the binding of transcriptional regulators, which is often temporal and position specific. Many transcriptional regulators recognize and bind specific DNA motifs. The length and degeneracy of these motifs results in their frequent occurrence within the genome, with only a small subset serving as actual binding sites. By occupying potential binding sites, nucleosome placement can specify which sequence motif is available for DNA-binding regulatory factors. Therefore, the specification of nucleosome placement to allow access to transcriptional regulators whenever and wherever required is critical. We show that many DNA-binding motifs in Saccharomyces cerevisiae show a strong positional preference to occur only in potential regulatory regions. Furthermore, using gene ontology enrichment tools, we demonstrate that proteins with binding motifs that show the strongest positional preference also have a tendency to have chromatin-modifying properties and functions. This suggests that some DNA-binding proteins may depend on the distribution of their binding motifs across the genome to assist in the determination of specificity. Since many of these DNA-binding proteins have chromatin remodeling properties, they can alter the local nucleosome structure to a more permissive and/or restrictive state, thereby assisting in determining DNA-binding protein specificity.
At any given point in time, cells are performing complex programs of gene expression. The binding of transcriptional regulators to target genes determines their expression or repression. Many DNA-binding proteins (DBPs) recognize and bind specific DNA sequence motifs located within specific regulatory regions of the gene. However, the length and nucleic acid composition of these binding motifs frequently enables their random occurrence within the genome, sometimes up to thousands of repetitions. Therefore, sequence information alone is insufficient to completely determine specificity (1,2).
Within the nucleus, DNA exists in complexes with RNA and proteins called chromatin. Commonly composed of an octamer of histone proteins consisting of two copies each of histones H2A, H2B, H3 and H4, nucleosomes are the basic repeating units of chromatin [for review see ref. (3)]. DNA wraps around the histone octamer core in approximately two superhelical turns. These cores are spaced ~10–80 bp apart; this internucleosomal DNA is referred to as linker DNA. This DNA can vary in length significantly, even between neighboring nucleosomes. DNA within nucleosomes is less accessible to DBPs, including transcriptional regulators (4). It has long been thought that by occupying potential binding sites, nucleosomes play an indirect role in regulating gene expression (4–7). However, this raises the question of how the structure of chromatin is constructed initially to ensure the availability of sites for transcriptional regulator binding. It is likely that inherent signals within the DNA sequence play an important role in positioning nucleosomes (8,9). Also critical are chromatin remodeling factors (CRFs) that reposition or modify nucleosomes (8,10–13), thereby repressing or enhancing transcription. Whether and how CRFs act to modify chromatin structure to a more permissive/restrictive state remains unknown. One possibility is that CRFs rely on the quality and genomic position of their DNA sequence motifs to help establish specificity. In this study, we investigated this hypothesis by examining the positional distribution of predicted binding sites for 184 DBPs in the Saccharomyces cerevisiae genome.
Transcription start sites (TSSs), as well as promoter and coding sequences, were obtained from the UCSC genome browser (14). The Mining Yeast Binding Sites (MYBS) database was used to obtain 666 position weight matrices (PWMs) (15). The Spt10 PWM was obtained from ref. (16) for a total of 667 PWMs. Promoters were defined as regions extending 1000 bp upstream of TSSs, excluding any coding sequence. Each PWM was used to score both promoter and coding sequences while looking for subsequences that closely match the binding motif represented by the PWM. The score of each subsequence was derived from the sum of the position-specific score of each nucleotide composing the subsequence. For a subsequence of length l(s1… sl) with length l equal to the number of columns in the PWM, the score was calculated as
where Sj represents the nucleotide at position j of subsequence s and mi, j represents the score in the PWM for row i and column j.
We randomized the sequence of interest by shuffling the nucleotides while retaining the overall nucleotide composition. Then each set of randomized sequence was scanned against the set of PWMs and the number of high-scoring matches was counted. The randomization was performed 800 times, and the mean and standard deviation for the number of matches expected in the randomized sequence for a given PWM was calculated. A z-score representing the degree of sequence motif enrichment was calculated using
where x is the number of high-scoring matches for the unshuffled sequence, ur is the mean number of high-scoring matches for 800 sets of shuffled sequences, and is the standard deviation for the group of 800 sets of shuffled sequences. We then used the calculated z-scores from the promoter and coding sequence to calculate a promoter enrichment score (i.e. promoter z-score − ORF z-score) for each PWM.
To perform this analysis, it was necessary to select a cutoff score. Therefore, similar to other comparable studies (17), a cutoff score representing 70% of the maximum possible score for a given PWM was chosen. Results from analyses using cutoff scores representing 80 and 90% of the maximum possible score showed little differences.
The set of PWMs was filtered using the methods outlined below and then ranked according to the promoter preference score. Finally, using the online David GO tool, we searched for enriched GO terms (18) in the top 20% of PWMs (N = 37). As a control, we assessed the set of all proteins (184) represented by the collection of 667 PWMs used in this study. To avoid the use of an arbitrary percentile cutoff, we also applied the online Gene Ontology enRIchment anaLysis and visuaLizAtion (GOrilla) tool (19) to our set of ranked proteins. GOrilla uses a flexible threshold technique to search for GO terms enriched in a ranked list.
The set of PWMs used contained considerable redundancy (i.e. many DBPs are associated with multiple PWMs). To perform the GO analysis, it was necessary to filter the set of 667 PWMs to obtain a unique set of 184 PWMs to pair with the 184 unique proteins. Two different filtering methods were used to determine which PWM out of the set of PWMs associated with a given DBP would be used when ranking the protein. With the first method, we filtered PWMs based on the promoter enrichment score. The PWM with the highest promoter enrichment score from the set of PWMs was selected to pair with that protein. Each protein was then ranked according to the promoter enrichment score of its paired PWM and GO analysis performed as outlined above. Using this method, both analysis tools identified GO terms related to chromatin modification for the highly ranked proteins. With the second method, we filtered the PWMs according to information content. The PWM with the highest information content was selected to pair with its associated protein. We repeated the above analysis using both GO tools. Using the David tool, we again identified an enrichment of chromatin modifying GO terms for highly ranked proteins (P<0.05). However, GOrilla did not reveal any GO terms possibly due to the stringent cutoff (P<0.001) of this tool.
With the set of high-scoring matches in promoter regions and a map of nucleosome positions produced in a recent study (20), we calculated the fraction of predicted binding sites that overlapped with a well-positioned nucleosome for each PWM. Nucleosomes, unlike many DBPs, do not necessarily have a well-defined binding site. Instead, they may have multiple binding locations in different cells for the same nucleosome. For each nucleosome, Mavrich et al. (20) calculated a ‘fuzziness score’ that represented the extent a nucleosome varies its binding location. To obtain a list of well-positioned nucleosomes we ranked all nucleosomes by their fuzziness score and took the top 15%.
To calculate the significance of the observed overlap of predicted binding sites with well-positioned nucleosomes, we randomly changed the positions of the predicted binding sites within a 1000-bp window and calculated the fraction of randomized sites that overlapped with a well-positioned nucleosome. After 1000 iterations, the mean and standard deviation of nucleosome overlap were estimated. In addition, a concurrent z-score representing the degree of nucleosome overlap above or below random chance was calculated according to Equation (2), where x was the fraction of high-scoring matches that overlapped a nucleosome, ur was the mean fraction of high-scoring matches that overlap a nucleosome calculated based on 1000 random permutations, and represented the standard deviation of the fractional overlap of the randomly moved high-scoring matches.
Promoter regions have a tendency to contain nucleosome-depleted regions (21). To control for potential bias, we randomly changed the predicted binding site location within a 1000-bp window that was centered on the binding site. In doing so, the randomly permuted binding sites were still mostly positioned within the same local chromatin structure. A 1000-bp window will almost always include some of the neighboring ORF sequences. Thus, while restricting the randomization to a defined window reduces the effect of simply being within a promoter region, it does not eliminate it entirely. One could argue that our results indicating a strong bias toward promoter regions for some motifs exacerbate this issue. However, in our calculation of nucleosome occupancy, we only used those sites found in promoter regions. Hence, promoter bias should not play a significant role in these analyses. For each PWM we paired its promoter enrichment score with its nucleosome overlap score and calculated the correlation using Spearman rank correlation. Correlation coefficients were calculated using those PWMs with at least 50 predicted binding sites.
For each high-scoring promoter region match, we calculated the distance to the closest TSS. Predicted binding sites that could not be mapped to a TSS were discarded. Sequence motifs that were highly ‘location constrained’ within promoter regions clustered together. For every PWM that had at least 50 predicted binding sites within promoter regions, we obtained the distance from the TSS for every high-scoring match (i.e. predicted binding site) and then calculated the mean, median and semi-interquartile range for the distance distribution. The smaller the semi-interquartile range, the more clustered the predicted binding sites were and the stronger the location constraint within promoter regions.
Sequence motifs for DBPs are commonly represented by a position weight matrix (PWM) (1,22). We obtained a set of 667 PWMs representing binding motifs for 184 DBPs from the MYBS database (15). For each PWM we calculated a promoter enrichment score. The larger the score, the more enriched the sequence motif was in promoter regions relative to coding regions.
Not surprisingly, most sequence motifs showed dramatically greater enrichment in promoter regions than in coding regions (Figure 1). For example, Orc1p, which has been demonstrated to function in chromatin modification (23), displayed the greatest difference in enrichment between promoter and coding sequence. For this sequence motif, the number of high-scoring matches within the promoter region was 1240, corresponding to a z-score of 261. Meanwhile, the number of high-scoring sequence motif matches within coding sequence was 38, corresponding to a z-score of −0.88. Yeast contains ~8.4 Mb of coding sequence compared to ~2.5 Mb of promoter sequence. Despite this, the Orc1p motif occurred far more often in potential regulatory, but not coding, sequence in the yeast genome.
We then investigated whether proteins whose sequence motifs showed a high positional preference for promoter regions also shared common biological functions. To explore this question, the set of 184 proteins was sorted according to the promoter enrichment score from largest to smallest (‘Materials and methods’ section). Then the online David bioinformatics resource tool (http://david.abcc.ncifcrf.gov/home.jsp) (18) was used to assess GO terms associated with the top 20% of ranked proteins. Chromatin remodeling-related terms were highly represented among these highly ranked proteins (P < 0.05), including chromatin modification, establishment and/or maintenance of chromatin architecture, DNA packaging, gene silencing, negative regulation of gene expression epigenetic, chromatin silencing and heterochromatin formation.
To verify these results, we performed a similar analysis using the GOrilla tool (http://cbl-gorilla.cs.technion.ac.il/) (19). When given a ranked list of genes, GOrilla searches for GO terms that show greater enrichment for items near the top of the list relative to the rest of the list. Therefore, it was unnecessary to limit this analysis to the top 20% of ranked proteins. We submitted to GOrilla a set of proteins ranked according to their promoter enrichment score and examined GO term enrichment. Similar to the analysis using David, many chromatin-associated GO terms were identified for high-ranking proteins, including histone modification, covalent chromatin modification, and chromatin modification. This analysis indicates that DBPs whose sequence motifs showed the strongest positional constraint for promoters were also associated with CRFs.
The relationship revealed above between the positional preference of sequence motifs and CRFs led us to postulate that a correlation may also exist between the binding of proteins exhibiting a high positional preference and nucleosome occupancy. Based on nucleosome positions obtained in a recent Chip-Seq study (20), we calculated a score to represent nucleosome occupancy (see ‘Materials and methods’ section) for each PWM.
A large negative score indicated that the overlap between predicted binding sites and nucleosomes was much less than would be expected by random chance. Conversely, a large positive score suggested that the likelihood of an overlap was greater than random chance. The Spearman rank correlation coefficient between the promoter enrichment score and the score representing nucleosome occupancy of predicted binding sites was then calculated. Indeed, there was a negative correlation between positional preference and nucleosome occupancy (rs = −0.39, P < 1e−16) (Figure 2A). The P-values for correlation coefficients were calculated according to Best and Roberts (24). This result, combined with those from the GO analysis, suggests that DBPs whose binding sites show strong positional preference may act in part to remove or shift nucleosomes upon binding to allow entry by other transcriptional regulators (10), thereby playing a role in determining specificity.
To further confirm these results, we repeated the correlation analysis using a different measure of nucleosome occupancy. Kaplan et al. (8) produced a high-resolution map of nucleosome occupancy across the yeast genome. For each position in the genome, a nucleosome occupancy score was calculated. A negative number indicated that nucleosome occupancy was below the genome average, while a positive number represented an above average likelihood for occupancy. We obtained the data set from Kaplan et al. (8) and averaged the nucleosome occupancy score for the set of predicted binding sites in promoter regions for a given PWM. Then, the Spearman rank correlation between the promoter enrichment score and the average nucleosome occupancy was calculated. With this method, we again observed a correlation between nucleosome occupancy and promoter preference (rs = −0.44, P < 1e−16) (Figure 2B).
Kaplan et al. also produced a map of nucleosome occupancy for chromatin that was reconstituted in vitro. Our results suggest that the trend toward lower nucleosome occupancy for motifs with a high positional preference may be due to active chromatin remodeling by the transcription factors that bind those motifs. As such, we would expect to observe a positive correlation between positional preference and those motifs that showed the largest difference between in vitro and in vivo nucleosome occupancy. To test this hypothesis, we calculated the correlation between the promoter enrichment score and the difference in nucleosome occupancy in vitro and in vivo for the set of predicted binding sites in promoter regions for each PWM. As anticipated, promoter enrichment and the difference between in vitro and in vivo nucleosome occupancy was positively correlated (rs = 0.46, P < 1e−16, see Supplementary Figure 1).
Previous studies have shown that motif context, including distance from the TSS, likely plays a role in gene regulation in yeast and humans (25,26). This prompted us to investigate whether sequence motifs showing strong promoter enrichment also display a strong positional constraint within promoter regions. To answer this question, we calculated the distance to the TSS for predicted binding sites in yeast promoters. Sequence motifs that demonstrated significant location constraint within promoter regions clustered together at similar distances from the TSS corresponding to a narrow distribution of distances (Figure 3A). Sequence motifs that were not constrained within the promoter exhibited distance distributions with a larger spread (Figure 3B). We noticed with interest that sequence motifs with a strong positional bias within promoter regions seem to cluster ~100–300 bp upstream of the TSS (Figure 3C).
The semi-interquartile range was calculated to measure the distribution spread statistically. Because many of the distance distributions were skewed (see Figure 3a), the semi-interquartile range was a better measure of spread than standard deviation. The Spearman rank correlation coefficient between the positional preference score and the semi-interquartile range was calculated. Indeed, a correlation between positional preference for promoter regions (high promoter enrichment) and positional preference within promoter regions (rs = −0.29, P = 2.4e−12) (Figure 4) was revealed.
Recent work elucidating nucleosome positioning in yeast has revealed a common chromatin architecture around TSS’s consisting of a nucleosome covering the TSS, an immediate upstream nucleosome-free region (NFR) of ~140 bp, and a well-positioned nucleosome (‘−1’ nucleosome) on the upstream border of the NFR (7,27). Veners et al. (28) demonstrated that the −1 nucleosome is evicted upon recruitment of RNA polymerase II. Additionally they showed that a number of chromatin remodeling complexes were selectively associated with the −1 nucleosome. Furthermore, a number of sequence-specific experimentally determined binding sites overlapped the −1 nucleosome. These results support the idea that the positioning of the −1 nucleosome may be strongly regulated.
Here we show that sequence motifs with a strong positional bias within promoter regions cluster almost exclusively ~100–300 bp upstream of the TSS (Figure 3C). This localization places them in a prime location to regulate or be regulated by the −1 nucleosome, further supporting the idea that positioning of the −1 nucleosome is important in transcriptional regulation.
If CRFs with sequence motifs that exhibit strong positional preferences are modifying the chromatin structure in part to provide specificity to other DBPs, what is the mechanism of action? One possibility is that CRFs remove and/or shift nucleosomes to open up binding sites for other transcriptional regulators. For example, Rap1p, Abf1p and Reb1p are all highly abundant sequence-specific general regulatory factors that bind motifs with a strong preference for promoter regions. There is good evidence that all three play a role in influencing chromatin structure (10,29,30). Additionally, these proteins appear to act in part by creating bubbles of open chromatin (8,31–33). In the case of Rap1p and Abf1p, creating a region of open chromatin appears to facilitate the binding of additional regulatory factors, leading to transcription enhancement (31). In many cases, Rap1p and Abf1p are unable to activate robust transcription alone (34,35) and require additional regulatory factors. Further support is provided by the observation that Rap1p- and Abf1p-binding sites can be substituted for one another without a loss in function (31,35).
However, both Rap1p and Abf1p are involved in many functions, including repression (36–38). Rap1p initiates a repressive chromatin structure by interacting directly with the chromatin modifying factors Sir3p and Sir4p (37). Therefore, in addition to making binding sites accessible, it is likely that DBPs whose sequence motifs show a strong positional preference can increase specificity by directly interacting with chromatin modifiers or transcriptional regulators.
A question that immediately presents itself is whether or not the pronounced preference for promoter regions is sufficient to determine specificity. Is the positional distribution sufficient to fully explain binding in vivo? In a genome-wide location analysis, Lieb et al. (39) noted the strongly skewed positional preference of Rap1p-binding motifs and concluded that the positional distribution of potential Rap1p-binding sites may account for much of the specificity in Rap1p binding. However, the skewed positional distribution of these potential binding sites was insufficient in fully explaining the pattern of Rap1p binding. For the case of Rap1p, additional genome-wide mechanisms also appear to be at work.
Supplementary Data are available at NAR Online.
L.M.R. was supported by CORPOICA. Funding for open access charge: Intramural Program of the National Library of Medicine, National Institutes of Health.
Conflict of interest statement. None declared.
The authors thank John Spouge for reviewing the manuscript and for his suggestions.