We used genome-wide location analysis
7–10 to determine the genomic occupancy of 203 DNA-binding transcriptional regulators in rich media conditions and, for 84 of these regulators, in at least 1 of 12 other environmental conditions (
Supplementary Table 1,
Supplementary Fig. 1;
http://web.wi.mit.edu/young/regulatory_code). These 203 proteins are likely to include nearly all of the DNA-binding transcriptional regulators encoded in the yeast genome. Regulators were selected for profiling in an additional environment if they were essential for growth in that environment or if there was other evidence implicating them in the regulation of gene expression in that environment. The genome-wide location data identified 11,000 unique interactions between regulators and promoter regions at high confidence (
P ≤ 0.001).
To identify the
cis-regulatory sequences that are likely to serve as recognition sites for transcriptional regulators, we merged information from genome-wide location data, phylogenetically conserved sequences, and prior knowledge (). We used six motif discovery methods
11–13 to discover 68,279 DNA sequence motifs for the 147 regulators that bound more than ten probes (
Supplementary Methods,
Supplementary Fig. 2). From these motifs we derived the most likely specificity for each regulator through clustering and stringent statistical tests. This motif discovery process identified highly significant (
P ≤ 0.001) motifs for each of 116 regulators. We determined a single high-confidence motif for 65 of these regulators by using additional criteria including the requirement for conservation across three of four related yeast species. Examples of discovered and rediscovered motifs are depicted in , and comparisons of the discovered motifs with those described previously are shown in
Supplementary Table 2. The discovered motifs provide significantly more information than was previously available; for 21 of the regulators there was no prior specificity information in the literature, and detailed probability matrices had previously been determined for only 17 regulators for which we report motifs
14. For Cin5, which showed the largest difference between the computationally derived motif (TTACRTAA) and the previously reported site (TTACTAA;
Supplementary Table 2), we found that the motif we report is also the preferred target
in vitro (
Supplementary Fig. 3). We supplemented the discovered motifs with additional motifs from the literature that also passed conservation tests, and we used this compendium of sequence motifs for 102 regulators (
Supplementary Table 3) in all subsequent analysis.
We constructed an initial version of the transcriptional regulatory code by mapping on the yeast genome sequence the motifs that are bound by regulators at high confidence (
P ≤ 0.001) and that are conserved among
sensu stricto Saccharomyces species (;
http://web.wi.mit.edu/fraenkel/regulatory_map). This map includes 3,353 interactions within 1,296 promoter regions. Maps of regulatory sites encompassing larger numbers of promoters, constructed with lower-confidence information, can also be viewed on the authors' website. Because the information used to construct the map includes binding data from multiple growth environments, the map describes transcriptional regulatory potential within the genome. During growth in any one environment, only subsets of the binding sites identified in the map are occupied by transcriptional regulators, as we describe in more detail below.
Where the functions of specific transcriptional regulators were established previously, the functions of the genes they bind in the regulatory map are highly consistent with this prior information. For example, the amino-acid biosynthetic regulators Gcn4 and Leu3 bind to sites in the promoter of BAP2 (chromosome II), which encodes an amino-acid transporter (). Six well-studied cell cycle transcriptional regulators bind to the promoter for YHP1 (chromosome IV), which has been implicated in the regulation of the G1 phase of the cell cycle. The regulator of respiration Hap5 binds upstream of COX4 (chromosome VII), which encodes a component of the respiratory electron transport chain. Where regulators with established functions bind to genes of unknown function, these target genes are newly implicated in such functional processes.
The utility of combining regulator binding data and sequence conservation data is illustrated in . All sequences matching the regulator DNA binding specificities described in this study (
Supplementary Table 3) that occur within the 884-base-pair intergenic region upstream of the gene
BAP2 are shown in the upper panel. The subset of these sequences that have been conserved in multiple yeast species, and are therefore likely candidates for regulator interactions, is shown in the middle panel of . The presence of these conserved regulatory sites indicates the potential for regulation through this sequence but does not indicate whether the site is actually bound by a regulator under some growth condition. The incorporation of binding information (, bottom panel) identifies those conserved sequences that are used by regulators in cells grown under the conditions examined.
The distribution of binding sites for transcriptional regulators reveals constraints on the organization of these sites in yeast promoters (). Binding sites are not uniformly distributed over the promoter regions but instead show a sharply peaked distribution. Very few sites are located in the region 100 base pairs (bp) upstream of protein-coding sequences. This region typically includes the transcription start site and is bound by the transcription initiation apparatus. The vast majority (74%) of the transcriptional regulator binding sites lie between 100 and 500 bp upstream of the protein-coding sequence, far more than would be expected at random (53%). Regions further than 500 bp upstream contain fewer binding sites than would be expected at random. It seems that yeast transcriptional regulators function at short distances along the linear DNA, a property that reduces the potential for inappropriate activation of nearby genes.
We note that specific arrangements of DNA binding sites occur within promoters, and we suggest that these promoter architectures provide clues to regulatory mechanisms (). For example, the presence of a DNA binding site for a single regulator is the simplest promoter architecture and, as might be expected, we found that sets of genes with this feature are often involved in a common biological function (
Supplementary Table 4). A second type of promoter architecture consists of repeats of a particular binding site sequence. Repeated binding sites have been shown to be necessary for stable binding by the regulator Dal80 (
ref. 15). This repetitive promoter architecture can also permit a graded transcriptional response, as has been observed for the
HIS4 gene
16. Several regulators, including Dig1, Mbp1 and Swi6, show a statistically significant preference for repetitive motifs (
Supplementary Table 5). A third class of promoter contains binding sites for multiple different regulators. This promoter arrangement implies that the gene might be subject to combinatorial regulation, and we expect that in many cases the various regulators can be used to execute differential responses to varied growth conditions. Indeed, we note that many of the genes in this category encode products that are required for multiple metabolic pathways and are regulated in an environment-specific fashion. In the fourth type of promoter architecture we discuss here, binding sites for specific pairs of regulators occur more frequently within the same promoter regions than would be expected by chance (
Supplementary Table 6). This ‘co-occurring’ motif architecture implies that the two regulators interact physically or have related functions at multiple genes.
By conducting genome-wide binding experiments for some regulators under multiple cell-growth conditions, we learned that regulator binding to a subset of the regulatory sequences is highly dependent on the environmental conditions of the cell (
Supplementary Fig. 4). We observed four common patterns of regulator binding behaviour (,
Supplementary Table 7). Prior information about the regulatory mechanisms employed by well-studied regulators in each of the four groups suggests hypotheses to account for the environment-dependent binding behaviour of the other regulators.
‘Condition-invariant’ regulators bind essentially the same set of promoters (within the limitations of noise) in two different growth environments (). Leu3, which is known to regulate genes involved in amino-acid biosynthesis, is among the best studied of the regulators in this group. Binding of Leu3
in vivo has been shown to be necessary but not sufficient for the activation of Leu3-regulated genes
17. Rather, regulatory control of these genes requires the association of a leucine metabolic precursor with Leu3 to convert it from a negative to a positive regulator. We note that other zinc cluster type regulators that show ‘condition-invariant’ behaviour are known to be regulated in a similar manner
18,19. It is therefore reasonable to propose that the activation or repression functions of some of the other regulators in this class have requirements in addition to DNA binding.
‘Condition-enabled’ regulators do not bind the genome detectably under one condition, but bind a substantial number of promoters with a change in environment. Msn2 is among the best-studied regulators in this class, and the mechanisms involved in Msn2-dependent transcription provide clues to how the other regulators in that class might operate. Msn2 is excluded from the nucleus when cells grow in the absence of stresses but accumulates rapidly in the nucleus when cells are subjected to stress
20,21. This condition-enabled behaviour was also observed for the thiamine biosynthetic regulator Thi2, the nitrogen regulator Gat1 and the developmental regulator Rim101. We suggest that many of these transcriptional regulators are regulated by nuclear exclusion or by another mechanism that would cause this extreme version of condition-specific binding.
‘Condition-expanded’ regulators bind to a core set of target promoters under one condition but bind an expanded set of promoters under another condition. Gcn4 is the best-studied of the regulators that fall into this ‘expanded’ class. The levels of Gcn4 are reported to increase sixfold when yeast cells are introduced into media with limiting nutrients
22, owing largely to increased nuclear protein stability
21,23, and under this condition we find that Gcn4 binds to an expanded set of genes. The probes bound when Gcn4 levels are low contain better matches to the known Gcn4-binding site than probes that are bound exclusively at higher protein concentrations, which is consistent with a simple model for specificity based on intrinsic protein affinity and protein concentration (
Supplementary Fig. 5). The expansion of binding sites by many of the regulators in this class might reflect increased levels of the regulator available for DNA binding.
‘Condition-altered’ regulators exhibit an altered preference for the set of promoters bound in two different conditions. Ste12 is the best-studied of the regulators whose binding behaviour falls into this ‘altered’ class. Depending on the interactions with other regulators, the specificity of Ste12 can change and alter its cellular function
24. For example, under filamentous growth conditions, Ste12 interacts with Tec1, which has its own DNA-binding specificity
25. This condition-altered behaviour was also observed for the transcriptional regulators Aft2, Skn7 and Ume6. We propose that the binding specificity of many of the transcriptional regulators might be altered through interactions with other regulators or through modifications (such as chemical) that are dependent on environment.
Substantial portions of eukaryotic genome sequence are believed to be regulatory
2,3,26, but the DNA sequences that actually contribute to regulation of genome expression have been ill-defined. By mapping the DNA sequences bound by specific regulators in various environments, we identify the regulatory potential embedded in the genome and provide a framework for modelling the mechanisms that contribute to global gene expression. We expect that the approaches used here to map regulatory sequences in yeast can also be used to map the sequences that control genome expression in higher eukaryotes.