|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: ST MEF. Performed the experiments: ST MEF. Analyzed the data: ST RAH MEF. Contributed reagents/materials/analysis tools: MEF. Wrote the paper: ST RAH MEF.
Heat-shock genes have a well-studied control mechanism for their expression that is mediated through cis-regulatory motifs known as heat-shock elements (HSEs). The evolution of important features of this control mechanism has not been investigated in detail, however. Here we exploit the genome sequencing of multiple Drosophila species, combined with a wealth of available information on the structure and function of HSEs in D. melanogaster, to undertake this investigation. We find that in single-copy heat shock genes, entire HSEs have evolved or disappeared 14 times, and the phylogenetic approach bounds the timing and direction of these evolutionary events in relation to speciation. In contrast, in the multi-copy gene Hsp70, the number of HSEs is nearly constant across species. HSEs evolve in size, position, and sequence within heat-shock promoters. In turn, functional significance of certain features is implicated by preservation despite this evolutionary change; these features include tail-to-tail arrangements of HSEs, gapped HSEs, and the presence or absence of entire HSEs. The variation among Drosophila species indicates that the cis-regulatory encoding of responsiveness to heat and other stresses is diverse. The broad dimensions of variation uncovered are particularly important as they suggest a substantial challenge for functional studies.
Although cis-acting sequences are clearly vital to the regulation of gene expression and its evolution , the decoding of the organizational basis for this regulation and evolution is still a work in progress despite considerable effort, numerous research studies, and substantial genomic data. The difficulty of this task resides in the nature of the code itself: it is irregular, contains numerous synonyms and exceptions, and no rule appears to go unviolated . We suggest, as have others  that a way forward may be in the combination of two aspects: a cis-regulatory sequence whose mechanism of controlling gene expression is understood in detail, and a robust phylogeny of successively more distantly related species. The former feature links explicit consequences for gene expression to variation in the cis-regulatory sequence. A phylogenetic approach has two advantages beyond “phylogenetic footprinting”; i.e., the identification of putative cis-regulatory sequence through conservation . First, it poses specific hypotheses about where in evolution changes have arisen; these hypotheses may in turn anchor other hypotheses about specific genetic mechanisms of evolutionary change and the adaptive consequences of the change. Second, it highlights which features of the cis-regulatory sequence are free to evolve and which are stabilized by selection, providing corroboration for functional studies but also possibly suggesting other, unrecognized features of the functional mechanism for further testing –. Here we exemplify these points in the cis-regulatory sequences governing the heat-shock response, heat-shock elements (HSEs).
HSEs are among the best-characterized and simplest of cis-regulatory elements (CREs) in metazoans (Figure 1); this and the nature of the heat-shock response itself provide an exceptionally firm foundation for analysis. In brief, heat, other proteotoxic stresses, and diverse other signals change gene expression dramatically. In Drosophila, the transcription of most active genes is halted  and a restricted set of heat-shock genes, many encoding molecular chaperones and including HSEs in their proximal promoter regions, are coordinately upregulated during or after heat-stress . Unlike in many other promoters, experimentally-verified HSEs of Drosophila are rarely beyond ~500 bp of the transcription start site, and this relatively short proximal promoter suffices for full-strength transcription .
HSEs are binding sites for HSFs (heat-shock factors, Figure 1), transcription factors whose trimerization mediates the heat-shock response in eukaryotes ,  in combination with numerous co-regulators . In many higher eukaryotes HSFs are multiple, with family members varying in stimuli for their activation, affinity for HSEs, and downstream targets , . Fortuitously for the present study, HSF is singular in Drosophila  and highly conserved . [This contrasts with the co-evolution of CRE binding sites and TFs both in other HSFs and for other TFs (eg. )]
The DNA binding domain of HSF contains an antiparallel β sheet and a cluster of three α helices, one of which (helix 3) functions in DNA recognition in the context of a helix-turn-helix structure . This domain is itself highly conserved , implying that variation in HSEs is primarily if not exclusively responsible for variation in HSF-HSE interaction in Drosophila.
Well-defined features of the HSE sequence seem critical for HSF-HSE binding. In flies as in other taxa, the canonical HSE comprises inverted repeats of the pentanucleotide sequence 5′-NGAAN-3′, where N is any nucleotide , and typically is within 400 bp upstream of the transcriptional start site. HSEs often contain at least 3 continuous 5 bp subunits alternating between NGAAN or NTTCN or vice versa, with each 5 bp HSE subunit capable of binding one monomer of the HSF trimer –. The alternating multiple subunits promote cooperative binding of HSF trimers in the major groove of the DNA helix , which correlates with more stable protein-protein and DNA-protein interactions and stronger heat shock responses , . In general, the number of consecutive 5 bp units relates to the magnitude of the heat-shock response. However, in yeast, HSEs of equal length (4 subunits) and homologous to the consensus have appreciably different binding properties depending on which subunit variant (NTTCN or NGAAN) occupies the most distal position . This difference is due to the ability of the variant beginning with NTTCN to bind an additional HSF trimer , and is not evident in experimental studies of D. melanogaster .
The three central nucleotides, GAA or TTC, are highly conserved in D. melanogaster , implying they are functionally constrained. Indeed, the 2nd position of an NGAAN motif is critical for binding, with the 3rd and 4th positions having a lesser effect and the 5th position no effect . Mutations in the first position also affect HSF binding , however, although the compositional bias is less than for positions 3 or 4. These data suggest the consensus motif should again [see, for example, , – as former episodes of redefinition] be redefined as AGAAN . As both HSF activation and the heat-shock response are multistep processes, variation in these processes cannot unilaterally be attributed to variation in HSEs, however.
Finally, HSEs are often multiple. Although HSF-HSE complexes that form near the transcription start site (TSS) can engage the transcriptional apparatus through physical proximity, even more distal HSEs can similarly engage through the bending of the promoter DNA . The approximation of distal HSEs to the TSS is a function of both their position within the promoter and the interaction of GAGA factor and its binding sites, positioned nucleosomes and their acetylation state, and other co-regulators , , .
As with other aspects of the cis-regulatory code, HSEs deviate from the canon. In vitro, HSEs with only 2 subunits can form stable interactions with HSF , . Furthermore, the 5 bp inverted repeats need not be consecutive. Gapped HSEs, for example, contain internal 5 bp blocks with little or no homology to the canonical motif flanked by canonical sequences in the proper orientation. These HSEs have binding affinity comparable to the minimal functional binding sequence as long as they exceed 4 subunits . In yeast, gapped versions constitute an entire class of HSEs . HSF's affinity for HSEs varies with their sequence and position , . HSF also appears to bind promoters that lack either typical or gapped HSEs, and in genes not previously associated with the heat-shock response , –. In some cases, recognizable motifs for other transcription factors are present, suggesting that HSF binding to these promoters is mediated differently than when HSEs are present .
Here we use this knowledge to elucidate the evolution of cis-regulatory sequence in Drosophila heat-shock proximal promoters derived from 12 species whose genomes have been sequenced , and. We also include data from the island endemic species D. santomea, which has diverged substantially from its sister species, D. yakuba, in thermal tolerance , a phenotype potentially explicable by regulatory divergence in heat-shock promoters. As suggested above, the combination of detailed mechanistic understanding of HSE structure-function and a phylogenetic context reveals hitherto unrecognized features of heat-shock promoters that apparently are under functional constraint. Moreover, where heat-shock promoters have evolved, this same combination defines the nature and timing of the underlying evolutionary events.
The 13 species of Drosophila exhibit at least 419 computationally identifiable HSEs in the 8 heat-shock genes under study (Table S1, Table S2, Figures 2–9).9). This estimate includes 223 in usually single-copy genes (Hsp22, Hsp23, Hsp26, Hsp27, DnaJ-1, Hsp68, and Hsp83), with the balance in Hsp70, which is always multi-copy. These HSEs vary in occurrence, number, position, and conformity with the canonical sequence. For example, HSEs range from two 5 bp subunits, [in Hsp23, Hsp26, DnaJ-1, and Hsp70 (Figures 3, ,4,4, ,6,6, ,8)]8)] to eleven subunits [in Hsp68 (Figure 7)]. Importantly, phylogenetic conservation in the 13 species suggests that certain features are functionally significant and are therefore subject to selection. These features include:
(a) Putative HSEs that previously had been overlooked. In Hsp26, for example, a third (and putative) HSE consisting of 2–3 5 bp units occurs in all melanogaster subgroup (Figure 4) species. This sequence was observed previously in D. melanogaster but reasonably declared not an HSE at that time . Its conservation in 6 species for >10 MY suggests that this conclusion should be revisited.
(b) Tail-to-tail configurations, which are preferred and persist in evolution. Identified HSEs are both head-to-head (an HSE with a nGAAn-type motif at its most distal site) and tail-to-tail (an HSE in which nTTCn is the most distal subunit) . The tail-to-tail configuration is preferred, however. Of the 393 HSEs in the analysis, 236 began with nTTCn while only 157 began with nGAAn (G-test: G=15.19, df=1, p<0.0001), despite an overall ratio of nTTCn-type subunits to nGAAn-type of 1.01. The orientation of an HSE appears conserved across species. This is true of most HSEs (Figures 2–9)9) and is especially clear in the proximal HSE of Hsp83.
(c) Gapped HSEs, which are common and conserved across species. We classify HSEs as gapped if one or more internal 5 bp subunits score <50% of the maximum possible for their motif type, and are flanked by subunits scoring >50%. Despite prior treatment in the literature as anomalies (see Introduction), gapped HSEs are commonplace in the 13 species of Drosophila. Gapped HSEs occur in every gene in our data set, in both proximal and distal HSEs, and in all species. The gapping, moreover, is conserved, appearing as light stripes in the heat maps in Figures 2–9.9. In Hsp23, for example, a low-scoring 5 bp unit (in grey) persists in 2 separate HSEs in melanogaster group (Figure 3) species diverging as much as 12 MYA, as do 2 tandem 5 bp units in the second most proximal HSE.
(d) Diversity in the organization of heat-shock promoters. Overall, species vary in the aggregate length of HSE sequence in their heat-shock promoter regions (G-test: G=22.810, df=11, p<0.019). Few if any heat-shock promoters are identical in the number, arrangement, and size of their HSEs (except that all include at least one HSE), suggesting that organizations sufficient for heat-shock responsiveness are multiple. Although the distal HSEs are more variable than the proximal HSEs, even the latter are not uniform. For example, whereas most promoters in the present study included a proximal (or solo) HSE within 67 bp of the TSS, the proximal HSE is 94–137 bp from the TSS in the Hsp23 promoters of all species and >270 bp from the TSS in the Hsp27 promoters of all melanogaster subgroup species, anomalies evidently conserved for >40 MY and >10 MY, respectively. In other cases, HSEs come and go during evolution, merge and diverge, move proximally or distally, and expand and contract – but with resultant patterns conserved for millions of years.
We cannot unambiguously decipher the relationships of the distal HSEs in Hsp27 promoters of the subgenus Drosophila (Figure 5) species and D. willistoni, and so exclude them from the following analysis. While the presence or absence of the remaining HSEs varies, this variation is always consistent with the phylogeny. That is, in no case does a feature disappear in a clade and reappear in a descendant clade. This consistency, combined with the principle of parsimony, allows the unambiguous assignment of evolutionary gains or losses of HSEs to specific periods in cladogenesis (Figures 2–99):
(a) Multiple origins of unique HSEs. In Hsp23, three such origins of distal HSEs occurred independently after D. mojavensis, D. willistoni, and D. ananassae each diverged from the other species. In Hsp26, two such origins of distal HSEs occurred independently after D. willistoni, and D. virilis each diverged from the other species. A distal HSE evolved in DnaJ-1 of D. grimshawi after it diverged from the other species. No such events occurred in the other genes examined.
(b) Unique losses of HSEs. A distal HSE disappeared in Hsp22 of D. grimshawi after it diverged from the other species, and is thus absent in this species only. A distal HSE disappeared in the Hsp83 of a common ancestor of all Sophophora (Figure 9) species examined, and is thus absent in all. This event cannot unambiguously be designated as a loss or a gain from the data for the 13 species, but the Hsp83 sequence of an outgroup (Culex) indicates the HSE was lost in Sophophora rather than gained in a common ancestor of the non-Sophophora species.
(c) Origins of HSEs with subsequent retention in all descendant species. In Hsp22, HSE features evolved prior to the divergence of the 13 species, although the distal HSE disappeared after D. grimshawi diverged. [An alternative explanation, that the state in D. grimshawi was ancestral and a distal HSE of similar size and position originated independently first in a common ancestor of D. virilis and D. mojavensis and second in a common ancestor of all other species, can be rejected as less parsimonious.] In Hsp23 and in Hsp27, a distal HSE evolved uniquely in a common ancestor of the Sophophora species (but after D. willistoni diverged for Hsp23), appearing in all descendant species. In Hsp26, a pair of distal HSEs evolved uniquely in a common ancestor of the melanogaster subgroup species, appearing in all descendant species. A distal HSE similarly evolved in Hsp27. In DnaJ-1 a distal HSE evolved in a common ancestor of D. persimilis and D. pseduoobscura, and a pair of distal HSEs arose separately in a common ancestor of the melanogaster group (Figure 6) species, in each case appearing in all descendant species.
(d) HSEs of Hsp70 are anomalous in their conservatism (Figure 8). The only routinely multi-copy gene surveyed includes four HSEs of similar size and location in the vast majority of its proximal promoter regions (except in a single copy each in D. santomea and D. willistoni).
The data support 14 discrete evolutionary events (gains or losses) distributed among 24 branches in the phylogeny, with 0, 1, or 2 events per branch (Figure 10). Lengths of these three classes of branches average 5.2, 19.5, and 29.7 MY respectively; this variation is significant (Kruskal-Wallis test, 2 df, p=0.0012). HSEs have thus neither appeared nor disappeared in the individual species of the melanogaster subgroup or the two species of the obscura group (Figures 2–9)9) presumably because the time since their divergence has been insufficient.
The sequences of HSEs seem far more variable than their number, size, and position. With Hsp70 HSEs excluded, overall HSEs vary among species (Tamura-Nei gamma distance: mean=0.410, SE=0.089; transversions only) as much as do the third positions of codons within the coding sequence of the genes in which they occur. (synonymous: mean=0.413, S.E=0.095; HSE: mean=0.410, SE=0.089; transversions only). By contrast, certain positions within HSEs are highly constrained. We analyzed 872 nGAAn-type 5 bp subunits and 880 nTTCn-type 5 bp subunits. While our findings (Figures 11 and and12)12) largely conform to previously reported patterns, some departures are evident. For example, whereas Position 1 in nTTCn has absolutely no nucleotide preference as previously reported, in the corresponding position in nGAAn (position 5) A accounts for 42.7% of the bases. Also, in position 5 in nTTCn only 40.5% of the motifs are T, less than previously reported.
Furthermore, even between otherwise poorly conserved sites in the 5 bp motifs, certain combinations of nucleotides were more prevalent than expected by chance for nGAAn-type motifs (Figure 12). For example, As in both position 1 and position 5 (denoted A1A5) were excessively more frequent than expected from their joint probability, as was A3A4. By contrast, A1T5 and C1A5 were excessively infrequent. Excesses and deficiencies in nTTCn-type motifs were smaller.
The foregoing analysis of 5 bp subunits included those from all HSEs in promoters we examined, including those in all copies of paralogous genes. To assess whether this universal inclusion affects our results, we repeated all analyses with HSEs from a single, arbitrarily chosen gene copy in each species. The results were not appreciably different than those for the more inclusive dataset.
Currently studies of cis-regulatory element (CRE) evolution emphasize (a) the discovery of novel CREs through their evolutionary conservation , , (b) the change in the combination of heterotopic CREs in cis-regulatory modules and networks , , , (c) evolved differences in CREs in sister taxa , –, and (d) the deduction of the general properties of CRE evolution through data-mining on a multigenome-wide scale , , , among others. While these emphases are laudable, they curiously have bypassed the distribution and nature of discrete evolutionary changes in CREs on a phylogenetic scale (e.g., Figures 2–9).9). [By contrast, comparable investigations of coding sequence (e.g., ) and phenotype (e.g., , Figure 1 in ) are numerous.] Thus, the present study provides a unique glimpse at the evolution of cis-regulatory elements (CREs) and the promoters that include them. As foreseen (see Introduction), the combination of a large pre-existing literature on the focal CRE's structure and function, a well-established and detailed phylogeny, and aligned orthologous promoters of the included clades underlie this analysis, and ought to permit similar analyses wherever this combination is available elsewhere. Indeed,  have similarly described gains and losses of bicoid and hunchback binding sites, and  changes in the structure and function of an enhancer element, in a subset of the 13 species investigated in the present study.
HSEs are always present in the genes we have studied, and presumably the promoters that include them uniformly can respond to activated HSF. By inference, the Drosophila heat-shock promoters studied never completely gain or lose HSF responsiveness, but evolve the number of HSFs that can bind, binding affinity, and/or the locations of the HSEs, which putatively affects their ability of HSF-HSE complexes to interact with the transcriptional apparatus. Three studies examining the gain and loss of other CREs in Drosophila implicate more rapid turnover , , ; each reports, for example, turnover in the melanogaster subgroup species, within which we detect none. By contrast, while HSEs appear or disappear (but never entirely) in evolution, they more frequently expand or contract, merge or separate, and/or migrate towards or away from the TSS. A possible explanation is that the genes hosting CREs in the work of , ,  are evolving more rapidly than heat-shock genes. The prior studies examine CREs involved in early development, whose evolution is responsible for the divergent phenotypes of the species. By contrast, the heat-shock response and its major genes are notoriously highly conserved ,  and considerably predate complex eukaryotes –. An approximate counterpart concerns duplicate genes, wherein one member of a duplicate pair can evolve rapidly after duplication. In genes free to evolve in this manner, especially recent duplicates , both gene expression and cis-regulatory sequence evolve more rapidly than in non-duplicate genes whose evolution is constrained . [Hsp70, the one multi-copy gene in the present study, is the only gene with almost no turnover of HSEs; its evolution, however, is constrained by gene conversion, at least among transcribed sequence ]. HSEs occur in numerous genes other than heat-shock genes; whether HSEs turn over more rapidly in less-conserved non-heat-shock genes might confirm or reject this explanation.
The phylogenetic context bounds the dates of each evolutionary event. While we have limited our dating to the most conspicuous of evolutionary events, the gain or loss of entire HSEs, changes in size and location of HSEs can similarly be dated. The dating can unambiguously exclude evolutionary explanations. We ourselves have posited that the heat-shock expression system in Drosophila melanogaster (and perhaps also D. simulans and D. yakuba) is an adaptation to the unique thermal environment that this species encounters. While D. melanogaster's thermal niche may be unique, the organization of its HSEs is both unremarkable and preceded the evolution of the species. The same is true of the desert species, D. mojavensis, and D. yakuba vs. its less-thermotolerant sister species, D. santomea . With respect to a linkage between regulatory evolution and speciation, none is evident in the heat-shock promoters under study. Finally, the gain or loss of HSEs is evident only for long time periods, as none is detectable in species diverging during the last 10 MY (Figures 2–99).
Nonetheless, our analysis has several significant limitations. Obviously it is only as good as the phylogeny on which it rests. Although the relationships of the 13 species examined in the present study are well established , , the 12 whose genomes have been sequenced were chosen to achieve objectives  other than the elucidation of HSE evolution. Hence, they are less than ideal for present purposes. For example, inclusion of D. montium, whose subgroup diverged from the ananassae subgroup, would have enabled localization of the evolution of an HSE in Hsp23 of D. ananassae to before or after the divergence. In particular, the time segments pertinent to the evolution of the Drosophila subgenus species, obscura subgroup species, and D. willistoni are too long to time evolutionary events precisely. This limitation can be addressed by selective resequencing of species whose divergence times progressively subdivide lengthy time segments; i.e., phylogenetic walking. Drosophila is a speciose group with a detailed phylogenetic tree (http://flybase.org/static_pages/allied-data/phylogeny/Drosophilidae-Tree/list3.html) and representative stocks (http://stockcenter.ucsd.edu) available, features lacking for many other taxa. Another limitation is that sequence variation, especially in non-transcribed sequence, often impedes the alignments on which the analysis depends, thereby reducing sample size and power , . We attempted to include members of the Hsp60 gene family in the analysis, for example, but were thwarted for lack of confidence in their orthology and alignment. In a sister study  of 117 primarily single-copy promoters in the 12 sequenced Drosophila genomes, despite extensive manual curation only 33% had conservation of the TSS sufficient to be alignable in all 12 species. Lack of conservation may limit the application of our approach to most genes, which are less well conserved than heat-shock genes , .
As noted, phylogenetic context also discovers aspects that are unique vs. repeatable and constrained vs. variable in evolution, thereby posing functional hypotheses for follow-up. Ordinarily caveats (e.g., chance resemblances among species, mutations in trans that compensate for mutations in cis) apply to this concept , but are minimal in the present study due to the detailed pre-existing mechanistic understanding of regulation of the heat-shock response. The prevalence and conservation of gapped HSEs, tail-to-tail organization, and interactions between positions 1 and 5 in nGAAn-type motifs are among these discoveries. In yeast, HSE variants beginning with nGAAn or nTTCNn (head-to-head or tail-to-tail, respectively) have distinct binding properties and biological activities , as tail-to-tail variants can bind more HSF trimers than head-to-head variants of equal size. In D. melanogaster, all that is known is that both variants bind HSF with similar affinity in vitro . Thus, either the benefit of the tail-to-tail arrangement or some other explanation (e.g., genetic linkage) is yet to be discovered for Drosophila. Gapped HSEs have been documented in a number of heat-inducible genes in yeast and are capable of binding HSF , but to our knowledge have not been described in Drosophila. In the yeast, Schizosaccharomyces pombe, the modulation of heat-shock expression through HSF appears to vary based on HSE architecture, suggesting that whether typical or gap-type HSEs occur in a promoter may be related to gene- or stress-specific differences in regulation . Although induction of heat-shock gene expression is not identical in yeast and flies , the nature of the HSF-HSE interaction is well-conserved , . This fact, together with the conservation of gap-type HSEs across Drosophila species, argues for constraint on these motifs, and a possible functional role in Drosophila. Finally, the canonical sequence of HSE 5 bp subunits has been inferred largely from a single species, D. melanogaster, with inherent ascertainment bias. A phylogenetic and multispecies perspective can test the robustness of a canon based on a single species. Our analysis is consistent with the previously-reported conservation of sites 1 and 2 in the AGAAn motif and site 4 in the nTTCn motif , , sites intimately involved in the interaction between HSE and the HSF DNA-binding domain . Additionally, the correlations between nucleotide states at pairs of sites (Figure 12) suggests that certain combinations of nucleotides outside the highly conserved regions favor or disfavor HSF binding. This finding suggests that the influence of nucleotide composition at individual sites in a HSE on binding affinity may be non-additive. Interestingly, significant correlations among states at different sites occurred in only the nGAAn-type motif. In conjunction with the finding of a bias towards HSEs with nTTCn motifs at the most distal site (tail-to-tail), this suggests the intriguing possibility that HSF-binding and heat-shock gene expression could differ mechanistically depending on HSE orientation in Drosophila.
While our sample of HSEs is both larger and phylogenetically more diverse than previously available, it reaffirms rather than alters the consensus sequence previously derived from more limited samples of HSEs. Importantly, our results overall suggest some caveats on the uses of this consensus sequence. For example, in silico searches for HSEs (e.g.,  should accommodate gapped HSEs and recognize that sites outside the central 3 positions of each 5 bp unit may consequential. Such practices may reconcile the apparent contradiction between experimentally determined binding sites of HSF and predicted binding sites.
While we have posited that pre-existing detailed understanding of a CRE's function ought to facilitate the analysis of its evolution, we may have underestimated the level of detail that is necessary. Although the heat-shock promoter is among the best understood of eukaryotic promoters and piecemeal experimental investigations of numerous variants establish that most are consequential for heat-shock gene expression (see Introduction), a precise functional read-out of evolutionary variation in the heat-shock promoter [as, for example,  have attempted at the level of the cis-regulatory module] is currently difficult to impossible to obtain experimentally, even with recent advances in throughput . As we have reviewed, the number, arrangement, sequence, and position of HSEs is consequential for heat-shock gene expression. Thus a single 3×5 bp-unit HSE represents 415 potential combinations of nucleotides, as even the highly-conserved positions sometimes vary when a canonical 5 bp-unit is adjacent to it. The HSE may be tail-to-tail or head-to-head. HSEs may be larger than 3×5 bp units, presenting both more combinations and increasing potential for cooperative binding of HSF. Separation between the HSE and the TSS can vary. Multiple HSEs can occur within a single heat-shock promoter, with identical HSEs having different impacts on heat-shock expression depending on their position and arrangement. Finally, mechanisms in addition to HSF binding to HSEs, both in cis and in trans, regulate heat-shock expression . The specific patterns observed in the 145 promoters of the 13 species presumably are feasible for heat-shock gene expression. Which unobserved patterns are not feasible and which are feasible but have not evolved? Of the observed arrangements, which encode higher vs. lower levels of gene expression (and what levels), and which are functionally equivalent? Elucidation of the cis-regulatory code rests on the answers to such questions. The enormous complexity of even such a seemingly simple and well-understood promoter as the heat-shock promoter, however, suggests that this elucidation may be among the “grand challenges” of biology . While “nothing in biology makes sense except in the light of evolution” , sometimes even this light is insufficient to make sense.
Sequences for 8 D. melanogaster heat-shock genes (Hsp22, Hsp23, Hsp26, Hsp27, DnaJ-1, Hsp68, Hsp70, and Hsp83) were obtained from FlyBase (http://flybase.org). We screened the highest scoring results of BLASTn and tBLASTn searches with D. melanogaster nucleotide coding and protein sequences using sequence homology, synteny, FlyBase orthology assignments, and reciprocal BLAST to D. melanogaster to identify orthologs in the other 11 species of Drosophila with sequenced genomes.
Several putatively single-copy heat-shock genes had lineage-specific duplicate copies, which underwent further analysis.
Six copies of Hsp70 are in the D. melanogaster genome. We used BLASTn with the coding sequence of D. melanogaster Hsp70Aa as a query to each of the other 11 sequenced genomes. All hits with an e-value of 0.0 were further examined to determine whether they constituted a putative copy of Hsp70, using a combination of Gbrowse orthology calls, synteny and relative position, and reciprocal BLAST to D. melanogaster. Hsp70 homologs were generally found in high-scoring clusters of >75% identity to D. melanogaster Hsp70, often in tail-to-tail configurations. Hsp68 and Hsp70 were distinguishable from diagnostic sites . Analyses incorporated all verified copies found in each species , except where specifically noted.
The identification of regulatory elements is dependent on the recognition of the promoter region upstream of the TSS. We chose a 1000 bp region that should encompass the majority of proximal promoter regulatory elements . Transcriptional start sites were first determined in D. melanogaster using experimental data collected in the Eukaryotic Promoter Database  and extrapolated to other species by referencing the whole genome alignments via the UCSC Genome Browser (http://genome.ucsc.edu/). The 1000 bp region upstream of the conserved TSS was then extracted for each gene in each species. In several cases this process was complicated by gene duplication or similarity. First, assessment of TSS location was confounded for Hsp68, as several species appear to have copies of Hsp70 aligned to the D. melanogaster Hsp68 gene region. We located the UCSC aligned regions in the genomes of species where the alignment appeared incorrect, and confirmed that the predicted coding sequence of the adjacent gene was Hsp70 through comparisons to diagnostic sites . Local multiz alignments obtained from the PromAn server  were exclusively used for the promoter region of Hsp68 to define the TSS in each species. Second, for several lineage-specific duplications or for the multi-copy gene Hsp70, only a single paralog is aligned at UCSC. We used local alignments to determine placement of the TSS in all paralogs. Third, Hsp22 is duplicate in D. grimshawi, but only one copy has identifiable HSEs in the region upstream of the putative TSS and was included in analysis.
We obtained heat-shock promoter sequence from D. santomea strain CAR1600  with primers designed from the closely related D. yakuba . DNA was extracted from single individuals using a DNeasy kit (Qiagen, Rockville, MD), and PCR was performed using the following conditions: 94°C for 2 min., 35 cycles of 94°C for 30 sec., 50–55°C for 30 sec., and 72°C for 1 min, with a final extension at 72°C for 6 min. Sequence reads were trimmed and edited using Sequencher 4.5.In D. yakuba, Hsp70 occurs in two tail-to-tail clusters. One set of primers enabled sequencing of the entire intergenic region of one cluster (GE24511 and GE26203 in D. yakuba); the second cluster (GE26149 and GE24569) proved refractory to PCR amplification. As D. santomea is not included in the UCSC genome alignment, we aligned the TSS from this species based on the closely related D. yakuba for promoter region delineation.
The defined promoter region upstream of the aligned TSS in all promoters was used as the search region for HSEs. First, we employed a motif finder, part of Mod Tools , to locate potential HSEs by searching for the 8-nucleotide sequence TTCNNGAA, allowing for a maximum of 4 mutations and screening for 8 bp sequences with a matrix score of at least 70. This search motif ought to detect all HSEs, which will include this sequence regardless of whether NTTCN or NGAAN occupies the most distal position. HSEs longer than the search motif were obvious as they generated multiple hits in the same region. All HSEs identified in the initial motif search were further screened prior to retention for analysis based on the following criteria derived from experimental data: (1) they must contain at least 3 contiguous alternating 5 bp subunits matching the canonical HSE, (2) they must have no more than 2 substitutions and (3) they must have no mismatch to G in position 2 (NGAAN type) and C in position 4 (NTTCN type), unless the 5 bp mismatched subunit was interior to an HSE that otherwise contained at least 4 subunits. 10 bp gaps were not tolerated under any circumstance.
A final screen was performed to quantify the likely binding strength of each HSE subunit from its sequence similarity with the canonical motif, and to define boundaries of HSEs. We generated WebLogos  for each of the alternative 5 bp subunits recovered in the motif search. Bit-scores were summed for each base at the 5 positions in each subunit, and expressed as a percentage of the sum calculated if the preferred base were at every position. Subunits scoring <50% and at an external position in a putative HSE were removed from further analysis. We specifically retained gapped HSEs in the dataset by including subunits scoring <50% but flanked by higher-scoring subunits. Lastly, we used the 12-species genome alignment at UCSC supplemented with multiz alignments of local regions to assess the orthology of individual HSEs. Many HSEs detectable in this search were homologous to previously identified and, in some cases, experimentally confirmed HSEs in D. melanogaster, adding confidence to their assignment.
A final screen was performed to test for the possible presence of HSEs consisting of only 2 5 bp subunits . To exclude false positives, we applied a stringent threshold: only motifs conserved across multiple species with a maximum of one substitution (excluding positions G2 and T4) from the canonical HSE motif were retained. This search yielded the final collection of HSEs.
We compared rates of divergence among species in HSE sequence with a neutral proxy, synonymous changes at 3rd positions of adjacent heat-shock coding regions. We performed this calculation among all orthologous HSEs and coding regions for all single copy genes except Hsp27 to ensure adequate sample sizes. To minimize the issue of substitutional saturation, we based our calculations on transversions only.
We performed 2-way χ2 tests on all possible position pairs for the 5 positions within a subunit, using the entire pool of 1,611 identified motifs. The test was also performed on a corrected pool of motifs, to account for overrepresentation of HSEs from Hsp70, and other duplicates. The nGAAn-type motifs were tested separately from the nTTCn-type motifs, as these variants differ in binding ability . The expect values for two given bases in their respective locations was the product of their rate of occurrences at those two positions. Results indicated where excesses and deficiencies arose in the data set. Three-way χ2 tests were not performed due to insufficient sample size.
Extracted sequences of all HSEs and HSE motifs. All identified HSEs are indicated along with gene and species of origin. Individual motifs that constitute each HSE are also listed.
(0.14 MB XLS)
Annotated HSEs in heat-shock promoters. All individual HSEs are indicated within promoters from all species and all genes. HSEs are color coded by distance to the transcription start site.
(0.23 MB DOC)
The authors would like to thank the members of the laboratory groups of Martin Kreitman and Kevin White at the University of Chicago for comments and discussion on matters relevant to the current study, Marcos Antezana for assistance with the analysis, and Patricia Wittkopp for comments on the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Funding: Supported by the National Science Foundation (IOB06-41278). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.