|Home | About | Journals | Submit | Contact Us | Français|
The recognition of specific DNA sequences by proteins is thought to depend on two types of mechanisms: one that involves the formation of hydrogen bonds with specific bases, primarily in the major groove, and one involving sequence-dependent deformations of the DNA helix. By comprehensively analyzing the three dimensional structures of protein-DNA complexes, we show that the binding of arginines to narrow minor grooves is a widely used mode for protein-DNA recognition. This readout mechanism exploits the phenomenon that narrow minor grooves strongly enhance the negative electrostatic potential of the DNA. The nucleosome core particle offers a striking example of this effect. Minor groove narrowing is often associated with the presence of A-tracts, AT-rich sequences that exclude the flexible TpA step. These findings suggest that the ability to detect local variations in DNA shape and electrostatic potential is a general mechanism that enables proteins to use information in the minor groove, which otherwise offers few opportunities for the formation of base-specific hydrogen bonds, to achieve DNA binding specificity.
The ability of proteins to recognize specific DNA sequences is a hallmark of biological regulatory processes. The determination of the three-dimensional structures of numerous protein-DNA complexes has provided a detailed picture of binding, revealing a structurally diverse set of protein families that exploit a wide repertoire of interactions to recognize the double-helix1. Nucleotide sequence-specific interactions often involve the formation of hydrogen bonds between amino acid side chains and hydrogen bond donors and acceptors of individual base pairs. It has long been recognized that every base pair has a unique hydrogen bonding signature in the major groove but that this is not the case in the minor groove2. Thus, the expectation has been that the recognition of specific DNA sequences would take place primarily in the major groove through the formation of a series of amino acid- and base-specific hydrogen bonds1. This “direct readout” mechanism is consistent with observations derived from three-dimensional structures of protein-DNA complexes but it is far from the entire story.
In many complexes, the DNA assumes conformations that deviate from the structure of an ideal B-form double helix3–5, sometimes bending in such a way to optimize the protein-DNA interface6 and in some cases undergoing significant conformational changes as in the opening of the minor groove in the complex formed between TBP and the TATA box7,8. The term “indirect readout” was coined9 to describe such recognition mechanisms that depend on the propensity of a given sequence to assume a conformation that facilitates its binding to a particular protein. The bases involved in such mechanisms need not be in contact with the protein and, for example, can be found in linker sequences that connect two half-sites that themselves are bound by individual protein subunits10,11.
We recently described an example of a novel readout mechanism, the recognition of local sequence-dependent minor groove shape12, that is distinct from previously described indirect readout mechanisms. In this case, the sequence-dependence of minor groove width and corresponding variations in electrostatic potential are used by the Hox protein Sex combs reduced (Scr) to distinguish small differences in nucleotide sequence12. Here we report that this mechanism is a widely used mode of protein-DNA recognition that involves the creation of specific binding sites for positively charged amino-acids, primarily arginine, within the minor groove. Minor groove narrowing is found to be correlated with A-tracts13,14, usually defined as stretches of four or more As or Ts that do not contain the flexible TpA step15, but extended here to include as few as three base pairs (see below). Our results offer fundamentally new insights into the structural and energetic origins of protein-DNA binding specificity and thus have important implications for the prediction of transcription factor binding sites in genomes.
Figure 1a reports the percentage of minor groove contacts associated with each amino acid, classified according to the width of the minor groove. Arginine constitutes 28% of all amino acid residues that contact the minor groove and is significantly enriched in narrow minor grooves, defined here by a groove width of <5.0 Å (compared to 5.8 Å in ideal B-DNA). Remarkably, 60% of the residues in narrow minor grooves are arginines as compared to 22% in minor grooves that are defined as not narrow – i.e. width ≥5.0 Å. A smaller enrichment is also observed for lysines but the overall population of lysines within the minor groove is much less than for arginine.
Binding to the minor groove is a characteristic of many, but not all, protein superfamilies and a significant subset of these contact a narrow minor groove (Table 1). Moreover, if the minor groove is contacted, arginines are likely to be involved, and the likelihood that an arginine will be present becomes even greater for narrow minor grooves (Supplementary Table 1).
Figure 1b compiles the DNA sequence preferences for protein-DNA complexes in which an arginine contacts a narrow minor groove. The figure shows that the base pair that has the shortest contact distance with the arginine guanidinium group has a probability of 78% of being an AT and 22% of being a GC. Neighboring base pairs in both the 5' and 3' directions surrounding the closest contacting base pair also have a strong tendency to be AT. Taken together, these data demonstrate that arginines tend to bind narrow minor grooves in AT-rich DNA.
We calculated minor groove widths for all tetranucleotides contained in PDB structures for both free DNA (Figure 2a) and DNA in complexes with proteins (Figure 2b). There is a large spread of values due in part to end effects and to the effects of crystal packing but some trends are nevertheless evident. For example, for free DNA structures most of the tetranucleotides with narrow minor grooves (width <5.0 Å) are AT-rich (Figure 2a and Supplementary Table 2a). Similar behavior is observed in protein- DNA complexes (Figure 2b and Supplementary Table 2b). In contrast, tetranucleotides with wide minor grooves have a strong tendency to be GC-rich.
The correlation between AT content and groove width is not unexpected given the fact that A-tracts are known to produce narrow minor grooves. However, TpA steps have a tendency to widen the minor groove15, so it was of interest to determine whether the distinct properties of A-tracts and TpA steps are reflected in our tetranucleotide data set. We find that 67% of tetranucleotides composed only of AT base pairs have a narrow minor groove but that this number increases to 82% if we exclude TpA steps so as to consider only A-tracts. Even A-tracts of length three have a strong tendency to narrow the minor groove. Forty three percent of the tetranucleotides with a minor groove width of <5.0 Å have an A-tract of length three, a percentage that decreases to 11% of tetranucleotides with canonical minor groove widths (between 5.0 and 7.0 Å) and to 4% of tetranucleotides with minor grooves wider than 7.0 Å (Supplementary Figure 1). Additionally, compared to other AT-rich sequences, A-tracts are specifically enriched in DNAs with narrow minor grooves (Supplementary Figure 1). Thus, although A-tracts are usually thought of as requiring four or more base pairs, in part because a minimum of four is required to rigidify the DNA14, this analysis shows that A-tracts as short as length three are positively correlated with narrow minor grooves.
Figure 3 and Supplementary Figure 2 plot minor groove width and electrostatic potential vs. binding site sequence for several complexes whose binding interface includes an arginine inserted into the minor groove. The correlation of width and potential as well as the tendency of arginines to be located close to minima in width and potential is evident. Below we highlight a few specific examples of how arginine-minor groove interactions are used in DNA recognition.
Figure 3a represents the ternary complex of the Hox protein Ultrabithorax (Ubx) and its cofactor Extradenticle (Exd) bound to DNA16. In this complex, Arg5 of Ubx, which is a conserved residue across all homeodomains, inserts into a narrow region formed by a four base pair A-tract. Figures 3b provides an example of a long and very narrow A-tract that binds α2-Arg7 from the MATa1/MATα2 complex with DNA17. In contrast, α2-Arg4 inserts into a shallower region at one end of the A-tract where there are local minima in width and potential that are smaller than at the Arg7 site in the center of the A-tract. The two POU-domains of the Oct-1/PORE complex bind to two A-tracts (Figure 3c) where the minima are positioned in such a way to provide binding sites for four arginines, two from each POU domain18.
The location of these A-tracts with respect to other nucleotide sequence features can be used to generate specificity, as previously discussed for the Hox protein Scr12. In the case of Scr binding, the position of a TpA step within an AT-rich region plays a critical role in binding specificity. A similar strategy is used by the MogR transcription factor where two long A-tracts separated by a TpA step produce two arginine binding sites19 (Figure 3d). The unique shape recognized by these two arginines is likely to contribute to the position of the MogR binding site along the DNA sequence. The overall tendency of TpA steps to widen the minor groove is most apparent when they are positioned between two A-tracts (as in Scr12 and MogR19) where the TpA step acts as a `hinge' between more rigid elements15,20. In other contexts, due to their flexibility, TpA steps can also be accommodated in narrow minor grooves21. An example is provided by the bipartite DNA-binding domain of Tc3 transposase where the arginines bind to a narrow region containing a TATA box22 that displays enhanced negative electrostatic potential (Figure 3e).
Although less frequent, arginines also bind narrow grooves associated with non-A- tract sequences. Figure 3f summarizes features of the binding of the 434 repressor to its operator23 which contains seven base pairs that are all AT with the exception of a central CG. (The guanine amino group tends to widen narrow grooves but a single GC base pair can be accommodated with only little disruption.)
Figure 4a plots minor groove width and electrostatic potential along the DNA sequence of the nucleosome core particle containing recombinant histones and a 147 base pair DNA fragment (PDB code 1kx5)24. There are 14 minima in minor groove width corresponding to regions where the DNA bends so as to wrap around the histone core. As above, there is a striking correlation between width and potential. The variation in width between the narrowest and widest regions is about 5 Å and the difference between the maxima and minima in electrostatic potential is about 6 kT/e (Figure 4a). As a consequence, there should be a strong driving force for basic amino acids to bind to narrow regions and indeed arginines are found in 9 of the 14 minima. These arginines are shown in Figure 4b where the nucleosomal DNA has been color coded by minor groove width. (Although all 14 narrow minor groove regions are contacted by arginines24 only 9 of these satisfy our criteria of <6.0 Å between arginine atoms and base atoms in the groove). A similar repeating pattern of narrow minor grooves that are contacted by arginines is seen in all 35 available nucleosome crystal structures (Supplementary Figure 3a,b).
Because short A-tracts narrow the minor groove and facilitate the bending of DNA, we would expect to see a periodicity of A-tracts in DNA sequences bound by nucleosomes in vivo. Previous analyses have focused on dinucleotide statistics25,26 although it has been known for some time that there is a periodic pattern of AAA and AAT trinucleotides in nucleosome core DNA27,28. An analysis of DNA sequences bound in vivo by yeast nucleosomes29 reveals a clear periodicity for A-tracts of at least length three (denoted 3+, Figure 4c). Moreover, nucleosomal DNAs contain, on average, 10.0 A-tracts of length 3+ (Figure 4d). Periodicity is also detected for A-tracts of length 4+ and even 5+, although the number per nucleosome decreases to 4.1 and 1.6, respectively (Supplementary Figure 3). Thus, even though long A-tracts tend to be excluded from the nucleosome30, A-tracts of ≤ five base pairs, when present, are used to facilitate bending of the DNA around the histone core.
To evaluate the effect of TpA steps, we compared the periodicities of A-tracts of length three to those of other trinucleotides composed only of AT base pairs. Trinucleotides that contain TpA steps exhibit a much weaker periodic signal than A-tracts of length three, which exclude the TpA step (Supplementary Figure 4). Together, this analysis suggests that many of the sequence periodicities in nucleosomal DNA reflect the presence of short A-tracts that lead to narrow regions in the minor groove that in turn are recognized by a complementary set of arginines present on the surface of the nucleosome core particle.
The remarkable correlation between minor groove width and electrostatic potential (Figures 3 and and4)4) is due primarily to the properties of the Poisson-Boltzmann (PB) equation that have been extensively discussed in the literature31. Biological macromolecules are less polarizable than the aqueous solvent and, in the language of classical physics, can be thought of as a low dielectric region embedded in a high dielectric solvent. Solutions of the PB equation for DNA showed that lines of electrostatic potential due to backbone phosphates follow the shape of the DNA and are the most negative within the grooves32. This effect is due to electrostatic focusing, first observed for the protein superoxide dismutase31, where the narrow active site focuses electric field lines away from the protein and into the high dielectric solvent. The same physical phenomenon produces enhanced potentials in grooves, accounting for the strong correlation described above.
In order to establish the source of the effect in quantitative terms, we calculated the potentials for the MogR binding site19 when the dielectric constant is set to 80 both inside the macromolecule and in the solvent (Figure 5, dashed line) and for the case where the two dielectric constants are different (Figure 5, solid line). Strikingly, a significant enhancement of electrostatic potentials is only observed when the dielectric constant of the macromolecule and solvent are different, reflecting the focusing of electric field lines described qualitatively above. The small effect seen when the dielectric constant is the same results from the phosphates being closer to the center of the groove when it is narrow (see Supplementary Figure 5 for a breakdown of the contributions to the net electrostatic potential). Both sets of calculations were carried out at physiological salt concentrations. Although ionic strength has as strong effect on the absolute values of the potentials, the effect remains qualitatively the same (Supplementary Figure 6).
It is somewhat surprising that there is such a significant population of arginines in the minor groove, and a large enrichment when the groove is narrow, whereas the effects for lysines are more modest (Figure 1a). Arginines have been known for some time to be enriched relative to lysines in protein-protein33 and protein-DNA interfaces34 and the difference has generally been attributed to the ability of the guanidinium group to engage in more hydrogen bonds than the amino group of lysine35. To evaluate this idea we determined the number of hydrogen bonds formed by all the arginines and lysines in our data set that penetrate the minor groove. Surprisingly, on average, less than one hydrogen bond is formed by either amino acid side chain to DNA (0.9 for arginine and 0.6 for lysine), and the standard deviations are such that this difference is insignificant (Supplementary Table 3).
An alternate explanation derives from the difference in the size of the cationic moieties of the two residues. According to the classical Born model the solvation free energies of ions are proportional to the inverse of their radii31, suggesting that it is energetically less costly to remove a charged guanidinium group from water than it is to remove the smaller amino group of a lysine. To test this quantitatively, we calculated the change in free energy in transferring arginine and lysine from water to a medium of dielectric constant 2 (see Methods for details). The difference in the transfer free energies between the two residues ranges from 2.3 to 6.7 kcal/mole, depending on the force field that was used, with lysine consistently having the higher value (Supplementary Table 4). These results suggest that the higher prevalence of arginines compared to lysines in minor grooves is due, at least in part, to the greater energetic cost of removing a charged lysine from water than to remove a charged arginine.
We have shown that there is a dramatic enrichment of arginines in narrow regions of the DNA minor groove that provides the basis for a novel DNA recognition mechanism that is used by many families of DNA-binding proteins. A readout mechanism based on groove width requires a connection between sequence and shape. This connection appears to be provided in part by A-tracts, which have a strong tendency to narrow the groove, producing binding sites for arginines that, when spaced appropriately on the protein surface, offer a complementary set of positive charges that can recognize local variations in shape. Arginines often insert into the minor groove as part of short sequence motifs (e.g. RQR in the Hox protein Scr12, RKKR in POU homeodomains18, RPR in Engrailed36, RGHR in MATa1/MATα217, RRGR in the nuclear orphan receptor37 and RGGR in the human orphan receptor38), thus offering a variety of presentation modes that can contribute to the specificity of DNA shape recognition.
The tendency of A-tracts to narrow the minor groove is due primarily to their ability to assume conformations, through propeller twisting, that lead to the formation of inter- base pair hydrogen bonds in the major groove15. This network is disrupted by TpA steps as strikingly seen in the MogR binding site19. GC base pairs also have a tendency to widen the minor groove14. The combination of these and other factors, such as effects induced by flanking bases that are not directly located within the binding site39, can produce a complex minor groove landscape that offers numerous possibilities for specific interactions with proteins. Indeed, minor groove geometry is no doubt the result of the interplay of intrinsic and protein-induced structural effects.
The physical mechanisms described here are dramatically evident in the nucleosome. The energetic cost of narrowing and bending the DNA in regions where the backbone faces inward will be reduced by the presence of short A-tracts that have an intrinsic propensity to assume such conformations and hence to bend the DNA28. In addition, the penetration of arginines into the minor groove at sites where the DNA bends and the groove is narrow21,40 provides a significant stabilizing interaction
The variations in DNA shape observed in protein-DNA complexes often reflect conformational preferences of free DNA4,10,41. Sequence-dependent conformational preferences have also been observed in computational studies11,21,42 and, most recently, analysis of hydroxyl radical cleavage patterns shows that DNA shape is under evolutionary selection43. Such observations suggest that the role of DNA shape must be taken into consideration when annotating entire genomes and predicting transcription factor binding sites. The biophysical insights described here, together with the increased availability of high-throughput binding data, offer the hope of major progress in understanding how proteins recognize specific DNA sequences and in the development of improved predictive algorithms.
Minor groove geometry was analyzed with Curves44 for all 1,031 crystal structures of protein-DNA complexes in the PDB that have any amino acid contacting base atoms. Protein side chains contact the minor groove in 69% of those structures that have at least one helical turn of DNA. The probabilities for each amino acid to contact the minor groove were calculated for three groups of DNAs: total, narrow, and not narrow. Proteins were grouped based on 40% sequence identity. The properties of free DNAs and DNAs bound to proteins were analyzed based on the minor groove widths of tetranucleotides, defined at the central base pair step.
All 35 crystal structures of the nucleosome available in the PDB were analyzed. The analysis of nucleosomal DNA is based on 23,076 sequences in an in vivo yeast dataset29. The signal for a sequence motif in nucleosomal DNA is positive for a base pair when the base pair comprises any part of the sequence motif. Frequencies were symmetrized by analyzing both complementary DNA strands.
Electrostatic potentials were obtained from solutions to the non-linear Poisson-Boltzman equation at physiologic ionic strength using the DelPhi program31,45. Regions inside the molecular surface of the DNA were assigned a dielectric constant of 2 while the solvent was assigned a value of 80. The potential is reported at a reference point at the center of the minor groove. The reference point is located close to the bottom of the groove in approximately the plane of a base pair. This definition provides a measure of electrostatic potential as a function of base sequence. Solvation free energies of amino acids were calculated for extended conformations of arginine and lysine side chains and compared for four different force fields.
There were in total 1,031 crystal structures of protein-DNA complexes in the PDB as of June 1, 2008 in which the DNA was contacted by any amino acid side chain at a distance <6.0 Å from base atoms. Of these structures, 567 contained at least one helical turn, and no chemical modifications or deformations that prevent the calculation of minor groove width. Groove geometry was analyzed using Curves44 and minor groove width was calculated as a function of base sequence by averaging all the Curves levels given for each nucleotide.
Of the 567 protein-DNA structures in our dataset, 392 have at least one minor groove contact defined by a distance of <6.0 Å between any base and side chain atoms. To avoid an oversampling bias, proteins in this dataset that shared ≥40% sequence identity were grouped to create 109 groups. The average number of contacts within each group was subsequently averaged over all 109 groups. These averages were divided by the sum of the average number of contacts for all amino acids to calculate the total minor groove contacts, contacts in not narrow minor grooves (≥5.0 Å), and contacts in narrow minor grooves (<5.0 Å), for each amino acid.
Hydrogen bond contacts between amino acid side chains and the DNA bases and phosphates, water molecules, and other protein atoms were identified with the HBplus program46.
The proteins in our dataset of protein-DNA complexes were classified in SCOP47 superfamilies. Proteins for which SCOP annotations were not available were annotated manually or using the ASTRAL database48.
Tetranucleotides in free DNA and protein-DNA complexes were used to analyze the base sequence propensity of minor groove regions as a function of minor groove width. The minor groove width of a tetranucleotide was defined by the average of all Curves44 levels for groove width of the second nucleotide and the first level of the third nucleotide, which describes groove width at the central base pair step. End regions and irregular tetranucleotides were excluded by requiring groove width definitions for at least one Curves level of each of the four nucleotides. Tetranucleotides from nucleosomal DNA were excluded from this analysis because the DNA is strongly deformed and the spacing between narrow regions is fixed at about one helical turn, thus adding a bias to the results. When applied to the 521 protein-DNA complexes in our dataset, these criteria allowed the analysis of all 136 possible unique tetranucleotides. When applied to the 88 free DNA structures in our dataset, the same criteria resulted in the analysis of 59 unique tetranucleotides. In order to increase coverage for the free DNA dataset, NMR structures were included if dipolar coupling data were used in the refinement.
The structural analysis of nucleosomes includes all 35 crystal structures in the PDB as of May 1, 2009. The sequence analysis was based on 23,076 nucleosome sequences of length 146–148 base pairs in a yeast in vivo dataset29. These nucleosome sites were scanned for sequence motifs such as A-tracts of different length, TpA steps, or other AT- rich regions. A given motif contributed to a positive signal for any base pair that overlapped that motif, thus longer motifs contributed signals to more nucleotide positions. The frequencies of all motifs were symmetrized by analyzing both complementary strands.
Electrostatic potentials were obtained from solutions to the non-linear Poisson- Boltzman equation at 0.145 M salt using the DelPhi program31,45. Partial charges and atomic radii were taken from the Amber force field49. The interior of the molecular surface of the solute molecule (calculated with a 1.4 Å probe sphere) was assigned a dielectric constant of ε=2 while the exterior aqueous phase was assigned a value of ε=80. Debye-Hückel boundary conditions and five focusing steps were used with a cubic grid size of 165 (a grid size of 185 was used for the nucleosome).
The electrostatic potential is reported at a reference point close to the bottom of the minor groove approximately in the plane of base pair i. The reference point i is defined as the geometric midpoint between the O4' atoms of nucleotide i+1 in the 5'-3' strand and nucleotide i−1 in the 3'−5' strand12. Where the DNA strongly bends into the major groove the reference point can clash with the guanine amino group and cause large positive potentials (as seen in Figure 4a for three regions of the nucleosome).
Desolvation free energies were calculated with the DelPhi program31,45 for the transfer of arginine and lysine side chains in extended conformations from water to a medium of dielectric constant ε=2. Transfer free energies were calculated for each of the two side chains based on charge distributions and atomic radii taken from Amber49 and three other force fields (see Supplementary Table 3).
This work was supported by NIH grants GM54510 (R.S.M.) and U54 CA121852 (B.H. and R.S.M.). The authors thank Andrea Califano for many helpful conversations.