Specific binding of a protein to DNA is now appreciated to be influenced both by the sequence of nucleotides and by the shape of the DNA double helix.1,2
A direct connection between minor groove width and electrostatic potential recently was established, providing a physical basis for this type of shape readout. Specifically, the magnitude of the electrostatic potential in the minor groove is controlled by the width of the groove3
with narrowing of the groove associated with more negative electrostatic potential. Many proteins have been found to take advantage of this property by inserting positively charged arginine side chains into the groove where it is narrow.4
Different DNA sequences can give rise to similar DNA shapes.5
The smaller "space" of DNA structure compared to nucleotide sequence confounds typical sequence-based analyses of genomes, which may miss regions of structural similarity that are not also similar in sequence.6,7
For example, current computational strategies for finding protein binding sites in genomes, which rely on nucleotide sequence identity (or similarity),8
are not effective in identifying similarities in DNA shape. To use shape recognition to understand how a protein selects a binding site in a genome, we need a way to map DNA shape variation at both high resolution and on a large scale. Here we report that hydroxyl radical cleavage of DNA provides the information required to evaluate shape and electrostatic potential variation in a DNA molecule of any length, including the DNA of an entire genome.
We begin by comparing the experimental hydroxyl radical cleavage pattern of a DNA molecule with NMR and X-ray structures of the same DNA sequence to "calibrate" the structural information embodied in the cleavage pattern. We next construct a new type of cleavage pattern that includes information from both DNA strands, to map minor groove width and electrostatic potential. Finally, we use an experimental database of cleavage patterns as the basis of an algorithm to computationally predict the minor groove shape and electrostatic potential for entire genomes.
We first forge an explicit link between cleavage and structure through quantitative comparison of the experimental hydroxyl radical cleavage pattern and the three-dimensional structure of DNA. For this analysis we use the Drew-Dickerson dodecamer, [d(CGCGAATTCGCG)]2
undoubtedly the structurally best-characterized DNA molecule.11–15
We had previously obtained a large amount of experimental hydroxyl radical cleavage data for the Drew-Dickerson dodecamer in the context of our efforts to construct ORChID, the •O
This database contains experimental cleavage patterns for more than 150 DNA sequences 40 base pairs in length. As a result, all 512 unique pentanucleotide sequences are represented in ORChID. Each of the 40-mers in ORChID is flanked on both sides by the Drew-Dickerson dodecamer sequence. The hydroxyl radical cleavage pattern of the dodecamer is exceptionally well determined because we have so many independent examples of the pattern.
The hydroxyl radical cleaves DNA by abstracting a hydrogen atom from a deoxyribose residue in the backbone. We showed previously that the solvent accessible surface area (SASA) of a deoxyribose hydrogen atom governs the extent of its reactivity with the hydroxyl radical.16
These experiments found that the 5', 5", and 4' hydrogen atoms, which lie on the outer edges of the DNA minor groove and are most exposed to solvent (, panel a), react most often. We also have noted that the extent of hydroxyl radical cleavage varies at each nucleotide along a double-stranded DNA molecule,17,18
suggesting that the cleavage pattern embodies information on sequence-dependent variation in DNA shape.
Figure 1 Quantitative correlation of hydroxyl radical cleavage with DNA structure. (a) H4' (blue) and H5', H5" (red) are the deoxyribose hydrogen atoms most often abstracted by the hydroxyl radical.16 (b) The extent of hydroxyl radical cleavage (black circles) (more ...)
In , panel b we compare the extent of hydroxyl radical cleavage for each nucleotide of the Drew-Dickerson dodecamer, with the sum of the solvent accessible surface areas of the 5', 5", and 4' hydrogen atoms of that nucleotide as determined from X-ray structures. Where the minor groove is wide, and deoxyribose backbone hydrogens are exposed, cleavage is high (, panel b, left inset); where the groove is narrow, and backbone hydrogens are diminished in exposure, cleavage is low (, panel b, right inset). This plot demonstrates that hydroxyl radical cleavage accurately reports the sequence-dependent variation in the shape of the DNA backbone. We call this type of hydroxyl radical cleavage pattern ORChID1, to indicate that it represents the •OH radical cleavage pattern of one strand of the DNA duplex.
We next sought to develop a method to map the shape of the DNA minor groove, since minor groove width and electrostatic potential are important recognition elements for protein binding.3,4,19
But while the ORChID1 pattern and SASA are properties associated with individual nucleotides in one of the strands of the double helix (), the minor groove width depends on both DNA strands. To construct a metric that incorporates hydroxyl radical cleavage information from both strands, we first determined the extent of cleavage for a given nucleotide, and then averaged this value with the extent of cleavage for the residue on the opposite strand that is closest in space across the minor groove. In B-form DNA these two positions are staggered by three nucleotides in the 3' direction.7
The phosphate groups of the same two nucleotides are used to define minor groove width.20
We call this new type of hydroxyl radical cleavage pattern ORChID2, to denote that it incorporates cleavage information from both strands of the DNA duplex. This approach enables a direct comparison between hydroxyl radical cleavage and minor groove width, with both structural parameters treated as double-strand properties.
To test the correspondence of the ORChID2 pattern with minor groove width we again took advantage of the structurally well-characterized Drew-Dickerson dodecamer. We made two comparisons, first with NMR structures of the dodecamer, and then with X-ray structures. While NMR structures of nucleic acids are usually not as high in resolution as X-ray structures, they are determined in solution, and so are free of crystal packing effects. For our analysis we used an NMR structure of the Drew-Dickerson dodecamer15
that was determined using dipolar coupling and chemical shift anisotropy data, and is thus of very high quality. We observe an excellent correlation of the experimental ORChID2 pattern with the width of the minor groove derived from the NMR structure (, panel a).
Figure 2 (a) Quantitative correlation of the experimental ORChID2 cleavage pattern (black circles), with electrostatic potential (red diamonds) and minor groove width (green squares) determined from a set of five NMR structures of the Drew-Dickerson dodecamer. (more ...)
Like the NMR experiment, the hydroxyl radical cleavage experiment is performed in solution, where the three-dimensional structure of the Drew-Dickerson dodecamer has the same symmetry as its palindromic nucleotide sequence.15
The ORChID2 pattern therefore exhibits the symmetry that is inherent in the palindromic sequence and structure of the dodecamer. But it is well known that crystal packing effects lead to asymmetry in the crystal structure of the dodecamer.21
To compare the ORChID2 pattern to the minor groove width determined from various X-ray structures,10,12,22–27
we symmetrized the groove width28
based on the inherent symmetry of the Drew-Dickerson dodecamer sequence. Symmetrization is a standard approach to separate crystal packing from sequence-dependent effects on DNA structure.28
We find an excellent correlation between the experimental ORChID2 pattern and the symmetrized minor groove width derived from the X-ray structures (, panel b). We also compared the ORChID2 pattern with the minor groove width derived from all-atom Monte Carlo simulations of the Drew-Dickerson dodecamer,3,28
and observe an excellent correlation (Supplementary Figure 1
To test the generality of the correspondence of cleavage pattern with structure, we searched the ORChID database for sequences that also have X-ray structures. We found the 9-mer sequence GATATCGCG, which is contained in the dodecamer [d(CGCGATATCGCG)]2
for which the X-ray structure has been determined.29
Despite the more limited experimental cleavage data available from ORChID for this sequence compared to the Drew-Dickerson dodecamer, we find an excellent correlation between the ORChID2 pattern and the X-ray-derived minor groove width (Supplementary Figure 2
To extend the comparison of ORChID2 to more nucleotide sequences, for each tetranucleotide in the Protein Data Bank (PDB) we plotted the minor groove width4
versus the experimental ORChID2 value derived from the ORChID database (Supplementary Figure 3
). For DNA in protein-DNA complexes we find a very good correlation of average ORChID2 value with minor groove width (Pearson correlation = 0.653, p-value < 1 × 10−16
). For free DNA molecules the correlation is similar (Pearson correlation = 0.638, p-value = 4.06 × 10−8
), despite the fewer number of free DNA structures in the PDB.
Since it has been shown that electrostatic potential depends on minor groove width,3
the results depicted in and Supplementary Figures 1 and 2
suggest that the ORChID2 pattern also embodies information on the local variation of minor groove electrostatic potential. To test this idea we solved the non-linear Poisson-Boltzmann equation to calculate the electrostatic potential at points in the center of the DNA minor groove.4,30
We symmetrized the electrostatic potential pattern that was calculated from X-ray-derived structures to remove crystal-packing effects. We did not symmetrize the electrostatic potential pattern derived from the NMR-based structures. As we anticipated, the ORChID2 pattern and electrostatic potential are highly correlated, both for NMR- and X-ray derived structures (). We find a similarly high degree of correspondence between the ORChID2 pattern and electrostatic potential for the sequence GATATCGCG (Supplementary Figure 2
The results shown in and Supplementary Figure 2
establish that the experimentally determined ORChID2 pattern represents a quantitative map of minor groove width and electrostatic potential. To use ORChID2 to map structural and electrostatic variation in genome-scale DNA molecules, we have developed a method to computationally predict the ORChID2 pattern. We have shown previously that a prediction tool based on experimental ORChID1 patterns, which considers the properties of a single DNA strand, can be used to predict the ORChID1 pattern for any DNA sequence, of any length, with high accuracy.5
We developed a related algorithm that computationally predicts the ORChID2 pattern for any DNA sequence of interest. The algorithm is very efficient, so predictions can be made for genome-length DNA sequences. We have deposited a dataset consisting of the ORChID2 pattern for the human genome in the UC Santa Cruz genome browser. Because of its high correlation with minor groove electrostatic potential (), the ORChID2 pattern represents the structure-dependent variation of electrostatic potential in the DNA minor groove throughout the human genome, at single base-pair resolution. This tool allows the role of DNA shape in protein-DNA recognition to be evaluated at the whole-genome scale. We note that our approach is applicable to any genome for which sequence information is available.
To demonstrate the application of ORChID2 to genome-scale recognition of DNA shape, we have analyzed sets of nucleosome-bound sequences that were identified in the yeast31
genomes. Appreciation of the involvement of chromatin structure in gene regulation has focused widespread attention on the underlying basis for nucleosome positioning.33
Previous work has most often attempted to find nucleotide sequence motifs that are associated with positioned nucleosomes,4,31,34
with limited success. But while sequence motifs have been identified that lead to bends or kinks that facilitate nucleosome binding,4
similar structural patterns can result from different sequence motifs.5
Because the hydroxyl radical cleavage pattern has been shown to be capable of uncovering structural similarity among sets of diverse nucleotide sequences,6
we used the ORChID2 pattern to reveal structural motifs in genomic DNA sequences that form nucleosomes.
When DNA wraps around the histone octamer, the minor groove faces the histone core every 10 base pairs, a strikingly periodic structural feature. X-ray structures of nucleosome core particles reveal a corresponding periodic variation in the width of the minor groove of nucleosome-bound DNA (Supplementary Figure 4
). We calculated the ORChID2 patterns for 23,076 DNA sequences from yeast that were found experimentally to be occupied by nucleosomes,31
and averaged these patterns. The resulting composite ORChID2 pattern has a clear 10 bp periodicity (, panel a). We find a very good correlation between minor groove width and the ORChID2 value for each nucleotide in the nucleosome-binding sequences (, panel c). Minima in the ORChID2 pattern occur at nucleotide positions at which the minor groove is most narrow in the nucleosome structure (Supplementary Figure 5
Figure 3 ORChID2 nucleosome patterns in yeast and fly. Mean ORChID2 values at each nucleotide position of 23,076 yeast (a) and 25,654 fly (b) nucleosome sequences (blue lines) are compared with ORChID2 values for shuffled versions of the same sequences (gray). (more ...)
We performed a similar analysis for a dataset consisting of 25,654 sequences bound by nucleosomes in Drosophila.32
We found a very similar periodic ORChID2 pattern (, panel b) and correlation of ORChID2 values with minor groove width (, panel d). Despite the very different G/C contents of the Drosophila melanogaster
and Saccharomyces cerevisiae
nucleosome-bound sequences (, panel e), the two ORChID2 patterns are highly similar to each other (, panel f).
What is especially noteworthy about the distinctive ORChID2 patterns of nucleosome-binding sequences is that they reveal periodic structural features that are present in naked genomic DNA sequences that correspond to the more extreme structural deformations that DNA adopts when wrapped around the histone octamer.
Although the periodic ORChID2 nucleosome pattern (, panels a and b) is not correlated with the pattern for shuffled sequences (Supplementary Figure 6
), the pattern itself is weak. The range in ORChID2 values for the consensus nucleosome pattern is ~0.03, while the range for the Drew-Dickerson pattern is ~3. The weakness of the pattern likely is the consequence of averaging many thousands of ORChID2 patterns to give the consensus nucleosome-associated ORChID2 pattern. To investigate how ORChID2 patterns of individual nucleosome binding sites correspond to the consensus pattern, we scanned the consensus pattern across all 23,076 yeast nucleosome-bound sequences, and calculated the Pearson correlation for each sequence. As a control we scanned the consensus pattern across shuffled versions of the same sequences. We performed the same analysis for the set of 25,654 Drosophila sequences, and plotted the distribution of correlation scores for each (Supplementary Figure 7
). Distributions of correlation scores for both real and shuffled sequences are shifted to the right of zero, suggesting that most nucleotide sequences have a positive correlation with the weak consensus pattern. However, the real genomic sequences are shifted significantly more to the right (p-value < 2.2 × 10−16
) and have a longer right tail compared to the shuffled sequences, showing that they are more similar to the consensus. We speculate that genomic sequences on the far right of the distribution, for which the ORChID2 pattern most closely resembles the consensus periodic pattern, form stable and well-phased nucleosomes and therefore might serve as nucleation sites for nucleosomal arrays.
While this paper was being revised, a method was published for computational prediction of minor groove electrostatic potential, using uranyl photocleavage of DNA as input data.35
Uranyl yields an essentially mirror-image cleavage pattern compared to hydroxyl radical. Uranyl, a positively charged ion, binds directly to the negatively charged phosphates in the DNA backbone, and cleaves most in regions where the minor groove is narrow.36
In contrast, the hydroxyl radical cleaves least where the minor groove is narrow (, panel a). We used the authors' web server to predict the electrostatic potential of the Drew-Dickerson dodecamer based on uranyl cleavage data. We found a Pearson correlation of 0.79 (p-value = 6.11 × 10−3
, 10 nucleotide positions), when we compared the uranyl-based prediction with a Poisson-Boltzmann calculation of the electrostatic potential from eight X-ray structures of the dodecamer, a result comparable to ours using ORChID2 ().
Representing genome sequences as strings of letters obscures the structural biology of DNA that is the true basis for protein recognition. We have shown here that a chemical probe (the hydroxyl radical) can be used to link high-resolution three-dimensional structure with DNA shape and electrostatic potential variation at the scale of an entire genome. Our results open the way to applying the powerful idea of shape-directed DNA recognition4
to the analysis of genomes.