Genomic sequences that code for proteins are relatively well understood, but comprise only ~2% of the human genome (
1). Many functions are encoded in the remaining ~98% non-coding portion of the genome, but little is known about how functional non-coding information is specified (
2). It has been hypothesized that functional regions are likely to be evolutionarily constrained because of their importance to the organism (
3–
11). Nonetheless, evolutionary sequence-constraint algorithms fail to identify many non-coding functional elements (
12–
15). This may be because these methods only analyze the primary nucleotide sequence of a genome (i.e., the order of A’s, C’s, G’s, and T’s). However, DNA is a molecule with a three-dimensional structure that varies according to the nucleotide sequence (
16–
18).
We used a previously developed method based on the hydroxyl radical cleavage pattern of DNA (
19) to quantitatively evaluate how the structure of DNA varies throughout a genome. This approach can be used to predict the shape of the DNA backbone and grooves of genomic DNA at single-nucleotide resolution (
20). We call this pattern the structural profile of a DNA region. Structural profiles reveal that the relationship of DNA structure to the corresponding DNA sequence is not always simple. While similar sequences often adopt similar structures () divergent nucleotide sequences can have similar local structures () (
20). Conversely (albeit less frequently), similar sequences can adopt very different structures (). These observations indicate that DNA regions that differ on the basis of the order of nucleotides may be similar in structure, which suggests they may perform similar biological functions.
We quantified the effect of single-base substitutions on DNA structure by computing the structural profiles of all possible 11-base pair (bp) sequences (4,194,304 in total), measuring the similarity between profiles for all pairs of 11-bp sequences that differ only by a single substitution at the central nucleotide (
21). A histogram of structural differences for all 11-bp sequences reveals a range of effects from minor to drastic (). In some cases, the one-base change in an 11-bp sequence has a minor effect, but the sequence may have a dramatically different structure if another base is substituted.
Because current sequence-constraint algorithms do not compute the effect of a base change on DNA structure, we developed a computer program, Chai, that incorporates structural information. This method works in a manner analogous to the DNA sequence-based binomial conservation (binCons) algorithm (
8) but instead of computing the binomial probability of observed base substitutions between species, Chai calculates the difference between DNA structural profiles as a measure of similarity (
21). We used the binCons and Chai algorithms to analyze 30 Mb of high-quality, comparative sequence data for 36 different species [the ENCODE pilot project regions (
15,
22)]. We defined a false discovery rate (FDR) on the basis of a neutral (or null) alignment (
10,
22) in an identical manner for assessing both sequence- and structure-informed conservation, so that each type of conserved region was identified with equal statistical confidence. We found that at any given FDR, the structure-informed Chai algorithm identified more evolutionarily constrained bases compared to binCons (,
Table S1). For example, at a 5% FDR, binCons identified 6.7% of bases as constrained, corresponding with previous findings of sequence-based mammalian constraint (
8–
11,
15,
22,
23). Chai identified nearly twice as many bases as constrained (12%) () while identifying 89% of the regions identified with the binCons method.
We next determined the extent to which evolutionarily constrained regions identified by Chai and binCons harbor functional elements. We examined deoxyribonuclease I (DNase I)-hypersensitive sites (
15) and predicted transcriptional enhancers identified using chromatin modification patterns (
24) from published studies (
21) as examples of non-coding functional elements that frequently exhibit little sequence similarity. Compared to binCons regions, we found that Chai-detected regions overlap a higher proportion of DNase I-hypersensitive sites (78% for Chai versus 50% for binCons) and predicted enhancers (84% for Chai versus 59% for binCons) (). To test whether the increased correlation with functional sequences was due to the additional territory identified by Chai relative to binCons, we tuned each algorithm for equal base coverage as opposed to equal statistical confidence and found that Chai-detected regions still overlap a significantly greater number of functional non-coding elements compared to binCons-detected regions (p < 10
−12; Fisher’s exact test) (
Fig. S1, ,
Table S1).
Focusing our analysis on regions identified only by Chai and not by binCons (Chai-only regions) resulted in a statistically significant overrepresentation of noncoding functional sequences [p < 0.01, genome structure correction (GSC) statistic] [see (
21) for a description of the GSC statistic] in the Chai-only regions and a statistically significant underrepresentation (p < 10
−9; GSC statistic) of coding regions (). This suggests that some of the functional information in the non-coding portion of the genome is conferred by DNA structure as well as by the nucleotide sequence.
We thus examined if non-coding nucleotide substitutions inducing changes in DNA structure impact the biological function of a sequence. We focused on the DNA-binding properties of the Zif268 protein, a mammalian transcription factor that consists of three zinc fingers that wrap around DNA in the major groove (
25). Binding affinity data (
26) were compared with structural profiles for the 15 sites identified to bind wild type Zif268 and the Zif268 REDV mutant (
21). For both proteins the structural profiles of the high-affinity sequence motifs were similar, while low-affinity motifs had a different structural profile (
Fig. S2A,
Fig. S3A).
We ranked each Zif268 REDV motif by the extent of DNA structural difference relative to the best binding site, and found that binding affinity correlated with this ranking (r=0.75; p = 6.9 × 10
−4; t-test;
Fig. S2B). In other words, low-affinity binding sites differed dramatically in structure from high-affinity sites, while sites with intermediate affinity showed less structural difference. We observed a similarly high correlation for the wild type protein, which has different binding preferences (r = 0.74; p = 7.5 × 10
−4; t-test;
Fig. S3B). Analysis of DNA-binding site structural profiles and binding affinity for the archaeal transcriptional regulator Ss-LrpB revealed similar trends to those found for Zif268 (
Fig. S4).
We next tested whether mutations that change the molecular topography of non-coding genomic regions have phenotypic consequences, using data from the PhenCode Project (
27), which organizes information from several locus-specific mutation databases. We gathered 734 non-coding single-nucleotide variants in the human genome associated with a phenotype. For each variant, we calculated the difference in the structural profile between the mutant and non-mutant sequence over an 11-bp window centered on the variant nucleotide (similar to the analysis above). For comparison, a distribution of baseline variation in DNA topography was computed for 16,832 neutrally-evolving single-nucleotide polymorphisms (SNPs) (
21). The phenotype-associated distribution was significantly correlated with larger changes in structure (p<3 × 10
−4; Wilcoxon rank sum test) relative to the baseline distribution, and contained an additional high structure-change peak (). These results indicate that phenotype-associated mutations tend to induce larger changes in the structural profile of non-coding DNA than does baseline neutral variation.
We also identified phenotype-associated SNPs causing the top 10% structural changes and the lowest 10%. The phenotype-associated mutations with the highest changes in DNA structure occurred significantly more often (p<10
−2, Fisher’s exact test) in evolutionarily constrained regions of the genome (56% for high structure-change regions versus 29% for low structure-change regions) (
21). This suggests that non-coding DNA may be under selective constraint, which may prevent changes in DNA structure. Because the severity of structural change might help identify functional SNPs, we constructed a database of changes in the structural profile for all known SNPs in the human genome (
21).
Finally, we demonstrated that structural changes affect biological function in non-coding evolutionarily constrained regions identified by the Chai algorithm. We chose 12 predicted enhancer-containing regions (
24): 5 regions overlap elements only detected by the Chai algorithm; 7 regions overlap a combination of Chai- and binCons-detected elements (
Table S2). We cloned the 300 bp surrounding each of these genomic regions in a luciferase reporter construct and transfected them into 293T cells. Eight of the twelve constructs displayed luciferase activity that was significantly greater (p ≤ 0.05; Wilcoxon rank sum test) than that of random control sequences (). Three of the 5 constructs only overlapping Chai-detected elements were positive (
Table S2).
Given the plethora of regulatory functions that a genome encodes and the three-dimensional genomic architecture required to orchestrate these events (
28), it may not be surprising that there is widespread conservation of local DNA topography. Perhaps only a subset of local structural configurations can accommodate the functional requirements of a genomic locus [for example see (
29)]. Once the molecular topography of a locus is permissive to a regulatory function, this structure may be maintained within the genome. Our high-resolution topography-based constraint-detection method reveals that structure-informed constraint is widespread in the human genome and that these regions overlap known non-coding functional sites. Because different DNA sequences can have similar local structures (
20), these regions might escape detection with sequence-based conservation–identification methods.