|Home | About | Journals | Submit | Contact Us | Français|
While the three-letter genetic code that maps nucleotide sequence to protein sequence is well known, there must exist other codes that are embedded in the human genome. Recent work points to sequence-dependent variation in DNA shape as one mechanism by which regulatory and other information could be encoded in DNA. Recent advances include the discovery of shape-dependent recognition of DNA that depends on minor groove width and electrostatics, the existence of overlapping codes in protein-coding regions of the genome, and evolutionary selection for compensatory changes in nucleotide composition that facilitate nucleosome occupancy. It is becoming clear that DNA shape is important to biological function, and therefore will be subject to evolutionary constraint.
Elucidation of the genetic code undoubtedly is one of the most important discoveries of modern biology. The ubiquity, across all known life forms, of using a triplet DNA sequence lookup table to encode amino acids underscores the elegance of this finding.
Although the triplet genetic code is the most common lookup table used to decode genomic information, the possibility exists of additional codes in the genome, both within and outside coding regions (Figure 1). The flexibility in DNA sequence afforded by the degeneracy of the genetic code allows for additional information to be encoded within protein-coding sequences. Motivation for searching for codes outside protein coding regions of the genome stems from the prescient suggestion that regulatory mutations would have a larger biological impact than would mutations in coding sequences .
In this review we discuss progress over the past two years in elucidating some of the many codes embodied in a genome, and how evolution may be sculpting these codes. Our approach to this question has its basis in the structural biology of DNA. While the elegance of the three-letter genetic code often has focused analysis of the human genome on the sequence of nucleotides, this approach neglects the molecular nature of DNA, the physical embodiment of the information that is encoded in the genome. Recent work has reminded us that subtle variations in DNA shape can be exploited by biological systems.
Since evolution is the guiding principle of biology , one might ask whether structural features of DNA might be under evolutionary constraint. Recent work has demonstrated that substantially more territory in the human genome is under selection for maintaining DNA shape than for the exact sequence of nucleotides ••. This work showed that segments in the human genome that are DNA shape-constrained encompass a substantial fraction of experimentally determined functional regions (enhancers, deoxyribonuclease I hypersensitive sites, promoters, etc.) , evidence that maintaining DNA shape is important to at least some aspects of genomic function.
Of the evolutionarily constrained nucleotides that have been identified in the human genome, about one-third occur in coding sequences . A large fraction of these constrained coding bases are likely the result of selective pressure at the level of protein structure and function. But a recent study shows that even in protein-coding sequences, overlapping codes exist ••, which implies that selection within these regions could operate through other means. In an impressive effort, Itzkovitz et al. analyzed over 600 different genomes from diverse phyla including viruses, bacteria, fungi, plants, and vertebrates. They found that coding regions encode additional information, including known alternate codes like bacterial translation initiation sites, as well as codes that are yet unknown. A striking example of the existence of overlapping genomic codes is the presence of enhancer regions that have recently been identified within coding sequences [6-10].
An earlier study had found that the organization of the genetic code allows for superimposition of a DNA structural signal onto a protein-coding sequence via amino acid substitution . How might this occur? In the DNA double helix, backbone atoms that are in closest proximity across the minor groove, and therefore influence DNA shape , are separated by three nucleotides on the complementary strands (Figure 2). Accordingly, positions offset by three nucleotides in adjacent codons are ideal places for selective constraint to act on DNA shape in coding regions.
The majority of evolutionarily constrained bases in the human genome, totaling two-thirds of all constrained positions , reside outside of coding sequences. Selective pressures imposed upon non-coding regions likely differ from selection that operates on coding sequences, which are subject to the strict rules of the genetic code. In support of this idea, a recent study found that broadly expressed genes have highly constrained protein sequences, but relatively plastic regulatory sequences . Another study found differential constraint patterns—including DNA shape-based constraint—operating on the non-coding and coding regions of a recently duplicated gene pair . Additional evidence that different mechanisms of evolutionary selection operate in non-coding regions comes from observations that substitution biases depend on local nucleotide context and proximity to genes , and that positively selected human-specific insertions/deletions (indels) are enriched in non-coding regions nearby genes .
Because protein-DNA interactions are so diverse and use a wide variety of recognition mechanisms , we are not likely to find a universal code that explains functional non-coding sequences. Further, functional sequences tend to turn over frequently , rendering identification across species using comparative methods a difficult task. In fact, enhancers can evolve beyond recognizable sequence similarity and still retain function . Given that different DNA sequences can encode similar shapes [3,20,21] (Figure 3), DNA shape-based functional equivalence becomes an interesting concept to investigate, in parallel with traditional investigations of functional equivalence based on nucleotide sequence identity.
How might the molecular nature of DNA be used to encode information in a genome? Proteins are known to exploit nuances in DNA shape for recognition. A recent review covered how advances in computational methods, particularly molecular simulation, have advanced our understanding of how DNA shape depends on sequence . A new and remarkably widespread mechanism for shape-specific recognition of DNA was discovered by comprehensive data-mining of three-dimensional protein-DNA structures ••. Many DNA-binding proteins (including the histones that make up the nucleosome core particle) were found to insert positively-charged arginine side chains into especially narrow segments of the DNA minor groove, thereby exploiting the enhanced negative electrostatic potential of a narrow minor groove for shape-dependent recognition. Another study found an alternative mechanism for creating a narrow minor groove through the use of Hoogsteen base pairing within a canonical B-DNA helix, and demonstrated the importance of DNA shape for p53 binding ••. The concepts presented in these recent findings reveal a clearer structural picture of how DNA shape can specify important genomic signals.
Although the bacterial gene architectural protein Fis is considered to be a protein that binds to DNA with little sequence preference, there are some sequences to which it binds with sub-nanomolar affinity. The authors of a recent study investigated the basis for Fis binding selectivity by solving X-ray crystal structures of Fis complexed with11 different DNA sequences •. The authors concluded that Fis initially selects binding targets that have a narrow DNA minor groove at the center of the binding site, which allows the helix-turn-helix units of Fis to bind to the adjacent segments of the major groove. The ultimate stability of the Fis complex is governed by the ability of a DNA binding site to bend, which depends on the location of pyrimidine-purine (YR) steps in the site.
A recent paper compared the structures of the DNA in three protein-DNA complexes with the X-ray structures of the naked DNA molecules ••. The authors found that structural nuances observed in protein-bound DNA also existed in the unbound DNA target, suggesting how recognition might occur. This paper demonstrates the power of having detailed structural information for naked DNA molecules to enable the elucidation of the pathway for recognition by a DNA binding protein.
Wrapping eukaryotic genomic DNA around histone octamers to form an array of nucleosome core particles influences the functions encoded in the underlying sequence. The proteins that comprise the nucleosome core particle are among the most highly constrained , an observation that underscores their biological significance. Recent improvements in high-throughput genome profiling methods have yielded high-resolution nucleosome occupancy maps for a number of species [27-29]. Examination of nucleosome-bound sequences has led to the conclusion that nucleosome positioning is sequence-directed [30-32], yet the histone octamer makes no base-specific contacts with DNA . This scenario leads to the intriguing hypothesis that DNA structure, and not sequence per se, can direct nucleosome positioning.
Properties of DNA that influence nucleosome positioning can be segregated into two general categories—those that are conducive to nucleosome formation, and those that exclude nucleosomes. A recent study performed a statistical analysis of sequence features that are predictive of nucleosome occupancy, and concluded that G+C content is the most dominant . While this is an informative conclusion, the DNA structural property of minor groove width also was found to be important and, unsurprisingly, G+C content generally correlates with many other DNA structural features . Consistent with this finding is extensive evidence that long A-tracts—stretches of consecutive deoxyadenosine nucleotides on one strand of the double helix—strongly influence nucleosome organization . A-tracts are enriched in eukaryotic genomes, and have unique structural and mechanical properties that likely resist the DNA structural deformation required for nucleosome formation. Systematic mutagenesis and subsequent functional analysis of short A/T-rich sequences found that they can act as core promoter elements . To explain this finding, the authors proposed complementary and redundant mechanisms of nucleosome exclusion by A/T-rich sequences, and binding site recognition by TFIID.
Other efforts to explain nucleosome positioning focused more on physical properties of DNA. For example, nucleosome occupancy in yeast and fly can be predicted using only DNA flexibility and curvature . Another group developed the Repositioned Mutation (RM) test, which is an elegant algorithm designed to detect evolutionary selection for nucleosome positioning by comparing patterns at orthologous loci (originally proposed in ). Implementation of the RM test on the yeast genome revealed that the biophysical property of nucleosomal deformation energy is preserved across species so as to maintain chromatin organization in non-coding regions . Together, these results suggest that physical properties of DNA are crucial for chromatin organization.
Two recent studies that used different methods to carefully analyze nucleotide substitution patterns in the yeast genome are particularly insightful about nucleosome positioning codes and the selective pressures that can act upon them. In the first study, the authors performed a thorough analysis of substitution patterns overlaid on high resolution nucleosome positioning data ••. Knowing that G+C content and A-tracts influence nucleosome positioning (see above), the authors focused on substitution patterns that would affect these signals in regions that positioning data show are important (for example, well-defined nucleosomes or nucleosome-depleted regions). Remarkably, they found regionally linked compensatory substitutions that serve to maintain nucleosome-positioning dynamics. They conclude that local sequence composition is influenced by nucleosome organization. The other study also looked at substitution patterns, but took a different approach in which the authors measured the effects of substitutions on a structurally based model of nucleosomal deformation energy ••. They observed a strong anti-correlation between substitution frequency and the DNA structure-based energetics of nucleosome formation. Together, these studies demonstrate the functional conservation of chromatin organization through natural selection operating on DNA shape-based signals.
Selection for DNA structural features that maintain nucleosomal positioning signals could result in a large fraction of the genome being under structural constraint. This could be considered a kind of low-level and pervasive form of DNA structural selection, whereas DNA structural selection acting on transcription factor binding sites would likely be less pervasive and, possibly, more intense. It is interesting to note that a DNA structure-based code for nucleosome positioning has been found in protein coding regions •, indicating the compatible superimposition of the genetic code and a nucleosome positioning code. An intriguing possibility is that variations in local DNA shape that are encoded along genomic sequences could have a profound impact on chromatin organization, and therefore the evolution of regulatory systems.
In the rapidly approaching age of personalized genomics and genomic medicine, the ability to interpret non-coding variation will be critical [43,44]. This is made clear by a recent meta-analysis of 465 unique human trait-associated single nucleotide polymorphisms (SNPs) that were identified across a series of genome wide association studies (GWAS), which showed that 89% of the variants occur in non-coding regions . This work suggests that sequence differences in regulatory regions of the genome may capture more trait-associated variation than do differences in coding sequences. As a specific example, a recent study discovered differences in allelic enhancers at several type 2 diabetes-associated loci through comprehensive identification of regulatory regions in human pancreatic islet cells ••. Another study used DNA shape as a guide to find a functional non-coding variant ••. Based on these early results, it is clear that analyses of DNA shape will contribute to the pressing endeavor of interpreting non-coding variations and assessing their affect on disease.
In this brief review we have discussed how the shape and physical properties of DNA can influence biological function. The existence of such phenomena suggests that functional genomic codes can utilize DNA shape in addition to nucleotide sequence. The implication is that DNA shape can be a substrate for selective evolutionary pressure.
A recent example points the way to new DNA structure-based biological phenomena. High throughput sequencing was used to determine the spectrum of mutations that occurred after treatment of DNA with a mutagen ••. This experimental approach allows for exhaustive characterization of mutation frequency, in contrast to standard methods that rely on phenotypic change. The frequency of mutation at a given position was found to vary depending on the identities of the nucleotides neighboring the site of mutation, beyond nearest-neighbors, an unexpected result. The authors used their results to advance the idea that a genotype itself has a phenotype, which is expressed only when the genotype is embodied as a molecular entity, DNA.
The flood of genome-wide non-coding functional data that are emerging from the ENCODE  and modENCODE [49,50] Projects only whets our appetite for similar data from other organisms in the tree of life. Such data will allow us to directly test the correspondence of nucleotide sequence conservation with conservation of function , and so perhaps detect the presence of new kinds of genomic signals. A pioneering effort in this realm used chromatin immunoprecipitation combined with high-throughput sequencing to compare maps of the genome-wide occupancy of two transcription factors in the livers of five vertebrates .
Multi-species genome alignments are the foundation of comparative genomics, but current methods align genomes based strictly on nucleotide sequence. New methods are emerging [53,54] that can incorporate other information besides sequence (including DNA shape) to drive multi-species alignments. This, and other, advances will give us new ways to interpret the complex and overlapping codes that are hidden in a genome.
We thank Adam Woolfe for help in identifying papers reporting exonic enhancers, and Jason A Greenbaum for permission to use Figure 2. This work was supported by a grant to TDT from the National Human Genome Research Institute of the National Institutes of Health (R01 HG003541). SCJP was supported by the Intramural Research Program of the NHGRI, NIH.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.