|Home | About | Journals | Submit | Contact Us | Français|
Repetitive DNA motifs are abundant in the genomes of various species and have the capacity to adopt non-canonical (i.e. non-B) DNA structures. Several non-B DNA structures, including cruciforms, slipped structures, triplexes, G-quadruplexes, and Z-DNA, have been shown to cause mutations, such as deletions, expansions, and translocations in both prokaryotes and eukaryotes. Their distributions in genomes are not random and often co-localize with sites of chromosomal breakage associated with genetic diseases. Current genome-wide sequence analyses suggest that the genomic instabilities induced by non-B DNA structure-forming sequences not only result in predisposition to disease, but also contribute to rapid evolutionary changes, particularly in genes associated with development and regulatory functions. In this review, we describe the occurrence of non-B DNA-forming sequences in various species, the classes of genes enriched in non-B DNA-forming sequences, and recent mechanistic studies on DNA structure-induced genomic instability to highlight their importance in genomes.
The canonical right-handed double helical structure of B-form DNA  has had a profound influence over studies designed to determine the function of DNA. However, many alternative DNA structures (reviewed in ) have been known to exist since the late 1950s and their roles in biological functions have begun to be elucidated, with substantial progress over the past decade. In 1957, sedimentation coefficient and optical absorption measurements revealed the association of ribonucleic poly-A and poly-U polymers into three-stranded complexes . The DNA of a d(CpGpCpGpCpG) fragment was crystallized in 1979, which revealed a left-handed conformation (Z-DNA) with altered helical parameters relative to the right-handed B-form . Soon after, cruciform structures formed by inverted repeats were identified by S1 nuclease probing [5,6] and by two-dimensional gel electrophoresis . During this same period, parallel four-stranded complexes (tetraplex or G-quadruplex DNA) were discovered to form by guanine-rich DNA sequences . To date, more than 10 different DNA conformations are known to exist from biophysical and biochemical studies, and more are likely to be identified.
Non-B DNA-forming sequences in genomes affect DNA replication and transcription, and contribute to genome instability [9–12]. In 1984, Glickman and Ripley reported the induction of deletions in the lacI gene of Escherichia coli by putative cruciform structures . More recently, studies in model systems (bacteria, yeast, and mammalian cell culture) on trinucleotide repeat sequences, whose expansions in disease-related genes are involved in approximately 30 human hereditary neurological disorders (reviewed in ), support the mutagenic role of non-B DNA structures [15–17]. Similarly, DNA sequences capable of forming non-canonical structures from the human c-MYC and BCL-2 loci that colocalize with translocation breakpoints, undergo frequent double-strand breaks (DSBs) in mammalian cells [12,18–22]. In support of these results, the same non-B DNA structure-forming sequence from the human c-MYC gene, also stimulates genomic instability on chromosomes in transgenic mice . Large (A+T)-rich inverted repeats on chromosome 22q11 and other chromosomes, such as 11q23 and 17q11 were found to cause recurrent translocations both in sperm cells in the general population  and in cell culture , providing evidence that cruciform structures may cause genomic rearrangements (reviewed in [26–28]). Thus, alternative DNA conformations are believed to contribute to mutations and to the dysregulation of cancer-related genes in translocation-related malignant diseases such as myeloma, leukemia, and lymphoma (reviewed in [11,29]).
Recent advances in the field of genomics have revealed the widespread occurrence of non-B DNA-forming motifs in various genomes, their selective enrichment within specific classes of genes and/or chromosomes, and the asymmetric frequency distributions within transcriptional units. These data are paradoxical given the mutagenic role of repeating sequences and their involvement in human disease. Herein, we describe the structural features and biological functions (and potential mechanisms involved) of the most well-characterized DNA structures, i.e., Z-DNA, cruciforms, triplexes or H-DNA, G-quadrupexes, and looped-out slipped structures. We also provide evidence for novel roles of non-B DNA structure-forming sequences as co-regulators of transcriptional activity and as genomic elements through which positive selective pressures have acted during evolutionary time so as to shape and preserve specific genomic functions.
The distribution of nucleotides in genomes is not random. Many DNA sequence patterns exist throughout genomes from bacteria to human, such as direct repeats of homo-, di-, or tri-nucleotides, inverted repeats, mirror repeats, etc. Unlike the majority of DNA sequences which form the canonical right-handed B-form , repeated sequences have the capacity to also adopt alternative conformations (i.e. non-B DNA structures). To date, nearly a dozen types of non-B DNA structures have been described, including hairpins/cruciforms, Z-DNA, triplexes (H-DNA), tetraplexes, slipped DNA, and sticky DNA.
Hairpin/cruciform structures can form at inverted repeats . One side of an inverted repeat, equidistant to the symmetric center, is complementary to the sequence on the other side, e.g. 5'-GACTGC….GCAGTC-3' (Fig. 1A). The two inverted repeats base pair with one another and form an intrastrand hairpin stem, leaving the sequence at the symmetric center looped out as a single strand. The cruciform structure consists of two hairpin-loop arms and a 4-way junction, which is structurally similar to a Holliday junction recombination intermediate . Formation of hairpin/cruciform structures from double-stranded DNA requires energy that may come from negative supercoiling [32,33]. Under these conditions, two inverted repeats as short as 7 bp are sufficient for the formation of hairpin structures .
Sequences with alternating pyrimidines and purines, such as (CG:CG)n and (CA:TG)n, may wind the double helix into a left-handed zigzag pattern (Z-DNA), as depicted in Fig. 1B. Whereas CG:CG repeats are most likely to form the Z-DNA structure, GT:AC repeats are more abundant in the human genome . Compared to the right-handed B-form, the left-handed Z-DNA contains inverted purines in the syn-conformation while pyrimidines remain in the anti-conformation with the sugar-pucker altered from the C2- to the C3-endo position so as to maintain the Watson-Crick base-pairing . These alterations cause a change in the sugar-phosphate backbone that changes the organization of the double helix. Therefore, unlike B-form DNA, which possesses one major groove and one minor groove, Z-DNA has only one deep and narrow groove with 12 bp per helical turn [4,37,38]. The crystal structure of a B- to Z-DNA junction was solved in 2005 and revealed an extruded base pair on each side of the DNA duplex, which is susceptible to DNA modification .
Intramolecular triplex DNA structures can form at homopurine:homopyrimidine sequences with mirror symmetry, where a single-stranded region can bind in the major groove of the underlying DNA duplex to form a three-stranded helix [40–42] (Fig. 1C). Triplex DNA can be classified according to the orientation and composition of the third strand, which can form either Hoogsteen or reverse-Hoogsteen hydrogen bonds with the purine-rich strand of the duplex DNA. Hence, the third strand can be either pyrimidine-rich and parallel to the complementary strand (Y*R:Y), or purine-rich and anti-parallel to the complementary strand (R*R:Y). Whereas (R*RY) triplexes form under conditions of physiological pH, triplex structures of the (Y*R:Y) composition form most readily under conditions of acidic pH. At physiological pH, triplex structures may be stabilized by negative supercoiling, modification with phosphorothioate groups, or polyvalent cations such as spermine and spermidine (reviewed in ).
This four-stranded structure consists of a square co-planar array of four guanines formed by a stretch of guanine-rich DNA  (Fig. 1D). Each guanine acts as a donor and acceptor of Hoogsteen hydrogen bonds in a cyclic arrangement involving N-1, N-2, O-6, and N-7. In vitro, these structures are stabilized by K+ or Na+ ions at physiological pH and temperature. Quadruplex structures may be formed by one, two, or four interacting strands and exist in a variety of conformations depending on the polarity of the strands (parallel or anti-parallel), glycosidic torsion angles, groove size, base sequence of the connecting loops, and the participation of cations.
When direct repeats are base paired with the complementary strand in a misaligned fashion, a slipped structure forms, particularly following unwinding, yielding hairpins or looped-out bases  (Fig. 1E). When direct repeats involve several units, like the telomeric or triplet repeat sequences (CGG, CTG, and CAG), the looped-out bases may form duplexes stabilized by interstrand stacking interactions .
The formation of DNA secondary structures in vitro has been demonstrated by several methods, including polyacrylamide gel electrophoresis, nuclease cleavage, chemical probing, circular dichroism, NMR, ultraviolet absorption, electron microscopy, atomic force microscopy, and crystallography (reviewed in ).
In vivo, non-B DNA conformations are believed to form, at least transiently, during DNA metabolic processes such as replication, transcription, repair, or recombination [9,11]. The expansion of slipped DNA-forming trinucleotide repeats observed in neurological diseases  correlates with the stability of secondary structures in vitro. For example, interruptions in the trinucleotide repeats of the SCA1 (CAG:CTG) and FRAXA (CCG:CGG) genes exert a protective role against instability [47–49]. These interruptions reduce the propensity of DNA secondary structure formation in vitro and the correlation between rates of expansion in individuals and slipped DNA formation has been taken to support a role for slipped DNA in genetic instability. Nevertheless, the transient existence of these, and other non-B DNA structures, has made their detection difficult in genomic DNA , particularly in cases such as inverted repeats, in which multiple conformations are possible, depending upon the environmental conditions .
To date, fluorescence immunostaining by antibodies against specific DNA structures rather than the sequences per se is considered the most direct method for detecting non-B DNA structures in vivo. Rabbit antibodies specific for the Z-DNA structure formed by brominated poly[d(GC)]:poly[d(GC)] were generated in 1981 , and used to bind the interband regions of Drosophila polytene chromosomes  and to detect Z-DNA formed by GT repeats in negatively supercoiled plasmids in vitro . Currently, antibodies against Z-DNA are commercially available (Abcam, GeneTex, etc.). One caveat of this methodology is that the estimation of non-B DNA structures formation in vivo may not reflect the physiological equilibrium conditions, since binding of Z-DNA antibodies may shift the B- to Z-DNA equilibrium towards the non-B conformation .
Several mouse monoclonal antibodies were developed to detect triplex DNA in chromosomes [56,57], and were demonstrated to bind triplex DNA specifically . H-DNA structures in human interphase nuclei were also detected by fluorescently labeled single-stranded DNA oligonucleotides (complementary to the single-stranded region of the H-DNA structure) in vivo . A quadruplex monoclonal antibody was first developed in 1998 [60,61] in mice against the quadruplexes formed by synthetic d(CGCG4GCG) and the telomere-derived d(TG4) and d(T2G4)4 sequences in vitro. Later, a high affinity (Kd ~ 4 nM) antibody against tetraplex DNA structures was developed and used in in vivo studies of telomeric tetraplex structures in the macronucleus of Stylonychia lemnae . Finally, a monoclonal antibody was developed to recognize cruciform and T-shaped DNA structures .
Additional evidence for the existence of non-B DNA structures in vivo has been generated using methods such as chemical probing and DNA cross-linking of genomic DNA sequences [64,65]. However, most of these methods require DNA extraction (before and/or after treatment) for analyses, as it has proven difficult to directly detect these structures in living cells. In addition to technical challenges associated with the detection of non-B DNA, these structures are certainly transient in nature in cells, making their detection even more challenging.
Since the abundance and distribution of non-B DNA-forming sequences may provide insights into their functions in DNA metabolism, analyses were carried out to compare the abundance of these structures in the genomes of various organisms. Overall, non-B DNA-forming sequences are more abundant in eukaryotic genomes than in prokaryotes .
Analysis of human sequences containing 157 genes for a total of 1 Mb of genomic sequence (including exons, introns, and 5’- or 3’-UTRs) revealed many dA:dT sequences, which may form cruciforms . In this sample set, the overall dA:dT abundance was ~49.7%, and the cruciform-forming sequences (≥8 bp (A+T)-rich inverted repeats) in the human genome was ~1/41,700 bp. Additional analyses of genomic sequences in E. coli and yeast revealed that the cruciform-forming sequences were more abundant in yeast (1/19,700 bp) and human than in E. coli . The distributions of hairpin/cruciform structure-forming sequences often overlap with chromosomal regions prone to gross rearrangements both in somatic and in germ cells [67–69].
Although the human genome is less (G+C)-rich than prokaryotic genomes, Z-DNA-forming sequences are in fact very abundant. The GT:AC repeats are estimated to account for more than 0.25% of the entire human genome . A computer-based thermodynamic search strategy (Z-Hunt-II) used by the Ho group to analyze the complete human genome showed that Z-DNA-forming sequences occur approximately once every 3,000 bp . Furthermore, Z-DNA-forming regions were found to be distinctly located near the 5’ ends of genes in the genome, and the proximity between these regions and the transcription start sites became more pronounced during the divergence from prokaryotes to eukaryotes . Therefore, the location bias of these GT:AC repeats is supportive of Z-DNA formation and stabilization by the transient surges in negative supercoiling associated with transcription. As early as in 1983, Nordheim and Rich suggested that three 8-bp Z-DNA-forming sequences in the simian virus 40 enhancer region may function in transcriptional activation . Studies in yeast showed that Z-DNA structures can be induced or stabilized by Z-DNA-binding proteins and function in gene regulation and chromatin-remodeling [72,73]. The occurrence of Z-DNA-forming sequences at chromosomal breakpoints in human tumors suggests that Z-DNA plays a role in causing genomic instability, perhaps by inducing double strand breaks and large deletions .
H-DNA-forming sequences occur at higher levels than expected in mammalian genomes. Using the same 1 Mb sequence sample set from the human genome as in the study of hairpin/cruciform structure-forming sequences, Schroth and Ho found that the occurrence of H-DNA sequences (≥10 bp 100% homopurine:homopyrimidines but <80% (A+T)-rich) in the human genome was ~1/49,400 bp . The distribution of long (≥100 bp) homopurine:homopyrimide sequences in human genes was confined to introns of genes coding for products localized to the cell membrane, phosphorylation, signal transduction, and development and morphogenesis . H-DNA structure-forming sequences are also found flanking proto-oncogenes, such as c-MYC, and may cause genomic instability, such as deletions and other rearrangements [12,23].
Two independent genome-wide surveys for potential intramolecular G-quadruplex-forming sequences identified ~37,000 sites in the human genome, approximately 1 tetraplex every 10 kb [75,76], with ~60% of them located outside of coding regions . Tetraplex-forming guanine-rich sequences are found in immunoglobulin switch regions , telomeric DNA [77,78], poly (dG) runs , and promoter regions . An analysis of promoter regions of 19,268 validated human genes in ENSEMBL (NCBI 34) showed that ~42.7% of human gene promoters contain at least one quadruplex-forming sequence . Du et al (2007, 2008) analyzed 13,276 human Reference Sequence (RefSeq) genes and 2,892 chicken RefSeq genes for potential G-quadruplex-forming sequences and identified one or more G4 DNA motifs in >60% of the genes studied [81,82]. The distribution of the more stable form of G-tetraplex, which contains single-nucleotide loops, is more abundant near transcription start sites, suggesting that this stable secondary structure may have been under positive selection to influence the transcription of particular groups of genes . In addition, a high proportion of genes also contain G4 motifs in 3’-UTRs, implying a role in facilitating transcriptional termination, perhaps by weakening the association of an RNA polymerase complex with template DNA . Therefore, the distribution of G-rich sequences in genomes supports their involvement in the regulation of transcription, in addition to other roles, such as homologous recombination [8,84] and telomere maintenance .
Repetitive DNA sequences account for nearly 30% of the human genome, and are interspersed throughout chromosomes [85,86]. These repeats are referred to as microsatellites (1–7 nt, ) or minisatellites (10–100 nt, ). Various human diseases have been demonstrated to be associated with either expansion or contraction of microsatellites and minisatellites [48,87]. Although microsatellites are abundant in the human genome, their representation varies greatly depending on sequence composition. For example, whereas >16,000 tracts comprised of A or T mononucleotide runs were present in the hg16 assembly at length ≥30 nt, only seven analogous tracts of Gs and Cs were found . Closer examination of the physical properties of tri- and tetra-nucleotide repeats revealed an inverse relationship between their number in vertebrate genomes and the propensity to fold into the hairpin or quadruplex structures . These data suggest that sequences with the propensity to form stable secondary structures have not been maintained as efficiently as their less stable counterparts during evolutionary time. Nevertheless, a comparison of the distribution of these tri- and tetra-nucleotide sequences in protein coding vs. non-coding regions revealed that the number of certain “strong secondary structure-forming” sequences, such as AGC, CCG, CCCG, AGCG, CCGG and ACCG was higher than expected in coding regions , supporting the idea that selective pressures acted so as to preserve the amino acid coding ability of these inherently unstable sequences.
It is important to point out that not all the repeated sequences analyzed to date have the same capacity to form non-B DNA structures. The search criteria used in different reports were set to answer different questions. For example, the Ho group alerted that although (G+C)-rich sequences are abundant in E. coli, not all of them meet the requirement for forming stable secondary structures. Rather, these (G+C)-rich repeats in bacteria are mostly recognized as transcription termination sequences when transcribed into RNA . Also, the most abundant tetraplex-forming G-rich sequences in the human genome analyzed by Huppert and Balasubramanian (2005) are located on the coding strand and therefore may fold into alternative structures in the RNA transcripts rather than in genomic DNA . Therefore, all repeat-based analyses should be interpreted with the realization that some of these ‘unusual’ sequences may not form ‘unusual’ DNA structures.
The completion of the Human Genome Project (HGP) [35,90] has made it possible to address the question of the distribution of non-B DNA-forming sequences in relation to transcribed DNA. More than 99% of euchromatic DNA, which contains genes and putative genes, is currently assembled. The remaining 0.5–1% of gapped DNA (~24 Mb) mostly contains segmental duplications, i.e. nearly identical sequences present at different chromosomal locations , for which clones are available to enable covering. Hence, the data summarized below is expected to capture most of the global genomic organization of genes in relation to non-B DNA-forming sequences. One notable exception is represented by the 18S- and 28S-ribosomal RNA gene arrays in acrocentric chromosomes, which like centromeric, pericentromeric and subtelomeric heterochromatin, are not targeted for sequencing. Indeed, few clones are available for such recalcitrant regions. Heterochromatin, which amounts to ~5–7% (~200 Mb) , is almost entirely populated by tandem repeats and shows limited transcriptional activity.
The first genome-wide search for inverted repeats (IRs) in the human genome revealed the prevalence of large IRs (96 with arm size ≥8 kb and ≥95% sequence identity) on the X (~25%) and Y (~15%) chromosomes . Of the 49 IRs whose arms shared >97–99% sequence identity, eleven from chromosome X, six from chromosome Y, and one from chromosome 11 contained genes/gene clusters predominantly expressed in the testis (Table 1). Indeed, all annotated genes present on the IRs from chromosome Y display testis-restricted expression and have a function in sperm production and maturation .
A subsequent search for the distribution of long, i.e. ≥100 and ≥250 nt, R:Y tracts within human genes indicated the presence of such sequences in the introns of 1,951 and 228, respectively, non-redundant transcriptional units . Strong enrichment (P-values as low as 10−15) was observed for sequences in genes encoding proteins with ion channel activity, cell adhesion, and cell-cell communication functions, particularly in subcellular structures, such as the post-synaptic density, critical to the transmission of nerve impulses (Table 1).
Herein, we report the analysis of the distribution of tetranucleotide repeat (TR) sequences ≥8 units  in human genes. Of the 29,708 TR tracts found genome-wide , 8,943 (~1/3) were located in 4,182 non-redundant RefSeq genes (~1/5 of all annotated genes), or within 1 kb of their transcriptional boundaries, with an average of 2 TR tracts per gene. Also, 114 genes were found to contain the repeats in the promoter region (within 1 kb of the predominant transcription start site), two in the 5’-UTR, 4,485 in introns, 23 in the 3’-UTR, and 100 within 1 kb downstream of the transcriptional unit. Thus, ~95% of gene-associated TRs are located within introns. The group of TR-containing genes was found to be most enriched for genes involved in cell adhesion, localization to the plasma membrane, ion channel function, and receptors involved in signal transduction pathways, cell communication, and transmission of the nerve impulse (Table 1 and Supplementary Information). In addition, genes associated with glutamate receptor activity were progressively enriched as a function of TR length (Supplementary Fig. 1 and Supplementary Information).
The enrichment analyses for the two gene datasets containing either TR (≥8 units, 4,182 genes) sequences or long R:Y tracts (≥100 nt, 1,951 genes) were extended to additional genomic functions . Both datasets were highly enriched in genes known to undergo alternative splicing and prone to DNA breakage leading to chromosomal translocations (Table 1). These data enable the following conclusions: 1) the categories of genes enriched in long R:Y tracts are also enriched in TR sequences; 2) the gene functions involved are associated, as a whole, with communication between cells; 3) long R:Y tracts (which also include most, TRs ≥18 units, Supplementary Information and Supplementary Fig. 1) are an exquisite property of synaptic glutamatergic activity; 4) intragenic R:Y and TR tracts are characteristic of genes that have acquired a complex organization through alternative splicing and thus, may encode proteins with multiple functions and 5) the genes involved are generally prone to breakage. An important aspect of these studies is the association between R:Y-tract containing genes and genes that confer susceptibility to complex mental disorders . This association has recently been strengthened by genome-wide case-control analyses  in subjects afflicted with schizophrenia [74,95]. Hence, triplex-forming sequences are attributes of genes involved in integrative networking functions in the brain.
Analysis of the distribution of micro and minisatellites ranging from dinucleotides to 11-mer repeats in human cDNAs  identified 2,626 unique RefSeq genes. The set displayed strong enrichment for genes associated with transcription factors, the regulation of transcription and specific signaling pathways, including genes from the MAPK and WNT pathways (Table 1). Similar searches at the proteomic level also showed preferential enrichment for transcription factors, chromatin binding proteins, DNA and RNA binding proteins, and proteins involved in translation [96,97]. The current rationale for these observations consists of a model whereby homo-amino acid runs constitute disordered protein regions that become ordered upon nucleic acid and/or cognate protein binding. The transition from a disordered to an ordered state would then greatly enhance the stability of the ensuing complexes and therefore elicit specific biological functions (reviewed in ).
As mentioned above, G-quadruplex-forming repeats predominate in gene regions flanking the transcription start sites but are also abundant in 3’-UTRs. The classes of genes most enriched in such repeats belong to the family of small GTPases, such as Rho, which play critical roles in signal transduction  and in the regulation of stress fibers, including the actin cytoskeleton  (Table 1).
In summary, the association of repetitive DNA with gene function follows specific patterns, i.e. genes involved in male reproduction for large IRs, cell-cell communication for long R:Y tracts, transcription and its regulation for coding microsatellites and small GTPase signaling/regulation for G-quadruplexes. Therefore, it is likely that selective pressures have acted so as to maintain specific DNA sequences in coding regions and to enable the acquisition and maintenance of novel gene functions during the course of evolution.
The first analyses on the genome-wide distributions of quadruplex-forming motifs (G3+N1–7 G3+N1–7 G3+N1–7 G3+) revealed their high prevalence in warm-blooded species  and an overrepresentation in the promoter region of genes [75,80,82,101]. Indeed, a recent investigation on a dataset of 13,276 non-redundant human RefSeq genes established the presence of one or more G4 motifs in the 500-nt region flanking the transcription start site (TSS) of 8,214 (~62%) such genes , a significant proportion. When the expression value of the RefSeq genes was analyzed in 79 human tissues/cell types, a significant association was found between G4 motifs downstream, but not upstream, of the TSS and an increase in gene expression. Moreover, a direct relationship was evident between the number of G4 motifs (0–4) and the levels of gene expression. Further analyses indicated that the average levels of gene expression for both the G4-negative and G4-positive genes varied according to tissue/cell type. Nevertheless, in each case the G4-positive gene set displayed higher transcriptional values than the G4-negative set (Fig. 2A). Hence, a direct association exists between G4 motifs and gene transcription, supporting a genome-wide role for quadruplex structures in either promoting transcriptional activity and/or stabilizing the ensuing pre-mRNA transcripts.
Quadruplex nucleic acid structures are likely to regulate transcriptional activity by several, and perhaps opposing, mechanisms. A recent search for G4 motifs in 32,985 annotated 5-UTRs and 32,818 3’-UTRs from a compilation of 21,658 human genes yielded the following trend in relative frequencies per kb of DNA: 5’-UTR>3’-UTR>transcriptome>whole-genome, with values ranging from 0.382 to 0.057 . Significantly, not only G4 motifs were overrepresented in the 3’-UTRs in addition to 5’-UTRs, but also for a high proportion of genes (97/561 or ~17%) with G4 motifs in 3’-UTRs, the genomic distance from the end of transcription to the next gene was shorter (within 1 kb) than genome-average, suggesting a role for G-quadruplex structures in transcription termination. Finally, a large body of evidence (reviewed in ) supports the conclusion that quadruplex DNA may form in the promoter region of oncogenes and elicit functional roles, such as the transcriptional inhibitory activity observed in c-MYC [103,104].
Herein, we contrast the global gene expression profile of genes that contain quadruplex-forming sequences with those that harbor triplex-forming sequences, i.e. the set of 228 genes (set 1) containing the longest (≥250 nt) R:Y tracts (Table 1) and the set of 190 genes (set 2) containing ≥18 TR units (Table 1, Supplementary Information and Supplementary Fig. 1). Analysis of the gene expression data in 70 tissues/cell lines (cancer tissues and cancer cell lines were not included) showed that for the 16,146 probe-set comprising the control genes (i.e. sets 1 and 2 excluded) the transcriptional values followed a bimodal distribution composed of two overlapping Gaussian curves (Supplementary Fig. 2), the first accounting for 75% of the data and showing high levels of gene expression (HGE) and the second comprising the remaining 25% of the data and displaying low levels of gene expression (LGE) (Fig. 2B). For comparative purposes, the HGE mean value was normalized to 1. Accordingly, the LGE mean value was 0.13 when the respective natural logarithms were transformed in raw gene expression data, a 9-fold reduction. Set 1 also displayed a bimodal distribution. However, whereas the LGE mean value did not differ from the control probe-set, the HGE distribution was shifted to significantly lower values (normalized mean = 0.73, P<0.001, an ~25% reduction relative to the control data-set mean). Similarly, set 2 displayed significant reduction in gene expression for both the HGE and the LGE distributions (normalized mean values 0.75 and 0.11, respectively; P<0.001) (Fig. 2B). Hence, genes containing long R:Y tracts with the potential to form triplex DNA structures are generally transcribed at lower levels than genes that do not contain such elements. A previous analysis  of the tissue-specific patterns of gene expression for set 1 after z-scoring (which normalizes the average expression of any given gene across all tissues) indicated that the highest transcriptional activity occurred in the brain. Hence, taken together, these data suggest brain-specific roles for long R:Y tracts in transcriptional regulation. Finally, these analyses reveal the contrasting transcriptional profiles of genes harboring quadruplex-forming repeats (increased transcription) and those containing triplex-forming sequences (decreased transcription).
Sex-specific genes are clustered in the arms of IRs on the X and Y chromosomes . The Y-chromosome comprises two external pseudo-autosomal (PAR1 and PAR2) regions (≤1.5 Mb) homologous to the X-chromosome and essential for chromosome segregation at meiosis, and a central male-specific segment (MSY) functionally divided into euchromatin (shorter p-arm) and heterochromatin (distal q-arm).
The euchromatin region is itself a complex mosaic of modular DNA sequences characterized by eight large (up to 1.46 Mb in length) inverted repeats, commonly referred to as palindromes 1–8, shorter inverted and direct repeats, all of which contain gene families with expression patterns specific to the testis and performing essential functions in the production and maturation of sperm . Two other regions, X-transposed and X-degenerate, harbor paralogous genes with copies on the X-chromosome. Modular tandem arrays also compose the entire heterochromatic region, whose length variation caused by polymorphic tandem array repeat number confers large-scale differences to the size of Y-chromosomes in the general population (Fig. 3A). Hence, inverted and direct repeats compose most of the human Y-chromosome, thus conferring higher-order structural architectures to the primary genomic sequence.
The MSY region does not have a counterpart in other chromosomes and thus it is excluded from sexual recombination. This unique behavior has prompted speculation  that Y-chromosome extinction is inevitable given that gene decay, consequent to naturally occurring mutations, would be irreversible. Indeed, the Y-chromosome has degenerated substantially both in size and gene content in comparison with the X chromosome. However, the ampliconic gene families nested within the palindromic arms and key to spermatogenesis have sustained much lower-than-expected mutation rates during evolutionary time . For example, not only the intrapalindromic (arm-to-arm) sequences share on average >99% sequence identity, but also gene pairs located at symmetrical positions within palindromic arms are generally identical or nearly so . In contrast, substantial sequence divergence exists between gene pairs belonging to the same gene families but located at different arm positions . Thus, high rates of gene conversion are believed to have occurred among testis-specific genes in the human Y-chromosome, which have effectively counteracted the threat of gene decay imposed by the absence of meiotic recombination [106,108]. In fact, comparative analyses between the human and chimpanzee Y-chromosomes strongly supports the conclusion that the ampliconic gene families in palindromes have been under strong positive selective pressure, most likely because of their key role in spermatogenesis (Fig. 3A) .
These observations raise a number of questions. Did the inverted repeat architecture of palindromes play a critical role in shaping and preserving Y-chromosome function? How did gene conversion take place between the arms of palindromes? Several studies have been performed to address these issues. First, analyses from representative ethnic groups revealed that the IR3/IR3 region of the Y-chromosome was inverted in 16/47 cases . This corresponds to a frequency of ~9.2×10−4 inversion events per father-to-son transmission, a frequency that is at least 10,000 times higher than that of single nt changes. Second, recent detailed sequence analyses of microinversions that distinguish the human and chimpanzee genomes showed that in all cases inverted repeats were present at breakpoints . Therefore, whereas inverted repeats may suppress random nucleotide changes arising from within their repeating arms , they nevertheless represent a structural unit capable of changing genomic orientation over time. We and others  have proposed that large inverted repeats may promote strand exchange and form stem-loop structures, which may account for these features . Accordingly (Fig. 3B), the two arms of an inverted repeat may interact and engage in a strand-exchange reaction leading to the formation of intra-strand Watson-Crick hydrogen bonded base pairs (Fig. 3B, Structure I). This gives rise to a stem-loop structure characterized by two Holliday-like junctions, one at the apex between the stem and the looped-out intervening sequence, the other at the base between the stem and the sequences flanking the inverted repeats (Fig. 3B, Structure II). Resolution of the Holliday-like junctions would yield two types of events. First, in 50% of cases the intervening sequence will invert, assuming equal rates of cleavage at the intersecting vs. non-intersecting strands (Fig. 3B, Structure III). Second, upon inversion the DNA complementary strands of the inverted repeats will contain the nucleotide that were previously located on the same DNA strand, effectively providing a means for the correction of mispairs, through mismatch or other repair pathways (Fig. 3B, Structure IV). These models (Fig 3B and ) offer a rationale for the observations that: a) inverted repeats mediate genomic inversions ; b) high rates of “gene conversion” events take place between the arms of palindromes ; and c) genes of the same family show a pair-wise pattern of sequence identity based upon their location at similar palindromic arm position . In addition, these models provide a rationale for the formation of large stem-loop structures, including cruciforms [24–28,111], for which the physiologic levels of negative supercoiling appear insufficient . Finally, because strand exchange may initiate and terminate anywhere along the inverted repeat sequences, their total lengths do not impose a size constrain to stem-loop structures, which may vary in length. This contrasts with the “classic” cruciform structure (Fig. 1A), which nucleates from the apical loop.
In summary, these composite data provide empirical evidence in support of the notion that cruciforms have played a pivotal role during evolutionary time by providing a genomic structure upon which selection acted so as to preserve, and perhaps shape, the sex-specific functions of the human Y-chromosome.
Studies using model systems suggest that instability caused by trinucleotide repeats and other non-B DNA-forming sequences may occur via aberrant DNA replication events (reviewed in [16,113,114]), as well as replication-independent mechanisms in non-proliferating tissues (reviewed in ). We discuss results to support both replication-dependent and replication-independent mechanisms of DNA structure-induced genetic instability below.
Human fragile sites often consist of non-B DNA-forming tandem repeats . Studies of model sequences have provided links between DNA replication and fragile site instability (reviewed in [114,116]). For example, the mutation rate of hairpin-forming CAG repeats increased when the DNA polymerase zeta subunit rev1 was mutated in Saccharomyces cerevisiae , suggesting that the transient formation of single-strand DNA during replication and the ensuing slipped DNA structures are mutagenic. Indeed, replication slippage at repetitive sequences (e.g. CTG:CAG, GAA:TTC, CGG:CCG, and GAC:GTC) has been implicated in mutations, deletions, or expansions of repeating units, causing genetic instability related to hereditary neurological diseases (reviewed in ).
Direct evidence for a link between replication and non-B DNA structures was provided by the ability of non-B DNA structure-forming sequences to slow replication forks. Using 2D gel electrophoretic analyses and electron microscopy, stalling of replication intermediates by trinucleotide repeats, inverted repeats of Alu elements , and an (A+T)-rich fragile site (FLEX1) from the human FRA16D gene  was detected when these elements were cloned into bacterial, yeast, and human cells. Replication attenuation was dependent on the length and/or sequence of these repeats and correlated with their capacity to form DNA secondary structures. A stalled replication fork will give rise to longer exposure of ssDNA, and may cause replication fork collapse and DSBs, which may be processed in a mutagenic fashion. DNA triplex structures can also block replication forks and cause DSBs [12,42].
Due to the differences between leading and lagging strand DNA synthesis during replication, the orientation of repeat sequences greatly influences their stability in model systems such as bacteria, yeast, and cultured mammalian cells [120–123]. Most non-B DNA structure-forming trinucleotide repeats are more unstable when they serve as lagging strand templates. The instability of GAA repeats in the FRDA gene responsible for Friedreich ataxia is dependent on the orientation of DNA replication. In yeast for example, GAA repeats display nearly 100-fold higher instability on the lagging strand than on the leading strand . Similarly, CTG repeats show higher levels of DNA instability when used as a template for lagging strand synthesis (to the replication origin ColE1) in a recA− strain of E. coli, upon induction of DSBs . A long (CTG)130 repeat from a myotonic dystrophy patient was unstable on the lagging-strand template but was stable on the leading strand template in yeast . Also, the (CGG)160 repeat from the 5'-UTR region of the FMR1 gene contracts when placed as the lagging strand template in the yeast chromosome, but yields few contractions when the repeat is located in the leading strand template . The strand-preference of trinucleotide repeat instability indicated that the ability to form secondary structures differs for the two complementary sequences. For example, the CTG repeats adopt a more stable hairpin structure than CAG repeats [126,127]. Hence, when CAG repeats serve as the lagging strand template, the newly synthesized complementary CTG repeats would be prone to form non-B structures that may cause repeated synthesis, resulting in expansion of the repeat [15,17]. At the same time, if the leading strand template with CTG repeats forms secondary structures, it may be bypassed and give rise to contractions within the repeat. Whereas, contractions of trinucleotide repeats are seen in many yeast and bacterial models, expansions are prevalent in human diseases [14,15,128]. The reasons for this discrepancy remain to be clarified, however transacting factors may be involved. For example, the human MSH2–MSH3 complex can bind CAG or CTG repeats , and knockdown of the proteins in this complex has been shown to reduce trinucleotide repeat instability [130,131]. Thus, it is possible that the MSH2–MSH3 complex might stabilize the repeats rather than processing the “mismatched” nucleotides (discussed below). Due to its strand discrimination ability, MSH2–MSH3 might then stabilize the structure formed on trinucleotide repeat tracts on the newly synthesized strand preferentially, leading to expansion events.
The ability of non-B DNA structure-forming sequences to stall replication forks can be counteracted by proteins that stabilize replication forks. Studies on CGG repeats and inverted repeats in yeast indicate that the replication fork-stabilizing proteins Mrc1 and Tof1 could reduce the replication stalling effect of non-B DNA structures [118,132]. Proteins functioning in the maturation of Okazaki fragments also influence the expansion and contraction of repeat sequences. For example, mutations in yeast Rad27 (homologous to the human FEN-1 flap endonuclease 1) lead to the expansion of repeated CAG:CTG sequences and to the recombination/instability of inverted Alu elements [133,134]. The interactions among Rad27, DNA ligase I, and proliferating cell nuclear antigen (PCNA) are critical for the maintenance of CAG:CTG repeats in yeast . Similar to Rad27, which prevents the expansion of trinucleotide repeats, the yeast helicase Srs2 unwinds the secondary structures formed by trinucleotide repeats, and together with post-replication repair proteins prevents the expansion of CAG:CTG repeats [136–138]. However, these results demonstrating a role for Rad27 in repeat stability in yeast are not consistent with those observed in mammalian cells. For example, the CAG:CTG repeat from the Huntington locus was stable over 27 successive cell passages when FEN-1 was continuously knocked-down by siRNA . Similarly, in mice, haploinsufficiency of Fen1 increases the expansion of CAG:CTG repeats at the Huntington locus but does not affect their stability at the myotonic dystrophy type 1 (DM1) locus in knock-in models .
Whereas DNA replication-related mechanisms may largely be responsible for non-B DNA structure-induced genomic instability in proliferating tissues, they do not account for genetic instabilities found in non-proliferative tissues (reviewed in ). For example, analyses of patients with Huntington disease and spinocerebellar ataxias showed instability of CAG:CTG repeats in their non-proliferative tissues, such as brain and sperm [141,142]. Similarly, H- and Z-DNA structures were found to induce large-scale deletions and rearrangements in replication-deficient HeLa cells ( and our unpublished results). In transgenic mice CAG repeats might expand by gap repair in germ cells without replication or recombination taking place [128,143]. In addition, the translocation of the palindromic AT-rich repeat has been shown to be independent of replication [25,111]. Several DNA repair-related mechanisms have been proposed to explain replication-independent mutagenesis events at non-B DNA conformations (reviewed in ).
Being different from the canonical B-form DNA conformation, non-B DNA structures represent distortions of the DNA double helix, including the non-B structure itself, and the non-B to B-form junctions. These distortions may be recognized as “damage” by DNA repair proteins. One consequence of such “damage” recognition is the introduction of mutations/deletions, causing genomic instability (Fig. 4). Many non-B DNA structures can lead to the generation of DSBs during DNA repair, which are critical lesions that can lead to cell death or chromosomal rearrangements .
Trinucleotide repeats can form hairpins with mismatched nucleotides in the stems. This structural property may be recognized as “damage” by repair proteins. The Mre11/Rad50 complex was shown to cleave hairpins/cruciforms in a structure-specific manner . Inverted repeats also generate DSBs and stimulate unequal sister-chromatid exchange in yeast . Although it was not evaluated whether replication is important for DNA breakage and translocation in this case, the rate of this spontaneous exchange was reduced to ~50% in yeast strains with mutations in the mismatch repair (MMR) genes Msh2 or Msh3, suggesting a role for DNA repair in non-B DNA structure-induced mutagenesis . Kirkpatrick and Petes (1997) reported that repair of 26-base loops in yeast involved both Msh2 and Rad1, suggesting that these repair proteins recognize helical distortions as “DNA damage” and remove DNA loops formed by trinucleotide repeats . The absence of a functional nucleotide excision repair (NER) protein UvrA has been shown to increase the instability of long CTG repeats in E. coli [146,147]. However, conflicting results on the roles of MMR and NER repair proteins on repeat instability have been reported in human cell lines or mouse model systems [130,148–150]. Thus, further studies in this area are warranted.
While it is clear that Z-DNA-forming sequences can cause genetic instability in a number of organisms, the underlying mechanisms remain largely speculative (reviewed in ). Studies from our laboratory have demonstrated that the instability of Z-DNA-forming sequences (CG)14 results from the DSBs induced by these sequences in mammalian cells . However, the mutation spectrum induced by the same (CG)14 sequence in bacteria is quite different . In bacteria the predominant mutation/deletion appears to be within the CG repeat with a gain or loss of dinucleotides, likely caused by slippage events during replication. In contrast, replication was not required for the (CG)14-induced mutations in mammalian cells, where predominant mutation events were large (>50 bp) deletions . It is possible that these deletions were the result of error-generating DNA repair processing events at these unusual DNA structures. Chromatin immunoprecipitation experiments showed that Z-DNA-forming (CG)14 repeats were enriched relative to B-DNA sequence controls in the precipitations with antibodies against the NER protein, XPA, and the MMR protein, MSH2 (Wang & Vasquez, unpublished data). Moreover, the mutation frequency of this Z-DNA-forming sequence was lower in XPA- or MSH2-deficient human cells than in their isogenic wild-type counterparts, suggesting that these proteins contribute to Z-DNA induced mutagenesis in human cells.
We have demonstrated that the naturally occurring H-DNA structure-forming sequence from the human c-MYC gene, which co-localizes with translocation breakpoints, can induce DSBs within these sequences in mammalian cells and cause genomic instability in mice [12,23]. Similarly, the instability of H-DNA structure-forming sequences from the polycystic kidney disease 1 (PKD1) gene was lower in MMR-deficient bacterial cells compared to wild type cells . Our data suggest that like Z-DNA, the mutagenicity of H-DNA-forming sequences involves XPA and MSH2 (Wang & Vasquez, unpublished data). Recently, we discovered that the MMR protein complex, MSH2–MSH3 (MutSβ), cooperates with two key NER protein complexes (XPA-RPA and XPC-RAD23B) in the recognition of triplex structures in the presence of a psoralen interstrand crosslink. This interaction was enhanced up to 10-fold in the presence of a psoralen interstrand crosslink within a triplex structure compared to a psoralen interstrand crosslink within a duplex DNA substrate, suggesting that the non-B DNA structure is a strong recognition signal for both NER and MMR proteins .
However, binding of DNA repair proteins to non-B DNA structure-forming sequences does not always result in increased instability. In some cases, binding of MSH2 or MSH3 to the hairpin structures formed by trinucleotide repeats may prevent the structure from being processed. In yeast, the Msh2–Msh3 complex binds preferentially to the imperfect stem formed by interrupted trinucleotide repeats and blocks their expansion . The human MMR protein complex MSH2–MSH3 was confirmed to preferentially bind looped-out secondary structures formed by CTG repeats, and the ATPase activity required for its repair function was decreased after binding to the non-B DNA structure-forming sequences .
DNA repair processes may promote the transition from B- to non-B DNA structures. When DNA damage occurs at or near repeated sequences, the subsequent repair processes may unwrap the DNA from the chromatin, which generates negative superhelical stress and promotes the transition to non-B DNA. Alternatively, single-stranded DNA regions may form, which then allow the folding of secondary structures to take place (Fig. 4). Genetic experiments in a mouse system demonstrated that knockdown of the recombination protein Rad52 decreased the expansion of CTG repeats . Introducing DSBs within the GAA repeats or within CTG repeats in E. coli resulted in deletion, but this stimulatory effect only occurred when DSBs were located within the repeats [155,156]. Similarly, more instability was seen in the processing of DSBs with a CTG repeat sequence in mammalian cells when the CTG repeat was capable of forming slipped DNA structures compared to a linear DNA control . These results suggest that hairpin/cruciform structure-forming sequences may be more susceptible to deletion or rearrangement events during DNA repair in the surrounding regions.
On the other hand, the formation of DNA secondary structures near DNA damage might influence the repair processing, depending on the type of damage, the environment, and the nature of the secondary structures. For example, the Malkova group has shown that in yeast, the inverted Ty elements promote the repair of DSBs at distances of up to 30 kb from the elements by forming dicentric inverted dimers [158,159]. The existence of inverted repeats flanking a DSB is thought to channel repair from a homologous recombination pathway into a single-strand annealing-gross chromosomal rearrangements (SSA-GCR) pathway in yeast . This pathway is not dependent on homologous recombination because in a rad51Δ strain, the existence of intact large inverted repeats near the DSB reduces the broken chromosomal loss from roughly 40% to ~13% . Unlike inverted repeats which promote the repair of DSBs, the secondary structures formed by CTG units in a plasmid reporter system in mammalian cells showed decreased repair efficiency of the DSB within the repeat, compared to a control of linearized plasmid containing the same CTG sequence and DSB . These results suggest that non-B DNA structures are able to form during DNA repair and the formation of such structures can potentially alter repair. If the non-B DNA structure-forming sequences near the damage site are processed during the repair of the lesion, they may contribute to the error-generating repair and lead to genomic instability. This notion is supported by data from patients showing that gene conversion contributes to the instability of CGG:CCG repeats in the FRAXA and CTG:CAG tracts in DM1 cases (reviewed in ).
Non-B DNA structures may also affect DNA repair by increasing DNA damage susceptibility and/or damage accumulation . The distortion of the DNA helix and the altered arrangement of the bases and sugar moiety in non-B DNA conformations can influence the interactions of DNA damaging factors with the nucleotides, and thus modify their accessibility to DNA damage. For example, many types of non-B DNA conformations, e.g., H-DNA, B-Z junctions, hairpin and loop structures, contain single-stranded regions that are not protected by hydrogen bonding, and are often precluded from chromatin that can otherwise protect the bases. Thus, non-B DNA structures may be more accessible to DNA damaging factors than B-DNA . For example, the guanines in a Z-DNA structure are more sensitive to ionizing radiation , and are more sensitive to oxidative damage in the single-stranded regions compared to B-form duplex DNA . On the other hand, it is also possible that DNA in non-B conformations are more resistant to certain types of damaging agents, e.g., interstrand crosslinks are less likely to be formed in the single-stranded regions of non-B DNA structures than in duplex DNA.
The abnormal positioning of the bases and sugar moiety in non-B DNA conformations can also impact the function of some DNA repair proteins on damaged DNA. For example, alkylating damage such as N7-methylguanine or O6-methylguanine is not repaired as efficiently in Z-DNA as it is in B-DNA [163,164]. This topic is covered in depth in a recent review by Wang and Vasquez (2009), which describes a model of “DNA repair-stimulated non-B DNA structure formation” .
Since the discovery of non-B DNA structures several decades ago, these structures have been shown to influence critical genetic transactions, such as DNA replication, transcription, recombination, and repair. Our knowledge of the role of non-B DNA structures in genomic instability has recently been gained along with the progress made in understanding the DNA structural characteristics, the correlations between DNA structure and genetic diseases, and the proteins that influence the stability of DNA structures. Genome-wide analyses have greatly influenced our view on DNA structure-induced genomic plasticity and its consequence in human disease and the evolutionary changes that took place since the divergence from prokaryotes to eukaryotes. The capability of non-B DNA structures to induce mutations/deletions and to promote chromosome rearrangements gives them potential evolutionary functions; e.g., mutating to adapt to rapid changes and at the same time, keeping DNA information through recombination (in the case of the human Y chromosome mentioned above).
However, there are still many questions to be answered regarding the relationships between DNA sequence, structure, and function. For example, what environmental conditions promote non-B structure formation? What proteins function in the recognition and subsequent processing of non-B DNA structures? What proteins/pathways are involved in their error-generating repair causing genomic instability? The same trinucleotide repeat sequences in various systems do not always result in genetic instability, suggesting that DNA sequence context and/or location in the genome may be critical factors in repeat instability. In our studies, H-DNA sequences are mutagenic in mammalian cells, but are not mutagenic when introduced in bacteria, suggesting a requirement for transacting factors/proteins in a host-specific fashion for structure formation and/or processing. The observation that specific types of non-B DNA-forming sequences are enriched in gene families with particular functions, and the correlation between gene expression levels and the presence of non-B DNA-forming sequences in these gene regions, emphasizes the need to further investigate the regulatory function of repetitive elements. It is not clear whether these elements are enriched due to their regulatory function, or due to the higher mobility of unstable non-B DNA structure-forming sequences.
The current mechanisms proposed for non-B DNA-induced genetic instability include abnormal DNA replication that can explain the contraction and expansion of trinucleotide repeats in replicating systems, and processing by DNA repair proteins that contribute to replication-independent mutagenesis induced by non-B DNA structures. Many DNA repair proteins have been found to interact with non-B DNA structures in vitro; while some protein-non-B DNA interactions lead to repair processing and DNA breakage, some other proteins might stabilize the non-B DNA conformations. Furthermore, a particular protein may have different affects on non-B DNA conformation in different species. The much-needed screening for proteins that interact with non-B DNA structures is in progress and will provide more information about their recognition and structure-induced genomic instability at the molecular level. These results will help us to comprehensively understand how these DNA structures influence genome stability, DNA metabolic functions (e.g. gene function and regulation), and the balance between selection stress and adaptation to changing environmental conditions.