|Home | About | Journals | Submit | Contact Us | Français|
Different types of human gene mutation may vary in size, from structural variants (SVs) to single base-pair substitutions, but what they all have in common is that their nature, size and location are often determined either by specific characteristics of the local DNA sequence environment or by higher-order features of the genomic architecture. The human genome is now recognized to contain ‘pervasive architectural flaws’ in that certain DNA sequences are inherently mutation-prone by virtue of their base composition, sequence repetitivity and/or epigenetic modification. Here we explore how the nature, location and frequency of different types of mutation causing inherited disease are shaped in large part, and often in remarkably predictable ways, by the local DNA sequence environment. The mutability of a given gene or genomic region may also be influenced indirectly by a variety of non-canonical (non-B) secondary structures whose formation is facilitated by the underlying DNA sequence. Since these non-B DNA structures can interfere with subsequent DNA replication and repair, and may serve to increase mutation frequencies in generalized fashion (i.e. both in the context of subtle mutations and SVs), they have the potential to serve as a unifying concept in studies of mutational mechanisms underlying human inherited disease.
“Where, when, and in which individual a particular mutation will appear is unpredictable”.
Theodosius Dobzhansky (1970) Genetics of the Evolutionary Process
“A mutation is in itself a microscopic event, a quantum event, to which the principle of uncertainty consequently applies – an event which is hence by its very nature essentially unpredictable”.
Jacques Monod (1971) Chance and Necessity: An Essay on the Natural Philosophy of Modern Biology
Although mutation is still often casually described as a ‘random process’ [Gerrish, 2002; Ayala, 2007; Kondrashov and Kondrashov, 2010], there is now abundant evidence that the process of mutation is far from random. Indeed, over the last 20 years, it has become ever clearer that human gene mutation is frequently a highly sequence-specific process, irrespective of the type of lesion involved. Further, we have come to understand that certain DNA sequences are inherently mutation-prone by virtue of their base composition, sequence repetitivity, epigenetic modification, and/or characteristic secondary structures, and hence have a tendency to mutate in very specific ways. This inherent mutability pertains not only with respect to gross gene lesions but also to subtle mutations such as single base-pair (bp) substitutions. Thus, whereas highly prominent genomic structural features may act at a distance so as to induce gross genomic rearrangements, the nature, location and frequency of micro-lesions are often influenced by their immediate DNA sequence context. The recognition that certain DNA sequences are inherently hypermutable has been accompanied by an emerging understanding of how DNA sequence influences (and indeed often underpins) secondary structure formation, how certain local DNA structures can themselves be mutagenic, and how the type and frequency of the resulting mutations can in turn help to explain the nature and prevalence of specific human genetic diseases [Rogozin and Pavlov, 2003; Bacolla et al., 2008; Arnheim and Calabrese, 2009]. Studies of hypermutable sequences have also provided important insights into the endogenous nature of many of the known mechanisms of mutagenesis, for example CpG deamination or slipped mispairing at the DNA replication fork, that are responsible for quite different types of recurring micro-lesion.
Human mutational spectra are increasingly being ascertained on a genome-wide scale, as for example in sequenced cancer genomes that can constitute an intricate patchwork of clustered, or even overlapping, somatic lesions. Here, however, we have attempted to focus on those mutations that have occurred in the germline and which underlie human inherited disease. Many of these lesions have become explicable (albeit retrospectively) in terms of their underlying mutational mechanisms by reference to local genome structure and sub-structure. In this review, we explore how the nature, location and frequency of the many different types of human gene mutation causing inherited disease are shaped in large part, and often in remarkably predictable ways, by the local DNA sequence environment. The central hypothesis we aim to discuss herein is that sites of mutation leading to inherited disease often coincide with DNA sequences known to possess peculiar biochemical and/or structural features, ranging from the spontaneous deamination of single bases to the cooperative transition from the canonical right-handed double-helix to complex secondary structures, including triplexes, slipped-out bases and cruciforms (collectively termed non-B DNA) such that the root cause of the vulnerability of DNA to mutation often resides within its own sequence.
The text below is organized operationally into sections, allowing us to address sequentially the impact of DNA sequence architecture upon (i) single nucleotide substitutions, (ii) microdeletions, microinsertions and indels, (iii) structural variants (SVs) including copy number variations, (iv) microsatellite mutation and (v) mutations in or involving the mitochondrial genome. We then discuss the extent to which non-canonical (non-B) DNA structure-forming sequences have the potential to contribute to a generalizable and hence potentially unifying hypothesis in the field of mutagenesis, on the basis that non-B DNA structures appear to have the capacity to increase the mutation frequency not only with respect to SVs but also in the context of subtle mutations.
“It is quite clear that the abnormal hemoglobins of man reveal a pattern of nucleotide replacements which is distinctly non-random. It is also clear that the major contributor.…is the G→A transition. Precisely the same conclusions are obtained from the data on the evolution of cytochrome c. This occurs in spite of the fact that in the case of hemoglobin we are probably looking almost exclusively at deleterious mutations, whereas in the case of cytochrome c we are looking only at mutations which have survived the rigors of selection”.
W. M. Fitch (1967) J. Mol. Biol. 26:499-507.
“Our observation of recurrent CG-TG mutations strongly supports the view that these dinucleotides are mutation hotspots”.
H. Youssoufian et al. (1986) Nature 324:380-382.
5-methylcytosine (5mC) is the most frequent post-synthetic (epigenetic) DNA modification in the human genome and is largely, but not exclusively, confined to the CpG dinucleotide. The first hint that the CpG dinucleotide might constitute a hotspot for pathological mutations in the human genome came 25 years ago with the finding that two different CGA>TGA (Arg>Term) nonsense mutations in the factor VIII gene (F8; MIM# 306700) had recurred quite independently in unrelated individuals causing hemophilia A [Youssoufian et al., 1986]. The potential generality of this phenomenon soon became evident with the finding that 12 of the 34 (35%) single base-pair substitutions then known to cause human inherited disease were C>T and G>A (on the other strand) transitions within CpG dinucleotides [Cooper and Youssoufian, 1988]. Further studies confirmed that the CpG dinucleotide was also a mutation hotspot in a number of other human disease genes including PAH [MIM# 612349; Abadie et al., 1989], SERPINC1 [MIM# 107300; Perry and Carrell, 1989], F9 [MIM# 300746; Koeberl et al., 1990], LDLR [MIM# 606945; Rideout et al., 1990], RB1 [MIM# 180200; Mancini et al., 1997], HPRT1 [MIM# 308000; O‘Neill and Finette, 1998] and DMD [MIM# 300377; Buzin et al., 2005]. As mutation data accumulated, CGA>TGA transitions were encountered disproportionately frequently as a cause of human genetic disease [Krawczak et al., 1998]. This was not simply due to the hypermutabilty of the CpG dinucleotide but also because such nonsense mutations are inherently more likely than missense mutations to come to clinical attention owing to their greater functional impact [Mort et al., 2008].
From the outset, it was realised that the hypermutability of the CpG dinucleotide was related to its role as the major site of cytosine methylation in the human genome. The reason traditionally put forward to explain this association has been that while cytosine spontaneously deaminates to uracil (which is efficiently recognized as a non-DNA base and removed by uracil-DNA glycosylase), the spontaneous deamination of 5mC yields thymine [Shen et al., 1994] thereby creating G•T mismatches whose removal by methyl-CpG binding domain protein 4 (MBD4) and/or thymine DNA glycosylase followed by base excision repair (BER) is inherently less efficient [Hendrich et al., 1999; Waters and Swann, 2000; Walsh and Xu, 2006; Cortázar et al., 2007; Boland and Christman, 2008]. This notwithstanding, it should be appreciated that CpG transitions do not originate exclusively via the spontaneous deamination of 5mC but may also arise through the action of other mechanisms and processes e.g. nucleotide misincorporation during replication [Shen et al., 1992; Zhang and Mathews, 1994; Pfeifer, 2006]. Irrespective of the precise nature of the underlying mutational mechanism, Krawczak et al.  estimated that, in the context of inherited disease, the rate of CG>TG (and CG>CA on the other strand) transitions was five times that of the base mutation rate. Subsequent estimates of 5mC hypermutability, derived from various studies of polymorphism, pathological mutations or sequence divergence in an evolutionary context, have ranged from four- to fifteen-fold [Nachman and Crowell, 2000; Kondrashov 2003; Tomso and Bell, 2003; Jiang and Zhao, 2006a; Zhao and Zhang, 2006; Zhang et al., 2007; Elango et al., 2008; Misawa and Kikuno, 2009; Li et al., 2009]. Ultimately, the question of whether or not a given CpG dinucleotide is hypermutable in the context of inherited disease is determined by its methylation status in the germline. An added level of complexity is however likely to be introduced into the equation by site-specific differences in the efficiency of DNA methylation (by DNA methyltransferases) that are conferred by the immediate flanking sequence [Wienholz et al., 2010]. It would also appear that local DNA structure, specifically in the form of sequences capable of forming DNA structures other than the canonical right-handed double-helix (collectively called non-B DNA), can influence the efficiency of DNA methylation [Halder et al., 2010]. In passing, another potential source of 5mC-associated mutations is the genome-wide induction of single-strand breaks generated during the waves of demethylation and remethylation in the zygote [Wossidlo et al., 2010]; such a mechanism may account for the large deletions stimulated by knocked-in (CG•CG) tracts in the mouse [Wang et al., 2008].
Self evidently, since the CpG dinucleotide is a hotspot for mutation, the CpG mutation rate is considerably higher than the non-CpG mutation rate. However, it would appear that the non-CpG mutation rate is contingent to some extent upon the local CpG content [Walser et al., 2008]. This correlation between the CpG and non-CpG mutation rates seems to be independent of G+C content, recombination rate and chromosomal location but, intriguingly, approximates to a sigmoidal curve [Walser and Furano, 2010]. This is potentially explicable in terms of the effect of CpG content on the non-CpG mutation rate being subject to a certain threshold (~0.53%), with ‘saturation’ being attained when the CpG content rises above a particular level (~0.63%). In addition, the mutational spectrum (transition/transversion ratio) of non-CpG sites was noted to change with CpG content [Walser and Furano, 2010] supporting the authors’ contention that this ‘CpG effect’ could be an intrinsic property of the DNA sequence.
It has been known for some time that cytosine methylation also occurs in the context of CpNpG sites (where N represents any nucleotide) in mammalian genomes [Woodcock et al., 1987; Clark et al., 1995; Ramsahoye et al., 2000] and in vitro [Pradhan et al., 1999]. Since the intrinsic symmetry of the CpNpG trinucleotide would support a semi-conservative model of replication of the methylation pattern (as with the CpG dinucleotide), it comes as no surprise that both maintenance and de novo methylation occurs at CpNpG sites in mammalian cells [Clark et al., 1995]. In their landmark paper on the human methylome, Lister et al.  reported abundant DNA methylation in CpHpG trinucleotides (where H = A, C or T); more specifically, some 17.3% of 5mC in embryonic stem cells was found to occur within CpApG, CpCpG and CpTpG with a further 7.2% of 5mC occurring in CpHpH. Although Lister et al.  suggested that non-CpG methylation is almost entirely lost upon differentiation (a conclusion based upon the analysis of fetal lung fibroblasts), others have noted CpNpG methylation within human genes in a variety of different somatic tissues [Lee et al., 2010; Laurent et al., 2010]. If we therefore assume not only that CpHpG methylation occurs in the germline but also that 5mC deamination can occur within a CpHpG context, then it follows that methylated CpHpG sites are very likely to constitute mutation hotspots causing human inherited disease. Initial indirect evidence that this might indeed be the case came from the observation that disproportionately high numbers of C>T and G>A transitions occur at CpNpG sites in studies of the human genes, NF1 [MIM# 613113; Rodenhiser et al., 1997] and BRCA1 [MIM# 113705; Cheung et al., 2007].
In the light of the above, Cooper et al.  revisited the question of CpG dinucleotide hypermutability and explored the potential contribution that CpHpG transitions might make to human inherited disease. A total of 54,625 missense and nonsense mutations from 2,113 genes causing inherited disease were retrieved from the Human Gene Mutation Database [http://www.hgmd.org; Stenson et al., 2009]. Some 18.2% of these pathological lesions were found to be C>T and G>A transitions located within CpG dinucleotides (compatible with a model of methylation-mediated deamination of 5mC), a ~10-fold higher proportion than would have been expected by chance alone [Cooper et al., 2010]. The corresponding proportion for the CpHpG trinucleotide was 9.9%, a ~2-fold higher proportion than would have been expected by chance alone. Cooper et al.  therefore estimated that ~5% of missense/nonsense mutations causing human inherited disease could be attributable to methylation-mediated deamination of 5mC within a CpHpG context. Irrespective of the functional role(s) of cytosine methylation in the human genome, it would appear that methylation of the CpHpG trinucleotide may leave a significant imprint on the spectrum of point mutations causing human genetic disease.
“A model for frameshift mutation can often be hypothesized from knowledge of the DNA sequence and contexts of the mutants and the sequence-specific behaviour of enzymes believed to be involved in mutation”.
L. S. Ripley (1990) Annu. Rev. Genet. 24:189-213.
In addition to the CpG and CpHpG effects discussed above, other types of nucleotide substitution also display context dependence in that substitution rates are dependent upon the identity of the neighbouring bases. For example, Krawczak et al.  observed a subtle and locally confined influence of the surrounding DNA sequence on relative rates of single-base-pair substitutions causing human inherited disease. Most notably, T>(C,A), A>(C,G), and G>(T,C) appear to be biased by the nucleotide at position −1, whereas T>(C,G), C>(G,A), A>(T,G), and G>T are biased by the nucleotide at position +1. However, the nearest-neighbour influence decreases markedly with distance from the site of nucleotide substitution. A significant, albeit weak effect was also observed for position +2, but only for five specific substitutions viz. T>C, C>T, A>G and G>(T,C). Interestingly, the six substitutions significantly influenced by the −1 (5′) nucleotide can be matched with their complementary substitutions being significantly influenced by the +1 (3′) neighbour, and vice versa. When nearest-neighbour effects were analyzed in such a way as to allow for neighbouring dinucleotides rather than mononucleotides, more substitutions were found to exhibit a statistically significant (albeit weaker) rate dependency [Krawczak et al.. 1998]. Nevertheless, this effect was again found not to extend beyond positions −2 and +3. Hence, in most cases, the influence of the surrounding DNA sequence would appear to extend no further than ~2 bp from the site of nucleotide substitution. This notwithstanding, recent work suggests that the presence in the vicinity of sequences capable of forming non-B DNA may be capable of exerting an influence on nucleotide mutability [Bacolla et al., 2011].
One possible mechanism to account for an influence of the local DNA sequence environment on the nature and location of single base-pair substitutions is misalignment mutagenesis [Kunkel, 1990]. Transient misalignment of the primer template, caused by looping out of a single template base can give rise to nucleotide misincorporation during DNA replication [Kunkel, 1990]. If not promptly repaired, such misaligned structures can be bypassed and extended by low-fidelity DNA polymerases, ultimately giving rise to heritable mutations [Sutton, 2010]. Employing primer-template models in vitro, Chi and Lam [2008; 2009] have shown that the relative stabilities of misaligned DNA structures, and hence the likelihood of their templating mutations, are dependent upon the terminal base-pair at the replicating site, the identity of the templating base and the nature of the upstream and downstream nucleotides. If this were to play an important role in the generation of single-base-pair substitutions in human genes, then a substantial proportion of single base-pair substitutions should exhibit identity between the newly introduced base and one of the bases immediately flanking the site of mutation. Consistent with this prediction, Krawczak et al.  previously showed that mutations causing human inherited disease display a degree of mutational bias that favours substitutions in the direction of the flanking bases, at least for certain codon positions. Mutation toward the 5′ flanking nucleotide was found to occur significantly more often than expected at the second position of the codon but not at the first or last position; mutation toward the 3′ flanking base was found to be favoured at the first position of a codon but was disfavoured at the second position. These findings were held to be suggestive of a mutational mechanism, involving positions 1 and 2 in the codon (both of which are critical for the specification of the encoded amino acid residue), that is biased toward the nucleotide at the other position. Inspection of the genetic code revealed that such a bias invariably serves to avoid the de novo introduction of termination codons [Krawczak et al., 1998]. Finally, although no specific preponderance of repeat-sequence motifs was noted in the vicinity of the nucleotide substitutions, a moderate correlation between the relative mutability and thermodynamic stability of DNA triplets emerged [Krawczak et al., 1998]. This was suggestive either of inefficient DNA replication in regions of high stability or the transient stabilization of misaligned intermediates. Not surprisingly, nearest neighbour effects are not confined to mutations causing inherited disease. Indeed, they are also evident in the spectrum of single nucleotide polymorphisms in the human genome [Zhao and Boerwinkle, 2002; Zhang and Zhao, 2004; Jiang and Zhao, 2006b] as well as in the context of evolutionary substitutions [Blake et al., 1992; Hess et al., 1994; Siepel and Haussler, 2004; Nevarez et al., 2010; Ma et al., 2010], findings that argue strongly for the ubiquity of the underlying mutational mechanisms.
The molecular basis of the sequence dependency of human mutation is clearly complex since the extensive inter- and intra-chromosomal variation in the mutation rate cannot be due entirely to neighbouring nucleotide effects [Hodgkinson et al., 2009]. Instead, a given mutational spectrum is likely to result from a combination of a number of different processes such as (i) the sequence specificity of both exogenous mutagens and endogenous mutational mechanisms, (ii) cellular attempts to repair the mutation in question followed by replication of the repaired DNA and (iii) chromatin composition (i.e. bulk vs. epigenetically modified nucleosomes [Tolstorykov et al., 2011]). Whilst imbalances in intracellular pools of dNTPs have long been known to exert a general mutagenic effect [Mathews, 2006], different exogenous mutagens can target specific sequence contexts [Pfeifer and Besaratinia, 2009]. Thus, both benzo[a]pyrene and UV light have been reported to display a target site specificity for CpG dinucleotides [Denissenko et al., 1997; You and Pfeifer, 2001], although both mutagens are rather more likely to be relevant to mutagenesis in the soma than in the germline. However, since the context-dependent pattern of (germline) mutations occurring during mammalian evolution correlates strongly with empirically determined patterns of oxidative damage [Stoltzfus, 2008; Sedelnikova et al., 2010], we may infer that oxidative damage probably plays a key role in germline mutation and that at least some of the context dependency of mutations is bound up with this mechanism of mutagenesis [Hsu et al., 2004]. This may be particularly relevant for sequences containing clustered guanine residues, since during repair these readily oxidized bases may give rise to opposing (or nearly opposing) single-strand breaks, which may then yield mutagenic double-strand breaks [Sedelnikova et al., 2010].
Different DNA polymerases and repair enzymes also exhibit their own characteristic sequence specificities and error signatures [Donigan and Sweasy, 2009; Mazurek et al., 2009; Korona et al., 2011; Lange et al., 2011]. In transcribed regions, transcription-coupled repair gives rise to inter-strand asymmetries in the mutation rate [Green et al., 2003; Polak and Arndt, 2008; Mugal et al., 2009] which are superimposed upon the intrinsic replication-associated mutational asymmetries that are thought to result from a combination of (i) the unequal rates of complementary base misincorporation by DNA polymerases and (ii) the different efficiencies of action of DNA mismatch repair enzymes on the leading and lagging DNA strands [Chen et al., 2011]. Different base mismatches, arising as a consequence of base misincorporation during DNA replication, display context dependency with respect to helix stability [SantaLucia and Hicks, 2004] and this strongly influences the local sequence bias exhibited by the resulting mutations [Nakken et al., 2010]. Local DNA flexibility is also known to be capable of modulating the efficiency of the enzymes involved in both base excision repair and mismatch repair [Seibert et al., 2002; Seibert et al., 2003; Wang et al., 2003; Isaacs and Spielmann, 2004] and this flexibility is itself sequence-dependent [Geggier and Vologodskii, 2010; Peters and Maher, 2010]. Finally, DNA repair efficiency may be influenced by nucleosome positioning [Ying et al., 2010] which is also DNA sequence-dependent [Chung and Vingron, 2009; Cui and Zhurkin, 2010; Wu et al., 2010]. Thus, a variety of different properties of a given DNA sequence and its structure are likely to impact on the inherent mutability of that sequence and the efficiency with which mutations arising are subsequently repaired.
Gene conversion occurs during homologous recombination and refers to the unidirectional transfer of genetic material from a ‘donor’ sequence to a highly homologous ‘acceptor’ [reviewed in Chen et al., 2007]. It can affect paralogous sequences (nonallelic homologous gene conversion) or different alleles at a given locus. Gene conversion appears to be most efficient when the sequences involved share homology over a range between 295 bp and 1 kb, but efficiency tails off rapidly if the length of the homologous stretch is less than 200 bp [Liskay et al., 1987]. The suggestion has also been made that gene conversion occurs optimally when the homology of the paralogous sequences involved exceeds 92% [Wolf et al., 2009].
A variety of DNA sequences, including direct repeats, inverted repeats, minisatellite repeats, the χ recombination hotspot and alternating purine–pyrimidine tracts with Z-DNA-forming potential have frequently been noted in association with gene conversion events in human genes indicative of the sequence-directed nature of this mutational mechanism [see Chuzhanova et al., 2009]. These somewhat anecdotal findings have recently been formalized by a methodical statistically-based analysis of 27 well-characterized human gene conversion mutations [Chuzhanova et al., 2009]. The lengths of the maximal converted tracts (MaxCTs) associated with these pathogenic gene conversions tended to be fairly short, rarely exceeding 1 kb. In silico analysis of the DNA sequence tracts involved in the 27 non-overlapping pathogenic gene conversion events in 19 different genes yielded several novel findings [Chuzhanova et al., 2009]. First, gene conversion events tended to occur preferentially within (C+G)- and CpG-rich regions. Second, sequences with the potential to form non-B DNA structures were found to occur disproportionately within MaxCTs and/or short flanking regions. Third, MaxCTs were enriched in several sequence motifs including a truncated version of the χ element (a TGGTGG motif) and the classical meiotic recombination hotspot, CCTCCCCT. Finally, there was a tendency for gene conversion to occur in genomic regions that had the potential to fold into stable hairpin conformations [Chuzhanova et al., 2009].
Another important aspect of this topic that is relevant to our brief relates to the phenomenon of biased gene conversion (BGC). Gene conversion is said to be biased if one of the two DNA molecules involved in the gene conversion event is more likely than the other to be the donor. In the case of allelic gene conversion, BGC leads to an excess of the ‘favoured’ allele in the pool of gametes and therefore tends to increase the frequency of this allele in the general population. Analysis of polymorphism and nucleotide substitution patterns in primate genomes has provided firm evidence for the action of BGC favouring GC alleles over AT alleles, i.e. the derived allele frequency of AT>GC mutations is higher than that of GC>AT mutations [Duret and Galtier, 2009; Clément and Arndt, 2011]. Recently, it has been shown that the spectrum of missense polymorphisms in human populations exhibits the footprints of GC-favoured BGC [Necşulea et al., 2011]. This pattern cannot be explained in terms of selection and is evident with all nonsynonymous mutations, including those implicated in human genetic disease. Necşulea et al.  have speculated that the genes most likely to be influenced by this effect will be those that are AT-rich (i.e. those genes for which the opportunities for AT>GC mutations are maximized) and which coincide with recombination hotspots, “an additional argument for these hotspots being an Achilles’ heel of the human genome”.
“Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy”.
The sequence context of microdeletions and microinsertions (<21 bp in length) causing human genetic disease was studied by Ball et al.  who analysed a total of 3,767 microdeletions (from 426 genes) and 1,960 microinsertions (from 307 genes). Deletions of 1 bp were the most common type of microdeletion analyzed (48% of the total) while 2,815 microdeletions (75% of the total) were between 1 and 3 bp in length. Of the 3,144 microdeletions located within coding regions, 2,758 (88%) were of a length that was not a multiple of three and hence would be expected to alter the reading frame. Some 45% of microdeletions led to the removal of a repeated sequence, an event termed “deduplication” by Kondrashov and Rogozin  in order to highlight the identity of the deleted sequence and the sequence abutting the site of deletion; Kondrashov and Rogozin  observed a deduplication frequency of 66%. In the study of Ball et al. , the proportion of deduplications decreased with increasing length of the deletion. For deletions of 2–5 bp, 38% were found to be deduplications whereas for deletions of ≥6 bp it was only 3%. By contrast, some 85% of microinsertions represented duplications of sequence bordering the site of mutation, comparable to the 81% reported by Kondrashov and Rogozin ; this proportion was independent of the length of the insertion. Ball et al.  reported that 1 bp constituted by far the most common length of microinsertion, with 66% of the total being of this size. As with microdeletions, the distribution was somewhat skewed, with some 1,571 microinsertions (80%) being between 1 and 3 bp in length. Of the 1,660 microinsertions located within gene coding regions, 1,556 (94%) were of a length that was not a multiple of three, and which would therefore be expected to alter the reading frame. Comparable results have been reported from extensive surveys of microinsertion/microdeletion polymorphisms in the human genome [Mills et al., 2006; Tan and Li, 2006].
Ball et al.  found that many of the lesions of >1 bp were potentially explicable in terms of slippage mutagenesis, and involved the addition or removal of one copy of a mono-, di-, or tri-nucleotide repeat. Various sequence motifs were found to be over-represented in the vicinity of both microinsertions and microdeletions, including the heptanucleotide CCCCCTG that shares homology with the complement of the 8-bp human minisatellite conserved sequence/χ-like element (GCWGGWGG) [Ball et al., 2005]. The previously reported indel hotspot GTAAGT [Chuzhanova et al., 2003a] and its complement ACTTAC were also found to be overrepresented in the vicinity of both microinsertions and microdeletions, thereby providing a first example of a mutational hotspot that is common to different types of gene lesion. Other motifs overrepresented in the vicinity of microdeletions and microinsertions included DNA polymerase pause sites and topoisomerase cleavage sites [Ball et al., 2005]. Analysis of DNA sequence complexity also demonstrated that a combination of slipped mispairing mediated by direct repeats, and secondary structure formation promoted by symmetric elements, can account for the majority of documented microdeletions and microinsertions [Ball et al., 2005]. Thus, microinsertions and microdeletions exhibit strong similarities in terms of the characteristics of their flanking DNA sequences, implying that they are generated by very similar underlying mechanisms.
Once again, replication slippage is the key to understanding the genesis of microdeletions and microinsertions. Replication slippage involves DNA polymerase pausing at a direct repeat sequence, enzyme dissociation, reannealing of the polymerase to a second direct repeat copy in the vicinity to generate a misaligned intermediate, followed by resumption of DNA replication [Kunkel, 2004; Garcia-Diaz et al., 2006]. In vitro studies have shown that the fidelity of DNA replication is strongly dependent upon both the local DNA sequence environment and the type of DNA polymerase involved [Kunkel, 2004; Loeb and Monat, 2008]. Further, different DNA polymerases appear to be characterized by subtly different types of misalignment mutagenesis during DNA replication/repair, giving rise to different types of lesion [Eckert et al., 2002; Wolfle et al., 2003; Tippin et al., 2004; Zhang and Dianov, 2005; Arana et al., 2007; Lyons and O’Brien, 2010]. The considerable explanatory value of these studies for slippage-mediated mutagenesis in vivo is evidenced by the concordance noted between in vitro and in vivo mutational spectra [Muniappan and Thilly, 2002]. Further, it has long been recognized that mononucleotide tracts are hotspots for microinsertions and microdeletions causing human genetic disease [Kondrashov and Rogozin, 2004; Truong et al., 2010; Ivanov et al., 2011] while Ball et al.  noted that oligonucleotides of 5–7 bp that were overrepresented in the vicinity of both microdeletions and microinsertions frequently contain A, C, or G mononucleotide tracts of 4–7 bp.
In his study of inherited mutations in a total of 20 human genes, Kondrashov  reported a strong correlation between the rates of microdeletion and microinsertion. Such a correlation was also evident for the much larger number of genes examined by Ball et al. . The observation that the propensity of a given gene to undergo microdeletion is related to its propensity to undergo microinsertion could be a consequence of the presence of certain DNA sequences that are prone to both types of lesion [Truong et al., 2010]. Consistent with this view, Ball et al.  reported strong similarities between microinsertions and microdeletions in terms of the sequence characteristics and repetitivity of the flanking DNA sequence, the overrepresentation of motifs known to play a role in recombination, mutation, cleavage, and rearrangement, and the likely involvement of various types of repetitive sequence element in the mutational mechanism. Similar conclusions have been drawn from studies of microdeletions and microinsertions identified in an evolutionary context [Zhang and Gerstein, 2003; Taylor et al., 2004; Messer and Arndt, 2007; Tanay and Siggia, 2008; Kvikstad et al., 2009; Sjödin et al., 2010]. Taken together, these results are consistent with the view that microdeletions and microinsertions are generated by very similar sequence-directed molecular mechanisms. The observation, noted above, that a GTAAGT hotspot of indel formation [Chuzhanova et al., 2003a] is significantly overrepresented in the vicinity of both microdeletions and microinsertions [Ball et al., 2005; Ivanov et al., 2011] suggests that some sequence motifs may represent hotspots for different types of mutation. Any such model, perhaps involving the repeat-mediated formation and resolution of secondary structure intermediates, would be most satisfying in that it could serve to mechanistically unify the various different types of microrearrangement described in human genes.
“The distribution of break points in human chromosomes….is non-random with seemingly preferential breakages in negative band areas in terms of Giemsa banding. The determination of ‘hot-spots’ for breakage in the human genome may help us in….investigating the cause or causes which give rise to some of these abnormalities”.
C.W. Yu, D.S. Borgaonkar & D.R. Bolling (1978) Hum. Hered.28:210-225.
Structural variation of the human genome is characterized by a variety of different types of gross rearrangement including deletions, duplications, insertions (termed Copy Number Variants, CNVs) as well as inversions and translocations. Four major mutational mechanisms account for these structural variants (SVs): nonallelic homologous recombination, non-homologous end joining, replication-based mechanisms and L1-retrotransposition (Fig. 1) [Conrad et al., 2010; Kidd et al., 2010; Mills et al., 2011]. In what follows, we shall describe some well-studied examples of structural variation in the human genome, with an emphasis on disease-associated SVs as well as gross chromosomal aberrations such as translocations and isochromosomes that illustrate the sequence-directed nature of the above mentioned mutational mechanisms.
Sequence analysis of the breakpoints of 1,054 SVs identified in the genomes of 17 healthy human individuals of different geographical origins indicated that NAHR accounts for 22.5% of insertions and deletions, as well as 69.1% of the inversions identified [Kidd et al., 2010; Fig. 2]. The majority of SVs identified in this study are likely to represent more or less neutral polymorphisms but at least 1% are estimated to be disease-associated. Interestingly, some of the SVs that segregate as polymorphisms within the normal population predispose to further structural changes such as disease-associated deletions and duplications [Antonacci et al., 2010; Ciccone et al., 2006; Giglio et al., 2002; Gimelli et al., 2003; Hobart et al., 2010; Osborne et al., 2001; Visser et al., 2005]. Thus, for example, heterozygosity for a ~970 kb inversion polymorphism of the MAPT locus [MIM# 157140] at 17q21.3 predisposes to the NAHR events that underlie the 17q21.31 microdeletion syndrome [MIM# 610443; Antonacci et al., 2009; Koolen et al., 2006; Koolen et al., 2008; Rao et al., 2010; Shaw-Smith et al., 2006]. The most likely explanation for this phenomenon is that inversion heterozygosity perturbs the pairing of homologous chromosomes during meiosis, which then promotes interchromosomal NAHR between the inversion-flanking low copy repeats (LCRs) thereby giving rise to the 17q21.3 microdeletion.
NAHR-mediated SVs are not randomly distributed across the human genome but rather are frequently located within complex regions that are enriched with segmental duplications. NAHR between segmental duplications not only causes submicroscopic CNVs giving rise to microdeletion and microduplication syndromes [reviewed by Guo et al., 2008; Stankiewicz and Lupski, 2010], but is also involved in the generation of cytogenetically visible chromosomal aberrations including isodicentric chromosomes and translocations. Thus, the isodicentric Xp11 chromosomes responsible for Turner syndrome do not simply occur at random but instead are mediated by NAHR between large inverted repeats comprising repetitive gene clusters and segmental duplications, which themselves correspond to regions of CNV [Scott et al., 2010; Koumbaris et al., 2011]. Recent findings also indicate that NAHR represents a major mechanism underlying unbalanced recurrent translocations, which are mediated either by interchromosomal LCRs or segmental duplications located on non-homologous chromosomes [Ou et al., 2011]. Regions containing highly redundant gene duplications such as those involving the olfactory receptor multigene family, located in the subtelomeric regions of human chromosomes, appear to be particularly prone to mediate interchromosomal NAHR causing recurrent translocations. These findings serve to emphasize the point that segmental duplications or LCRs are ubiquitous ‘soft spots’ in the human genome that have the potential to mediate SVs and other chromosomal rearrangements such as translocations. Clearly, not all LCRs are prone to undergo recurrent NAHR events; as deduced from LCRs that are known to be involved in recurrent pathogenic large deletions and duplications, the sequence requirements for LCRs to be frequently involved in mediating genomic instability include >95% sequence identity, >10 kb of LCR length and a distance between the LCRs of 50 kb-10 Mb [Bailey et al., 2002]. Based upon these criteria, a map of potential ‘rearrangement hotspots’ in the human genome has been defined and some of these predicted hotspots have already been found to be prone to recurrent disease-associated SVs [Mefford et al., 2007, 2008; Sharp et al., 2006, 2008; Shaw-Smith et al., 2006; Ou et al., 2011]. It should be kept in mind that LCRs are themselves non-randomly distributed at the chromosomal level [Bailey and Eichler, 2006; Marques-Bonet and Eichler, 2009].
Pathogenic NAHR and normal meiotic AHR (allelic homologous recombination) appear to have similar sequence requirements, as suggested by the spatial coincidence of AHR and meiotic NAHR hotspots [Lindsay et al., 2006; De Raedt et al., 2006]. This view is supported by the observation that the 13-bp sequence motif CCNCCNTNNCCNC, located within 40% of AHR hotspots, is also present in the NAHR hotspots that mediate CNVs [Myers et al., 2008]. PRDM9, a meiosis-specific protein which contains zinc finger arrays, binds to this motif and targets the initiation of recombination to specific locations (hotspots) in the genome [Baudat et al., 2010; Berg et al., 2010; Paranov et al., 2010]. Genetic variation at the PRDM9 locus has been shown to exert a powerful effect on recombination hotspot activity in sperm. Further, subtle changes within the zinc finger array serve to create hotspot-non-activating or -enhancing variants, suggesting that PRDM9 is a major regulator of AHR hotspot activity in the human genome [Berg et al., 2010]. Importantly, genetic variation at the PRDM9 locus [MIM# 609760] also influences NAHR activity as is evident in the context of the Charcot-Marie-Tooth type 1A-repeat (CMT1A-REP)-mediated duplications and deletions at 17p11.2 [MIM# 118220]; in the sperm of healthy donors homozygous for the A allele of PRDM9, de novo rearrangements between the CMT1A-REPs were observed >20-fold more frequently than in individuals homozygous for non-A alleles [Berg et al., 2010]. Taken together, these findings indicate that the locations of meiotic NAHR hotspots are not only determined by highly homologous target sequences but also by specific DNA sequence motifs and the proteins (such as PRDM9) which bind to them so as to perform their functions as trans-regulators of meiotic recombination.
Disease-associated NAHR also occurs in mitotic cells [reviewed by Moynahan and Jasin, 2010]. Although both meiotic NAHR and mitotic NAHR may be mediated by the same pairs of LCRs [Carvalho and Lupski, 2008; Messiaen et al., 2011], they are very likely to differ in terms of the underlying determinants for DSB formation, since SPO11 and other recombination initiating factors are expressed exclusively in meiotic cells [Shannon et al., 1999; Hayashi et al., 2005]. This is consistent with the observation that mitotic NAHR events causing large deletions of the NF1 gene region do not cluster in highly localized hotspots that are limited to a few hundred base-pairs, in contrast to the majority of NF1 deletions which are caused by meiotic NAHR [De Raedt et al., 2006; Roehl et al., 2010]. The observed properties of type-1 NF1 deletions are largely consistent with the finding that certain NAHR hotspots predominate during meiosis and are found only rarely (or not at all) during mitosis [Messiaen et al., 2011; Turner et al., 2008]. Breakpoint regions of structural variants generated by meiotic NAHR events have been previously found to be (i) biased toward GC-rich regions and (ii) to manifest higher DNA helix stability and lower DNA flexibility as compared with rearrangements caused by NHEJ [Lam et al., 2010; Lopez-Correa et al., 2001; Visser et al., 2005]. Interestingly, both the DNA stability and GC content have been found to be significantly higher in the PRS1 and PRS2 meiotic NAHR hotspots causing type-1 NF1 deletions than in the breakpoint regions of the mitotic type-2 NF1 deletions [Roehl et al., 2010]. However, in passing, we should point out that mitotic NAHR-mediated deletions also appear to be sequence-directed since short repeats capable of forming non-B DNA structures have been found to be over-represented within the breakpoint regions of mitotic type-2 NF1 deletions [Roehl et al., 2010].
The defining characteristic of NHEJ (Fig. 1) is the ligation of DSB ends without the requirement for extensive homology, in stark contrast to the mechanism of homologous recombination. The presence of terminal microhomologies (typically 1-3 bp) facilitates canonical NHEJ (C-NHEJ) but this appears not to be an absolute requirement for it to occur. C-NHEJ of ends from simultaneous DSBs accounts for a diverse range of genomic rearrangements [Chen et al., 2010; Kidd et al., 2010].
Increasing evidence has emerged to support the view that when the core C-NHEJ factors (i.e. Ku and/or DNA ligase IV-XRCC4) are absent, DSB ends can still be repaired by NHEJ. This latter pathway, originally termed microhomology-mediated end joining (MMEJ) is now commonly known as alternative NHEJ (A-NHEJ) [Boboila et al., 2010; Fattah et al., 2010; Helmink et al., 2011; Lee-Theilen et al., 2011; Simsek and Jasin, 2010; Yan et al., 2007; Zhang and Jasin, 2011]. The process of A-NHEJ is presumed to involve a 5′ to 3′ end resection of DNA DSB(s), thereby exposing microhomologies between the resulting two 3′ single-strand DNA tails; subsequent annealing at the region of microhomology followed by 3′-flap removal and gap filling then gives rise to deletions or translocations [Lee-Theilen et al., 2011; Zhang and Jasin, 2011]. As compared with C-NHEJ, A-NHEJ is inherently more prone to generate large genomic rearrangements, particularly translocations [Boboila et al., 2010; Fan et al., 2010; Helmink et al., 2011; Simsek and Jasin, 2010; Yan et al., 2007]. Approximately 30-50% of all structural variants in the human genome have originated through microhomology–mediated NHEJ events [Conrad et al., 2010; Kidd et al., 2010].
Although some NHEJ events will have resulted from the repair of DSBs that originated quasi-randomly, there are also many well documented cases in which the location of the NHEJ-initiating DSBs appears to be highly dependent upon the local DNA sequence environment. The role of the local DNA sequence context in generating NHEJ-mediated germline mutations is exemplified by the constitutional t(11;22), the most common type of recurrent non-Robertsonian translocation in humans. The breakpoint sequences of both chromosomes are characterized by several hundred base-pairs of inverted AT-rich repeats; similar sequences have also been identified at the breakpoints of other non-recurrent translocations [Kehrer-Sawatzki et al., 1997; Kurahashi et al., 2010; Rhodes et al., 1997]. Evidently, NHEJ of two ends from different DSBs requires such ends to be physically located in the immediate vicinity. In mammalian cells, high-precision tracking of tagged broken chromosome ends indicates that these ends can only partially separate and, consequently, DSBs preferentially undergo translocations with those chromosomes with whom they share nuclear space [Soutoglou et al., 2007; Wijchers and de Laat, 2011]. This provides strong support for the ‘contact-first’ hypothesis, which proposes that interactions between different DSBs can only take place if they are colocalized at the time of DNA damage [Nikiforova et al., 2000]. Consistent with this hypothesis, close spatial proximity has been observed between several frequent translocation partners [reviewed by Meaburn et al., 2007; Wijchers and de Laat, 2011].
A meta-analysis of germ-line and somatic DNA breakpoint junction sequences derived from a total of 219 different rearrangements (most of which are likely to be NHEJ events) underlying human inherited disease and cancer allowed the first methodical examination of the local DNA sequence environment of translocation and deletion breakpoints across a wide variety of different gene loci [Abeysinghe et al., 2003; Chuzhanova et al., 2003b]. A number of recombination-predisposing motifs and non-B DNA-forming sequences were found to be overrepresented at these breakpoints as compared with randomly selected control sequences, indicative of the sequence-directed nature of many NHEJ mediated rearrangements.
It has been observed that at least one of the breakpoints of NHEJ-mediated rearrangements is often located within repetitive elements (such as LTRs, LINE or Alu elements) and sequence motifs capable of causing DSBs have been frequently identified in the vicinity of the breakpoints of these NHEJ-mediated rearrangemetns [Inoue et al., 2002; Kehrer-Sawatzki et al., 2005, 2008; Nobile et al., 2002; Oshima et al., 2009; Shaw and Lupski, 2005; Stankiewicz et al., 2003; Toffolatti et al., 2002; Vissers et al., 2009; Yatsenko et al., 2009]. Importantly, the breakpoints of many non-recurrent CNVs mediated by NHEJ map to LCRs [Carvalho et al., 2009; Stankiewicz et al., 2003; Kehrer-Sawatzki et al., 2005, 2008; Shaw and Lupski, 2005; Zhang et al., 2010] suggesting that LCRs can promote genomic instability by inducing certain chromatin secondary structures thereby alleviating NHEJ-mediated rearrangement.
Replication slippage or template switching during replication account for both small and large deletions and duplications with terminal microhomologies (Fig. 1). Recently, relevant replication-based models including serial replication slippage (SRS) [Chen et al., 2005a; Chen et al., 2005b; Chen et al., 2005c], fork stalling and template switching (FoSTes) [Lee et al., 2007] and microhomology-mediated break-induced replication (MMBIR) [Hastings et al., 2009], which were collectively termed microhomology-mediated replication-dependent recombination (MMRDR) by Chen et al. , have been used to explain the generation of a diverse range of complex genomic rearrangements [Bauters et al., 2008; Carvalho et al., 2009; Chauvin et al., 2009; Collie et al., 2010; Koumbaris et al., 2011; Sheen et al., 2007; Vissers et al., 2009; Zhang et al., 2009, 2010].
For example, DNA replication stalling-induced chromosome breakage has turned out to be an important mechanism causing deletions at chromosomal ends. Different types of telomeric deletions have been described (Fig. 3) [Kulikowski et al., 2010]: type A terminal deletions are formed by chromosomal ends that are stabilized by the capture of a telomere from another source, whereas type B deletions are actually interstitial deletions towards the chromosomal ends. By contrast, type C deletions describe the process by which chromosomal ends are stabilized by telomere healing, namely the telomerase-dependent de novo addition of telomeres at non-telomeric sites. Terminal deletions associated with inverted duplications [Zuffardi et al., 2009] can be classified as either type A or type C. Recently, Hannes et al.  succeeded in cloning the breakpoints of nine chromosome 4p terminal deletions. All nine cases were shown to be type C terminal deletions. Bioinformatics analysis of the breakpoint-flanking regions involved in these nine cases, together with 12 previously fully characterized type C terminal deletions, led to the realization that there is an enrichment in secondary structure-forming sequences and replication stalling site motifs in these regions as compared with a randomly selected sequence dataset [Hannes et al., 2010].
Certain sequence features, such as microsatellites and transposon-rich regions, can serve to induce replication stalling, thereby acting as potential sources of genome instability [e.g. Cha and Kleckner, 2002; Pelletier et al., 2003]. On this basis, Koszul and colleagues  proposed a two-step mechanism to account for the generation of large segmental duplications: “First, a replication fork pauses and collapses generating a chromosome breakage. Second, the double-strand break can be processed into a new replication fork either intra- or inter-molecularly by a break-induced replication-like mechanism that does not necessarily need a long sequence homology”. It was this ‘microhomology-dependent BIR’ model (Fig. 1) that was subsequently deployed to explain disease-causing copy number mutations. In MMBIR, replication ends with the engagement of a misaligned template instead of reannealing to its original template; the synthesis of the second strand then follows the synthesis of the first [reviewed in Chen et al., 2010]. In practice, mutations due to SRS/FoSTes are often indistinguishable from those due to MMBIR. Indeed, the two terms have sometimes been used interchangeably [e.g. Choi et al., 2011; Zhang et al., 2009].
All the replication-based models recently proposed to account for the formation of structural variants and/or mutations in the human genome stress the importance of genomic architectural elements such as palindromic DNA, stem-loop structures, repeats etc, features which may facilitate the initial stalling of the replication fork [Gu et al., 2008; Chen et al., 2010].
L1 elements comprise ~17% of the human reference genome sequence [Lander et al., 2001]. Retrotranspositionally competent L1 elements are typically ~6.0 kb in length and comprise a 5′-untranslated region (UTR), two non-overlapping open reading frames (ORF1 and ORF2), a short 3′-UTR, and a poly(A) tail. Whereas ORF1 encodes an RNA-binding protein, ORF2 encodes a protein with endonuclease (L1 EN) and reverse transcriptase (L1 RT) activities. L1 retrotransposition is thought to occur by target site-primed reverse transcription; briefly, it would appear that the L1 EN cleaves genomic DNA at a degenerate consensus target sequence (3′-A/TTTT-5′ and variants thereof), thereby freeing up a 3′-OH group that then serves as a primer for the reverse transcription of L1 RNA by L1 RT. The nascent L1 cDNA then recombines with genomic DNA, generating in the process the characteristic hallmarks of L1 retrotransposition such as 5′ truncations, a 3′ poly(A) tail and target site duplications (TSDs) of variable length [Cordaux and Batzer, 2009; Kazazian, 2004]. L1 retrotransposition requires a precise interplay between ORF1p, ORF2p, and L1 RNA [Doucet et al., 2010].
Of the >500,000 L1 copies in the reference human genome, only 80–100 are believed to be capable of active retrotransposition [Brouha et al., 2003]. Recent studies have however revealed that (i) the actual number of highly active or “hot” L1s in the human population is much higher than that identified in the reference human genome [Beck et al., 2010], and (ii) L1 retrotransposition has played a more important role in generating structural variation in the human genome than previously appreciated [Ewing and Kazazian, 2010; Huang et al., 2010; Iskow et al., 2010; Xing et al., 2009]. The rate of L1 retrotransposition in humans has been estimated by one study to be one insertion in every 108 births [Huang et al., 2010] and between 1/95 and 1/270 births by another [Ewing and Kazazian, 2010]. The number of dimorphic L1 elements in the human population with allele frequencies >0.05 is estimated to be between 3,000 and 10,000 [Ewing and Kazazian, 2010], far exceeding the ~400 human L1 retrotransposon insertion polymorphisms (RIPs) registered in dbRIP [Wang et al., 2006a].
L1 retrotransposition can affect the primary structure of the human genome in a variety of ways other than by simple self-insertion. For example, L1 elements are also able to mobilize non-autonomous sequences in trans, including repetitive Alu sequences, SVA (short interspersed nucleotide elements-R, variable-number-of-tandem-repeats, and Alu) elements, and processed pseudogenes [Cordaux and Batzer, 2009; Kazazian, 2004; Konkel and Batzer, 2010] (Fig. 1). In addition, L1 retrotransposition can give rise large genomic deletions [Callinan et al., 2005; Han et al., 2005; Xing et al., 2009]. L1 elements can also undergo retrotransposition in the germline [Ostertag et al., 2002], during early embryonic development [Garcia-Perez et al., 2007; Garcia-Perez et al., 2010; Kano et al., 2009; van den Hurk et al., 2007], in certain somatic cells [Coufal et al., 2009; Muotri et al., 2005] and in the human lung cancer genome [Iskow et al., 2010].
L1 retrotransposition can also give rise to human inherited disease. Since the first report of Kazazian et al. , L1-mediated simple L1, Alu and SVA insertions have been increasingly reported to cause inherited disease [see Chen et al., 2005d for publications prior to 2005 and subsequently, Apoil et al., 2007; Bochukova et al., 2009; Bouchet et al., 2007; Chen et al., 2008; Gallus et al., 2010; Musova et al., 2006]. Following our own retrospective identification of pathogenic large genomic deletions caused by LI-mediated Alu insertions [Chen et al., 2005d], pathogenic large genomic deletions caused by L1-mediated L1 [Miné et al., 2007; Morisada et al., 2010], a number of Alu [Okubo et al., 2007; Schollen et al., 2007] and SVA [Takasu et al., 2007] insertions have been reported in prospective screens while the first cases of L1-driven pseudogene insertion causing human genetic disease have also been reported [Awano et al., 2010; Tabata et al., 2008].
The non-random insertion of L1-mediated retrotranspositional elements into the human genome can be considered at two distinct levels. First, consistent with the known target site specificity for L1 EN, the study of pre-insertion sites of de novo L1 insertions in cultured human cancer cells revealed an AT-rich bias in the 50 bp flanking the insertion sites [Gasior et al., 2007]. The genome-wide profiling of human L1(Ta) retrotransposons has also revealed a tendency for L1(Ta)s to accumulate within AT-rich regions [Huang et al., 2010]. L1(Ta) (transcribed L1, subset a) is the youngest L1 family that is currently capable of active retrotransposition, and hence the L1 family that is largely responsible for generating L1 insertion (presence/absence) polymorphisms in the human genome. In addition, the currently reported pathogenic L1-mediated events have almost invariably integrated at L1 EN consensus target sites. Second, in the abovementioned study of pre-insertion sites of de novo L1 insertion in cultured human cancer cells, a statistically significant cluster of such insertions was localized in the vicinity of the c-myc gene (MYC; MIM# 190080). This finding suggested that in addition to the local sequence determinants (i.e. L1 EN target sites), other features of the flanking genomic region may also influence the insertion preference of L1-mediated insertions [Gasior et al., 2007]. Apparent insertion clusters have also been observed in the context of pathogenic L1-mediated events. Thus, three independent Alu insertions have been found to be integrated into a 104 bp region of the FGFR2 gene [MIM# 176943; Bochukova et al., 2009; Oldridge et al., 1999] while two independent L1 insertions have been reported to have inserted into exon 44 of the dystrophin gene (DMD; MIM# 300377) within an 89 bp region [Musova et al., 2006; Narita et al., 1993].
The above notwithstanding, the most striking finding pertinent to the non-random nature of L1 retrotranspositional insertion is that independent L1 retrotransposition elements can integrate at precisely the same chromosomal sites [Chen et al., 2005d]. Thus, an L1 element and an Alu sequence are known to have become inserted at exactly the same location in the APC gene [MIM# 611731] in two unrelated individuals [Halling et al., 1999; Miki et al., 1992]; whilst the L1 element was a somatic insertion, the Alu sequence was a germline insertion. In addition, two markedly different Alu Ya5a2 elements have become integrated at precisely the same site in the F9 gene [MIM# 300746] causing severe hemophilia B [Vidaud et al., 1993; Wulff et al., 2000]. Finally, an SVA element and an Alu sequence have inserted at the same site within the coding region of the BTK gene [MIM# 300300; Conley et al., 2005]. These observations are consistent with some genomic locations being exquisitely prone to L1 retrotransposition [Chen et al., 2005d].
A canonical Alu element is about 300 base-pairs long, comprising two related GC-rich monomers separated by an A-rich linker region and ending with a poly(A) tail [Cordaux and Batzer, 2009]. Owing to the high frequency (>1 million copies) of complete or partial Alu elements in the human reference genome (~10.6% of the genome sequence) [Lander et al., 2001], they serve as a huge reservoir of sequences for homology-based recombination. AMR between nonallelic sequences is also a frequent cause of human genetic disease as evidenced by the many recently described examples [e.g. Abo-Dalo et al., 2010; Champion et al., 2010; Cozar et al., 2011; Gentsch et al., 2010; Goldmann et al., 2010; Franke et al., 2009; Resta et al., 2010; Shlien et al., 2010; Tuohy et al., 2010; Yang et al., 2010; Zhang et al., 2010].
The importance of Alu elements in the context of mediating genomic deletions is unlikely owing simply to their sheer abundance in the human genome. In other words, Alu elements themselves must possess inherent recombination-predisposing properties [Rudiger et al., 1995]. A survey of a small subset (n = 36) of Alu-mediated rearrangements in several human genes identified a 26 bp core sequence that is often located at or close to the sites of recombination [Rudiger et al., 1995]. Importantly, this core sequence contains the pentanucleotide motif CCAGC, which represents a truncated version of the χ recombination hotspot (consensus sequence: 5′-GCTGGTGG-3′ or its complement, 5′-CCACCAGC) [Kenter and Birshtein, 1981; Smith, 1983]. This is likely to have had important implications with respect to many of the subsequently found AMR-mediated pathogenic deletions. In the absence of any meta-analysis or systematic review, we shall mention only two studies, to which some of us contributed. The first study reported a gross HFE [MIM# 613609] deletion consistent with AMR; the 17 bp crossover region contained two sequence motifs, CCACCA and CCAGC, both truncated versions of the χ recombination hotspot [Le Gac et al., 2008]. It should be noted that CCACCA has also been noted to be a mutational ‘super-hotspot’ common to microdeletions, microinsertions, and indels [Ball et al., 2005]. The second study reported, among others, a 2,769 bp SERPINC1 [MIM# 107300] deletion mediated by Alu elements. The 13 bp crossover region (i.e. GCCACCACGCCCG) was also found to contain the CCACCA mutational ‘super-hotspot’ [Picard et al., 2010]. In passing, it should also be appreciated that Alu elements are particularly prone to form non-B DNA structures (e.g. slipped structures) owing to their containing two related GC-rich monomers (Fig. 4).
“The distribution of interspersed repeats close to and even within genes has brought the mechanism of their mutation into the arena of human molecular genetics. These sequences have a unique form of mutation: variation in copy number. The rate of the mutation is related to the copy number, and therefore, the mutability of the product of a change in copy number is different from that of its predecessor. For this reason, we have termed this mechanism dynamic mutation.”
R.I. Richards & G.R. Sutherland (1992) Cell 70:709-712.
Microsatellites, defined as the repetition of short (1-6 bp) DNA tandem motifs, display somewhat higher (yet individually distinct) mutation rates than the average nucleotide substitution rate genome-wide. Microsatellites comprise ~3% of the human genome [Lander et al., 2001] and the proportion of mononucleotide and dinucleotide repeat tracts that display length polymorphism in the human population (involving the generation of multiple alleles) has been found to increase almost exponentially above a length threshold of ~10 nt [Kelkar et al., 2010]. In similar vein, analyses of length polymorphisms of trinucleotide sequences in several human transcriptomes, assessed from the coding portions of RefSeq genes annotated in the human reference genome, have revealed the existence of multiple length difference alleles above ~25 bp, as opposed to essentially only two alleles for shorter (<~16 bp) tracts [Molla et al., 2009]. With an estimated average mutation rate of 10−5, microsatellites accumulate mutations at a rate three orders of magnitude higher than the average rate of nucleotide substitutions genome-wide (2 × 10−8) [Molla et al., 2009].
In addition to length changes, single base changes within microsatellites also occur at higher frequencies than the genome-wide average, not only within the microsatellite repeats but also in the bases adjacent to the repeats [Siddle et al., 2011]. From the analysis of 1000 Genomes Project pilot data, variability genome-wide has been noted to be at its maximum (accounting for 42.5% and 28%, respectively) at the dinucleotides (TG•CA) and (TA•TA) [McIver et al., 2011], which are the most abundant microsatellites, and for which mutation rates of up to 10−2 per locus per gamete per generation have been reported [Eckert and Hile, 2009]. Parent/child transmission studies have revealed that several of the most mutable loci also contain compound sequences comprising two or more different types of microsatellite repeat [Brinkmann et al., 1998; Dupuy et al., 2004; Eckert and Hile, 2009]. Thus, in addition to tract length, microsatellite sequence composition also exerts a powerful influence on the mutation rate. Systematic analyses, performed in human colorectal cancer cells defective in post-replicative mismatch repair (MMR), have provided evidence for heteroduplex DNA at (A•T)10, (G•C)10, (CA•TG)13 and (CA•TG)23 target microsatellites, with one strand containing the initial number of repeats and the complementary strand containing either +1 or −1 repeats. Hence, the lack of correction by MMR of bulges and unpaired/mispaired loops resulting from the misalignment of repetitive DNA during DNA synthesis appears to be the most plausible mechanism for the observed increase in mutation rates at microsatellite loci (Fig. 5A). Indeed, strand slippage of repetitive DNA motifs represents a significant cause of mutation genome-wide, as mentioned previously. Mutation rates are consequential to combined kinetic reactions involving strand slippage and their subsequent repair, both of which display complex dependence upon DNA sequence. For example, in MLH1-deficient (MMR) HCT116 cells, (A•T)10 repeats display 5- to 15-fold higher susceptibility to replication errors than (G•C)10 repeats [Campregher et al., 2010], whereas the longer (A•T)17 tract is 7- to 15-fold less prone to replication errors than (G•C)17 on the same genetic background [Boyer et al., 2002]. Conversely, in isogenic cells complemented for MMR function, (G•C)16 repeats (which exhibit a ~20- to 60-fold higher error rate during replication than (G•C)10 tracts in the absence of MLH1), are repaired ~10 times more efficiently than (A•T)10 repeats [Campregher et al., 2010]. However, repair efficiency varies by more than two orders of magnitude between different genetic backgrounds [Boyer et al., 2002]. In addition to sequence composition, the repair of slipped-out bases is dependent upon their size and densities along the DNA chain, decreasing sharply as a function of both loop size (1 – 30 bases) and local concentration [Panigrahi et al., 2010]. Finally, slippage-dependent mutation rates at microsatellites are highly sensitive to their flanking sequence. For example, (A•T)7 and (A•T)10 repeat tracts exhibit an ~3-fold higher mutation rate when inserted within exon 10 of the ACVR2A gene [MIM# 102581] than within exon 3 of the TGFBR2 gene [MIM# 190182] in MMR-deficient cells, whereas the converse is seen for (A•T)13 tracts [Chung et al., 2008]. In addition, −2 bp deletions resulting from multiple slippage events were only seen in the TGFBR2 exonic context. The DNA sequence features responsible for these complex patterns are largely unknown; however, both base stacking interactions [Bacolla et al., 2008; Yang, 2008] and energy coupling reactions between bases distally located within loops and/or the flanking duplex region [Völker et al., 2010] are likely to be involved.
Some 30 human inherited diseases associated with neuromuscular and developmental disorders have now been linked to the expansion of a microsatellite repeat within the corresponding disease-associated gene [Brouwer et al., 2009; Lopez Castel et al., 2010; Wells and Ashizawa, 2006]. Expansions generally originate from ‘at-risk’ (premutation) alleles, from which the addition of up to thousands of repeat units, usually trinucleotides, takes place within parent-child transmissions. The number of repeats in normal alleles is highly variable between loci, but is generally limited to fewer than 40-45 repeats. Small expansions into the premutation range (~29-35 repeats within coding regions and ~55-200 repeats in non-coding regions) and/or loss of interruptions within the repeat tract act to destabilize the sequences, which then become increasingly prone to further expansion [Brouwer et al., 2009; McMurray, 2010; Orr and Zoghbi, 2007], triggering an escalating positive feed-back loop that creates pathogenic mutation alleles within a few generations [Wells and Ashizawa, 2006]. As the lengths of the microsatellites increase, the severity of the disease symptoms generally worsen and/or the age of onset decreases, a phenomenon termed ‘genetic anticipation’.
The dramatic intergenerational instability observed in microsatellite expansion diseases (MEDs) differs markedly from the population-based microsatellite instability described above. Indeed, several molecular mechanisms in addition to slippage are believed to occur. The microsatellite sequences involved in MEDs include the trinucleotide repeats (GAA•TTC) in intron 1 of the frataxin gene [FXN; MIM# 606829] in Friedreich ataxia; (CTG•CAG) in the 3′UTR of the DMPK gene [MIM# 605377] in myotonic dystrophy type 1 and the ataxin 8 opposite strand gene [ATXN8OS; MIM# 603680] in spinocerebellar ataxia type 8 (SCA8), in the 5′UTR of the serine/threonine-protein phosphatase 2A 55 kDa regulatory subunit B beta isoform gene [PPP2R2B; MIM# 604325] in SCA12 and within the coding regions of 9 polyglutamine expansion diseases; (CGG•CCG) in the 5′UTR of the fragile X mental retardation 1 gene [FMR1; MIM# 309550] in fragile X syndrome and fragile X-associated ataxia and tremor, the 5′UTR of the AF4/FMR2 family member 2 gene [AFF2; MIM# 300806] in FRAXE-associated mental retardation and the coding regions of 9 polyalanine expansion diseases; the tetranucleotide (CCTG•CAGG) in intron 1 of the CCHC-type zinc finger, nucleic acid binding protein gene [CNBP; MIM# 116955] in myotonic dystrophy type 2 and the pentanucleotide (ATTCT•AGAAT) in intron 9 of the ataxin 10 gene [ATXN10; MIM# 611150] in SCA10 [Lopez Castel et al., 2010; McMurray, 2010; Messaed and Rouleau, 2009]. A tenth polyalanine expansion disorder associated with the Zic family member 3 gene [ZIC3; MIM# 300265] and leading to X-linked heterotaxy with VACTERL association has recently been described [Wessels et al., 2010]. Most of these microsatellite sequences have been shown to be capable of adopting specific secondary structures (non-B DNA), including hairpin-loops, three-(triplex) and four-stranded (quadruplex) structures and left-handed Z-DNA [Lopez Castel et al., 2010; Mirkin, 2007; Renciuk et al., 2011; Wells and Ashizawa, 2006]. Below, we review some of the most compelling evidence in support of a role for DNA secondary structures in either microsatellite expansion and/or the process of pathogenesis.
In fragile X syndrome, premutations expand to full mutation only upon maternal transmission, whereas full mutations invariably contract to premutations upon paternal transmission [Brouwer et al., 2009; Jin and Warren, 2000]. Expansion is believed to occur early in oogenesis, a stage when primary oocytes remain quiescent (i.e. do not divide) for years. Work both in the mouse and in vitro supports the view that following DNA damage (including oxidation-related DNA damage) within the (CGG•CCG) sequence, repair of the damaged bases (a process which involves both base excision and mismatch repair) entails the formation of stable hairpins on one DNA strand, which then direct DNA synthesis to the complementary strand in order to incorporate the looped-out structures into de novo DNA [Entezam et al., 2010; Lopez Castel et al., 2010; McMurray, 2010], resulting in expansion (Fig. 5B). In contrast to this replication-independent mechanism of repeat expansion, in the majority of the other MEDs in which large expansions also occur, similarly stable hairpins are thought to form, mainly on the lagging strand during DNA replication, thereby blocking DNA synthesis; further resolution of stalled replication forks and reinitiation of synthesis could then lead to expansion [Mirkin, 2007] (Fig. 5C). For the smaller expansions seen in polyglutamine diseases, slippage during DNA replication involving small hairpin-loops, as in the case of microsatellite length polymorphism (see above) remains the most likely mechanism [Lopez Castel et al., 2010; Mirkin, 2007; Wells and Ashizawa, 2006] (Fig. 5A). In polyalanine expansion diseases, in which the coding trinucleotide tracts are short and often interrupted, pedigree analyses support the occurrence of both fork stalling and template switching, triggered by secondary structure formation (Fig. 5D), as well as unequal crossing-over between two normal alleles [Arai et al., 2010; Cocquempot et al., 2009; Messaed and Rouleau, 2009; Warren, 1997] (Fig. 5E).
At the time the molecular basis of Friedreich ataxia [MIM# 229300] was first reported [Campuzano et al., 1996], a substantial number of studies had already been performed on the biophysical properties of the (GAA•TTC) sequence. Indeed, the asymmetric purine•pyrimidine composition was known to enable the formation of three-stranded structures [Wells et al., 1988]. In triplex DNA, the purine-rich strand of duplex DNA binds a third strand through specific Hoogsteen hydrogen bonds, including A:A and G:G (purine-rich third strand) and A:T and G:C+ (pyrimidine-rich third strand) pairs. Thus, mirror symmetry within purine•pyrimidine sequences is required to yield stable triplex structures [Frank-Kamenetskii and Mirkin, 1995]. As expected, long (GAA•TTC) tracts cloned in plasmids were found to interact with each other and form stable intermolecular DNA structures that were interpreted as triplexes (sticky DNA) [Sakamoto et al., 1999]. However, their exceptionally high thermal stability, and the number of negative superhelical turns remaining in plasmids after DNA structure formation, suggest that other conformations, such as duplex-duplex interactions, are also feasible for long (GAA•TTC) repeats [Son et al., 2006]. In Friedreich ataxia, the FXN locus is silenced [Al-Mahdawi et al., 2008]. Within the FXN gene, local chromatin is characterized by hypoacetylation of histones H3 and H4 and methylation of histone H3 at Lys9 (H3K9), which are hallmarks of transcriptionally inactive heterochromatin [Punga and Buhler, 2010]. However, heterochromatin does not spread to the 5′ and 3′ sections of the gene and only transcriptional elongation (rather than initiation) appears to be impaired in patient-derived lymphoblastoid cell lines. Removal of H3K9 methylation marks is however ineffective in re-establishing transcriptional elongation [Punga and Buhler, 2010], strongly supporting a model in which the expanded (GAA•TTC) repeat itself or, more likely its folding into a secondary structure, imposes a direct block upon the transcriptional apparatus [Punga and Buhler, 2010].
Expansion of the (CGG•CCG) repeat in the FMR1 gene also leads to gene silencing in fragile X syndrome. A decrease in histone H3 and H4 acetylation is evident in pathological full mutation alleles, accompanied by de novo methylation of the repeat tract and the upstream CpG island in the promoter region [Jin and Warren, 2000]. Studies in intact ovaries of fetuses and chorionic villus samples harbouring full-mutations [reviewed in Jin and Warren, 2000] suggest that methylation is a dynamic process that takes place over an extended time period. However, the mechanisms by which expanded (CGG•CCG) repeats induce methylation remain unclear. (CGG•CCG) repeats have been shown to fold into hairpin-loops [Amrane and Mergny, 2006; Darlow and Leach, 1998], quadruplexes [Khateb et al., 2004; Usdin and Woodford, 1995], left-handed Z-DNA [Renciuk et al., 2011], to possess inherently high flexibility (bending) [Bacolla et al., 1997] and are predicted to sustain stable ‘bubbles’ despite their high CG content [Alexandrov et al., 2011]. In vitro, the ability of DNA methyltransferase 1 (DNMT1) to methylate (CGG•CCG) repeats increases with increasing negative supercoiling [Bacolla et al., 2001]. Hence, it is possible that the formation of alternative DNA structures and/or open (denatured) states favored by torsional stress at long repeat tracts, might nucleate unscheduled de novo methylation in the FMR1 gene, leading to gene silencing.
In additions to the MEDs described above, length polymorphism at specific microsatellites within genes or their promoters has been associated with phenotypic trait variation and/or susceptibility to disease [Bacolla et al., 2008; Gemayel et al., 2010]. For example, a highly polymorphic (GT•CA)n repeat within the proximal SLC11A1 gene [MIM# 600266] promoter regulates variation in allele expression [Bayele et al., 2007] by directly modulating the recruitment of HIF-1α to the repeat sequence through its ability to interconvert from the canonical right-handed B- to left-handed Z-DNA. In addition, surrogate stimuli of the innate immune response (such as E. coli and S. typhimurium LPS, mannose- and phosphoinositide-capped lipidoarabinomannans from M. bovis and M. smegmatis, respectively), stimulate HIF-1α-dependent transactivation. Given the prominent role of HIF-1α in integrating innate immune responses to infection and inflammation, this SLC11A1 repeat polymorphism is believed to contribute to the heritable variation in susceptibility to infection and/or inflammation that is observed within and between populations [Bayele et al., 2007].
A recent study on the relationships between matrix metalloproteinase genetic polymorphisms and vulnerable plaques in a cerebrovascular disease patients cohort revealed a significant association between prognosis and the length of a polymorphic (CA•TG)13-26 microsatellite upstream of the MMP9 [MIM# 120361] transcriptional start site [Fiotti et al., 2011]. Specifically, carriers of ≥22 repeats displayed ~50% larger plaques and had a significantly higher risk of persistent angina and ischemic stroke than non-carriers. Consistent with this association, long (CA•TG)-containing alleles manifest increased MMP9 gene expression relative to shorter ones [Shimajiri et al., 1999].
Other examples on the involvement of polymorphic microsatellites in disease susceptibility include a (CA•TG) dinucleotide repeat in the EGFR [MIM # 131550] 5′UTR and gastrointestinal cancers [Baranovskaya et al., 2009], an (AAAT•ATTT) tetranucleotide repeat in intron 27b of the NF1 gene and mental retardation [Védrine et al., 2011], an (AAAG•CTTT) repeat in the estrogen receptor-related γ (ESRRG; MIM# 602969) 5′UTR and breast cancer [Galindo et al., 2011], a (GGGCGG•CCGCCC) hexanucleotide repeat in the arachidonate 5-lipoxygenase gene (ALOX5; MIM# 152390) and risk of carotid atherosclerosis and myocardial infarction [Vikman et al., 2009], and a (CATT•AATG) tetranucleotide repeat in the macrophage migration inhibitory factor gene (MIF; MIM# 153620) promoter and duodenal ulcer, rheumatoid arthritis and psoriasis [Shiroeda et al., 2010].
The mitochondrial genome differs from the nuclear genome in a variety of different respects, most notably in terms of its high copy number (with the consequent potential for heteroplasmy), matrilineal inheritance, a 10 to 17-fold higher mutation rate despite having its own DNA repair system [Liu and Demple, 2010], active exposure to reactive oxygen species [Sedelnikova et al., 2010], a unique mode of DNA replication [Wanrooij and Falkenberg, 2010] and the virtual lack of any recombination [Krishnan and Turnbull, 2010]. A wealth of knowledge has now accumulated with respect to the spectrum of germline mitochondrial genome mutations that are responsible for heritable mitochondrial disease [Taylor and Turnbull, 2005; Neiman and Taylor, 2009; Wallace, 2010]. Despite these basic differences, the nature, location and frequency of the many different types of mutation in the mitochondrial genome are also strongly influenced by the local DNA sequence environment. Thus, as already reported for the nuclear genome, direct repeats have been frequently noted at mitochondrial DNA (mtDNA) breakpoints in mtDNA deletion syndromes [Samuels et al., 2004]. Indeed, mtDNA deletions may be separated into two types, type I (with a direct repeat) and type II (with an imperfect or no direct repeat), with respect to the sequences present at the two breakpoints. Sadikovic et al.  have recently shown that, irrespective of the presence or absence of a direct repeat, most mtDNA deletions are characterized by an increase in sequence homology surrounding the breakpoints. This finding is consistent with sequence homology being a key determinant of breakpoint location in mtDNA deletion syndromes. In accord with an expectation that the longest direct repeats would be likely to demarcate the most dramatic mtDNA deletion hotspots, the most common mtDNA deletion (8470-13447), which is flanked by the longest (13 bp) direct repeat, has been noted in 37% of mtDNA deletion syndrome patients [Sadikovic et al., 2010]. The presence of sequence homologies at the deletion breakpoints is suggestive of a role for sequence homology not only in the generation of the initial break but also in the subsequent repair of the mtDNA damage. It has been suggested that direct repeats serve to promote breakpoint generation when there is an error in mtDNA replication due either to the illegitimate alignment of direct repeats [Holt et al., 2000] or to mtDNA damage [Krishnan et al., 2008]. Defects in mtDNA replication, resulting from the inappropriate alignment of direct repeats or mis-annealing of a single-stranded mtDNA molecule following the occurrence of a double strand break, both require the presence of direct repeats (or at the very least some sequence homology).
The mitochondrial genome is however also involved in a very different type of mutation. Numerous fragments of mitochondrial DNA are present throughout the human nuclear genome, these fragments having migrated from the mitochondrial genome over evolutionary time [Mishmar et al., 2004; Ricchetti et al., 2004]. An occasional consequence of these migrations in extant genomes is the de novo disruption of nuclear genes resulting in a heritable disease. Once again, the nature and location of these highly unusual lesions are both strongly influenced by the local DNA sequence environment. Probably the best characterized example of a pathogenic mitochondrial-nuclear DNA transfer is that described by Turner et al.  in a sporadic case of Pallister-Hall syndrome [MIM# 146510], a condition usually inherited in an autosomal dominant fashion. The mutation involved a de novo nucleic acid transfer from the mitochondrial to the nuclear genome, more specifically the insertion of a 72-bp segment into exon 14 of the GLI3 gene [MIM# 165240] thereby creating a premature stop codon. The insertion site in the GLI3 gene was flanked by inverted repeat elements that could have facilitated hairpin-loop formation. Although no similarity of the 72-bp mitochondrial (mt) DNA insert and the GLI3 gene was apparent, Turner et al.  noted significant sequence identity (~60%) of a 112-bp region (interrupted by a 31 bp inverted repeat) 5′ to the GLI3 gene insertion site and an 81 bp region of the mitochondrial genome immediately 5′ to the 72 bp insertion sequence. They therefore proposed that a mtDNA fragment, initially >72 bp in length, had interfered with the resolution of a transient GLI3 hairpin-loop structure, leading to the illegitimate insertion of a 72 bp mtDNA fragment during DNA repair.
A further example of this type of insertion was recently described in an isolated case of lissencephaly [MIM# 607432; Millar et al., 2010]: a de novo 130 bp mtDNA insertion into the 5′ untranslated region of the PAFAH1B1 gene [MIM# 601545], 7 bp upstream of the translational initiation site. The inserted DNA sequence was found to exhibit perfect homology to two non-contiguous regions of the mitochondrial genome [8,479 to 8,545 and 8,775 to 8,835, containing portions of two genes, MTATP8 (MIM# 516070) and ATP6 (MIM# 516060)]. Several other examples of mitochondrial-nuclear DNA transfer have been reported as a cause of human inherited disease. However, in the context of the mutation reported here, the mtDNA insertion polymorphism in intron 1 of the FOXO1A gene [MIM# 136533; Giampieri et al., 2004] is perhaps the most intriguing, since this 39 bp insertion was derived from the mtDNA sequence between nucleotides 8,531 and 8,569 containing the MTATP8 and MTATP6 genes. The mtDNA sequence inserted into the FOXO1A gene therefore overlaps with the 130 bp PAFAH1B1 gene insert reported by Millar et al.  by 14 bases (8,532 to 8,545), raising the possibility of the preferential insertion (into the nuclear genome) of certain mtDNA fragments.
In the preceding sections, numerous examples of mutations have been provided in which the formation of non-B DNA conformations (including cruciforms, looped-out bases, quadruplex, triplex and Z-DNA structures) [Figs. [Figs.44 and and5]5] has been postulated to account for intermediate (and transient) forms of DNA that generally serve to promote genetic instability while giving rise specifically to frameshift mutations, repeat expansions and other gross rearrangements. However, with the notable exception of heteroduplex formation by microsatellite repeats in MMR-deficient human cells, direct evidence for such structures having formed and being responsible for the reported mutations has been lacking, with most conclusions being drawn from experiments performed either in vitro or using episomal systems in bacteria and yeast [Mirkin, 2007]. Here, we review some of the work that has directly addressed the extent to which non-B DNA structures can induce human genomic rearrangements.
As already mentioned, the t(11;22)(q23;q11) is a recurrent balanced translocation and is the most frequent of non-Robertsonian translocations, i.e. those that do not involve the large heterochromatic regions of acrocentric chromosomes [Kurahashi et al., 2006]. Although carriers of t(11;22) are generally healthy or only mildly affected, their offspring may come to clinical attention as a consequence of severe mental retardation and morphologic anomalies, associated with the inheritance of the supernumerary der(22) chromosome (Emanuel syndrome, MIM# 609209) [Kurahashi et al., 2006]. Positional cloning permitted the identification of junction fragments in ~40 cases studied, which revealed the clustering of t(11;22) breakpoints at the centre of two large A+T-rich regions (~450 and ~590 bp, respectively), one on each chromosome, and each capable of forming a near-perfect cruciform due to the arrangement of the A+T-rich bases as an inverted repeat [Kurahashi and Emanuel, 2001]. These sequences were termed ‘palindromic AT-rich regions’, or PATRR11 and PATRR22. Most of the chromosomal breaks occurred within the predicted single-stranded loops that separated the two arms of each cruciform, one on chromosome 11 and the other on chromosome 22. Interestingly, despite the A+T-richness, no significant homology was apparent between PATRR11 and PATRR22, suggesting that t(11;22) events resulted from double-stranded break repair by a non-homologous end-joining mechanism.
Further support for this model has come from the analysis of two independent cases of neurofibromatosis type 1 (NF1) caused by a rare t(17;22)(q11;q11) translocation that disrupted the NF1 gene on chromosome 17. Molecular cloning identified PATRR22 as the region responsible for the rearrangements on chromosome 22, whereas an additional ~200 bp PATRR (PATRR17) within intron 31 of the NF1 gene was revealed to be the partner breakage site on chromosome 17 [Kehrer-Sawatzki et al., 1997; Kurahashi et al., 2006]. Thus, a mechanism similar to t(11;22) was apparent in both cases. More recently, analyses of at least 12 individuals with both balanced and unbalanced t(8;22)(q24.13;q11.2) translocations also showed the consistent involvement of PATRR22, as well as a predicted 129-145 bp long undisrupted cruciform structure at PATRR8 involving a sequence that is ~97% A+T [Sheridan et al., 2010]. Hence, in all these instances, a PATRR predicted to fold into a stable cruciform and hosting chromosomal breaks at the single-stranded centre loop, is believed to have been directly involved in the translocation process. Based on the PATRR-dependent cruciform model, Sheridan et al. made the prediction that if both t(11;22) and t(8;22) were recurrent events involving the common PATRR22 region, then t(8:11) might also occur at some frequency, even although carriers of such a rearrangement had not been reported in the literature [Sheridan et al., 2010]. The use of specific PCR primers on sperm samples from healthy males confirmed the occurrence of just such an event, which took place with an estimated frequency of <2.6 × 10−6. Additional sperm analyses aimed at detecting the frequencies of t(11;22) and t(8;22) in healthy males also provided strong support for the PATRR-dependent cruciform model for translocation. The t(11;22) was found to occur at a frequency of ~10−5, whereas the frequency of t(8;22) ranged from ~10−6 to 10−5. Importantly, these frequencies were found to vary by more than two orders of magnitude and correlated in a predictable manner with PATRR sequence length polymorphisms (i.e. the existence of multiple alleles of variable length in the general population). Specifically, homozygous males carrying long PATRRs with the inverted symmetry required to extrude cruciform structures from regular duplex DNA were associated with high translocation frequencies, whereas carriers of shorter alleles in which such inverted symmetry was either reduced or lost, manifested fewer, if any, translocation events [Kato et al., 2006; Sheridan et al., 2010]. These results therefore provide compelling support for the hypothesis that cruciform structures generated by PATRR sequences are responsible for recurrent non-Robertsonian translocations, by providing a substrate for the generation of structure-directed double-strand breaks.
In addition to these composite cases that share common recombination hotspots, a number of other studies have reported the occurrence of non-B DNA-forming sequences at breakpoints of rearrangements associated with inherited disease. For example, a common 1.1 Mb deletion on chromosome 14q32 has been identified in two unrelated patients diagnosed with uniparental disomy. An expanded (TGG)n repeat was identified on either side of the deletion, suggesting that either non-allelic homologous recombination between the two repeat tracts and/or the formation of non-B structures (such as quadruplexes) adjacent to the repeats, could have induced strand breakage thereby triggering the deletion [Bena et al., 2010]. Additional examples include the presence of triplex-, quadruplex- and hairpin-forming sequences at sites of subtelomeric rearrangements associated with mental retardation [Rooms et al., 2007] and other abnormalities (ear shape, scoliosis) [Bonaglia et al., 2009], short cruciform structures flanking a large (~30 kb) heterozygous deletion that removed the entire SPINK1 gene [MIM# 167790], associated with idiopathic pancreatitis [Masson et al., 2007] and similar structures formed by inverted Alu repeats flanking deletions in the OTC gene [MIM# 300461] leading to ornithine transcarbamylase deficiency [Quental et al., 2009]. Comprehensive meta-analyses that have aimed to determine whether non-B DNA-forming motifs are enriched at rearrangement breakpoints have also supported the involvement of DNA secondary structural features in promoting genetic instability [Bacolla et al., 2004; Wells, 2007; Bengesser et al., 2010; Quemener et al., 2010; Roehl et al., 2010].
The abovementioned studies, together with those cited in previous sections, raise the question as to how non-B DNA structures form on chromatin and how they induce genetic instability. A large number of studies, performed in vitro and on model organisms, now support the conclusion that non-B DNA conformations may arise through various mechanisms, including the folding of single-stranded DNA regions during replication [Mirkin, 2007], transcription [Belotserkovskii et al., 2007; Lin et al., 2010; Tornaletti, 2009] and repair [Wang and Vasquez, 2009], as well as through the generation of unrestrained negative supercoiling [Napierala et al., 2005] either via such processes as transcription and replication or upon nucleosome release. In the specific case of MEDs, a number of studies in bacteria, yeast, mammalian cell culture and mouse models support the conclusion that the extent of instability is intimately associated with replication fork dynamics, being generally greater when microsatellite repeats are close to, or part of, a replication origin and/or when more stable hairpins may form on the lagging, rather than on the leading, strands [Wells and Ashizawa, 2006; Potaman et al., 2003; Liu et al. 2010; Nichol Edamura et al., 2005; Yang et al., 2003; Tomé et al., 2011]. As already mentioned (Fig. 5), collapsed replication forks may lead to aberrant repair, including recombination, at non-B DNA conformations leading to instability. Hence, initiation of replication, recombination and mutagenesis probably constitute the three corners of a triangle associated with a number of human pathological conditions. On the other hand, while we believe that these processes are adequate to explain the transient formation of short non-B DNA regions, they appear insufficient to account for the formation of much larger structures, such as the cruciforms extruded from PATRR elements. Hence, it is possible that other as yet unidentified mechanisms of secondary structure formation also operate.
Regarding the mechanisms underlying non-B DNA induced genetic instability, studies in bacteria, yeast and mouse are consistent with the recognition and cleavage of non-B DNA structures by DNA repair enzyme pathways [Lopez Castel et al., 2010; Wang et al., 2008; Wang et al., 2006b; Wang and Vasquez, 2004; Wang and Vasquez, 2009] and the local induction of oxidative damage [Bacolla et al., 2011], followed by DSB repair via non-homologous end-joining. Nevertheless, a number of questions still remain to be addressed. For example, although the t(11;22), t(8;22) and t(8;11) translocations were detected in sperm samples, as mentioned above, they were not observed in somatic cells, despite the occurrence of such recombination events in episomal DNA systems in cell culture [Inagaki et al., 2009]. Thus, it appears that during the course of meiosis, ‘natural’ chromatin might offer a more favourable environment for the generation of non-B DNA conformations and the ensuing genomic instability than mitotic cells.
Finally, the survey presented here raises the question as to the overall impact that non-canonical DNA conformations might have in the context of the causation of human genetic disease. Since the number of pathological conditions specifically listed above is necessarily quite limited, it would at first sight appear as if the overall impact of non-B DNA structures in human inherited disease could be rather modest. However, the specific examples described above were for the most part confined to either repeat expansions or the close proximity observed between the location of chromosomal strand breaks and the presence of potential non-canonical DNA structures. A recent study using human osteosarcoma cell lines has shown that non-canonical DNA conformations are capable of increasing the overall spectrum of mutations (from single base substitutions to gross rearrangements) in a reporter gene in cis by exposing those distant DNA sequences to oxidative damage [Bacolla et al., 2011]. Further, in this study, the spectrum of single base substitutions was shown to be indistinguishable from that induced by other conditions known to lead to an hyperoxidative state (such as Werner deficiency and lung tumorigenesis), an observation which lends support to a model whereby DNA bases become oxidized, followed by the transfer of their oxidized state (‘hole migration’) to target neighbouring bases. If these observations are eventually found to be relevant in the context of ‘natural’ chromatin during meiosis, then the impact of non-canonical DNA conformations on human inherited disease, both with respect to gross rearrangements and single base substitutions, would be even greater than the current review already appears to suggest.
“The human genome [is] riddled with structural and operational deficiencies ranging from the subtle to the egregious. These genetic defects register not only as deleterious mutational departures from some hypothetical genomic ideal but as universal architectural flaws in the standard genomes themselves”.
John C. Avise (2010) Proc. Natl. Acad. Sci.USA 107:8969-8976.
In the above discussion, we have seen that the most plausible explanations for many types of inherited mutation almost invariably invoke either the immediate DNA sequence environment or higher order (but nevertheless still comparatively local) features of genome structure and sub-structure. Different types of mutation may vary dramatically in size (from gross genomic rearrangements down to subtle gene lesions at the single base-pair level) but what they have in common is that their nature, location and extent are often determined by specific characteristics of the local DNA sequence environment. Thus, both the non-randomness and sequence directedness of human gene mutation are reflections of the influence of a number of different genomic features including base composition, epigenetic modification and sequence repetitivity. In addition, the presence of certain DNA sequence motifs may serve to induce mutations by initiating or modulating specific biological processes (e.g. recombination or DNA repair) associated with that motif. Together, such sequence features exert a profound influence over the likelihood of occurrence of specific types of mutation at specific sites or in particular genomic locations.
It has also come to be realised that the mutability of a given gene/genomic region can be mediated indirectly through a variety of non-standard secondary structures whose formation is facilitated by the underlying DNA sequence. These unusual secondary structures may be slipped mispairing intermediates or any one of a number of different non-B DNA structures that can interfere with subsequent DNA replication and repair. It is also becoming apparent that once formed, non-B DNA structures can serve to increase the mutation frequency in generalized fashion, inducing large deletions and other gross genomic rearrangements as well as subtle mutations such as single base-pair substitutions. For reasons that we do not yet fully understand, the single nucleotide substitution rate often covaries with the frequency of insertions, deletions and other rearrangements in the human genome [Longman-Jacobsen et al., 2003; Yang et al., 2004; Marques-Bonet et al., 2007; Tian et al., 2008]. One explanation could be that the single nucleotide substitution rate becomes elevated as a direct consequence of the low fidelity of the error-prone DNA polymerases used to repair regions that have been subject to structural alteration [De and Babu, 2010], an hypothesis not inconsistent with the concept of transient hypermutability [Chen et al., 2009].
Since the human genome is a product of molecular evolution rather than some form of ‘intelligent design’, it is scarcely surprising to find that it contains “pervasive architectural flaws” rendering it “the antithesis of thoughtful organic engineering” [Avise, 2010; Chapman, 2010]. Indeed, over evolutionary time, and as an integral part of its development, the extant human genome has acquired a variety of rearrangements including inversions, insertions and duplications [Cooper, 1999] that, by virtue of their structure and/or organization, now constitute mutation hotspots. This should not of course be held to imply that a relatively immutable primeval genome once existed which then proceeded to decay to an imperfect state over evolutionary time; genomes always were and always will be mutable, and it could not be otherwise since mutability constitutes the major driving force behind the evolution of all life forms. In yet another manifestation of the much vaunted ‘Goldilocks principle’, if somehow the genomes of our ancestors had been immutable, we would not now be around to register it. There is however a price that extant organisms, including humans, must pay for the inherent mutability of their genomes: genetic disease. There are numerous examples of benign (or relatively benign) genetic changes or rearrangements that occurred during the evolutionary history of our species, and which gave rise to particular types of genomic organization or even specific DNA sequences that are now inherently hypermutable and hence responsible for the recurrence of pathological mutations in extant humans [Laken et al., 1997; Huang et al., 2004; Mirkin, 2007; Bacolla et al., 2008; Kim et al., 2008; Wolf et al., 2009; Witherspoon et al., 2009; Fu et al., 2010]. In this review, we have come to appreciate, through perusal of some of the many published studies of molecular defects identified in individuals afflicted by inherited disease, that the structure of the human genome is inherently nemesistic in the sense that it contains buried within it the seeds of its own destruction, or at the very least its own decay. Our task is to come to understand the ground rules that characterize the different mechanisms of mutagenesis in order to apply this knowledge in the context not only of the analysis and diagnosis of genetic disease, but also eventually perhaps, in the cause of its therapeutic correction.
This work was supported in part by a National Cancer Institute/National Institutes of Health Contract HHSN261200800001E (to A.B.).