|Home | About | Journals | Submit | Contact Us | Français|
Summary: Repeated elements can be widely abundant in eukaryotic genomes, composing more than 50% of the human genome, for example. It is possible to classify repeated sequences into two large families, “tandem repeats” and “dispersed repeats.” Each of these two families can be itself divided into subfamilies. Dispersed repeats contain transposons, tRNA genes, and gene paralogues, whereas tandem repeats contain gene tandems, ribosomal DNA repeat arrays, and satellite DNA, itself subdivided into satellites, minisatellites, and microsatellites. Remarkably, the molecular mechanisms that create and propagate dispersed and tandem repeats are specific to each class and usually do not overlap. In the present review, we have chosen in the first section to describe the nature and distribution of dispersed and tandem repeats in eukaryotic genomes in the light of complete (or nearly complete) available genome sequences. In the second part, we focus on the molecular mechanisms responsible for the fast evolution of two specific classes of tandem repeats: minisatellites and microsatellites. Given that a growing number of human neurological disorders involve the expansion of a particular class of microsatellites, called trinucleotide repeats, a large part of the recent experimental work on microsatellites has focused on these particular repeats, and thus we also review the current knowledge in this area. Finally, we propose a unified definition for mini- and microsatellites that takes into account their biological properties and try to point out new directions that should be explored in a near future on our road to understanding the genetics of repeated sequences.
At the dawn of the 21st century, the human genome was sequenced, and even though it was only the fifth eukaryotic genome to be analyzed as such, it opened a new era for geneticists. With over 150 eukaryotic genomes sequenced within the last few years, we are now provided with a wealth of DNA sequence information, an unprecedented event in the history of science. However, several years before reliable and convenient sequencing methods were published (324, 440), scientists already knew that vertebrate genomes contained a large proportion of repeated sequences. In denaturation-renaturation experiments, the rate of renaturation of genomic DNA after heat denaturation is proportional to its concentration. The C0t parameter was defined as the value at which the reassociation is half completed under controlled conditions. Each organism could then be defined by its C0t value. Using this approach, it was shown that the C0t value of the slowly reassociating fraction in calf DNA was 0.03, while the C0t value of the rapidly reassociating fraction was 3,000, proving that the concentration of DNA in the rapidly reassociating fraction was 100,000 times the concentration of the slowly reassociating fraction (52). Three different values of C0t parameters were identified for mice and in other eukaryotes. Highly repetitive sequences had the highest C0t value and accounted for approximately 10% of the mouse genome (1,000,000 copies). They corresponded to what was called satellite DNA. Moderately repetitive sequences represented 20% of the mouse genome (approximatively 1,000 to 100,000 copies), and unique sequences represented approximately 70% of the mouse genome. Although the C0t method slightly underestimated the real amount of repetitive sequences, probably due to slow renaturation of diverged or rearranged repetitive elements (a common characteristic of transposons and retrotransposons), it is remarkable that this method gave a globally accurate picture of genome composition. C0t-based DNA fractionation is still used today to produce genomic DNA libraries that are specific for highly repetitive, moderately repetitive, and single-copy sequences (390).
Early observers of chromosomes found that different species contained different amounts of DNA in their nucleus, also called the “C value.” This apparently benign observation of DNA content caused a lot of trouble when it was shown that amphibians and fishes contained 20 times more DNA per nucleus than mammals, considered to contain more genes than primitive fish due to their higher developmental complexity. Even more surprising, it was subsequently found that the DNA content of the unicellular amoeba Amoeba dubia was 200 times higher than that in humans. This was called the “C-value paradox” and was for a long time the argument of choice for the early opponents of the DNA-based theory of heredity (493). Later on, it was discovered that the increased C value in these organisms was actually due to the presence of abundant repetitive sequences and that the numbers of coding genes are of the same order of magnitude in all eukaryotes, from about 6,000 in the unicellular Saccharomyces cerevisiae to approximately 20,000 to 25,000 in the human genome (which is 200 times bigger than the genome of budding yeast).
At the present time, we know that repetitive elements can be widely abundant in some eukaryotes, composing more than 50% of the human genome, for example. It is possible to classify repeated sequences into two large families, called “tandem repeats” and “dispersed repeats.” Each of these two families can be itself divided into several subfamilies, as shown in Fig. Fig.1.1. Dispersed repeats contain all transposons, tRNA genes, and gene paralogues, whereas tandem repeats contain gene tandems, ribosomal DNA (rDNA) repeat arrays, and satellite DNA, itself subdivided into satellites, minisatellites, and microsatellites. Remarkably, the molecular mechanisms that create and propagate dispersed and tandem repeats are specific to each class and usually do not overlap. In the present review, we have chosen in the first section to describe the nature and distribution of dispersed and tandem repeats in eukaryotic genomes, in the light of the complete (or nearly complete) genome sequences that are available. In the second part, we will focus on the molecular mechanisms responsible for the fast evolution of two specific classes of tandem repeats: minisatellites and microsatellites. Given that a growing number of human neurological disorders involve the expansion, sometimes massive, of a particular class of microsatellites, called trinucleotide repeats, a large part of the recent experimental work on microsatellites has focused on these particular repeats, and thus we will also review the current knowledge in this area. Finally, we will propose a unified definition for mini- and microsatellites that takes into account their biological properties and will try to point out new directions that should be explored in the near future on our road to understanding the genetics of repeated sequences.
Centromeres, telomeres, and mitochondrial and chloroplastic DNA, which are particular kinds of repeated sequences as well, will not be reviewed here, and readers are encouraged to consult appropriate publications concerning these topics. Note also that this review focuses on tandem repeat elements in eukaryotes but such elements can also be found in prokaryotes, although generally less abundantly (510).
Saccharomyces cerevisiae was the first eukaryotic organism whose nuclear genome was fully sequenced (167). As such, it has also been the first eukaryotic genome to be investigated for genome redundancy and gene duplications. In a seminal paper published shortly after the complete sequence was released, Wolfe and Shields (538) proposed that the modern yeast genome is derived from an ancestral tetraploid genome, followed by massive gene loss and translocations. Based on two observations, namely, the orientation of the 55 duplicated regions compared to what is seen for centromeres and the absence of any triplicated region (i.e., duplication of a duplication), the authors favored the hypothesis that an ancestral whole-genome duplication, as opposed to several successive segmental duplications, was responsible for the structure of the modern yeast genome. Later on, following the partial sequencing of 13 hemiascomycetous yeast species by the Genolevures consortium, it was found that ancestral chromosomal segments did not entirely coincide with S. cerevisiae duplicated blocks. It was therefore proposed that the hemiascomycete genomes evolved by successive segmental duplications, an alternative to the whole-genome duplication model (290). These apparently conflicting data were reconciled when other hemiascomycete genomes were completely sequenced. The ancestral whole-genome duplication was found to have occurred in the genome of the ancestor of S. cerevisiae and Candida glabrata, but no evidence of whole-genome duplication was found in the genomes of Ashbya gossypii, Kluyveromyces waltii, and Kluyveromyces lactis, three hemiascomycetes phylogenetically more distant from S. cerevisiae than C. glabrata is (108, 117, 242). Following this whole-genome duplication, genes have been lost differentially between the duplicated species (140). Other various studies showed that duplicated genes evolve or are lost at different rates during the evolution of yeast genomes (271), and that rates of large genome rearrangements—based on synteny conservation—were highly variable among hemiascomycetes (141), suggesting that the remodeling of duplicated blocks and the loss of duplicated genes were both subject to constraints specific to each organism. It must be noted that segmental duplications have been experimentally reproduced in S. cerevisiae. Using a gene dosage selection system, Koszul et al. (256) showed that large inter- and intrachromosomal duplications, covering from 41 kb up to 655 kb in size and encompassing up to several hundreds of genes, occurred with a frequency of close to 10−7 per cell/generation. Junctions of segmental duplications frequently contain either microsatellites or transposable elements. The stability of these segmental duplications during meiosis and mitosis was shown to rely both on the size of the duplication and on its structure (257). Replication-based mechanisms leading to segmental duplications rely on both homologous and nonhomologous events (384a). Similarly, by use of a positive selection screen relying on a mutated allele of the URA2 gene, segmental duplications covering from 5 kb to up to 90 kb were found to occur spontaneously in baker's yeast and to be independent of the product of the RAD52 gene, which is necessary for homologous recombination (449, 450). Whole-genome duplications, segmental duplications, and genome redundancies in hemiascomycetes have been recently reviewed (115), and possible molecular mechanisms leading to the formation of segmental duplications were reviewed by Koszul and Fischer (258).
Schizosaccharomyces pombe, an archiascomycetous yeast, is another model organism, whose complete sequence was published in 2002. Its genome does not exhibit the signature of an ancient whole-genome or large-scale duplication, but it contains many duplicated blocks at the subtelomeres of chromosomes I and II (539). These blocks contain several groups of two to four genes whose DNA sequences are 100% identical and which are predicted to be cell surface proteins, an observation reminiscent of what is observed at S. cerevisiae subtelomeres (87, 167).
It has long been postulated that vertebrate genomes resulted from two rounds of whole-genome duplications that occurred early in their evolution (369). By comparing gene duplications between humans and two invertebrates (Drosophila melanogaster and Caenorhabditis elegans), McLysaght and colleagues (326) showed that a number of large paralogous regions detected in the human genome were significantly larger than what would be expected by chance. Molecular clock analysis of invertebrate and human orthologs revealed that a burst of gene duplications occurred in an early chordate ancestor, suggesting that at least one round of whole-genome duplication occurred in this distant ancestor. More recently, distantly related chordate genomes, namely, those of the tunicate Ciona intestinalis, the pufferfish Takifugu rubripes, the mouse, and the human, were compared. The pattern of gene duplications observed was indicative of two successive rounds of whole-genome duplications in vertebrates (98). In addition, the complete sequence of Tetraodon nigroviridis revealed that a whole-genome duplication occurred in an ancestor of this actinopterygian fish after diverging from sarcopterygians (including tetrapods and like organisms). It was shown that one region of synteny in humans was typically associated with two regions of synteny in Tetraodon, a distinctive signature of whole-genome duplications (210). It was possible to reconstruct the ancestral genome of Tetraodon, given that ancestral chromosome duplications were easy to identify due to there being few rearrangements following duplication. This is very different from what is seen for mammalian genomes, which have been extensively reshuffled compared to that of Tetraodon. This might be due to the fact that, compared to the genome of Tetraodon, they contain many transposable elements that may be directly involved in the numerous rearrangements observed. Recent segmental duplications have also been found in the human (26, 27, 270a, 463), rat (504), and mouse (25, 494) genomes. Segmental duplications show a statistical bias for pericentromeric and subtelomeric regions in these three species, although interstitial duplications are more abundant. Interestingly, although segmental duplications in humans are enriched for short interspersed elements (SINEs), no such enrichment was found for rats, except for a fourfold enrichment for centromeric satellite repeats, suggesting that these repeats could be involved in the formation of segmental duplications in rats (504). Comparisons between the human and chimpanzee (Pan troglodytes) genomes revealed that a surprisingly large fraction of duplicated DNA in humans (approximatively 32 Mb) is not duplicated in the chimpanzee. These human-specific duplications represent 515 regions, with biases for chromosomes 5 and 15. Reciprocally, 202 regions of duplicated sequences (approximatively 36 Mb) in the chimpanzee are unique in the human genome (79). When junctions of subtelomeric segmental duplications were analyzed in the chimpanzee genome, it was found that a majority of them (49 out of 53) probably resulted from nonhomologous end joining (NHEJ), whereas only 4 of the events involved nonallelic homologous recombination between repeated elements. Surprisingly, only microhomology sequences (less than 5 bp) and no microsatellites were found at the junctions (284), a situation different from experimental segmental duplications in budding yeast, in which microsatellites were found at 14% of new junctions (256). By comparing segmental duplications in humans, chimpanzees, and macaques (Macaca mulatta), Jiang et al. (222) were able to identify ancestral duplication blocks. Among those, “core duplicons” were defined as ancestral duplications that were present in more than 67% of the blocks. The 14 core duplicons identified are shared by human and chimpanzee and they have a higher gene density and match with more spliced expressed sequence tags than nonduplicated regions of the genome, suggesting that they carry some selective advantage that allows rapid expansion and fixation during great ape evolution.
In a diploid organism, one round (or more) of whole-genome duplication leads to polyploidy, a well-described phenomenon in flowering plants (reviewed in reference 533). Among monocotyledons, maize (Zea mays) is an allotetraploid, resulting from the fusion of two diverged ancestors approximately 11.4 million years (My) ago (161). Wheat (Triticum aestivum) is an allohexaploid, containing three sets of homoeologous chromosomes (i.e., chromosomes that were completely homologous in an ancestral form), whereas rice (Oryza sativa) does not show any evidence for polyploidy (44). Among dicotyledons, soybean (Glycine subgenus soja) has probably undergone more than one round of duplication, since there is an abundance of triplicate and quadruplicate sequences in its genome (465). Using a global analysis of age distribution of paralogous pairs of genes among 11 dicotyledons, Blanc and Wolfe (44) found that seven species (namely, tomato [Lycopersicon esculentum], potato [Solanum tuberosum], soybean [Glycine max], barrel medic [Medicago truncatula], cotton [both Gossypium arboreum and Gossypium herbaceum], and Arabidopsis thaliana) exhibited large-scale gene duplications reminiscent of polyploidy or aneuploidy events. Analysis of the complete genome sequence of the model flowering plant Arabidopsis thaliana revealed 24 large duplicated regions of at least 100 kb, covering 58% of the genome (65.6 Mb) (12a). Further analyses showed that duplicated blocks fall into four different groups based on their ages. The most recent group corresponds to the duplication of approximately 9,000 genes at the same time, reminiscent of a whole-genome duplication event. The three other classes are older and may represent successive large-scale duplication events (519). The evolution of gene content has also been studied for this species, and studies have shown that all gene duplicates are not evenly lost among functional categories, i.e., signal transduction and transcription genes have been preferentially retained, whereas DNA replication and repair genes have been preferentially lost (43). This suggests that the expression of genes involved in genome maintenance and transmission is finely tuned and that any imbalance could be lethal and therefore rapidly counterselected.
One cannot review whole-genome duplications without mentioning the unexpected and so far unique case of Paramecium tetraurelia. Sequencing of the macronuclear genome of this ciliate revealed three whole-genome duplications (and possibly a fourth, more ancient), comprising a very recent event occurring before the divergence of P. tetraurelia and P. octaurelia, an old event that occurred before the divergence of Paramecium and Tetrahymena and an intermediate event. A striking feature of the recent whole-genome duplication is the high number of genes retained in duplicate (about 68% of the proteome is composed of two-gene families). This is in contrast with whole-genome duplications discovered in yeast or fish, in which such events could not be detected without the comparison with a nonduplicated reference genome. Maintenance of such a high number of duplicated genes may be driven by gene dosage constraints, since many of the recent duplicates are functionally redundant and are under strong purifying selection, indicative of events in which deleterious mutations, affecting one of the two copies, are not complemented by the expression of the other copy (20).
Whole-genome duplications and segmental duplications are two active phenomena that create redundancy by duplicating a very large amount or the totality of the genes in a genome. When this happens, coding sequences (exons) and noncoding sequences (introns 5′ untranslated region [5′-UTR] and 3′-UTR) are duplicated and may undergo purifying selection or accumulate mutations and become pseudogenes. Another way of duplicating genes is to reverse transcribe mRNA and recombine the resulting cDNA. In mammalian genomes, many examples are well characterized in which genes were created by the retrotranscription of a spliced mRNA into cDNA, followed by the integration of the resulting cDNA into the genome. The 5′ sequences of these cDNAs are sometimes truncated and their upstream regulatory and promoter sequences are lacking, since they are not part of the mature transcript. They do not contain introns. These retrogenes are therefore not functional and are called retropseudogenes, but there are a few cases described in which transcription initiation started upstream of the normal promoter and reverse transcription gave rise to a functional retrogene. The process of making retropseudogenes can be dramatically efficient in mammals, since more than 200 copies of the glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene were found in rat and mouse (reviewed in reference 528). Analysis of the human genome sequence revealed that it contains approximately 10,000 retropseudogenes, including more than 1,700 ribosomal pseudogenes, while the C. elegans genome contains slightly more than 200 retropseudogenes and 2,000 pseudogenes (189, 555). Consistent with a reverse transcription intermediate in the formation of retropseudogenes, there seems to be a positive correlation between the number of retropseudogenes for one given gene and its level of transcription, i.e., more pseudogenes are found for highly transcribed genes (555). By establishing an experimental system in budding yeast requiring transcription, splicing, and reverse transcription of a selectable marker, Derr et al. showed that such events occurred at a frequency of about 10−7 per cell/generation and were dependent on the expression of Ty retroelement reverse transcriptase (105). However, no evidence for the presence of retrogenes in S. cerevisiae has been published thus far. Retrotransposition, like segmental and whole-genome duplications, therefore contributes to the formation of paralogues (or pseudoparalogues), thereby increasing the overall level of redundancy in eukaryotic genomes.
By definition, all the paralogues of a given genome belong to a family whose size ranges from two members to up to several hundreds for immunoglobulin genes, for example (270a, 294). Gene families represent a rather large proportion of all protein-encoding genes in almost all eukaryotes. In S. cerevisiae, 40% of predicted open reading frame (ORF) products are not unique, showing significant identity with from 1 to 22 paralogues (487). In other hemiascomycetes, the number of genes belonging to gene families ranges from 31.8% for K. lactis (the least duplicated genome) to 51.5% for Debaryomyces hansenii (117), figures that are similar to what is observed for D. melanogaster and C. elegans. The situation is very different for S. pombe, in which 93% of predicted protein-coding genes do not belong to recognizable gene families (539). In D. melanogaster, 40% of predicted genes are duplicated (433), whereas in C. elegans, figures range from 32% to 49%, depending on whether proteins (541) or protein domains (433) are considered. It must be noted that approximatively 7% of nematode duplicated genes are thought to have resulted from block duplications involving more than one gene (541). As expected from the genome of A. thaliana, which underwent several large-scale duplications, 65% of its genes belong to gene families, and a substantially higher proportion of genes (37.4%) belong to families containing more than five members, compared to what is seen for D. melanogaster (12.1%) and C. elegans (24%) (12a). In mice and humans, genes belonging to families correspond to 60 to 80% of all genes, a figure similar to what was found for other mammals (103) (Table (Table11).
Gene duplications may give rise to different outcomes. One of the two copies may become nonfunctional by the accumulation of point mutations, insertions, and deletions (nonfunctionalization), or one copy may acquire a novel function, while the other retains the original function (neofunctionalization). Alternatively, the two copies may accumulate point mutations so that neither of the two copies is functional by itself and requires the presence of the other copy, or each copy loses one or more enzymatic activities and becomes specialized (subfunctionalization). Studies of substitutions in paralogues encoded by several eukaryotic genomes suggested that gene duplications are a rather frequent phenomenon, arising at the average rate of 0.01 per gene per 1 My, indicating that 50% of all genes in a genome are expected to be duplicated at least once in a 35- to 350-My time scale (301). As a corollary, since random duplications of genes occur in each genome, gene families must expand and contract compared with each other in closely related organisms. That is indeed what was observed for hemiascomycetes (179) as well as for mammals (103). Thus, even in the absence of whole-genome duplication, a large fraction of all genes in a genome are expected to become duplicated, generating a powerful source of novelty in eukaryotes.
Transfer RNAs, the genetic link between transcription and translation, are essential for cell viability in all living organisms. In eukaryotes, there is no correlation between genome size and tDNA copy number (note that figures given hereafter only refer to nucleus-encoded tDNA and do not include organelle-encoded tDNA). A. thaliana contains 589 tDNAs and 13 tDNA pseudogenes, a number higher than that for any other eukaryotic genome sequenced so far, including the human genome. D. melanogaster contains 292 tDNAs, C. elegans has 659 tDNAs and 29 pseudogenes, and in the human genome 345 tDNAs and 167 pseudogenes were detected (5, 12a, 73a, 270a). In the mouse genome, 335 putative tDNAs were found but analysis was complicated by the presence of thousands of active B2 sequences (see the following section), which are derived from an ancient tDNA. It is therefore possible that several tDNAs detected are not functional (526). An in-depth analysis of tDNAs in nine hemiascomycetes and one archiascomycete (S. pombe) revealed 2,335 genes, which ranged in distribution from 131 in Candida albicans to 510 in Yarrowia lipolytica (313). In hemiascomycetes, tDNAs generally appear scattered throughout the genome, except in D. hansenii, in which eight identical tandem copies of a tDNA-Lys are found on chromosome B, reminiscent of the frequent occurrence of gene tandems in this organism (see “Tandem repeats of paralogues” below). Clusters of tDNAs have also been found in other eukaryotes. In S. pombe, 22 tDNAs were found in a 50-kb pericentromeric region on chromosome II and two other clusters were found around the two other centromeres (265, 484). In D. melanogaster, a genome region contains a cluster of 10 tDNAs (102), and in humans 140 tDNAs are found in a 4-Mb region on chromosome 6. This rather small region (0.1% of the whole genome) contains representatives for 36 of the 49 anticodons found in the human genome (270a).
It was experimentally shown using S. cerevisiae that when tDNAs are transcribed in the orientation opposite to replication fork progression, they promote the formation of replication pause sites (106). One may therefore wonder what the effect of large tDNAs arrays (such as those mentioned above) might be on replication and, more generally, on the stability of genomic regions encompassing them (see “Fragile sites and cancer” below).
The mechanism(s) by which tDNAs propagate in genomes is at the present time only speculative. One may imagine that reverse transcription of tRNAs followed by integration at an ectopic position in the genome is a possible mechanism. However, to the best of our knowledge, there is no experimental evidence for such a mechanism being active on tRNAs.
Transposable elements were elegantly discovered by Barbara McClintock several years before the biochemical structure of DNA itself was solved (325). Since that time, transposons have been found in prokaryotes and eukaryotes and can be classified into two large families: retrotransposons (class I elements) and DNA transposons (class II elements). Recently, a more sophisticated classification based on the mode of transposition and on insertion mechanisms was proposed. Using this classification, retrotransposons were themselves divided into five “orders,” including long terminal repeat (LTR) retrotransposons, long interspersed nuclear elements (LINEs), SINEs, DIRS (Dictyostelium intermediate repeat sequence) elements, and PLE (Penelope-like elements), the last two being less widely spread (535). In each family, autonomous elements, which are able to catalyze their own transposition, and nonautonomous elements, which rely on autonomous elements in order to transpose, are found. Their abundance in genomes is highly variable, from one complete copy of a Ty element in Candida glabrata to millions of copies in the human genome (≈50% of the total sequence) (270a). Homologous (or homeologous) recombination between transposons may induce chromosomal rearrangements such as deletions, inversions, translocations, and segmental duplications as well as mutational events when they transpose into genes (34). For these reasons, transposable elements have been considered to be an important drive in eukaryotic genome evolution (139).
LINEs are non-LTR class I elements whose best characterized member is the LINE-1 (L1) mammalian retrotransposon. This 6- to 8-kb element contains two ORFs (ORF1 and ORF2) that are cotranscribed from the same promoter but can be processed into several distinct messenger RNAs. ORF1 encodes a trimeric nucleic acid chaperone protein that binds to L1 mRNA to form ribonucleoprotein complexes, considered to be transposition intermediates. ORF2 contains endonuclease and reverse transcriptase activities, both required for retrotransposition in wild-type cell lines (34). L1 elements transpose in three steps: (i) formation of a nick in double-stranded DNA at AT-rich sequences, mediated by the endonuclease activity of ORF2; (ii) annealing of the L1 mRNA poly(A) tail to the 5′ poly(T) tail at the nick and cDNA synthesis by the reverse transcriptase activity of ORF2; and (iii) degradation of the L1 mRNA and second-strand synthesis followed by ligation (227). It is worth noting that this mechanism is reminiscent of budding yeast group II mitochondrial intron transposition, in which the intron-encoded protein makes a double-strand break (DSB) at the exon-exon junction, which serves as a primer for reverse transcription of the intron RNA (557, 558). It is therefore possible that mitochondrial group II introns are distant ancestors of mammalian LINE elements, although it could also be a case of convergent evolution.
In the human genome, approximately 850,000 LINEs were found (270a), and 660,000 were found in the mouse genome (526); both figures represent approximately 20% of their respective genome sequences. Several instances of “exonization” of active LINE elements, leading to the creation of new genes, have been recently reviewed (62). By comparison, the genome of another vertebrate, Tetraodon nigroviridis, contains 700 times fewer transposable elements, almost half of them being LINEs (210, 426). Aside from sometimes being involved in large chromosomal rearrangements, like segmental duplications (see “Whole-genome and segmental duplications in vertebrates” above), LINEs may also play a role in evolution by regulating global genome transcription. It was shown that the presence of the human L1 retrotransposon inhibited transcriptional elongation and induced the premature polyadenylation of the transcript containing the L1 sequence. Given the high repetitive nature of L1 elements in the human genome (Table (Table1),1), it is likely that these elements play an active role in regulating gene expression genome-wide and are therefore a key component of mammalian genome evolution (182).
SINEs are the most abundant elements in mammalian genomes. They include Alu and MIR elements in primates and B1, B2, and ID elements in rodents as well as many other elements in mammalian and nonmammalian genomes. Alu elements are composed of two 130-bp monomers separated by a short A-rich linker region. Each monomer was ancestrally derived from the 7SL RNA, following a duplication of this gene before the time of the mammalian radiation (233, 507). They transpose through a mechanism basically similar to that seen for L1 elements, needing only the product of ORF2 to be active (34). They are classified into several families, themselves classified into subfamilies, based on sequence conservation. The S and J families are the oldest, having their origin 35 to 55 My ago, followed by the Y family, at the radiation between green monkeys and the branch leading to African apes, some 25 My ago. This family was itself expanded 4 to 6 My ago, after the divergence of humans and African apes, to give rise to “young” Alu elements (32).
Both the human and the mouse genomes contain approximately 1,500,000 SINEs, which make them the most abundant repeated elements in these genomes (270a, 526). Interestingly, Alu elements are threefold more active in humans than in chimpanzees, since approximatively 7,000 lineage-specific Alu sequences were found in the human genome, compared to 2,300 lineage-specific copies in the chimpanzee genome. Homologous recombination between more or less diverged Alu elements can produce deletions of the sequence located between the repeats. More than 600 such deletions were found in the human genome and more than 900 were found in the chimpanzee genome, underlining the dramatic role that Alu elements can play on genome rearrangements (79a).
Although occasional examples of exon capture by Alu insertion have been recorded (32), a very recent work identified a SINE family, AmnSINE1, that is conserved in all amniota (mammals, birds, and reptiles) and that may be involved in mammalian brain formation. Two of these conserved AmnSINE1 elements were found to behave as distal transcriptional enhancers of developmental genes. Out of 124 conserved AmnSINE1 elements in the human genome, one-fourth are located near genes involved in brain development, leading the authors to speculate that this conserved family could have played a central role in the development of the central nervous system in mammals (444).
These class I elements transpose by a mechanism different from that seen for LINEs/SINEs. It also involves a reverse transcription step but one that is usually primed by annealing of a tRNA to the primer binding site, the 3′ end of the transposon RNA, followed by reverse synthesis of the first cDNA strand and then synthesis of the second DNA strand. This process occurs in the cytoplasm and the transposon is subsequently transferred to the nucleus, in which integration occurs by a mechanism similar to what is seen for type II DNA elements, with a nuclease making specific nicks at the integration site to catalyze the process (99). Some retroviruses share a similar method of propagation, including the formation of a cytoplasmic particle containing the viral genome, and can therefore be classified as LTR retrotransposons (364). It must be noted that homologous recombination between the two LTRs of a transposon results in “popping out” of the element, leaving as a scar a solo LTR. Most of the LTR retrotransposon copies (85%) are detected as solo LTRs both in the yeast genome (247) and in the human genome (270a). The number of such elements varies greatly among sequenced genomes, from one single full-length copy of a Ty element in the hemiascomycetous yeast Candida glabrata to 443,000 copies of retrovirus-like elements in the human genome (8% of the genome). Fifty-one Ty elements, classified in five families, have been detected in the S. cerevisiae genome, along with 268 to 280 solo LTRs (184, 247), with this number varying between budding yeast strains. Yeast Ty elements tend to be clustered around tRNA genes and genes transcribed by RNA polymerase III (Pol III), this preference most likely being mediated by interactions between the Pol III complex and the integration complex (184). Similarly to the human genome, the macaque (M. mulatta) genome contains around half a million recognizable copies of retroviruses, among which 2,750 copies are lineage specific and result from at least eight instances of horizontal transmission, a figure higher than that for the human lineage (183). LTR retroelements are particularly abundant in plants, representing up to 80% of the DNA sequence in some genomes. Massive expansions of such elements may lead to a rapid increase in genome size, as in Oryza australiensis (a wild relative of the Asian cultivated O. sativa), in which 90,000 retrotransposon copies have been accumulated in the last 3 My, leading to a doubling of genome size (393).
Class II elements are mobile DNA elements that utilize a transposase and single- or double-strand DNA breaks to transpose. They can be classified into three major subclasses: (i) elements that excise as double-stranded DNA and transpose by a classical “cut-and paste” mechanism, such as Drosophila P elements; (ii) elements that utilize a rolling-circle mechanism, such as Helitrons (398); and (iii) elements that probably utilize a self-encoded DNA polymerase but whose transposition mechanism is not well understood, such as Mavericks (399). Based on transposase sequence similarities and phylogenetic analyses, they can be classified into 10 different families (131). Similarly to LTR retrotransposons, “popping out” of the element by homologous recombination between the two LTRs results in a solo LTR. Their number and proportion, compared to those of retrotransposons, are highly variable among eukaryotic genomes. The human genome contains about 300,000 copies of DNA transposons, 100 times more than the C. elegans genome (119, 315) and 700 times more than the D. melanogaster genome (230). In the genome of the protist pathogen Trichomonas vaginalis, an estimated 3,000 Maverick copies are found, which occupy approximately 37% of the genome size (399). Given that most transposon copies in this genome show a very low level of polymorphism (2.5% on the average), and by comparison with its sister taxon Trichomonas tenax (a trichomonad of the oral cavity), it was suggested that the T. vaginalis genome was recently invaded by such elements, leading to a very substantial increase of its size (71). By comparison, some eukaryotic genomes, such as those of S. cerevisiae and S. pombe, do not contain any DNA transposons, although they do contain retroelements. It is, however, not a rule in all ascomycetes, since Y. lipolytica and C. albicans contain several DNA transposons (366). Interestingly, it was shown that 100 to 200 copies of a MITE (miniature inverted-repeat transposable element)-type transposon in the rice genome, called Micron, was flanked by (TA)n repeats, suggesting that this transposon specifically targets a microsatellite (7). This peculiar example shows how a dispersed repeated element may propagate among tandemly repeated elements such as microsatellites.
Ectopic recombination between transposable elements is detrimental to genome structure and organization. Numerous examples of large-scale chromosomal rearrangements in plants, animals, and budding yeast have been reported (32, 131, 256). Therefore, although a rapid expansion of such elements in a genome may sometimes be viable and will increase genetic diversity, it may also rapidly reduce fitness and eventually have lethal consequences. Transposable elements are not rare in the genomes of some filamentous fungi (88), and several species have developed specific mechanisms to counter their propagation. Neurospora crassa uses a process called RIP (for repeat-induced point mutation) to efficiently detect and mutate duplicated sequences. RIP recognizes duplications of at least 400 bp in length and introduces C:G-to-T:A mutations into both copies of the duplicated sequence. Since its discovery in N. crassa, evidence for a similar process operating in other filamentous fungi has been reported (155). In Ascobolus immersus, a nonmutagenic process called MIP (for methylation induced premeiotically) methylates cytosines contained in duplicated sequences with a high efficiency, reducing meiotic crossovers dramatically. By decreasing the efficiency of homologous recombination between duplicated sequences, MIP therefore reduces the chance of nonallelic translocations occurring between repeats (305).
Contrary to dispersed repeats, tandem DNA repeats are sequentially repeated. This sophism must not mask the fact that out of two possible orientations for tandem repeats (head-to-tail repeats, also called “direct repeats,” and head-to-head repeats, also called “inverted repeats”), only direct repeats are frequently found in genomes. This is demonstrated by the biased distribution of Alu tandem repeats. Alu elements are frequently found in tandem within the human genome, sometimes separated by only a few base pairs. It was found that nearly identical inverted Alu repeats are 70-fold less frequent than the same repeats in direct orientation when the two copies are separated by less than 20 bp, but this difference is abolished when the two copies are separated by more than 100 bp (292). It was postulated that such repeats are able to form hairpin or cruciform structures in vivo, and Lobachev et al. (291) showed that inverted Alu elements induce DSBs in budding yeast. These breaks require the Mre11-Rad50-Xrs2 complex (a multifunctional protein complex conserved in all eukaryotes ) in order to be correctly processed and repaired. Another study with the fission yeast S. pombe showed that a 160-bp palindrome induced homologous recombination and that this induction was dependent on the Rad50 orthologue, Rhp50 (126). Similarly, palindromes also induce DSBs during budding yeast meiosis (358, 362). In Escherichia coli, a very recent work showed that a 246-bp palindrome integrated into the bacterial chromosome was cleaved in vivo by the SbcCD protein complex, the prokaryotic orthologue of the Rad50-Mre11 complex, giving rise to a two-ended DSB that can be detected by Southern blotting (123). It must also be noted that large inverted repeats can be formed in yeast by a mechanism similar to rDNA palindrome formation in Tetrahymena thermophila, a highly regulated process involving the generation of a DSB near a short inverted DNA repeat (63, 64).
The effect of palindrome-induced homologous recombination can be dramatic for cells, since chromosomal rearrangements reminiscent of those found in human tumors, such as internal deletions and inverted duplications, frequently occur in yeast cells harboring such inverted repeats (361). In humans, a long AT-rich palindrome suspected to form a cruciform structure in vivo is found at the constitutive t(11;22) breakpoint, the most frequently occurring non-Robertsonian translocation (266, 267). Since inverted repeats are deleterious sequences leading to large chromosomal rearrangements, they must be counterselected for, and the vast majority of tandem repeats found in eukaryotic genomes are repeats in direct orientation (292). This is the case of all tandem repeat classes detailed in the present chapter, from the large rDNA arrays covering hundreds to thousands of kilobases to the more discrete but widely abundant microsatellites.
Gene tandems are not particularly frequent in hemiascomycetes (a few dozen arrays per genome), except in Debaryomyces hansenii, in which 247 tandem arrays were detected throughout its genome, including large arrays of up to nine copies that were not found in other yeast genomes, significantly contributing to global genome redundancy. Like Alu tandems in humans, most of the tandems (80 to 90%) were found to be in direct orientation (117). Given the fragmented structure of most genes in higher eukaryotes, tandem repeats of paralogues are rare, but they are not completely absent. The mouse genome draft sequence contains a high proportion of regions that could not be assembled or anchored on the chromosomes due to the repetitive nature of these regions. One striking example is a large region on chromosome 1 containing a tandem expansion of the Sp100-rs gene, repeated approximately 60 times and covering a 6-Mb region. This region is highly variable in size among mouse species and laboratory strains, ranging from 6 to 200 Mb, suggesting that an active process frequently expands and contracts this region (526). In S. cerevisiae, the CUP1 gene, encoding a copper metallothionein, can be tandemly amplified, conferring resistance to high concentrations of copper to yeast cells. Laboratory strains are polymorphic at this locus, usually exhibiting 10 to 12 tandem copies of CUP1. Losses and gains of repeat units occur mainly by meiotic homologous recombination, and both gene conversions between repeat arrays and unequal crossovers are observed (529). This is reminiscent of what is observed for minisatellites in yeast and humans (see “Rearrangements during homologous recombination” below) and suggests that homologous recombination may lead to expansions and contractions of gene tandem repeats in both budding yeast and humans.
rDNA is another essential genetic element linking transcription to translation. rRNA is at the same time the main structural and the catalytic component of the ribosome. rRNA is translated from a large tandem repeat found at one or more loci in each haploid genome. It is essential for cell viability since it is transcribed in rRNA, the central component of the whole ribosomal translational machinery. Each repeat unit contains the 28S large subunit, the 18S small subunit, and the 5.8S gene as well as two internal transcribed spacers (ITS1 and ITS2) and a large intergenic nontranscribed spacer (294). Another gene, the 5S rRNA gene, may be present within the rDNA array, as is the case in most hemiascomycetes (117), or is encoded elsewhere in the genome, as is the rule with most other eukaryotes. The number of repeat units varies greatly among eukaryotes, from 40 to 19,300 in animals and from 150 to 26,000 in plants, and is positively correlated with genome size (400). Given the repetitive nature of rDNA arrays, it is not always easy to determine whether all of the repeat units share an identical sequence. In a recent study, nonassembled rDNA sequences generated during whole-genome shotgun sequencing of five fungi have been examined in order to look for possible polymorphisms between rDNA repeat units. Few base variations were found, from 4 in S. cerevisiae to 37 in Cryptococcus neoformans, and there was no obvious bias toward their localization to spacer regions (158). These results show that rDNA tandem arrays are evolving through concerted evolution and suggest that sequence quasi-identity is maintained by homogenization of rDNA repeat arrays. This homogenization could occur by homologous recombination between tandem repeats, since Holliday junctions (a hallmark of homologous recombination) were detected in rDNA during mitotic growth of yeast cells. Their presence is dependent on Pol α, but not on Pol δ or Pol , and they are significantly reduced in a rad52 mutant in which homologous recombination is abolished (560). RAD52 is also directly involved in the formation of extrachromosomal circles (ERCs) in old yeast cells (382). ERCs are DNA minicircles whose formation is dependent on several cis- and trans-acting factors. A replication block is located 3′ of each rDNA repeat unit in budding yeast that arrests the replication fork coming from the 3′ end so that it cannot collide with the RNA polymerase complex transcribing the repeat unit in the opposite orientation. This replication fork block is dependent on the presence of the Fob1 protein and its mechanism has been extensively studied and reviewed elsewhere (268, 333, 432). Interestingly, mutations in the FOB1 gene lead to an increase in budding yeast life span and a decrease in the amount of ERCs (96). The molecular link between the amount of ERCs and aging in yeast is unclear, but both depend on the presence of the SGS1 helicase and on the SIR complex, involved in chromatin silencing (244, 468). Yeast cells mutated for the SGS1 helicase contain a higher proportion of ERCs, exhibit nucleolar fragmentation, and age prematurely compared to wild-type cells (469). SGS1 encodes an S-phase DEAH-box DNA helicase that was first identified as a suppressor of a mutation in the topoisomerase TOP3 gene (156). It was subsequently shown to play several roles during homologous recombination and probably also during replication in yeast (85, 125, 157, 207, 283, 425). SGS1 has orthologues in E. coli (RecQ), in all hemiascomycetous yeasts (419), and in mammals. In humans, five orthologues are found, namely, WRN, BLM, and RTS, involved in Werner, Bloom, and Rothmund-Thomson syndromes, respectively, and two shorter forms, RecQL and RecQ5. Interestingly, it was shown in humans that rDNA organization depends on the WRN gene. Using single-molecule analysis, Caburet and colleagues (65) have shown that rDNA tandem arrays frequently differ from the canonical organization. The size of the intergenic nontranscribed spacer varies from 9 kb to 72 kb, and palindromic structures are found in one-third of the molecules analyzed in wild-type cells. However, the proportion of palindromes increases to 50% in cells deficient for the WRN helicase, suggesting that some form of illegitimate recombination controlled by this helicase is responsible for making rearrangements within human rDNA tandem arrays. In conclusion, homologous recombination is very frequent within rDNA and is tightly linked to DNA replication of the tandem arrays.
As mentioned above, 5S rDNA genes are not often encoded within the large rDNA array but are found as dispersed elements, themselves sometimes amplified in tandem repeats, as in Drosophila species. Comparison of 5S tandem repeat sequences in several Drosophila species revealed that insertions and deletions were very frequent between species and were often flanked by conserved nucleotides, suggesting that they could occur by slippage of the newly synthesized strand during DNA replication or alternatively by gene conversion (380) (see “Molecular mechanisms involved in mini- and microsatellite expansions” below). A more recent work on four completely sequenced filamentous fungi (Aspergillus nidulans, Fusarium graminearum, Magnaporthe grisea, and Neurospora crassa) revealed an interesting property of 5S genes in these species. It was shown that 5S genes located at different loci share more identity to other 5S genes in other species than with 5S genes in the same species (5S clusters are interspecies instead of being intraspecies) (429). This suggests that 5S genes in a given species do not coevolve by a mechanism similar to large rDNA arrays and are not homogenized during evolution of a given species. This also suggests that an active mechanism of constant “birth-and-death” creates new 5S sequences, as opposed to the model of concerted evolution that seems to apply to large rDNA tandem arrays (365). Interestingly, a class of SINE (SINE3) deriving from a 5S gene was discovered in the zebrafish genome and is probably mobilized in trans by zebrafish LINE elements (234). This raises the interesting possibility that some 5S rDNA genes could also be reverse transcribed and transposed elsewhere in the genome, therefore themselves behaving like transposable elements. In plants, a retroelement called Cassandra exhibits the unique property of carrying universally conserved 5S sequences in each of its two LTRs. Transposition of Cassandra would therefore propagate 5S sequences in plant genomes, providing an explanation for the lack of concerted evolution and of rapid rearrangements of 5S loci in plants (228).
Historically, satellite DNA was identified as a DNA fraction that sedimented as a strong and localized band, above or below the main band in cesium chloride density gradients, hence its name (520). It is widespread in eukaryotic genomes, such as D. melanogaster (293), plants (461), and mammals (505) but is absent from hemiascomycetes and S. pombe, although the large fission yeast centromeres contain many repetitive elements essential for their function (539). Satellite DNA is found in heterochromatin regions, such as mammalian centromeres, the D. melanogaster Y chromosome, and plant subtelomeres and centromeres but may also be found as intercalary DNA (76). Although its unusual buoyant density was the hallmark of a strong nucleotide composition bias, molecular analyses of satellite DNA showed its highly repetitive nature. It is characterized by large tandem repeats, whose total length may reach several millions of nucleotides and whose repeat units show a great variation in size, ranging from 5 nucleotides for human satellite III up to several hundreds of base pairs. Repeat units are not strictly identical and exhibit sequence polymorphisms (505). The motif sizes, total lengths, and the numbers of satellites per genome are summarized in Fig. Fig.2.2. In humans, several centromeric satellites are known, and their repeat unit lengths range from 5 to 171 nucleotides. The 5-bp satellite is an imperfect GGAAT repeat present in most if not all chromosomes, spanning up to hundreds of kilobases, and it might be a functional component of the centromere. The 171-bp satellite (generally called the α-satellite) is also found on all chromosomes and was shown to bind the centromere protein CENP-B (505). This protein is thought to be derived from transposases encoded by ancient DNA transposons. It was recently shown that S. pombe homologues of human CENP-B localize to Tf2 retrotransposons and recruit histone deacetylases to silence these retroelements. Therefore it is possible that CENP-B binding at human satellites similarly helps to recruit histone deacetylases to silence these heterochromatin regions (69). The β-satellite is normally present in tandem arrays, covering hundreds of kilobases on the short arms of acrocentric chromosomes. Remarkably, the insertion of 18 complete β-satellite repeat units (68 bp) within the TMPRSS3 gene led to both congenital and childhood onset autosomal recessive deafness in humans. It is totally unclear how this satellite was propagated and inserted into this gene, leading to its inactivation (458). In Mus musculus, the minor satellite often maps close to a telomeric (TTAGGG)n sequence, whereas the major satellite is pericentromeric (53, 406). Plant satellites were identified in centromeric and telomeric positions for dozens of species and harbor repeat unit lengths ranging from 118 to 755 bp. Some of them are also present in distinct regions on both chromosomal arms in several Triticeae genomes (461). Satellite DNA has been extensively studied and mapped on each chromosome in D. melanogaster. Repeat unit sizes range from 5 to 359 bp, with the larger units being found essentially within heterochromatin covering about half of chromosome X. Shorter repeat unit satellites are localized on all chromosomes [like the (AAGAGAG)n and the (AATAT)n satellites] or on only a subset (293). The Y chromosome, almost entirely heterochromatic, carries nine satellites whose repeat unit sizes range from 5 to 7 nucleotides, three of them mapping only to the Y chromosome and the others being present on other chromosomes (47). In Tetraodon nigroviridis, a 118-bp tandem repeat is found at all centromeres. This centromeric satellite DNA remarkably shows a high conservation of the first half of the repeat unit (approximatively 60 bp) and a more variable second half of the repeat unit, suggesting that both halves of the repeat unit are not under the same constraints (426). Remarkably, transposons, although scarce in T. nigroviridis, are preferentially found within heterochromatic regions, in proximity to satellite elements, suggesting either preferential insertions of transposons in these regions or selective elimination of transposed elements in euchromatic regions as a way to reduce the deleterious incidence of homologous recombination between them (138).
One may wonder about the putative function of satellite DNA, given that some of these elements are conserved over large evolutionary distances, like the human α-satellite, which was detected in chicken and zebrafish (281). An old hypothesis suggested that heterochromatin satellites would help proper meiotic disjunction, increasing the chance of correctly segregating chromosomes during meiosis (520). Given the centromeric or pericentromeric locations of several mammalian satellites, and the binding of centromere-specific proteins, like CENP-B or histone CEN H3, it is reasonable to assume that they may play a role either in replicating or in correctly segregating centromeres during mitosis (and/or meiosis). Interestingly, several authors reported that satellites are transcribed in a variety of organisms, including invertebrates, vertebrates, and plants. These polyadenylated transcripts are highly regulated, being differentially expressed at particular developmental stages or in specific tissues, raising a possible role for satellites in development. Short interfering RNAs originating from satellite DNA have also been detected, and they may play a role either in control of the initial formation or subsequent maintenance of heterochromatin or in the expression of particular genes embedded in satellites (506). Due to the highly repetitive nature of satellite DNA, whole-genome sequencing studies of eukaryotic organisms have focused on sequencing euchromatic regions, and little information has been obtained about the heterochromatic nature of such genomes. Sequencing satellites is still a challenge, and the possible presence of protein-coding genes or other genetic or regulatory elements in such regions is still questionable.
Mini- and microsatellites are tandem repeats composed of short repeat units. The repeat unit size is used as the main feature to classify a short tandem repeat as a mini- or microsatellite. However, there is at the present time no consensus about the precise definition of both kinds of repeats. Some authors do not consider mononucleotide repeats [poly(A) tracts, for example] as microsatellites, whereas for others the threshold between micro- and minisatellites may vary between 6 and 10 repeats. There is no consensus either for the minimal number of repeat units to be considered as a micro- or minisatellite. Some authors consider that two repeat units are not enough and fix the threshold at three, four, or even five units. It has also been proposed that any distinction between the two types of repeats would be purely academic. In the present chapter, we will review experimental data showing that although molecular mechanisms involved in mini- and microsatellite size changes are basically similar (if not identical), it is possible to propose their classification as two different types of repeats based on their distributions and functions in eukaryotic genomes. The motif sizes, total lengths, and the numbers of minisatellites and microsatellites per genome are summarized in Fig. Fig.22.
Historically, the first minisatellite (also called VNTR [for variable number of tandem repeats]) was discovered by Wyman and White (543), who identified a human locus exhibiting restriction fragment length polymorphism among individuals with various degrees of proximity. Later on, several hypervariable loci were identified in the human genome and called “minisatellites,” as a reference to megabase-large variable satellite DNA (221). One of the first minisatellites was found in an intron of the human myoglobin gene and comprised four 33-bp tandem repeats with some sequence similarities with other minisatellites discovered previously. It was flanked by a 9-bp direct repeat, a characteristic signature of transposable elements, suggesting that this minisatellite was able to transpose in some way (221). Although the first microsatellite was characterized by Weller and colleagues (531) as a polymorphic (GGAT)165 repeat in the human myoglobin gene, the term “microsatellite” (also called short sequence repeat) entered the literature a few years later, with the demonstration that a (TG)n repeat in the human genome exhibited size polymorphisms when amplified by PCR on genome samples from several individuals (286). The increasing availability of DNA amplification by PCR at the beginning of the 1990s triggered a tremendous number of studies using the amplification of microsatellites as genetic markers for forensic medicine, paternity testing, or positional cloning. Among the most prominent and original studies are the identifications by microsatellite genotyping of the skeletal remains of an 8-year old murder victim (178) and of Josef Mengele, who escaped to South America following World War II (215). DNA analysis of the descendants of the U.S. president Thomas Jefferson showed that he was the father of one of his slave's children, a long-standing debate among historians (146). Microsatellite typing started to be used in yeast studies 12 years ago to identify laboratory strains of S. cerevisiae (413) and more recently to identify industrial yeast strains or pathogenic strains of S. cerevisiae (196, 304), Candida albicans (300), and Candida glabrata (S. Brisse, C. Pannier, A. Angoulvant, T. de Meeus, O. Faure, P. Lacube, H. Muller, J. Peman, A. M. Viviani, R. Grillot, B. Dujon, C. Fairhead, and C. Hennequin, unpublished data) involved in human infections. Population geneticists also extensively used microsatellite typing to study population structures and evolution (213) and to study specific questions concerning the origin of domestic horses (516) or French wine grapes (49), to give just a couple of examples. Before the completion of whole-genome sequences, several linkage maps were built using microsatellites as genetic markers for channel catfish (Ictalurus punctatus), rainbow trout (Oncorhynchus mykiss), wheat (Triticum aestivum), Arabidopsis thaliana, pine (Pinus taeda), and Homo sapiens (107), to name just a few. Minisatellite size polymorphisms were used in a similar way for paternity testing (194) or to determine the source of saliva on a used postage stamp (201) as well as various other forensic studies (163). But the abundance of variable microsatellites, compared to minisatellites, made the former the marker of choice for similar studies. This is represented in Fig. Fig.3,3, in which the numbers of citations per year in the PubMed database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed) for the words “microsatellite” and “minisatellite” have been plotted. In a few years, the number of citations for microsatellites went from 2 in 1989 to 433 in 1994 and to more than 2,000 in 1999, well above levels of citations attained by minisatellites and DNA satellites. The relatively recent development of single-nucleotide polymorphisms as genetic markers most probably led to the clear inflection in the citation curve observed for microsatellites (Fig. (Fig.3).3). Undoubtedly, genotyping human populations by use of variable microsatellites (or single-nucleotide polymorphisms) has become a powerful tool, not only for human geneticists who study population differentiation in modern humans (21, 30, 404) but also for governments in regulating immigration flows, as already enforced by laws in several states of the European Union.
Over the last 10 years, a large number of studies on microsatellite distribution in eukaryotic genomes have been published. Unfortunately, it is not always an easy task to compare published results, since different authors often use different algorithms (sometimes homemade and not necessarily published), and they do not always agree on the definition of a microsatellite and therefore use different settings and thresholds to detect what they think should be defined as a microsatellite. However, a recent study describes a comparative analysis of the main algorithms available to exhaustively search for tandem repeats in DNA sequences. Five algorithms were compared, namely, Mreps (255), Sputnik (C. Abajian, University of Washington, Seattle [http://espressosoftware.com/pages/sputnik.jsp]), TRF (35), RepeatMasker (A. Smit, R. Hubley, and P. Green [http://repeatmasker.org/]), and STAR (101). It was shown that the total number of perfect microsatellites detected varies greatly among the five algorithms, ranging from 6,228 detections per megabase, for Sputnik, to 76 detections per megabase, for RepeatMasker. STAR and RepeatMasker, which are less efficient for the detection of abundant microsatellites of two to three repeat units in length, generally detect fewer but longer microsatellites than do TRF, Mreps, and Sputnik. Most microsatellites detected by RepeatMasker and STAR are also detected by the three other algorithms, whereas the reverse is not true (273). Note that other algorithms have been developed to detect tandem repeats in DNA sequences (434), some of them, like ACMES or MREPATT, dedicated to the detection of perfect tandem repeats (410, 431), and others, like TandemSWAN, designed to specifically detect imperfect (or “fuzzy”) tandem repeats (46), but their efficiency, compared to that of other more widely used algorithms, has not been carefully evaluated. Therefore, one must be advised when undertaking a microsatellite search in a given genome to consider many parameters before selecting one software, particularly if short, imperfect, or compound microsatellites are being researched, since the efficiency of detection of such genetic objects greatly depends on the algorithm used. It must be noted that recent works tried to define the main parameters that are associated with tandem repeat polymorphism in order to predict variable (or hypervariable) micro- and minisatellites. In one approach, it was found that G+C content and a measure of redundant patterns of mutation (called HistoryR) were both strongly correlated with minisatellite polymorphism (104). In the other approach, a numeric score (called the VARscore) dependent on several parameters, including the number of units, length, and purity, was assigned to each tandem repeat. A good correlation was found between the VARscore and tandem repeat polymorphism, as determined experimentally (275).
The S. cerevisiae genome was the first eukaryotic genome to be completely sequenced and, as such, was also the first in which microsatellites could be exhaustively analyzed. Given our previous comments on the different algorithms available for such a study and the lack of a clear consensus on the definition of a microsatellite by the scientific community, it therefore is not surprising that the outcomes of such studies showed large variations in the numbers of microsatellites detected. If one compares only trinucleotide repeats (a class of microsatellites), for which several independent studies with budding yeast are available, absolute numbers of such repeats in the budding yeast genome vary from 92 to 1,769 (Table (Table2),2), depending on parameters chosen by the authors (19, 95, 133, 235, 306, 415, 548). Despite these expected discrepancies, authors agree that microsatellites are generally excluded from yeast genes, except for trinucleotide repeats and hexanucleotide repeats, which are found both in ORFs and in intergenic regions (499). Careful analysis of amino acids encoded by trinucleotide repeats showed an overrepresentation of charged residues, such as glutamine, asparagine, and glutamatic and aspartic acids. These genes often encode nuclearly located proteins, particularly transcription factors and regulators of gene expression (312a, 415, 442, 548). This is also the case for 13 other hemiascomycetous yeast genomes analyzed (306) and seems to be true for other eukaryotes, including humans (129). A recent study of 12 sequenced genomes from Drosophila species showed that the most frequent codon found in gene-encoded trinucleotide repeats was CAG, although CAA was the most frequent triplet encoded by triplet repeats detected in noncoding regions (203), reminiscent of what was observed in Saccharomyces bayanus var. uvarum (306).
Early studies of human DNA sequences found in public databases concluded that CAG and CGG trinucleotide repeats were overrepresented in the human genome (181, 475), but more-recent studies on the human genome public sequence revealed that this overrepresentation was limited to coding regions (478). This discrepancy is probably due to a bias in sequences cloned and sequenced in public databases at the time of the former study. In humans, as in yeast, microsatellites are generally underrepresented in exons, except for trinucleotide and hexanucleotide repeats. The densities of microsatellites are similar on all chromosomes, even though chromosomes 17, 19, and 22 show a slight increase in density (479). Analysis of five complete plant genomes showed that microsatellites are preferentially found in unique regions of the genomes and exhibit a lack of association with transposon-rich regions. The frequencies of each microsatellite class vary, but imperfect trinucleotide repeat densities range from 77 repeats/Mb in soybean (Glycine max) to 159 repeats/Mb in Arabidopsis thaliana, with an average of 105 repeats/Mb, a density significantly higher than that seen for budding yeast for the same class of repeats (imperfect trinucleotide repeats at least four units long) (Table (Table2)2) (346). Compared to plants, yeast, and humans, teleostean fishes are remarkably rich in perfect microsatellites, since 1,700/Mb and 1,176/Mb are detected in the genomes of Tetraodon nigroviridis and Fugu rubripes, respectively, compared to 281/Mb on average for plants (346), 145/Mb for budding yeast (306), and 87/Mb in the human genome (270a) (Table (Table33).
It must also be noted that microsatellites are not homogeneously distributed along budding yeast chromosomes; rather, they exhibit repeat-rich and repeat-poor regions (418). Interestingly, some of these regions also correspond to regions highly biased for G+C content (116, 462), suggesting that forces shaping chromosome structure have an influence on microsatellite formation or maintenance. A similar observation was made for dinucleotide repeats in D. melanogaster, which were frequently found in clusters (22). Similar analyses of other eukaryotic genomes will be required in order to determine if these observations underlie a more general rule.
In summary, several algorithms were designed to search for tandem repeats in DNA sequences, but one may keep in mind that some of them are more efficient at finding specific kinds of repeats (short and perfect repeats or long and imperfect ones, for example). Therefore, before selecting a given program to perform a search, it is recommended first to define the kind of repeats that are being looked for and then to select the best program suited to perform this search among those available. An alternative approach could be to select two or more programs, run them, and compare their outputs in order to get a list of repeats that would be more exhaustive than one that would be obtained with one single program.
Minisatellites have been less studied than microsatellites (Fig. (Fig.3)3) and few genome-wide analyses have been performed. Former attempts at systematic minisatellite cloning gave a rough estimate for a few hundred minisatellites in the human genome (18, 359), but other estimations were in the range of a few thousand. By analyzing the sequence from human chromosomes 21 and 22, Denoeud et al. (104) found 127 minisatellites fulfilling their criteria. Extrapolating this number to the complete human genome gives a rough estimation of 6,000 minisatellites. Among these, a majority (75%) are expected to show some degree of polymorphism in the population. In a more recent work, Vergnaud and Denoeud (513) analyzed the minisatellite content of human chromosome 22, A. thaliana chromosome 4, and C. elegans chromosome 1 by use of the TRF software (35). In this study, minisatellites were defined as tandem repeats with units longer than 16 bp and covering at least 100 bp with a high G+C content and a strong strand bias. By use of this definition, half of the 62 minisatellites detected on chromosome 22 were located within the terminal 10% of chromosome 22, confirming previous studies (104). The same analysis revealed that minisatellite densities were similar in A. thaliana and C. elegans, with the same subtelomeric bias in the nematode, whereas in A. thaliana, minisatellites tend to cluster in the pericentromeric region. In the genome of the teleostean T. nigroviridis, one tandem repeat that could qualify as a minisatellite was detected. Its repeat unit is 10 bp long and is repeated in very large variable-size arrays on 10 of the 11 short arms of subtelocentric chromosomes, with the exception being the rDNA-bearing chromosome (426).
To our knowledge, the only eukaryotic genome that was entirely analyzed for the presence of minisatellites is the budding yeast genome. By use of different algorithms, 44 minisatellite-containing genes were found by Verstrepen et al. (515) and 49 were detected by Richard and Dujon (414), with a 95% overlap between the two sets of minisatellites. In addition, 11 minisatellites were detected in intergenic regions, but they are, on the average, shorter than those included in ORFs, and 18 were found in subtelomeric Y′ elements (298). If one excludes Y′ minisatellites, there is no obvious bias for their distribution in subtelomeric or centromeric regions. They are, on the average, GC rich, a property shared by human minisatellites, and they generally exhibit a strong negative GC skew (more cytosines than guanines on the coding strand). The conservation of S. cerevisiae minisatellites in other completely sequenced hemiascomycetous yeasts reveals that a large number of them are not conserved during yeast evolution, although the genes that contain them are conserved. In Saccharomyces paradoxus, a close relative to S. cerevisiae, 73% are conserved, but in more distantly related yeast species (Candida glabrata, Kluyveromyces lactis, Debaryomyces hansenii, and Yarrowia lipolytica), only 25 to 47% of minisatellites are conserved. In addition, in each species a few pseudominisatellites were detected, testifying that in the distant past, the minisatellite was present at the same location in the same gene. Remarkably, in several instances, a different minisatellite was found at the same location in a given gene among different species. This is the case for the PRY2 gene, which contains a six-copy 18-bp repeat in S. cerevisiae, a nine-copy 15-bp repeat in S. paradoxus, a six-copy 15-bp repeat (with a different motif) in Y. lipolytica, and a pseudominisatellite in D. hansenii and does not contain any minisatellite in C. glabrata and K. lactis. This suggests that the evolution rate of minisatellites is much higher than that for their containing genes or, alternatively, that minisatellites are able to invade genes at specific locations, like transposable elements (414). Interestingly, among minisatellite-containing genes, 50 to 60% encode cell wall proteins or proteins involved in cell wall metabolism (48, 414, 515). Minisatellites found in these genes encode mostly serine and threonine residues (59% of the total), which are believed to be the sites of O mannosylations by the Pmt4 protein, these posttranslational modifications being important for maintaining the protein at the cell wall surface (121, 272). Therefore, it seems reasonable to assume that these minisatellites were positively selected for, as their presence is essential for protein function. Note that, as expected, all of the minisatellites detected in yeast genes contain a motif that is a multiple of 3 nucleotides, allowing unit additions and deletions without disrupting the reading frame. A more recent global analysis of minisatellites in the pathogenic yeast Candida glabrata revealed that this hemiascomycete contains very large minisatellites, often included in cell wall genes and genes involved in cell-to-cell adhesion, making these large minisatellites good candidates to play a role in pathogenicity (489). In support of this hypothesis, a recent work using Aspergillus fumigatus revealed that this pathogenic filamentous fungus contains several minisatellites, some of which contain large repeat units of up to 255 bp. Approximately 3% of these minisatellites are located in genes that encode potential cell surface proteins. One of these genes (Afu3g08990) was inactivated and the corresponding mutant conidia showed decreased cellular adhesion, although no effect on virulence in a mouse model was detected (279).
We have seen that Alu elements are widely spread in the human genome, representing more than 10% of its total size (Table (Table1).1). Since Alu repeats contain a poly(A) tail and a central linker region rich in adenines, their association with A-rich microsatellites has been studied. Nadir et al. (356) showed a significant association between the 3′ end of Alu sequences not only with (A)n mononucleotide repeats but also with (AAC)n, (AAT)n, and A-rich tetra- to hexanucleotide repeats, but this association was surprisingly weaker with (AT)n dinucleotide repeats. In another study, (AC)n dinucleotide repeats were found to be preferentially associated with Alu elements, 75% of them being found at the 3′ end of the element, while the remainder were found in the central linker region (15). Interestingly, the (GAA)n trinucleotide repeat, involved in Friedreich's ataxia, may have arisen with the insertion of an Alu element. Out of 788 human genomic loci containing (GAA)n repeats, 63% (501 loci) map within 25 bp of an Alu element. Among them, 94% are associated with the poly(A) tail and the remaining are associated either with the 5′ end of the element or with the central linker region (83). Genome-wide studies using complete genome draft sequences will now be necessary to determine the real impact of Alu and other transposable elements on the spreading of microsatellites in eukaryotic genomes.
We have seen in the first chapter that both dispersed and tandem repeats are abundant in all eukaryotic genomes sequenced so far, although their relative numbers vary among organisms. Due to the repetitive nature of these elements, their presence in a genome may cause reciprocal or nonreciprocal translocations, segmental duplications, gene amplifications, and other kinds of spontaneous chromosomal rearrangements that may ultimately lead to cell death. The molecular mechanisms creating such large genome rearrangements mainly involve defects during S-phase replication and during homologous recombination. These mechanisms, along with their effect on genome stability, will be studied in the present chapter.
Fragile sites were defined cytologically within human metaphasic chromosomes as chromatid constrictions or breaks after cells were grown in the presence of drugs involved in impairing DNA metabolism or replication. Two types of fragile sites were distinguished; common fragile sites were found in all individuals, whereas rare fragile sites were present only in a small proportion of the population (5% or less). Common fragile sites are expressed in the presence of aphidicolin (DNA polymerase inhibitor), bromodeoxyuridine, or 5-azacytidine (nucleotide analogues). Rare fragile sites are expressed in the presence of folic acid (a cofactor in nucleotide anabolism), distamycin A (an antibiotic that binds to AT-rich DNA), or bromodeoxyuridine. Fragile sites have been extensively studied for years, since they are associated with hereditary mental retardations and several types of cancers in humans. Several reviews specifically focused on fragile sites have been published (93, 164, 483), but we will more specifically discuss in the present review the link between DNA repeats and fragile sites and how the replication of regions containing repeated elements may induce the expression of a fragile site.
Three types of repeated elements are found at fragile sites: transposons, microsatellites, and minisatellites. The common fragile site FRA3B was sequenced over 10 years ago (205) and was shown to contain numerous LINEs (L1 and L2), SINEs (Alu and MIR), LTR retroviruses (HERVs and MalRVs), and DNA transposons (mariner and MER) in direct and inverted repeat orientations. Forty Alu repeats, 41 MIR elements, and 39 MER repeats were recognized, covering 12.3% of the region. Altogether, repeated elements covered between 23% and 43% of the FRA3B region, with variations in subregions. Several types of cancers are associated with FRA3B (see below). The mouse orthologous region Fra14A2 is also an aphidicolin-inducible fragile site, and comparison of human and mouse sequences showed a good sequence conservation (72.6% identity). Numerous repeated elements were also detected. Overall, dispersed repeats represent 32.6% of the Fra14A2 region. Surprisingly, almost all L1 elements were inserted after the divergence of human and mouse lineages, since they disrupt the alignment, suggesting that this region was a hot spot for transposition in both lineages (464). Two fragile sites, FRA10B and FRA16B, are AT-rich regions containing tandem repeats of a 42-bp minisatellite and of a 33-bp minisatellite, respectively (198, 552). Both regions exhibit size polymorphisms and intergenerational as well as somatic instability, and fragility is associated with a large expansion of these minisatellites. Thus far, no pathology has been found to be associated with these two fragile sites. This is not the case for the rare fragile site FRAXA, which is the most common cause of hereditary mental retardation in humans. This disorder is due to an expansion of a (CGG)n microsatellite into the 5′-UTR region of the FMR1 gene (152, 514). The FMR1 gene is one of several genes involved in X-linked mental retardation (430), although most of them are not fragile sites. Nevertheless, at least three other loci containing (CCG)n microsatellites are responsible for mental retardation, namely, FRAXE (251), FRAXF (424), and FRA11B (225, 226). Note that although the fragile site FRA16A is also due to the expansion of a (CCG)n microsatellite (360), no phenotype has been associated with it thus far. Interestingly, in the filamentous fungus Candida albicans, chromosomal translocations and chromosome losses are often associated with the major repeat sequence (MRS). The MRS is a large and complex tandem repeat which is found at nine different loci in the haploid genome. The MRS is composed of a 2-kb sequence (RPS) that itself includes seven to nine copies of a 16-bp repeated sequence and six to eight copies of a 29-bp sequence. The RPS can be tandemly repeated up to 39 times in a single MRS, and size heterogeneity of the MRS is a major cause of chromosome length polymorphism in C. albicans (302).
Interestingly, fragile sites are not always associated with repeated sequences. This is the case for FRA7H, which does not exhibit any particular enrichment in repeated elements but has a sequence that is AT rich (58%). Computer analysis of sequence flexibility (443) showed several peaks of high flexibility within the region (337). The same observation was made for FRA16D (423) and FRA7E (559). In addition, it was predicted for FRA7E that these flexible peaks enriched in AT base pairs are able to form secondary structures, at least in silico. Therefore, it seems that a fragile site is determined not only by the presence of many repeated elements or large microsatellites but also by the propensity of the sequence to form some kind of secondary structure that could impede replication through its locus, thus causing fragility.
At the present time, the precise cause for chromosomal fragility is still under debate, but several hypotheses, not mutually exclusive, have been proposed. First, as mentioned above, the molecular structure of the region, particularly its ability to form secondary structures that may stall replication forks, seems to be important. However, one may wonder whether breaks result from single-strand nicks on one DNA strand that are transformed into DSBs when the replication fork encounters these nicks or from another origin, like a nuclease that would recognize and cleave those secondary structures. In Schizosaccharomyces pombe, mating-type switching is induced by the replication fork encountering a single-strand nick at the mat1 locus, transforming this nick into a DSB, thus initiating homologous recombination with one of the two homologous mat cassettes (references 13, 14, 237, and 237a). This is a highly regulated process under the control of several proteins, including the conserved fork protection Swi1/Swi3 complex (Tim/Tipin in mammals), responsible for the temporary replication fork pausing near mat1 (200, 237). The possible presence of single-strand nicks at fragile sites in vivo, during or after replication, is an open question.
The search for trans-acting factors that may regulate fragile site expression led to the finding that the downregulation of Rad51 (involved in homologous recombination ) or DNA-protein kinase (PK) and ligase IV, both involved in NHEJ (263), increased the expression of FRA3B and FRA16D in the presence of aphidicolin (453). Inactivation of the ATR protein also increases FRA3B and FRA16D expression, but ATM has no effect (72), suggesting that stalled replication forks but not DSBs are the signal activating the replication checkpoint (187). Similarly, the inhibition of BRCA1, a key player in DNA damage response (530), increases the expression of FRA3B and FRA16D (17). The Werner syndrome helicase, a member of the RecQ family of helicases and involved in the resolution of DNA structures during replication (55), was also shown to be involved in fragile site expression. In the presence of aphidicolin, Werner syndrome helicase-deficient cells show a significantly high level of gaps and breaks, compared to wild-type cells, at FRA7H, FRA16D, and FRA3B (396). Altogether, these data point to a role of stalled forks in promoting single-stranded damage at fragile sites, triggering checkpoints, and increasing fragility, but the precise mechanism remains elusive.
Several studies with the model budding yeast have also tried to decipher the molecular basis for chromosomal fragility. Freudenreich and colleagues designed an elegant experimental system to look at chromosomal fragility in yeast. The sequence to be tested for fragility was cloned into a yeast artificial chromosome, between the centromere and a URA3 selectable marker. If the cloned sequence induced fragility, the distal chromosome part containing the URA3 marker would be lost and yeast cells would become resistant to the drug 5-fluoroorotic acid (148). Using this system, they were able to show that (CAG)n and (CCG)n repeats exhibited length-dependent fragility in vivo (28, 148). They also showed that fragility was increased in cells mutated for MEC1 (yeast ATR homologue), RAD9, and RAD53 (269), and in a checkpoint-deficient allele of the MRC1 gene, mrc1-1, involved in replication fork stability (149). Using the same experimental system, they looked at the fragility of FRA16D and showed that a short (AT)n repeat within the fragile site sequence was able to induce chromosome breakage in a length-dependent manner. By two-dimensional gel electrophoresis, they were able to visualize the accumulation of DNA molecules during replication at the (AT)n repeat, proving that replication forks stall within or near this sequence in vivo in a length-dependent manner (554). Other experimental systems were designed in yeast, using the loss of a genetic marker on a yeast chromosome following natural or I-SceI-induced chromosome breakage (488). These studies showed that chromosomal fragility was increased in DNA damage checkpoint mutants (6, 354) and when Pol α levels were reduced (276). Interestingly, fragility sometimes involved repeated elements, like tDNAs and LTRs (6) or Ty retrotransposons (276), but spontaneous chromosome rearrangements were also observed for regions lacking such elements (354), suggesting that replication stalling and DNA breaks (either single or double stranded) may occur in nonrepetitive regions. These regions have been called replication slow zones by Cha and Kleckner (74), who determined that DSBs do not occur stochastically but rather in specific regions on yeast chromosome III during replication of strains carrying a mutant allele of the MEC1 checkpoint gene (mec1-1). In another study using the same mec1-1 mutant, it was shown that 17 early-firing regions in the yeast genome were not efficiently replicated, but they do not seem to correspond to replication slow zones (407). Hence, at the present time, it is clear that yeast fragile sites and mammalian fragile sites share some properties, such as nonrandom breakage at specific programmed sites within the genome, an increase in fragility when replication is slowed down with drugs, and the implication of the MEC1/ATR checkpoint in regulating fragility. However, the majority of yeast fragile sites do not involve repeated elements, nor do they occur in regions of nucleotide composition bias, suggesting that not all fragile sites in mammals may occur in such regions.
There is a long-standing debate among cancer specialists of how to determine whether large chromosomal abnormalities detected in cancer cells are the result of uncontrolled cell proliferation or are required to transform a normal cell into a cancerous one (318). Early studies described extensive karyotype alterations (2), suggesting that spontaneous DNA damage during replication could promote formation of DSBs that are highly recombinogenic (517) and would give rise to chromosomal translocations and other genome-wide rearrangements (451). More-recent work helped to refine this model. By comparing different stages of human tumors and normal tissues, Bartkova et al. (31) showed that the DNA damage response is activated very early in tumor life and precedes the appearance of p53 mutations. At the same time, Gorgoulis et al. (170) reached an identical conclusion, showing in addition that FRA3B was frequently lost in neoplastic tissues, a signature of an unrepaired chromosomal break at this fragile site. Several lines of evidence for the involvement of FRA3B in cancer exist (421) and are strengthened by a very recent study in which human-mouse chromosome 3 somatic hybrid cells were exposed to aphidicolin-mediated replication stress. Between 13% and 23% of clones exhibited deletions of the FRA3B region, spanning 200 to 600 kb and matching deletions observed for several gastrointestinal, colon, lung, breast, and cervical cancers (120). Similarly, loss of heterozygosity was frequently observed at FRA16D for breast and prostate cancers, and a recurrent t(14;16) translocation involving FRA16D has been identified for multiple myeloma (397). Sequence analysis of one deletion showed that an (AT)n dinucleotide repeat and an AT-rich minisatellite were found at both endpoints, showing the implication of AT-rich repeats in FRA16D fragility. Most cell lines with FRA16D deletions also exhibited FRA3B deletions, showing that chromosomal breakage, following replication stress, may affect more than one common fragile site (137).
In conclusion, human fragile sites are associated with the presence of dispersed or tandem repeats, although this is not necessarily the case in yeast. Therefore, rather than envisioning the presence of a fragile site as the direct effect of the presence of such repeats, we should try to analyze replication processes in eukaryotic cells more thoroughly. It is possible that repeated elements just play the role of enhancers of “replication defects” that occur at several other places within genomes but are not detected since they do not give rise to fragile sites.
Trinucleotide repeats belong to the category of microsatellites. Since the discovery 17 years ago of the first neurological disorders involving trinucleotide repeat expansions, more than two dozen human diseases involving what are sometimes called “dynamic mutations” (422) have been brought to light. Despite extensive studies undertaken by many groups, using different prokaryotic or eukaryotic model systems, the molecular mechanism(s) underlying such dramatic expansions of triplet repeats in one single generation in humans is still unknown. This is not the place to extensively review all that is known about the disorders or about the peculiar properties of these particular microsatellites, since many review articles have been dedicated to that purpose in the last few years (84, 160, 277, 334-336, 357, 372, 385, 420, 532). Instead, we are going to give a general overview of the diverse molecular mechanisms involved in trinucleotide repeat instability and, more importantly, address what we think are crucial questions for better understanding of these mechanisms.
Researchers usually classify triplet expansions into two main categories, those that occur within genes and generate a protein containing an expansion of a given amino acid, and those that occur in noncoding regions. For the first category, only expansions of polyglutamine and polyalanine have been discovered thus far. It must be noted that the “rule of three” (308) was broken when expansions of a GC-rich dodecamer (270), of a tetranucleotide repeat (285), and of a pentanucleotide repeat (320) were discovered as being involved in human neurological disorders. Therefore, the most commonly accepted view is that expansions are unrelated to the trinucleotide repeat nature of the sequence but are related to the propensity of the sequence to form secondary structures (see below) and to interfere with DNA replication, repair, or recombination.
Compared to those in coding regions, expansions in noncoding sequences include a large number and variety of tandem repeats. As already mentioned, the most common form of inherited mental retardation in humans, called fragile X syndrome, is due to expansions of a (CCG)n trinucleotide repeat responsible for the fragility (152, 514). Type 1 myotonic dystrophy (DM1) and type 8 spinocerebellar ataxia (SCA8) involve a (CTG)n repeat in the 3′-UTR of the DMPK gene (54, 153, 186), whereas DM2 involves expansion of a (CCTG)n repeat in an intron (285). Friedreich's ataxia is induced by the expansion of a (GAA)n repeat in an intron (70), whereas SCA10 is linked to the expansion of a pentanucleotide (ATTCT)n, also in an intron (320), and progressive myoclonus epilepsy involves a (C4GC4GCG)n dodecamer in the 5-UTR of the EPM1 gene (270, 518). Intuitively, one could think that noncoding regions would show a higher flexibility than coding regions to accommodate tandem repeats of various unit lengths, since any change of the unit number that is not a multiple of three would be fatal for gene function if they were in exons. This is what is observed, since several of these repeat unit lengths are not multiples of three. Even though these repeats are located in nontranslated regions, their repetitive sequences sometimes interfere with other cellular processes, such as replication, transcription, and splicing, etc. In the fragile X syndrome, the FMR1 gene is extensively methylated and inactivated in full-mutation alleles containing more than 230 CCG repeats (394). Within shorter alleles, the locus is not methylated and the mRNA levels are high (245), but protein translation is decreased, due to ribosome stalling in the expanded CCG repeats in the 5′-UTR of the FMR1 mRNA, leading to a dramatic reduction in protein levels (130, 223). Hypermethylation of CpG islands adjacent to FRAXE and FRAXF alleles, containing an expanded (CCG)n repeat, were also reported (251, 424). In Friedreich's ataxia, the expansion of the (GAA)n sequence in the first intron of the FRDA gene results in a reduction in gene expression, due to the inhibition of transcriptional elongation (70). This inhibition was recapitulated in vitro and in vivo in bacteria by Krasilnikova et al. (260). They showed that long (GAA)n repeats carried by an E. coli plasmid interfered with transcription by dramatically reducing the amount of full-size mRNA containing these repeats. A similar observation was made in an in vitro reconstituted transcription system by the same authors. Transcriptional study of expanded (CAG)n or (CTG)n repeats also showed a reduction in the amount of normally sized mRNA molecules containing these repeats, but surprisingly longer mRNA molecules were detected, suggesting that some kind of transcription slippage could occur, leading to transcripts longer than expected (124). In DM1, the expanded (CTG)n repeat in the 3′-UTR of the DMPK gene reduces the expression of the downstream gene DMAHP, probably by locally remodeling chromatin (250, 495). Two CTCF-binding sites were identified flanking the (CTG)n repeat, and methylation of the DM1 locus in congenital myotonic dystrophy disrupts their function, probably by modifying nucleosome positioning in this region (136). DMPK mRNAs, containing expanded CUG repeats, form nuclear foci in vivo and the transcripts are not properly exported (91, 486). When a URA3 reporter gene containing in its 3′-UTR an expanded CUG repeat tract was expressed into yeast cells, CUG-containing foci were detected but were not specifically clustered in the nucleus, since cytoplasmic labeling was visible. This suggests that some but not all CUG-containing RNA defects can be recapitulated in budding yeast (124). The DMPK mRNA binds to the muscleblind-like protein (MBLN) (127, 330) and the CUG-binding protein (CUG-BP) (392), deregulating splicing of several transcripts and causing defects in a muscle-specific chloride channel (77, 309). Expanded CUG repeats also form hairpins that activate the double-stranded RNA-dependent protein kinase PKR, interfering with its normal cellular function (496). Note that in type DM2, due to an expansion of a (CCTG)n tetranucleotide repeat, the CCUG-containing mRNAs also form nuclear foci and bind muscleblind-like proteins (128, 310). In conclusion, expanded repeats in noncoding regions interfere with the metabolism of several cellular pathways, such as methylation, transcription, splicing, RNA processing, nuclear export, and translation, and the resulting expanded mRNAs often acquire a dominant negative altered function that is directly involved in pathogenicity.
In comparison to noncoding sequences, trinucleotide repeat expansions within coding sequences are homogeneous. First, they concern only triplets, since any other size change would disrupt the reading frame and lead to gene loss of function. Second, they have been found thus far to concern only two types of amino acids, namely, glutamine and alanine. Third, expansions are always of moderate size compared to noncoding expansions, which may reach several hundreds or even thousands of triplets in one generation. Although mechanisms leading to polyglutamine and polyalanine expansions are not necessarily different, they have their own specificities that will be discussed below.
Expansions of (CAG)n repeats inside exons are found in Huntington disease (HD) and several SCAs. In these neurodegenerative disorders, the reading frame consists of an expanded CAG triplet that always encodes a polyglutamine tract. It is peculiar that out of two possible codons for glutamine, CAA and CAG, only CAG expansions have been discovered thus far, probably pointing to a requirement for a GC-rich triplet in order to trigger expansions. Mechanisms leading to polyglutamine pathogenesis have been reviewed elsewhere (160, 371, 372). Polyglutamine aggregates were described more than 10 years ago as intranuclear inclusions that cause a progressive neurological phenotype in mice that is similar to polyglutamine disorders in humans (90, 370, 452). Formation of these detergent-resistant aggregates (238) depends on the length of the polyglutamine tract (262, 317) and on the presence of chaperone proteins in several model organisms, including yeast (262, 350), Drosophila (75, 239), and C. elegans (445). Aggregates were initially thought to be directly involved in pathogenicity, but it was subsequently shown that neuronal death was directly correlated not to their presence (446) but rather to the nuclear presence of a soluble fraction of polyglutamine proteins (249). These proteins induce cell death in neurons by apoptosis (280, 446), although a mechanism of cellular death not involving apoptosis has also been reported (503). Protein-protein interaction domains often contain polyglutamine tracts and are therefore more prone to self-aggregation than other protein domains (236). We want to point out that the formation of polyglutamine aggregates is reminiscent of abnormal protein aggregation seen for other neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and prion diseases (401, 452), and therefore searching for genes encoding polyglutamine tracts in fully sequenced genomes could help to predict which proteins might share the same properties (329).
It is intriguing that several trinucleotide repeat disorders affect the central nervous system. This observation has not found any satisfying explanation thus far. A possibility would be that genes involved in neural development are richer in microsatellites than other genes, increasing the chance to contain an expandable trinucleotide repeat tract. It may be the case in Drosophila, in which long amino acid repeats were preferentially found in genes involved in developmental control and in central nervous system development (236). Similar studies on completely sequenced mammalian genomes should help to clarify this point.
Alanine tracts are expanded in several human developmental disorders, such as type II synpolydactyly (8), oculopharyngeal muscular dystrophy (50), cleidocranial dysplasia (351), and holoprosencephaly (57). Expansions are rather shorter than polyglutamine expansions and they involve imperfect trinucleotide repeat tracts, since polyalanine tracts are often encoded by two to four different alanine codons (56). This is very different from what is seen for other trinucleotide repeat expansions, in which repeats are stabilized by the presence of an imperfect triplet in the sequence. These observations suggest that mechanisms of poly-Gln and poly-Ala expansions could differ. It was therefore suggested that polyglutamine expansions mainly rely on a replication slippage mechanism, whereas polyalanine expansions are due to unequal crossover (525). However, careful examination of poly-Ala-expanded alleles reveals similarities between their pattern and the pattern of minisatellite rearrangements during meiotic recombination (see “Molecular mechanisms involved in mini- and microsatellite expansions” below), suggesting that gene conversion (with or without crossover) could be involved in polyalanine expansions. Interestingly, it was shown that polyglutamine tracts were often mistranslated, leading to polyalanine tracts by a ribosomal −1 frameshift (159, 500).
One of the early questions related to trinucleotide repeats was the time of their expansion in human cells. Were expansions meiotic, prezygotic, postzygotic, or somatic? Addressing this question was a way to define the precise mechanism involved: S-phase mitotic or meiotic replication, meiotic recombination, or yet another mechanism. Since expansions are detected in every tissue (with some degree of mosaicism), it is tempting to think that they mainly occur either during parental meiosis, during zygote formation, or very shortly thereafter (348, 537). Earlier studies of the fragile X syndrome showed that (CCG)n expansions were absent from sperm cells, although they could be detected in lymphocytes, suggesting that expansions occurred in the female germ line and subsequently contracted to shorter allele sizes in the male germ line but not in somatic cells (411). Strengthening this hypothesis, further analyses of intact fetus ovaries showed that oocytes contained expanded alleles (307). Similarly, expansions were detected in sperm DNA, but not in lymphocyte DNA, in patients affected by SCA1 (321).
Single sperm analysis of (CAG)n repeats in the HD gene in humans revealed that the frequency of expansions was dependent on allele size, with longer alleles being more prone to expansions (274). It was subsequently shown by single-molecule analysis of sperm cell DNA that expansions of HD gene repeats occur before the completion of meiosis, and some of the expansions were detected before the beginning of meiosis (547). Mouse models have been particularly helpful in addressing this question. Single-cell analysis of mice transgenics for the DM1 repeats showed that both sperm cells and somatic cells exhibited a bias toward expansions, suggesting that at least some of the intergenerational expansions observed for DM1 originated from somatic expansions (343). When male mouse germ cells were sorted according to their maturation stage and DM1 allele size was analyzed by PCR, it was shown that no size change was detectable in spermatogonias, spermatocytes, or spermatids, but increases were visible in mature spermatozoa, suggesting that some mechanism(s) was generating expansions after meiotic replication and recombination took place (259). However, a more recent study using an improved method to sort germ cells in order to reduce as much as possible contamination by other cell types, revealed that expansions of the DM1 allele were detected very early, in spermatogonia, before meiotic replication took place (448).
Analysis of the sperm DNA of a Friedreich's ataxia carrier of a premutation allele (around 100 repeats) showed that an expanded allele of approximately 320 repeats was present. This carrier's son was affected by the disease and his DNA exhibited expansions up to 1,040 repeats. This suggested that a first expansion occurred in the father's germ cells (from 100 to 320 repeats) and another one occurred very early after zygote formation (from 320 to 1,040 repeats) (100).
In summary, trinucleotide repeat expansions may occur at different stages during cell life, probably reflecting the fact that different repeat sequences and different genetic locations may trigger different mechanisms leading to these expansions.
Besides being involved in a number of fragile sites and associated cancers as well as in several neurological and developmental diseases, tandem repeats have a more positive role in eukaryotic genome evolution by allowing the rapid adaptation of a given organism to its environment, namely, the fast evolution of morphological features or modulation of sociobehavioral traits, as will be exemplified now.
As described above, several genes in S. cerevisiae involved in cell wall biogenesis contain minisatellites. Among them, genes belonging to the FLO family of mannoproteins also contain various minisatellites, whose repeat unit sizes are 30 bp (FLO11), 81 bp (FLO10), and 135 bp (FLO1, FLO5, and FLO10) (414). These genes are homologous to the ALS and EPA gene families in C. albicans and C. glabrata, respectively, which are involved in cell-to-cell adhesion and pathogenicity. By using several alleles of FLO1 differing only in their numbers of minisatellite repeat units, Verstrepen et al. (515) showed that the length of the minisatellite was directly correlated to cell adhesion. Yeasts with longer FLO1 minisatellites exhibit a better adhesion to polystyrene but also to other cells, as demonstrated by increased flocculation of liquid cultures. Similarly, it was shown that a yeast strain used to make cherry wine exhibits higher hydrophobicity and cell-cell adhesion, giving it the property to form a buoyant biofilm at the wine surface. This property is dependent both on the level of expression of the FLO11 gene and on the number of minisatellite repeat units in this gene (132).
Morphological variations are common among canine species. By studying 36 developmental genes that contain microsatellites, Fondon and Garner (145) found that 29 out the 36 loci exhibit fewer interruptions in the repeat tract than their human orthologues. Probably due to the greater repeat purity, some microsatellites were highly polymorphic in dogs and these size variations were correlated with morphological changes, such as digit polydactyly and skull morphology variations. Some of the genes involved, such as Hox-D13, Runx-2, and Zic-2, are also involved in the genetic abnormalities found associated with polyalanine expansions in humans (see above), suggesting that the others could also well be associated with developmental phenotypes in humans. This is a blatant example of how the rapid evolution of a gene sequence by microsatellite size changes may lead to phenotypic diversity in a modern domesticated animal species, probably much faster than would be allowed by accumulation of point mutations in the same genes.
An example of the involvement of microsatellite polymorphism in social behavior is given by voles. The prairie species of this little rodent is biparental and shows high levels of social interest, in contrast to the closely related meadow vole. Length differences in a (GA)n dinucleotide microsatellite in the 5′-UTR of the vasopressin receptor gene (V1aR) underlies this difference. In species with a long, expanded microsatellite, males show higher levels of V1aR in the brain, concomitant with higher rates of pup licking and grooming, along with higher levels of partner preference formation, compared to species with a short microsatellite (180). Remarkably, comparison of the V1aR orthologues in humans, chimpanzees (Pan troglodytes), and bonobos (Pan paniscus) shows that the microsatellite is conserved in humans and bonobos, both species sharing similar sociosexual behaviors, whereas in chimpanzees, a 360-bp sequence encompassing the microsatellite is deleted (180).
The frequent size variability of mini- and microsatellites has generated a large number of studies trying to understand the mechanisms involved in this size variability. In addition to the work of Alec Jeffreys and colleagues on minisatellite instability during human meiosis, numerous studies with model organisms (mainly but not exclusively E. coli, yeast, and mouse) have been helpful in dissecting such mechanisms. Given experimental data on human minisatellite size changes during meiosis (216, 220), it was initially thought that minisatellites expanded and contracted by homologous recombination, whereas microsatellites were subject to unrepaired slippage events between the newly synthesized strand and its template during S-phase DNA synthesis (also called “replication slippage”) (Fig. (Fig.4)4) (476). Nevertheless, subsequent experiments with both kinds of tandem repeats revealed that the differences between them were less pronounced than initially believed. Extensive studies of microsatellites, particularly of trinucleotide repeats, have shown that several mechanisms were involved in their instability, including replication, meiotic and mitotic homologous recombination, and postreplicational DNA repair. In the meantime, it was shown that minisatellites also evolved by slippage during S-phase replication. Most of what we know today about the molecular mechanisms involved in micro- and minisatellite instability comes from studies in model organisms, mainly but not exclusively E. coli, budding yeast, and mice. Although this review focuses on tandem repeats in eukaryotes, some of the most important papers deriving from studies with bacteria have been included in the present chapter. We will see that although microsatellites and minisatellites are generally rearranged in vivo by similar mechanisms, important details distinguish both types of repeats.
Due to their repetitive nature and highly biased nucleotide composition, mini- and microsatellites were early suspected to form secondary structures that may play an important role in the mutational process. Earlier studies on homopurine-homopyrimidine (GA)-(TC) dinucleotide repeat tracts showed that they were able to form triple helices in vitro that have the property to block DNA synthesis in vitro (29). This is a general property of all homopurine-homopyrimidine tracts, even though they are not repeated in tandem (16). It was subsequently shown that (GA)n, (AT)n, and (GC)n dinucleotide repeats were very poor substrates for binding E. coli single-strand binding protein (SSB) and RecA or Rad51 recombination proteins. This was interpreted as the inability of these proteins to bind to structured DNA (41). With the discovery of the first trinucleotide repeat disorders, data on structural properties of such repeats have blossomed. Biophysical and biochemical analyses showed that (GTC)n, (CAG)n, and (CTG)n repeats form hairpin structures in vitro, in which cytosines and guanines are paired and adenines or thymines are excluded (Fig. (Fig.5)5) (154, 338, 340, 550, 551). The same repeats also form slipped structures on double-stranded DNA in which both DNA strands carry a hairpin (387). A more recent analysis of (CTG)n repeats shows that two nucleotides at the base of the stem are sensitive to single-strand-specific nucleases, suggesting that some sort of secondary hairpin arises from the stem base (10). In RNA, (CUG)n repeats are able to fold into a triangular tubelike structure that looks like the chocolate bar confection “Toblerone” and is more similar to a triplex than to a classical hairpin (395). Bending properties of (CCG)n, (GGC)n, and (CCA)n trinucleotide repeats were also determined and correlate well with data obtained by X-ray crystallography (58). At the same time, it was shown that (CCG)n and (CGG)n repeats form stable hairpins in vitro (154, 339, 355, 549). The same repeats are also able to fold into stable tetraplex structures similar to those found at telomeres (Fig. (Fig.5)5) (144, 151). (GAA)n trinucleotide repeats form several distinct types of secondary structures: at low temperature they form hairpins (193, 480); they may also form triple helices, like other homopurine-homopyrimidine sequences (314); and two triplexes may associate in a structure called “sticky DNA” (435). Both DNA triplex and sticky DNA inhibit transcription (172, 437). Secondary structure formation and transcription inhibition depend on repeat purity, since interrupted repeats lose these properties (436). (GGA)n trinucleotide repeats, which are also homopurine-homopyrimidine tracts, form intramolecular tetraplexes in vitro, similarly to (CCG)n repeats (319).
Interestingly, (CCG)n and (CGG)n repeats block DNA synthesis in vitro, suggesting that secondary structures make significant obstacles to replication factors (509), this defect being partially alleviated by the WRN helicase in vitro (229). Note that DNA synthesis in vitro can be efficiently arrested by a short G16CG(GGT)3 motif, which was proposed to form a tetraplex-like structure (540). Hairpin structures formed by trinucleotide repeats are resistant to cleavage by the FEN-1 nuclease, suggesting that if DNA flaps form in vivo during replication, they cannot be accurately processed by FEN-1 if they contain trinucleotide repeats that form hairpins (474).
Given that all trinucleotide repeat disorders involve sequences that are able to form secondary structures in vitro, it was postulated previously that these structures were essential to the expansion mechanism (327). However, at the present time, there is no direct proof that the same (or similar) structures form in vivo within these repeats, although several lines of evidence that will be discussed below suggest that at least some kinds of trinucleotide repeat-containing structures form in living cells.
Since trinucleotide repeats form efficient secondary structures in vitro, it was a valid concern to assay the efficiency of nucleosome formation on these structures. Nucleosomes are normally formed by 146 bp of DNA wrapped around an octamer of histone proteins, and it is possible to reconstitute them in vitro using purified histones. When (CCG)n repeats were used in this assay, a strong nucleosomal exclusion was found, characterized by a reduced amount of histone-DNA complexes compared to nonrepetitive DNA (498, 522). This nucleosome exclusion was actually one of the strongest known at that time (523). Methylation of CpG dinucleotides inside the repeat tract reinforced nucleosome exclusion, suggesting that expanded (CCG)n repeats that are heavily methylated in fragile X patients may strongly inhibit nucleosome formation in vivo (524). Surprisingly, the effect is the opposite with (CTG)n trinucleotide repeats. These repeats apparently enhance nucleosome assembly in vitro in a length-dependent manner (longer repeats are more efficient at assembling nucleosomes than shorter ones) (521), while other studies demonstrated that as few as six CTG repeats facilitate nucleosome assembly (165, 439). No effect of GGA repeats on nucleosome formation was found (498).
It was very recently shown that inhibition of the SIRT1 histone deacetylase, the homologue of yeast SIR2 involved in rDNA maintenance (see “rDNA repeated arrays” above), reactivated alleles of FMR1 silenced by methylation. This reactivation was accompanied by an increase in histone H3 and H4 acetylation (40). This result shows that changing the methylation of FMR1 is possible by modulating the levels of histone acetylase/deacetylase in vivo, although this approach will most certainly lead to undesired effects on global gene expression.
On one hand, unrepaired slippage events between the newly synthesized strand and its template, the so-called “replication slippage” model (Fig. (Fig.4),4), was the main pathway initially proposed to account for microsatellite instability. It was subsequently shown that trinucleotide repeats could be rearranged by gene conversion, during homologous recombination. On the other hand, meiotic gene conversion was first identified as being responsible for minisatellite size changes; therefore, studies focused mainly on homologous recombination until it was also shown that strand slippage within minisatellites could also occur during mitotic S-phase replication. At the present time, it is generally accepted that any mechanism which involves new synthesis of DNA, such as replication, recombination, and repair, may generate size changes within tandem repeats, the frequency and extent of such changes being clearly dependent on the chromosomal location, sequence, length, and purity of the repeat and on the genetic background.
Several experimental setups have been designed for model organisms in order to look at microsatellite stability in replication mutants. Since almost all of the genes involved in replication are essential for cell viability, conditional mutants of the replication fork were tested. In budding yeast, (GT)n microsatellites are destabilized in a length-dependent manner, with the rate of mutation varying by 500-fold between (GT)15 and (GT)105 repeats and with the longest being the most unstable (536). A mutation in POL30 (yeast PCNA [proliferating cell nuclear antigen] or “clamp”) (Fig. (Fig.6)6) increases the instability of mono-, di-, penta-, and octanucleotide repeats by several orders of magnitude, with mononucleotide repeats being the most strongly destabilized (252). Similarly, a transposon insertion into one of the subunits of the RFC complex, RFC1 (“clamp loader”) (Fig. (Fig.6),6), increases by 10-fold the instability of a (GT)16 tract (544). A specific allele of the POL2 gene (yeast Pol , the main polymerase of the leading strand ), pol2-C1089Y, was identified as having a moderate effect on mononucleotide repeat stability, whereas another allele, pol2-4, has no effect on the same tracts (248). In contrast, the pol3-01 allele of the POL3 gene (yeast Pol δ, the main polymerase of the lagging strand ) destabilizes the same mononucleotide tract 75-fold compared to what is seen for the wild type. This suggests that replication defects on the lagging strand are more prone to induce microsatellite size changes than replication defects on the leading strand.
The above-mentioned experimental setups were designed by cloning a repeat tract into a reporter gene and looking for mutations that restore (or lose) the reading frame. Mutations recovered were almost always additions or deletions of one repeat unit, suggesting that these mutations are the most prevalent in yeast and strengthening the “stepwise mutation model” of microsatellite evolution (66, 173, 556). Although elegantly designed, these experimental setups were not adapted to study trinucleotide repeat instabilities, since any change of one or more repeats would maintain the reading frame, and thus alternative systems needed to be designed. Several groups have used molecular approaches such as PCR and Southern blotting to detect trinucleotide repeat size changes, but although these methods are perfectly qualified to detect frequent size changes, they become tedious when the frequency of instability is low, and they are impossible to use in a genome-wide screen. The first genetic assay designed to overcome these limitations took advantage of the S. pombe ADH1 promoter, which exhibits specific spacing requirements in order to function in S. cerevisiae. If the distance between the TATA element and the transcription initiation start is below or above a given size, the promoter does not function and the reporter gene is turned off. This setup allows one to screen for contractions of a “long promoter” or expansions of a “short promoter”; however, the main limitation of this assay is that it does not allow screening for expansions of trinucleotide repeat tracts that are longer than 28 repeat units, since at that size or above, the promoter is turned off (109). The same genetic assay was used to look for trinucleotide repeat instability in human cells by cloning the (CAG)n-containing reporter gene into a shuttle vector that replicates both in yeast and in human cells (82). An alternative assay was designed by cloning triplet repeats in the intron of a yeast suppressor tDNA, SUP4. The tDNA is correctly transcribed and spliced as long as the repeat tract is no longer than 30 repeat units, and it suppresses a point mutation in the ADE2 reporter gene. When the repeat tract is longer than this threshold, the suppressor tDNA is no longer active and yeast cells are ade2− (416). Unfortunately, this system suffers from a similar drawback, since it is impossible to use for screening expansions of a trinucleotide repeat tract longer than 30 repeat units. Nevertheless, using either assay, as well as molecular methods, the effects of several replication mutants on trinucleotide repeat instability have been assayed. Most of the key players of the replication fork show to some extent a destabilization of triplet repeats in the corresponding mutant background (Fig. (Fig.6).6). Notable exceptions are mutants of the RNase H complex, with mutations of the pol3-01 allele of Pol δ and the pol2-4 allele of Pol , both deficient in proofreading function, namely, the pol2-18 and pol1-17 mutants. Knockouts of Pol ζ (REV7) and Pol η (RAD30), both involved in error-prone translesion bypass synthesis, have no effect on trinucleotide repeat stability (111). The most drastic effects on triplet repeat stability have been observed for mutants of PCNA (POL30), ligase I (CDC9), and yeast FEN-1 (RAD27), the latter showing the highest effect for all replication mutants. FEN-1/RAD27 is a structural homologue of RAD2 (209, 408), harboring both endo- and exonuclease activities, interacting with PCNA (202, 542), and is involved in Okazaki fragment processing. The RNA primer of Okazaki fragments is normally removed by RNase H, but the last ribonucleotide is cleaved by Rad27p (352, 353, 403). It was shown that Rad27p processes 5′ flaps very efficiently but at a much lower rate when these flaps are structured as hairpins (197). Mutants that separate endo- and exonuclease functions have been isolated, and several studies suggest that efficient processing of trinucleotide repeat-containing flaps necessitate all biochemical activities of Rad27p (288, 289, 470, 545). Interestingly, Dna2p, which is an essential nuclease also involved in Okazaki fragment processing, exhibits only weak phenotypes on trinucleotide repeat instability (Fig. (Fig.6).6). Dna2p and Rad27p interact with each other (61), and further studies suggest that Dna2p first cleaves the RNA-containing 5′ flap of Okazaki fragments, before Rad27p cleaves a second time to obtain a fully processed Okazaki fragment (24, 232). It is not completely obvious to reconcile this model with the effects of dna2-1 and rad27Δ mutants on triplet repeat instability, as the latter has a strong effect, while the former exhibits only a very moderate increase in instability. It is not completely clear either why DNA2, along with all other genes involved in replication in yeast, is essential for cell viability, while RAD27 is dispensable, although the knockout strain shows a strong mutator phenotype (497). One explanation could be that RAD27, in addition to its role at the replication fork, also plays another role in DNA metabolism, perhaps required for efficient processing of some recombination intermediates that could arise during the course of replication, as was suggested by some authors (188). In that case, in the absence of Rad27p more lesions would be made during replication of trinucleotide repeats, and these lesions could not be properly repaired, leading to the very high instability observed in this mutant background. One could also simply argue that comparing a point mutation in DNA2 to a complete deletion of RAD27 is unfair and that a point mutation in RAD27 exhibits a reduced level of trinucleotide repeat instability compared to what is seen for the complete inactivation of the gene (545). Nevertheless, it is clear that processing of Okazaki fragments during replication is an essential step leading to both contractions and expansions of triplet repeats and more generally of microsatellites. Strong destabilization of all microsatellites has also been observed in different alleles of PCNA, but this large processivity complex interacts with several replication proteins, including Pol δ, Pol , Rad27, and Cdc9, as well as the mismatch repair complex (508) (see “Defects in mismatch repair dramatically increase microsatellite instability” below). The observed increase in instability is probably the result of defects at several stages during replication. These data were collected from many publications and are summarized on Fig. Fig.66 (68, 208, 253, 409, 416, 455, 456, 477, 534).
Compared to the abundant literature on microsatellite instability in replication mutants, little has been done for minisatellites. Initially, the human MS32 minisatellite known to exhibit high rates of meiotic rearrangements (0.8% per molecule in sperm) was found to be also unstable in blood cells, although with a much lower frequency (<0.06%). Mutations observed include simple duplications or deletions of a given number of repeat units and can all be explained by intra-allelic events. No evidence of complex events involving interallelic recombination was found, and it was therefore proposed that minisatellite mitotic rearrangements involve replication slippage or sister-chromatid recombination (217).
Following the discovery that microsatellites—particularly trinucleotide repeats—were destabilized by inactivating RAD27, the stabilities of several minisatellites have been assayed in yeast strains deficient for this nuclease. A short minisatellite made up of 20-bp repeat units, tandemly repeated three times, shows an 11-fold increase in instability in a rad27Δ strain and a 13-fold increase in a pol3-t strain (253). Further studies of several human minisatellites integrated into the budding yeast genome showed that, in the absence of RAD27, their instability is highly increased, whereas they are only moderately increased in a dna2-1 mutant and unchanged in a rnh35Δ mutant, recapitulating the respective effects of the same mutants on trinucleotide repeat instability (295, 303). It was subsequently shown that minisatellite rearrangements in rad27Δ cells occur by homologous recombination, since they are suppressed by deletions of RAD51 or RAD52, both key players in homologous recombination, and they exhibit specific features usually associated with homology-driven mechanisms (296). A recent work identified the zinc transporter ZRT1 as also involved in minisatellite instability during mitotic divisions in budding yeast. Although the pathway involved in this process is not fully understood, it was shown that instability is reduced in a rad50Δ mutant, suggesting a role for this protein, whose functions in DSB repair are numerous (243).
Since the discovery of the first trinucleotide repeat disorders, it has been clear that cis-acting elements played a central role in the expansion process, as reviewed by several authors (277, 335, 385). We are now going to discuss three specific properties of trinucleotide repeats that, in our opinion, are essential to keep in mind when dealing with triplet repeats. As far as we know, other microsatellites do not exhibit the same properties, although specific experiments should certainly be designed to eventually answer this question.
First, it is noteworthy that trinucleotide repeat expansions are locus specific, meaning that a patient with an expansion in DM1 or FRAXA does not show any sign of expansion at other triplet repeat loci. This is exemplified by work published almost 10 years ago in which trinucleotide repeat polymorphisms at the DM1 locus and the SCA1 locus were compared in 29 families affected by HD. They found frequent polymorphism at the DM1 locus, whereas the SCA1 locus and other genome-wide microsatellites were stable. Moreover, they analyzed the same loci in 26 families affected by colorectal cancer, a type of cancer characterized by a high frequency of microsatellite instability, due to deficiencies in the mismatch repair system (see “Defects in mismatch repair dramatically increase microsatellite instability” below). They found that microsatellites, including triplet repeats at DM1 and SCA1, were generally highly variable, strongly suggesting that the molecular mechanisms involved in genome-wide microsatellite instability and those for trinucleotide repeat expansions were fundamentally different (166). This also explains why unstable human trinucleotide repeats integrated at different locations in the mouse genome exhibit variable levels of somatic and intergenerational instability (171, 282, 342, 460) and may even become highly stable (166) or lead to large expansions similar to those observed for humans (168). Interestingly, it was shown that (GT)16 dinucleotide repeats exhibited different rates of variation, depending on the locus at which they were integrated in the yeast genome. A 16-fold difference in the stability rate was measured between the most stable and the most unstable locus, suggesting that different regions of the genome do not replicate microsatellites with the same accuracy (190).
Another remarkable property of trinucleotide repeats is their propensity to be stabilized by the presence of interruptions in the repeat sequence, i.e., one or several repeat units that differ from the repeat consensus. This was first noticed in the fragile X locus FRAXA, in which AGG triplets are frequently found interrupting the (CGG)n repeat tract in the normal population, while patients exhibiting fragile X syndrome show pure, uninterrupted tracts (112, 199, 472). This is also true for (CAG)n repeat tracts in SCA1 alleles that are interrupted by a CAT triplet and for (CAG)n tracts in SCA2 alleles that are interrupted by CAA triplets, these interruptions being lost in the corresponding expanded alleles (80, 81). Similarly, a human (CA)n dinucleotide repeat shows a high stability when interrupted by a TA dinucleotide compared to what is seen for an uninterrupted allele (23). In budding yeast, it was shown that the introduction of an interruption within a perfect (CAG)n repeat tract stabilizes the tract by almost 2 orders of magnitude (427) and that contractions of the interrupted repeat tract almost always removed the interruption, transforming it into a perfect tract (110, 322). Similarly, Petes and colleagues (391) showed that interrupted (GT)n dinucleotide repeat instability was fivefold decreased compared to what was seen for a perfect (GT)n tract, suggesting that the property of interruptions that stabilize a repeat tract could be generalized to other microsatellites. One possible (but not exclusive) explanation would be that the presence of a different repeat unit within a microsatellite could disfavor the formation of potential secondary structures, hence reducing the chance of replication slippage due to these structures.
Last but not least of the cis-acting effects, the orientation of the repeat tract compared to the replication origin was first described for E. coli in a seminal paper showing that (CTG)n trinucleotide repeats cloned into a plasmid are more unstable when the CTG strand is the lagging-strand template than when the CAG strand is the lagging-strand template (231). This effect was confirmed in E. coli when repeats were integrated into the bacterial chromosome (553). Similar experiments were performed in yeast cells, in which (CTG)n or (CCG)n repeats were integrated in yeast chromosomes in the two possible orientations. With (CCG)n repeats, the repeat tract is more unstable when the lagging-strand template carries the CCG triplets (28). With (CTG)n repeats, the same orientation dependence seen for E. coli was found for yeast, with unstable repeats carrying CTG triplets on the lagging-strand template (150, 323, 331), although in one early report no difference was detected between the two orientations (332). The underlying model for this orientation effect relies on the propensity for CTG repeats to form secondary hairpins more stable than those formed by CAG repeats. Since it is suspected that the lagging-strand template is more prone to have single-stranded regions during replication than the leading-strand template, then when CTG repeats are exposed on such single-stranded regions, they are more prone to form hairpins than CAG repeats, and therefore they are more unstable. Note that at the present time, it is not formally proven that single-strand regions on the lagging-strand template are exposed long enough during the course of replication to allow the formation of secondary structures. It is also a possibility that some structures are left behind the replication fork and are inherited by the daughter cell, raising the chance of replication errors during the next round of replication.
It would be interesting to determine the orientation of trinucleotide repeats undergoing expansions in humans, but, unfortunately, most replication origins remain to be determined. Recently, a replication origin was identified in the promoter region of the FMR1 gene (174). The replication fork coming from this origin would replicate the (CCG)n trinucleotide repeat tract expanded in FRAXA such that the CGG sequence would be on the lagging-strand template, the orientation that was found to be more unstable in yeast, although this orientation leads more frequently to contractions than to expansions (28).
Given that authors generally use different experimental systems along with different nomenclatures to name repeat orientations, it is not always straightforward to know in which orientation repeats are replicated in a given system, sometimes adding confusion to the results. We therefore propose to adopt a general terminology to name repeat orientation. Since the lagging-strand template is supposed to be a key player in the instability process, we propose that repeats are named according to the sequence found on the lagging-strand template, i.e., CTG repeats when the CTG sequence is on the lagging-strand template (equivalent to the leading strand). This would correspond to what is generally (but not systematically) described as orientation II in the literature. The advantage of such a nomenclature is that is applicable to all kinds of microsatellites, without the need to define what is orientation I or II and what is orientation C or D, etc. Hopefully, there will be studies on other types of microsatellites that will help to determine if the orientation effect is restricted to trinucleotide repeats or is a general property of microsatellites.
One of the early questions about trinucleotide repeats was the possibility that secondary structure formation could impede the progression of the replication fork through the repeat tracts. This question was first addressed in E. coli, in which plasmids containing (CCG)n and (CAG)n repeats were cloned and transformed. By analysis of replication intermediates using two-dimensional gel electrophoresis, Samadashwily et al. (438) showed that (CCG)n repeat tracts induced strong replication blocks when located near a replication origin, and they addressed the effect of CCG or CGG triplets cloned on the lagging-strand template. The replication block was length dependent, such that longer repeats showed a stronger effect. In contrast, (CTG)n repeats induced stalling of the replication fork, only when the CTG and not when the CAG triplets were cloned in the lagging-strand template and only when chromosomal replication (but not plasmidic replication) was inhibited with chloramphenicol. Repeats shorter than (CTG)70 did not show any effect on replication. In budding yeast, by use of a similar experimental setting and the same technical procedure, (CCG)n trinucleotide repeats were also found to stall the replication fork in both orientations, even when short repeat tracts (18 repeat units) were examined. To the contrary, a very weak replication slowdown for (CAG)80 repeats was observed for both orientations (388). Similarly, no strong replication block was detected when (CAG)n repeats were integrated into a yeast chromosome (A. Kerrest, R. Anand, R. Sundararajan, R. Bermejo, G. Liberi, B. Dujon, C. H. Freudenreich, and G. F. Richard, unpublished data). When (GAA)n repeats were examined using the same approach, replication stalling was strong when GAA triplets, but not TTC triplets, were cloned on the lagging-strand template, showing that impediment of replication on the lagging strand was promoted by the homopurine but not the homopyrimidine sequence (260).
Replication fork stalling and replication restart have been intensely studied in the past few years and are the topic of several recent reviews (268, 333, 432). The fate of the replication fork after stalling or blocking was investigated in model organisms. It was first proposed by Seigneur and colleagues (459) that arrested replication forks are transiently “reversed” into Holliday junctions by the RuvAB complex, serving as a substrate for homologous recombination. Fork reversal is under the control of several proteins in E. coli, including the helicase UvrD (143, 278), a structural homologue of the Srs2 helicase that is involved in homologous recombination in budding yeast (261, 512) (see “Role of the error-free postreplication repair pathway on trinucleotide repeat expansions” below). In budding yeast, it was proposed that replication fork reversal occurs during chromosomal replication when it is slowed down with hydroxyurea and that it is controlled by Rad53, a key player in the checkpoint response (297). Using plasmid-borne (CAG)n repeat tracts, Fouché et al. (147) observed by electron microscopy “chicken foot” structures representing fork reversals and showed that these structures were dependent on the presence of the repeat tract. It would now be interesting to know if trinucleotide repeats that form strong secondary structures, like (CCG)n and (GAA)n, are also able to induce fork reversal in vivo or if this is restricted to the more labile secondary structures formed by (CAG)n repeats. It is nevertheless interesting that fork reversal cannot occur spontaneously in supercoiled regions and could occur in vivo only if topoisomerases are present to relax the supercoiling induced by the progression of the replication fork (134, 135).
As presented above, mutations in DNA damage checkpoints increase fragile site expression, both in human and yeast cells (see “Molecular basis for fragility” above). The stability of triplet repeats has also been studied for such mutants. The contraction frequency of (CAG)n trinucleotide repeats is increased in yeast cell mutants for MEC1, DDC2, RAD17, RAD24, and RAD53 genes, with the effects of other checkpoint mutants being less pronounced (269). Interestingly, the expansion frequency of the same repeat tract is not increased in a comparable way, suggesting that an increase in contractions is specific to checkpoint mutants. One possibility is that checkpoint mutants increase DNA fragility (and therefore DNA breaks) within the repeat tract and that these breaks are repaired mainly by annealing of both sides of the break by single-strand annealing (SSA), a mechanism that was shown to preferentially produce contractions of the repeat tract (416). A different result was obtained with mice heterozygous for a mutation in the ATR gene (the mammalian homologue of yeast MEC1), in which more expansions of a (CGG)n repeat tract were detected (122). It is unclear if the different results obtained for yeast and for mice may reflect a difference in the intrinsic functionality of checkpoint proteins in both organisms, or whether CAG and CGG repeats behave differently in checkpoint mutants.
The mismatch repair pathway (or MMR) is a complex of several proteins conserved from bacteria to all known eukaryotes and responsible for detecting replication errors such as transitions, transversions, insertions, and deletions, etc., and signaling them to the cell so that they can be repaired. The MMR complex includes several genes belonging to the MutS family, such as MSH2, MSH3, and MSH6 in budding yeast or to the MutL family, such as PMS1, MLH1, and MLH2 in yeast, and an endonuclease, EXO1 (254). Mismatch repair became famous when it was shown to be directly involved in sporadic colon cancer and in human nonpolyposis colon cancer (341, 389). In these two classes of colon cancer, the rate of mutation of microsatellites is several orders of magnitude higher than that in noncancerous cell types, suggesting a somatic origin for the instability in these tumors (287, 383). It was shown for these cancer patients that the MSH2 gene was mutated, leading to a high increase of microsatellite instabilities but also of point mutations (142). Several genes that could be directly involved in the tumorigenesis were subsequently found to be altered by this hypermutator phenotype. The type II transforming growth factor β receptor gene, involved in epithelial cell growth, contains an insertion in a (GT)3 dinucleotide repeat or a 1- or 2-nucleotide deletion in an (A)10 mononucleotide repeat in cancerous cells (316). IGFIIR, the insulin-like growth factor II receptor, contains a 1- or 2-bp deletion in a (G)8 mononucleotide repeat (473), and deletions and insertions in a (G)8 mononucleotide repeat in the BAX gene, involved in apoptosis, were found in tumors (405). Human nonpolyposis colon cancer-like cancer predisposition was also found in mice inactivated for the MSH3 or MSH6 genes (106a), and microsatellite instability is also elevated in C. elegans when MSH2 is inactivated (97).
Several experimental systems have been designed in budding and fission yeasts to study the effect of the MMR on microsatellite instability. Most of them rely on the insertion of a mono- or dinucleotide repeat tract within the coding sequence of a reporter gene, in such a way that a loss or gain of one repeat unit leads to gene inactivation (or activation, depending on how the construct was initially made). With such genetic tools, it was found that a deletion of any of the MutS or MutL yeast homologues led to a dramatic increase in microsatellite instability (Fig. (Fig.6)6) (185, 299, 312, 466, 467, 476, 501). Surprisingly, a deletion of mutS in Haemophilus influenzae has no effect on tetranucleotide repeat stability, while it has an effect on dinucleotide repeat stability, suggesting that this bacterial species has a specific way of regulating microsatellite instability. This might be linked to the fact that most of the microsatellites in H. influenzae are tetranucleotide repeats that are encoded by genes involved in virulence and adaptation to environmental changes and might have been selected to be more refractory to genome-wide microsatellite instability (33).
The stability of trinucleotide repeats was also assayed in MMR-deficient cells by use of assays similar to those used to assess the effect of replication mutants. Experimental systems initially designed in budding yeast were aimed at identifying repeat expansions or contractions of several triplets. Since most repeat size changes in MMR-deficient cells are additions or deletions of one repeat unit, mismatch repair mutants were found to have little or no effect in such experimental settings, compared to what was observed for other microsatellites (331, 332, 416). More surprisingly, when all small deletions and additions of triplets were scored using molecular approaches, it was found that trinucleotide repeats exhibited a much higher stability in MMR-deficient cells than other microsatellites (454) (Fig. (Fig.6).6). This suggested that secondary structures formed by trinucleotide repeats efficiently escaped recognition by the mismatch repair system or other DNA repair pathways, as suggested by some authors (345), and that trinucleotide repeat instability was largely independent of mismatch repair. However, additional studies revealed that the human MSH2 protein binds in vitro to secondary structures formed by (CAG)n and (CTG)n triplets in a length-dependent manner and with a higher efficiency for (CAG)n repeats (386). It was subsequently shown that transgenic mice deficient for Msh2 show a reduction in (CAG)n instability, a surprising result that is in contrast to what is commonly observed for other microsatellites, which are destabilized in MMR-deficient cells (311). The loss of Msh2 actually decreases the frequency of expansions, while contractions become much more frequent (447). These Msh2-dependent expansions seem to occur premeiotically, at the beginning of spermatogenesis (448). It was further shown that Msh3, but not Msh6, is also involved in (CAG)n expansions in mice, involving the heteroduplex Msh2-Msh3 in the expansion process (374). The same authors showed that purified Msh2-Msh3 complex binds efficiently to (CAG)n hairpins and it was proposed that this binding stabilizes the loop structure and inhibits correct repair by the MMR system or other DNA repair complexes. Thus, Msh2-Msh3 would have an opposite effect on trinucleotide repeats compared to other microsatellites, actually reducing the chance of correct repair of the hairpin. Strengthening the results obtained with Msh2 and Msh3, Pms2 null mice exhibit the same phenotype, a reduction in (CAG)n expansions, showing that another component of the MMR pathway is involved in making expansions (169).
A new way of stabilizing trinucleotide repeats has been recently discovered, when a whole-genome screen for mutants affecting the stability of (CAG)n repeats identified the SRS2 gene of S. cerevisiae, as involved in this process. SRS2 was originally cloned as a suppressor of the radiation sensitivity phenotype of rad18 mutants, and is involved in the postreplication repair pathway of DNA damage (4). The Srs2 protein exhibits a 3′-to-5′ ATP-dependent helicase activity in vitro (428) and was shown to disrupt a Rad51p nucleoprotein filament, an intermediate of homologous recombination (261, 512). The same property was found for the bacterial functional homologue of SRS2 (UvrD) in unwinding RecA nucleoprotein filaments in E. coli (511). Further biochemical analysis of the protein substrates suggests that the Srs2 helicase could act as an antirecombinogenic protein that unwinds toxic recombination intermediates (118). Bhattacharyya and Lahue (38) showed that (CTG)13, (CTG)25, (CAG)25, and (CGG)25 trinucleotide repeats were more prone to expansions in an srs2 mutant and that these expansions were largely independent of RAD51-mediated homologous recombination. Interestingly, we recently found that longer (CTG)n repeats were highly destabilized in an srs2 mutant but that this instability was completely dependent on homologous recombination (Kerrest et al., unpublished), suggesting that two distinct pathways lead to trinucleotide repeat instability in srs2 mutants, depending on the size of repeat tracts. Mutants in the RAD18 and RAD5 genes or a mutation that abolishes the ubiquitination and sumoylation of PCNA (pol30-K164R) increase (CAG)n and (CTG)n repeat instability, confirming the role of the error-free postreplication pathway in this process (89). In addition, biochemical studies revealed that the Srs2 helicase is also able to unwind CTG hairpins or CTG-containing double-stranded DNA (39). Many years ago, it was published that a (GT)14 dinucleotide repeat tract was stabilized approximately 10-fold in a rad5 mutant (224), but this effect has been interpreted as an error-prone function of the RAD5 gene (89). RAD5 encodes a multifunctional protein which exhibits ATPase activity, is involved in DSB repair of cohesive ends in a way similar to that seen for the Mre11 protein (78), and is proposed to be a key player in replication fork reversal (45). Additional work will be required to determine whether replication fork reversal mediated by RAD5 (or SRS2) could be a way of stabilizing trinucleotide repeat tracts. It is interesting that SRS2 has a homologue in humans and in Schizosaccharomyces pombe, the FBH1 gene (F-Box DNA helicase). FBH1 in S. pombe seems to play some of the roles of SRS2 (347, 373), and it would be interesting to test the effect of FBH1 on trinucleotide repeats in human cells.
When the first massive expansions of trinucleotide repeats were discovered, it was originally thought that expansions occurred during S-phase DNA replication and most probably that the mismatch repair system was in some way involved in this process. These hypotheses were not completely wrong, although the precise mechanism responsible for these large expansions was not simple “replication slippage,” as was observed with other microsatellites in MMR mutants. After a while, some authors looked toward another mechanism, homologous recombination, as a possible cause of the large expansions, helped by the large body of evidence published on minisatellite instability during human meiosis.
Several human minisatellites were initially found to be highly variable in the human population (218, 219, 221), and this variability was subsequently shown to arise in the germ line (60, 485). Molecular analysis of minisatellite rearrangements in sperm cells revealed complex mutation events, including both intra- and interallelic exchanges. This observation, along with the polarity of rearrangements (preferential modifications at one end of the tandem array), together with the absence of evidence for the exchange of flanking markers, suggested that rearrangements occur mainly through gene conversion not associated with crossover (59, 220). A meiotic recombination hot spot was mapped upstream of the MS32 minisatellite and was shown to be responsible for its meiotic instability (216). It was concluded that minisatellites are frequently rearranged by meiotic gene conversion when they are located near a meiotic recombination hot spot. The minisatellite mutation rate in the germ line was studied among children born in the area contaminated by radiation spills after the Chernobyl accident. It was shown that the mutation rate in the exposed population was twofold higher than the mutation rate in the control population, suggesting that ionizing radiations induce minisatellite rearrangements, probably by making low levels of DSBs that activate homologous recombination (114). In support of this hypothesis, it was shown that the mutation rate of mouse minisatellites was increased about twofold when mice were exposed to 0.5 Gy or 1 Gy of γ-irradiation (113), a dose more acute than what was estimated for the Chernobyl accident. A more recent work, looking specifically at the mouse minisatellite Ms6hm, showed that its mutation rate was also increased twofold following exposure to a 6-Gy γ-irradiation (368). Minisatellite instability was also observed in SCID (severe combined immuno deficiency) mouse cells, in which the catalytic unit of the DNA-dependent protein kinase (DNA-PKcs) was impaired, resulting in impairment in end joining. These cells exhibited a higher rate of minisatellite instability, suggesting that inactivating end joining increases minisatellite instability, possibly by increasing homologous recombination (204). It is interesting that meiotic instability is not restricted to natural minisatellites, since an artificial transgene array composed of 8 kb of human and bacterial sequences was dramatically amplified in mice. Expansions of from 5 to 8 initial repeat units of the transgene to from 200 to 300 copies were observed to happen in one generation, reminiscent of the dramatic expansions of trinucleotide repeats found to be associated with neurological disorders in humans. It was likely that these transgenic expansions occurred during gametogenesis in the male germ line (240).
S. cerevisiae has been a powerful tool to study minisatellite rearrangements during meiosis. MS1 was the first human minisatellite to be introduced in the genome of a haploid yeast cell. Frequent size changes of the minisatellite were observed and it was concluded that recombination between homologous chromosomes was therefore not a prerequisite for minisatellite rearrangements (73). The rate of instability of the human minisatellite MS32 was then compared during meiosis and mitosis in diploid yeast cells. MS32 was integrated near a well-characterized meiotic hot spot on yeast chromosome III. The frequency of size changes following meiosis was around 10%, a 40-fold increase over the frequency found during mitotic growth of yeast cells (12). Size changes occurred by gene conversion, with or without crossover (11). Similar features were found for two other human minisatellites (MS205 and MS1) integrated at the same locus in the yeast genome (36, 191, 192). A further step was made when Debrauwère and colleagues (94) showed that the meiotic instability of the human CEB1 minisatellite integrated near a yeast meiotic hot spot was dependent on the presence of the Spo11 endonuclease responsible for initiating meiotic recombination by making DSBs (37, 241). They also showed that minisatellite instability required the activity of Rad50, a protein involved in processing meiotic DSBs in order for recombination to occur properly. Subsequent analyses of human minisatellite instability during yeast meiosis showed that it did not depend on the mismatch repair system or on the Sgs1 helicase (42) but that the Rad1 protein was specifically increasing the frequency of minisatellite expansions (214). RAD1 is a gene whose product is essential to remove nonhomologous tails during gene conversion and SSA in yeast (377, 481), suggesting that such structures are formed during minisatellite recombination that need to be removed in order for expansions to occur.
All of the collective observations made for humans, along with experiments performed in budding yeast, formerly proved that meiotic gene conversion was the most important drive for minisatellite instability. This was proposed to be very different for microsatellites, however. Earlier experiments comparing the rates of instability of a (GT)16 dinucleotide repeat tract did not find significant differences between a strain mutated for the RAD52 gene and the wild-type yeast strain, suggesting that homologous recombination was not involved in microsatellite instability (195). At the same time, large expansions of (CGG)n repeats in FRAXA patients were found to occur in the absence of recombination of flanking markers (no crossover), which was too rapidly interpreted as ruling out meiotic homologous recombination as a possible cause for expansions (152), since the exchange of genetic information during meiosis is not always accompanied by crossovers (376). However, when (GT)n dinucleotide repeat tracts were introduced into a diploid yeast chromosome and these cells underwent meiosis, it was found that the presence of the microsatellite increased by severalfold the frequency of crossover and the frequency of multiple recombination events (involving more than two chromatids) in its vicinity (502). Later experiments in budding yeast showed that the presence of a (GT)39 dinucleotide repeat interfered with crossover resolution and increased the frequency of multiple recombination events. The microsatellite was found to be more unstable in recombinant spores than in parental spores (162). The instability of a (CAG)n trinucleotide repeat tract was increased during meiosis, compared to the mitotic instability, when the repeat tract was inserted near a yeast recombination hot spot (212). Most meiotic rearrangements (95%) of the repeat tract were found to be dependent on Spo11 (211). Similarly, (CAG)n trinucleotide repeats integrated in a yeast artificial chromosome were more unstable during meiosis than during mitotic cell divisions in yeast (86). Contrary to this, working with much smaller and interrupted (CAG)n trinucleotide repeats, Schweitzer et al. (457) did not find any evidence for increased instability, gene conversion, or crossover events associated with the presence of the microsatellite during yeast meiosis. The conclusion of these experiments is not qualitatively different from what was observed for minisatellites during meiosis: whenever perfect microsatellites are introduced in close proximity to a meiotic hot spot, they show a higher rate of instability during meiosis than during mitosis, this instability being dependent on Spo11 and therefore on the formation of meiotic DSBs. However, it was unclear if the microsatellite by itself was able to act as a meiotic hot spot, hence inducing meiotic recombination, although some data suggested that it could be the case (212). This question was first addressed by Moore and colleagues (345), who showed that the presence of (CAG)10 or (CGG)10 trinucleotide repeat tracts did not increase recombination between two flanking markers and were not the preferential site of meiotic DSBs. Additional experiments using much longer trinucleotide repeat tracts, (CAG)98 and (CAG)255, confirmed that even these long repeats did not act as meiotic recombination hot spots in yeast, the meiotic recombination rate being independent of the presence or absence of the microsatellite (412). Using the same experimental setting, a unique meiotic DSB was introduced by the specific endonuclease I-SceI in one of the two homologues, the other homologue carrying either a (CAG)98 or a (CAG)255 repeat tract. Contractions and expansions of the (CAG)98 repeat tract occurred in about 5% of the meiotic gene conversions, while such events were 10-fold more frequent when the (CAG)255 repeat tract was used as a template. Surprisingly, the frequency of rearrangements dropped to less than 1% when I-SceI was expressed during the mitotic growth of the cells, suggesting that meiotic recombination is more prone to make errors than mitotic recombination when (CAG)n repeat tracts are copied (412).
Experimental systems have also been set up in model organisms to study the fate of tandem repeats during mitotic recombination. In D. melanogaster, a tandem array of 5S ribosomal genes located within a P element was shown to be highly unstable following excision of the transposon. Contractions and expansions of the repeat array occurred in 40% of the progeny and were proposed to be the result of rearrangements within the 5S genes during DSB repair of the DSB generated by the transposon excision (375, 381). This experimental system was transposed into budding yeast and it was elegantly shown that the 5S tandem array underwent frequent expansions and contractions during repair of a single DSB induced by the HO endonuclease. Interestingly, the DSB could be repaired using two ectopic overlapping donor sequences, showing that homologous recombination was able to find and assemble overlapping sequences located at different loci (378). The same experimental system was used to study the fate of a 36-bp minisatellite during HO-induced DSB repair in yeast. Contractions and expansions of the minisatellite were found in 16% of the repair events, and it was shown that Msh2 and Rad1 proteins were required to promote contractions of the minisatellite during gene conversion (379), a result opposite to what was observed during meiotic recombination of a minisatellite (214). A similar experimental system was used to study (CAG)n size changes during HO- or I-SceI-induced DSB repair in budding yeast. When the DSB is repaired using a (CAG)39 as a template, frequent contractions but no expansions of the repeat tract were observed (416), whereas both expansions and contractions were found when a longer (CAG)98 repeat tract was used as the template (420). It was shown that the MRE11-RAD50-XRS2 complex was responsible for making large expansions, and it was proposed that these expansions occurred either through cleavage of secondary structures formed by the repeats and mediated by the endonuclease activity of Mre11 or, alternatively, by an unwinding of these secondary structures by the MRE11-RAD50-XRS2 complex (417). It is worth noting that when the endonuclease cleavage site was flanked by two short (CAG)5 repeats, a majority of the repair events occurred by annealing between the two repeats, often leading to a shorter repeat tract and therefore making SSA an attractive mechanism to generate repeat contractions (416). Note that repeat contractions and expansions of 5S genes or trinucleotide repeats during mitotic gene conversion are not associated with crossovers, ruling out the hypothesis of unequal crossover as a possible source of instability (378, 412) and supporting the hypothesis of a “DSB repair slippage” that would occur during gene conversion and would be 2 to 3 orders of magnitude more prone to errors than classical replication slippage (416). One of the possible models describing how DSB repair slippage may occur during gene conversion is shown in Fig. Fig.77.
Studies with transgenic mice showed that neither Rad52 nor Rad54 had any effect on (CAG)n trinucleotide repeat instability (447). However, it was proposed that the functional homologue of yeast RAD52 in mammals might be Brca2 (349, 384, 482), and therefore the possible involvement of this gene in trinucleotide repeat instability should be assayed.
It is noteworthy that several human ataxias involve defects in DNA repair. Ataxia telangiectasia (ATM) and ATM-like disorders are characterized by early-onset cerebellar ataxia and the progressive degeneration of the cerebellum and spinocerebellar tract. These two ataxias respectively involve defects in the ATM checkpoint gene and the MRE11 gene, which are required for proper DSB repair response. These ataxias also exhibit radiosensitivity, chromosomal instability, and a high occurrence of cancers. Mutations in other genes involved in DSB repair lead to neural death by apoptosis. This is the case for NHEJ genes like those for the Ku complex, XRCC4, and ligase IV or for genes involved in homologous recombination like those for XRCC2 or BRCA1 (3). Other types of ataxias involve deficiencies in DSB repair. This is the case for SCA with axonal neuropathy (SCAN1) due to a defect in the TDP1 gene required for proper single-strand break repair and of ataxia-ocular motor apraxia (AOA1), involving the aprataxin gene, which is required for single-strand break signaling and repair (67). However, in these two last cases, no evidence for a general increase in genetic instability has been recorded and the defects are apparently restricted to the nervous system. At the present time, it is unclear why deficiencies in the above-mentioned genes seem to affect preferentially the nervous system, although one can advance the hypothesis that neurons are more sensitive than other cells to DNA breaks and enter more efficiently apoptosis when unrepaired breaks occur. It is possible that mild deficiencies in DSB repair pathways would lead to an increase in the level of endogenous single-strand breaks or DSBs and that these breaks could in turn trigger trinucleotide repeat expansions, increasing the chance of further breaks. Some of these unrepaired breaks could also trigger neuron apoptosis, increasing the severity of the clinical phenotype.
At the end of the section on molecular mechanisms generally involved in mini- and microsatellite instability, particularly those related to trinucleotide repeat expansions, we want to propose a simple model that recapitulates data obtained both from human patients and from experiments in model organisms to explain how trinucleotide repeat instabilities occur (Fig. (Fig.8).8). In this model, the fate of a trinucleotide repeat depends exclusively on its propensity to form secondary structures and on the size of these structures. Short structures will be covered and protected by Msh2, preserving them and favoring replication slippage within the repeats. Slippage will eventually be corrected by the postreplicational repair pathway under the control of the Srs2 protein. Longer trinucleotide repeats will form longer and more stable hairpins that will also be substrates for Msh2. In addition, these longer hairpins may stall replication forks, leading to single-strand breaks and eventually DSBs. If DSBs are not correctly recognized by the checkpoint machinery, they will escape repair and lead to chromosomal fragility. DSB repair is also under the control of the Srs2 (and the Rad51) protein and may lead to repeat contractions and expansions by gene conversion. Other DNA damage, like oxidative damage, may also lead to fork stalling and DSBs. Given that Srs2 has a central role in this model, it would be interesting to test the role of its human homologue, FBH1, on trinucleotide repeat instability in mammalian cells.
Finally, the link between replication and homologous recombination needs to be explored further. In model organisms like budding yeast and bacteria, several lines of evidence connect these two processes (175), but little is known on their interconnection in mammalian cells. As an alternative to DSB repair, fork reversal after DNA damage and fork stalling could supply cells with a mechanism that does not involve homologous recombination and would therefore be less prone to chromosomal rearrangements leading to segmental duplications and to microsatellite instabilities.
As we have seen, several mechanisms relying on the formation of secondary structures, and on replication and homologous recombination, interplay with each other to generate contractions and expansions of tandem repeats. Even though the precise molecular steps remain to be clarified, the logic of going from two repeat units to hundreds or thousands is straightforward. However, it is still unclear how one goes from a single unit to two repeat units, or said in another way, how tandem repeats are born. One may easily imagine that point mutations can create mono-, di-, or even trinucleotide duplications, but what about longer microsatellites, minisatellites, and longer tandem repeats? It is unlikely that minisatellite birth occurs by successive point mutations. A seductive hypothesis was proposed by Haber and Louis (177), who noticed that the repeat units of several yeast and human minisatellites were flanked by short (5-bp) direct repeats and suggested that initial slippage between these direct repeats could duplicate the DNA sequence between them, giving birth to a minisatellite. A similar observation was made on many natural minisatellites found in budding yeast (414). Given the number of whole-genome sequences now available for eukaryotes, systematic sequence comparisons of closely related genomes will certainly shed a new light on the molecular mechanisms involved in micro- and minisatellite birth (328, 527, 556). Alternatively (or complementarily), sophisticated experimental systems in powerful model organisms like S. cerevisiae should also reveal some of these answers.
How microsatellites evolve is another intriguing question. The most recently accepted view is that short microsatellites tend to expand, while longer ones tend to contract. This was demonstrated by analyzing 122 human tetranucleotide repeat loci and showing that the rate of expansions is constant for all alleles, whereas the rate of contractions increases exponentially over repeat length. This led to a model in which microsatellites tend to expand until they reach a critical size, above which they will tend to contract (546). A corollary to this model is that the size distribution of microsatellites, in an equilibrium in a given genome, should be centered around the critical size. Another model proposes that microsatellite evolution is driven by slippage that will tend to increase repeat size and by point mutations that will tend to interrupt the tandem repeat sequence, therefore stabilizing it (66, 264, 441). Note that these two hypotheses are not mutually exclusive and that experimental results on microsatellite instability in model organisms support both models.
Both microsatellites and minisatellites are frequently found within genes, at least in hemiascomycete genomes, in which they have been the most studied. Remarkably, they are not contained by the same type of genes, with microsatellites being found mainly in nuclear transcription factors, while minisatellites are contained by cell wall genes. Therefore, the distinction between both types of tandem repeats, originally based on historical grounds, finds a biological justification. We therefore propose that since the shortest minisatellite unit found in a cell wall gene was 9 nucleotides long (414), mono- to octanucleotide repeats should be called microsatellites, whereas nonanucleotide repeats and above should be called minisatellites. Hopefully, this definition will be adopted by a majority of people who work in the field, so that everyone will use the same terminology when dealing with tandem repeats. It would also be helpful if the same terminology could be used to define the orientation of a repeat tract according to replication: we therefore propose that repeats be named according to the sequence found on the lagging-strand template, i.e., CTG repeats when the CTG sequence is on the lagging-strand template (equivalent to the leading strand). This nomenclature is also applicable to all kinds of microsatellites besides trinucleotide repeats, as long as the direction of replication is determined.
Adding to the trouble coming from the lack of clear definitions for these elements, it is difficult to find two scientific reports using the same algorithm to detect tandem repeats, and whenever this occasionally happens, differences in parameters and thresholds for detection are so different that it makes any comparison between data sets quite ambiguous (Table (Table3).3). It is therefore very difficult to compare microsatellite distributions among eukaryotes, and it is still unclear for us if the density of trinucleotide repeats in the human genome (11.8 per megabase) (270a) is higher or lower than the density of trinucleotide repeats in the yeast genome, which varies from 8 to 147 repeats per megabase (Table (Table3).3). Imperfect and perfect microsatellites should at least be computed separately, which is not systematically the case.
As one of the pioneers of molecular biology and a Nobel prize winner, Jacques Monod was amazed by the conservation of structures in living organisms, given the relatively high frequency of some mutations in human beings: “Altogether, we may estimate that in the present-day human population of approximately three thousand million there occur, with each new generation, some hundred thousand million to a billion mutations… . Considering the scope of this gigantic lottery and the speed with which nature draws the numbers, it may well seem that the amazing and indeed paradoxical thing, hard to explain, is not evolution but rather the stability of the ‘forms’ that make up the biosphere” (344). What would Jacques Monod have thought of mutations that are even more frequent, to the order of 10−2 to 10−3 for some microsatellites, and even higher for trinucleotide repeat expansions? He would probably have been fascinated by the fact that morphological shapes in dogs rely on such unstable sequences (145). However, these mutations must still be exposed to the process of natural selection and, in the end, a dog might remain a dog or evolve into another species.
We are greatly indebted to the many colleagues who contributed over the past years to stimulating discussions on the evolution of repeated sequences, especially to all the members, past and present, of the Unité de Génétique Moléculaire des Levures; to Jim Haber, Alain Nicolas, Benoit Arcangioli, and all the members of their respective labs; and to Geneviève Gourdon, Catherine Freudenreich, and Frédéric Pâques for more-specific discussions on tandem repeat instabilities. We apologize to many colleagues working on microsatellites, since it was not possible to extensively cite all their studies. We are very grateful to Allyson Holmes, Gilles Fischer, and Cécile Neuvéglise for many suggestions that improved the overall quality of the present work and to Agnès Ullmann for lending us her personal copy of Chance and Necessity.
This work was supported by grant 3738 from the Association pour la Recherche contre le Cancer (ARC) and grant ANR-05-BLAN-0331 from the Agence Nationale de la Recherche. B.D. is a member of the Institut Universitaire de France.