|Home | About | Journals | Submit | Contact Us | Français|
The mutation rate at fifty-four perfect (uninterrupted) dinucleotide microsatellite loci is estimated by direct genotyping of 96 Arabidopsis thaliana mutation accumulation lines. The estimated rate differs significantly among motif types with the highest rate for AT repeats (2.03 × 10−3 per allele per generation), intermediate for CT (3.31 × 10−4), and lowest for CA (4.96 × 10−5). The average mutation rate per generation for this sample of loci is 8.87 × 10−4 (SE 2.57 × 10−4). There is a strong effect of initial repeat number, particularly for AT repeats, with mutation rate increasing with the length of the microsatellite locus in the progenitor line. Controlling for motif and initial repeat number, chromosome 4 exhibited an elevated mutation rate relative to other chromosomes. A survey of dinucleotide repeats across the entire Arabidopsis genome indicates that AT repeats are most abundant, followed by CT, and CA. The great majority of mutations were gains or losses of a single repeat. Several lines exhibited multiple step changes from the progenitor sequence, although it is unclear whether these are multi-step mutations or multiple single step mutations. Generally, the data are consistent with the stepwise mutation model of microsatellite evolution.
Microsatellites are simple sequence repeats that frequently display length variation within natural populations. These loci can be classified according to the length and type of repeated motif, where the most common lengths are 2, 3, or 4 bases (di-, tri- and tetra- nucleotide repeats, respectively). Because microsatellites are highly polymorphic, they are frequently used as genetic markers in ecological and evolutionary studies (Schlötterer and Pemberton, 1994). The multi-allelic character of microsatellites makes them ideal for paternity analysis (Chase et al.1996; Dow and Ashley, 1998), estimation of parameters in pollination biology (e.g. Kelly and Willis, 2002) and studies of dispersal/spatial-genetic structure (e.g. Sweigart et al., 1999). If one further assumes that microsatellite variation is selectively neutral, they can be used to estimate the effective population size (e.g. Schug et al., 1998).
Polymerase slippage during DNA replication is thought to be the primary source of mutation in microsatellites (Schlötterer et al., 1998). However, much remains unknown about the nature of the mutational process. Most studies suggest that mutations are typically gain or loss of a single repeated unit (Thuillet et al., 2002; Vigouroux et al., 2002), although there are putative examples of multi-repeat gains or losses (Ellegren, 2004). The rate of mutation may depend on allele length, i.e. the number of repeat units (Wierdl et al., 1997; Vigouroux et al., 2002; Thuillet et al., 2004), as can the direction of changes, i.e. the relative likelihood of gain versus loss (see Wierdl et al., 1997). Finally, the mutation rate and other mutational properties may depend on the repeat motif, i.e. AG vs CG (Bachtrog et al., 2000; Kelkar et al., 2008). Most data suggest that dinucleotide microsatellites mutate at a rate that is greater than that of trinucleotide and tetranucleotide microsatellites (Chakraborty et al., 1997 but see Weber and Wong, 1993).
Microsatellites are distributed non-randomly across plant genomes and are associated with non-repetitive DNA (Zhang et al., 2006). In A. thaliana, they are often found in regulatory regions, especially 5’UTRs and 5’flanking regions (Zhang et al., 2006; Grover and Sharma, 2007). A-rich repeats are prominent in introns and intergenic regions. AG is the most common di-nt motif in exons and 5’flanking regions, while AT is most common in introns, intergenic regions, 3’ flanking regions (Zhang et al., 2004).
Microsatellite mutation rates have been estimated for a variety of crop plants (Table 1). Rate estimates range from 0 to 5 × 10–3 per locus per generation. Across these studies, mutations were more frequently observed in loci with long alleles (more repeat units) and most were single repeat changes with gains more frequent than losses. Across all three studies of Table 1, smaller loci (fewer repeats) tended to expand while longer loci (more repeats) tended to lose repeats.
Estimates of microsatellite mutation rates are directly relevant to hypotheses about genetic diversity in natural populations. Symonds and Lloyd (2003) found that genetic diversity for 20 microsatellite loci across 126 accessions was positively correlated with the number of contiguous repeats in A. thaliana. This association is predicted by models where mutation rate increases with repeat number. Direct estimates of mutation rate are also essential for evaluating theories of microsatellite evolution. The simplest model is the Infinite Alleles Model (IAM; Kimura and Crow, 1964; Balloux and Lugon-Moulin, 2002) where mutations occur at a constant rate and each mutation creates a novel allele. Seemingly more appropriate for microsatellites is the stepwise mutation model (SMM; Ohta and Kimura, 1973) where mutations occur at a constant rate and involve the gain or loss of a single unit. The two phase model of DiRienzo et al. (1994) is a modification of the SMM with most mutations involving a gain or loss of a single repeat and the remainder of the mutations being multi-step mutations following a geometric distribution. In a survey of variation at five microsatellite loci across 37 populations of A. thaliana, Bakker et al. (2006) found support for both the SMM (2 of the 5 loci) and the IAM (4 of the 5 loci).
In this paper, we estimate the rate of mutation per allele per generation of dinucleotide repeats in A. thaliana. A large panel of Mutation Accumulation (MA) lines is scored for allele length at fifty-four perfect dinucleotide repeat loci. Perfect repeats are uninterrupted strings of a single motif, e.g. AT. The loci examined in this study are not associated with genes or within intergenic regions of gene clusters. As a consequence, natural selection on allele length within these loci is likely to be much weaker than for gene associated microsatellites. All putative mutations were confirmed by multiple independent PCR amplifications. These results corroborate the effect of allele length on mutation rate. They also indicate an important effect of motif type and possibly also chromosomal location. We also conduct a genomic survey of A. thaliana and interpret our mutation estimates in relation to the full distribution of repeat lengths and motif frequency in the Arabidopsis genome.
Shaw et al. (2002) maintained 118 independent Mutation Accumulation Lines of Arabidopsis thaliana for 30 generations prior to the current study. All lines were initiated from the Columbia accession and each was propagated by single seed descent. We chose a random subset of this population (96 lines) and grew plants to maturity in the University of Kansas greenhouse in February 2008. The soil was equal parts vermiculite and perlite with potting soil sprinkled on top of seeds. Day length was artificially expanded to 18 hours and plants were fertilized every week with Peat-lite (20–10–20 NPK). Tissue was collected for DNA extraction from the basal rosette when each plant was approximately five weeks old.
Tissue was collected into a 96-well plate with a metal bead in each well. 500 µL of CTAB buffer and 1 µL of β-mercaptoethanol was added to each sample. The plate was then sealed and shaken at high speed for 45s in a bead beater. The plate was then incubated for ~20 min. in 60°C water bath and then centrifuged for ~10 sec (3980 rpm) to separate solids. We transferred 300 µl liquid from each tube to a new 96-well Costar plate and added 300 µl of chloroform to each sample. This was followed by another round of mixing using the “slanted- vortex technique” and centrifuge for 10 min @ 3980 rpm. Each sample was then fully separated into aqueous (upper) and chloroform (lower) layers. We removed the aqueous layer to a new 96-well plate, added 200 µl isopropanol, and mixed well by inverting the plate repeatedly. The new plate was stored at −20°C overnight and then centrifuged for 10 minutes @ 3980 rpm. This produced a gelatinous pellet in each well. We then poured off the supernatant, added 200 µl 70% ethanol, capped the tubes, and repeated the shake and centrifuge steps. We then poured off the ethanol and air-dried the pellet. Each DNA pellet was resuspended in 50 µL of distilled water. All samples were quantified using a NanoDrop 1000 spectrophotometer (Thermo Scientific) and diluted with distilled H20 to 7–9 ng/µL.
Microsatellite loci were identified by searching the Arabidopsis genome sequence via The Arabidopsis Information Resource (TAIR) website (www.arabidopsis.org). Microsatellites were found by searching for each motif in a string of 8 repeats, e.g. ATATATATATATATAT or (AT)8. For coverage of the genome, we divided each of 5 chromosomes into four regions and selected one locus per region per motif type. Not all regions contained a microsatellite satisfying our selection criteria. We eliminated microsatellites that were within 200 bp of start/end of gene, in either a UTR or an intron, had more than 30 repeats, or if the repeat sequence of the microsatellite was interrupted. We found no CG repeats that met these conditions and so our sample consisted entirely of AT, CA, and CT repeats. A number of loci failed to amplify, and as a consequence, we ended up with fewer CA loci (14) than AT or CT loci (20 of each). Primers, described in the Appendix, were designed for the selected loci using the program Primer3 with the default settings (Rozen and Skaletsky, 2000).
For each locus, we genotyped 96 individuals using a 3-primer method for polymerase chain reaction (PCR; Boutin-Ganache et al., 2001). We used one untagged primer for each pair, a second primer with a 5’ tag (CAG sequence: 5’-CAGTCGGGCGTCATCA-3’), and a third CAG-sequence primer with a 5'-6FAM (Applied Biosystems) fluorescent label. The CAG sequence was added to the primer in each pair such that the melting temperature of the tagged primer was approximately 65 ºC. PCRs (15 µl total volume) contained 40ng of template DNA, 0.25 µM untagged primer, 0.025 µM CAG-tagged primer, 0.25 µM 6FAM-labeled CAG primer, 200 µM each dNTP, 0.5 units Taq DNA polymerase (Promega) and 1× PCR buffer (500 mM KCl, 15 mM MgCl2, 100mM Tris-HCl; Promega). For temperature cycling, we implemented a touchdown PCR protocol using an iCycler Thermal Cycler (BioRad): 94 ºC for 1 min, 21 cycles of denaturing at 94 ºC for 30 s, annealing for 20 s, and extension at 72 ºC for 20 s; initial annealing temperature (Ta) = 60 ºC and decreased by 0.5 ºC with each cycle until Ta reached 50 ºC, followed by 9 cycles using this Ta, and a final extension at 67 ºC for 45 min. We detected PCR-amplified fragments on an ABI 3130 Genetic Analyzer (Applied Biosystems), and sized fragments using GeneMapper 4.0 software (Applied Biosystems) calibrated with the ROX500 size standard (Applied Biosystems). Logistic regression and other statistical analyses of the mutation accumulation data were performed in R (www.r-project.org/).
We downloaded entire chromosome sequences as FASTA files from www.arabidopsis.org and used the program Tandem Repeats Finder v. 4.0 for Windows (TRF; Benson, 1999) to identify microsatellites. We used the following parameter values within TRF for genome analysis: alignment weights +2, −7, −7 (representing match, mismatch and indel penalties); matching probability of 0.80 and an indel probability of 0.10 (pM = 0.80 and pI = 0.10, respectively); a minimum alignment score of 20 and a maximum period size of 10. We extracted the dinucleotide repeats of all motif types from the full TRF output by visual inspection. We statistically analyzed the resulting data in Minitab (v. 14.0) for mean repeat length for each repeat motif category.
For all loci, the majority of lines produced fragments that matched the length of the progenitor sequence: the Col-1 genomic sequence length plus the increment due to the primers. Putative mutations were identified as deviations from this progenitor sequence length. Each putative mutant was subsequently re-amplified and re-genotyped 2–6 times to distinguish real mutations (acquired during mutation accumulation) from those due to PCR error. Approximately 15% (19/124) of all putative mutations identified in the initial screen were determined to be PCR errors.
Across lines and loci, there were 5165 genotypes. Of these, 137 (2.7%) were confirmed mutations (Table 2). If we bin all mutant types in Table 2, the (haploid) mutation rate,μ, can be estimated as the number of mutations divided by the product of the number of lines (L) and the number of generations of mutation accumulation (G). Each line is expected to produce 2µ mutations per locus per generation but only half of these mutations will fix in subsequent generations of propagation. By this method, the estimated μ is 2.03 × 10−3 for the 20 AT repeats, 4.96 × 10−5 for the 14 CA repeats, and 3.31 × 10−4 for the 20 CT repeats. For the entire sample, the estimatedμ= 8.87 × 10−4 with a standard error of 2.57 × 10−4.
The preceding calculations are approximate because the number of mutant lines may not exactly match the number of mutant alleles. Counting het-gain and het-loss as full mutations produces a slight upward bias in mutation rate because we expect that half of these lines will revert to the progenitor sequence with random allele loss due to segregation. However, we are likely underestimating mutation rate by single counting the multi-gain and multi-loss lines. These lines might reflect real multi-step mutations but they might also have fixed multiple single repeat mutations. Also, a small fraction of lines are expected to match the progenitor because of canceling of gains and losses.
There was a great deal of variability among loci in mutation rate (Table 2). This is partly due to the difference among motif types. However, within both the AT and CT groups, the variance in mutation count substantially exceeds the mean. Much of this variation can be attributed to the strong effect of initial repeat number (Figure 1). For both AT and CT repeats, mutation rate increases substantially with the allele length for that locus in the progenitor line. This is confirmed statistically using a Poisson general linear model with mutant count per locus as the response variable, motif type as a categorical factor, and progenitor repeat number as the covariate. The estimated mutation rate equations for each motif type are:
All coefficients, intercepts and slope, are significantly different from zero (p < 0.001). These equations share the same slope estimate because the test for an interaction between motif type and progenitor repeat number (slope heterogeneity) is non significant.
Finally, we examined whether the direction of mutation (gain vs. loss) was related to repeat number. In our screen, gains were more frequent than losses. For AT loci, there were an equal number of gains and losses (4 of each), but gains occurred more frequently in shorter alleles (16.5 vs. 20 repeats on average, respectively). For the AC repeat loci, there was equal number of gains and losses (1 of each). The number of repeats in the gain was 10 and the number of repeats in the loss was 13. For the AG repeats, all five mutations were gains. In our second longest locus (AT0402; 28 repeats), 6 of the mutation accumulation lines differed from the progenitor by 2 or more repeats and all were losses. This is consistent with the trend noted in other studies for longer loci to contract with mutation.
The loci were chosen to span all five chromosomes of Arabidopsis. To test for an effect of chromosome on mutation rate, we added it as a factor in the Poisson regression model. Controlling for the effect of initial repeat number and motif type, the chromosomes were indistinguishable except for chromosome 4 which exhibits an elevated mutation rate (Z = 2.876, p<0.005). This is because the most mutable loci within motifs (AT402, AT403 and CT401, CT402) reside on chromosome 4. With chromosome included as a factor in the model, initial repeat number remains the dominant predictor of mutation rate, although the estimated slope is reduced by about 25%.
Microsatellites composed of AT repeats were the most frequent followed by AG and then AC microsatellites (Table 3). The scan also identified a small number of short GC repeats, but these were excluded from Table 3. A greater number of perfect microsatellites (uninterrupted repeat strings) were identified than imperfect microsatellites. The latter category included compound microsatellites for all repeat motif types. Compound microsatellites comprise more than one repeat type. Some, but not all, compound microsatellites also have insertions between the multiple repeat types and this is likely to affect the mutational pattern.
This survey estimates the rate of mutation at 54 dinucleotide microsatellite loci in A. thaliana. The average estimated rate across loci isµ= 8.87 × 10−4 and the great majority of mutations were gains or losses of a single repeat. The mutation rate is heterogeneous across loci and increases with repeat number. Mutations in longer alleles are more frequently losses than gains (e.g. locus AT0402 in Table 2). These observations are fully consistent with previous mutational studies of plants (Table 1) and other organisms (e.g. Wierdl et al., 1997; Schlötterer et al., 1998; Dieringer and Schlötterer, 2003; Harr and Schlötterer, 2004; Seyfert et al., 2008).
For a given allele length, mutation rate differed among motif types. Kelkar et al. (2008) review a number of reasons why motifs might differ in mutability. The rate of loss and/or formation of hydrogen bonds can differ among motifs (AT maybe more mutable because fewer H bonds must be broken). The relative mutability of motifs could also depend on the stability of hairpin structures formed (ranked by mutation rate and hairpin stability: ATn > AGn > ACn) or in other secondary structures. Finally, motifs may be recognized differently by DNA repair mechanisms (see Harr and Schlötterer 2000; Schlötterer et al., 2006). We found the AT motif to be most mutable and the CA motif to be least mutable (see difference in intercept estimates in equations 1), which is consistent with each of the first two suggestions (hydrogen bond and hairpin stability). There is also a slight tendency towards greater variability in allele length among A. thaliana lines for AT loci than for other motifs in the surveys of Innan et al. (1997) and Symonds and Lloyd (2003).
Our overall mutation rate estimate is probably less useful than the calibrated functions predicting rate given locus-specific features (equations 1). The strong dependence on motif and initial length implies that the average genome mutation rate depends on the relative frequency of the various motif types and on the distribution of allele sizes currently segregating in the population. The AT motif, which had highest mutation rate, is the most frequent repeat type in the entire genome (Table 3; see also Morgante et al., 2002). The CA motif, which is least mutable, is least frequent. The overall average mutation rate also depends on the distribution of repeat numbers per motif in the genome. We selected loci with allele sizes in the 8–30 range (Figure 1; averages 15.35, 11.86, and 16.35 for AT, CA, and CT, respectively). These average repeat lengths for our sample are higher than the mean for each motif type in our genome survey (Table 3). Since mutation rate increases with repeat number, the average rate across our loci within motifs should be elevated relative to the genomic average. However, this bias is counteracted because the most mutable motif (AT) is more frequent in the genome than in our sample.
Equations (1) use a single slope to describe the linear relationships between mutation rate and repeat length across motifs. This is statistically defensible—the test for slope heterogeneity was not significant—but is unlikely to be literally correct. For example, we see essentially no relationship between allele length and mutation rate in CA repeats of our dataset (Figure 1), although our sample contains few CA loci with large numbers of repeats. Also, the fact that equations (1) have negative intercept estimates is consistent with the idea that there is a minimum size for microsatellite loci to accrue mutations at their typically high rate. According to our linear model, this minimum is identified by where our lines cross the x-axis. However, we caution that the true relationship between mutation rate and repeat length is likely to be non-linear.
Approximately 15% of all putative mutations identified in our initial screen proved to be PCR mutations and were discarded. This proportion is lower than in other studies that have verified putative mutations with multiple rounds of PCR. In their study of corn, Vigouroux et al. (2002) found 166 mutations in their initial screen, but only 72 were confirmed (approximately 43%). Symonds and Lloyd (2003) reported a PCR error rate of 95% for single base pair differences in A. thaliana microsatellites. While replicating PCR eliminates ‘false positives’, it is also possible for PCR to produce false negatives. This occurs if PCR reverts a real mutation back to the allele length of the progenitor. While we did not directly correct for false negatives, this bias should be minimal.
There is great interest in estimating Ne, the effective size of natural populations (Frankham, 1995; Leberg, 2005). The neutral theory of molecular evolution predicts that the amount of genetic diversity within a population should be a direct function of the product of Ne and the mutation rate, μ (Kimura, 1983). An independent estimate for μ allows these two variables to be disentangled and permits inference of Ne from genetic diversity.
Symonds and Lloyd (2003) surveyed 126 accessions of A. thaliana for variation at 20 dinucleotide microsatellite loci. The average gene diversity (G) in this survey was 0.76, similar to a previous estimate (0.79) obtained by Innan et al., (1997). Assuming neutrality, the expected G is under the Stepwise Mutation Model (Ohta and Kimura, 1973). Substituting the average G from Symonds and Lloyd (2003) and our average μ across loci, we find that Ne ≈ 2300. With G = 0.79, Ne ≈ 3050. A distinct estimator for Ne is based on V, the variance of allele lengths in a population. The expected value for V is 4 Ne µ, assuming stepwise mutation (Moran 1975). Pooling variance estimates from 20 loci (accounting for differences in sample sizes) in Innan et al. (1997) yields an average V of 25.5. Solving, Ne = 25.5/(4 × 8.87 × 10−4) ≈ 7200.
While reasonable, these Ne estimates are encumbered with a number of notable caveats. First, each is subject to the bias inevitable when substituting point estimates into non-linear functions. Estimation error in either the variation statistics (G or V) or in the mutation rate biases estimation of Ne. Second, these calculations ignore real variation in mutation rate among loci. Finally, and perhaps most importantly, microsatellite allele length may not be selectively neutral. Very weak selection can substantially affect species level polymorphism (Akashi, 1997). The first two issues could be addressed by applying a more elaborate statistical model to the data. A large population survey focused on the same loci for which we have direct mutation rate information could potentially provide a strong test of the neutrality assumption.
Plants do not have a segregated germ line and as a consequence both mitotic and meiotic mutations will accumulate in MA lines. A few studies have attempted to isolate the mitotic rate by comparing genotypes from ancestral and descendent cells within the same plant. Cloutier et al. (2003) observed no microsatellite mutations in a total of 12 loci of Pinus strobus, allowing the authors to place an upper bound of between 2.3 × 10–7 and 6.9 × 10–8 for the mutation rate per mitotic cell division. Leberg (2005) observed one microsatellite mutation across 8 loci of Thuja plicata and from this estimated 3.13 × 10–4 mitotic mutations per allele per generation.
While our study cannot distinguish between meiotic and mitotic mutations, we suggest that meiotic errors are likely to be more important. Whittle and Johnson (2003) found that a greater proportion of mutations in A. thaliana are transmitted to progeny via pollen than ovule, implying mutation during gametogenesis. Also, our mutation rate estimate and most of the others in Table 1 are much higher than the mitotic rate estimate obtained by Cloutier et al. (2003). However, in long-lived species or those with extensive clonal reproduction, mitotic mutations might contribute a larger fraction of the genetic variation. In the future, application of the molecular tools available for this model plant might provide a quantitative estimate for the contribution of meiotic and mitotic mutation.
We thank J. Gleason, S. Macdonald, L. Hileman, J. Preston, J. Mojica, and M. Holder for comments on this paper. C. Baer provided insightful criticism of an early draft. This research was supported by NIH grant GM073990 and NSF grant DEB-0543052 to J. K. Kelly, NSF grants DEB-9629457 and DEB-9981891 to R. G. Shaw, and NSF DEB-0108242 to M. Orive. M. E. Mort acknowledges DEB-0344883. We thank S. Macdonald and J. Gleason for use of their laboratory equipment and Lisa Darmo for establishing the MA lines.