|Home | About | Journals | Submit | Contact Us | Français|
High-throughput DNA analyses are increasingly being used to detect rare mutations in moderately sized genomes. These methods have yielded genome mutation rates that are markedly higher than those obtained using pre-genomic strategies. Recent work in a variety of organisms has shown that mutation rate is strongly affected by sequence context and genome position. These observations suggest that high-throughput DNA analyses will ultimately allow researchers to identify trans-acting factors and cis sequences that underlie mutation rate variation. Such work should provide insights on how mutation rate variability can impact genome organization and disease progression.
Mutations play important roles in disease progression (e.g. 1) and in shaping genome evolution and architecture. Such events can affect gene size, organization, and expression level, and can alter genetic interactions that act in recombination and sex (2–5). Mutations also play direct roles in phenotypic evolution (6). For example, gene duplication followed by divergence of the duplicated genes through mutational processes is thought to be a major mechanism for evolving novel gene functions (7).
In general, mutations fall into one of three categories: single nucleotide mutations, insertions/deletions, and chromosome rearrangements. Insertions and deletions can involve single base pairs, entire genes, or larger chromosomal regions. Single nucleotide mutations can result from exposure of the genome to endogenous and exogenous mutagens (8). These DNA damage events can occur at high frequency; for example, a single mammalian cell accumulates ~10,000 abasic (apurinic) lesions per day (9). Most DNA lesions induced by endogenous and exogenous means are recognized and repaired by well characterized DNA repair systems (10, 11). In their absence, the replication of DNA containing damaged bases (e.g. oxidation, deamination, alkylation) can generate mutational events at a high rate, primarily through the loss of template information and/or the recruitment of low fidelity DNA polymerases that display error rates as high as 1 per 100–1000 nucleotides incorporated (12). Mutations in the genome can also arise as a result of errors during DNA replication. These events are rare because the nucleotide selectivity and proofreading functions of DNA polymerases, combined with post-replicative mismatch repair systems, greatly reduce the error rate to roughly 1 per 109 nucleotides incorporated (12).
Mutations occur in coding and non-coding regions and can be broadly classified as lethal, deleterious, neutral, or beneficial based on their fitness effects. Neutral mutations have negligible effects on fitness and are thus invisible to natural selection. Beneficial mutations are favored by positive selection and may range from mildly to highly adaptive (13–15), while deleterious mutations impose a fitness cost and tend to be removed from the population by natural selection. The relative proportion of these mutations varies between species (16) due to factors including a species’ effective population size and genome organization (e.g. the percent of the genome containing coding and repetitive sequences). For example, 20–30% of amino acid mutations appear neutral in humans (17, 18, 19); this value is significantly lower (2.8%) in enteric bacteria (20).
Work in a number of organisms suggests that mutation rates vary as a function of genome position (21–24). In baker’s yeast the mutation rate of a microsatellite reporter placed at a variety of chromosomal positions varied by 16-fold (21). A two-fold difference in substitution rate was observed in human chromosomes measured at one megabase pair resolution (24); because these rates were estimated using non-functional transposable elements, it is likely that this variation is largely due to mutation rate variation among genomic locations. Regional variation in mutation rates has been correlated with differences in base composition (23, 25), local recombination rate and gene density (26), transcriptional activity (27, 28), variations in repair efficiency at different sites in the genome (21), chromatin structure (29), nucleosome position (30) and replication timing (31). Some of the mutational variation described above is likely to occur in repetitive parts of the genome that are subject to less selection and appear to be less stable (32, 33).
Obtaining a measure of mutation rate variation using more accurate and comprehensive high-throughput methods, combined with bioinformatic approaches, will allow for a better understanding of the cis and trans-acting factors that affect mutation rate and will likely be relevant to understanding genome evolution, organization and disease progression. This review will provide an overview of pre- and post-genomic methods to determine mutation rates and how post-genomic technologies can be used to measure mutation rate variation.
Mutation rates are typically represented as µn, the mutation rate per base, or U, the mutation rate per genome. In general these rates are measured with respect to single nucleotide mutations. U can often be indicated as UT, the total mutation rate per genome, or UD, the deleterious mutation rate per genome based on fitness (34, 35). µn and U are presented as the number of mutations/base/division, and the number of mutations/genome/division, respectively. For multicellular organisms, these units are typically expressed per generation instead of per division. Unless otherwise noted, all estimates for UT and UD are for diploid genomes. Pre-genomic mutation rate measurements in bacteria and yeast have generally been obtained using fluctuation tests (36). In these assays, a large number of parallel cultures are started with a relatively low number of wild-type cells, grown under non selective conditions, and then plated onto selective media to identify mutants. The total number of cells at the end of the growth period is determined by plating an appropriate dilution on non-selective media. The number of mutations that arise in each culture should follow a Poisson distribution that can be used to estimate the mutation rate by several methods including the commonly used method of the median (36). At the CAN1 locus in the baker’s yeast S. cerevisiae, the average mutation rate per base (µn) from the analysis of several fluctuation tests was estimated to be 1.7 × 10−10 (37). A more recent study in S. cerevisiae (38) provided several refinements to this analysis; µn was estimated, based on analysis of the URA3 and CAN1 loci, to be 3.80 × 10−10 and 6.44 × 10−10, respectively. These data also support the idea that mutation rates vary as a function of genome position, and provide further motivation to implement high-throughput methods to accurately quantify mutation rates across the genome.
The fluctuation test estimate of µn in S. cerevisiae relies on detecting mutational events in genetic markers using selection based on reversion to function or loss of function. For example, the reversion rate at three different loci was used to measure rates of spontaneous mutation during mitosis and meiosis (39). While these early estimates of mutation rates were important in terms of obtaining initial values across organisms, they were based on measurements made at a few loci that were then extrapolated to the entire genome. Such an approach is likely to be inaccurate because mutation rates can vary according to chromosome position (21 and see below) and it can often fail to detect synonymous mutations and mutations in non-coding regions.
Mutation accumulation assays have been used in several non-mammalian model systems to directly measure genome-wide mutation rates (40–43). In these assays, a set of initially isogenic lines are maintained and allowed to accumulate mutations by minimizing the effects of selection. Selection is minimized by frequent bottlenecks, where minimum effective population sizes are maintained to allow even deleterious mutations to accumulate. Different lines will independently accumulate different numbers of mutations, leading to loss of fitness (ΔM) compared to controls, and an increase in variance for fitness (ΔV) among the lines. Fitness measures commonly used in these assays are growth rate and reproductive success. Because organismal fitness is controlled by a very large number of loci, it offers the widest mutational target, allowing the recovery of most mutational events (44). From the ΔM and ΔV parameters, one can infer mutation rates computationally using the Bateman-Mukai (BM; 42), maximum likelihood (ML; 45) or minimum distance (MD; 46) methods. A comparison of these methods is described in Garcia-Dorado and Gallego (47). As shown in Table 1, computational analysis of mutational accumulation assays in Drosophila melanogaster has resulted in estimates of UD ranging from 0.01 to 0.17 (43, 46, 48). In C. elegans, analogous methods have estimated UD to be 0.005 per haploid genome (49). In S. cerevisiae, estimates of UD range from 6.3 × 10−5 per haploid genome (14) to 9.5 × 10−5 per diploid genome (50).
The computational models described above assume that all mutations have similar effects on fitness. This assumption, which is almost certain to be incorrect, will cause bias in mutation rate estimates. Methods have been developed that do not make this assumption. For instance, in S. cerevisiae, all four meiotic products can be recovered, which facilitates directly linking a single deleterious mutational event and its fitness effect (44). Diploid clones that have acquired a single deleterious mutation can be sporulated and all four haploid products can be recovered; the growth rates of the two wild type haploids relative to the two mutant haploids are used to estimate the fitness effects of the novel mutation. Using this strategy UD was determined in S. cerevisiae to be 1.1 × 10−3 per diploid genome (44). However, even this direct method can underestimate UD as deleterious mutations with small fitness effects may not show observable growth differences among the haploid progeny, especially under laboratory conditions.
Indirect methods are often used to infer mutation rates in mammals where mutation accumulation studies and fluctuation tests are impractical. These methods measure neutral sequence differences between related species to infer mutation rates (51). Neutral substitutions are considered to be: 1. The four-fold degenerate synonymous sites in open reading frames of protein-coding sequences, 2. Pseudogene loci, 3. Repetitive DNA sequences and 4. Noncoding non-repetitive DNA. Mutation rate estimates in mammals based mostly on indirect methods and a few direct estimates are shown in Table 2.
While indirect methods have improved our estimation of mutation rates, they also have their limitations. Estimating mutation rates indirectly from phylogenetic comparisons of DNA sequences is dependent upon accurate estimates of the generation length and divergence time of a species; these measures, however, are difficult to obtain. In addition, the four-fold degenerate synonymous sites may not be neutral, as has been suggested (52, 53), and non-coding DNA may be subject to high levels of selective constraint and may also evolve under positive selection, at least in some systems (e.g. 54, 55).
The above pre-genomic approaches typically determine mutation rates based on limited sequence analyses at a few loci or fitness based assays. Since most mutations that confer phenotypes are thought to be deleterious, these small-scale approaches can skew the distribution towards observable mutations (16, 40, 48, 50, 56–59). In addition, the heterogeneity in the fitness effects of mutations makes it difficult to accurately infer mutation rate in fitness based mutation accumulation assays (60), where the fitness effects of all mutations are assumed to be the same. High-throughput genome wide measurements of mutation can reduce concerns about skewed mutation distributions and variable fitness effects because mutations are directly detected.
High-throughput measurements can be performed using traditional sequencing methods such as Sanger sequencing or by other methods that detect sequence variants such as single strand conformation polymorphism (SSCP) and denaturing high performance liquid chromatography (DHPLC; 61). However, these methods are in general labor and cost intensive. An alternative is to use new sequencing and microarray technologies that provide rapid, accurate, and cost effective mutational profiling at a genomic scale. Three new sequencing technologies are commercially available that produce short sequence reads (25–200 bp) using a massively parallel approach. In general, a reference genome, which is available for almost all model organisms, is required to assemble these reads. These technologies have been developed by Illumina (Genome Analyzer), Roche (Genome Sequencer FLX) and ABI (SOLiDS). These and other emerging technologies are reviewed exhaustively (62, 63). Microarray approaches have been developed that offer a viable alternative to whole genome sequencing by identifying mutations based on differential hybridization to oligonucleotide tiling arrays. Such arrays allow the entire genome to be interrogated at single base level for mutations which can be identified by re-sequencing (64).
High-throughput DNA analysis approaches have led to a considerable amount of new information on single base mutation rates in model organisms. In particular, estimates of mutation parameters µn, UT, and UD have been significantly revised upwards; these increases most likely reflect the greater sensitivity of high-throughput approaches for detecting mutation events (Table 3). Post-genomic estimates of UD appear high in D. melanogaster and C. elegans and are similar to values determined in mammals using pre-genomic estimates (Tables 2 and and3).3). This suggests that high deleterious mutation rates are not unique to mammalian genomes.
In S. cerevisiae, µn was determined to be 3.3 × 10−10 by pyrosequencing the genomes of four mutation accumulation lines to a 5-fold average genome coverage (65). UT was estimated to be approximately 0.32 per haploid genome and mostly comprised homopolymeric mutations (0.30) that were shown to have very high mutation rates. Although comparison with pre-genomic estimates of UD that are in the range of 10−5 to 10−3 suggests that only a very small percentage (0.1%) of mutations confer fitness defects that can be detected in laboratory assays, it is not clear what fraction of the homopolymeric mutations are deleterious (see 66).
In C. elegans, a total of 4 Mb was sequenced from mutation accumulation lines at different generations using the Sanger method (67). µn was estimated to be 2.1 × 10−8 per base, which is an order of magnitude higher than pre-genomic estimates based on laboratory fitness assays (34, Table 3). UT was estimated to be 2.1 per haploid genome which, when compared to pre-genomic haploid estimates of UD (0.005), suggested that most mutations (99%) in these lines have fitness defects that are not easily seen in the laboratory. UD was inferred to be 0.48 per haploid genome which is two orders of magnitude higher than the pre-genomic estimates and highlights the drawbacks of inferring them based on laboratory fitness assays. More than half of the mutations were small insertions or deletions (17 insertion-deletion events out of 30 mutations observed) instead of single base mutation events as assumed earlier (68). A high estimate of UD (0.14 per diploid genome per generation for protein coding genes) was also inferred by Davies et al. (69) by comparing the number of deleterious mutations detected at the molecular level in forward and reverse mutation assays with fitness based assays. This comparison suggested that greater than 96% of the deleterious mutations fixed in the mutation accumulation lines have fitness effects too subtle to be detected based on laboratory fitness assays.
In Drosophila melanogaster the mutation rate per base (µn) was estimated to be 8.4 × 10−9 by scanning 20 Mb of DNA from three sets of mutation accumulation lines using denaturing high-performance liquid chromatography (70). This estimate is about 24 fold higher than pre-genomic estimates (34). UD was estimated to be 1.2, which is again higher than pre-genomic estimates from computational analysis of mutation accumulation lines (48). Significant heterogeneity in the mutation rate was also seen between the three lines. µn was 4.8 × 10−9 for the Madrid line, 17.2 × 10−9 for the Florida-33 line, and 6.8 × 10−9 for the Florida-39 line. The ability to detect mutation rate variation between lines/individuals is one of the advantages of using high-throughput approaches as opposed to the population-based estimates that are obtained from pre-genomic methods. Unlike the case in C. elegans, single base substitution mutations comprised the majority of the mutations.
The post-genomic estimates of µn appear somewhat similar in C. elegans and D. melanogaster (2.5 fold difference, Table 3) but are markedly lower in S. cerevisiae (25 and 63 fold lower relative to D. melanogaster and C. elegans, respectively). Mutation rate estimates in multicellular organisms such as C. elegans and D. melanogaster can appear magnified because the values reflect per generation estimates (number of mutations per division multiplied by the number of germ line divisions per generation) rather than per division as determined for unicellular organisms. Even after taking this into account, it is likely that multicellular organisms bear an increased mutational load that makes them more susceptible to the effect of deleterious mutations (71; UD estimates in Table 3). It is fascinating that higher mutation rates in multicellular organisms do not appear to interfere with the evolution of complexity in multicellular organisms (72). One explanation for the appearance of higher mutation rates in multicellular organisms is that the lower effective population sizes of these organisms contribute to higher mutation rates by increasing the role of genetic drift in fixing mutator alleles (2).
Two problems associated with the new high-throughput sequencing technologies are the inability to obtain mappable sequence information from repetitive regions of the genome, and the high error rates associated with sequence detection (73). Repetitive regions are significantly underrepresented in the useable output of current high-throughput sequencing methods due to the difficulties in mapping repetitive sequences to unique chromosomal positions using short read data. As described below this can be a major concern because mutation rates in repetitive regions are likely to be significantly higher than in non-repetitive regions. In addition, mutations found in only a subset of the repeat copies are often masked as sequencing errors because it is difficult to determine the origin of any particular sequence read (74). Also, DNA sequencing errors, which are often specific to the new technology used, can make it challenging to identify heterozygous mutations in diploid genomes (64, 75). Lack of adequate genome wide coverage during sequencing can also contribute to sequence errors that appear as mutational events. At present the overall impact of false negatives and positives on mutation rate is unclear.
Many of the sequencing errors associated with the new technologies cannot be resolved by programs designed for the Sanger sequencing method such as Polyphred; therefore, observed mutations have to be carefully analyzed (76). New algorithms are being developed to provide base quality scores for these new sequencing methods (e.g. 77), and to map short repeat sequence reads to a reference genome (78). Also, some of the base calling errors characteristic to the new sequencing technologies, such as higher error rates towards the end of sequencing reads, have been estimated (79) and methods to better detect these errors are being developed (80, 81). Since most new mutations that arise in diploid organisms are heterozygous, detecting them using these new sequencing technologies is challenging, but will likely be overcome with increased sequence coverage and verification through other methods such as Sanger sequencing.
Accurate estimates of mutation rate parameters µn and U are crucial from a molecular evolution perspective because they are used to fix baseline mutation rates within a species. These values are in turn useful for mutation rate comparisons under altered environmental and growth conditions. Improvements in mutation rate measurements will be useful for accurately determining the rate of deleterious mutations that are not efficiently removed by selection and can thus contribute to mutational loads that accumulate and cause the extinction of small populations (82). Because humans pass on roughly 100 new mutations to their offspring (83), knowing how many of these mutations are beneficial, neutral or deleterious has implications for the long term fitness of a species that produces few offspring (84).
More accurate estimates of mutation rate are crucial in an evolutionary context as well (51). For instance, more informed estimates of mutational load are of primary importance from a theoretical perspective in terms of understanding the evolution of sex and the evolution of recombination. In addition, because the number of neutral substitutions per site (K) is a function of time and mutation rate (K = 2µT), increasing the accuracy of mutation rate (μ) estimates will improve our estimates of divergence times among species under the assumption of a molecular clock. Finally, the most fundamental population genetic parameter is arguably θ = 4Neµ, where Ne is the effective population size. This equation describes the level of neutral variation in a population. This parameter, which depends critically on mutation rate, is indispensable in population genetic models, which are used to infer patterns of selection and demography based on extant population-level sequence data. Improved estimates of mutation rate will help inform our understanding of the strength and frequency of adaptive events, the distribution of selective constraint across the genome, and the effects of population history on population-level variation.
As described above, a major drawback of most high-throughput DNA analyses is that they are unable to detect mutations in highly repetitive sequences. These include simple repeat sequences that have repeat lengths of less than 300 bp such as microsatellites, minisatellites, Alu repeats and telomeric repeats as well as much longer repeats (several kb) like LINE elements and rDNA repeats. Repetitive DNA comprises as much as 50% of the human genome (85) and a substantial portion (~17–57%) of the genome in most model organisms (e.g. 86, 87). Repetitive regions are prone to a wider range of mutational events such as insertion-deletion mutations and rearrangements (88). In addition, single nucleotide substitution events are known to be higher near insertion/deletion mutations (89). Studies performed primarily in bacteria, yeast and Drosophila have shown that frameshift mutations in repetitive regions can occur at up to several orders of magnitude higher than in non-repetitive regions (e.g. 90–92). These mutagenic events are likely to dramatically affect the fitness of an organism if they occur in the open reading frame of a gene. For example, Heck et al. (66) searched the S. cerevisiae genome and identified greater than 600 seven-nucleotide repeat runs in essential genes and calculated that strains grown for 160 generations display a 7 × 10−4 probability of acquiring a mutation in one or more of these runs. Thus, essential genes containing simple sequence repeats within open reading frames are at risk for disruption. Simple repetitive sequences have also been identified within developmental genes (93). The repetitive sequences in these genes are thought to contribute to rapid morphological evolution by contributing to localized genetic variation without causing a general increase in mutational load (94, 95).
The presence of localized regions of the genome with different mutation rates can have importance consequences for the evolution field. Recent work examining cryptic mutation hotspots indicates that mutation rate variation has been grossly underestimated in the human genome (96). The substantial mutation rate variation within a genome makes previous calculations of average mutation rates µn and U, initially measured to be relatively constant across species (34), less useful in terms of estimating population history and species divergence time. The existence of such variation also makes it more difficult to determine the roles of selection and mutation in maintaining conserved genomic regions. Identifying regional variation in mutation rates using bioinformatic analyses of high-throughput data is likely to be useful in terms of identifying sequences/regions that correlate with both high and low mutation rates, and should allow us to distinguish between the role played by selection or low mutation pressure in maintaining conserved genomic regions (24). Already such approaches have borne fruit. For example, recent work in yeast suggests that linker DNA has a 10–15 % lower substitution rate than nucleosomal DNA (30). Future work in this field will likely involve an analysis of the relative contributions of mutation rate versus selective constraint in establishing nucleosome positioning (30). Regional variation in mutation rate has also been hypothesized to influence the spatial distribution of genes (3).
Genome wide mutation rate measures and local heterogeneity in mutation rates are relevant from a human disease perspective. Knowledge of the mutation rate associated with different tumors can be useful for clinical therapy. Greenman et al. (97) re-sequenced 274 Mb of DNA corresponding to coding exons of 518 protein kinase genes in 210 human cancers. Strikingly, the prevalence of mutations in different types of cancers was different. For example lung and gastric cancers showed the highest prevalence of mutations (4.21 and 2.10 mutations per Mb, respectively) while testis and breast cancers showed lower levels (0.12 and 0.19 mutations per Mb, respectively). Tumors with higher mutation rates are predicted to require treatment with multiple drugs in order to avoid drug resistance (98). Chromosomal regions with higher mutation rates may also be more susceptible to DNA damage. Such sites could be responsible for chromosome instability that is associated with tumorigenesis (99).
Lastly, the new DNA sequencing methods will likely identify patterns of mutagenesis that would be difficult to detect by measuring mutations at only a few loci. For example, mutations have been reported to occur in showers in systems ranging from viruses (100) to mice (101). Mutational showers, defined as regions containing mutations at levels higher than predicted by mutation rate and random distribution, are hypothesized to occur due to a transient hypermutable state of a fraction of the population (102, 103). Such clustering of mutations in genomic space was seen in transcription-associated mutagenesis in bacteria and yeast (27, 104). Mutations mediated by error-prone DNA double strand break repair pathways are also thought to create mutation clusters around double-strand break sites (103, 105–108). The new sequencing technologies offer an efficient way to characterize subpopulations with different mutation rates because the expense of sequencing multiple populations is relatively modest.
We are looking forward to the development of more accurate high-throughput methods that can be used to identify de novo mutation events in both unique and repetitive regions of a diploid genome. Such measurements, performed on a large scale, will provide more accurate estimates of mutation rates that can ultimately be used to identify trans acting factors and cis sequences that affect mutation rate variation. Work by Lynch et al. (65), Denver et al. (67) and Haag-Liautard et al. (70) provide excellent examples of how this work will be pursued. Traditional approaches that seek to summarize mutation rate information over a genome are being replaced by high-throughput approaches that will provide a better estimation of mutation rate variation that result from distinct mutation formation mechanisms (89, 94, 96). Such achievements will ultimately allow scientists to measure mutation rate variation associated with different drug treatments, biological processes (e.g. different stages of the cell cycle), environmental conditions, and nutritional and disease states (e.g. 97). Ultimately the high-throughput technologies will allow one to determine mutation rate variation between individuals at specific regions in the genome. This is particularly relevant in the coming era of personalized medicine for estimating genetic disease risk.
We are grateful to Charles Aquadro and members of the Alani laboratory for discussions and comments on the manuscript. E. A. and K. T. N. were supported by NIH grant GM53085. N. D. S. is supported by NIH grant number 1F32GM080944-01 to N. D. S., Charles F. Aquadro, and Andrew G. Clark.