|Home | About | Journals | Submit | Contact Us | Français|
DNA mutations are the source of genetic variation within populations. The majority of mutations with observable effects are deleterious. In humans mutations in the germ line can cause genetic disease. In somatic cells multiple rounds of mutations and selection lead to cancer. The study of genetic variation has progressed rapidly since the completion of the draft sequence of the human genome. Recent advances in sequencing technology, most importantly the introduction of massively parallel sequencing (MPS), have resulted in more than a hundred-fold reduction in the time and cost required for sequencing nucleic acids. These improvements have greatly expanded the use of sequencing as a practical tool for mutation analysis. While in the past the high cost of sequencing limited mutation analysis to selectable markers or small forward mutation targets assumed to be representative for the genome overall, current platforms allow whole genome sequencing for less than $5,000. This has already given rise to direct estimates of germline mutation rates in multiple organisms including humans by comparing whole genome sequences between parents and offspring. Here we present a brief history of the field of mutation research, with a focus on classical tools for the measurement of mutation rates. We then review MPS, how it is currently applied and the new insight into human and animal mutation frequencies and spectra that has been obtained from whole genome sequencing. While great progress has been made, we note that the single most important limitation of current MPS approaches for mutation analysis is the inability to address low-abundance mutations that turn somatic tissues into mosaics of cells. Such mutations are at the basis of intra-tumor heterogeneity, with important implications for clinical diagnosis, and could also contribute to somatic diseases other than cancer, including aging. Some possible approaches to gain access to low-abundance mutations are discussed, with a brief overview of new sequencing platforms that are currently waiting in the wings to advance this exploding field even further.
DNA mutations are a double-edged sword. On the one hand they provide a template for natural selection in creating the bewildering diversity of life on Earth, while on the other hand, their random occurrence can disturb highly conserved and interconnected gene regulatory networks resulting in altered genes or gene expression, defective cell functioning and disease. To maintain germ line mutation rate at an optimal level, allowing organisms to adapt to new environments while avoiding deleterious effects on fitness, an efficient system of genome maintenance has evolved over the millennia. This has led to a mutation rate that is surprisingly well conserved among unicellular and multicellular organisms with rates between 1×10−9 and 1×10−10 mutations per base per cell division (Table 1). The majority of deleterious mutations that spontaneously arise in a population are quickly removed from the gene pool through selection. However, mutations with mildly deleterious effects can remain in the population and lead to disease-related phenotypes.
For multicellular organisms, it is necessary to differentiate between mutational processes that occur within the germline and those that arise in the soma. Although selective pressures in the germline and soma differ, they share a common set of repair enzymes that likely evolved prior to the emergence of multicellular organisms. Mutations in the germline, so-called de novo mutations, can be passed on to offspring and may have adverse phenotypic consequences. Mutations can also arise in the soma, contributing to the development of both neoplastic and non-neoplastic syndromes.
In spite of the importance of DNA mutation as the substrate of evolution and a major cause of human disease, there is very little direct information about mutation frequencies and spectra in metazoans. This is entirely due to the lack of methods for quantifying and characterizing germline and somatic mutations. Especially low-abundance, somatic mutations are currently beyond the reach of most molecular analysis methods. With the emergence of massively parallel sequencing (MPS) methods, the direct measurement of genetic mutations is now possible and has already led to new data on germline mutation frequencies in invertebrate organisms. Here, we give a short historical background of the field of mutation research with the technology platforms it has used to estimate mutation rates and study mutation spectra in different cells and organisms. We then review MPS technologies and their applications in mutation research with a focus on mutation detection in mammalian systems. Finally, we briefly discuss new approaches to capture low-abundance mutations and experimentally address cell-to-cell variation in mutation loads, including the impact of new, experimental systems for single molecule sequencing.
Some of the earliest attempts to define the rate of germline mutation were described at the beginning of the 20th century[4, 5]. The first demonstration of an induced mutation load was provided by Muller’s X-ray experiments on Drosophila. This work formed the basis of forward genetics and expanded the tools available to estimate germline mutation rates. Muller used a phenotypic scoring approach, counting mutant lethals in order to measure the effect of irradiation on mutation frequencies. Building upon Muller’s work, Stadler irradiated maize and scored mutants using qualitative traits at eight loci to estimate the mutation frequency of each gene (between 10−4 and 10−6). Haldane provided the first estimate (indirect) for human mutation frequency using the principle of selection balance. Based on demographic data on the fitness and frequency of hemophilia-affected males, he was able to estimate the frequency of new mutations arising at the haemophilia locus in the general population.
Due to a lack of knowledge on the size of the genome, the size of the locus, and the fraction of the locus that, when mutated, led to a dysfunctional protein product, extrapolations of the mutation frequency to the entire genome were not possible until very recently. The availability of fully annotated genomes has produced accurate estimations of the fraction of functional sites per locus and has allowed for the quantification of genome-wide per generation base-pair mutation rates in many organisms (Table 1)[9–12].
In principle, non-synonymous mutations can be detected at the protein level. Neel and co-workers examined children whose parents had been exposed to radiation at the time of the atomic bombings of Hiroshima and Nagasaki for the occurrence of mutations altering the electrophoretic mobility or activity of a series of proteins. The mutation rate observed in the children of exposed individuals was 0.60 × 10−5/locus/generation compared to 0.64 × 10−5/locus/generation in the control children, whose parents had not been exposed to radiation. Apart from the fact that the mutant frequency in this case appeared not to depend on parental exposure, these results reveal a spontaneous mutation frequency that is very similar to the estimates made 50 years earlier by Haldane.
A major limitation of approaches based on phenotypic scoring is the high number of individuals that must be screened in order to detect spontaneous mutations. One way to circumvent this problem is to look at hundreds or thousands of loci simultaneously, e.g. using two-dimensional protein gel electrophoresis-based methods. Classical phenotypic scoring of visible markers or mutant lethals, however, can often be extremely labor intensive, requiring millions of screened cells or samples, e.g., 2.8 million mice, from multiple labs across the US, were screened for seven visible markers in the specific locus test.
The large sample size needed in phenotype-based assays can be circumvented through mutation accumulation experiments. These assays tested for a reduction in fitness, or for a visible phenotypic change, after an inbred population accumulated mutations over many generations. By measuring the fitness of Drosophila at different generational time points, Mukai was able to provide an estimate for the deleterious (but non-lethal) mutation rate. The validity of data obtained using this method has recently been questioned due to consistent overestimation of the genome-wide rate when compared with other assays. A dependence on deleterious mutations of large effect has limited its usefulness and applications. Also, variation in the selective effect of deleterious mutations is not accounted for and could be the reason for overestimation of the mutation rate.
Prior to the introduction of DNA sequencing, mutation research was limited to estimating mutation rates and mapping new mutations through linkage analysis, i.e., by tracking the co-segregation of phenotypic marker loci. The emergence of assays to directly analyze DNA sequence variation enabled investigators to identify mutations at the molecular level and explore their mechanisms of action. The first DNA-based assays screened for mutations at restriction sites that resulted in a restriction fragment length polymorphism (RFLP). The development of nucleotide sequencing methods subsequently permitted the analysis of small sections of genomic DNA for allelic variants after cloning[20, 21]. However, the high cost of sequencing led to the development of alternative assays to scan DNA fragments for sequence variants. In the 1980s multiple techniques were developed that screened samples for single base variants at specific, PCR-amplified loci[22, 23]. Examples include denaturing gradient gel electrophoresis (DGGE) and temperature gradient gel electrophoresis (TGGE), which are based on the exquisite sensitivity of DNA denaturation for sequence variants; two 500-bp fragments of similar size, differing in only one base pair melt at different temperatures and can be separated by gel electrophoresis in a gradient of chemical denaturants or temperature[24–26]. The sensitivity of these assays can be greatly increased by first allowing a mixture of mutant and wildtype fragments to denature and then slowly reanneal. The subsequent heteroduplex fragments are then easily distinguished by denaturing gradient gel separation. To increase efficiency, denaturing gradient assays can be run in a two-dimensional format allowing the detection of all possible sequence variants in the PCR-amplified coding and regulatory regions of large genes. Other frequently used formats of DNA melting assays for mutation detection are denaturing high performance liquid chromatography (DHPLC) and high-resolution melting curve analysis . These approaches are used as a preliminary screen for mutants, which can then be characterized using Sanger sequencing. The arrival of shotgun sequencing enabled labs to screen larger fractions of the genome for mutations. Several groups have sequenced specified regions from their mutation accumulation lines to identify germline mutations and estimate locus-specific base-pair substitution rates[30, 31].
Meanwhile, array-based methods, such as comparative genomic hybridization (aCGH), allowed the identification of large germline variation, such as copy number variation. While the resolution of these assays has recently been improved to 500bp with Nimblegen’s 4.2M arrays, their sensitivity and accuracy remain poor for events smaller than 10kb. Still, they are much more sensitive than cytogenetic assays, based on fluorescence in situ hybridization (FISH), which have a resolution of no more than 10Mb, but offer the advantage that low-abundant chromosomal mutations can be detected in single cells. More recently, single cell assays have also emerged for aCGH based on whole genome amplification[35–37]. However, while useful in their own right, aCGH and FISH are incapable of detecting small mutations, which comprise the majority of the mutational landscape.
A connection between somatic mutations and human disease was first proposed by Carl Nordling who proposed that cancer is the result of accumulated mutations to a cell’s DNA[38, 39]. Around the same time, Leo Szilard attempted to explain the nature of aging through a somatic mutation framework. Szilard postulated that aging is caused by the accumulation of damaged chromosomes or genes in somatic tissue, realizing that ‘when a chromosome suffers an aging hit, the cell will cease to be functional if the homologous chromosome has either previously suffered an aging hit or if it carries a fault’ . Almost two decades later, Knudson, based on these ideas, proposed his two-hit hypothesis, i.e., hereditary cancer is caused by the inheritance of a mutant allele in the germinal cells and acquisition of a mutation in the normal allele in a somatic cell; in nonhereditary cancers both mutations occur in somatic cells. As Szilard predicted, the frequency of somatic mutations has now been demonstrated to increase with age, but their relationship with aging and disease remains unclear. However, somatic mutations are now considered to be the driving force behind the majority of cancers as well as a causal factor in some non-neoplastic human diseases, such as neurofibromatosis 1 and 2.
The discovery of the role played by somatic mutations in cancer initiation led to the emergence of assays to measure somatic mutation rates and to test the mutagenicity of different chemical agents and compounds in human cell lines and mice. For this purpose, cytogenetic methods were initially used, which proved capable of detecting the mutagenic effect of clastogens. For smaller mutations bacterial systems, such as the Ames test, and endogenous and transgenic reporter-gene based mutation assays in mammals have dominated the field.
The endogenous, X-linked hypoxanthine-guanine phosphoribosyltransferase (HPRT) assay was one of the first reporter systems available and remains the most widely used assay for the analysis of somatic mutation frequencies in humans[45, 46]. A similar system in mice, based on a heterozygous mutation in the (autosomal) Aprt gene, also allows mutation analysis in vivo by screening for loss of the remaining wild-type allele in somatic cells. These assays, however, can only be performed on cells that can actively proliferate in culture. The need for an in vivo assay that can be applied to all organs and tissues led to the development of transgenic reporter based methods. Mice harboring a LacZ or LacI reporter transgene emerged in the late 80s [48–50]. These reporter genes are part of a lambda or plasmid construct that can be recovered from genomic DNA and tested in E. coli for mutational inactivation of the reporter gene. Since then, a plethora of data has been published on somatic mutation frequencies and spectra in these animals, both induced mutations after exposure to a variety of mutagenic agents and spontaneous mutations accumulating during the normal aging process[51–53]. Using transgenic reporter genes, it has been shown that mutation frequencies in most tissues, especially actively proliferating tissues, such as the epithelia in the small intestine, consistently show an age-dependent increase (Fig. 1).
While reporter-based assays have provided us with information about the frequency and spectra of mutations in human lymphocytes and varioust organs and tissues of aging animals, they are limited to a single locus and so might not be representative of genome-wide events. Additionally, their reliance on a robust phenotypic change, i.e. a dysfunctional protein product, ignores slightly deleterious mutations and therefore leads to a gross underestimation of the actual mutation frequency at the reporter locus. But the most significant drawback of transgenic reporter gene-based methods is the fact that they are restricted to model organisms and therefore unable to replace the HPRT assay for studying mutational processes in humans.
The main difference between massively parallel sequencing (MPS) and the classical Sanger sequencing is that in the former each target is sequenced multiple times as a series of overlapping short sequencing ‘reads’. To do this a DNA sample is first randomly fragmented, for example, by sonication, into fragments that can vary from 200 to 500 basepairs (bp). These fragments are then sequenced, millions at a time, in a random fashion. The resulting short sequence reads are aligned to reference sequences (when available), and consensus base calls are made. The emergence of MPS, which allows for the sequencing of 500Gb (500 billion) nucleotides of DNA sequence in a week on a single machine, has opened the door for projects previously thought unfeasible. This is reflected in the dramatically lower cost of sequencing. Figure 2 shows the cost per basepair since 1971, when the first DNA molecule was sequenced. Initially, improvements in Sanger sequencing drove this development but with the introduction of the first ‘next-generation’ sequencing machine, the Roche 454 in 2005, the costs have come down by orders of magnitude. During the last 10 years, Genbank has grown from 5Gb to over 285Gb. More impressively, a new database dedicated to the open access of short-read sequencing data, the Short Read Archive (SRA), has grown to contain over 60 terabases(1012) of sequence data.
The importance of this development for the future of genetics and medicine is well recognized. The new technologies have been used to sequence hundreds of human genomes, novel organisms, and metagenomes (genomes recovered from environmental samples). Multiple human cancer genomes have been interrogated leading to the discovery of new oncogenes and tumor suppressor genes. With most of the focus on applications directed to cancer genomics and personalized medicine, the potential for MPS (or next-generation sequencing, NGS, as it is often called) to revolutionize mutation research and lead to more sensitive assays to test for mutagen exposure and genome instability has received less attention.
Here, we will address the impact of this new technology on mutation research (Table 2) per se, i.e., the study of the natural tendency of genomes to be unstable and its consequences with respect to evolution and as a cause of disease, aging and death.
There are currently three routinely used platforms for MPS: the Roche 454 system, the Illumina HiSeq, and the SOLiD system of Applied Biosystems. A fourth, the PacBio RS single molecule sequencing system of Pacific Biosciences, was introduced fairly recently and will be briefly discussed later. The principles behind these four systems are schematically depicted in Figure 3. Additional platforms exist, such as the Polonator, the Helicos Single Molecule Sequencer, and an in-house only nanoarray-based sequencing-by-ligation technology developed by Complete Genomics. These latter systems remain quite limited in their use and will not be discussed.
The Roche 454 system employs a “sequencing-by-synthesis” strategy based on pyrosequencing, which detects pyrophosphate molecules as they are cleaved during nucleotide incorporation. Although the system was the first massively parallel technology released, a high error rate at homopolymer sites and a low throughput compared to the two newer platforms has narrowed its applications. It is now primarily used for sequencing metagenomes, e.g., human microbial communities, and for gap filling in de novo sequencing projects, both of which benefit from its long read length (500bp).
The SOLiD sequencer and the Illumina HiSeq, an updated version of Illumina’s earlier Genome Analyzer, are competing technologies that use different sequencing strategies. In the HiSeq, a library of double-stranded adapter-ligated template molecules between 300 and 600bp in size, constructed from fragmented nucleic acids, is flowed across a hollow glass slide coated on the inside with polyacrylamide to which forward and reverse primers are attached. The adapter-ligated template DNA hybridizes to the primers and is copied onto the flow-cell surface by extension of the flow-cell primer to which it is hybridized. These newly synthesized strands serve as templates for an isothermal amplification reaction, resulting in clusters of amplified strands. One strand is selectively removed before a sequencing primer is hybridized. Sequencing begins using reversible fluorescent terminator dNTPs. Each DNA strand within a cluster incorporates the same, single nucleotide during each chemistry cycle. At the end of every cycle, the clusters are imaged, before the blocking groups and fluorophores on the newly incorporated nucleotides are removed by chemical cleavage and the next round of nucleotide incorporation begins. Four images are the output, one for each fluorophore. A base and associated base quality score (which is estimated using the background fluorophore levels at each cluster on logarithmically linked to error probabilities, similar to Phred quality scores used in Sanger sequencing) are called for each cluster using built-in analysis software. By sequentially sequencing from both ends of a DNA fragment, it is possible to obtain so-called paired end sequences of up to 150 bases each. This entire process (Fig. 3) is described in detail on the Illumina website and in some excellent reviews[65, 66]. The output format for the sequencing files is FASTQ, which contains information on the cluster location, the sequence of called bases, and the associated base-quality scores (Fig. 4A).
In the SOLiD machine, adapter-ligated template molecules are individually captured on a bead and subjected to PCR. This produces an emulsion of beads each containing thousands of copies of an identical template molecule. The beads flow across a glass slide and are chemically cross-linked to its surface. Instead of sequencing by synthesis, the SOLiD platform exploits so-called 2-base encoding, utilizing a ligation-mediated sequencing reaction. Basically, hexamer probes, which contain a 5′ di-base specific label, are added to the slide during each cycle and are ligated to the 3′ end of a sequencing primer hybridized to the template molecule. Each di-base sequence is matched to a fluorophore that is detected at each subsequent ligation step. Following fluorescence detection, the 3′ base of the incorporated hexamer is removed and the entire process is repeated with a net extension of five bases per cycle. After 10 cycles, the entire synthesized strand is removed and the process is repeated beginning with an “n-1” sequencing primer, which is offset by a single base. The SOLiD machine currently has read lengths of 50bp and has kits designed for single-end reads and paired-end reads for fragments between 600 bp and 10 kbp (the so called mate-pair approach). The SOLiD machine outputs color space data, where a nucleotide is called based on the sequence of two emitted flurophores instead of a single fluorophore. The output format is similar to the FASTQ format in that it contains associated quality scores. However, instead of presenting a sequence of called bases, a string of the numbers 0, 1, 2, and 3 is used, which represent the four different emitted fluorophores. The string of numbers can then be converted into a nucleotide sequence (Fig. 3).
The shift from Sanger sequencing to high-throughput technologies has required new computational approaches to deal with the considerably larger data sets. The first challenge is to match the sequence reads to a reference sequence of the species under study. This so-called sequence alignment profits from the many consensus genome sequences that are now available for a large number of animal and plant species. To process the millions of reads produced by the SOLiD and Illumina machines, alignment algorithms (Table 3) have been optimized for speed and memory usage. Additionally, because the major application of these platforms lies in genome re-sequencing rather than de novo sequencing, the alignment algorithms have been designed for low divergence rates. Expectations for the average number of mismatches between the short read and the reference sequence are driven by the species polymorphism rate and the platform error rate instead of evolutionary divergence. These assumptions have allowed for faster alignments of considerably larger datasets without a large increase in the required computational resources.
The two major alignment algorithms used are hash table–based algorithms (BLAST, MAQ, Eland) and Burrows Wheeler transform (BWT)-based methods (SOAP2, BWA, Bowtie). Hash table-based algorithms use an indexing scheme that enables ultra-fast searches for short sequences of a defined length k (k-mer matches or seeds) with up to m mismatches. These seeds are then extended to find the optimal hit. In other words, a small piece of the target sequence or read (k-mer) is aligned to the reference and if a match is found the small piece is extended to see if the whole read matches. Unlike BLAST, which seeds alignments for consecutive matches, the algorithms used by Eland and MAQ utilize spaced seeds (Fig. 4B) to improve sensitivity around polymorphisms and single-base errors.
BWT-based methods also use a hash table, but apply a Burrows Wheeler transform (Fig. 4C), which helps build a more efficient index of either the reference sequence or the sequencing reads, leading to a reduction in the memory-footprint. Currently the gold standard for short-read alignments is BWA, a BWT-based method that is both accurate and fast. BWA produces aligned sequences in the SAM/BAM format, which contains all of the information needed by downstream variant calling programs (Fig. 4D).
Ideally, following an alignment one could easily determine mismatches between the genome or genomic region under study and the reference sequence by ‘calling’ the correct base or set of bases at each position in the genomic target. This would allow one to detect single nucleotide variants and small deletions or insertions (InDels). Unfortunately, locus specific differences in sequencing depth, mapping quality scores (a measure of the probability that the read is correctly aligned) and allelic imbalance (where one allele makes up a greater fraction of reads than the second allele), as well as a high frequency of sequencing errors (i.e., in the range of 0.1–1%), necessitate variant calling algorithms that normalize the sequencing data and eliminate sources of error and bias. Although there are many homebrew programs available, developed in bioinformatics groups all over the world, two SNP calling pipelines have dominated modern genotyping (Table 3): the recently published Genome Analysis Tool Kit (GATK), which was used to analyze most of the data from the 1000 Genomes Project, and SAMtools. GATK is a comprehensive package that contains tools for working with aligned BAM files. Its main advantages over other software tools lie in its ability to recalibrate base quality scores, which helps to lower the false positive rate of SNP calling by lowering the base quality scores of specific dinucleotides (and other variables such as homopolymer tracts) that are associated with higher error rates, and its ability to identify small intra-read insertions and deletions (responsible for misaligned reads) and realign the reads at these loci using indel-friendly parameters (the alignment algorithm’s penalties for insertions and deletions will be reduced) leading to cleaner consensus calls (Fig. 4E).
Analysis of structural variation can be performed using paired-end sequencing data from either the Illumina or SOLiD platform. Basically, from the sequencing library of randomly fragmented DNA a particular size class is selected, for example, all fragments of 500 bp ±10bp, and sequenced from both ends. The two sequenced ends should now align to the reference sequence within the 500 bp range. But when, for example, one of the paired reads maps to another chromosome this is taken as evidence for an interchromosomal rearrangement (Fig. 5). Similarly, when the two end sequences align too far out this is evidence for a deletion (Fig. 5) with the opposite situation indicative for an insertion. The read lengths vary between 50 and 150bp depending on the platform and kit used. Alternatively, a mate-paired approach can be used where larger fragments (between 2kb and 10kb) are circularized after their ends have been labeled with biotinylated nucleotides. Following fragmentation of the circularized molecules, the fragments containing biotin groups, representing the circularized ends of the original fragments, are captured on streptavidin magnetic beads. The eluted fragments are sequenced using the standard paired-end module producing paired reads with an inverted orientation. The mate-pair approach produces a considerably higher mapping coverage of the genome from the same number of reads and provides a cost-effective approach for identifying large structural variants genome-wide. Recently two groups used this approach to identify rearrangements in tumor samples[74, 75].
Similar to calling single nucleotide variants and InDels, analyzing structural variations is prone to artifacts. A common artifact involves chimeras between two fragments in the library that can be miscalled as genome rearrangements. Chimeras are thought to originate from the ligation reaction during the library preparation that attaches sequencing adapters to all DNA fragments. While the nature of the adapters should prevent such self-ligation, it apparently happens at low frequency. To address this problem one approach uses a series of stringent gel-based size-selected fragments both before and after the ligation reaction. Any chimeras that are produced will be double the size of the selected fragment range and thus will not be retained after the second size selection. Another source of false positive variant calls are incorrectly mapped paired-end reads resulting from multiple single nucleotide errors in a single read. To filter out these artifacts and produce high-quality variant calls, most programs require a cluster of read-pairs supporting any aberration.
A comprehensive and proven package for the detection and characterization of structural variations is yet to be released. The spectrum of structural variation has made it difficult to design a single program that can accurately call all types of structural variation. Instead, three distinct types of calling algorithms have been developed (Table 3) that together, are able to capture the full spectrum of these events.
The most basic algorithm is that applied by Breakdancer, Breakway, SVDetect, BreakSeq and GASV. These programs use mapping distance data provided through the paired-end alignment statistics to estimate the average fragment size of the library. Clusters of aligned reads that appear to be mapping at a distance that is more than three standard deviations away from the average are identified and called as structural variants. The stringency of this algorithm depends on a number of key parameters: the minimum number of reads supporting a cluster (normally >4), the frequency of the event at the locus (commonly an underestimate due to reads that span the breakpoint and thus are not aligned), and the map quality score, which is a measure of the probability that the paired read is correctly aligned. By narrowing the size distribution of fragments that are selected during the library preparation, it is possible to call smaller insertions and deletions. Events that are missed using these algorithms include deletions and insertions smaller than 100bp, insertions larger than 400bp, and some segmental duplications (CNVs). To correctly identify these events, two additional algorithms must be implemented. The first uses an approach similar to that used for the analysis of array-based Comparative Genomic Hybridizatin (aCGH) data. Bins of a width defined by the user are constructed to span the genome, and the number of aligned reads in each bin is recorded. Bins that have an average read depth that is significantly different from the norm can be investigated manually using a genome visualizer and can be validated using aCGH or qPCR. This approach can be used to look for both insertions and deletions and can be used in parallel with a program like Breakdancer to provide additional support for these events.
For the investigation of large insertions, so called orphan reads (only one of the two read pairs map) are extracted from the alignment data and used as input in a local de novo assembly using the assembly software ABySS or Velvet. The assembled fragments are aligned to the reference genome and the alignments are parsed for contigs that provide evidence of insertional breakpoints. An open-source pipeline called SVMerge was recently released that uses the output from the various classes of variant calling algorithms in order to filter and classify a complete set of structural variants. The pipeline runs a de novo local assembly using reads that align at the genome coordinates associated with each identified structural variant. This provides an additional validation step to filter out false positives and identify exact breakpoints.
Until recently, our knowledge of the rate and spectra of de novo mutations was limited to studies in a number of reporter genes. The reduced cost and higher throughput of MPS has led to the first genome-wide estimates for germline mutation rates and spectra in yeast, worms, plants, and flies. The experiments, run on both 454 and Illumina machines, used accumulation lines, where inbred populations accumulated mutations over many generations. A wide range of generational time points were used, with an average of 4800 generations from the founder for S. cerevisae, 300 generations for C. elegans, 30 generations for A. thaliana, and 262 generations for D. melanogaster. The criterion for calling germline mutations in any one of the accumulation lines was that the variant must not be present in the composite control, representing the consensus sequence for all other accumulation lines and/or the founder line. Using data on the number of base-substitutions in each line and the number of generations from the founder, a mutation rate was calculated for each species (Table 1). Missing from all four experiments was comprehensive data on structural variants. Although two of the four groups presented data on InDels, the absence of paired-end sequencing data prevented the groups from analyzing large deletions, insertions and segmental-duplications. Decreasing sequencing costs should allow future studies to be performed using shorter generational time points.
With the completion of the draft sequence of the human genome, the human genetics community has turned to analyzing human genetic variation. The HapMap project, which was started in 2003, has genotyped four million Single Nucleotide Polymorphisms (SNPs) in 1301 individuals from eleven populations distributed across the world. The data has provided an abundance of information on common SNPs and their association with human disease, but many variants, including disease-causing variants, that occur at a low allele frequency in the population have been missed. The emergence of next-generation sequencing technologies led to a proposal to sequence the entire genomes of 1000 individuals of different ethnicity. This would provide investigators conducting genome-wide association studies (GWAS) with all variants of at least 1% minor allele frequency in the disease-associated regions. In addition, it allows imputation of many millions of variants identified in the 1000 genomes in the GWAS studies, based on shared haplotype stretches[91, 92]. The project data is available through the NIH and European Bioinformatics Institute (EBI) websites and is updated in real time. So far, data for more than 1100 individuals has been released, with an additional 1400 genomes planned for 2011 (http://www.1000genomes.org/).
Two family trios were sequenced as part of the project. Working with a combination of calling algorithms, de novo mutations arising in the parental gametes were identified. Based on an initial validation using targeted resequencing, 1001 and 669 germline de novo mutations were scored in the offspring from the two families. The majority of these “de novo” mutations represented lymphoblastoid cell-line specific somatic mutations resulting from clonal selection during passaging. Therefore, a second validation was performed using primary DNA sources and by testing for segregation to offspring. This resulted in 35 and 45 “de novo” mutations for the two trios, respectively. Mutation rates were estimated by correcting for the false negative rate of mutation discovery from the three calling algorithms and the false negative rate in the validation experiments (Table 1). Structural variant analysis did not reveal any confident calls in either trio; only deletions were identified with high sensitivity. It is likely that with improved calling algorithms and higher sequencing coverage, the full spectrum of events will be uncovered.
During the last two decades of the 20th century, advances in cancer genetics were hindered by the limited resolution of available techniques and by an incomplete map of the human genome. With both limitations now overcome, the full spectrum of somatic mutations in cancer genomes has begun to emerge. Since 2002, the number of recognized cancer genes has grown from 115 to 457. Multiple international efforts, including the Cancer Genomes Project, the Cancer Genome Atlas and the International Cancer Genome Consortium, have set out to uncover new driver mutations and survey the association between mutated genes and drug efficacy.
Traditionally, genome-wide discovery efforts were performed using FISH or array-based methods, leading to the identification of translocations, for example, those common to Chronic Myelogenous Leukemia (CML) and Burkitt’s lymphoma[98, 99]. These assays were limited to analyzing large translocations or copy number changes. In an attempt to identify both somatic and germline mutations in protein coding genes, and thereby discover new oncogenes and tumor suppressor genes, genome resequencing projects have targeted the exome, the transcriptome and even the complete genomes of tumor and matched normal tissues. Although some of the early data was generated using capillary sequencing, the majority of cancer sequencing projects have relied on MPS.
The analysis of these MPS data sets has posed unique challenges. A cancer is a mass of cells that has gone through multiple rounds of selection and clonal expansion[103, 104] and has both heterogeneous and homogeneous components to its mutation profile. Arising from a single cell, with its own spectrum of unique somatic mutations that accumulated over the more than 50 rounds of cell division since fertilization of the egg, a million or more cells arise that share the same set of clonally amplified somatic mutations (both drivers and passengers). In parallel, many low-abundance mutations arise as a consequence of what has been termed a mutator phenotype, resulting from mutations in genes involved in genome maintenance. This creates a large degree of mutational heterogeneity within the tumor. Identification of the clonally amplified mutations is akin to looking for germline events, but investigating the low-abundant random somatic mutations requires new strategies and approaches. Due to this heterogeneity, and the frequent contamination of tumor samples with adjacent normal tissue, allele frequencies do not always fall within standard genotyping windows, i.e. 0.0, 0.5, 1.0. Therefore, standard SNP-calling algorithms have been modified to allow for the full spectrum of genotyping calls. SomaticSniper and VarScan use aligned data from both the tumor sample and matched control sample as input to identify low frequency events specific to the tumor. The combination of these stringent algorithms and the higher coverage available through high-throughput technologies has enabled the identification of mutant alleles present in tumor samples in proportions as low as 0.2% . Translocations and copy-number changes have also been investigated using paired-end sequencing coupled with the aforementioned structural variation callers, e.g., Breakdancer, GASV, BreakSeq. Following the identification of somatic variants in genome-wide studies, investigators have performed large scale resequencing projects to determine whether the gene is mutated at a statistically significant frequency in the cancer subtype, and if so, what the underlying biological mechanism might be [110–113].
The falling costs of high-throughput sequencing, and improvements in statistical methods and data analysis tools, have opened the door to additional medical applications. The Cancer Translation Project was created within the Cancer Genomes Project to investigate the genomics of drug sensitivity in cancer. The first results from the project were released in 2010 and provide evidence of strong correlations between gene-specific mutations and drug responses.
High-throughput sequencing is also being used as a tool to monitor disease progression through the identification of cancer-specific rearrangements[74, 75]. After discovering these rearrangements through whole-genome sequencing, simple PCR assays were designed to target the rearrangement signatures in circulating tumor cells in plasma (CTCs) and thereby track disease progression and response to treatments. A mate-paired sequencing approach was used in one of these studies, with a cost of only $2000 per patient. The ability to track the disease in real-time and monitor the effectiveness of different treatments represents a potentially invaluable tool for clinicians that could reduce both the cost and treatment time associated with ineffective chemotherapy drugs.
Heterogeneity among cells in their mutation spectra is not limited to tumors. Due to the inevitability of errors in the processing of DNA damage, mutations accumulate from the zygote through development, adulthood and aging. Life style factors, environmental exposure and the quality of one’s genome maintenance systems determine the rate and severity of this process of somatic mutation accumulation. The stochasticity of mutagenesis with its many low-abundance or even unique mutations has necessitated the use of reporter assays in mutation research. As we have seen, MPS has now essentially broken through the high-cost barrier that thus far essentially constrained whole genome sequencing to identify mutational differences between tissues. However, is it also capable of accurately determining how individual cells in a cell population genetically diverge over time or during cell culture?
Using MPS it is now possible to sequence an entire mammalian genome to find mutational variants present in all or most of the cells of a tissue. This readily allows the determination of mutation rates and spectra in germlines and clonally derived tissues, such as tumors. However, low-abundant, somatic mutations remain a major challenge for two reasons. First, they require a significantly high coverage to identify them. And, second, the nature of MPS, which is based on a consensus model, tends to discard low-abundant variants as potential artifacts.
Due to sequencing errors produced by all contemporary technologies, the identification of somatic variants that are present in a single copy, or a few copies (if clonally amplified), poses many problems. Illumina sequencing runs consistently display a base-pair error rate of 0.05 to 1%. The di-base encoding used by the SOLiD machine is able to lower this error rate to about 0.075% (SOLiD application note). These error rates constitute a background that is several orders of magnitude higher than the recorded somatic base-pair mutation rate and therefore low-abundance somatic events are effectively masked.
Similarly, the detection of random somatic rearrangements by paired-end sequencing is limited by the generation of chimeric sequences, i.e., ligation of two genomic sequences to each other, during the library preparation. Normally, the DNA sample is subjected to random fragmentation, after which the DNA fragments are end-polished and appended with an A-overhang, which promotes preferential annealing with the T-overhang-containing sequencing adapters and precludes cross-ligation. However, as already discussed above, cross-ligation does occur at a very low rate. Such artifacts are not a problem in consensus-based procedures, since unique rearrangement events are discarded, but for somatic mutation research they will lead to high false positive rates.
There are essentially two types of approach to overcome these problems. First, one could try to decrease error rates, either experimentally or through the application of in silico filters. For example, polymerase errors that arise during template preparation (during the pre-amplification step required for library preparation) or the solid-phase amplification on the instrument can be reduced by using high-fidelity enzymes, such as the phusion DNA polymerase. Also machine errors can be expected to decrease further in the future by advances in chemistry and fluidics. As mentioned above, to reduce false positive genome rearrangement calls in the form of chimeras between two unrelated fragments it is possible to apply a more stringent size selection. Ultimately, the solution to sequencing-related errors is single molecule sequencing; multiple passes of the same molecule will quickly eliminate random errors. Such a system is now available in the form of PacBio’s single molecule, real-time sequencing technology, but is as yet immature for this purpose (see below). To eliminate or greatly reduce errors it is also possible to apply sophisticated algorithms that filter out errors. This can only be done with errors that exhibit some consistent patterns related to the library preparation and sequencing steps and not feasible for errors that appear at random. Interestingly, a recent approach to the identification of rare variants utilized the PCR amplification step in the (Illumina) library preparation to identify families of templates carrying the same mutation, which then could be inferred to have been present in the original template molecule as a real mutation
A second approach for avoiding sequencing errors as confounders in the identification of low-abundant mutations is to use single cells. To sequence the genomes of single cells rather than mixtures of genomes from whole cell populations or tissues is critically important in its own right. However, it also allows one to circumvent the problem of sequencing errors by adopting the consensus model after single-cell, whole-genome amplification (WGA). Multiple protocols exist for whole genome amplification[118, 119], the most common of which is Multiple Displacement Amplification (MDA), which uses a high-fidelity isothermal polymerase (phi29) to generate up to 20μg of DNA product after starting from a single cell. Working with single cells, it is possible to follow the consensus model in next-generation sequencing in which mutations can be called in the single cell on the basis of their occurrence in multiple reads (Fig. 6A), not unlike current procedures for identifying mutations in tumors. In our lab, we use an MDA-based protocol where single cells are subjected to whole genome amplification followed by paired-end sequencing. An unamplified sample, representing the population as a whole, is also sequenced. After the alignment to a reference sequence, a three-way comparison is made between the reference sequence, the unamplified sample and the amplified single cells. Both germline and somatic mutations can be recorded(Fig. 6B,C).
The success of any single cell genomics approach is heavily dependent on the development of accurate variant calling algorithms that can correct for the vast differences in coverage, allele dropout, and locus dropout that are produced by whole genome amplification procedures. Multiple aCGH experiments[36, 37], and recently a MPS experiment, have used single-cell approaches to profile CNVs in individual human normal and tumor cells, but base-pair resolution genotyping of single eukaryotic cells has not yet been shown. Massively parallel sequencing of single-cells has the potential to revolutionize reproductive medicine, cancer research, developmental biology and aging research.
Almost immediately following the release of the Illumina and SOLiD platforms, reports appeared regarding third-generation sequencing. Pacific Biosciences’ PacBio RS detects fluorescently labeled nucleotides in real time as they are incorporated during a second-strand synthesis (Fig. 3)[63, 120]. Oxford Nanopore’s sequencing platform uses an exonuclease cleavage reaction and a protein nanopore to sequence individual cleaved bases by a unique electrical signature as they are transported through the pore.
The Pacific Biosciences instrument was made available to early access customers in September 2010. The instrument has a quicker turn-around time (approximately 100MB of sequence is produced during a 90 minute run), and longer read lengths (average of 2000bp, with 5% exceeding 5000bp) than the Illumina and SOLiD platforms, but the throughput is only 1 Gb/day and the error rate is approximately 20%. To compensate for the high per base error rate, the sample preparation, which is similar to the Illumina protocol but uses hairpin adapters, was designed to allow for multiple sequencing rounds across the same site in the template molecule, lowering the error rate to 4% with two passes, 0.8% with three passes and 0.16% with 4 passes. When designing experiments with this platform, a trade-off must be made between sequencing 2000bp fragments at 80% accuracy or 500bp fragments at 99.8% accuracy
The limiting factor in the length of PacBio RS sequencing reads is the half-life of the polymerase, which is damaged by the laser-induced photochemistry used in base calling. An alternative protocol, termed “strobe sequencing” has been revealed by the company that would allow for gapped sequences covering up to 10kb. Results from a simulation showed that using strobe reads instead of classical paired-end reads for calling inversions and deletions larger than 120bp improved the sensitivity while reducing the false positive rate by more than 50%. Because the experiment was run under the assumption that the base-pair error rate was 5%, instead of 20%, it will be necessary to validate the results by repeating the experiment.
The SOLiD and Illumina platforms, which can produce up to 75GB a day, will remain the first choice for the majority of applications. For genome-wide mutation detection in mammalian species, high throughput and high accuracy is a prerequisite. Although the PacBio RS instrument could help identify germline structural variants as a supplement to a whole genome run with the Illumina or SOLiD, it does not have a high enough throughput to be used for the detection of random somatic variants.
Oxford Nanopore has two sequencing strategies in development, both of which will likely enable almost unlimited read lengths. The technology closest to the market is exonuclease-based and will be marketed, sold, distributed, and serviced by Illumina. The alternative technology, termed ‘strand sequencing’  may be capable of rereading the same strand multiple times. It remains to be seen what the throughput and error rate of the instruments will be, but due to a lower reagent cost and the absence of a library preparation, the cost of the assay will most likely be significantly lower than current technologies.
The study of genetic variation has progressed rapidly over the last twenty years and because of recent advances in sequencing technology, the rate of discovery is faster than ever. Classical tools for the detection and characterization of germline and clonally amplified somatic variants are being replaced by a new array of massively parallel sequencing based methods. The characterization of genome-wide genetic variants in individual humans has become an important tool in the diagnosis of congenital malformations[125, 126], especially with recent advances in non-invasive prenatal genetic testing, and has led to the discovery of novel genes and pathways associated with human disease[128, 129]. Applications in cancer research and clinical oncology are even more pertinent, where MPS has already impacted oncogene discovery, clinical diagnosis, pharmacogenomics, and the monitoring of disease progression. As we enter the age of personalized genomics there remain gaps in our knowledge, none greater than a lack of understanding of the rate and spectra of random somatic variation in our tissues. The accumulation of random alterations in the genome sequence of our cells might have profound functional consequences that have been ignored because of a focus on the average (or consensus) level of gene and protein expression in our tissues. Although there are limitations with the technology as it stands, advances in single-cell genomics techniques and the development of methods to lower the intrinsic error rates of MPS technology and filter out false positives will lead to increasingly sensitive and specific assays for measuring low-abundant somatic mutations. Additionally, single-molecule sequencing technologies have begun to emerge that may have lower intrinsic error rates than the current platforms, which would not interfere with the detection of low-abundant somatic variants.
This work was supported by AG034421, ES019520 and AG17242.
Conflict of interest statement
The authors declare that there are no conflicts of interest.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.