The main difference between massively parallel sequencing (MPS) and the classical Sanger sequencing is that in the former each target is sequenced multiple times as a series of overlapping short sequencing ‘reads’. To do this a DNA sample is first randomly fragmented, for example, by sonication, into fragments that can vary from 200 to 500 basepairs (bp). These fragments are then sequenced, millions at a time, in a random fashion. The resulting short sequence reads are aligned to reference sequences (when available), and consensus base calls are made. The emergence of MPS, which allows for the sequencing of 500Gb (500 billion) nucleotides of DNA sequence in a week on a single machine, has opened the door for projects previously thought unfeasible. This is reflected in the dramatically lower cost of sequencing. shows the cost per basepair since 1971, when the first DNA molecule was sequenced[55
]. Initially, improvements in Sanger sequencing drove this development but with the introduction of the first ‘next-generation’ sequencing machine, the Roche 454 in 2005, the costs have come down by orders of magnitude. During the last 10 years, Genbank has grown from 5Gb to over 285Gb[56
]. More impressively, a new database dedicated to the open access of short-read sequencing data, the Short Read Archive (SRA), has grown to contain over 60 terabases(1012
) of sequence data[57
Figure 2 Sequencing cost per megabase since 1971. The sequencing cost per megabase has decreased rapidly since the first 12 bp were sequenced in 1971. The cost is displayed using a logarithmic scale, with key events in the history of sequencing plotted on the (more ...)
The importance of this development for the future of genetics and medicine is well recognized. The new technologies have been used to sequence hundreds of human genomes, novel organisms, and metagenomes (genomes recovered from environmental samples). Multiple human cancer genomes have been interrogated leading to the discovery of new oncogenes and tumor suppressor genes. With most of the focus on applications directed to cancer genomics and personalized medicine, the potential for MPS (or next-generation sequencing, NGS, as it is often called) to revolutionize mutation research and lead to more sensitive assays to test for mutagen exposure and genome instability has received less attention.
Here, we will address the impact of this new technology on mutation research () per se, i.e., the study of the natural tendency of genomes to be unstable and its consequences with respect to evolution and as a cause of disease, aging and death.
Advantages of using high-throughput sequencing for mutation discovery
There are currently three routinely used platforms for MPS: the Roche 454 system[58
], the Illumina HiSeq[59
], and the SOLiD system of Applied Biosystems[60
]. A fourth, the PacBio RS single molecule sequencing system of Pacific Biosciences[61
], was introduced fairly recently and will be briefly discussed later. The principles behind these four systems are schematically depicted in . Additional platforms exist, such as the Polonator[62
], the Helicos Single Molecule Sequencer[63
], and an in-house only nanoarray-based sequencing-by-ligation technology developed by Complete Genomics[64
]. These latter systems remain quite limited in their use and will not be discussed.
Figure 3 Massively parallel sequencing technologies. A schematic showing sample preparation and sequencing technologies for the four major commercially available sequencers: the GS FLX Titanium by 454 (Roche), the HiSeq by Illumina, the SOLiD system by Life Technologies (more ...)
The Roche 454 system employs a “sequencing-by-synthesis” strategy based on pyrosequencing, which detects pyrophosphate molecules as they are cleaved during nucleotide incorporation. Although the system was the first massively parallel technology released, a high error rate at homopolymer sites and a low throughput compared to the two newer platforms has narrowed its applications. It is now primarily used for sequencing metagenomes, e.g., human microbial communities, and for gap filling in de novo sequencing projects, both of which benefit from its long read length (500bp).
The SOLiD sequencer and the Illumina HiSeq, an updated version of Illumina’s earlier Genome Analyzer, are competing technologies that use different sequencing strategies. In the HiSeq, a library of double-stranded adapter-ligated template molecules between 300 and 600bp in size, constructed from fragmented nucleic acids, is flowed across a hollow glass slide coated on the inside with polyacrylamide to which forward and reverse primers are attached. The adapter-ligated template DNA hybridizes to the primers and is copied onto the flow-cell surface by extension of the flow-cell primer to which it is hybridized. These newly synthesized strands serve as templates for an isothermal amplification reaction, resulting in clusters of amplified strands. One strand is selectively removed before a sequencing primer is hybridized. Sequencing begins using reversible fluorescent terminator dNTPs. Each DNA strand within a cluster incorporates the same, single nucleotide during each chemistry cycle. At the end of every cycle, the clusters are imaged, before the blocking groups and fluorophores on the newly incorporated nucleotides are removed by chemical cleavage and the next round of nucleotide incorporation begins. Four images are the output, one for each fluorophore. A base and associated base quality score (which is estimated using the background fluorophore levels at each cluster on logarithmically linked to error probabilities, similar to Phred quality scores used in Sanger sequencing) are called for each cluster using built-in analysis software. By sequentially sequencing from both ends of a DNA fragment, it is possible to obtain so-called paired end sequences of up to 150 bases each. This entire process () is described in detail on the Illumina website and in some excellent reviews[65
]. The output format for the sequencing files is FASTQ, which contains information on the cluster location, the sequence of called bases, and the associated base-quality scores ().
Figure 4 Bioinformatics formats and tools. a. FASTQ format, which uses four lines per read, is the preferred output format for MPS data. The example shown is Illumina FASTQ format, with the read identifier occupying the first and third lines, and the sequence (more ...)
In the SOLiD machine, adapter-ligated template molecules are individually captured on a bead and subjected to PCR. This produces an emulsion of beads each containing thousands of copies of an identical template molecule. The beads flow across a glass slide and are chemically cross-linked to its surface. Instead of sequencing by synthesis, the SOLiD platform exploits so-called 2-base encoding, utilizing a ligation-mediated sequencing reaction. Basically, hexamer probes, which contain a 5′ di-base specific label, are added to the slide during each cycle and are ligated to the 3′ end of a sequencing primer hybridized to the template molecule. Each di-base sequence is matched to a fluorophore that is detected at each subsequent ligation step. Following fluorescence detection, the 3′ base of the incorporated hexamer is removed and the entire process is repeated with a net extension of five bases per cycle. After 10 cycles, the entire synthesized strand is removed and the process is repeated beginning with an “n-1” sequencing primer, which is offset by a single base. The SOLiD machine currently has read lengths of 50bp and has kits designed for single-end reads and paired-end reads for fragments between 600 bp and 10 kbp (the so called mate-pair approach). The SOLiD machine outputs color space data, where a nucleotide is called based on the sequence of two emitted flurophores instead of a single fluorophore. The output format is similar to the FASTQ format in that it contains associated quality scores. However, instead of presenting a sequence of called bases, a string of the numbers 0, 1, 2, and 3 is used, which represent the four different emitted fluorophores. The string of numbers can then be converted into a nucleotide sequence ().
The shift from Sanger sequencing to high-throughput technologies has required new computational approaches to deal with the considerably larger data sets[67
]. The first challenge is to match the sequence reads to a reference sequence of the species under study. This so-called sequence alignment profits from the many consensus genome sequences that are now available for a large number of animal and plant species. To process the millions of reads produced by the SOLiD and Illumina machines, alignment algorithms () have been optimized for speed and memory usage. Additionally, because the major application of these platforms lies in genome re-sequencing rather than de novo
sequencing, the alignment algorithms have been designed for low divergence rates. Expectations for the average number of mismatches between the short read and the reference sequence are driven by the species polymorphism rate and the platform error rate instead of evolutionary divergence. These assumptions have allowed for faster alignments of considerably larger datasets without a large increase in the required computational resources.
MPS data analysis toolkit for mutation research
The two major alignment algorithms used are hash table–based algorithms (BLAST, MAQ[68
]) and Burrows Wheeler transform (BWT)-based methods (SOAP2[69
]). Hash table-based algorithms use an indexing scheme that enables ultra-fast searches for short sequences of a defined length k (k-mer matches or seeds) with up to m
mismatches. These seeds are then extended to find the optimal hit. In other words, a small piece of the target sequence or read (k-mer) is aligned to the reference and if a match is found the small piece is extended to see if the whole read matches. Unlike BLAST, which seeds alignments for consecutive matches, the algorithms used by Eland and MAQ utilize spaced seeds () to improve sensitivity around polymorphisms and single-base errors.
BWT-based methods also use a hash table, but apply a Burrows Wheeler transform (), which helps build a more efficient index of either the reference sequence or the sequencing reads, leading to a reduction in the memory-footprint. Currently the gold standard for short-read alignments is BWA, a BWT-based method that is both accurate and fast. BWA produces aligned sequences in the SAM/BAM format, which contains all of the information needed by downstream variant calling programs ().
2.3. Single nucleotide variants/InDel calling
Ideally, following an alignment one could easily determine mismatches between the genome or genomic region under study and the reference sequence by ‘calling’ the correct base or set of bases at each position in the genomic target. This would allow one to detect single nucleotide variants and small deletions or insertions (InDels). Unfortunately, locus specific differences in sequencing depth, mapping quality scores (a measure of the probability that the read is correctly aligned) and allelic imbalance (where one allele makes up a greater fraction of reads than the second allele), as well as a high frequency of sequencing errors (i.e., in the range of 0.1–1%), necessitate variant calling algorithms that normalize the sequencing data and eliminate sources of error and bias. Although there are many homebrew programs available, developed in bioinformatics groups all over the world, two SNP calling pipelines have dominated modern genotyping (): the recently published Genome Analysis Tool Kit (GATK)[72
], which was used to analyze most of the data from the 1000 Genomes Project, and SAMtools[73
]. GATK is a comprehensive package that contains tools for working with aligned BAM files. Its main advantages over other software tools lie in its ability to recalibrate base quality scores, which helps to lower the false positive rate of SNP calling by lowering the base quality scores of specific dinucleotides (and other variables such as homopolymer tracts) that are associated with higher error rates, and its ability to identify small intra-read insertions and deletions (responsible for misaligned reads) and realign the reads at these loci using indel-friendly parameters (the alignment algorithm’s penalties for insertions and deletions will be reduced) leading to cleaner consensus calls ().
2.4. Structural variation calling
Analysis of structural variation can be performed using paired-end sequencing data from either the Illumina or SOLiD platform. Basically, from the sequencing library of randomly fragmented DNA a particular size class is selected, for example, all fragments of 500 bp ±10bp, and sequenced from both ends. The two sequenced ends should now align to the reference sequence within the 500 bp range. But when, for example, one of the paired reads maps to another chromosome this is taken as evidence for an interchromosomal rearrangement (). Similarly, when the two end sequences align too far out this is evidence for a deletion () with the opposite situation indicative for an insertion. The read lengths vary between 50 and 150bp depending on the platform and kit used. Alternatively, a mate-paired approach can be used where larger fragments (between 2kb and 10kb) are circularized after their ends have been labeled with biotinylated nucleotides. Following fragmentation of the circularized molecules, the fragments containing biotin groups, representing the circularized ends of the original fragments, are captured on streptavidin magnetic beads. The eluted fragments are sequenced using the standard paired-end module producing paired reads with an inverted orientation. The mate-pair approach produces a considerably higher mapping coverage of the genome from the same number of reads and provides a cost-effective approach for identifying large structural variants genome-wide. Recently two groups used this approach to identify rearrangements in tumor samples[74
Figure 5 Schematic depiction of the Illumina protocol for structural variation detection. DNA is extracted from a tissue or cell population and randomly fragmented and gel size-selected to approximately 500bp. Adapters are ligated to both ends of the fragments (more ...)
Similar to calling single nucleotide variants and InDels, analyzing structural variations is prone to artifacts. A common artifact involves chimeras between two fragments in the library that can be miscalled as genome rearrangements. Chimeras are thought to originate from the ligation reaction during the library preparation that attaches sequencing adapters to all DNA fragments. While the nature of the adapters should prevent such self-ligation, it apparently happens at low frequency. To address this problem one approach[76
] uses a series of stringent gel-based size-selected fragments both before and after the ligation reaction. Any chimeras that are produced will be double the size of the selected fragment range and thus will not be retained after the second size selection. Another source of false positive variant calls are incorrectly mapped paired-end reads resulting from multiple single nucleotide errors in a single read. To filter out these artifacts and produce high-quality variant calls, most programs require a cluster of read-pairs supporting any aberration.
A comprehensive and proven package for the detection and characterization of structural variations is yet to be released. The spectrum of structural variation has made it difficult to design a single program that can accurately call all types of structural variation. Instead, three distinct types of calling algorithms have been developed () that together, are able to capture the full spectrum of these events.
The most basic algorithm is that applied by Breakdancer[77
] and GASV[81
]. These programs use mapping distance data provided through the paired-end alignment statistics to estimate the average fragment size of the library. Clusters of aligned reads that appear to be mapping at a distance that is more than three standard deviations away from the average are identified and called as structural variants. The stringency of this algorithm depends on a number of key parameters: the minimum number of reads supporting a cluster (normally >4), the frequency of the event at the locus (commonly an underestimate due to reads that span the breakpoint and thus are not aligned), and the map quality score, which is a measure of the probability that the paired read is correctly aligned. By narrowing the size distribution of fragments that are selected during the library preparation, it is possible to call smaller insertions and deletions. Events that are missed using these algorithms include deletions and insertions smaller than 100bp, insertions larger than 400bp, and some segmental duplications (CNVs). To correctly identify these events, two additional algorithms must be implemented. The first uses an approach[82
] similar to that used for the analysis of array-based Comparative Genomic Hybridizatin (aCGH) data. Bins of a width defined by the user are constructed to span the genome, and the number of aligned reads in each bin is recorded. Bins that have an average read depth that is significantly different from the norm can be investigated manually using a genome visualizer and can be validated using aCGH or qPCR. This approach can be used to look for both insertions and deletions and can be used in parallel with a program like Breakdancer to provide additional support for these events.
For the investigation of large insertions, so called orphan reads (only one of the two read pairs map) are extracted from the alignment data and used as input in a local de novo
assembly using the assembly software ABySS or Velvet[83
]. The assembled fragments are aligned to the reference genome and the alignments are parsed for contigs that provide evidence of insertional breakpoints. An open-source pipeline called SVMerge[84
] was recently released that uses the output from the various classes of variant calling algorithms in order to filter and classify a complete set of structural variants. The pipeline runs a de novo
local assembly using reads that align at the genome coordinates associated with each identified structural variant. This provides an additional validation step to filter out false positives and identify exact breakpoints.