|Home | About | Journals | Submit | Contact Us | Français|
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
The landscape of human genetics is rapidly changing, fueled by the advent of massively parallel sequencing technologies . New instruments from Roche (454), Illumina (GenomeAnalyzer), Life Technologies (SOLiD) and Helicos Biosciences (Heliscope) generate millions of short sequence reads per run, making it possible to sequence entire human genomes in a matter of weeks. These ‘next-generation sequencing’ (NGS) technologies have already been employed to sequence the constitutional genomes of several individuals [2–10]. Ambitious efforts like the 1000 Genomes Project and the Personal Genomes Project  hope to add thousands more. The first five cancer genomes to be published [12–17] revealed thousands of novel somatic mutations and implicated new genes in tumor development and progression. Our knowledge of the genetic variants that underlie disease susceptibility, treatment response and other phenotypes will continually improve as these studies expand the catalog of DNA sequence variation in humans.
The genomes of at least 10 individuals have been sequenced to high coverage using NGS technologies (Table 1). The first such genome (Watson) was sequenced to ~7.4× coverage on the 454 GS (Roche) platform , and included ~3.3 million single nucleotide polymorphisms (SNPs) of which 82% were already listed in the National Center for Biotechnology Information SNP database (dbSNP) . Remarkably, the nine personal genomes that followed on NGS technologies [2–8] reported similar results in terms of SNPs: 3–4 million per genome, 80–90% of which overlapped dbSNP. This pattern is so robust, in fact, that many consider ~3 million SNPs with 80–90% dbSNP concordance (depending on the ethnicity of the sample) to be the ‘gold standard’ for SNP discovery in whole-genome sequencing (WGS). Another implication of this pattern is that individual genomes contain ~0.5 million novel SNPs, whose submission to public databases will cause exponential growth as WGS studies expand. Indeed, since the completion of the Watson genome in 2007, submissions to dbSNP have skyrocketed (Figure 1). As of February 2010, dbSNP received over 100 million submissions for human, corresponding to 23.7 unique sequence variants of which more than half have been validated .
NGS technologies show great promise for the study of the genetic underpinnings of human disease. WGS is particularly appealing because it can detect the full spectrum of genetic variants—SNPs, indels, structural variants (SVs) and copy number variatons (CNV)—that may contribute to a phenotype . Indeed, the complete genome sequences several human cancers—AML [12, 13], breast cancer [16, 17], melanoma , lung cancer  and glioblastoma —have dramatically expanded the catalog of acquired (somatic) changes that may contribute to tumor development and growth (Table 1). For Mendelian diseases, massively parallel sequencing of family pedigrees offers an effective means of identifying the variants and genes underlying inherited disease . Indeed, the recent sequencing and analysis of a proband with Charcot–Marie tooth syndrome  demonstrates that these technologies have the potential for diagnostics in a clinical setting.
The value of massively parallel sequencing instruments for research is clearly illustrated by the widespread adoption of these platforms throughout North America, Europe, Asia and the Pacific (Figure 2). The commoditization of NGS throughout the world suggests that a substantial portion of sequenced human genomes will be produced outside of major genome sequencing centers. Very soon, groups with little to no experience in working with massively parallel sequencing data will gain access to these powerful technologies. The challenges that they face—in terms of production, management, analysis and interpretation of incredible amounts of sequence data—are daunting indeed. Fortunately, major genome centers and other groups who pioneered both traditional and NGS of human genomes have already addressed many of the key issues. Their strategies and methods for high-throughput sequencing of human genomes are the focus of this review.
Massively parallel sequencing enjoys a wide array of applications to the study of human genetics. Generally speaking, however, human genome resequencing using NGS technologies typically employs one of three strategies: targeted resequencing (Target-Seq), whole genome shotgun sequencing (WGS) and transcriptome sequencing (RNA-Seq). The types of genetic variation that can be characterized by these strategies are largely complementary; ultimately, a combination of whole-genome, targeted, and transcriptome sequencing yields the most comprehensive view of an individual genome (Figure 3).
Targeted sequencing (Target-Seq) applies genome enrichment strategies to isolate specific regions of interest prior to sequencing. Polymerase chain reaction (PCR)-based approaches for enrichment are gradually being supplanted by hybrid selection (capture) technologies [23, 24], in which sets of DNA or RNA oligonucleotide probes complementary to regions of interest are hybridized with libraries of fragmented DNA. Several methods for capture have been optimized for use with massively parallel sequencing [25–29]. Perhaps the penultimate goal of massively parallel targeted sequencing is to fully characterize the ‘exome’, or the full set of known coding exons. Indeed, dozens of human exomes have been sequenced using hybrid selection technologies paired with massively parallel sequencing .
WGS offers the most comprehensive and unbiased approach to genome characterization with next-generation instruments. WGS is particularly attractive because it lets one study the full scope of known DNA sequence variation—from SNPs and small indels to large SVs and CNVs—in a single experiment . Furthermore, sequence reads from single DNA molecules enable the phasing of detected variants to determine which occur on the same chromosome copy, information which is critical for genotype–phenotype correlation. To comprehensively characterize the variation in a single genome, however, it is necessary to generate highly redundant coverage to account for the increased sequencing error and shorter read lengths of massively parallel sequencing technologies. The redundancy required for accurate sequencing (currently ~30×) is dependent upon read lengths and sequencing error rate; as these metrics improve, less redundancy may be required.
Massively parallel sequencing of cDNA libraries, or RNA-Seq, is a rapidly developing application for NGS technologies . RNA-Seq offers a powerful approach to study the transcribed portion of the human genome, providing a digital readout of gene expression with sensitivity that far exceeds microarray-based methods. Furthermore, RNA-Seq enables the characterization of alternative splicing, allele-specific expression, fusion genes, and other forms of variation at the transcript levels. Specialized methods for mapping mRNA–miRNA interactions have also been adopted for massively parallel sequencing [32, 33].
The broad set of applications for massively parallel sequencing technologies, combined with their widespread adoption by the research community, suggest that NGS will continue to play a key role in the biological discoveries of coming years. Although investigators are understandably eager to harness the power of these new technologies, the massively parallel sequencing of human genomes presents some significant challenges.
It is important to realize that the generation and analysis of data from next-generation instruments present numerous challenges (Table 2). Principal among these are issues of sample contamination from non-human sources, library chimaeras, sample mix-ups and variable run quality.
While sample contamination remains an area of concern in any sequencing project, two aspects of NGS help mitigate this issue. First, NGS can be performed on libraries without the use of bacterial cloning, which was a significant source of sequence contamination in capillary-based sequencing. Second, each read from NGS interrogates a single DNA molecule, which permits the identification and removal of individual contaminating reads. Indeed, by mapping NGS reads to a database of common contaminating genome sequences (of bacterial and viral origin, for example), it is possible to rapidly screen libraries and remove the contaminating sequences.
As many as 5% of long-insert paired-end libraries contain chimeric reads . This artifact can have serious ramifications for de novo assembly [34–37] and SV prediction [38, 39] algorithms that rely upon mate pairing information. The assembly problem is potentially more severe, as chimeric fragments can generate false assembly paths. One solution is to use only fragment-end or short insert paired-end data for de novo assembly, and long insert paired-end data for the scaffolding of assembled contigs. For both scaffolding and SV detection, requiring a minimum of three or more independent supporting read pairs at a given locus helps reduce the influence of low-frequency chimeras.
In a high-throughput sequencing environment, human error is an important factor. Major genome centers have developed strategies to identify samples that are switched, mislabeled, or highly contaminated. To identify mislabeled samples, our group and others utilize high-density SNP array data, which provide thousands or millions of accurate genotypes across the genome. These not only provide reference points for diploid coverage estimation, but also constitute a highly individualized forensic DNA profile of the intended sample. Even a single lane of data from WGS or exome captures typically provides sufficient depth to call genotypes at thousands of variant positions; a simple concordance analysis between these and the expected genotypes from high-density arrays (Figure 4A) can distinguish correctly a correctly matched sample (90–99% concordant) from a mis-labeled one (60–80% concordance).
NGS of cancer genomes is typically performed on tumor samples and matched normal controls from the same patient. Here, correct sample identification is particularly critical, since the discovery of somatic changes requires a direct comparison of tumor to normal. Unfortunately, high-density SNP arrays are less informative, since samples share a common genetic origin. For many tumors, however, widespread genomic alterations and copy number changes distinguish tumor from normal. Thus, our group and others have applied CNV detection algorithms to NGS data from tumor-normal pairs to identify possible sample switches.
As massively parallel sequencing instruments transition from technology development labs to production floors, maintaining consistent run quality is an important challenge. The quality and amount of input DNA and reagents, as well as the skill of the laboratory technicians, can significantly affect results. Given the typically high cost of a single run on NGS instruments, experimental variability must be reduced as much as possible. Major genome centers have addressed this issue through automated liquid handling and streamlined workflows. Regular training of laboratory personnel is important as well. Finally, a series of quality control checks—DNA quantification using Picogreen, gel-based or microfluidics fragment size selection, for example—can isolate the source of problems when they arise.
Although the throughputs of current NGS platforms are significant, some samples (particularly those undergoing WGS) may require multiple sequencing runs. Given that sequencing runs on NGS instruments are costly and time-consuming, defining a data generation goal is an important step of the planning process. How much sequence data is enough? Often this question is answered by the practical considerations of funding, instrument access, and/or the availability of sample material. In the absence of such restrictions, however, certain performance metrics can indicate the quality and completeness of a sequenced genome.
The vendor-provided software for most NGS platforms provides some informative run metrics indicating the quality. Specifically, the number of reads, average read length (for the Roche/454 platform), alignment rate, and inferred error rate are the most obvious indicators of success or failure. As users gain experience with NGS data, the performance metrics of good versus bad runs will become more obvious. On the Illumina GAIIx platform, for example, we expect that good runs will yield 35–75 million reads per lane, with error rates of <2% and ELAND alignment rates of >80%. Error rate and alignment rate are correlated; as error rates increase, alignment rates tend to decrease (Figure 4B).
Sequencing coverage of the genome (for WGS studies) or of target regions (for Target-Seq studies) is the most basic metric for genome completion. There are several advantages to using coverage rather than numbers of runs, lanes, reads, or bases generated. Importantly, coverage excludes reads that are unmapped, ambiguously mapped, or marked as PCR duplicates to provide an estimate of ‘usable’ sequence data. The depth and breadth of sequencing coverage are directly related to the sensitivity and specificity of variant detection, which often represents the key analysis endpoint of human resequencing.
‘Fold’ redundancy (also called haploid coverage) is the number usually followed by an ‘X’ in whole genome resequencing studies. Most of the currently published WGS studies report fold coverage in the ~30–50× range, which seems to be the bar for a genome sequenced on current NGS platforms. Among the individual genome sequences listed in Table 1 are two exceptions. One is the genome of James D. Watson , which was the first to be sequenced on a massively parallel platform (Roche/454) and whose 7.4×-fold coverage represented a major achievement in sequencing technology. The second exception to the ~30× rule is the sequencing of NA18507 on Life Technologies’ SOLiD platform , which utilizes a di-base encoding scheme that requires lower redundancy to achieve >99% sequencing accuracy .
The availability of high-density SNP arrays, which typically assay >1-million SNPs across the human genome, provides another key metric of genome completion. Granted, current SNP arrays are largely comprised of SNPs that were characterized by large-scale efforts such as the International HapMap Project, sets which are known to harbor certain biases (assay ability, allele frequency and proximity to genes). Highly repetitive regions, for example, are under-represented. Nevertheless, SNP array data for a sequenced genome is extremely valuable because it provides millions of data points at which sequencing coverage and accuracy can be assessed. Because SNP arrays include many common variants, a substantial number are likely to be heterozygous in the individual being sequenced; detection of both alleles in sequencing data indicates that both chromosomes in a pair are represented. Thus, a comparison of the SNP calls from sequencing data to known genotypes from high-density SNP arrays serves as a more direct measurement of diploid coverage, which should reach 98–99% in a completed genome [12, 13].
Initially, the sheer volume of data produced by NGS instruments can be overwhelming. Development of a streamlined, highly automated pipeline to facilitate data analysis is a critical step that facilitates the transition from technology adoption to rapid data generation, analysis and publication. In this portion of the review, we discuss the key components of a primary analysis pipeline: sequence alignment, read de-duplication and conversion of data into a generic format in preparation for downstream analysis (Figure 5A).
The key first step in the analysis of next-generation resequencing data is the alignment, or mapping, of sequence reads to a reference sequence. Three characteristics of NGS data complicate this task. First, the read lengths are relatively short (36–250 bp) compared to traditional capillary-based sequencing, which not only provides less information to use for mapping, but also decreases the likelihood that a read can be mapped to a single, unique location. Second, reads from NGS platforms are of imperfect quality; that is, they contain higher rates of sequencing error. On the Roche/454 platform, for example, homopolymeric sequences are often over- or under-called , resulting in reads that contain gaps relative to the reference sequence. On the Illumina GAIIx platform, base quality is a function of read position , with the highest-quality bases at the start of the read. The third complication presented by NGS platforms is the sheer volume of data. A single run produces millions of sequencing reads, whose alignment to a large reference sequence requires significant computing power.
Recent years have seen a plethora of short-read alignment tools to support next-generation data analysis. Reads produced on the Roche/454 platform are long enough that traditional algorithms like BLAT  and SSAHA2  can be used effectively to map them. The high-throughput and short-read length of the Illumina/Solexa platform, however, presented a significant algorithmic challenge. One of the first tools to address it was the mapping and alignment with qualities algorithm, or MAQ . Compared to the vendor-provided software for Illumina data alignment, MAQ offered several advantages. It considered base quality scores during sequence alignment, which helped to address the variable quality of sequence across a read. Second, it assigned a mapping quality score to quantify the algorithm’s confidence that a read was correctly placed. Finally, MAQ made use of read pairing information (for Illumina paired-end libraries) to improve mapping accuracy and identify aberrantly-mapped pairs. MAQ was widely adopted by the NGS community, and utilized in WGS of human [2, 3] and cancer [12, 13, 16] genomes.
Efficient mapping of short reads to a large reference sequence has remained a considerable computational challenge, spurring the development of dozens of alignment algorithms (Table 3). Some, like Novoalign (http://www.novocraft.com), sought to improve upon the sensitivity of Illumina read alignment. At least three aligners (Bowtie , BWA  and SOAP2 ) have leveraged the Burrows–Wheeler transformation (BWT) algorithm, to dramatically decrease alignment time. Indeed, these algorithms can map a single lane of Illumina data (~20 million reads) in a matter of hours, compared to the several days required by Maq or Novoalign.
The SOLiD platform (Life Technologies) utilizes a unique di-base encoding scheme in which each base is interrogated twice, to help distinguish sequencing errors from true variation. Indeed, a recently published study applied SOLiD sequencing to characterize an entire genome with only ~18× haploid coverage . While the vendor-provided software for mapping SOLiD data is available, independent groups have developed their own. SHRiMP  is a rapid implementation of Smith–Waterman alignment that performs colorspace-correction while aligning reads. The BLAT-like fast alignment software tool (BFAST ) maps reads in color space and allows gaps, which enabled the identification of ~190 000 small (1–21 bp) indels in the recent sequencing of a glioblastoma cell line .
Early during the rise of NGS, it became apparent that many of the reads from massively parallel sequencing instruments were identical—same sequence, start site and orientation—suggesting that they represent multiple reads of the same unique DNA fragment, possibly amplified by PCR during the sequencing workflow [49–52]. It is critical to identify and remove these duplicate reads prior to variant calling, since the unintended amplification of PCR-introduced errors can yield skew variant allele frequencies and thereby decrease variant detection sensitivity and specificity .
SAMtools (, http://samtools.sourceforge.net) includes utilities for the removal of PCR duplicates from single-end or paired-end libraries. However, a superior solution is offered by the Picard suite (http://picard.sourceforge.net), which not only applies optimal fragment-based duplicate identification, but marks duplicate reads using the FLAG field rather than removing them from a SAM file.
The definition of the sequence alignment map (SAM) format and its binary equivalent (BAM) was a critical achievement for NGS data analysis. The SAM format specification (http://samtools.sourceforge.net/SAM1.pdf) describes a generic format for storing both sequencing reads and their alignment to a reference sequence or assembly. SAM/BAM format is relatively compact, but flexible enough to accommodate relevant information from different sequencing platforms and short-read aligners. A single SAM/BAM file can store mapped, unmapped, and even QC-failed reads from a sequencing run, and indexed to allow rapid access. This means that, if desired, the raw sequencing data can be fully recapitulated from the SAM/BAM file.
One key field of the SAM format specification is the FLAG, a ‘bitwise’ representation of several read properties, which can be true or false. Each property is set to on (1) or off (0); the bits that are set to on, when combined, represent an integer value. Thus, a single field in the SAM format specification indicates if a read is paired, properly paired, mapped, read1 or read2, quality-failed, or marked as duplicate. Thus, SAM/BAM files can contain extensive information about a read, its properties, and its alignment to a reference sequence. A freely available software package, SAMtools , provides the utilities for creating, sorting, combining, indexing, viewing, and manipulating SAM/BAM files. For these reasons, SAM/BAM format has been widely adopted by the sequencing community.
The availability of sequencing services offered by private companies  such as Complete Genomics, as well as the Beijing Genomics Institute and other centers, have raised the possibility of ‘outsourcing’ massively parallel sequencing. This option may be attractive to investigators because it mitigates the considerable financial and personnel investment required for NGS instruments . Furthermore, the development of NGS data analysis packages for cloud computing  suggests that computationally intense analyses may be run on rented hardware, thus removing the cost of purchasing and maintaining such equipment.
The possibility of outsourcing DNA sequencing to a third party deserves careful consideration. There are important concerns related to privacy and security of the data—since DNA and RNA contain information that could be used to identify an individual, keeping that information in confidence, and safe from intrusion, is of the utmost importance for many investigators. The ethical and legal responsibilities surrounding human samples continue to gain prominence; suggesting that permitting third parties to perform the sequencing faces, at the very least, an uphill battle. Transparency in the data generation process is also a key issue; since the primary analysis of NGS data is so critical to the final results, every step between receiving a sample and providing a BAM file must be carefully documented.
Despite these difficulties, it is clear that some companies and institutions will have the capacity to perform sequencing for outside parties, and some investigators are bound to find sequencing-as-a-service appealing for their research. Furthermore, recent studies in which NGS technologies uncovered important genes for Mendelian disease [21, 22] illustrate the potential of sequencing data to enhance clinical research. For these reasons, the following sections on downstream analysis have the expectation of sequencing data in a BAM file, which seems the most likely endpoint for both primary analysis pipelines and outsourced sequencing.
A key advantage of converting NGS data to SAM/BAM format is that all downstream analysis can be driven from a single data file (Figure 5B). Properly mapped reads can be used to identify SNPs/indels and to infer genome-wide copy number. Aberrantly mapped read pairs can be screened for evidence of underlying structural variation, while de novo assembly of unmapped reads [34–37] enables the characterization of novel insertions and SV breakpoints. In this section of the review, we discuss some of the algorithms that have been developed for detecting these types of variation in NGS data.
Massively parallel sequencing data has proven ideal for the identification of SNPs [55, 56]. Indeed, some ~3–4-million SNPs per individual were reported for the WGS studies presented in Figure 1; of these, the vast majority (74–90%) [2–9] corresponded to known variants in the National Center for Biotechnology Information’s database of sequence variation in humans (dbSNP). This high overlap suggests that the vast majority of reported variants are real human polymorphisms. Yet the significant fraction of novel SNPs (10–26%) identified from whole genome sequencing implies that a substantial portion of rare variation remains to be discovered. Efforts such as the 1000 Genomes Project (http://www.1000genomes.org) hope to catalog these by sequencing the genomes of thousands of individuals. In cancer, most of the validated somatic single nucleotide variants (SNVs) are neither present in dbSNP nor shared amongst other tumors [12, 13, 16]. Accurate detection of single nucleotide variation, therefore, remains an important aspect of NGS.
Numerous algorithms for calling SNPs in NGS data have been developed in recent years. Bayesian methods (e.g. Atlas-SNP , SOAPsnp ) utilize prior probability calculations to determine the most probable genotype (reference or variant) based upon available sequence information. Other packages (e.g. SAMtools , VarScan ) include numerous utilities for detection and filtering of variant calls based on heuristic and probabilistic models, reinforced with empirical knowledge of massively parallel sequencing platforms. No one tool single-handedly outperforms the others. Indeed, a combination of variant calling algorithms, each tuned to perform optimally for the dataset in hand, is likely to yield the best combination of sensitivity and specificity for variant detection in human genomes.
False positives during SNP calling generally arise from two phenomena. The first source is sequencing error, which is more prevalent for NGS platforms than traditional capillary-based methods. While sequencing errors are often random, certain platform-specific and platform-independent trends have become evident. On the Illumina/Solexa platform, sequencing error is positively correlated with read position; errors tend to occur near the ends of reads. In contrast, errors on the Roche/454 platform are not dependent on read position, but tend to cluster around homopolymeric sequences that are under- or over-called during 454 pyrosequencing.
Alignment artifacts are the second major source of false positive SNP calls. The relatively short-read lengths from NGS platforms and complexity of the human reference genome make read mis-alignments inevitable. Paralogous sequences and low-copy repeats that differ by only a few bases can give rise to reads that, when aligned incorrectly, appear to support a substitution at the same position. Thus, these types of errors can manifest even in regions of deep coverage. A window-based filtering approach that identifies clusters of SNP calls (i.e. three SNPs within 10 bp) can help remove some of these artifacts.
Detection of small insertions and deletions in NGS data has proven more difficult, particularly due to the relatively short-read length typical of most platforms. Computationally speaking, aligning reads with substitutions (SNPs) to a reference sequence is much easier than aligning reads with gaps (indels). While the longer reads of the 454 platform seem to address this problem, indels detected in 454 data tend to carry a high false-positive rate, primarily due to the inability of pyrosequencing technology to resolve homopolymers (runs of a single nucleotide) longer than 4–5 bases. The growing read length of the Illumina/Solexa platform (currently 100 bp) coupled with improvements in gapped short-read alignment (BWA, Novoalign), makes it feasible to detect insertions of up to 30 bp and deletions of nearly any size. We developed a tool, called VarScan (http://varscan.sourceforge.net)  that performs indel detection using gapped alignments of massively parallel sequencing data. The Pindel tool takes another approach to indel detection that leverages the mate pairing information from paired-end sequencing on Illumina or SOLiD platforms. By isolating mate pairs where only one read is mapped, and performing split-read alignment of the unmapped read, Pindel identifies slightly larger indel events that are refractory to direct gapped alignment.
Despite these advances, accurate indel detection using massively parallel sequencing data remains challenging. One reason for this is the relatively short-read lengths of NGS platforms, which limits their ability to detect large events, particularly for insertions. Furthermore, indels predicted in NGS datasets, particularly single-base events, suffer high false positive rates due to alignment artifacts and sequencing error. The combination of paired-end sequencing (to increase mapping accuracy) and localized de novo assembly (to remove local mis-alignments and resolve breakpoints) [34–37] improves the performance of indel detection, though not nearly to levels of sensitivity and specificity that are achievable for SNPs.
Massively parallel sequencing data is particularly advantageous for the study of structural variation (SV). Not only does it offer the sensitivity to detect SVs across a wide range of sizes (1–1000 kb), but also enables precise identification of structural breakpoints at base-pair resolution [59–62]. Most sequence-based approaches to SV detection extend seminal work by Volik et al.  and Raphael et al. . Their approach used traditional 3730 sequencing to perform end-sequence profiling (ESP) of bacterial artificial chromosomes (BACs). When mapped to the human genome, aberrations in distance and/or orientation between end-sequence read pairs revealed the presence of underlying structural variation.
The ESP method has since been adapted to characterize structural variation in human genomes using paired-end sequencing on the Roche/454 , Illumina  and SOLiD platforms . Our group has developed an automated pipeline for SV prediction from Illumina paired-end sequencing data. The software algorithm, BreakDancer , utilizes data in BAM format to conduct de novo prediction and in silico confirmation of structural variation. The confidence score for each SV prediction is estimated using a Poisson model that takes into consideration the number of supporting reads, the size of the anchoring regions, and the coverage of the genome. BreakDancerMax outputs five types of SVs: insertions, large deletions (>100 bp), inversions, intra-chromosomal rearrangements and inter-chromosomal translocations. Alignment artifacts in short-read data appear to be the most significant source of false positives from BreakDancer and other SV prediction algorithms. To remove false positives, and to precisely define the breakpoints of each variant, we perform de novo assembly using TIGRA (unpublished) of all read pairs that have at least one end mapped to the predicted intervals. AbySS , Velvet  and other short-read assemblers are well-suited to localized de novo assembly for this purpose. Even the most advanced pipelines for SV detection suffer a high false positive rate , suggesting SV detection using NGS data is still in its infancy. However, theoretical work shows the possibility, at least in principle, of controlling false-positives by appropriately tuning redundancy .
Massively parallel, WGS enables detection of CNV at unprecedented resolution. It is important, however, to account for certain biases when utilizing sequencing coverage to infer copy number. First, variable G + C content throughout the genome is known to influence sequence coverage on most NGS platforms. On the Illumina platform, for example, regions with significantly low (<20%) or high (>60%) G + C content are under-represented in shotgun sequencing . To address this bias, Yoon et al.  segmented the genome into 100-bp windows, and adjusted each window’s read counts based on the observed deviation in coverage for a given G + C percentage. Mapping bias is another important contributor to variation in sequencing coverage, particularly for the short (35–50 bp) reads produced on Illumina and SOLiD platforms. Campbell et al.  proposed a method to correct for mapping bias based on simulations of Illumina 2 × 35 bp reads, which they mapped to the genome using MAQ. Next, they divided the genome into non-overlapping ‘windows’ of unequal width such that each window contained roughly the same number of mapped reads.
After correcting for G + C content and uniqueness, the normalized read depth offers a uniform representation of copy number across the genome. To identify regions of significant copy number change, Campbell et al.  adapted a circular binary segmentation algorithm for SNP array data. Their adaptation is implemented in R as the ‘DNAcopy’ library of the Bioconductor project (http://www.bioconductor.org). A similar method, correlational matrix diagonal segmentation (CMDS) , enables copy number estimation across a population of samples.
The rise of massively parallel sequencing has fundamentally changed the study of genetics and genomics. Whole genome sequencing of 10 individuals and several tumor samples has only begun to reveal the extent and nature of human sequence variation. To date, the majority of NGS has taken place inside of major genome centers. However, the widespread adoption of new sequencing instruments throughout the world suggests that this pattern will change. It should be noted that NGS as a research tool presents substantial challenges—in production, in data management, and in downstream analysis. Investigators stand to benefit from strategies for quality control and data analysis that produced the first studies enabled by NGS technologies. It is clear that new sequencing technologies hold incredible promise for research; their capabilities in the hands of investigators will undoubtedly accelerate our understanding of human genetics.
This work was supported by the National Human Genome Research Institute [grant number HG003079, PI Richard K. Wilson].
We thank Michael C. Wendl for critical reading of the manuscript. We also thank Robert S. Fulton, Lucinda Fulton, David Dooling, David E. Larson, Ken Chen, Michael D. McLellan, Nathan Dees, and Christopher C. Harris of the Genome Center at Washington University in St. Louis for their contributions to discussions related to this review.
Daniel C. Koboldt works in the medical genomics group of the Genome Center at Washington University in St. Louis, and maintains a blog on next-generation sequencing at http://www.massgenomics.org.
Li Ding is assistant director and head of the medical genomics group at the Genome Center and a research assistant professor at Washington University in St. Louis.
Elaine R. Mardis is co-director of the Genome Center and an associate professor of genetics at Washington University in St. Louis.
Richard K. Wilson is director of the Genome Center and a professor of genetics at Washington University in St. Louis.