With broad applications in research and clinical diagnostics, next-generation DNA sequencing (NGS) has become an important platform for identifying mutations and variants from clinical samples. NGS has been frequently applied to detection of polymorphisms from normal diploid genomic DNA samples where the allele frequency is based on Mendelian inheritance. In this case, a heterozygote variant comprises half the depth at its position. However, many samples represent more complex mixtures in their genetic composition where a mutation or variant may be present in only a small proportion of the relevant sequences. Deep sequencing analysis with very high levels of coverage on smaller targeted regions can sensitively detect less prevalent, minor alleles and mutations from these admixed samples. For example, ultrasensitive detection could identify mutations in individual genes that cause resistance to the drugs that target specific gene products.
Generally, applications of sensitive rare mutation and minor allele detection from admixed samples include: microbial or viral population sequencing, rare cancer-specific mutations in primary tumors, environmental diversity sampling of specific microbes and pooled sample sequencing. As noted, many studies are particularly interested in small sets of genes that are therapeutic targets. Deep sequencing has been used for the analysis of clinical samples from individuals with HIV infection or other viruses to identify the multiple related viral clones, often referred to as ‘quasi-species’, that coexist in an infected individual (1–4
). This offers the opportunity to identify rare drug resistance mutations to antiviral therapies whose representation in a virus population can expand after therapeutic selection in chronically infected individuals. NGS can also provide sensitive detection of cancer-specific mutations from primary clinical cancer samples contaminated with normal stroma (5
) and in a similar context, these mutations can lead to cancer therapy resistance. By pooling genomic DNA from many individuals in a cohort and sequencing the pool, one can identify rare variants from smaller size genomic regions at a much lower per-sample cost than from large population studies (6
There are many experimental methods available for highly sensitive detection of rare mutations and variants from admixture samples containing multiple genotypes. These include denaturing high-performance liquid chromatography (DHPLC) (7
), high-resolution melting analysis (HRMA) and mutation-specific PCR-based genotyping assays (8
). However, DHPLC and HRMA require DNA sequencing as final confirmation of the identity of a mutation and mutation-specific genotyping assays (9
), while highly sensitive and specific, require a priori
knowledge of the mutation. Compared to these other methods, direct DNA sequencing offers significant advantages for both discovery and confirmation of rare mutations in samples that are complex genetic mixtures.
Presently, there are only a limited number of ways to detect rare single nucleotide mutations using a NGS platform (1
). SNPseeker (11
) uses quality filtering and large-deviation theory to call SNPs with a minor allele frequency (MAF) as low as 0.5–1.2%. VarScan (10
) uses thresholds on coverage, quality and variant frequency to call variants with a MAF as low at 1%. CRISP (13
) uses a probabilistic model to call rare variants present in pools as large as 25 individuals representing a level of 2% allele frequency. Hedskog et al.
) report detection of 0.07–0.09% variants in a viral population using pyrosequencing technology. Some methods are designed detect both SNPs and indels. The major challenge for NGS rare variant and mutation detection is finding a true signal with the relatively high error rates of NGS. With the initial commercial release of these technologies, these errors were generally quoted as ranging from 1% to 3% (14
). We demonstrate that the error rates are significantly lower based on our results of sequencing a synthetic DNA samples. Our overall objective was to develop a robust and general method to detect rare (0.1%) single nucleotide variants with current sequencing-by-synthesis NGS technology by overcoming the general sequencing error rate limitations. At this level, this represents accurately detecting one mutation among 1000 wild-type alleles.
Our method for the detection of rare single nucleotide mutations at the 0.1% level relies on innovation in both experimental design and statistical algorithm (). We use a multi-reference, indexed experimental design to minimize experimental variance and characterize a position-specific error distribution. We employ a rigorous statistical model to estimate the position-specific error rate distribution for reference sequences and thus the probability of a true mutation at each position in the sample. The statistical model provides a rigorous framework for hypothesis testing and estimation that minimizes false positives in variant calling. We demonstrate our method by accurately calling known mutant positions in a 0.1% mixture of two pure synthetic DNA constructs sequenced via Illumina NGS. The reference and mutant positions are known a priori and provide a gold standard for testing our approach. We then apply our method to identify mutations of the H1N1 influenza A (H1N1) neuraminidase gene (NA) obtained from nine infected individuals during the 2009 pandemic. We identified a known drug resistance mutation among these variants. Finally, we show a statistical power analysis of our method in order to characterize the sequencing parameters under which our method can be generalized to other novel applications.
Figure 1. Method flowchart. The method for detecting rare variants compares the baseline error rate from multiple reference replicates to the sample error rate at each position. Sample and reference DNA are independently prepared and tagged with indexed adapters. (more ...)