Genetic differences can arise among individual viral particles within an infected host, and detecting these viral genetic variants can reveal how viruses adapt to challenges such as host immune responses, antiviral medications, and transmission bottlenecks. However, detecting rare variants is difficult with existing sequencing technologies due to low sensitivity, high error rates, and/or poor scalability. For example, bulk-sequencing approaches generate a consensus assembly, but they have limited sensitivity to detect intra-host variation
[1]. One approach to increase sensitivity is to amplify and clone selected fragments of viral nucleic acids into proliferating targets that are subsequently isolated and sequenced
[2], but this method has a higher false positive rate and poor scalability. To reduce errors, the single genome amplification (SGA) method isolates individual viral genomes through dilution, and then amplifies and sequences each genome individually to minimize introduced errors
[3]–
[5], although scalability remains an issue. Rare variant detection requires deep coverage that is not cost-effective with current methods of cloning or SGA. To address scalability, massively parallel sequencing technologies can isolate and sequence individual DNA or cDNA molecules en masse from the population of viral genomes and generate millions of short read sequences that can increase the sensitivity and decrease the cost to detect variants
[6],
[7]. Still, increased error rates can somewhat impact potential gains in sensitivity. Here, we report on a novel method to detect rare variants that increases sensitivity even in the presence of process errors.
Detecting biological variants involves not only finding them, which deep sequencing technologies can do with high sensitivity, but also differentiating them from process (i.e. amplification or sequencing) errors. One way to do this is to compare variants to a distribution of errors. For example, several authors have reported using a Poisson or binomial probability model to define the error distribution, and they can call candidates that fall outside the distribution variants
[8]–
[13]. These models, however, assume that all bases have equal quality scores, where the base quality score is a measure of how accurate the base call is. This assumption is invalid for bases measured by massively parallel sequencing technologies, as Brockman et al.
[14] have shown, since base quality can vary by several criteria; in fact, sequencing technologies take criteria such as these into account when assigning base quality scores. To avoid this assumption, probability models can incorporate base quality scores. Such probability models exist in tools that call single nucleotide polymorphisms (SNPs) in human and other diploid genomes, including MAQ
[15], SoapSNP
[16], Unified Genotyper
[17], SNVMix
[18], or Slider
[19]. In contrast, instead of an explicit error probability model, Hoffman et al.
[20] compare variants to an empirical control data set. Archer et al.
[21] and Rozera et al.
[22] report methods that correct read sequences for suspected process errors prior to calculating variant frequencies. Archer et al.
[21] use a k-mer mapping approach to position reads on a consensus template and refine alignments locally, and Rozera et al.
[22] turn to heuristic rules to filter out errors based on cutoffs for base quality scores and other criteria. Both strategies avoid using an explicit probability model of error and hence assume that all process errors take a specific form, and that no biological variants take the same form as the process errors.
The above models separate variants from error using specific forms or heuristics or a probabilistic distribution. An alternative approach is to consider patterns of candidate variants. For example, Eriksson et al.
[9] use Fisher's exact test to find patterns that occur more frequently than expected by chance to call variants. Refining this approach further, several authors probabilistically cluster patterns to infer variant haplotypes
[9],
[11],
[12],
[23]; the cluster centers are haplotypes, and process errors can be removed by collapsing variation within the cluster. Since patterns of variants are essentially groups of variants that occur at the same loci on multiple reads, i.e. in phase, we can analyze them together as a group of phased variants, and we can compare them to phased errors in the same pattern. Phased errors presumably occur much less frequently than errors in general, making it easier to recognize phased variants.
To address the challenge of calling rare genetic variants in diverse populations in the presence of error, we introduce V-Phaser, a single nucleotide variant calling tool that uses phase and quality filtering with a probability model that incorporates and recalibrates individual base quality scores. To increase sensitivity, V-Phaser looks not only for variants that fall outside the distribution of errors but also for patterns of variants in phase. To increase specificity, it incorporates individual base quality scores into a composite Bernoulli model that allows error rates to vary from base to base. It also uses a pre-processing filter to screen out low quality bases and improve the fit of the model. We calculate the theoretical gain in sensitivity of detecting variants using phase to increase specificity. We then validate V-Phaser on read sets with known variation generated by the 454 sequencing platform to estimate sensitivity and specificity. To determine the effect of each algorithmic step on performance, we evaluate the method with each of three features (phasing, recalibration, and filtering) turned off and compare these results to those achieved on the same data with several other viral variant callers. Finally, we use V-Phaser on data from a chronically HIV-1 infected subject to demonstrate its utility to detect low frequency variants in viral populations.