Recent advances in sequencing technologies have enabled more widespread study of heterogeneity and sub-populations in a cell population, and a migration away from a ‘consensus sequence’ view of their evolution. Such a ‘population perspective’ has applications in a range of biological systems, from the characterization of viral quasi species and intra-host variation (1
), to bacterial sub-populations (3–5
), to sub-clonal evolution in cancer (6–8
). Precise characterization of population structure (and rare sub-populations) in these studies is fundamental to the analysis of population evolution and dynamics as a function of host response or drug exposure. Several recent cancer sequencing studies have further emphasized the functional role of rare sub-populations and variants in aspects such as tumor growth, drug resistance and metastasis (9
) and the need for computational tools to study them.
In principle, the high throughput of massively parallel sequencing allows for sampling of even rare sub-populations. Sequencing errors, however, complicate the determination of true variations in the population. Sequencing error rates are known to be highly variable and differ significantly between technologies, runs, lanes, multiplexes, genomic location as well as substitution types (11–13
). While approaches to correct for these have been studied, the majority of variant-calling methods have focused on low-coverage human re-sequencing data and diploid calls (14–16
) with discrete frequencies of interest (i.e. 0, 0.5 and 1; a related set of methods are those tailored for calling diploid genotypes in pooled sequencing data (17–20
) and are not generally applicable).
Studies of RNA viruses have provided the exceptions to this rule (21–24
). RNA viruses have high mutation rates (due to poor or missing proof-reading capability of the viral RNA-dependent DNA polymerase) and high replication rates, and the resulting intra-host variations have implications for drug treatment strategies (25
) and the genetic monitoring of live vaccines (26
). The methods used in these studies though rely on ad hoc
trimming, filtering and thresholds to call variants, limiting their sensitivity and widespread applicability (needing manual adjustment per sample or sequencing run). Recent model-based approaches such as Breseq (27
) and SNVer (29
) are potentially more sensitive and generic, but rely on simple binomial models and are not tailored to accommodate biases in error rates. A more sophisticated approach, that relies on phasing to reduce the effect of sequencing errors and is tailored to 454 sequencing has recently been applied to viral datasets (30
). This method is, however, not directly applicable to other technologies and cannot be run on large genomes or sequencing datasets.
In emerging clinical applications that use sequencing to monitor the genomic state of cells, the ability to detect rare variants in a population and to do so at the edge of detection limits is an important unfulfilled capability. On the one hand, increased sensitivity in variant callers can make it possible to monitor rare but important sub-populations (e.g. cancer stem cell mutations) and on the other hand, sensitivity is essential for early detection of say a drug-resistant sub-population (e.g. with antiretroviral drugs for HIV). In such settings, ad hoc approaches lack the desired adaptability and robustness and may suffer from an artificial cap in the sensitivity of variant detection. Precise modeling of sequencing errors is essential to push sensitivity limits and it is this need that we seek to address.
In this work, we present a sensitive and robust approach for calling single-nucleotide variants (SNVs) from high-coverage sequencing datasets, based on a formal model for biases in sequencing error rates. We show that rigorous statistical testing can be done efficiently under this model, without resorting to approximations, thus allowing for the exact analysis of large genomes and high-coverage datasets. The resulting method, LoFreq, adapts automatically to sequencing run and position-specific sequencing biases and can call SNVs at a frequency lower than the average sequencing error rate in a dataset. LoFreq’s robustness, sensitivity and specificity were validated using several simulated and real datasets (viral, bacterial and human) and on two experimental platforms (Fluidigm and Sequenom). Our results from applying LoFreq to call rare somatic SNVs (in exome sequencing datasets for gastric cancer) and for studying dengue virus quasi species before and after treatment in a clinical study (of a nucleoside-analog drug Balapiravir) further highlight the robustness and versatility of this approach.