High-throughput sequencing (HTS) technologies are revolutionizing the way biologists acquire and analyze genomic data. HTS machines, such as 454/Roche, Illumina/Solexa and AB SOLiD are able to sequence up to a full human genome per week, at a cost hundreds fold less than previous methods. The resulting data consists of reads ranging in length between 35 and 400 nt, from unknown locations in the genome. Analysis of these datasets poses an unprecedented informatics challenge due to the sheer number of reads that a single run of an HTS machine can produce, the shortness of the reads, and the various technologies' different sequencing biases and error rates. The two basic steps in the discovery of variants in the human population from reads coming from any of these technologies are: first, the mapping of reads to a finished (reference) genome, and second the identification of variation by analysis of these mappings.
In the last few years, there have been many approaches proposed for mapping reads from HTS technologies (Campagna et al.
; Langmead et al.
; Li and Durbin, 2009
; Li et al.
; Lin et al.
; Rumble et al.
among many others; see Dalca and Brudno, 2010
; Flicek and Birney, 2009
for reviews) that utilize a wide variety of approaches. Compared to this multitude of mapping tools, there have only been a handful of toolsets for single nucleotide polymorphism (SNP) and small (1–5 bp) indel discovery. The main challenge in detecting these variants is using the error rates of the sequencing platform, the potentially incorrect mappings, and the varying coverage to determine the likelihood that a position represents a heterozygous or homozygous variant with respect to some reference genome. We use the term heterozygous to refer to the case when a single donor allele differs from the reference, and homozygous to refer to the case when both donor alleles differ from the reference, and are the same as each other. Tri-allelic SNPs, when the two donor alleles differ from each other and from the reference, are rare. This variation detection task is further complicated by the different types of errors and data representation methods used by various technologies. For example, while the predominant error type in Illumina sequencing is the misreading of a base pair, in 454/Roche the most common mistake is insertion/deletion errors in a homopolymer (same base repeating multiple times). The AB SOLiD system introduced a dibase sequencing technique, where two nucleotides are read at every step of the sequencing process together as one color
. Only four dyes are used for the 16 possible dibases (a), and the predominant error is the miscall of a color (colors are usually written as numbers 0–3). Most tools for variation detection (Li et al.
; Marth et al.
) combine a detailed data preparation step, in which the reads are filtered, realigned and often rescored, with a nucleotide or heterozygosity calling step, typically done using a Bayesian framework. The typical parameters considered are the sequencing error rate, the SNP rate in the population (the prior) and the likelihood of misalignment (mapping quality). Most of the tools for SNP calling analyze one base of the reference genome at a time and do not use adjacent locations to help call SNPs (positions are considered independent).
Fig. 1. Color-space description: Parts (a) and (b) show the correspondence between di-nucleotides and their color space representation with a translation matrix and the corresponding Finite State Automaton. In part (c), we show the effect of SNPs on the color-space (more ...)
AB SOLiD's dibase sequencing presents several unique challenges for SNP and indel identification. While typical, letter-space reads represent the DNA sequences directly as a string of A's, C's, G's and T's, one can think of dibase encoding as the output of a Finite State Automaton: consider each color as the shift from one letter to the next, so even though only four colors are generated, we can derive each subsequent letter if we know the previous one (b). Sequencing starts at the last letter of the molecule that connects to the DNA (the linker), which is known, thus enabling the translation of the whole read from color space into letter space. It is important to note, however, that if one of the colors in a read is misidentified (e.g. due to a sequencing error), this will change all of the subsequent letters in the translation (c). For this reason, simply translating the reads to letter-space would be impractical. While this error profile may at first seem detrimental, it can actually be advantageous when we need to decide if a particular difference between a read and the reference genome is due to an underlying change in DNA or a sequencing error: all SNPs will change two adjacent colors, while the probability that two adjacent colors are both misread is small, as error probabilities at adjacent positions are independent. Simultaneously, non-SNP genomic variants (e.g. polymorphisms at adjacent residues and micro-indels) have more complicated color-space signatures, complicating variation discovery.
Some tools for color-space SNP calling first map the reads in color space by translating the reference, but then translate the multiple alignment back to nucleotide space in order to call SNPs (Li and Durbin, 2009
; Li et al.
). McKernan et al.
) describe Corona Lite, a consensus technique where each valid pair of read colors votes for an overall base call. Currently, there are no methods that can simultaneously take full advantage of both color- and letter-space data to call variants—an important, consideration since the advantages and disadvantages of the various platforms are quite disparate. By combining these data sources, it is possible to exploit the strengths of multiple HTS technologies to improve on the accuracy of current SNP callers. Here, we present VARiD—a probabilistic approach for variant identification from either or both letter- and color-space data simultaneously. We represent both types of data as emissions from a hidden Markov model (HMM), while the underlying genotypes of the sequenced genome are the hidden states. By applying the forward–backward algorithm on the HMM we generate, for every base of the genome, a probability distribution over the possible bases. In our testing, VARiD performs more accurately than AB's Corona Lite pipeline for just color-space data, while its ability to incorporate letter-space data allows for more accurate determination of genomic variants using multiple read types, simultaneously.