|Home | About | Journals | Submit | Contact Us | Français|
Motivation: High-throughput sequencing (HTS) technologies are transforming the study of genomic variation. The various HTS technologies have different sequencing biases and error rates, and while most HTS technologies sequence the residues of the genome directly, generating base calls for each position, the Applied Biosystem's SOLiD platform generates dibase-coded (color space) sequences. While combining data from the various platforms should increase the accuracy of variation detection, to date there are only a few tools that can identify variants from color space data, and none that can analyze color space and regular (letter space) data together.
Results: We present VARiD—a probabilistic method for variation detection from both letter- and color-space reads simultaneously. VARiD is based on a hidden Markov model and uses the forward-backward algorithm to accurately identify heterozygous, homozygous and tri-allelic SNPs, as well as micro-indels. Our analysis shows that VARiD performs better than the AB SOLiD toolset at detecting variants from color-space data alone, and improves the calls dramatically when letter- and color-space reads are combined.
Availability: The toolset is freely available at http://compbio.cs.utoronto.ca/varid
High-throughput sequencing (HTS) technologies are revolutionizing the way biologists acquire and analyze genomic data. HTS machines, such as 454/Roche, Illumina/Solexa and AB SOLiD are able to sequence up to a full human genome per week, at a cost hundreds fold less than previous methods. The resulting data consists of reads ranging in length between 35 and 400 nt, from unknown locations in the genome. Analysis of these datasets poses an unprecedented informatics challenge due to the sheer number of reads that a single run of an HTS machine can produce, the shortness of the reads, and the various technologies' different sequencing biases and error rates. The two basic steps in the discovery of variants in the human population from reads coming from any of these technologies are: first, the mapping of reads to a finished (reference) genome, and second the identification of variation by analysis of these mappings.
In the last few years, there have been many approaches proposed for mapping reads from HTS technologies (Campagna et al., 2009; Langmead et al., 2009; Li and Durbin, 2009; Li et al., 2008a, b, 2009; Lin et al., 2008; Rumble et al., 2009 among many others; see Dalca and Brudno, 2010; Flicek and Birney, 2009 for reviews) that utilize a wide variety of approaches. Compared to this multitude of mapping tools, there have only been a handful of toolsets for single nucleotide polymorphism (SNP) and small (1–5 bp) indel discovery. The main challenge in detecting these variants is using the error rates of the sequencing platform, the potentially incorrect mappings, and the varying coverage to determine the likelihood that a position represents a heterozygous or homozygous variant with respect to some reference genome. We use the term heterozygous to refer to the case when a single donor allele differs from the reference, and homozygous to refer to the case when both donor alleles differ from the reference, and are the same as each other. Tri-allelic SNPs, when the two donor alleles differ from each other and from the reference, are rare. This variation detection task is further complicated by the different types of errors and data representation methods used by various technologies. For example, while the predominant error type in Illumina sequencing is the misreading of a base pair, in 454/Roche the most common mistake is insertion/deletion errors in a homopolymer (same base repeating multiple times). The AB SOLiD system introduced a dibase sequencing technique, where two nucleotides are read at every step of the sequencing process together as one color. Only four dyes are used for the 16 possible dibases (Fig. 1a), and the predominant error is the miscall of a color (colors are usually written as numbers 0–3). Most tools for variation detection (Li et al., 2008a, 2009; Marth et al., 1999) combine a detailed data preparation step, in which the reads are filtered, realigned and often rescored, with a nucleotide or heterozygosity calling step, typically done using a Bayesian framework. The typical parameters considered are the sequencing error rate, the SNP rate in the population (the prior) and the likelihood of misalignment (mapping quality). Most of the tools for SNP calling analyze one base of the reference genome at a time and do not use adjacent locations to help call SNPs (positions are considered independent).
AB SOLiD's dibase sequencing presents several unique challenges for SNP and indel identification. While typical, letter-space reads represent the DNA sequences directly as a string of A's, C's, G's and T's, one can think of dibase encoding as the output of a Finite State Automaton: consider each color as the shift from one letter to the next, so even though only four colors are generated, we can derive each subsequent letter if we know the previous one (Fig. 1b). Sequencing starts at the last letter of the molecule that connects to the DNA (the linker), which is known, thus enabling the translation of the whole read from color space into letter space. It is important to note, however, that if one of the colors in a read is misidentified (e.g. due to a sequencing error), this will change all of the subsequent letters in the translation (Fig. 1c). For this reason, simply translating the reads to letter-space would be impractical. While this error profile may at first seem detrimental, it can actually be advantageous when we need to decide if a particular difference between a read and the reference genome is due to an underlying change in DNA or a sequencing error: all SNPs will change two adjacent colors, while the probability that two adjacent colors are both misread is small, as error probabilities at adjacent positions are independent. Simultaneously, non-SNP genomic variants (e.g. polymorphisms at adjacent residues and micro-indels) have more complicated color-space signatures, complicating variation discovery.
Some tools for color-space SNP calling first map the reads in color space by translating the reference, but then translate the multiple alignment back to nucleotide space in order to call SNPs (Li and Durbin, 2009; Li et al., 2008a). McKernan et al. (2009) describe Corona Lite, a consensus technique where each valid pair of read colors votes for an overall base call. Currently, there are no methods that can simultaneously take full advantage of both color- and letter-space data to call variants—an important, consideration since the advantages and disadvantages of the various platforms are quite disparate. By combining these data sources, it is possible to exploit the strengths of multiple HTS technologies to improve on the accuracy of current SNP callers. Here, we present VARiD—a probabilistic approach for variant identification from either or both letter- and color-space data simultaneously. We represent both types of data as emissions from a hidden Markov model (HMM), while the underlying genotypes of the sequenced genome are the hidden states. By applying the forward–backward algorithm on the HMM we generate, for every base of the genome, a probability distribution over the possible bases. In our testing, VARiD performs more accurately than AB's Corona Lite pipeline for just color-space data, while its ability to incorporate letter-space data allows for more accurate determination of genomic variants using multiple read types, simultaneously.
In this section, we introduce our application of a HMM to the process of detecting variation from mapped reads. We begin by describing a simplified version of the model, and then describe the details of the full model and pipeline.
An HMM is a statistical model where the states of the system are hidden—that is, not observable directly—and respect a Markov progression. The observables are emissions from the hidden states. For a detailed introduction to HMMs, we refer the reader to Chapter 3 of Durbin et al. (1999). The structure of an HMM is defined in terms of the possible hidden states and the permitted transitions and hidden states and the permitted transitions between these. The model is parameterized by the emissions and transition probabilities. In the context of variation detection, we define the following HMM model (illustrated in Fig. 2):
In the previous subsection, we described a simplified HMM for variation detection that can use both color- and letter-space data. This simple HMM, however, calls only a single nucleotide per position, and cannot detect events such as micro-indels or heterozygous SNPs. In this section, we describe the full VARiD variation identification algorithm, including the expanded HMM utilized to address the above shortcomings, and the use of base and mapping quality values to parameterize the emission probabilities. We also describe the post-processing methods utilized in VARiD to filter some types of spurious calls. A summary of the VARiD pipeline and model is given in Figure 6.
The HMM that VARiD utilizes is memoryless: the information about the specific reads that generated certain letters and colors is not maintained. This leads to the possibility that a valid path through the state space is not supported by any reads. Figure 7b depicts an example that may result in a heterozygous SNP prediction: four counts of red and two counts of blue for the first position, and four yellow, and two green for the second. Red:yellow and blue:green are considered ‘valid’ adjacent color changes that typically support a SNP. In this case, however, there are no individual reads that support the blue:green combination, indicating that this combination is actually unlikely to appear in the genome and hence is unlikely to be a heterozygous variant. While the proper approach to fixing this problem would be to use a higher order HMM, this would be computationally inefficient. We instead supplement the current probabilistic model with a post-processing step, where we verify that a statistically likely fraction of the reads directly support each heterozygous SNP call. This approach is fast, as putative SNPs are rare.
The running time of the typical forward–backward algorithm is O(nt), where n is the length of the sequence and t is the number of permitted transitions. While t<k2, where k is the number of states, in the VARiD HMM k = 1600 and it is necessary to utilize sparse matrix operations to efficiently implement the forward–backward algorithm. Overall, the running time of VARiD is linear in the length of the genome. Furthermore, it is possible to parallelize VARiD over larger intervals by splitting the reference into smaller segments or windows, with the requirement that they be slightly overlapping. The overlapping regions can then be easily reconciled. VARiD required ~4 min on a single Intel P4 Xeon 3.2GHz machine to predict variants in the 80 kb of the human genome that we analyze in the next section.
To test VARiD, we utilized the dataset of Harismendy et al. (2009), who sequenced several regions of the human genome, spanning a total of 260 kb, from four individuals (NA17156, NA17275, NA17460 and NA17773), both with the AB SOLiD platform and the 454/Roche Pyrosequencer. To validate the SNP calls, the authors also resequenced 80 kb from the same regions with Sanger sequencing. From the original high-coverage datasets, we generated reduced coverage, randomly selected subsets from the individuals with different degrees of coverage. To analyze the AB SOLiD data we ran the SOLiD System Analysis Pipeline Tool (Corona Lite 4.2.2 with the 35_3 schema) on the color-space data, as well as VARiD with both the AB Pipeline mappings as well as SHRiMP (Rumble et al., 2009) mappings, for all of the read subsets. For the 454 data, we ran VARiD and gigaBayes (Marth et al., 1999) on the letter-space reads (using Mosaik and SHRiMP as the mapping tools). Finally, we tested our prediction pipeline on various color- and letter-space subsets combined. We compared the variants called by each method with the Sanger validation set to compute the following statistics:
The results of our analysis are illustrated in Figures 8–10, where we present results of color space only, results of letter space only and results for combinations of the two sequence types, respectively.
In Figure 8, we present results from variation identification with VARiD and the Corona Lite SNP caller (http://www.solidsoftwaretools.com/gf/project/mapreads) using the color-space data. We ran VARiD both with the alignments produced by the AB pipeline for the Corona caller and with alignments generated by SHRiMP. While the results as a whole demonstrate the difficulty of calling variants from color-space data, even at high coverages, a direct comparison of the two SNP calling pipelines shows that at low-coverage (10×) VARiD outperforms the Corona pipeline when using the same set of mappings generated by AB's own mapping tool, while at higher coverage VARiD has better precision and worse recall (and a lower F-measure). The VARiD + SHRiMP pipeline has slightly lower precision than Corona and VARiD + AB mapper, but a significantly better recall, leading to a higher F-measure score.
In Figure 9, we compare results of running the VARiD framework on the 454/Roche letter-space data using the Mosaik alignments as well as using the SHRiMP alignments, compared to gigaBayes using Mosaik alignments. At low coverage (1–5×), the gigaBayes SNP caller produces the best results, having higher precision with similar recall. At higher coverages (10–20×), VARiD outperforms gigaBayes with higher recall and higher precision, regardless of the mapper used to generate the alignments.
Figure 10 shows the main advantage of the VARiD pipeline: its ability to combine color- and letter-space reads. In determining useful combinations of the SOLiD and 454/Roche subsets for running on the VARiD framework together, we considered the cost and accuracy of each platform. The 454/Roche contains a relatively high indel count, but has much more accurate base calls. At the same time, the 454 platform is ~10 times more expensive. Therefore, we considered combining read coverages with 10-fold more AB SOLiD than 454 data. For example, we may combine 50× of color-space reads with 5× letter-space, giving us the equivalent of 100× of AB SOLiD or 10× of 454 in terms of cost. Of course, the best trade-off will vary depending on the costs of the platforms and their respective accuracies.
In Figure 10, we consider the various possible coverage combinations between the AB SOLiD data and the 454/Roche. In general, the performance of VARiD on a certain coverage of color-space data can be greatly improved with just a small number of 454 reads. More concretely, comparing at cost we can look at 50× coverage of color space with 5× coverage of 454 data: when combined, we find 84% precision and 77% recall. Looking at the cost equivalent coverage of just 454 data—10×—gives 7–9% lower precision and recall. Similarly, for the cost equivalent coverage of AB SOLiD data—100×—will again perform worse. Combining the data thus shows significant improvement over predicting variation from letter or color space only.
The various HTS technologies that have emerged in the past few years have different data representations, advantages, biases and features. In this work, we introduced a novel probabilistic framework for variation identification, which can use both letter- and color-space data simultaneously. We have shown in our results that when using only color-space data—a data type for which very few genomic analysis tools exist—the model outperforms the AB SOLiD toolkit Corona Lite, and performs on par with gigaBayes predictions for letter-space data alone. More importantly, when the color- and letter-space data are combined, the VARiD framework allows for a significant performance increase, demonstrating that a method that can take into consideration multiple technologies, combining their different advantages and compensating for their different weaknesses can achieve higher accuracy variant predictions than are possible from any single data type.
We thank Adrian Dalca Sr for help with the implementation.
Funding: National Sciences and Engineering Research Council (NSERC) of Canada; Mathematics of Information Technology and Complex Systems (MITACS) grant; Life Technologies (Applied Biosystems).
Conflict of Interest: none declared.