|Home | About | Journals | Submit | Contact Us | Français|
Summary: We have developed a tool, called ProbeMatch, for matching a large set of oligonucleotide sequences against a genome database using gapped alignments. Unlike most of the existing tools such as ELAND which only perform ungapped alignments allowing at most two mismatches, ProbeMatch generates both ungapped and gapped alignments allowing up to three errors including insertion, deletion and mismatch. To speedup sequence alignment, ProbeMatch uses gapped q-grams and q-grams of various patterns to identify target hits to a query sequence. This approach results in fewer initial sequences to examine with no loss in sensitivity. ProbeMatch has been used to align 169 095 Illumina GAII reads against the human genome, which could not be mapped by ELAND, and found alignments for 28 625 reads of the 169 095 reads in less than 3 h.
Availability: Source code is freely available at http://www.cs.wisc.edu/~jignesh/probematch/
Supplementary information: Supplementary data are available at Bioinformatics online.
New high-throughput DNA sequence technologies play an important role in modern life science research. These high-throughput methods, such as the Illumina and 454 Life Sciences technologies, produce a large volume of sequence data, which can be used for a variety of tasks including genome resequencing and genome-wide polymorphism discovery. These methods produce a large set (thousands or millions) of short sequences that often must be mapped to a genome, allowing for only a few errors. Traditional sequence alignment tools such as BLAST (Altschul et al., 1997) provide the necessary functionality, but they are computationally prohibitively expensive for this task.
To address this computational problem, several programs have been developed. ELAND (A.Cox, ELAND: efficient local alignment of nucleotide data, unpublished data), which is a part of the data analysis pipeline for the Illumina analyzer, is designed to search DNA databases for a large number of short sequences. To speedup the data processing, ELAND performs only ungapped alignments allowing up to two mismatches. MAQ (H.Li, Mapping and assembly with quality, http://maq.sourceforge.net, unpublished data) is another alignment program designed for Illumina analyzer, which also performs only ungapped alignments allowing up to three mismatches. In addition, using sequence quality information, MAQ measures the error probability of alignments. SOAP (Li et al., 2008) is another tool, which allows for ungapped alignments, and alignments with one continuous gap (one opening gap and up to two extended gaps) and no mismatch. For example, SOAP will find an alignment with one continuous gap of size 2, but will not find an alignment with one gap and one mismatch. SeqMap (Jiang et al., 2008) is another tool similar to ProbeMatch which allows gapped alignment. Compared with SOAP, which allows only one continuous gap, SeqMap allows any gap combinations, up to three gaps. However, its performance is much slower than SOAP and degrades significantly once the gaps are allowed.
Compared with traditional programs such as BLAST, these newer programs are often faster by an order of magnitude or more, but these programs usually map only 60–80% of the query sequences to genomes, leaving a significant fraction of the sequence workload for further processing using computationally expensive but sensitive alignment methods (Hillier et al., 2008). Unfortunately, as predicted by Amdahl's law (Amdahl, 1967), the overall gain using these faster methods is limited and does not adequately address the pressing end-to-end computational problem of rapidly and accurately mapping a large number of oligonucleotide sequences against entire genomes. In this work, we address this challenge and present an efficient and highly sensitive sequence alignment technique for a large set of short sequences. Unlike most of the existing methods, our program allows for a richer match model and finds gapped and ungapped alignments with up to three errors of any error combinations (mismatch, insertion and deletion). The ability to identify both gapped and ungapped alignments has the added advantage of being able to detect multiple classes of mutations: single nucleotide variations (SNVs) and insertions or deletions (indels). The genetic variation provided by each class of mutations can occur in coding or regulatory regions thereby altering the function of important proteins which have a major impact on human health and susceptibility to disease.
ProbeMatch has been used to align 169 095 Illumina GAII reads against a human genome, which were not mapped by ELAND, and found alignments for 28 625 reads of these 169 095 sequences in ≤3 h. It is interesting to note that a large number of reads are not mapped. A likely reason, based on our experience and other labs that are working with the relatively new Illumina technology, is that these may be due to adaptor sequences, low-quality sequences, sequences with many error in base calling, contaminants, etc. It is also likely that more indels/mismatches could be responsible for these if they have large gain/loss of nucleotide. We plan on exploring these issues further in the future as we gain more experience with this technology.
ProbeMatch takes as input a query sequence set and a database of sequences. Since the database is often large and cannot fit in memory, ProbeMatch divides the database into small segments. ProbeMatch loads each database segment and builds a q-gram index. For each query sequence, ProbeMatch searches against the q-gram index to find potential hits and extends hits to find longer alignments. This process is repeated for each database segments.
To speedup sequence alignment, ProbeMatch uses a filtering technique based on the following lemma: If two sequences, Q and T, match within k errors and j non-overlapping fragments are taken from Q, then the matching sequence T contains at least one of the fragments with at most k/j errors (error includes not only mismatches but also insertions and deletions).
Based on this lemma, to find matching sequences with at most three errors (k = 3), first, the query sequence is split into two fragments (j = 2). Next, matching target sequence that can be aligned to one of the fragments within one error (k/j=1) are found. Finally, the matched hits are extended to check if the entire query sequence and the target sequence can be aligned within three errors.
The ProbeMatch index is a gapped q-gram index. A gapped q-gram is a non-contiguous alphabet sequence with q alphabet matches and has ‘don't cares’ at specified positions (Burkhardt and Kärkkäinen, 2002). For example, one gapped 3-gram is a non-contiguous sequence of length 4, which has three exact alphabet matches and one ‘don't care’ at a specified position. There exist several patterns of a gapped q-gram depending on the ‘don't care’ position. For instance, one gapped 3-gram has two different patterns, ##_# and #_##, where # and _ corresponding to an alphabet match and a ‘don't care’, respectively.
The gapped q-gram has an advantage over an ungapped q-gram as a carefully selected gapped q-gram pattern provides a more efficient filtering than an ungapped q-gram. Details regarding the choice of the gapped q-gram are presented in the Supplementary Material.
We evaluated the performance of ProbeMatch using transcriptome data from a prostate cell line (RWPE), generated by the Illumina Genome Analyzer. The queries in our experiment are 169 095 transcriptome short reads (36 nt), that were not mapped to the human genome by ELAND. This query set was matched against the human genome using various alignment programs on a 3.12 GHz Intel Xeon, 2 MB Cache, 4 GB RAM, Linux Fedora 2.6.9 machine. Results are shown in Table 1.
These results include two runs of BLAST with two different word sizes—the default word size of 11, and a smaller word size of 9. For BLAST, the DUST filter option is disabled. We ran SOAP in both ungapped and gapped alignment modes. We report the number of unique queries that align with less than or equal to three errors.
Table 1 compares sensitivity and speed of various programs. It shows the number of queries mapped to the genome and total execution times of each program. As shown in Table 1, ProbeMatch is 10–15 times faster than BLAST, while aligning more queries. Compared with SOAP, ProbeMatch is 3.6–7 times slower, but it aligns 30–40% more queries: ProbeMatch and SOAP align 26 467 and 18 681 queries, respectively. Both programs found ungapped alignments with three mismatches for 16 024 queries. They also found gapped alignments with one continuous gap for 2657 queries. In addition to these alignments, ProbeMatch found alignments for 9944 queries, which cannot be found by SOAP. These queries are the ones with more than one gap, or have both gaps and mismatches. Compared with SeqMap, both SeqMap and ProbeMatch found the same number of alignments when three mismatches with up to three gaps are allowed. However, as shown in Table 1, ProbeMatch is 25 times faster than SeqMap.
We thank Terrence Barrette, Xuhong Cao and Chandan Kumar for making libraries, setting up machines and handling analysis pipeline.
Funding: National Science Foundation (DBI-0543272); National Institutes of Health (1-U54-DA021519-01A1).
Conflict of Interest: none declared.