Following the development of sensitive local alignment software, such as FASTA (Pearson and Lipman, 1988
) and BLAST (Altschul et al.
) around 1990, a new generation of faster methods to find DNA sequence matches was developed since 2000, including MegaBLAST (Morgulis et al.
; Zhang et al.
), SSAHA2 (Ning et al.
), BLAT (Kent, 2002
) and PatternHunter (Ma et al.
), greatly speeding up matching capillary sequencing reads against a large reference genome. When new sequencing technologies arrived that generated millions of short (<100 bp) reads, a variety of new algorithms were developed which were 10–1000 times faster, including SOAP (Li,R. et al.
), MAQ (Li,H. et al.
), Bowtie (Langmead et al.
) and BWA (Li and Durbin, 2009
). However, Roche/454 sequencing technology has already produced reads >400 bp in production, Illumina gradually increases read length >100 bp, and Pacific Bioscience generates 1000 bp reads in early testing (Eid et al.
). Reads coming from the new sequencing technologies are not short any more, which effectively rules out many of the new aligners exclusively designed for reads no longer than 100 bp. Efficiently aligning long reads against a long reference sequence like the human genome poses a new challenge to the development of alignment tools.
Long-read alignment has different objectives from short-read alignment. First, in short-read alignment, we would usually like to align the full-length read to reduce the reference bias caused by the mismatches toward the ends of the read. Given this requirement, we can design spaced seed templates (Ma et al.
) spanning the entire read (Jiang and Wong, 2008
; Lin et al.
; Smith et al.
), or quickly filter out poor matches, for example, by applying q-gram filtration (Rumble et al.
; Weese et al.
) or by bounding the search process (Li and Durbin, 2009
), and thus accelerate the alignment. In long-read alignment, however, we would prefer to find local matches because a long read is more fragile to structural variations and misassemblies in the reference but is less affected by the mismatches close to the ends of a read. Secondly, many short-read aligners are only efficient when doing ungapped alignment or allowing limited gaps, e.g. a maximum of one gap. They cannot find more gaps or the performance quickly degrades when they are tuned for this task. Long-read aligners, however, must be permissive about alignment gaps because indels occur more frequently in long reads and may be the dominant source of sequencing errors for some technologies such as 454 and Pacific Bioscience.
When considering algorithms to speed-up long-read alignment, hash table indexing as is used in most current software is not the only choice. Meek et al.
) found a Smith–Waterman-like dynamic programming that can be applied between a query sequence and the suffix tree of the reference, effectively aligning the query against each subsequence sampled from the suffix tree via a top-down traversal. As on a suffix tree identical sequences are collapsed on a single path, time is saved by avoiding repeated alignment of identical subsequences. Lam et al.
) furthered this idea by implicitly representing the suffix tree with an FM-index (Ferragina and Manzini, 2000
), which is based on the Burrows–Wheeler Transform (BWT; Burrows and Wheeler, 1994
), to achieve a small memory footprint. Their new algorithm, BWT-SW, is able to deliver identical results to the standard Smith–Waterman alignment, but thousands of times faster when aligning against the human genome sequence. While BWT-SW is still slower than BLAST on long query sequences, it finds all matches without heuristics. One can imagine that introducing heuristics would further accelerate BWT-SW. Our BWA-SW algorithm follows this route.
To some extent, BWA-SW, as well as BWT-SW, also follows the seed-and-extend paradigm. But different from BLAT and SSAHA2, BWA-SW finds seeds by dynamic programming between two FM-indices and allows mismatches and gaps in the seeds. It extends a seed when the seed has few occurrences in the reference sequence. Speed is gained by reducing unnecessary extension for highly repetitive sequences.
In this article, we will describe this new alignment algorithm, BWA-SW, for long-read alignments and evaluate its practical performance along with BLAT and SSAHA2 on both simulated and real data. We will also give a brief introduction to suffix array and FM-index, but readers are referred to Li and Durbin (2009
) for more details.