|Home | About | Journals | Submit | Contact Us | Français|
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
The rapid development of new sequencing technologies substantially extends the scale and resolution of many biological applications, including the scan of genome-wide variation , identification of protein binding sites (ChIP-seq), quantitative analysis of transcriptome (RNA-seq) , the study of the genome-wide methylation pattern  and the assembly of new genomes or transcriptomes . Most of these applications take alignment or de novo assembly as the first step; even in de novo assembly, sequence reads may still need to be aligned back to the assembly as most large-scale short-read assemblers [5, 6] do not track the location of each individual read. Sequence alignment is therefore essential to nearly all the applications of new sequencing technologies.
All new sequencing technologies in production, including Roche/454, Illumina, SOLiD and Helicos, are able to produce data of the order of giga base-pairs (Gbp) per machine day . With the emergence of such data, researchers have quickly realized that even the best tools for aligning capillary reads [8, 9] are not efficient enough given the unprecedented amount of data. To keep pace with the throughput of sequencing technologies, many new alignment tools have been developed in the last two years. These tools exploit the many advantages specific to each new sequencing technology, such as the short sequence length of Illumina, SOLiD and Helicos reads, the di-base encoding of SOLiD reads, the high base quality towards the 5′-end of Illumina and 454 reads, the low indel error rate of Illumina reads and the low substitution error rate of Helicos reads. Short read aligners outperform traditional aligners in terms of both speed and accuracy. They greatly boost the applications of new sequencing technologies as well as the theoretical studies of alignment algorithms.
This article aims to systematically review the recent advance with respect to alignment algorithms. It is organized as follows. We first review the progress on general alignment techniques and then examine their applications in the context of specific sequencing platforms and experimental designs. We will use simulated data to evaluate the necessity of gapped alignment and paired-end mapping, and present a list of alignment software that are actively maintained and widely used. Finally, we will discuss the future development of alignment algorithms.
Most of fast alignment algorithms construct auxiliary data structures, called indices, for the read sequences or the reference sequence, or sometimes both. Depending on the property of the index, alignment algorithms can be largely grouped into three categories: algorithms based on hash tables, algorithms based on suffix trees and algorithms based on merge sorting. The third category only consists of Slider  and its descendant SliderII . This review will therefore focus on the first two categories.
The idea of hash table indexing can be tracked back to BLAST [12, 13]. All hash table based algorithms essentially follow the same seed-and-extend paradigm. BLAST keeps the position of each k-mer (k = 11 by default) subsequence of the query in a hash table with the k-mer sequence being the key, and scans the database sequences for k-mer exact matches, called seeds, by looking up the hash table. BLAST extends and joins the seeds first without gaps and then refines them by a Smith–Waterman alignment [14, 15]. It outputs statistically significant local alignments as the final results.
The basic BLAST algorithm has been improved and adapted to alignments of different types. Nevertheless, the techniques discussed below focus on mapping a set of short query sequences against a long reference genome of the same species.
BLAST seeds alignment with 11 consecutive matches by default. Ma et al.  discovered that seeding with non-consecutive matches improves sensitivity. For example, a template ‘111010010100110111’ requiring 11 matches at the ‘1’ positions is 55% more sensitive than BLAST’s default template ‘11111111111’ for two sequences of 70% similarity. A seed allowing internal mismatches is called spaced seed; the number of matches in the seed is its weight.
Eland (A.J. Cox, unpublished results) was the first program that utilized spaced seed in short-read alignment. It uses six seed templates spanning the entire short read such that a two-mismatch hit is guaranteed to be identified by at least one of the templates, no matter where the two mismatches occur. SOAP  adopts almost the same strategy except that it indexes the genome rather than reads. SeqMap  and MAQ  extends the method to allow k-mismatches, but to be fully sensitive to k-mismatch hits, they require templates, which is exponential in k and thus inefficient given large k. To improve the speed, MAQ only guarantees to find two-mismatch hit in the first 28 bp of each read, the most reliable part of an Illumina read. It extends the partial match when a seed match is found.
RMAP [20, 21], which is based on the Baeza–Yates–Perleberg algorithm , applies a different set of seed templates. It effectively uses k + 1 templates to find k-mismatch hits. RMAP reduces the number of templates, but for large k, the weight of each template is small. Such a strategy cannot fully take the advantage of hash table indexing in this case as many candidates will be returned.
An improvement is achieved by Lin et al.  who present the optimal the way to design a minimum number of spaced seeds, given a specified read length, sensitivity requirement and memory usage. For example, their program ZOOM is able to identify all two-mismatch hits for 32 bp reads using five seed templates of weight 14. In comparison, RMAP uses three templates of weight 10; Eland uses six templates of weight 16 but with only 12.5 bases indexed in the hash table to reduce memory requirement. As the time complexity of spaced seed algorithm is approximately proportional to where q is the weight, m the number of templates, n the number of reads and L the genome size, ZOOM has better theoretical time complexity given limited memory.
The memory required by hashing genome is usually bytes where s is the sampling frequency . It is memory demanding to hold in RAM a hash table with q larger than 15. Homer et al.  proposed a two-level indexing scheme for any large q. They build a hash table for j-long (j < q, typically 14) bases. To find a q-long key, they look up the hash table from the first j bases and then perform a binary search among elements stored in the resulting bucket. Looking up a q-long key takes time, only slightly worse than the optimal speed O(1). The peak memory becomes independent of q. A similar idea is also used by Eland and MAQ, but they index reads instead of the genome.
Many other aligners [26–28] also use spaced seed with different templates designed specifically for the reference genome and sensitivity tolerances, making spaced seed the most popular approach for short-read alignment.
A potential problem with consecutive seed and spaced seed is they disallow gaps within the seed. Gaps are usually found afterwards in the extension step by dynamic programming, or by attempting small gaps at each read positions [17, 18]. The q-gram filter, as is implemented in SHRiMP  and RazerS , provides a possible solution to building an index natively allowing gaps. The q-gram filter is based on the observation that at the occurrence of a w-long query string with at most k differences (mismatches and gaps), the query and the w-long database substring share at least (w + 1) − (k + 1)q common substrings of length q [31–33]. Methods based on spaced seeds and the q-gram filter are similar in that they both rely on fast lookup in a hash table. They are mainly different in that the former category initiates seed extension from one long seed match, while the latter initiates extension usually with multiple relatively short seed matches. In fact, the idea of requiring multiple seed matches is more frequently seen in capillary read aligners such as SSAHA2 and BLAT; it is a major technique to accelerate long-read alignment.
Due to the use of long spaced seeds, many aligners do not need to perform seed extension or only extend a seed match without gaps, which is much faster than applying a full dynamic programming. Nonetheless, several improvements over BLAST have been made regarding on seed extension. A major improvement comes from the recent advance in accelerating the standard Smith–Waterman with vectorization. The basic idea is to parallelize alignment with the CPU SIMD instructions such that multiple parts of a query sequence can be processed in one CPU cycle. Using the SSE2 CPU instructions implemented in most latest x86 CPUs,  derived a revised Smith–Waterman algorithm that is over 10 times faster than the standard algorithm. Novoalign (http://novocraft.com), CLC Genomics Workbench (http://clcbio.com/index.php? id = 1240) and SHRiMP are known to make use of vectorization.
Another improvement is achieved by constraining dynamic programming around seeds already found in the seeding step [25, 35, 36]. Thus unnecessary visits to cells far away from seed hits in iteration are greatly reduced. In addition, Myers  found that a query can be aligned in full length to an L-long target sequence with up to k mismatches and gaps in O(kL) time, independent of the length of the query. These techniques also help to accelerate the alignment when dynamic programming is the bottleneck.
All algorithms in this category essentially reduce the inexact matching problem to the exact matching problem and implicitly involve two steps: identifying exact matches and building inexact alignments supported by exact matches. To find exact matches, these algorithms rely on a certain representation of suffix/prefix trie, such as suffix tree, enhanced suffix array  and FM-index . The advantage of using a trie is that alignment to multiple identical copies of a substring in the reference is only needed to be done once because these identical copies collapse on a single path in the trie, whereas with a typical hash table index, an alignment must be performed for each copy.
It should be noted that the choice of these data structures is independent of methods for finding inexact matches. An algorithm built upon FM-index, for example, would also work with suffix tree index in principle.
A suffix trie, or simply a trie, is a data structure that stores all the suffixes of a string, enabling fast string matching. To establish the link between a trie and an FM-index, a data structure based on Burrows-Wheeler Transform (BWT) , we focus on prefix trie which is the trie of the reverse string. All algorithms on a trie can be seamlessly applied to the corresponding prefix trie.
Figure 1A gives the prefix trie of AGGAGC. Finding all exact matches of a query sequence is equivalent to searching for a path descending from the root where each edge label on the path matches a query letter in the reverse order. If such a path exists, the query is a substring. Given a query AGC, for example, the path matching the query is [0, 6]→[3, 3]→[5, 5]→[1, 1].
The time complexity of determining if a query has an exact match against a trie is linear in the length of the query, independent of the length of the reference sequence. However, a trie takes O(L2) space where L is the length of the reference. It is impractical to build a trie even for a bacterial genome. Several data structures are proposed to reduce the space. Among these data structures, a suffix tree (Figure 1C) is most widely used. It achieves linear space while allowing linear-time searching. Although it is possible in theory to represent a suffix tree in L log2L + O(L) bits using rank-selection operations , even the most space efficient implementation of bioinformatics tools requires 12–17 bytes per nucleotide , making it impractical to hold the suffix tree of the human genome in memory.
To solve this problem, Abouelhoda et al.  derived an enhanced suffix array that consists of a suffix array and several auxiliary arrays, taking 6.25 bytes per nucleotide. It can be regarded as an implicit representation of suffix tree, and has an identical time complexity to suffix tree in finding exact matches, better than the suffix array originally invented by Manber and Myers .
A further improvement on memory is achieved by Ferragina and Manzini  who proposed the FM-index and found that locating a child of a parent node in the prefix trie can be done in constant time using a backwards search on this data structure. Thus the time complexity of finding exact matches with an FM-index is identical to that with a trie. With respect to memory, the FM-index was originally designed as a compressed data structure such that the theoretical index size can be smaller than the original string if the string contains repeats (equivalently, has small entropy). The FM-index is usually not compressed for better performance during alignment since DNA sequences have a small alphabet. The practical memory footprint of an FM-index is typically 0.5–2 bytes per nucleotide, depending on implementations and the parameters in use. The index of the entire human genome only takes 2–8 GB of memory.
It is worth noting that we only focus on the data structures having been used for DNA sequence alignment. There is a large volume of literature in computer science on general theory of string matching, especially on short string matching. Readers are referred to ref.  for a more comprehensive review in a wider scope. Nevertheless, traditional string matching algorithms strive for completeness, while many current aligners sacrifice absolute completeness for speed.
Of published aligners that can be used for query-reference alignment, MUMmer  and OASIS  are based on suffix tree, Vmatch  and Segemehl  on enhanced suffix array, and Bowtie , BWA , SOAP2 , BWT-SW  and BWA-SW  on FM-index. As explained above, a program built upon one representation of suffix/prefix trie can be easily migrated to another. The FM-index is most widely used mainly due to its small memory footprint.
As to the algorithms for inexact matching, MUMmer and Vmatch anchor the alignment with maximal unique matches (MUMs), maximal matches, maximal repeats or exact matches, and then join these exact matches with gapped alignment. Similarly, Segemehl initiates the alignment with the longest prefix match of each suffix, but it may also enumerate mismatches and gaps at certain positions of the query to reduce false alignments.
OASIS and BWT-SW essentially sample substrings of the reference by a top–down traversal on the trie and align these substrings against the query by dynamic programming. BWA-SW furthers BWT-SW by representing the query as a directed word graph (DAWG) , which also enables it to deploy heuristics to accelerate alignment.
Bowtie and BWA also sample short substrings of the reference, but instead of performing dynamic programming, they compare the query and sampled substrings only allowing a few differences. In addition, as they require the entire read to be aligned, the traversal of the trie can be bounded since it is unnecessary to descend deeper in the trie if it can be predicted that doing so leads to an alignment with excessive mismatches and gaps. Alternatively, Bowtie and BWA can be considered to enumerate all combinations of possible mismatches and gaps in the query sequence such that the altered query can be aligned exactly.
The algorithms reviewed above are general techniques. Depending on the characteristics of the sequencing technologies and their applications, aligners for new sequence reads also implement extra features.
Sequence reads from Illumina and SOLiD technologies were initially 25 bp in length. Performing gapped alignment for such short reads is computationally challenging because allowing a gap in this case will slow down most seeding algorithms. Fortunately, the growing read length makes gapped alignment tractable, although this feature still comes at the cost of efficiency. This raises the question about whether gapped alignment is worth doing.
From Figure 2A, it is clear that gapped alignment (curve ‘gap-se’) increases sensitivity by a few percent in comparison to ungapped alignment (curve ‘ungap-se’), but does not reduce alignment errors. To this end, gapped alignment does not seem to be an essential feature. However, gapped alignment plays a far more important role in variant discovery [53, 54]. When gapped alignment is not implemented, a read containing an indel polymorphism may still be mapped to the correct position but with consecutive mismatches towards the underlying location of the indel. These mismatches can be seen on multiple reads mapped to the same locus, which cause most variant callers to call false SNPs. As a result, more false SNPs are predicted from ungapped alignment (Figure 2B) and these SNPs cannot be easily filtered out even with the help of sophisticated tools such as the GATK realigner (http://tinyurl.com/broad-gatk); all high-quality false SNPs by ‘gap-se’ are also around undetected long indels. Furthermore, lacking gapped alignment may also lead to false structural variation calls at least for some algorithms. For example, on the simulated data used in Figure 2B, when the aligner in use does ungapped alignment only, an indel polymorphism is seen causing seven reads mapped to a wrong position with high confidence. BreakDancer  predicts a high scoring translocation based on the wrong alignments. Effective gapped aligners such as BWA and novoalign (http://novocraft.com) do not produce this false translocation. Therefore, gapped alignment is essential to the variant discovery, but how ChIP- and RNA-seq  may be affected is an open question.
Some sequencing technologies produce read pairs such that the two reads are known to be close to each other in physical chromosomal distance. These reads are called paired-end or mate-pair reads. With this mate-pair information, a repetitive read will be reliably placed if its mate can be placed unambiguously. Alignment errors may be detected and fixed when wrong alignments break the mate-pair requirement. Figure 2A shows that paired-end alignment outperforms single-end alignment in terms of both sensitivity and specificity. The gain of sensitivity is also obvious from SNP discovery (Figure 2B).
In addition, it is worth noting here that although curve ‘novo-pe’ outperforms ‘gap-pe’ in Figure 2A, the accuracy of SNP calls are similar in Figure 2B. This is possibly because the extra alignment errors from ‘gap-pe’ are random and thus contribute little to variant discovery.
Smith et al.  discovered that using base quality scores improves alignment accuracy because knowing the error probability of each base, the aligner may pay lower penalty for an error-prone mismatch. Figure 3 shows that using base quality score halves alignment errors when the quality score is accurate. In practice, however, accurate quality score is not always available from the base calling pipeline. Recalibration of quality score is recommended to make this strategy more effective.
Long reads have greater potential than short reads to contain long indels, structural variations and misassemblies in the reference genome. It is essential for a long-read aligner to be permissive about alignment gaps and to allow partially aligned read sequences in alignment. At present, all programs capable of genome-wide long-read alignment follow the seed-and-extend paradigm, seeding the alignment using hash table index [8, 9] or more recently FM-index [50, 51], and extending seed matches with the banded Smith–Waterman algorithm. This allows for sensitive detection of indels as well as allowing for partial hits.
The SOLiD sequencing technology observes two adjacent bases simultaneously. Each dinucleotide (16 possibilities) is encoded as one of four possible colors, with the encoding referred to as color space (Figure 4A). Although the known primer base allows for the decoding of a color read to bases (Figure 4B), contiguous errors will arise from a single color sequencing error in this conversion (Figure 4D). Thus algorithms that naively decode a color read will fail. Given the fact that reverse complementing a base sequence is equivalent to reversing the color sequence (Figure 4C), the proper solution is to encode the reference as a color sequence and align color reads directly to the color reference as if they are base sequences with the exception of the complementing rule. After alignment, color sequence can be converted to base sequences with dynamic programming .
Performing alignment entirely in the color space may not be optimal, though. With color encoding, one base mutation leads to two contiguous color changes with some restrictions (Figure 4E). Two adjacent consistent color changes are preferred over two discontinuous changes. A better solution is to perform a color-aware Smith–Waterman alignment found in BFAST and SHRiMP [29, 56]. This extension to the standard Smith–Waterman algorithm allows the detection of indels without the aid of post-alignment analysis at the cost of increased computational complexity. Most alignment algorithms described in the previous sections can be applied to SOLiD sequencing reads with few modifications making them color space aware.
Bisulfite sequencing is a technology to identify methylation patterns . From the alignment point of view, unmethylated ‘C’ bases, or cytosines, are converted to ‘T’ (sequences 1 and 4 in Figure 5) and ‘G’ bases complement those cytosines converted to ‘A’ (sequences 2 and 3). Directly aligning converted sequences against the standard reference sequence would be difficult due to the excessive mismatches. Most aligners capable of bisulfite alignments [24, 57] do the following. They create two reference sequences, one with all ‘C’ bases converted to ‘T’ bases (the C-to-T reference) and the other with all ‘G’ bases converted to ‘A’ bases (the G-to-A reference). In alignment, ‘C’ bases are converted to ‘T’ base for reads and are mapped to the C-to-T reference (then a C–T mismatch is effectively regarded as a match); a similar procedure is performed for the G-to-A conversion in the next round of alignment. The results from two rounds of alignment are combined to generate the final report. If there are no mutations or sequencing errors, a bisulfite treated read can always be mapped exactly in one of the two rounds.
Transcriptome sequencing, or RNA-seq , produces reads from transcribed sequences with introns and intergenetic regions excluded. When RNA-seq reads are aligned against the genomic sequence, a read may be mapped to a splicing junction, which will fail with a standard alignment algorithm. It is possible to add sequences around known or predicted splicing junctions to the ref.  or more cleverly to make the alignment algorithm aware of known splicing junctions . Nevertheless, novel splicing will not be discovered this way.
QPALMA  and TopHat  were developed to solve this problem. They first align reads to the genome using a standard mapping program and identify putative exons from clusters of mapped reads or from reads mapped into introns at their last few bases, possibly aided by splicing signals learnt from real data. In the next round, potential junctions are enumerated within a certain distance around the putative exons. The unmapped reads are then aligned against the sequences flanking the possible junctions. Novel junctions can thus be found. However, Trapnell et al.  have reported that only 72% of splicing sites found by ERANGE  can be identified by TopHat without using known splicing sites (TopHat is able to consider known splicing), indicating that incorporating known splicing sites in alignment may be necessary to RNA-seq. Readers are also referred to ref.  for a more comprehensive review on practical issues on processing RNA-seq data.
Reads mapped to the same locus are highly correlated, but all read aligners map a read independent of others and thus cannot make use of the correlation between reads or the expected coverage at the same position. Especially in the presence of indels, not using this correlation may lead to wrong an alignment around the tail of a read. For indel calling, it is necessary to perform multi-alignment for reads mapped to the same locus. Realigner  is such a tool, but originally designed for capillary read alignment. GATK implements a different algorithm for new sequencing data. Sophisticated indel callers such as SAMtools  also implicitly realign reads around potential indels.
Over 20 short-read alignment software have been published in the past 2 years and dozens of more are still unpublished. The availability of these tools greatly boosts the development of alignment algorithms, yet only a few of them are being heavily used. Table 1 gives a list of free popular short-read alignment software packages, based on the ‘tag cloud’ of the SEQanswers forum (http://seqanswers.com). They all output alignments in the SAM format , the emerging standard alignment format which is widely supported by alignment viewers such as GBrowse , LookSeq , Tablet , BamView , Gambit (http://tinyurl.com/gambit-viewer), IGV (http://www.broadinstitute.org/igv/) and MagicViewer (http://bioinformatics.zj.cn/magicviewer/), as well as generic variant callers such as SAMtools, GATK (http://tinyurl.com/broad-gatk) VarScan  and BreakDancer .
On speed, Bowtie and BWA typically align ~7 Gbp against the human genome per CPU day. In comparison, the standard Illumina base caller, Bustard, processes ~3 Gbp per CPU day and the real-time image analysis requires similar amount of CPU time to base calling (Skelly and Bonfields, personal communication). Therefore, alignment is no longer the most time consuming step in the entire data processing pipeline. Speed improvement will not greatly reduce the time spent on data analyses.
For long reads, SSAHA2 and BLAT are still the most popular aligners. However, their alignment speed is of the order of 0.5 Gbp per CPU day, much slower than short-read aligners. Recently developed algorithms such as Mosaik and BWA-SW are faster and may alleviate this computational bottleneck.
Short-read alignment is thought to be the computing bottleneck of the analysis of new sequencing data. Fortunately, the active development of alignment algorithms opens this bottleneck even with the rapidly increasing throughput of sequencing machines. In a couple of years, however, long reads will dominate again and programs developed for short reads will not be applicable; long-read alignment and de novo assembly will become crucial.
In addition, while major sequencing centers have sufficient localized computing resources to analyze data at present, such resources are not available to small research groups, which hamper the application of new sequencing technologies. Even between major centers, data sharing in a large collaborative project such as the 1000 genomes project (http://1000genomes.org) poses challenges. One possible solution to these problems might be cloud computing, with data uploaded and analyzed in a shared cloud. Several researchers [26, 68] have explored this approach, but establishing a cloud computing framework requires the efforts of the entire community. Furthermore, data transfer bottlenecks and leased storage have yet to be proved cost-effective for cloud computing.
Another trend of development is the simultaneous alignment against multiple genomes. Li et al.  have found the presence of extensive novel sequences absent from the human reference genome, which may lead to the loss of information when reads are aligned to a single genome. In the light of large-scale resequence projects such as the 1000 genomes project, the Drosophila population genomics project (http://dpgp.org) and the 1001 genomes project (http://1001genomes.org), alignment against multiple genomes will become increasingly important. Several groups [70, 71] have pioneered in this direction; the proposal of unifying multi-genome alignment and de novo assembly with an assembly graph (Birney and Durbin, personal communication) is attractive, but how to apply the methods given genome-wide human data is yet to be solved in practice.
H.L. is supported by the NIH grant 1U01HG005208-01 and N.H. by 1U01HG005210-01.
We thank the three anonymous reviewers whose comments helped us to improve the manuscript.
Heng Li is a postdoctoral researcher at the Broad Institute and used to work at the Sanger Institute where he developed several popular alignment algorithms.
Nils Homer is a PhD student in Computer Science and Human Genetics departments of UCLA. He developed the BFAST alignment algorithm.