2.1 Prefix trie and string matching

The prefix trie for string *X* is a tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from a leaf to the root gives a unique prefix of *X*. On the prefix trie, the string concatenation of the edge symbols from a node to the root gives a unique substring of *X*, called the string represented by the node. Note that the prefix trie of *X* is identical to the suffix trie of reverse of *X* and therefore suffix trie theories can also be applied to prefix trie.

With the prefix trie, testing whether a query

*W* is an exact substring of

*X* is equivalent to finding the node that represents

*W*, which can be done in

*O*(|

*W*|) time by matching each symbol in

*W* to an edge, starting from the root. To allow mismatches, we can exhaustively traverse the trie and match

*W* to each possible path. We will later show how to accelerate this search by using prefix information of

*W*. gives an example of the prefix trie for ‘

`GOOGOL`’. The suffix array (SA) interval in each node is explained in

Section 2.3.

2.2 Burrows–Wheeler transform

Let Σ be an alphabet. Symbol $ is not present in Σ and is lexicographically smaller than all the symbols in Σ. A string *X*=*a*_{0}*a*_{1}…*a*_{n−1} is always ended with symbol $ (i.e. *a*_{n−1}=$) and this symbol only appears at the end. Let *X*[*i*]=*a*_{i}, *i*=0, 1,…, *n*−1, be the *i*-th symbol of *X*, *X*[*i*, *j*]=*a*_{i} …*a*_{j} a substring and *X*_{i}=*X*[*i*, *n*−1] a suffix of *X*. Suffix array *S* of *X* is a permutation of the integers 0…*n*−1 such that *S*(*i*) is the start position of the *i*-th smallest suffix. The BWT of *X* is defined as *B*[*i*]=$ when *S*(*i*)=0 and *B*[*i*]=*X*[*S*(*i*)−1] otherwise. We also define the length of string *X* as |*X*| and therefore |*X*|=|*B*|=*n*. gives an example on how to construct BWT and suffix array.

The algorithm shown in is quadratic in time and space. However, this is not necessary. In practice, we usually construct the suffix array first and then generate BWT. Most algorithms for constructing suffix array require at least

*n*log

_{2}*n* bits of working space, which amounts to 12 GB for human genome. Recently, Hon

*et al.* (

2007) gave a new algorithm that uses

*n* bits of working space and only requires <1 GB memory at peak time for constructing the BWT of human genome . This algorithm is implemented in BWT-SW (Lam

*et al.*,

2008). We adapted its source code to make it work with BWA.

2.3 Suffix array interval and sequence alignment

If string

*W* is a substring of

*X*, the position of each occurrence of

*W* in

*X* will occur in an interval in the suffix array. This is because all the suffixes that have

*W* as prefix are sorted together. Based on this observation, we define:

In particular, if

*W* is an empty string,

*R*(

*W*)=1 and

. The interval

is called the

*SA interval* of

*W* and the set of positions of all occurrences of

*W* in

*X* is

. For example in , the SA interval of string ‘

`go`’ is [1, 2]. The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of ‘

`go`’.

Knowing the intervals in suffix array we can get the positions. Therefore, sequence alignment is equivalent to searching for the SA intervals of substrings of *X* that match the query. For the exact matching problem, we can find only one such interval; for the inexact matching problem, there may be many.

2.4 Exact matching: backward search

Let

*C*(

*a*) be the number of symbols in

*X*[0,

*n*−2] that are lexicographically smaller than

*a* Σ and

*O*(

*a*,

*i*) the number of occurrences of

*a* in

*B*[0,

*i*]. Ferragina and Manzini (

2000) proved that if

*W* is a substring of

*X*:

and that

if and only if

*aW* is a substring of

*X*. This result makes it possible to test whether

*W* is a substring of

*X* and to count the occurrences of

*W* in

*O*(|

*W*|) time by iteratively calculating

*R* and

from the end of

*W*. This procedure is called

*backward search*.

It is important to note that Equations (

3) and (

4) actually realize the top-down traversal on the prefix trie of

*X* given that we can calculate the SA interval of a child node in constant time if we know the interval of its parent. In this sense, backward search is equivalent to exact string matching on the prefix trie, but without explicitly putting the trie in the memory.

2.5 Inexact matching: bounded traversal/backtracking

gives a recursive algorithm to search for the SA intervals of substrings of *X* that match the query string *W* with no more than *z* differences (mismatches or gaps). Essentially, this algorithm uses backward search to sample distinct substrings from the genome. This process is bounded by the *D*(·) array where *D*(*i*) is the lower bound of the number of differences in *W*[0, *i*]. The better the *D* is estimated, the smaller the search space and the more efficient the algorithm is. A naive bound is achieved by setting *D*(*i*)=0 for all *i*, but the resulting algorithm is clearly exponential in the number of differences and would be less efficient.

The CalculateD procedure in gives a better, though not optimal, bound. It is conceptually equivalent to the one described in , which is simpler to understand. We use the BWT of the reverse (not complemented) reference sequence to test if a substring of *W* is also a substring of *X*. Note that to do this test with BWT string *B* alone would make CalculateD an *O*(|*W*|^{2}) procedure, rather than *O*(|*W*|) as is described in .

To understand the role of *D*, we come back to the example of searching for *W* =`LOL` in *X*=`GOOGOL$` (). If we set *D*(*i*)=0 for all *i* and disallow gaps (removing the two star lines in the algorithm), the call graph of InexRecur, which is a tree, effectively mimics the search route shown as the dashed line in . However, with CalculateD, we know that *D*(0)=0 and *D*(1)=*D*(2)=1. We can then avoid descending into the ‘`G`’ and ‘`O`’ subtrees in the prefix trie to get a much smaller search space.

The algorithm in guarantees to find all the intervals allowing maximum

*z* differences. It is complete in theory, but in practice, we also made various modifications. First, we pay different penalties for mismatches, gap opens and gap extensions, which is more realistic to biological data. Second, we use a heap-like data structure to keep partial hits rather than using recursion. The heap-like structure is prioritized on the alignment score of the partial hits to make BWA always find the best intervals first. The reverse complemented read sequence is processed at the same time. Note that the recursion described in effectively mimics a depth-first search (DFS) on the prefix trie, while BWA implements a breadth-first search (BFS) using this heap-like data structure. Third, we adopt an iterative strategy: if the top interval is repetitive, we do not search for suboptimal intervals by default; if the top interval is unique and has

*z* difference, we only search for hits with up to

*z* + 1 differences. This iterative strategy accelerates BWA while retaining the ability to generate mapping quality. However, this also makes BWA's speed sensitive to the mismatch rate between the reads and the reference because finding hits with more differences is usually slower. Fourth, we allow to set a limit on the maximum allowed differences in the first few tens of base pairs on a read, which we call the

*seed* sequence. Given 70 bp simulated reads, alignment with maximum two differences in the 32 bp seed is 2.5× faster than without seeding. The alignment error rate, which is the fraction of wrong alignments out of confident mappings in simulation (see also

Section 3.2), only increases from 0.08% to 0.11%. Seeding is less effective for shorter reads.

2.6 Reducing memory

The algorithm described above needs to load the occurrence array

*O* and the suffix array

*S* in the memory. Holding the full

*O* and

*S* arrays requires huge memory. Fortunately, we can reduce the memory by only storing a small fraction of the

*O* and

*S* arrays, and calculating the rest on the fly. BWT-SW (Lam

*et al.*,

2008) and Bowtie (Langmead

*et al.*,

2009) use a similar strategy which was first introduced by Ferragina and Manzini (

2000).

Given a genome of size

*n*, the occurrence array

*O*(·, ·) requires 4

*n*log

_{2}*n* bits as each integer takes

log

_{2}*n* bits and there are 4

*n* of them in the array. In practice, we store in memory

*O*(·,

*k*) for

*k* that is a factor of 128 and calculate the rest of elements using the BWT string

*B*. When we use two bits to represent a nucleotide,

*B* takes 2

*n* bits. The memory for backward search is thus 2

*n*+

*n*log

_{2}*n*/32 bits. As we also need to store the BWT of the reverse genome to calculate the bound, the memory required for calculating intervals is doubled, or about 2.3 GB for a 3 Gb genome.

Enumerating the position of each occurrence requires the suffix array

*S*. If we put the entire

*S* in memory, it would use

*n*log

_{2}*n* bits. However, it is also possible to reconstruct the entire

*S* when knowing part of it. In fact,

*S* and inverse compressed suffix array (inverse CSA) Ψ

^{−1} (Grossi and Vitter,

2000) satisfy:

where (Ψ

^{−1})

^{(j)} denotes repeatedly applying the transform Ψ

^{−1} for

*j* times. The inverse CSA Ψ

^{−1} can be calculated with the occurrence array

*O*:

In BWA, we only store in memory

*S*(

*k*) for

*k* that can be divided by 32. For

*k* that is not a factor of 32, we repeatedly apply Ψ

^{−1} until for some

*j*, (Ψ

^{−1})

^{(j)}(

*k*) is a factor of 32 and then

*S*((Ψ

^{−1})

^{(j)}(

*k*)) can be looked up and

*S*(

*k*) can be calculated with Equation (

5).

In all, the alignment procedure uses 4

*n*+

*n*log

_{2}*n*/8 bits, or

*n* bytes for genomes <4 Gb. This includes the memory for the BWT string, partial occurrence array and partial suffix array for both original and the reversed genome. Additionally, a few hundred megabyte of memory is required for heap, cache and other data structures.

2.7 Other practical concerns for Illumina reads

2.7.1 Ambiguous bases Non-A/C/G/T bases on reads are simply treated as mismatches, which is implicit in the algorithm (). Non-A/C/G/T bases on the reference genome are converted to random nucleotides. Doing so may lead to false hits to regions full of ambiguous bases. Fortunately, the chance that this may happen is very small given relatively long reads. We tried 2 million 32 bp reads and did not see any reads mapped to poly-N regions by chance.

2.7.2 Paired-end mapping BWA supports paired-end mapping. It first finds the positions of all the good hits, sorts them according to the chromosomal coordinates and then does a linear scan through all the potential hits to pair the two ends. Calculating all the chromosomal coordinates requires to look up the suffix array frequently. This pairing process is time consuming as generating the full suffix array on the fly with the method described above is expensive. To accelerate pairing, we cache large intervals. This strategy halves the time spent on pairing.

In pairing, BWA processes 256K read pairs in a batch. In each batch, BWA loads the full BWA index into memory, generates the chromosomal coordinate for each occurrence, estimates the insert size distribution from read pairs with both ends mapped with mapping quality higher than 20, and then pairs them. After that, BWA clears the BWT index from the memory, loads the 2 bit encoded reference sequence and performs Smith–Waterman alignment for unmapped reads whose mates can be reliably aligned. Smith–Waterman alignment rescues some reads with excessive differences.

2.7.3 Determining the allowed maximum number of differences Given a read of length *m*, BWA only tolerates a hit with at most *k* differences (mismatches or gaps), where *k* is chosen such that <4% of *m*-long reads with 2% uniform base error rate may contain differences more than *k*. With this configuration, for 15–37 bp reads, *k* equals 2; for 38–63 bp, *k*=3; for 64–92 bp, *k*=4; for 93–123 bp, *k*=5; and for 124–156 bp reads, *k*=6.

2.7.4 Generating mapping quality scores For each alignment, BWA calculates a mapping quality score, which is the Phred-scaled probability of the alignment being incorrect. The algorithm is similar to MAQ's except that in BWA we assume the true hit can always be found. We made this modification because we are aware that MAQ's formula overestimates the probability of missing the true hit, which leads to underestimated mapping quality. Simulation reveals that BWA may overestimate mapping quality due to this modification, but the deviation is relatively small. For example, BWA wrongly aligns 11 reads out of 1 569 108 simulated 70 bp reads mapped with mapping quality 60. The error rate 7 × 10^{−6} (= 11/1 569 108) for these Q60 mappings is higher than the theoretical expectation 10^{−6}.

2.8 Mapping SOLiD reads

For SOLiD reads, BWA converts the reference genome to dinucleotide ‘color’ sequence and builds the BWT index for the color genome. Reads are mapped in the color space where the reverse complement of a sequence is the same as the reverse, because the complement of a color is itself. For SOLiD paired-end mapping, a read pair is said to be in the correct orientation if either of the two scenarios is true: (i) both ends mapped to the forward strand of the genome with the R3 read having smaller coordinate; and (ii) both ends mapped to the reverse strand of the genome with the F3 read having smaller coordinate. Smith–Waterman alignment is also done in the color space.

After the alignment, BWA decodes the color read sequences to the nucleotide sequences using dynamic programming. Given a nucleotide reference subsequence

*b*_{1}*b*_{2}…

*b*_{l+1} and a color read sequence

*c*_{1}*c*_{2}…

*c*_{l} mapped to the subsequence, BWA infers a nucleotide sequence

such that it minimizes the following objective function:

where

*q*′ is the Phred-scaled probability of a mutation,

*q*_{i} is the Phred quality of color

*c*_{i} and function

*g*(

*b*,

*b*′)=

*g*(

*b*′,

*b*) gives the color corresponding to the two adjacent nucleotides

*b* and

*b*′. Essentially, we pay a penalty

*q*′ if

and a penalty

*q*_{i} if

.

This optimization can be done by dynamic programming because the best decoding beyond position

*i* only depends on the choice of

. Let

be the best decoding score up to

*i*. The iteration equations are

BWA approximates base qualities as follows. Let

. The

*i*-th base quality

,

*i*=2…

*l*, is calculated as:

BWA outputs the sequence

and the quality

as the final result for SOLiD mapping.