|Home | About | Journals | Submit | Contact Us | Français|
The famous Burrows–Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more.
We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly.
We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs.
The Burrows–Wheeler Transformation (BWT) has a very peculiar history. First conceived in 1983, it was published only eleven years later in a technical report , presumably because it was so innovative that the first reviewers were not able to grasp its full significance. A few years later, the compression algorithm bzip2 based on the BWT became popular, challenging gzip's dominance, thanks to the finely engineered implementation of Julian Seward  (the very same computer scientist who gave us also the invaluable tool Valgrind ).
After its introduction as a compression tool, interest in the BWT was rekindled when many researchers realized that, among the different techniques discovered at the turn of the century for designing compressed indexes , , , those based on the BWT are probably the simplest and most space efficient , . After this realization, in the last ten years we have witnessed an unusual phenomenon in computer science: variants of the BWT have been proposed and applied to more and more complex objects: from trees, to graphs, to alignments. These variants are clearly related to the BWT even if some of them no longer have the two main features of the original BWT, namely of being invertible and of “helping” compression.
At this point it is natural to ask whether we have approached the BWT as the blind men approached the elephant (see, e.g., ), with one touching a leg and thinking the elephant is like a tree, another touching the trunk and thinking it is like a snake, and yet another touching the tail and thinking it is like a rope. We do not disparage previous surveys of the BWT and related data structures, such as , , , since we too have spent years trying to make sense of our sometimes disparate impressions of the it. Without pretending to give a complete answer, in this paper we propose a unifying view for many different BWT variants. Somewhat surprisingly we get our unifying view considering the Nondeterministic Finite Automata related to different pattern matching problems. We show that the state graphs associated to these automata have common properties that we summarize with the concept of Wheeler graphs.1
Using the notion of a Wheeler graph, we show that it is possible to process strings efficiently, e.g., in linear time if the alphabet is constant, even if the automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactly represented and traversed using up to three arrays with additional data structures supporting efficient rank and select operations. It turns out that these arrays coincide with, or are substantially equivalent to, the output of many BWT variants described in the literature.
We believe our unifying view can help researchers develop new BWT variants and new indexing data structures. However, we stress that not every BWT-related data structure fits our framework: for example Ganguly et al.'s parameterized BWT  and the index for order-preserving matching in . Therefore, we hope that our contribution will spur further research resulting in a wider vision of the fascinating field originated by the seminal work of Burrows and Wheeler.
Consider a directed edge-labeled graph G such that each edge is labeled by a character from an totally-ordered alphabet A. We use to denote the ordering among A's elements. Labels on the edges leaving a given node are not necessarily distinct, and there can be multiple edges linking the same pair of nodes (for simplicity we still use the term graph rather than the more formally correct multi-graph).
G is a Wheeler graph if there is an ordering of the nodes such that nodes with in-degree 0 precede those with positive in-degree and, for any pair of edges and labeled a and respectively, the following monotonicity properties hold:
An example of a Wheeler graph is shown in Fig. 1. As an immediate consequence of (1), all edges entering a given node must have the same label. We now show that Wheeler graphs also possess the following property:
G is path coherent if there is a total order of the nodes such that for any consecutive range of nodes and string α, the nodes reachable from those in in steps by following edges whose labels for α when concatenated, themselves form a consecutive range.
If G is a Wheeler graph by an ordering π on its nodes then it is path coherent by π.
Suppose G is a Wheeler graph by π. Consider a consecutive range of nodes and let be the smallest range that contains all the nodes reachable from those in in one step by following edges labeled with some character a. By our choice of , both and are reachable from nodes in in one step by following edges labeled a. By our definition of a Wheeler graph, nodes with in-degree 0 precede those with positive in-degree, such as , so every node in has at least one incoming edge.
Assume some node v strictly between and has an incoming edge labeled . Since we have , by our definition of a Wheeler graph and modus tollens; similarly, since we have , thus obtaining a contradiction. It follows that the edges arriving at nodes in are all labeled a. Furthermore, since the labels are equal, by the second implication in (1) and modus tollens we get that any edge with destination strictly between and must originate in .
It follows that the nodes reachable in one step from those in by following edges labeled a are the ones in , which is a consecutive range. For any string α, therefore, the nodes reachable in steps from those in by following edges whose labels form α, themselves form a consecutive range, by induction on the length of α.□
In the later sections of this paper, we explore the implications of path coherence. In the remainder of this section we first show it is possible to obtain a fast and compact representation of a Wheeler graph, then sketch why Wheeler graphs can achieve compression. We point out that, even without explicitly defining Wheeler graph s, many researchers have implicitly considered them while studying how to implement efficiently BWT variants and how to bound their space usage, and given results analogous to the ones we give now.
A plain, edge-by-edge representation of a labeled graph with n nodes and e edges uses bits. Given a Wheeler graph G, let denote the ordered set of nodes. For let and denote respectively the out-degree and in-degree of node . Define the binary arrays of length
Note that O (resp. I) consists of the concatenated unary representations of the out-degrees (resp. in-degrees). Let denote the multiset of labels on the edges exiting from arranged in an arbitrary order, and let denote the concatenation . By construction, and there is a natural one-to-one correspondence between the 's in O and the characters in L. For example, for the graph of Fig. 1 it is
while the indegree array is . Finally, let denote the array such that is the number of edges with label smaller than c. For simplicity, assume every distinct character labels some edge; otherwise, we store a bitvector of bits marking the characters that do label edges and work with that subset. With this assumption, and has all incoming edges labeled c if and only if .
Given an array Z we use the standard notation to denote the number of occurrences of c in , and to denote the position of the j-th c in Z. For simplicity we assume . The following two properties are straightforward to prove; indeed, similar properties have been established in many papers dealing with BWT-related data structures.
With a symmetric reasoning, given and c we can establish whether there exists an edge labeled c entering in , and, if this is the case, all indices i such that has label c. Using standard data structures to represent compactly arrays supporting rank and select queries , we can establish the following result.
It is possible to represent an n-node, e-edge Wheeler graph with labels over the alphabet A in bits. The representation supports the forward and backward traversing of the edges in time.□
We can often reduce even further the space used to represent Wheeler graphs. For example, suppose G is a Wheeler graph; is the set of nodes of G reachable in exactly k steps from some other nodes; S is a set of strings of length k such that every node in is reachable in k steps by following edges whose labels when concatenated form a string in S; and f is a function assigning to each node a string such that v is reachable in k steps by following edges whose labels when concatenated form s. Consider the list of labels of edges originating in in order by origin with ties broken by label.
By Lemma 3, for all the nodes such that form a consecutive interval. Therefore, we can partition into consecutive intervals such that each interval is the concatenation of the labels of edges leaving nodes v such that ; is empty for . It follows that if , where is a constant and is the size of the alphabet of labels, then we can store in bits, where is the 0th-order empirical entropy of . Informally, this means that if knowing how we can get to a node in k steps tells us a lot about the labels that are likely to be on the edges leaving that node, then we can compress well. In some ways this generalizes the well-known analyses ,  of the compression achievable using the Burrows–Wheeler Transform on a single string.
Suppose we have never heard the words “Suffix Array” or “Burrows–Wheeler Transform” but we want to build a data structure supporting efficient queries for the substrings of, say, . A simple solution would be to build the Deterministic Finite Automaton (DFA) accepting all substrings of s, see Fig. 2 (left). Although there are techniques to compactly represent DFAs, such avenue will likely lead to a data structure much larger than the bits of the original text.
A more space economical alternative is to build the Nondeterministic Finite Automaton (NFA) for the same set of substrings: It has a linear structure, see Fig. 2 (right), but because of nondeterminism, processing of patterns appears to be a difficult task. It turns out that this difficulty is only a matter of perspective: traversing the DFA becomes much simpler if we recognize that it is a Wheeler graph in disguise.
The NFA of Fig. 2 is already a labeled directed graph. To make it a Wheeler graph we only need to eliminate the ϵ-transitions (which violate the property that all the incoming edges to a node have the same label), make all the states initial (so the language is not changed), and define an ordering of the nodes such that the monotonicity properties (1) hold. To this end, we associate to each node of the NFA the prefix of s that takes us there without an ϵ-transition and order the NFA nodes according to the right-to-left lexicographic rank of such strings, see Fig. 3 (left and center). With this definition the NFA without ϵ-transitions is a Wheeler graph. Once the node ordering is established we can discard the prefixes and identify each node with its rank in the ordering, see Fig. 3 (right).
In the Wheeler graph derived from the NFA of Fig. 3 every node has out-degree 1 except from node 5, corresponding to the longest prefix, that has out-degree zero. Hence, in the representation of the Wheeler graph described in Sect. 2, we can get rid of the bitarray O and instead insert in position 5 of L a symbol not occurring elsewhere in s, say $. Since all nodes have in-degree 1 except node 0 (ignoring the sourceless edges indicating that all nodes are initial) there is no need to store the bitarray I either. Summing up, we can represent and navigate the Wheeler graph using the string , enriched with data structures for /select operations, and the array . Since the string L coincides with the last column of the BWT matrix of , the string s reversed, and C is a well-known representation of the first column of such matrix, we have established the following result.
Given a string s, let denote the corresponding NFA described above. The FM-index of is a compact representation of . The Last-to-First and First-to-Last maps of the FM-index coincides with the navigation operations in .□
The attentive reader may ask why the Wheeler graph represents the FM-index of while historically the FM-index was defined for s. The reason is that the FM-index was derived from the BWT of s. The consequence is that the search of the pattern must be done right-to-left: the infamous backward-search procedure . Here, we have defined the NFA so that the search is done left-to-right and through the Wheeler graph we have therefore obtained the FM-index of .
Lemma 5 provides a new perspective on FM-indexes that may be interesting in its own right and, more importantly, suggests a way we can systematically generalize the ideas behind FM-indexes to indexing other kinds of data than single strings. Using the concept of path coherence, we now prove a more powerful version of this lemma:
Consider a finite-state automaton over an alphabet A without ϵ-transitions and with either one initial state or all states initial. If its state diagram is a Wheeler graph with n nodes and e edges then we can store it in bits such that, for any string α, in time we can compute the set of states reachable from the initial states on α.
Suppose the state diagram is a Wheeler graph by an ordering π on its nodes so, by Lemma 3, it is path coherent by π. Since either one state is initial or all are, the initial states form a consecutive range in π. By Definition 2, the nodes reachable from the initial states on each prefix of α form a consecutive interval. We can use Lemma 4 to map from the interval for each prefix to the interval for the next prefix in time.□
Theorem 6 means that if we have a problem we can solve with a finite-state automaton, then even if the automaton is non-deterministic we may still be able to implement it without the usual blowups in space or query time. In the rest of this paper we show that many data structures based on variants of the BWT fit into this framework.
We note as an aside that FM-indexes usually include a sampled suffix array which permits us to locate each occurrence of a pattern in the indexed string. Extending this idea to automata seems straightforward, by sampling the nodes, but we do not explore that in this paper.
The first natural generalization of the BWT is to extend it to a collection of strings. This was done for the first time in , ,  where the authors described a reversible multi-string transformation inspired by the BWT, and showed its effectiveness for data compression and for measuring string similarity.
In the context of pattern matching, given a collection of strings , in addition to substring queries it is convenient to offer also the possibility of prefix/suffix queries. In a prefix/suffix query, given two substring we want to find all strings such that is prefixed by α and suffixed by β. As first observed in ,  with the Compressed Permuterm Index, such queries can be solved by adding a special symbol $ to each string and searching for the pattern inside a circular version of each . Fig. 4 (left) shows the NFA supporting circular queries for the strings AT$, HOT$, HAT$. To each node we naturally associate a circular shift of one of the strings; if we sort the nodes according to the right-to-left lexicographic order of such shifts the resulting graph is a Wheeler graph, see Fig. 4 (right).
Since each node in the Wheeler graph has in-degree and out-degree 1, we can represent it with the approach of Lemma 4, using only the arrays L and C. Note that the array L, in the example of Fig. 4, coincides with the multi-string BWT defined in terms of the cyclic shifts in . Instead, the Compressed Permuterm Index is built computing the single-string BWT of the concatenation that does not support naturally the search for circular patterns. The authors of  got around this problem by sorting the strings before the concatenation and by applying the so-called jump2end function. More recent algorithms using circular pattern search all make use of the cyclic multi-string BWT , , , .
If we are only interested in the substrings of a collection , and not in circular matches, a smaller automaton than the one considered in the previous section is the one derived from the trie data structure, see Fig. 5. Following the lead from , , , we associate to each node the string formed by the labels in the node-to-root upward path, and we order all nodes according to the lexicographic rank of such strings.
The resulting graph is a Wheeler graph with the same number of nodes as the original trie. In a trie all nodes except the root have in-degree 1 so in the representation of Lemma 4 we do not need to store the in-degree binary array. Hence the representation consists of the out-degree binary array O, the labels array L and the count array C. For the trie in Fig. 5 we have
This representation is essentially equivalent to the XBWT (eXtended BWT) introduced in  to represent labeled trees, with the arrays O and L corresponding respectively to and in . Note that we can restrict the search to prefixes by setting the root as the only start state, and we can restrict the search to suffixes by setting as accepting states only those corresponding to one of the input strings without affecting our representation.
Bowe, Onodera, Sadakane and Shibuya  (see also, e.g., , , ) extended the XBWT from trees to de Bruijn graphs, which are widely used in bioinformatics for de novo assembly, read correction, identifying genetic variations in a population, and other applications. A kth-order de Bruijn graph for a string or set of strings contains a node for each distinct k-tuple that occurs in those strings, and an edge if there is a -tuple in the strings that starts with u and ends with v (which implies that v can be obtained from u by deleting u's first character and appending a character).
Together with the GCSA, described in Section 4.5, Bowe et al.'s representation is really the prototypical BWT-based data structure that requires us to think in terms of graphs instead of strings. The main difference between representing a labeled tree and representing a de Bruijn graph is that, in a graph, nodes can have in-degree more than 1, but this can be handled using the in-degree array I as described in Lemma 4.
We can also rederive Bowe et al.'s representation from Theorem 6: we build a finite-state automaton that accepts all prefixes of length at least k of strings containing only k-tuples and -tuples from a given list, by building a trie for the k-tuples and then adding edges connecting the leaves appropriately. Fig. 6 shows an example for the triples AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA, with the edges for BAAA and CABA missing: this automaton accepts all prefixes of length at least 3 of strings in which the only triples are AAA, AAB, AAC, ABA, BAA, BAB, CAB and CCA and in which BAAA and CABA do not occur. By numbering each node according to the lexicographic rank of the string labeling the node-to-root path, the state diagram is immediately seen to be a Wheeler graph.
Let G be a Wheeler graph under node ordering π. If all incoming edges to nodes in range have the same label, we can merge the range into a single node v, and the resulting graph will still be a Wheeler graph. If graph G has an edge with label a from node u to a node in range , graph will have an edge with label a. Similarly, if graph G has an edge with label from a node in range to node w, graph will have an edge with label .
If we have a multi-string BWT, we can transform its Wheeler graph into a more compact representation of the strings by merging ranges of nodes. This compact representation may contain false positives: path labels that combine substrings from different original strings. The FM-index of alignment (FMA) ,  has an efficient procedure for detecting false positives, based on carefully choosing the nodes to merge.
We start with an alignment of the strings, and separate the alignment into common regions shared between all strings, and non-common regions where some strings are different. We assume that each common region has a proper suffix that does not occur anywhere else in the strings. If no such suffix exists, we merge the common region into the neighboring non-common regions. We transform the alignment in two steps. First we move the suffixes into the following non-common regions . Then we justify each non-common region to the right, and use and to denote the common and non-common regions in the transformed alignment. See Fig. 7 for an example.
The transformed alignment guides us in merging the Wheeler graph of the multi-string BWT into the FM-index of alignment. As we are building an FM-index for the original strings, we need to build a Wheeler graph for the reverse strings. We merge nodes corresponding to aligned positions, if the nodes a) have in-degree 0; b) are in a common region; or c) correspond to the same suffix of the non-common region. See Fig. 8 for an example.
By Lemma 3, the Wheeler graph of a multi-string BWT is path coherent. To see that the graph remains a Wheeler graph after merging, we note that:
False positives are paths that do not correspond to any of the original strings. They can occur when the path comes from a non-common region , enters a common region , and exits into another non-common region . Assume that for each node v we have stored the set of strings passing through the node. To check for false positives, we take the intersection of sets over all nodes v on the path. If the intersection is non-empty, it contains the strings compatible with the path. If is the only outgoing edge from node u, we have . Similarly, if is the only incoming edge to node v, we have . Hence it is enough to take the intersection a) at the final node; and b) at nodes u, where the path crosses an edge to a node v with multiple incoming edges. It is also enough to store the set explicitly if a) node u has multiple or no outgoing edges; or b) for the only outgoing edge .
Let α be a string, let be the set of nodes reachable by following paths with label α, and let be the set of strings compatible with the paths. Because the graph is a DFA, the set of reachable nodes is monotonically decreasing: and for any character . If is a moved substring, for any character , as the paths can only end at the rightmost node of the common region . If set is non-empty, set contains all strings, as is a substring of all strings.
When we search in FMA, we move from to . We update the set of compatible strings S lazily. Initially the set contains all strings. If , there cannot be false positives. Either none of the paths enters a common region from a non-common region, or all such paths start within and enter through every incoming edge. In either case, . If and the node has multiple incoming edges, there is a risk of false positives. We therefore update . When the search finishes at , the set of compatible strings is .
While not all NFAs have an ordering of nodes that makes them Wheeler graphs, we can use Wheeler graphs to index path labels of length up to k in arbitrary graphs . This is the idea behind Generalized Compressed Suffix Arrays (GCSAs). We emphasize that Wheeler graphs can be larger than the graphs they are used to index.
Let G be an NFA, let be a Wheeler graph, let α be a string, and let and be the sets of nodes reachable with α in G and , respectively. Graph is a kth-order path graph of G for a , if there is a function f from nodes of to sets of nodes of G such that for all .
We can use a kth-order de Bruijn graph of path labels in graph G as a kth-order path graph of G. See Fig. 9 for an example. Function f has the same role as the suffix array. As with the sets of strings in FMA, we must store set explicitly, if a) node u has multiple or no outgoing edges; or b) we cannot derive from for the only outgoing edge . We often number the nodes of the NFA in a way such that , if is the only outgoing edge from u and the only incoming edge to v.
With large values of k, de Bruijn graphs often have many redundant nodes when we use them as path graphs. If , the nodes reachable with α in the de Bruijn graph are the ones with α as a suffix of the corresponding k-tuples. By Lemma 3, the nodes form a consecutive range. If for all , we can merge the nodes without affecting reachability. GCSA2  uses such pruned de Bruijn graphs to save space when indexing path labels in an NFA. See Fig. 9 for an example.
The original GCSA  considers a different scenario. Instead of indexing paths of length up to k in an arbitrary NFA, it indexes paths of arbitrary length in an acyclic DFA. Conceptually, GCSA construction searches for a de Bruijn graph that is equivalent to the input graph G as a DFA, and then uses that de Bruijn graph as an infinite-order path graph of G. Consecutive ranges of nodes are merged if for all nodes v and in the range, even if the ranges do not correspond to shared suffixes of the k-tuples. See Fig. 9 (bottom) for an example.
Another remarkable variant of the BWT is Durbin's positional BWT  (PBWT), which he introduced for haplotype matching but has also been used for reconstructing ancestral recombination graphs . Given d strings of equal length the PBWT supports the matching of substrings starting at a specified position in the strings.
We can view the PBWT as a Wheeler graph with the same structure as that of the multi-string BWT. If the multi-string BWT has an edge with label a from text position i to position , PBWT uses as the label. Because all edges from position i go to position , the positional component is always for all outgoing edges in range . Hence, we can infer the position from the range and store just the character a in the succinct representation (Lemma 4). A generalization  of PBWT relaxes the requirements. Instead of storing strings of equal length, we store the labels of paths in a graph. If edge in the Wheeler graph corresponds to edge with label a in the underlying graph, we use as the label of the edge. As we can no longer infer the positional component, we have to store it explicitly.
Although introduced for completely different applications, the PBWT bears a striking resemblance to a data structure called a wavelet matrix, which Claude, Navarro and Ordóñez  introduced as a version of the wavelet tree  better suited for large alphabets. This suggests that even wavelet trees and matrices can be viewed as Wheeler graphs. For the sake of brevity, we assume the reader is familiar with these data structures with “myriad virtues” , , and note only that, while building a wavelet tree is analogous to a most-significant-bit-first radix sort, building either a PBWT or a wavelet matrix is analogous to a least-significant-bit-first radix sort.
Consider the wavelet tree shown on the left in Fig. 10. Since it is a tree, we can use our construction from Subsection 4.2 to build an FSA (with one initial state) that accepts all the binary strings that are root-to-leaf paths in the wavelet tree. This does not store all the information in the wavelet tree, however, so we turn the graph from an FSA into a directed multi-graph, as shown on the right in Fig. 10, which is still a Wheeler graph. This graph has some interesting properties: e.g., the order of the outgoing edges from each node encode the bitvector stored at that node in the wavelet tree; as in the wavelet tree, for each internal node v except the root, v's in-degree and out-degree are equal and its in-degree is equal to the sum of the in-degrees of its leaf descendants. It is beyond the scope of this paper to investigate this representation of wavelet trees as a data structure, but it seems like an interesting direction for future research that goes beyond the reach of Theorem 6.
Now consider the wavelet matrix show on the left in Fig. 11. We can turn it into a Wheeler graph if, as in the wavelet tree, we transform it into a directed multi-graph and then, as in the PBWT, we rename the label c on each edge with the pair , where i is the level of the starting node, as shown on the right in Fig. 11. Again, the Wheeler graph clearly encodes the wavelet matrix, but we leave as future work investigating the possible benefits of this alternative representation.
We have defined Wheeler graphs to try to capture the ideas behind many of the variants of the BWT, and given a framework for developing new variants by solving problems with finite-state automata whose state diagrams are Wheeler graphs. We did not pursue the topic here, but Wheeler graphs seem able to capture properties also of string transformations not related to pattern matching: by reversing the inequality in (1) we get a slightly different notion of Wheeler graph that can be used to succinctly represent the variant of the BWT defined in terms of the alternating lexicographic order  (see also  for an ancestor of the multi-string BWT).
There has been so much work involving the BWT, however, that it would be surprising if one idea could subsume it all, and indeed some BWT-related results, such as the positional BWT, seemingly cannot be reasonably modeled by finite-state automata, even if they can still be viewed as Wheeler graphs; other BWT-related results, such as Ganguly et al.'s parameterized BWT  and the index for order-preserving matching in , seem not even to be based on Wheeler graphs at all. We have not yet even considered bidirectional FM-indexes , ,  and bidirectional BWT-based de Bruijn graphs .
Apart from applying and extending our framework, we hope to develop algorithms to recognize Wheeler graphs efficiently, and to characterize classes of finite-state diagrams that are Wheeler graphs or can be expanded slightly to become Wheeler graphs without changing the language accepted by the automata (although characterizing all such state diagrams might be difficult). In this regard we observe that not all regular languages have finite-state automata whose state diagrams are Wheeler graphs: e.g., in any finite-state automaton for the language , there must be disjoint paths for and and both ends of all edges with label x must be in the same order, so there must be separate nodes for and for all values of i.
Finally, we hope our new perspective on BWT variants makes them more accessible to computer scientists from areas outside string algorithms and data structures.
This work was supported by Academy of Finland grant 268324, FONDECYT grant 1171058, KITE-DiSIT project, INdAM-GNCS Project “Efficient algorithms and techniques for the organization, management and analysis of biological Big Data”, and the Wellcome Trust grant .
1On many occasions Mike Burrows stated that, as reported also in , the original idea of the BWT is due to David Wheeler. We therefore decided to name this graph class after this pioneer of computer science.