MEMs are exact matches between two sequences that cannot be extended in either direction toward the end or the beginning of the sequences without introducing a mismatch (Abouelhoda et al.
; Kurtz et al.
). The only constraint on MEMs is a minimum length L
. Without this constraint, a large number of matches consisting of a small number of characters will be returned. To find MEMs between a query string P
and the reference string S
, the algorithm advances through each character position p
in the query string and attempts to find MEMs of sufficient length that begin at p
in the query string.
In order to find MEMs using sparse SAs, we adapt several existing techniques for finding MEMs in full-text SAs. Specifically, we use an approach where MEMs are found by maintaining two intervals (Abouelhoda et al.
). These intervals are obtained by top-down binary search at position p
in the query string P
. The first interval d
] is found by matching at most L
− 1) characters (Supplementary Algorithm 1
). For this interval, d
is the length of the current match, s
is the start of the interval in the SA and e
is the end of the interval in the SA. By allowing for matches that are K
− 1 characters <L
in length, the entire match can be recovered by scanning regions in between the sparsely indexed suffix positions (a). The second interval q
] is found by matching as many characters as possible, the longest possible match. For this interval, q
is the length of the current match, l
is the start of the interval in the SA and r
is the end of the interval in the SA. Note that q
] is a subinterval of d
], i.e. s
, and q
Fig. 3. (a) Partial matches at successive locations in the query P = issxiss and K = 2 sparsely indexed string S = mississippi$. The asterisk indicates an indexed position and the numbers in the left column designate positions in the query P. A match of ‘ss’ (more ...)
If there are at least d≥L−(K − 1) matched characters at position p in the query P, the algorithm uses both intervals to scan for MEMs of length L. The use of these intervals is based on the observation that every suffix corresponding to the first interval d:[s..e] has a prefix of length L−(K − 1) which matches the query string P at position p. To determine whether each of these prefixes corresponds to a MEM of length L, we need to find its left and right maximal boundaries.
In order to find these MEMs, we use an approach that differs from the maintenance of a stack and reliance on the structure of the suffix tree present in traditional approaches for finding MEMs (Abouelhoda et al.
; Kurtz et al.
). Using both intervals, d
] and q
], we determine the length and position of right MEMs. Right MEMs are matches between S
that cannot be extended any further towards the end of the strings.
The algorithm finds these right MEMs by ‘unmatching’ characters in the maximum length match using LCP values at the start LCP[l
] and end LCP[r
+ 1] of the interval (Supplementary Algorithm 3
; b). The first-right MEMs of length q
are obtained directly from the interval q
]. Observe that the LCP of the set of suffixes indexed between l
corresponds to the maximum length q
, and that both LCP[l
] and LCP[r
+ 1] must be less than this value, or otherwise the interval corresponding to the maximum length match could be expanded. Thus, q
+ 1]) is the length of the next longest match and the minimum LCP value of the interval with characters unmatched.
Using the new LCP value, the interval can be expanded to the left l=l − 1 until LCP[l|<q′ and the right r = r + 1 until LCP[r + 1]<q′. Each array index visited as these intervals are expanded has a corresponding suffix position in the SA, and each of these suffixes in the SA must be right maximal matches of length q′. This expansion process continues as long as q′≥d the length of the right maximal matches collected are greater than the minimum-length match.
Once a right maximal match and its corresponding length is found, the algorithm determines left maximality by scanning to the left of the right maximal match (Supplementary Algorithm 4
). Because, the reference string S
is indexed at every K
-th suffix position, the algorithm must scan up to K
characters to the left of the match. The scan is stopped at a mismatch or at the beginning of either string. The resulting left maximal match is stored only if it meets the length constraint ≥L
Approaches that rely on the structure of the suffix tree, such as the ESA approach in Abouelhoda et al.
), have complexity O
), where m
is the length of matched query sub-string and R
is the number of right maximal matches. In contrast, our approach has complexity O
) where Q
is the number of length m
− 1) matches of a query sub-string in the sparse SA.
At this point, the algorithm can advance to the next position p +1 of the query P and reset the minimum length interval d:[s..e] and the maximum length interval q:[l..r] both to be 0:[0..n/K − 1] the maximum- and minimum-length intervals, and match again from this new query string prefix. However, this naive algorithm will repeat all the work required to obtain the d − 1 and q − 1 characters of the minimum- and maximum-length matches, respectively. Suffix links offer a way of avoiding the additional work of matching these characters again. In the next section, we describe an approach that does not rely on having pre-computed and stored suffix links, but instead uncovers them with extra computation.