|Home | About | Journals | Submit | Contact Us | Français|
We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings.
The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families.
Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.
Finding motifs or repeated patterns in data is of wide scientific interest [1-4] with many applications in genomic and proteomic analysis. The motif search problem abstracts many important problems in analysis of sequence data, where motifs are, for instance, biologically important patterns. For example, elucidating motifs in DNA sequences is a critical first step in understanding biological processes as basic as the RNA transcription. There, the motifs can be used to identify promoters, the regions in DNA that facilitate the transcription. Finding motifs can be equally crucial for analyzing interactions between viruses and cells or identification of disease-linked patterns. Discovery of motifs in music sequences, text, or time series data is a fundamental, general means of summarizing, mining and understanding large volumes of data. For the purpose of this study, motifs are (short) patterns that occur in an exact or approximate form in all or most of the strings in a data set. Consider a set of input strings S of size N = |S| constructed from an alphabet Σ. The solution for the (k, m, Σ, N)-motif finding problem (Figure (Figure1)1) is the set M of k-mers (substrings of length k), M Σk, such that each motif a M, |a| = k, is at minimum Hamming distance of at most m from all (or almost all) strings s S.
In this work, we focus on a deterministic, exhaustive approach to motif search. Exhaustive motif finding approaches are guaranteed to report all instances of motifs in a set of sequences, but are faced by the exponential complexity of such search. As a consequence, the problem quickly becomes intractable for even moderately long motifs and small alphabets. We present a new deterministic algorithm for finding common patterns with the search complexity that scales well with the size of the alphabet. Compared to existing algorithms in this class (e.g. [5,6]) that have strong dependency on the alphabet size and work with small-alphabet input, our algorithms significantly improve search efficiency in the important case of large-alphabet inputs (e.g. protein alphabet, extended DNA alphabet, etc.) and inputs of large length. As we show in the experiments, using both synthetic and real data, our algorithms are orders-of-magnitude faster than existing state-of-the-art deterministic search algorithms, especially on large-alphabet inputs (e.g., protein sequences). This result extends applicability of the exact motif search algorithms to more complex problems requiring analysis of biological sequence data modeled as strings over large alphabets. The problem of motif discovery has been tackled extensively over the past two decades . Within the class of exhaustive methods, a number of approaches have been proposed, including graph methods (WINNOWER) , explicit trie traversal (MITRA) , explicit mapping (Voting algorithms) , suffix trees [6,9], sorting and enumeration , etc. Existing exhaustive algorithms use explicit exploration of the motif space and require time proportional to the size of the neighborhood of a k-mer, i.e. the number of k-mer sequences at Hamming distance of at most m from it. This size, depends on the alphabet size, and can lead to high computational complexity and running times, as shown in Table Table11.
Explicit mapping (voting) algorithms proposed in  use an indicator array V of the maximum size |Σ|k to find motifs through voting. Each length-k substring observed in the input has at most one vote for each input sequence and gives this vote to all of its V(k, m) neighbors. The substrings that occur in every input string will receive N votes and will be included in the output motif set M. The algorithm takes O(km+1|Σ|mnN) time and requires at least O(km+1|Σ|mnN) space. The large space requirement of the algorithm restricts its usage to small values of k and m, as well as to small alphabet size |Σ|.
One of the most efficient exact algorithms for motif search, the mismatch tree (MITRA) algorithm , uses efficient trie traversal to find a set of motifs in the input strings. Under a trie-based computation framework [5,11], the list of k-long contiguous substrings (k-mers) extracted from given strings is traversed in a depth-first search manner with branches corresponding to all possible symbol substitutions from alphabet Σ. Each leaf node at depth k corresponds to a particular k-mer feature (either exact or inexact instance of the observed exact string features) and will contain a list of matching features from each string. The leaf nodes corresponding to motifs will contain instances from all (or almost all) strings. The complexity of the trie-based traversal algorithm for motif finding is O(km+1|Σ|mnN). Note that the algorithm essentially explores the neighborhood of all O(nN) k-mers in the input.
Another class of efficient algorithms is based on sorting and enumeration . The PMSP algorithm enumerates all possible neighboring k-mers for the first string s1 and outputs k-mers that occur in every string with Hamming distance at most m, similar to the Voting algorithm . The PMSprune algorithm  employs a more efficient search strategy to traverse the candidate space and is an improvement, in the expected case, over the PMSP. We note that explicit enumeration is employed by all above-mentioned algorithms.
While the exact algorithms focus on retrieving all possible motif patterns, an important issue of estimating significance of the found motif patterns can be addressed with existing techniques as used in, for instance, non-exhaustive algorithms based on stochastic optimization (e.g., MEME ).
In contrast to existing exact exhaustive algorithms, we approach the problem of motif finding by performing an efficient search over patterns with wildcards. As a consequence, the proposed method’s complexity becomes independent of the alphabet size.
In this section, we develop an efficient combinatorial algorithm for motif finding with the search complexity independent of the size of the alphabet |Σ|. The algorithm begins by finding a set of candidate motifs, followed by the construction of the intersections of those candidates’ neighborhoods, the sequences that are at most m symbols apart from each candidate pair. In a crucial departure from other approaches, this set is efficiently represented using stems, or patterns with wildcards. The number of the stems does not depend on the alphabet size and is a function of the motif length (k), the number of mismatches (m) and the Hamming distance between k-mers. Patterns common to all (or almost all) input strings are then found by pruning the stems that do not satisfy the motif property (i.e., do not occur in all input strings). The main idea of our approach is to construct a candidate set C which includes all motifs M plus some non-motifs, i.e. M C, and then efficiently select true motifs from the candidate set. Given C, the complexity of motif finding is then proportional to its size: the motifs can be extracted from C by checking each candidate against the motif property, a task we accomplish using rounds of counting sort in Algorithm 2. To generate C, we collect the sets of stems which characterize the common neighbors of the pairs of k-mers (a, b) in the input. We call these sets the stem sets, H(a, b). Finding each H(a, b) is independent of the alphabet size and is accomplished in Algorithm 3. To further reduce the complexity, we construct the stem sets only for potential motif instances I, those k-mers that are at Hamming distance of at most 2m from every input string. We find I using rounds of counting sort (Algorithm 2). We outline our motif search algorithm below:
This algorithm uses as its main sub-algorithm (in step 2) a procedure that finds the intersection of k-mer neighborhoods for any pair of the k-mers a, b. This intersection finding algorithm is described in Section ‘Motif generation’. We describe selection and pruning steps (steps 1 and 3) in Section ‘Selection algorithm’.
The overall complexity of the algorithm is where H is the maximum size of H(a, b), and I is the size of I, the number of k-mers used to construct the candidate set C. The important fact that makes our algorithm efficient in practice is that typically I min(nN, |Σ|k) and H V(k, m), particularly for large alphabets. We demonstrate this in our experimental results and provide an expected-size analysis in Section ‘Selection algorithm’.
A necessary condition for a group of k-mers to have a shared, common neighbor (motif) is that the Hamming distance between any pair of patterns cannot exceed 2m. We will use this condition to select k-mers from input that are potential motif instances and place them in set I. A particular k-mer a in the input is a potential motif instance if it is at the minimum Hamming distance at most 2m from each of the input strings. All other k-mers that violate the above condition cannot be instances of a motif and can be discarded. To select the valid k-mers, we use multiple rounds of count sort by removing iteratively 2m out of k positions and sorting the resulting set of (k − 2m)-mers. A k-mer is deemed a potential motif instance if it matched at least one k-mer from each of the other strings in at least one of the sorting rounds. The purpose of sorting is to group same k-mers together. Using a simple linear scan over the sorted list of all input k-mers, we can find the set of potential motifs and construct I. This algorithm is outlined in Algorithm 2. As we will see in the experiments (Section ‘Results and Discussion’), the selection step
significantly reduces the number of k-mer instances considered by the algorithm and improves search efficiency. The number of selected k-mers, i.e. the size of I, is small, especially for large-alphabet inputs. This can be seen from the expected case analysis. For this purpose we assume that sequences are generated from a background process with few motifs implanted in the background-generated sequences. Assuming an iid background model with equiprobable symbols, the expected number of k-mers in the input of N strings of length n that match each of the N strings with up to 2m mismatches by chance is
where pk,2m is the probability that two randomly selected k-mers are at distance of at most 2m. For instance, for a set of N = 20 protein sequences (sampled from alphabet |Σ| = 20) of length n = 600 the expected number of potential motifs of length k = 13, m = 4 by chance is about 8, with p13,8 = 2.9 10−4. Given t implanted motif instances, the average number of k-mers that will be selected from nN input samples, or the expected size of I, is
E[I] = t + nN(1 − (1 − pk,2m)t) + E[IB].
Since t and p are typically small, for small pn, E[I] nN, the number of k-mers in the input. In the protein example above the expected size of I is about 1 + 3 + 8=12 for t = 1, which is orders of magnitude smaller than nN = 12000, signifying the importance of creating I first. This is empirically demonstrated in Section ‘Results and Discussion’.
The sorting approach of Algorithm 2 is also used to select patterns satisfying the motif property from the candidates C (Step 3 in main Algorithm 1). The pruning step is based on verifying the motif property (i.e. whether given patterns match all input sequences with up to m mismatches) and can be accomplished using rounds of counting sort.
In what follows, we describe an efficient algorithm that finds the set of stems that represent the set of k-mers shared by a pair of k-mers a and b. This process is used to create set C from potential instances I, which is subsequently pruned to yield the true motif instances.
The number of k-mers in the common neighborhood of any two particular k-mers a and b assumes a fixed set values depending on the Hamming distance d(a, b) between k-mers , for given values of |Σ|, k, and m. We want to represent the shared k-mers in this intersection using a set of stems, patterns with wildcards. However, the number of stems will not depend on the alphabet size |Σ|.
To find all stems shared by k-mers a and b, consider two sets of positions: mismatch region in which a and b disagree and match region in which a and b agree. We consider two cases depending on the number of mismatch positions (i.e. Hamming distance between a, b). In the first case, the distance d(a, b) is at most m, the maximum number of mismatches allowed. In the second case, the distance d(a, b) exceeds m. When d(a, b) ≤ m, wildcard characters can appear both inside and outside of the mismatch region. When d(a, b) >m, wildcard characters can appear only inside the mismatch regions. Consider for example, the case of d(a, b) = 0 and m = 1. In this case, the set of stems is the set of patterns with 1 wildcard at each of the possible k positions (with the remaining positions as in a) plus one stem with 0 wildcards. When m = 2, and d(a, b) = 1, the set of stems will include patterns with 0 or 1 wildcard in k − d positions and 0 or 1 wildcards in the remaining d = 1 positions. For example, for the pair (tgt, tgc) the corresponding patterns with wildcards are tg?, t??, ?g?, t?c, and ?gc, where ? denotes a wildcard.
We outline our algorithm for finding set of stems for the k-mer neighborhood intersection in Algorithm 3. The number of stems generated by the algorithm is
The number of stems describing all the explicit k-mers shared between a, b does not depend on the alphabet size. The complexity of the stemming algorithm is proportional to the number of stems generated. The maximum number of stems H is for typical values of m <k/2. We use Algorithm 3 for every pair of k-mers in I (step 2) to construct C as outlined in the main algorithm.
The complexity of the selection step 1 for constructing I is and does not depend on the alphabet size |Σ|. Steps 2 and and 3 have the complexity and again do not depend on |Σ|. As a consequence, the three-step procedure gives us an efficient, alphabet-independent motif search algorithm that outputs all motifs embedded in the input S. Our experiments will next demonstrate that this allows efficient exploitation of sparsity of typical solutions—we explore only a small portion of the motif space by focusing (using Algorithm 2) only on the support samples that are potential instances of the motifs. This results in significant reductions in running times, especially for large-alphabet inputs, i.e. the cases difficult for the current exact motif finding algorithms.
Our proposed framework can be used to reduce search complexity for other exact search-based motif finding algorithms. Existing exhaustive algorithms typically (e.g. [5,8,10]) use the entire input (i.e. all the k-mers in the input) and find motif by essentially exploring neighborhoods of every k-mer in the input. Their search complexity can be improved by using a reduced set of k-mers instead of all input samples. This reduced set of k-mers can be obtained using our linear time selection algorithm (Algorithm 2, Section ‘Selection algorithm’). Using reduced set of k-mers, the actual search complexity after the selection step becomes sublinear in the input size (since the number of selected k-mers I = |I| is much smaller than input length O(nN)). For instance, the search complexity of the trie-based algorithms (e.g., ) can be reduced to instead of O(knNV(k, m)), where V(k, m) is O(km|Σ|m). This will lead to a more efficient search especially for large-alphabet since a possibly large input (O(knN)) is replaced with a smaller set I of k-mers that match with up to 2m mismatches every string in the input.
We evaluate our algorithms on synthetic benchmark motif finding tasks and real data sets. We first test our algorithms on the planted motif problem commonly used as a benchmark for evaluation of the motif finding algorithms [2,5,10]. We then illustrate our method on several DNA and protein sequence data sets.
A planted motif problem  is the task of finding motifs and their instances in a set of sequences with variants of the consensus string (motif) implanted with up to m mismatches in every string. This task represents a well-defined subtle motif discovery problem. Instances of this problem with large number of mutations m are known to be challenging for most of the motif finding algorithms.
We follow the standard setting used in previous studies [2,5,10] and generate N = 20 random strings of length n = 600 using iid, uniformly distributed symbols from an alphabet of size |Σ|. We then embed a copy (with up to m substitutions at random positions) of a motif at a random location in every string. The task is then to identify motifs hidden in the input.
In Table Table2,2, we compare the running time of our algorithms with state-of-the-art motif finding algorithms on several challenging instances of the planted motif problem. We give the running time comparison for large-alphabet (|Σ| = 20 − 100) instances in Table Table3.3. As we can see from the results in Table Table22 and Table Table3,3, our algorithms show significant reduction in running times compared to state-of-the-art methods, especially for large-|Σ| inputs (Table (Table3).3). For large alphabets and large k,m trie traversal takes substantial amount of time and results in these cases are not reported. In Figure Figure2,2, we show the running time ratio (logarithmic scale) between the mismatch trie traversal (MITRA) algorithm and our algorithm as a function of the alphabet set size. The running time is measured on (13,4) instances of the planted motif problem. For relatively small alphabet of size 20 our algorithm is about 104 times faster than the mismatch trie. The difference in running time increases with the size of the alphabet. Large alphabets can, for instance, arise when encoding the 3D protein structure, a necessity in cases when sequences share little similarity at primary level.
Figure Figure33 shows efficiency of the selection (step 1 in the algorithm) as a ratio between the input size and the number of the selected samples (k-mers) |I|. We observe that across different input sizes selection reduces the number of samples by a factor of about 103. The observed number of selected samples I = |I| agrees with the theoretical estimates (Section ‘Selection algorithm’) (e.g., for |Σ| = 50, n=5000, N=20, we expect about 52 k-mers to be selected, and the observed size of I is 103 k-mers).
We use several data sets with experimentally confirmed TF binding sites: CRP, FNR, and LexA. The CRP data set contains 18 DNA sequences of length 105 with one or two CRP-binding sites [2,14]. The FNR and LexA data sets are obtained from RegulonDB  database and contain 30 and 91 sequences known to have sites of length 14 and 20 bases. The task is to identify the sequence motif corresponding to the binding sites and the positions of sites within sequences.
For CRP, we use relatively long k-mers of length k = 18, with a large number of allowed mismatches (m = 7) from a given set of 18 DNA sequences (|Σ| = 4). For FNR and LexA data sets, we set motif length to k = 14 and k = 16 bases, with the maximum number of mismatches set to m = 4 and m = 6, respectively.
Figure Figure44 illustrates motifs found by the algorithm on the CRP data set. In the figure, colors indicate the importance of positions as measured by the number of hits between the found motif patterns and the sequences, with blue horizontal lines denoting true (confirmed) locations of the binding sites. The set of discovered locations agrees with the set of experimentally confirmed primary positions. The discovered motif patterns correspond to instances of the reference consensus motif TGTGAnnnnnnTCACA [14,16]. Because of large k and m we observe running time improvements similar to the benchmark planted motif problems: our algorithm takes about 6 minutes, while the mismatch trie traversal requires about 12 times as long (4489 seconds). Allowing a large number of mismatches (m = 7) in this case is critical for the motif prediction performance, because fewer mismatches do not lead to successful identification of the binding sites.
For FNR and LexA motifs, our algorithm correctly finds consensus patterns TTGATnnnnATCAA and CTGTnnnnnnnnnCAG, in line with the validated transcription factor binding sites, with the performance coefficients  of 83.69 and 90.38.
We also apply our algorithm to finding subtle sequence motifs on several protein sequence datasets, a challenging task due to the increased alphabet size (|Σ| = 20) coupled with large k and m.
Lipocalin motifs. We first consider motifs in lipocalins which are topologically similar but have very diverse primary sequences. Using k-mer of length k = 15 with m = 7 mismatches, we identify motifs containing 15 residues with the instance majority FD[IKLW]S[AKNR]FAGTWYE[ILMV]AK (Figure (Figure5),5), which agrees with the known reference motif . Our algorithm takes about 5 minutes to complete this task, while the mismatch trie algorithm takes more than a day. As in the case of the DNA, a large number of mismatches is critical for finding motifs, while smaller values of k, m do not result in motif identification.
Zinc metallopeptidase motif. In this experiment, 10 relatively long (average length is 800) human zinc metallopeptidase sequences used to test motif finding. Identification of subtle motifs in this case is made even more challenging by the length of the sequences. We use 11 residues long k-mer with m = 5 mismatches and find sequence motifs with the instance majority VAAHELGHS[GL]G in 9 out of 10 sequences that correspond to previously confirmed locations. We note the large number of mismatches (m = 5) was critical to motif identification.
Super-secondary structure sequence motifs. We consider now two data sets of protein sequences with interesting 3D sandwich structure studied previously by biologists, for which existence of corresponding sequence motifs has been postulated . Using Cadherin and Immunoglobin superfamilies as an example, our algorithm finds sequence patterns that correspond to the supersecondary structure (SSS) motifs [18,19], i.e. arrangements of the secondary structure units (loops, strands). In particular, in Cadherin superfamily we find long motifs of length 20 (using m = 4 mismatches) corresponding to the secondary structure units strand 1 - loop - strand 2 (VIPPISCPENE[KR]GPFPKNLV) and strand 3 - loop - strand 4 (YSITGQGAD[KNQT]PPVGVFII) (3D SSS motif ). Figure Figure66 shows identified motif positions within the sequences. Our algorithm finds 36 potential motif instances (out of 330 samples) after the selection (step 1) and takes about 47 seconds (compared to about 600 seconds using the trie traversal). In Immunoglobin superfamily (C1 set domains), we find a sequence motif of length 19 SSVTLGCLVKGYFPEPVTV which corresponds to strand 2-loop-strand 3 secondary structure units (2E SSS motif).
We presented a new deterministic and exhaustive algorithm for finding motifs, the common patterns in sequences. Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in large-alphabet sequences (e.g., proteins), as we showed on several motif discovery problems in both DNA and protein sequences. The proposed algorithms could be applied to other cases and challenging problems in sequence analysis and mining potentially characterized by large alphabets, such as text mining.
The authors declare that they have no competing interests.
All authors contributed equally to this work.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 8, 2010: Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S8.