The importance of discovering the patterns and features in genomic sequences is motivated by a number of scientific contexts. The Encyclopedia of DNA Elements project (ENCODE) seeks ‘to identify all functional elements in the human genome sequence’ [

1]. Another context, the study of co-regulated genes, involves the analysis of the promoter sequences, introns, and untranslated regions (UTRs) of genes that were determined by microarray experiments to be co-regulated. Similarly, transcription factor binding regions identified by ChIP-chip and ChIP-Seq experiments are examined to identify genomic patterns [

2]. Genome-wide pattern discovery studies, which seek to identify vocabularies of genomes [

3,

4], provide yet another perspective on genomic data. Large scale analysis of genomic data is also performed in the search for genomic signatures (unique elements that characterize specific organisms, tissues, pathways, and functions) [

5]. All of these problems require the discovery of patterns in genomic sequences.

Several approaches have been developed for genomic pattern discovery. Word enumeration methods are algorithmic techniques that systematically discover either substrings (i.e.,

*words*) or sets of related substrings (i.e.,

*motifs*) in DNA sequences. Most enumeration methods create a data representation of the input sequence(s) that provides fast retrieval of elementary word statistics. The representation serves as a central data structure for a number of other analyses, including statistical word scoring, word-clustering, and motif discovery. A number of algorithmic techniques for word space enumeration have been proposed. Each of the enumeration algorithms can be classified as either

*index-based*[

6-

19],

*graph-based*[

20], or

*iterative*[

21,

22].

*Index-based* enumeration strategies create a data structure, called the index, and provide a mapping function, which maps the character composition of a word to a specific entry in the index data structure. Index-based strategies differ in (1) the data structure used for the index and (2) the type of mapping function employed. Popular index-based strategies employ hash functions, radix trees, and suffix trees.

*YMF*[

6,

13,

14],

*Wordspy*[

15,

16], and

*RMES*[

8,

9,

11] employ

*hash functions* for enumerating the word space. An alternative to hash functions,

*radix trees* require

*O(n*^{2}) space (where

*n* is the total number of characters in the input sequences), and are among the fastest representations for the retrieval of words.

*Seeder*[

7] and

*SMS*[

12] are examples of approaches that utilize a

*radix tree* for storing words. A third alternative for index data structure,

*suffix trees* provide a semantically rich representation of a set of input sequences. They require

*O(n)* time and space, and enable a number of efficient and elegant string processing algorithms. Many tools and algorithms based employ suffix trees, including

*Speller*[

10],

*Weeder*[

17],

*REPuter*[

18], and

*Verbumculus*[

19].

*Winnower*[

20], a

*graph-based* approach, has been used for solving

*the Planted (l,d) Motif* problem [

20] (the problem of finding a motif of length

*l* occurring among all sequences in a set, allowing for at most

*d* mismatches between the instances of the motif). The Winnower algorithm reduces the problem of finding (l, d) motifs to the problem of finding large cliques in multi-partite graphs. The undirected Winnower graph

*G* contains nodes representing words, and edges representing a similarity relationship (e.g., hamming distance) between words. Instead of finding maximal cliques, which is an NP-complete problem [

23], Pevzner and Sze iteratively remove edges from

*G* that are guaranteed not to be contained in a clique of size

*k*, resulting in an algorithm of O(N

^{k+1}), where

*N* is the total number of nucleotides.

*Iterative* approaches, such as

*Teiresias*[

22] and

*Mitra*[

21], incrementally concatenate short motifs from the input sequences to discover maximal motifs. These methods generate the set of maximal patterns without having to enumerate the entire word-space of an input sequence set. The Teiresias algorithm divides the motif discovery process into two phases: scanning and convolution. During the scanning phase, a set of elementary patterns of length

*W*, satisfying a user-defined quorum

*q*, is enumerated for a specific length with a required number of non-mismatches

*L*. During the convolution phase, the elementary patterns are combined pair-wise and the resulting patterns are added to the set of elementary patterns if they satisfy the quorum. During convolution it is necessary to consistently detect and remove patterns that are no longer maximal, but are instead part of larger patterns with the same quorum satisfaction. The complexity of the scanning phase is O(NW

^{L}), with N being the total number of nucleotides, and the complexity of the convolution phase is

, (where

*rc(T’)* represents the matches in a pattern

*T’*, which is a maximization of a pattern

*T*[

24]). Taking into consideration all calls to a maximization function, the worst-case time complexity of the Teiresias algorithm is

, where

*t*_{H} is the time needed for locating hash entries, and

*P* is a pattern to be inserted into the set of maximal patterns [

24].

While a number of algorithms and software tools have been developed to solve the word discovery problem, most do not provide the scalability needed to process large (genome-scale) data sets. For example, our single-processor enumeration methods, based on either a radix tree or a suffix tree, are unable to perform word enumeration for the ~27,000 core promoters of the *Arabidopsis thaliana* genome for word lengths greater than 19bp (see Figure ).

The WordSeeker software suite addresses this problem by providing scalable word discovery algorithms. The software described herein builds upon earlier work of the authors (reported in [

32]), which developed cache aware data layout and access strategies for a shared memory implementation of the radix tree data structure. WordSeeker has been used to analyze the promoter regions of genes in the DNA repair pathways of

*Homo sapiens*[

25], the entire genome of

*Arabidopsis thaliana*[

26], and regulatory regions involved in gravity response in

*Arabidopsis thaliana*[

27]. As reported in [

28], results of the WordSeeker analysis of the

*Arabidopsis thaliana* genome have been incorporated into AGRIS - the Arabidopsis Gene Regulatory Information Server [

29].

The remainder of the manuscript presents a description of the methods employed by WordSeeker, an experimental assessment of their effectiveness, and a discussion of results.