R

ecently, genome-wide location analyses have been carried out for proteins such as OCT4 (Boyer et al.,

2005), P53 (Wei et al.,

2006), EREα (Carroll et al.,

2006; Lin et al.,

2007), c-Myc (Zeller et al.,

2006), CTCF (Kim et al.,

2007), FOXP3 (Zheng et al.,

2007), NRSF (Johnson et al.,

2007), STAT1 (Robertson et al.,

2007), and for histone markers (Bernstein et al.,

2006; Lee et al.,

2006; Mikkelsen et al.,

2007; Pan et al.,

2007; Schones et al.,

2008; Wang et al.,

2008). One goal of these studies is to discover short functional elements such as

*cis*-elements embedded in these sites that are a few hundreds to tens of thousands of nucleotides long. Computational tools for

*de novo* motif discovery in such massive data are needed.

During the last decade or so, many

*de novo* motif discovery methods have been developed (Bailey and Elkan,

1994; Buhler and Tompa,

2002; Down and Hubbard,

2005; Elemento et al.,

2007; Eden et al.,

2007; Hertz and Stormo,

1999; Linhart et al.,

2008;Liu et al.,

2001,

2002; Pavesi et al.,

2001; Pevzner and Szu,

2000; Roth et al.,

1998; Sinha and Tompa,

2002; Sumazin et al.,

2005; Thijs et al.,

2001; van Helden et al.,

2000). These methods fall into two categories: word enumeration and local search. The performance of many algorithms has recently been assessed (Tompa et al.,

2005). Word-enumeration techniques count the number of occurrences of a motif, defined as a string of letters {a,c,g,t and sometime with degenerate letters, e.g., y = c,t and r = a,g} of certain length (e.g., 6–20) in the sequence data. When no degenerate letters are used in the motif profile/model, a subsequence is considered an instance of the motif when the number of mismatches between the subsequence and the motif is less than a threshold. The motifs are then rank-ordered based on their overrepresentation, thus, these approaches guarantee the global optimum—e.g., producing motifs with the highest overrepresentation. Many methods in this group have been developed. For instance, Consensus (Hertz and Stormo,

1999) first uses each

*k*-mer to form the first sequence to construct an alignment matrix and the matrix is further updated. The

projection algorithm (Buhler and Tompa,

2002) projects every

*l*-mer in the input data into a smaller space by hashing. Other methods in this category includes

winnower (Pevzner and Szu,

2000), spaced dyads (van Helden et al.,

2000; Li et al.,

2002), Weeder (Pavesi et al.,

2001),

mitra (Eskin and Pevzner,

2002), YMF (yeast motif finder) (Sinha and Tompa,

2002), DWE (Sumazin et al.,

2005), Drim (Eden et al.,

2007), and

fire (Elemento et al.,

2007). Recently, Xie et al. (

2007) enumerated a list of candidate

*k*-mers (12–22 nucleotides) and counted the number of matching instances in a set of conserved noncoding elements in the human genome.

Perhaps more widely used approaches employ local search techniques such as EM and Gibbs sampling (for recent reviews, see Jensen et al. [

2004] and van Nimwegen [

2007]). Unlike word-enumeration, the local search techniques use the PWM as the motif model. Initially, these models are either pre-defined or randomly specified. The models are then updated by an iterative process until convergence. Local search methods include

meme (Bailey and Elkan,

1994), AlignACE (Roth et al.,

1998), BioProspector (Liu et al.,

2001), motifSampler (Thijs et al.,

2001), MDScan (Liu et al.,

2002), Nested

mica (Down and Hubbard,

2005) and fdrMotif (Li et al.,

2008).

One advantage of the local search methods is that initial motif models are iteratively updated. On the other hand, the search space can be very large. Consequently, one of the main challenges for the local search techniques is how to obtain starting positions for the local search algorithms. For instance, meme converts each subsequence of length, *w*, into a letter probability matrix and uses it as the starting point for its EM algorithm. Only one step of EM is carried out for each starting matrix. The resulting best models are subjected to full EM. Motifs with the strongest statistical significances (*E*-values) are reported. This approach almost guarantees good starting positions for the EM algorithm. However, examining all possible starting positions for various lengths of *w*'s (e.g., 6–30) for a large dataset is computationally too costly to be practically useful. Here we present an efficient method that combines existing algorithms to identify good starting positions for an EM algorithm for unbiased motif discovery in large scale data sets.

Our method begins by counting the number of matching instances of all

*k*-mers (

*k* = 3, 4, 5, 6) in the data. For instance, there are 4

^{3} = 64 possible tri-nucleotides (3-mers). For each

*k*, the

*k*-mers are rank-ordered based on their overrepresentation. The top-ranked

*k*-mers for all four

*k*'s are subsequently used as the words for the spaced dyads (Li et al.,

2002; van Helden et al.,

2000). The top-ranked words may be viewed as “seeds” for a motif. Unlike the word-enumeration methods, we do not count the numbers of matching instances of the spaced dyads in the data. Instead, we convert the spaced dyads into letter probability matrices (similar to

meme), which in turn, are used as the initial models for a local search technique via an EM algorithm. Thus, one might regard our method as a hybrid of the word enumeration and local search techniques. Similar hybrid methods have been proposed. For instance, Eskin (

2004) developed the

mitra-pssm algorithm that combined an efficient branch and bound algorithm for finding consensus patterns and a local search algorithm.

A spaced dyad consists of two words separated by a certain number of spacers (unspecified bases),

*d*. If we choose

*N*_{k} top-ranked

*k*-mers as the possible words and

*d* spacers for the spaced dyads, the number of possible spaced dyads constructed from these building blocks can be large. In theory, there are (

*N*_{3} +

*N*_{4} +

*N*_{5} +

*N*_{6}) × (

*d* + 1) × (

*N*_{3} +

*N*_{4} +

*N*_{5} +

*N*_{6}) possible spaced dyads when both words of the spaced dyads can independently come from any of the four

*k*-mer groups and there are 0 to

*d* spacers between the words. Subjecting all of them (after conversion to probability matrices) to EM is impractical. Therefore, we employ a genetic algorithm (GA) (Goldberg,

1989) to guide the formation of spaced dyads so that only a small fraction of the spaced dyads are examined while finding most or all motifs. A GA is very effective in searching high-dimensional space and has been used in many optimization/search problems. Earlier, Wei and Jensen (

2006) proposed a GA-based approach,

game, to evolve motifs from randomly generated starting motifs. We refer to our method as

gadem (

**G**enetic

**A**lgorithm guided formation of spaced

**D**yads coupled with

**E**M for

**M**otif identification).