The binding of transcription factors to relatively short and variably degenerate regulatory DNA sequences (cis
-regulatory elements) is central to the regulation of gene expression (Orphanides and Reinberg, 2002
). While several sequenced genomes are nearly deciphered in terms of the protein-coding gene repertoire, the inventory and comprehensive characterization of cis
-regulatory elements remains elusive.
Motif discovery has motivated the development of numerous tools and algorithms, and the use of various motif models and statistical approaches (Guha Thakurta, 2006
). Motif discovery can be broadly divided into ‘sequence-driven’ and ‘pattern-driven’ methods. The former methods typically involve building a position-weight matrix (PWM) from sequence data, and local search techniques such as expectation–maximization or Gibbs sampling are used to optimize the log likelihood ratio until convergence or a maximum number of iterations is reached. Though routinely fast, those methods are not guaranteed to yield the best solution, or global optimum (Stormo, 2000
). Enumerative methods, on the other hand, are guaranteed to find a global optimum but have the drawback of being computationally expensive and limited to short motifs.
Searching a set of sequences for patterns that are overrepresented relative to a given background model may converge towards motifs that are prevalent in the genome thus not likely to represent regulatory elements. Sinha (2003
) introduced the notion of ‘discriminative’ motif discovery in which a motif is treated as a feature that leads to good classification between positive sequences deemed to contain common cis
-regulatory elements and a set of background sequences.
In this work, we present the Seeder algorithm—a novel, exact discriminative seeding DNA motif discovery algorithm inspired by Keich and Pevzner, 2002
; Pizzi et al.
. The major benefits of the Seeder algorithm are (i) the use of intuitive and reliable statistics for the choice of motif seeds and (ii) a data structure that significantly accelerate the computation of motifs and background models. The algorithm is benchmarked against popular motif finding tools and demonstrates greater performance. The algorithm is applied to the analysis of Arabidopsis thaliana
seed-specific (the plant structure seed, not to be confused with motif seed) promoters and identifies motifs with high similarity to seed-specific cis
-regulatory elements experimentally characterized in Brassica napus
, a closely related species.