A central challenge in molecular biology is understanding the logic and mechanisms of gene regulation. An important step towards this lies in identifying regulatory elements within genomes. The prominent role of sequence specificity in controlling gene expression allows regulatory element detection to be performed via motif discovery, where the goal is to find approximately repeated patterns in unaligned sequences that are thought to share a common regulator and thus possess a common motif. Such sets of sequences can be obtained through DNA microarray studies (Tavazoie et al.
), ChIP-chip (Lee et al.
) or ChIP-seq (Robertson et al.
) experiments or protein binding microarrays (Mukherjee et al.
). Alternatively, sets of regulatory regions of orthologous genes, which may be partly or wholly aligned, can be searched for regulatory elements (e.g. Stark et al.
Because it represents the most basic type of first step analysis towards uncovering regulatory networks, de novo
motif finding is a classic and widely studied problem in computational biology. Despite numerous motif finding approaches, based on various formulations and algorithmic techniques [see reviews of Das and Dai (2007
); MacIsaac and Fraenkel (2006
), and references therein], no single currently existing method can solve them completely [e.g. see Das and Dai (2007
); Hu et al.
); Tompa et al.
Here, we introduce a novel motif finding algorithm, SAMF
(Solution-Aggregating Motif Finder), which builds upon powerful techniques in machine learning by modeling the problem as a Markov Random Field (MRF) and iteratively inferring an ensemble of highly probable model configurations (or top scoring solutions) using the Best Max-Marginal First (BMMF) algorithm (Yanover and Weiss, 2004
). We utilize an exact calculation of statistical significance (Zaslavsky and Singh, 2006
) to determine the number of configurations to be considered, and derive coherent motifs by applying a new technique for aggregating and clustering the ensemble of significant configurations. Each such configuration is assigned a weight that depends on its score in the model, and the algorithm predicts a final set of putative motifs corresponding to sufficiently highly weighted clusters. This ensemble-based procedure enables SAMF to detect both distinct multiple motifs and repeated motif instances within each sequence without requiring an estimate on the number of binding sites. The spirit of our work agrees with the recent observation that considering an ensemble of solutions to motif finding (Reddy et al.
) and optimization problems in general (Webb-Robertson et al.
) is more suitable than simply searching for a single optimal solution. Our approach and the meaning ascribed to the term ‘ensemble’ is different from that of a number of recent methods, which combine the findings of several motif discovery algorithms, each potentially utilizing entirely different methodologies and objective functions, to produce a final motif prediction (Hu et al.
; Wijaya et al.
). Here, we explicitly aim to enumerate and aggregate an ensemble of distinct
solutions obtained via the same underlying MRF model.
While our method can be used in any motif finding application, whether in DNA, RNA or protein sequences, we apply it to detect regulatory elements in prokaryotic genomes. In prokaryotes, regulation of gene expression is carried out primarily during transcription, and is modulated by transcription factors that carry out their function by binding DNA fragments in the immediate upstream vicinity of the gene being regulated. Though much recent attention has focused on computational methods for the difficult problem of identifying regulatory elements in eukaryotes [e.g. Tompa et al.
)], prokaryotic transcription factor binding site detection comes with its own set of challenges. Prokaryotic promoter regions contain binding sites that tend to be long [10–48 bp, Robison et al.
]; this limits the applicability of motif finders based on pattern enumeration—which are the basis of some of the most successful motif finders in eukaryotes (Elemento et al.
; Tompa et al.
)—as they typically consider binding sites of at most 12 bp. Additionally, prokaryotic binding sites can overlap and often appear in tandem (Hermsen et al.
; Karp et al.
), mechanisms that are used to selectively modulate gene expression. The overlapping nature of the binding sites poses a problem to some motif finders (Bailey and Elkan, 1995
; Roth et al.
) as they are often constrained to look for non-overlapping motifs. Additionally, many methods require the expected number of motifs and their occurrences to be specified as parameters.
Our algorithm, SAMF, is based on a novel formulation and robustly addresses the issues above. The motifs we are able to find are not length constrained and can overlap. We can uncover multiple distinct motifs and multiple occurrences without having to provide an estimate on the number of binding sites. Indeed, we obtain excellent results in practice. In particular, we test SAMF on sets of genes experimentally determined to be regulated by a common Escherichia coli
transcription factor (McGuire et al.
; Robison et al.
), and apply our algorithm to look for regulatory motifs in the corresponding upstream regions. We compare our results with those of a representative set of previous approaches, and demonstrate that our algorithm outperforms the others both in sensitivity and specificity when identifying regulatory elements in bacteria. Additionally, we predict a number of putative sites and wholly conserved motifs in this dataset, and provide biological evidence that they likely are real transcription factor binding sites.