The rapid emergence of experimental techniques that can probe for functional elements at whole-genome scales
necessitates computational methods to analyze data in these settings. In particular, methods that locate promoters or measure gene expression on genome-wide scales (e.g. 
) must be complemented by algorithms that can find the active regulatory elements within the larger promoters. Ab initio
computational search for transcription factor binding sites (TFBS) in DNA sequences is often termed “motif discovery”. “Motif” here refers to a general pattern describing what DNA sequences the transcription factor binds
. Motif discovery is one of the classical problems in computational sequence analysis and can be briefly stated as: Given a set of sequences containing one or several short overrepresented sites, locate these and produce a model describing them.
There are two main avenues used to attack this problem: i) enumerative algorithms based on word counting, such as 
, and ii) pattern-based approaches often using position specific weight matrices (WMs), which scores sites based on position specific weights 
. Since the binding preferences of transcription factors (TFs) are not easily captured by a single word or consensus string, pattern-based approaches can give solutions closer to the biological reality and it has been argued that the matrix score is related to the binding energy 
. However, such approaches correspond to the problem of finding local, optimal multiple alignments, which is NP-complete 
. Therefore, almost all pattern-based motif finders use statistical optimization methods such as Gibbs sampling or expectation maximization 
A typical instance of motif discovery starts with a set of upstream promoter regions of co-expressed genes suspected to be co-regulated and by extension more likely to be under control by the same regulatory machinery. This set is called the “positive set” and most methods proceed from here by locating motifs that are in some way statistically overrepresented in this set. The most successful applications of motif discovery have been in organisms whose regulatory information is densely aggregated around transcription start sites, such as Saccharomyces cerevisiae
(baker's yeast). In mammalian genomes, regulatory information is spread out over wider regions, which makes “pattern drowning” a significant issue; in other words, the information in the regulatory sites is too small to stand out in the large genomic region of interest. In this context, the accuracy of contemporary pattern finders is not sufficient for many biologically important problems 
Most methods operate with some notion of a background model describing “generic DNA” against which the over-representation is measured. The model is often a multinomial or a Markov model. The choice of model is important for obtaining good results 
. However, most such models have difficulty in capturing the complexity of the highly heterogeneous mammalian genome sequence, which has a multitude of different promoter architectures
, numerous interspersed repeats, low complexity sequences, CpG islands, etc. 
. Instead of simplifying the underlying DNA sequence by a general model, we take this to its extreme conclusion and use a very large set of promoters as the actual background instead of building a model describing the sequences in the promoters. For simplicity, we use the term “negative set” to describe the background set; this is strictly speaking not true as sites could occur in this set at a much lower frequency, since real promoters are sampled randomly. By contrasting the sets, it is possible to see what common features make the sequences in the positive set unique.
Discriminatory motif searching is not a new idea; several methods have been developed that take advantage of a negative set 
. However, many of these use word-based models 
, which might not capture the diversity of binding sites. Others again use PWMs, but have binary hit models that do not distinguish between hits as long as they are over a threshold 
. A discriminatory approach similar to ours has been combined with the use of expression data 
, but depending on the regions that are being investigated this might often not be available or even possible. We adopt an approach similar to DEME 
to identify the most discriminative set of motifs by modeling the sequence labels (positive or negative) rather than using the conventional generative approach
. However, there are some important differences to DEME. Firstly, DEME uses a global string-based search followed by a local gradient refinement, which may miss patterns that are not well-represented by a consensus string, whereas we use a global optimization technique (simulated annealing) for optimizing the model, which does not have this limitation, although it may have others (see below). Secondly, our method (Motif Annealer - MoAn) uses and optimizes a threshold, and uses an enhanced suffix array (ESA) to speed up pattern searches. Thirdly, in MoAn the length of the motif is also optimized. DEME is also particularly targeted towards proteins while our approach is intended for use with DNA.
Specifically, we use conditional maximum likelihood to estimate the WMs and their thresholds such that the probability of the positive and negative sets is maximized (see Methods
). Thus, the resulting matrices cannot be derived from the frequency matrix for the sites found – it is rather the matrices that lead to the best discrimination. The probability of a sequence is calculated as a product of the probabilities given by the matrices matching above a threshold and a simple null model for non-matching regions. From this and prior probabilities for matches in the positive and negative sets, the probability of the set label (positive or negative) is calculated. In this probability the background model cancels. The total likelihood is a product of the class probabilities for all sequences (positive and negative).
This conditional likelihood leads to a non-trivial optimization problem which is handled using simulated annealing (see Methods
), where we iteratively change the WMs and their thresholds, retaining changes that lead to higher discriminatory power using the Metropolis-Hastings algorithm 
. Given sufficient iterations, the method guarantees convergence on the optimally discriminatory motifs. To cope with the vast size of the sets we utilize a highly efficient data structure, the ESA, for searching DNA for pattern instances
. With reasonable cutoffs, this reduces the computation by an order of magnitude