Enhancers and silencers of transcription consist of clusters of transcription factor binding sites (1
). A number of publications have proposed to detect transcription regulatory regions by searching for clusters of the sequence motifs preferentially bound by a set of transcription factors (2
). It remains to be seen whether this approach will be generally successful for large eukaryotic genomes. Most recent methods for finding motif clusters fall into two categories: those that count motif ‘hits’ occurring within a sequence window of some size (2
), and those that employ probabilistic models (5
). The advantage of the former methods is the intuitive clarity of what they do. Advantages of the latter are avoidance of arbitrary thresholds and the ability to integrate contributions from indefinitely many indefinitely weak motif hits. By using log likelihood ratios, model-based approaches can also claim to discriminate motif clusters from background DNA in a mathematically optimal way (the Neyman–Pearson Lemma).
We have taken the modeling approach, searching for regions of the sequence that resemble a statistical model of a motif cluster more than they resemble a model of ‘background DNA’. Our motif cluster model is for motifs to occur randomly with a uniform distribution across the region and the background model consists of independent, random nucleotides with probabilities estimated from their local abundances in the query sequence. We wish to identify subsequences whose log likelihood ratios, In [Prob (subsequence
cluster model)/Prob (subsequence
background model)], are maximal (i.e. they do not overlap subsequences with higher log likelihood ratios). Unfortunately, the algorithm for finding these subsequences requires time proportional to the sequence length squared and is not feasible for sequences longer than a few kb (15
We have developed three ways of circumventing this problem. Our first program, Cister, does not directly predict motif clusters, but returns a probability curve indicating the probability that each basepair in the sequence lies within a cluster, using the linear-time Forward–Backward algorithm (10
). Cister pays the price of using a slightly more complex probabilistic model with more nuisance parameters. Comet finds motif clusters in linear time with the Viterbi algorithm, but it does not calculate the full log likelihood ratio, considering only the most likely arrangement of motifs within the subsequence (11
). An advantage of Comet is that it calculates E-values to indicate the statistical significance of its predictions. Cluster-Buster tackles the problem head-on, employing a linear-time heuristic which attempts to return the same cluster predictions as the full quadratic-time algorithm. As a test we applied Cluster-Buster and an implementation of the quadratic-time algorithm to a set of 27 short sequences. The two programs returned the exact same 19 clusters. So Cluster-Buster appears to be extremely successful at emulating the exact algorithm.
In constructing the Cluster-Buster web server, we have gone to unusual lengths to make it convenient to use. For example, every option on the input form is linked to a pop-up help box which describes its purpose. The output provides an overview figure depicting the locations of motif clusters and annotated protein-coding regions in the sequence, followed by graphics and tables that detail the motifs within each cluster. We also provide a Linux executable of Cluster-Buster for download, which would be necessary for large-scale or highly customized use. Cluster-Buster is extremely fast, requiring ~5
s to analyze a megabase of sequence with five motifs on a 1.6
GHz Athlon processor. That extrapolates to about 4
h for the whole human genome.