|Home | About | Journals | Submit | Contact Us | Français|
The signals that determine activation and repression of specific genes in response to appropriate stimuli are one of the most important, but least understood, types of information encoded in genomic DNA. The nucleotide sequence patterns, or motifs, preferentially bound by various transcription factors have been collected in databases. However, these motifs appear to be individually too short and degenerate to enable detection of functional enhancer and silencer elements within a large genome. Several groups have proposed that dense clusters of motifs may diagnose regulatory regions more accurately. Cluster-Buster is the third incarnation of our software for finding clusters of pre-specified motifs in DNA sequences. We offer a Cluster-Buster web server at http://zlab.bu.edu/cluster-buster/.
Enhancers and silencers of transcription consist of clusters of transcription factor binding sites (1,2). A number of publications have proposed to detect transcription regulatory regions by searching for clusters of the sequence motifs preferentially bound by a set of transcription factors (2–14). It remains to be seen whether this approach will be generally successful for large eukaryotic genomes. Most recent methods for finding motif clusters fall into two categories: those that count motif ‘hits’ occurring within a sequence window of some size (2,12,13), and those that employ probabilistic models (5,10,11,14). The advantage of the former methods is the intuitive clarity of what they do. Advantages of the latter are avoidance of arbitrary thresholds and the ability to integrate contributions from indefinitely many indefinitely weak motif hits. By using log likelihood ratios, model-based approaches can also claim to discriminate motif clusters from background DNA in a mathematically optimal way (the Neyman–Pearson Lemma).
We have taken the modeling approach, searching for regions of the sequence that resemble a statistical model of a motif cluster more than they resemble a model of ‘background DNA’. Our motif cluster model is for motifs to occur randomly with a uniform distribution across the region and the background model consists of independent, random nucleotides with probabilities estimated from their local abundances in the query sequence. We wish to identify subsequences whose log likelihood ratios, In [Prob (subsequencecluster model)/Prob (subsequencebackground model)], are maximal (i.e. they do not overlap subsequences with higher log likelihood ratios). Unfortunately, the algorithm for finding these subsequences requires time proportional to the sequence length squared and is not feasible for sequences longer than a few kb (15).
We have developed three ways of circumventing this problem. Our first program, Cister, does not directly predict motif clusters, but returns a probability curve indicating the probability that each basepair in the sequence lies within a cluster, using the linear-time Forward–Backward algorithm (10). Cister pays the price of using a slightly more complex probabilistic model with more nuisance parameters. Comet finds motif clusters in linear time with the Viterbi algorithm, but it does not calculate the full log likelihood ratio, considering only the most likely arrangement of motifs within the subsequence (11). An advantage of Comet is that it calculates E-values to indicate the statistical significance of its predictions. Cluster-Buster tackles the problem head-on, employing a linear-time heuristic which attempts to return the same cluster predictions as the full quadratic-time algorithm. As a test we applied Cluster-Buster and an implementation of the quadratic-time algorithm to a set of 27 short sequences. The two programs returned the exact same 19 clusters. So Cluster-Buster appears to be extremely successful at emulating the exact algorithm.
In constructing the Cluster-Buster web server, we have gone to unusual lengths to make it convenient to use. For example, every option on the input form is linked to a pop-up help box which describes its purpose. The output provides an overview figure depicting the locations of motif clusters and annotated protein-coding regions in the sequence, followed by graphics and tables that detail the motifs within each cluster. We also provide a Linux executable of Cluster-Buster for download, which would be necessary for large-scale or highly customized use. Cluster-Buster is extremely fast, requiring ~5s to analyze a megabase of sequence with five motifs on a 1.6GHz Athlon processor. That extrapolates to about 4h for the whole human genome.
Cluster-Buster takes two required items of input: a DNA sequence and a selection of motifs, and optionally a small number of parameters may be varied to tune its behavior. The sequence may be in raw, Fasta or GenBank format. If GenBank format is used, any annotated protein-coding regions (CDS) will be drawn in the results overview figure along with the motif clusters. Alternatively, if the user simply enters a GenBank identifier, the web server will automatically fetch the sequence. The web server, but not the downloadable program, limits the sequence length to 100kb. Motifs are entered as 4×N matrices, where rows correspond to sequential positions in the motif and columns indicate abundances of A, C, G and T at each position. A limited number of built-in motifs are provided as checkboxes and matrices can be copied and pasted directly from the TRANSFAC website (16). A user of Cluster-Buster must begin with a hypothesis that a particular set of motifs may occur in clusters.
The tunable parameters include a gap parameter, residue abundance range and pseudocount. The gap parameter indicates the average distance between motifs within a cluster in Cluster-Buster's internal model of motif clusters. Low values enhance the program's sensitivity for tight clusters of weak motifs and high values enhance its sensitivity for loose clusters of strong motifs. Cluster-Buster's model of background DNA varies across the sequence to take account of fluctuations in nucleotide abundances. The residue abundance range specifies how far either side of each point in the sequence to count residue frequencies. The pseudocount is added to all entries in all motif matrices. Pseudocounts are a widely used technique, with a theoretical underpinning in Bayesian statistics, for estimating underlying frequencies from a limited number of counts. The web server also allows low-complexity regions, such as microsatellites or poly(A) tracts, to be filtered from the sequence using the program dust (R. Tatusov and D. Lipman, unpublished). Another option is to filter sequence regions written with lowercase letters, which is becoming a standard way of indicating repetitive elements. The repetitive nature of such regions may lead to artefactually strong motif clusters, but in some cases they may contain genuine regulatory elements.
Cluster-Buster produces an overview diagram indicating the locations of motif clusters with score higher than a user-specified threshold and any annotated coding regions (Fig. (Fig.1A),1A), followed by details of the motifs within each cluster (Fig. (Fig.1B).1B). The overview represents motif clusters as green rectangles whose horizontal position indicates their position in the sequence and whose height is proportional to their log likelihood ratio score. Protein-coding regions on the forward strand are drawn as purple rectangles above the central line and those on the reverse strand are drawn below the line.
In the details figure, motifs are drawn as color-coded rectangles above or below the line according to the strand they are on (relative to the motif matrix; the ERE appears on only one strand in Fig. Fig.1B1B because we used a slightly non-palindromic matrix to represent this motif). The heights of the motif rectangles represent their log likelihood ratio scores using the standard weight matrix technique (17). A table below this figure lists the position, strand, score and sequence of each motif.
The Cluster-Buster algorithm, in outline, consists of three steps:
This method bears some resemblance to one that has been used earlier for protein fold recognition (18).
M.F. is a Howard Hughes Medical Institute Predoctoral Fellow. This project was partially funded by NSF grants DBI-0078194 and DBI-0116574 and NIH grant 1P20GM066401-01.