With the rapid accumulation of genomic sequence data and the development of high-throughput experimental technologies such as DNA microarrays and their combination with chromatin immunoprecipitation, recent years have witnessed a substantial increase in computational efforts toward the understanding of transcriptional regulation. The identification of short sequence motifs, such as transcription factor binding sites, is at the center of such efforts. Regulatory motifs are often overrepresented in the promoters of co-regulated genes, which can be detected by microarray studies of genetically altered or chemically stimulated cells. There are two main approaches toward finding such motifs: (i)
ab initio motif discovery with multiple local sequence alignment and (ii) the detection of statistical overrepresentation of previously known motifs. While much attention has been paid to
ab initio algorithms [e.g. (
1–
5)], the detection of overrepresentation is simpler and potentially more powerful, because the search is confined to a library of known motifs (
6). However, the latter approach is not able to identify previously unknown motifs.
Algorithms for testing whether a motif is overrepresented in a target set of DNA sequences have appeared only recently (
6–
12). Most of these methods ultimately reduce to the statistics of contingency tables. A position-specific scoring matrix (PSSM) for a motif is first scanned across the target sequences and a set of control sequences, and matches with a similarity score greater than some threshold are recorded. The data can be cast as a 2 × 2 contingency table, with the four entries of the table denoting (i) the number of
target sequences
with a match, (ii) the number of
control sequences
with a match, (iii) the number of
targets without a match and (iv) the number of
controls without a match. A χ
2 test or Fisher's exact test (using the hypergeometric distribution) can then be performed to test the null hypothesis that the sequences with motif matches are evenly distributed among the target and control sets. Three flavors of such
sequence-based methods have been published (
7–
9). Since target and control sequences are typically of different lengths and can contain different numbers of motifs, several
motif-based methods have been proposed to count the total number of motifs rather than sequences, and construct a similar contingency table to that described above (
8,
10–
12). We have recently developed a method named Clover which cannot be classified as sequence- or motif-based. Rather, it combines multiple matches per sequence in an intuitive manner, motivated by a simple thermodynamic model (
6). We further establish the statistical significance of the results via comparison to sequences obtained by randomizing nucleotides or dinucleotides in the target sequences, by randomizing the columns of each PSSM and by selecting random segments from a large set of background DNA sequences (
6).
With the accumulation of known motifs in databases such as TRANSFAC (
13) and JASPAR (
14), motif-overrepresentation algorithms are becoming sufficiently accurate for guiding laboratory experimentation. Active collaborations centered on the application of such algorithms have sprung up between our lab and three experimental labs studying estrogen response elements, dopamine response genes and platelet-specific genes. We quickly realized the need for a user-friendly web interface for motif-overrepresentation algorithms. Here we describe an interactive web server for a motif-based algorithm Rover (
12), a sequence-based algorithm Motifish (M. C. Frith, Y. Fu and Z. Weng, unpublished results) and Clover (
6). For comparison, we also include a simple program named Possum which scans one sequence against one motif, since this represents the base-line prediction.
MotifViz makes these four algorithms available to a much wider audience and facilitates the comparison of results from different approaches. This web server provides a focused toolset for motif detection, in contrast to other applications that are only available as command-line driven and computer-platform-specific programs, or are buried in large software packages. With experimental scientist users in mind, we have carefully designed a consistent web interface that gives a similar look and feel to all four algorithms. Both an overview of motif distribution and a detailed sequence output are presented, with cross-referencing and mouse-over functions to facilitate the selection of sequence regions for experimental testing. The common input and output formats we provide for this spectrum of motif detection algorithms allow the user to identify the relevant benefits of different methods and focus later research on higher likelihood results agreed upon by multiple methods.