|Home | About | Journals | Submit | Contact Us | Français|
The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.
Despite the essential role played by non-coding sequences in transcriptional regulation, genome annotations usually focus on identifying the genes and predicting their function through sequence similarity searches. The services offered by most genome centers are restricted to analysis of coding and peptidic sequences. The web resource Regulatory Sequence Analysis Tools (RSAT) is dedicated to the analysis of the other part of the genomes: the non-coding sequences. It proposes a collection of modular tools which can be combined in different ways to predict regulatory elements. Three main scenarios can be handled: (i) starting from a set of co-regulated genes, retrieve their upstream sequences and detect over-represented motifs, which might be responsible for their co-regulation (pattern discovery); (ii) predict the location of binding sites for a known transcription factor in a given sequence (pattern matching); (iii) starting from the known consensus pattern for a given transcription factor, scan all upstream sequences of a selected genome in order to predict putative target genes (genome-scale pattern matching).
The procedures currently supported by RSAT are summarized in Table Table1.1. These procedures are linked in a pipeline as illustrated in Figure Figure1.1. In the following we describe different tasks and the programs that are most appropriate for performing them.
The simplest input for RSAT is a list of gene names. Using this list the retrieve-seq program returns upstream, downstream or unspliced ORF sequences (introns and spliced ORFs will soon be supported). The user can specify the left and right limits of the sequences to be retrieved. Default values have been selected for each genome, depending on the average size of the intergenic regions and mechanisms of regulation. Upstream sequences can be retrieved over a constant size, but an option also allows to clip them in order to avoid the inclusion of coding sequences from upstream ORFs.
The specificity of a transcription factor can be described by a pattern. Two alternative formats are currently used to describe regulatory signals: strings (including the IUPAC alphabet for ambiguous nucleotides) or position-specific scoring matrices (PSSM) (1).
When the regulatory pattern is known (e.g. the consensus binding sequence for a given transcription factor), one may wish to locate its occurrences, in order to identify putative transcription factor binding sites in upstream sequences of a set of genes. Patterns can be collected from the literature or obtained from specialized databases (2–4). String-based pattern matching is performed with the program dna-pattern. This program supports the IUPAC degenerate alphabet, as well as regular expressions, which allow the specification of spaces of variable length. Patterns can be searched on either one or both strands. A matrix-based pattern matching procedure, patser, developed by Jerry Hertz (5,6), has been integrated to the web interface.
Given a set of co-regulated genes, pattern discovery programs can be used to detect over-represented motifs in their upstream regions. This is particularly useful for the prediction of regulatory motifs from clusters of co-expressed genes, such as those obtained from microarray data or other high-throughput methods. Several algorithms for pattern discovery are supported. The program oligo-analysis (7) analyzes oligonucleotide occurrences and returns those that are statistically over-represented (Table (Table22A).
Despite it simplicity, this program has proven to be very efficient for the detection of regulatory motifs in the yeast Saccharomyces cerevisiae. However, some motifs escape detection, because they take the form of a spaced dyad, i.e. a pair of very short oligonucleotides separated by a region of fixed length but variable content. A second program, dyad-analysis (8), specifically detects such spaced motifs, which are typical of many bacterial transcription factors, and of the fungal binuclear zinc cluster proteins. String-based pattern discovery programs generally return several oligonucleotides or dyads, which can be assembled with the program pattern-assembly, to yield larger and/or partially degenerate motifs (Table (Table2B).2B). Two matrix-based pattern discovery programs, Andrew Neuwald's gibbs sampler (9) and Jerry Hertz's consensus (5,6), are also available.
The strength of string-based pattern discovery methods is their very low rate of false positives and the fact that they are able to return multiple motifs when a set of genes is regulated by several factors. This is illustrated by the example in Table Table2B,2B, where the analysis of 10 methionine-responsive genes led to the detection of two distinct patterns, corresponding to the binding sites of Met4p and Met31p, respectively. Matrix-based programs return a more refined description of pattern degeneracy, but have the drawback of always returning an answer, even when random sequences are submitted.
Pattern matching can be applied to the full set of upstream sequences in a genome, in order to predict genes possibly regulated by a given transcription factor. It should be noted that the simple presence of a motif in a given upstream region is generally not sufficient to predict regulation. Indeed, given the short size of the motifs and the large size of the genomes, hundreds, or even thousands of matches could be returned by chance alone. Predictions can be improved by detecting multiple binding sites, either for the same transcription factor, or for combinations of several different transcription factors.
The results obtained by pattern matching can be displayed graphically, in the form of a feature map (Fig. (Fig.2).2). In this map, each motif is represented by a box painted in a different color, whose height is proportional to the statistical significance of the pattern. Feature maps are not only useful for illustrative purposes, they can also reflect additional properties of the discovered motifs such as a conserved position relative to the start codon, a distal or proximal location, the pairing of heterologous motifs and so on.
Random sequences are useful for performing negative controls. Indeed, some programs present the inconvenient of systematically returning an answer, even when the submitted sequence set contains no biologically significant features. The program random-seq generates random DNA sequences on the basis of various probabilistic models (independent nucleotides, Markov chains).
Another program, random-genes, selects random sets of genes for a given organism. Random gene selections provide a very stringent test for pattern discovery programs. Indeed, although each selected gene is likely to have some regulatory elements, there is no reason for the selected group, as a whole, to be co-regulated, and a good pattern discovery program should thus generally return a negative answer or motifs with low significance.
The originality of the RSAT resource is that it provides an integrated approach for tackling a variety of questions about regulatory sequences. The tools are integrated into a pipeline (Fig. (Fig.1),1), but can also be used individually by filling the forms with data from external sources. This includes the uploading of large sequence files.
The web interface has been designed so as to allow ready access to the tools by non-specialists. Default parameters have been defined on the basis of previous experience. In addition, a user manual provides a detailed description of the options for each program. Moreover, a series of tutorials are available for the step-by-step initiation of first-time users.
The web tools presented here perform predictions of regulatory elements using several approaches under various circumstances. The approaches used have strong points as well as limitations of which one should be well aware. Any predictive method will unavoidably return false positives and/or miss some genuine regulatory patterns. Until now, most methods have been optimized and validated in microbial model organisms (yeast and bacteria). Whether these approaches can be extended to higher organisms is still an open question. Currently we are evaluating the applicability of RSAT for the detection of regulatory elements in the genomes of multicellular organisms. To perform such detection successfully, new methods, based on comparative genomics, might be required (reviewed in 10).
I acknowledge support from the Action de Recherches Concertées de la Communauté Française de Belgique (contract ARC-97/01-211). I am grateful to Jean Richelle for the system administration and to Shoshana Wodak who helped to improve this manuscript. The RSAT project was originated at the Universidad Nacional Autonoma de Mexico, in the laboratory of Julio Collado-Vides, to whom I am thankful for past and present collaboration.