|Home | About | Journals | Submit | Contact Us | Français|
Identifying and characterizing the transcription factor binding site (TFBS) patterns of cis-regulatory elements represents a challenge, but holds promise to reveal the regulatory language the genome uses to dictate transcriptional dynamics. Several studies have demonstrated that regulatory modules are under positive selection and, therefore, are often conserved between related species. Using this evolutionary principle, we have created a comparative tool, rVISTA, for analyzing the regulatory potential of noncoding sequences. Our ability to experimentally identify functional noncoding sequences is extremely limited, therefore, rVISTA attempts to fill this great gap in genomic analysis by offering a powerful approach for eliminating TFBSs least likely to be biologically relevant. The rVISTA tool combines TFBS predictions, sequence comparisons and cluster analysis to identify noncoding DNA regions that are evolutionarily conserved and present in a specific configuration within genomic sequences. Here, we present the newly developed version 2.0 of the rVISTA tool, which can process alignments generated by both the zPicture and blastz alignment programs or use pre-computed pairwise alignments of several vertebrate genomes available from the ECR Browser and GALA database. The rVISTA web server is closely interconnected with the TRANSFAC database, allowing users to either search for matrices present in the TRANSFAC library collection or search for user-defined consensus sequences. The rVISTA tool is publicly available at http://rvista.dcode.org/.
Unlike most prokaryotic genomes, which are composed of tightly packed gene units with limited intergenic regions, eukaryotic genomes are rich in noncoding sequences of unknown functions. Extensive annotation of the human and mouse genomes has predicted in the vicinity of ~40000 genes, exons of which account for <5% of the genome. An additional 40–45% of the mammalian genome is comprised of repetitive DNA, while the remaining 50% is noncoding in nature (1,2). First glimpses of the human genome have revealed very few insights regarding new RNA coding genes, transcriptional regulatory elements or any other biologically relevant sequences present in noncoding regions. Although some parts of the noncoding genome will likely demonstrate no measurable biological function, it is widely assumed that much of our genetic complexity is due to sophisticated regulatory noncoding signals that determine when, where and to what extent each gene displays transcriptional activity. Despite the importance of noncoding sequences in gene regulation, our ability to computationally identify and characterize these elements is very limited.
In multi-cellular organisms, modulation of gene expression is accomplished through the compound interaction of regulatory proteins (transcription factors, TFs) and the specific DNA regions (cis-regulatory sequences or modules) with which they physically interact. Numerous DNA footprinting and biochemical studies carried out over the last decade have identified close to 500 vertebrate-specific TFs, and information regarding the DNA sequences they recognize. The TRANSFAC database (http://www.biobase.de) (3,4) represents the most comprehensive collection of TF binding specificities, summarized as position weight matrices (PWMs). A major limitation in using PWMs to computationally identify functional transcription factor binding sites (TFBSs) is that many TFs bind to short degenerate sequence motifs (6–12 bp in length). Such sequences occur very frequently in a genome, and experimentally it has been shown that only a very small fraction of these predicted TFBSs are functionally relevant.
We have previously shown that the rVista tool combines pattern recognition with comparative sequence analysis to dramatically reduce the number of false positive TFBS matches and enrich for functional sites (5). These results suggest an alternative strategy for sequence-based discovery of biologically relevant regulatory elements. To increase its versatility, and create a more efficient and user-friendly tool, we developed rVISTA 2.0, an improved web-based server that interconnects TFBS motif searches and cross-species sequence analysis with several comparative sequence analysis tools to significantly simplify and expedite its use. Originally, rVISTA required external alignment files to be submitted for analysis and was limited to only one alignment format. Currently, rVISTA accepts blastz alignments submitted at the rVISTA homepage or alignment and gene annotations automatically forwarded from (1) the ECR Browser, (2) zPicture and (3) GALA database. Also, we designed a new program for detecting TFBSs that is significantly faster than the MATCH program originally accompanying the TRANSFAC database (4,6). This new development significantly decreases the processing time, enabling the analysis of much larger genomic intervals.
There are four main ways to access the rVISTA tool: (i) submitting a blastz alignment file (7) at the rVISTA homepage (http://rvista.dcode.org/), (ii) dynamically generating and automatically forwarding (with a single mouse button click) zPicture alignments (http://zpicture.dcode.org/) (8), accessing pre-computed multiple genome alignment data available at (iii) the ECR Browser website (http://ecrbrowser.dcode.org/) (Figure (Figure1A)1A) and (iv) the GALA database. All these tools providing alignments for rVista 2.0 use the blastz program (7) to identify homologous regions and to produce local sequence alignments between the reference sequence and one or more other orthologous sequences. The local alignment method used by zPicture and the ECR Browser tools provides a careful assessment of the evolutionary rearrangements, ensuring the ability of rVISTA to detect TFBSs that have undergone positional changes relative to nearby genes and other features over the course of evolution.
rVISTA analysis proceeds in four main steps: (i) detect TFBS matches in each individual sequence using PWMs from the TRANSFAC database, (ii) identify pairs of locally aligned TFBSs, (iii) select TFBSs present in regions of high DNA conservation and (iv) create a graphical display that dynamically overlays individual or clustered TFBSs with the conservation profile of the genomic locus. Users have the option of either selecting matrices from the TRANSFAC library or inputting their own TFBS consensus sequences. TRANSFAC professional library includes matrices from vertebrates, plants, nematodes, insects, fungi and bacteria. The current TRANSFAC library utilized by rVista 2.0 contains representatives from ~500 vertebrate TF matrices that comprise ~400 TF families. Selected matrices from this library are additionally verified and improved. Users selecting the TRANSFAC library have the option to specify the stringency to be used for the PWM identification.
We have replaced the MATCH (6) program accompanying the TRANSFAC (3,4) database with a recently developed tfSearch tool for detecting TFBS (I. Ovcharenko, unpublished data). tfSearch combines ‘suffix tree’-based fast substring searches (9) with PWM scoring of substring similarities. Transforming the original sequence into the suffix tree may use extensive memory (requiring a memory allocation ~100 times larger than the size of the sequence), but it greatly raises the efficiency in localizing substrings. A substring of size N will require O(N) operations with the suffix tree in order to localize all the matches. PWM searches that use the suffix tree require a scan of the suffix tree at a depth ≤N and stop when the count at the node is below the PWM matrix similarity threshold selected by the user. Table Table11 summarizes results of PWM-detecting TFBSs in two genomic loci, 100 kb and 1 Mb long, utilizing MATCH and tfSearch tools. The gain in speed obtained with use of the tfSearch tool varies from 10- to 100-fold in comparison with the time required by the MATCH program. It is especially pronounced when a large number of PWMs is used. The speed improvement thus introduced into the rVista 2.0 tool significantly decreases the tool's response time due to the fact that detecting TFBSs in the sequence file is the performance bottleneck of this approach.
After localizing the TFBSs in both sequences, rVISTA proceeds to identify pairs of aligned TFBSs that are interconnected in the local blastz alignment. Genomic DNA insertions and deletions in either of the sequences (identified as gaps in the alignment) that occur in the core region of a TFBS disqualify the prediction. Subsequently, rVISTA requires aligned TFBS predictions to be locally highly conserved. Local conservation of at least 80% sequence identity in a 20 bp sliding window spanning the binding site (and always including the core of the binding site) selects aligned-and-conserved TFBSs (that are also referred to as conserved in the rVISTA output).
The rVista web page that is returned to the user contains detailed information on rVista processing results. This includes positional information on TFBS predictions in both sequences, and distribution of aligned and aligned-and-conserved TFBSs. The report includes data on the location, percentage identity and strand (Figure (Figure1B)1B) (reference sequence only). Conserved sites can also be visualized in the textual blast-like alignment, and are highlighted in blue. Finally, rVISTA results provide an interactive visualization module that overlays positional information on TFBS predictions above a graphical conservation profile that includes annotation of protein coding features for the locus. Clustering analysis of TFBSs permits the search and subsequent visualization of complex TFBS modules consisting of multiple different TFBSs (Figure (Figure1C).1C). For more informative analysis, users have the option to select for visualization only a subset of TFs from the initial list provided.
Several visualization parameters can be adjusted by the user: (i) alignment size (in bp) per layer, (ii) window resolution, (iii) types of site to be displayed (all, aligned, conserved) and (iv) the type of clustering analysis to be used. Two clustering options are also available, individual and combinatorial. Individual clustering is used for identifying groups of TFBSs belonging to the same TFs. Users have the option to indicate the number of sites and the size of the TFBS module they wish to identify. Combinatorial clustering is carried out for groups of TFBSs belonging to two or more different TFs. For example, if the visualization module is selected to display binding sites for TFs Hnf1, Tbx5 and Nkx2.5, and the user is interested in finding 100 bp regions that contain clusters comprised of at least five sites from this selected subset, rVISTA will identify all evolutionary conserved regions with any combination of these sites. In the visual display rVISTA will present only sites that fit the selected criteria (Figure (Figure11C).
To illustrate the application of the rVISTA tool, we have carried out an unbiased analysis for the NKX2.5 human locus with the intention of detecting the regulatory element known to play a key role in cardiac development. The conservation profile available for this gene in the ECR Browser revealed several upstream and intronic noncoding elements in this locus (Figure (Figure2A).2A). Experimental evidence suggests that Smad proteins are involved during cardiac development, and in particular a direct link has been established between NKX2.5 induction and Smad consensus sequences. Therefore, Smad transcription factors were selected as ideal candidates for NKX2.5 TFBS analysis (10–11). rVISTA analysis of the ECR Browser alignment spanning ~7 kb including the NKX2.5 region was performed using TFBS matrices for the transcription factor, Smad4. A TFBS search with a 0.85 PWM matrix cut-off identified 43 PWM matches across the locus, four of which are highly conserved in the human–mouse alignment (Figure (Figure2B).2B). All four Smad4 TFBSs are localized inside the single conserved element located ~2 kb upstream of the NKX2.5 transcription start site. This highly conserved element has been previously shown to function as a cardiac enhancer in transgenic mice. In particular, one of these conserved Smad4 TFBSs coincides with the site mutated by Lien et al. (12) and has been shown to be required for the proper activity of the NKX2.5 cardiac enhancer (Figure (Figure2C).2C). It was also demonstrated that a 2 basepair mutation (from A to C) in the most highly conserved Smad4 TFBS was able to diminish the cardiac enhancer properties of this regulatory element (Figure (Figure2C)2C) (12).
Understanding the function of noncoding DNA and identifying and characterizing the structure of transcriptional regulatory elements embedded in the human genome create continuing challenges. We present a completely redeveloped rVISTA 2.0 web server, for high-throughput discovery of cis-regulatory elements. By combining interspecies sequence conservation, reliable TF matrices and combinatorial clustering of TFBSs, rVISTA 2.0 maximizes the probability of identifying functional TFBSs. The novel features and programs implemented into rVISTA 2.0 make this tool very powerful for identifying and analyzing TFBSs in long genomic intervals. The interconnectivity with blastz, zPicture and the ECR Browser tools for genome comparative sequence analysis makes rVISTA 2.0 a valuable resource for establishing a direct link between the language of noncoding DNA and biological function of genomes.
We are grateful to Lisa Stubbs for constant support, critical comments and suggestions on the manuscript. The work was performed under the auspices of the US Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Livermore National Laboratory Contract No. W-7405-Eng-48.