There are four main ways to access the rVISTA tool: (i) submitting a blastz alignment file (7
) at the rVISTA homepage (http://rvista.dcode.org/
), (ii) dynamically generating and automatically forwarding (with a single mouse button click) zPicture alignments (http://zpicture.dcode.org/
), accessing pre-computed multiple genome alignment data available at (iii) the ECR Browser website (http://ecrbrowser.dcode.org/
) (Figure A) and (iv) the GALA database. All these tools providing alignments for rVista 2.0 use the blastz program (7
) to identify homologous regions and to produce local sequence alignments between the reference sequence and one or more other orthologous sequences. The local alignment method used by zPicture and the ECR Browser tools provides a careful assessment of the evolutionary rearrangements, ensuring the ability of rVISTA to detect TFBSs that have undergone positional changes relative to nearby genes and other features over the course of evolution.
rVISTA analysis proceeds in four main steps: (i) detect TFBS matches in each individual sequence using PWMs from the TRANSFAC database, (ii) identify pairs of locally aligned TFBSs, (iii) select TFBSs present in regions of high DNA conservation and (iv) create a graphical display that dynamically overlays individual or clustered TFBSs with the conservation profile of the genomic locus. Users have the option of either selecting matrices from the TRANSFAC library or inputting their own TFBS consensus sequences. TRANSFAC professional library includes matrices from vertebrates, plants, nematodes, insects, fungi and bacteria. The current TRANSFAC library utilized by rVista 2.0 contains representatives from ~500 vertebrate TF matrices that comprise ~400 TF families. Selected matrices from this library are additionally verified and improved. Users selecting the TRANSFAC library have the option to specify the stringency to be used for the PWM identification.
We have replaced the MATCH (6
) program accompanying the TRANSFAC (3
) database with a recently developed tfSearch tool for detecting TFBS (I. Ovcharenko, unpublished data). tfSearch combines ‘suffix tree’-based fast substring searches (9
) with PWM scoring of substring similarities. Transforming the original sequence into the suffix tree may use extensive memory (requiring a memory allocation ~100 times larger than the size of the sequence), but it greatly raises the efficiency in localizing substrings. A substring of size N
will require O(N) operations with the suffix tree in order to localize all the matches. PWM searches that use the suffix tree require a scan of the suffix tree at a depth ≤N
and stop when the count at the node is below the PWM matrix similarity threshold selected by the user. Table summarizes results of PWM-detecting TFBSs in two genomic loci, 100 kb and 1 Mb long, utilizing MATCH and tfSearch tools. The gain in speed obtained with use of the tfSearch tool varies from 10- to 100-fold in comparison with the time required by the MATCH program. It is especially pronounced when a large number of PWMs is used. The speed improvement thus introduced into the rVista 2.0 tool significantly decreases the tool's response time due to the fact that detecting TFBSs in the sequence file is the performance bottleneck of this approach.
Table 1. Comparative detection of PWMs in long genomic intervals performed by MATCH (6) and tfSearch programs
After localizing the TFBSs in both sequences, rVISTA proceeds to identify pairs of aligned TFBSs that are interconnected in the local blastz alignment. Genomic DNA insertions and deletions in either of the sequences (identified as gaps in the alignment) that occur in the core region of a TFBS disqualify the prediction. Subsequently, rVISTA requires aligned TFBS predictions to be locally highly conserved. Local conservation of at least 80% sequence identity in a 20 bp sliding window spanning the binding site (and always including the core of the binding site) selects aligned-and-conserved TFBSs (that are also referred to as conserved in the rVISTA output).
The rVista web page that is returned to the user contains detailed information on rVista processing results. This includes positional information on TFBS predictions in both sequences, and distribution of aligned and aligned-and-conserved TFBSs. The report includes data on the location, percentage identity and strand (Figure B) (reference sequence only). Conserved sites can also be visualized in the textual blast-like alignment, and are highlighted in blue. Finally, rVISTA results provide an interactive visualization module that overlays positional information on TFBS predictions above a graphical conservation profile that includes annotation of protein coding features for the locus. Clustering analysis of TFBSs permits the search and subsequent visualization of complex TFBS modules consisting of multiple different TFBSs (Figure C). For more informative analysis, users have the option to select for visualization only a subset of TFs from the initial list provided.
Several visualization parameters can be adjusted by the user: (i) alignment size (in bp) per layer, (ii) window resolution, (iii) types of site to be displayed (all, aligned, conserved) and (iv) the type of clustering analysis to be used. Two clustering options are also available, individual and combinatorial. Individual clustering is used for identifying groups of TFBSs belonging to the same TFs. Users have the option to indicate the number of sites and the size of the TFBS module they wish to identify. Combinatorial clustering is carried out for groups of TFBSs belonging to two or more different TFs. For example, if the visualization module is selected to display binding sites for TFs Hnf1, Tbx5 and Nkx2.5, and the user is interested in finding 100 bp regions that contain clusters comprised of at least five sites from this selected subset, rVISTA will identify all evolutionary conserved regions with any combination of these sites. In the visual display rVISTA will present only sites that fit the selected criteria (Figure C).