|Home | About | Journals | Submit | Contact Us | Français|
Transcriptional regulation is one of the most basic regulatory mechanisms in the cell. The accumulation of multiple metazoan genome sequences and the advent of high-throughput experimental techniques have motivated the development of a large number of bioinformatics methods for the detection of regulatory motifs. The regulatory process is extremely complex and individual computational algorithms typically have very limited success in genome-scale studies. Here, we argue the importance of integrating multiple computational algorithms and present an infrastructure that integrates eight web services covering key areas of transcriptional regulation. We have adopted the client-side integration technology and built a consistent input and output environment with a versatile visualization tool named SeqVISTA. The infrastructure will allow for easy integration of gene regulation analysis software that is scattered over the Internet. It will also enable bench biologists to perform an arsenal of analysis using cutting-edge methods in a familiar environment and bioinformatics researchers to focus on developing new algorithms without the need to invest substantial effort on complex pre- or post-processors. SeqVISTA is freely available to academic users and can be launched online at http://zlab.bu.edu/SeqVISTA/web.jnlp, provided that Java Web Start has been installed. In addition, a stand-alone version of the program can be downloaded and run locally. It can be obtained at http://zlab.bu.edu/SeqVISTA.
Transcriptional regulation is an extremely important mechanism in controlling the spatial and temporal production of mRNA molecules, which impacts the subsequent production of protein molecules. Thus, from a finite number of genes, an almost infinite variety of protein forms (produced from alternative splicing and post-translational modifications) can be created in a highly regulated fashion in response to physiological and environmental stimuli. Transcriptional regulation is highly complex, especially in multicellular eukaryotic organisms such as humans. Despite the extraordinary strides that have been made in genome sequencing and computational algorithm development, our ability to recognize functional regulatory elements in multicellular eukaryotic genomic sequences remains limited. Currently available computational algorithms, if applied individually, are unable to reliably detect cis-elements that are functional in vivo. This is largely due to the complexity of eukaryotic genomes. For example, the chromatin around a site may not be open for the DNA to bind to transcription factors (TFs), and even if a site were accessible, the gene may be expressed in a cell type in which the TF is not transcribed. Furthermore, the prediction of the proximal promoter may be in error because of the alternative usage of another first exon (and thus the associated proximal promoter) at a distant chromosomal location. The computational methods will obviously continue to improve as more experimental data become available. However, we argue that biological systems are so complex that we are extremely far away from producing a generic computational model that is sophisticated enough to capture the regulatory mechanisms of all genes.
Therefore, in order for current computational methods to significantly impact our biological understanding, we must (i) integrate all methods and databases so that the user can take advantage of the strengths of different methods as well as annotations describing previous experimental results on the same gene, and (ii) make the system extremely user-friendly to bench biologists, so that they can incorporate their own expert knowledge and experimental results and perform computation–experiment iterations to maximize the impact on their results.
There are a large number of sequence databases of regulatory regions and programs for analyzing these sequence regions. A long list is available at our lab website: http://zlab.bu.edu/zlab/gene.shtml. Some programs have a web interface and others can be downloaded to run locally. Many of the online resources provide remarkable user interfaces and rich connections between diverse types of data. However, as with all rapidly developing fields, there has been little cooperation among bioinformatics development efforts. Each author exerts maximal creativity in order to achieve the best user experience when it comes to designing both data organization and user interface. Although individually highly functional, the lack of interoperability across multiple tools poses a severe inconvenience for an experimental biologist user in several aspects: (i) since most sensible biological questions require the integration of various types of data, the user frequently needs to use multiple tools to retrieve and analyze these data; (ii) as programs with similar goals can have distinctly different user interfaces, the user has to spend substantial time adjusting to each program; (iii) there is no easy way to exchange data between the programs and the user often needs to submit the same input to multiple programs one by one and attempt to piece together and compare the output of these programs, which are formatted rather differently. The lack of interoperability is disastrous for bioinformaticians, who must develop computer programs to communicate in multiple formats. Numerous parsers need to be implemented to load data with different formats or data from different sources with distinct interfaces. If an interface or data format changes, the parser will need to be modified to accommodate the change. This situation is described in L. Stein's excellent reviews (1,2).
To date, we are aware of three programs that are working toward the goal of integration. However, there is still ample room for further development. Regulatory Sequence Analysis Tools (RSAT) is a compilation of multiple tools in one website (3). It has extensive documentation, and is easy to use. However, the user still needs to paste the input sequence into each tool and perform the analysis individually. Furthermore, the resource has been designed for analyzing prokaryote and yeast sequences. Toucan is a stand-alone program with functionality for phylogenetic footprinting, overrepresented motif search and ab initio motif discovery (4). However, it has a rather limited visualization front-end. INCLUSive is a web compilation of microarray analysis and cis-element motif analysis tools (5), using Toucan as the visualization module. Similar to RSAT, tools in INCLUSive are only loosely coupled and the user must run them individually.
In this paper, we present an integration effort that includes eight web services (described in the next section), which cover most key approaches for finding regulatory motifs in higher eukaryotes. We have augmented a versatile sequence and annotation visualization program SeqVISTA (6) with a new Motif module to facilitate the integration. SeqVISTA supplies a consistent input and output interface as well as the ability to recognize rich annotations in input sequences. Its new Motif module contains functions to directly query the web services and retrieve results. The module can automatically recognize the logic of the sequential application of multiple programs and the relationships among their results and provides a coherent and versatile visualization of the output files. It is technically straightforward to integrate additional web services, which makes SeqVISTA a general integration infrastructure for biological sequence analysis.
Table Table11 lists the addresses for the eight web services that are now integrated in SeqVISTA. Below we briefly describe the method of each of them as follows. Among the five tools that have been developed in our lab, Glam and Clover were downloadable, while the others were downloadable also supported web-based user interfaces. None of them could communicate via the Simple Object Access Protocol (SOAP; http://www.w3.org/TR/2003/REC-soap12-part0-20030624/), which is the current approach for stable integration across multiple software platforms. We have developed SOAP-based web services for these five programs.
In order to help the user learn these web services, as well as the batch function described below, we have prepared several tutorials. These tutorials can be accessed at http://zlab.bu.edu/SeqVISTA/tutorials/. Phylogenetic footprinting is a commonly used technique to discover functionally conserved sequence regions across species, which may have a higher chance of harboring regulatory elements. We plan to incorporate a phylogenetic footprinting service into SeqVISTA in the near future. Presently, the user can upload a sequence with non-conserved regions in lower-case letters and they will be skipped over in the motif-finding services in SeqVISTA.
Previously, we developed a Java-based sequence visualization program named SeqVISTA (6). It presents a holistic, interactive graphical view of sequence records with supporting annotation data. It has most functions for sequence manipulation, such as load, copy, paste, locate and pattern search. We have tried to make SeqVISTA as user-friendly as possible. Specifically, we have developed several functions in the sequence panel of SeqVISTA to render sub-sequence selection effortless. In addition, the user can launch the program while browsing a sequence record using Internet Explorer by clicking the ‘SeqVISTA’ button, which is added to the browser during the installation of SeqVISTA. SeqVISTA can communicate with external analysis programs and displays their outputs along with the GenBank annotations of the sequence in an integrated fashion. SeqVISTA runs on all computer platforms that support Java.
Our goal is to develop the essential features of an integrated infrastructure for computational studies of gene regulation. The infrastructure contains three components: (i) Preprocessing: focused on data collection and formatting; (ii) Computational Core: focused on data analysis, e.g. repeat masking and motif discovery; (iii) Post-Processing: focused on output visualization and integration. SeqVISTA covers the basic components of pre- and post-processing and the web services described in the previous section constitute the computational core. Since the original publication of SeqVISTA (6), we have significantly improved it to provide a reliable open architecture for loading sequences from different data sources and for performing versatile analyses on these sequences. The new features that are essential for the integration are described as follows:
The user can load an input sequence locally from a file or remotely at a web address (which corresponds to the text file of the sequence record). The latter facilitates easy exchange of sequence records among different labs through the Internet. SeqVISTA also allows the user to supply a GenBank Identification (GI) number or an accession number and it will retrieve the sequence record directly from the NCBI server. For local files, different data formats are identified by the filename extension and assigned to a corresponding parser through an XML-based configuration file (SeqVISTA.xml, located at the SeqVISTA install directory). Similarly, a file over the web can be identified by its base Universal Resource Locator (URL) and assigned to the corresponding parser. For example, the URL http://zlab.bu.edu/muscle_mouse.fasta will trigger SeqVISTA to interpret the data using the FASTA parser. For computational efficiency, parsers are loaded into SeqVISTA at run time.
For each web service integrated within SeqVISTA, we have developed the corresponding service adaptor. A service adaptor allows the user to enter various parameters to modify the behavior of the corresponding web service. The service adaptors, together with the default parameters of the corresponding web service, are also configurable in the configuration file SeqVISTA.xml mentioned above. The parameters of a service adaptor are categorized into three groups: required, optional and advanced.
For the eight services integrated in SeqVISTA, the required categories of adaptor parameters in general include two items: sequence and motif. There are three options for the input sequence, allowing the user to perform analysis on the active sequence (at any point in time there is one active sequence), selected sequences, or all the sequences. A sequence can be selected by mouse-clicking the check box in front of the sequence in the tree panel of SeqVISTA (Figure (Figure1;1; left panel). The sequence selection functions can also be invoked by right-mouse-clicking in the tree panel or by selecting from the Edit menu tab. SeqVISTA allows the user to select one or multiple motifs from a list provided by the corresponding service. For example, Possum, Clover and Cluster-Buster allow a flexible selection of motifs from a tree-structured list organized by motif families in the JASPAR database (31). The user can also upload motif matrices.
The optional category is typically service specific. For the services that have not had corresponding web interfaces before (Clover and Glam), we provide instant tool-tips when the user passes the mouse over the corresponding field. A detailed explanation of these parameters can be found in SeqVISTA's user manual.
The advanced category provides parameters that are only useful for advanced users in special cases. For example, the user can redirect SeqVISTA to run the RepeatMasker service at another server that is faster by changing the service address.
We continue to use the three-panel format for SeqVISTA's display window, as described in detail previously (6). Figure Figure11 is a screen shot of the SeqVISTA window displaying four original input mouse sequences and the results of MotifSampler, Clover, Possum, MotifScanner and Glam on these sequences. The left panel (tree panel) is a tree structure of all sequences and their features (each feature in a different color). Predicted motif sites by the various programs are represented as features. The top-right panel (graphics panel) contains graphical depictions of the features indicated by colored boxes, with features on the + strand drawn above the line representing the sequence and features on the − strand below the line. The location and width of each colored box represent the location of the feature in the sequence and the number of bases to which the feature corresponds. The lower-right panel (sequence panel) contains the nucleotide sequence for the active sequence record. The three panels are dynamically linked; if the user selects a feature in one panel by mouse clicking, the corresponding feature and sub-sequence will be displayed and highlighted accordingly. The user can hide selected sequences or features to improve clarity. One sequence may contain one or more plots in the graphics panel to represent the results of analysis programs. The user can show, hide or delete individual plots.
An analysis produces another sequence record with features corresponding to the results. Such a record is saved in SeqVISTA as a new record, but linked to the original sequence record that was used as the input for the analysis. The name of the analysis program is appended to the beginning of the input sequence name to produce the output sequence name, which allows the user to clearly identify the parent/child relationship between these two sequences as described above. Figure Figure11 compares the outputs of five programs on a set of four input sequences. For clarity, the results for only one sequence is shown, along with the original input sequence with experimentally annotated motif binding sites indicated with an arrow. It is apparent that the results of multiple programs are easily contrasted in such a visualization setting. The user can obtain the details of an individual prediction by mousing over the corresponding colored box in the graphics panel. A tool tip will appear with the details of the prediction (e.g. from and to positions and score).
The unified data model and the integration of abundant analysis programs provided by our infrastructure enable the flexibility of performing batch data analysis with multiple programs on a selected set of sequences. In addition, a pipelined batch process can be carried out sequentially such that the result of one analysis can serve as the input for another analysis. The analyses are done in the background and the user can perform visualization in the meantime. We have developed the corresponding software components to control the batch process in a separate thread. These components provide a graphical user interface (GUI) to let the user configure the batch process (see Figure Figure22 and explanation below). The user may also change the default parameters (such as cutoff values) for different analysis services using this GUI. SeqVISTA will check whether the analysis program takes multiple sequences as input. If not, the sequences can be fed to the analysis program one by one without intervention from the user.
Figure Figure22 illustrates the GUI used to facilitate the construction and management of pipelined tasks. The left panel lists all the available services that can be used to construct a pipeline while the right panel lists the created pipelines. A pipeline is defined as a set of tasks that will be carried out with specified web services either in parallel or sequentially. Each pipeline node is marked with a checkbox, which allows a subset of pipelines to be carried out. The right panel of Figure Figure22 shows four example pipelines. The user can drag a web service from the left panel and drop it into the right panel and configure it with the pop-up menu in the right panel to create new pipelines. The task nodes within a pipeline are maintained using a hierarchical structure in which the tasks at the same level will be carried out in parallel while sub-tasks will be carried out sequentially. For example, the second pipeline shown in Figure Figure22 (Ab initio Motif Detection) indicates that PromoSer will be first run to obtain the promoter sequences, which are then used as input to Glam and MotifSampler, which will be run in parallel. Each task node can be configured to change the default parameters of the service it represents. SeqVISTA recognizes the dependencies of the tasks within a pipeline and will require a specific order of configuration whenever necessary. For services that always require to be configured (e.g. PromoSer requires the user to enter a list of GenBank accession numbers), their task nodes are indicated with a ‘?’ as shown in Figure Figure2.2. The ‘?’ will disappear once the node is configured.
The created pipeline can be saved and reused next time, with different input sequences and reconfiguration of individual services. By default the running pipeline is represented as an animated button at the task queue panel. The progress of the pipelined tasks can be examined by clicking on the button. The corresponding results of the pipelined analysis, however, will be only added to SeqVISTA when all tasks have been finished.
We thank Brian Pierce and Heather Burden for thoroughly proofreading the manuscript. This work has been supported in part by NSF grants DBI-0078194 and MRI DBI-0116574 and NIH grants 1P20GM066401-01, 1R01HG03110-01 and A08-POGM66401A.