Transcriptional regulation is an extremely important mechanism in controlling the spatial and temporal production of mRNA molecules, which impacts the subsequent production of protein molecules. Thus, from a finite number of genes, an almost infinite variety of protein forms (produced from alternative splicing and post-translational modifications) can be created in a highly regulated fashion in response to physiological and environmental stimuli. Transcriptional regulation is highly complex, especially in multicellular eukaryotic organisms such as humans. Despite the extraordinary strides that have been made in genome sequencing and computational algorithm development, our ability to recognize functional regulatory elements in multicellular eukaryotic genomic sequences remains limited. Currently available computational algorithms, if applied individually, are unable to reliably detect cis-elements that are functional in vivo. This is largely due to the complexity of eukaryotic genomes. For example, the chromatin around a site may not be open for the DNA to bind to transcription factors (TFs), and even if a site were accessible, the gene may be expressed in a cell type in which the TF is not transcribed. Furthermore, the prediction of the proximal promoter may be in error because of the alternative usage of another first exon (and thus the associated proximal promoter) at a distant chromosomal location. The computational methods will obviously continue to improve as more experimental data become available. However, we argue that biological systems are so complex that we are extremely far away from producing a generic computational model that is sophisticated enough to capture the regulatory mechanisms of all genes.
Therefore, in order for current computational methods to significantly impact our biological understanding, we must (i) integrate all methods and databases so that the user can take advantage of the strengths of different methods as well as annotations describing previous experimental results on the same gene, and (ii) make the system extremely user-friendly to bench biologists, so that they can incorporate their own expert knowledge and experimental results and perform computation–experiment iterations to maximize the impact on their results.
There are a large number of sequence databases of regulatory regions and programs for analyzing these sequence regions. A long list is available at our lab website: http://zlab.bu.edu/zlab/gene.shtml
. Some programs have a web interface and others can be downloaded to run locally. Many of the online resources provide remarkable user interfaces and rich connections between diverse types of data. However, as with all rapidly developing fields, there has been little cooperation among bioinformatics development efforts. Each author exerts maximal creativity in order to achieve the best user experience when it comes to designing both data organization and user interface. Although individually highly functional, the lack of interoperability across multiple tools poses a severe inconvenience for an experimental biologist user in several aspects: (i) since most sensible biological questions require the integration of various types of data, the user frequently needs to use multiple tools to retrieve and analyze these data; (ii) as programs with similar goals can have distinctly different user interfaces, the user has to spend substantial time adjusting to each program; (iii) there is no easy way to exchange data between the programs and the user often needs to submit the same input to multiple programs one by one and attempt to piece together and compare the output of these programs, which are formatted rather differently. The lack of interoperability is disastrous for bioinformaticians, who must develop computer programs to communicate in multiple formats. Numerous parsers need to be implemented to load data with different formats or data from different sources with distinct interfaces. If an interface or data format changes, the parser will need to be modified to accommodate the change. This situation is described in L. Stein's excellent reviews (1
To date, we are aware of three programs that are working toward the goal of integration. However, there is still ample room for further development. Regulatory Sequence Analysis Tools (RSAT) is a compilation of multiple tools in one website (3
). It has extensive documentation, and is easy to use. However, the user still needs to paste the input sequence into each tool and perform the analysis individually. Furthermore, the resource has been designed for analyzing prokaryote and yeast sequences. Toucan is a stand-alone program with functionality for phylogenetic footprinting, overrepresented motif search and ab initio
motif discovery (4
). However, it has a rather limited visualization front-end. INCLUSive is a web compilation of microarray analysis and cis
-element motif analysis tools (5
), using Toucan as the visualization module. Similar to RSAT, tools in INCLUSive are only loosely coupled and the user must run them individually.
In this paper, we present an integration effort that includes eight web services (described in the next section), which cover most key approaches for finding regulatory motifs in higher eukaryotes. We have augmented a versatile sequence and annotation visualization program SeqVISTA (6
) with a new Motif module to facilitate the integration. SeqVISTA supplies a consistent input and output interface as well as the ability to recognize rich annotations in input sequences. Its new Motif module contains functions to directly query the web services and retrieve results. The module can automatically recognize the logic of the sequential application of multiple programs and the relationships among their results and provides a coherent and versatile visualization of the output files. It is technically straightforward to integrate additional web services, which makes SeqVISTA a general integration infrastructure for biological sequence analysis.