A significant portion of modern biological research involves the identification of the biochemical and biological functions associated with one or multiple positions of a sequence. Numerous databases have been constructed to store these sequence regions and their associated functions, defined as sequence features. An example compilation of such databases is available at
http://zlab.bu.edu/~mfrith/tools.shtml. Common features for DNA sequences include introns, exons, 3' or 5' untranslated regions, transcription start sites, cis-elements and other protein binding sites, repeats, low complexity regions and single nucleotide polymorphisms (SNPs). Protein sequence features include secondary structures (α-helices and β-strands), transmembrane regions, and post-translational modifications such as phosphorylation and glycosylation sites. There can be dozens of features associated with a single sequence. Frequently, features can be nested; for example, a SNP can reside within a cis-element, which can be in an intron. Therefore, it is extremely difficult for a text record in a database to reveal all of the salient features of a sequence to the user in an intuitive fashion.
The human genome project has motivated substantial scientific and technological developments in sequencing large eukaryotic genomes. Among the many tools developed in the course of the project, web-based graphical viewers facilitate the search, display and retrieval of sequences and annotations associated with a genome. Such viewers are typically integrated with the databases that store the genomes. They are not only extremely important for delivering the final results of a sequencing project to lab-bench biologists but also indispensable in the assembly and annotation of genome drafts, since assorted evidence must be integrated. Three well-known genome viewers are available for the public working draft of the human genome: the viewers developed by the Ensembl project
http://www.ensembl.org; [
1], the human genome browser at UCSC
http://genome.ucsc.edu; [
2] and the NCBI map viewer
http://www.ncbi.nlm.nih.gov. The focus of genome viewers is typically above the gene level, with the most common use of searching for evidence of novel genes. VISTA is another user-friendly program for visualizing the alignment of very long DNA sequences [
3]. With the rapid enrichment of annotated sequence features, there is a need for sequence viewers at the nucleotide or amino acid level, targeting lab-bench experimentalists. An example of such a sequence viewer, viewGene, focuses on polymorphism visualization [
4].
Computational analysis of DNA and protein sequences is among the most frequently encountered activities in bioinformatics research. Computational tools for sequence analysis are often specialized in producing only one kind of feature, and frequently in text output format. The most widely used tools detect genes [
5-
8], cis-elements and general promoter regions [
9-
12], repeats [
13,
14], protein secondary structures [
15-
17] and protein transmembrane helices [
18,
19]. Currently, there is no visualization tool to easily compare the output of a sequence analysis program with the annotations of this sequence stored in a database, as well as to compare the outputs of multiple analysis programs. THEATRE [
20] is an attempt to combine the sequence features produced by widely used sequence analysis tools or sequence databases; however, it only produces a static postscript graph.
We have developed SeqVISTA with the exact goal of facilitating the visualization of sequence features in annotation records such as those of GenBank [
21] and Swiss-Prot [
22], as well as the comparison of multiple sequence feature sets, produced by different sequence analysis programs, with the annotation record. We take advantage of the observation that all sequence features are indexed with one or several positions of the sequence, and construct a coherent framework for the representation of virtually unlimited feature types and feature sets. SeqVISTA can be a general platform for integrating numerous sequence analysis tools, and thus alleviate the need of developing program-specific visualization software. More importantly, with careful programming design and implementation, SeqVISTA targets the broad community of experimental biologists. All features are linked directly and dynamically to the sequence itself, and a user is presented with the global view of the most salient features. Furthermore, the user can extract any feature-containing sequence region easily and precisely for performing further experiments.