|Home | About | Journals | Submit | Contact Us | Français|
SRS (Sequence Retrieval System) is a widely used keyword search engine for querying biological databases. BLAST2 is the most widely used tool to query databases by sequence similarity search. These tools allow users to retrieve sequences by shared keyword or by shared similarity, with many public web servers available. However, with the increasingly large datasets available it is now quite common that a user is interested in some subset of homologous sequences but has no efficient way to restrict retrieval to that set. By allowing the user to control SRS from the BLAST output, BLAST2SRS (http://blast2srs.embl.de/) aims to meet this need. This server therefore combines the two ways to search sequence databases: similarity and keyword.
BLAST2 is the most widely used tool to query databases by sequence similarity search (1). The major BLAST2 webservers, as at NCBI, EBI or ExPASy, allow the user to retrieve sequences returned in the output. This is essential for enabling further analyses such as multiple alignment, calculating a tree, investigating residue conservation and other forms of comparative sequence analysis.
Often a user will be interested in a defined subset of sequences, for example just mammalian or just purple bacteria. Servers such as at NCBI and ExPASy allow prior taxonomic restriction of the database, saving the computational search time.
However, we find that there is a more common situation in which the user will not be interested in all the protein sequences found by BLAST, which, for example in the case of protein kinases, will be thousands, but that the set of interesting sequences is dependent on what is returned by the search and, as the analysis of the hits proceeds, may need continual revision. As an example, if one is interested in the evolution of vertebrate multigene families, an invertebrate outgroup may be needed. But which? Chordates like amphioxus or Ciona would be desirable but the sequence may not have been determined yet. The Drosophila proteome is available but the homologous protein may be absent—true for many vertebrate extracellular matrix proteins in this soft-centred organism for example. Shall we next try Caenorhabditis or Aplysia? Our choice of sequence to use as outgroup depends on what sequences are found to be available.
While the user can click through an output list to find out which species are present, the cryptic entry IDs used by SPTrEMBL (and other databases except SWISS-PROT) are a hindrance. Furthermore no major BLAST server provider currently includes sufficient useful information like species or gene name in the output to allow rapid perusal, even though this information can be easily parsed into the BLAST-formatted binary databases.
Experienced SRS users can run BLAST from within an SRS server (e.g. http://srs.ebi.ac.uk/) and use query linking to select desired data subsets. The approach is powerful, though difficult for naive users. More recently BLAST at NCBI has allowed the user to apply ENTREZ keyword queries to select sequence entries from the output. As noted above, however, both servers currently provide uninformative BLAST outputs so the user may still have difficulty determining what is present.
To meet the need for flexible on-line retrieval, both for ourselves and others, we want a server that allows the user to apply taxonomic and other keywords to delimit sequence collection from BLAST output. One way to do this is by harnessing the SRS retrieval tool (2,3) in tandem to BLAST, principally by setting up the BLAST web output with SRS controls. A comprehensive sequence similarity search is initially performed, from which the results, a list of database hits, are saved for subsequent rounds of selection/filtering and viewing by different criteria. Additionally, by carrying useful database annotation into the BLAST output, the user can be better informed as to what is available for retrieval.
Here we summarise the functionality of the BLAST2SRS server. It is designed for flexible retrieval of subsets of related sequence entries in the SWISS-PROT and SPTrEMBL protein sequence databases (4). The focus is on this alone and features, such as graphical alignment summaries, found on more sophisticated user interfaces are not implemented here. Nor is this server aimed at maximising sensitive detection of remote homologues, where PSI-BLAST (1), Profile (5) or HMM (6) searches would be more appropriate.
To meet our aims with the server we cannot support arbitrary flat file database formats. Therefore only the SWISSALL database group (4) is used, allowing for consistent parsing and preprocessing of the database. This includes the SWISS-PROT database, the best annotated of all the protein sequence databases but with a relatively small set of entries (4). SWISSALL is updated weekly and now includes >1000000 protein sequences. It should be fairly comprehensive as regards annotated genes in sequence data processed by the major DNA databases; for up-to-the-minute data, the user should access individual genome servers. Our formatted binary BLAST database includes the species, gene name, description and fragment fields in addition to the ID and Accession numbers.
The server is run on an IBM Intel processor-based FreeBSD server using Apache 2. Parser scripts are written in Python using available library routines for flat file parsing (7). NCBI BLAST2 (1) version 2.2.5 is currently used. The BLAST XML output is parsed into an HTML format with a species list in the header, E-value cutoff and other controls. Links are provided to the EMBL SRS server (currently SRS version 5; http://srs.embl-heidelberg.de:8000/srs5/). BLAST2SRS is at http:/blast2srs.embl.de/.
In a future revision, the parser will be rewritten to replace the XML with the standard BLAST output. Using the XML output has resulted in an inefficient parser that impacts server performance.
When a user requires an arbitrary set of sequences (arbitrary from the perspective of the server provider), they will have several main reasons in mind for keeping or rejecting entries. These include the significance of the match (E-value), the species, the protein/gene name (when there are paralogous groups in a multiprotein family), whether the protein is a fragment (useless for phylogeny and many other analyses). These are foreseeable requirements and specific control is provided to the user by BLAST2SRS. A list of included species is given in the header: very useful in those situations where taxonomy is important. There are less foreseeable requirements such as desiring a higher level taxonomic node (e.g. Chordata) and/or wishing to exclude transmembrane receptors from the output of a tyrosine kinase query: SRS query language can support these demands so a query input box is available to the user.
The user invokes BLAST2SRS in the usual way by copy-pasting a protein sequence into the input box, choosing the database and filtering option and submitting the search. Limited control over other BLAST parameters is also available. The search and output parsing will take longer than typical for one of the major service providers, a minute or more depending on query, DB and output sizes. The output page can be bookmarked to guarantee retrieval and to return to at a later date. Table Table11 summarises the retrieval controls that are to be found in the output header.
Figure Figure11 shows an example of output from BLAST2SRS. Imagine we are interested in the evolution of TFIIB transcription factors and we want to know if any occur in prokaryotic organisms. Using the human TFIIB sequence (SWISS-PROT Accession Q00403) we search SWISSALL with BLAST2SRS. There are >70 significant matches from ~40 species. [In this case there is no clear border between the true and false matches with eukaryotic paralogous sequences (BRF/TFIIIB) scoring a little below the default E-value cutoff at e−3.] Archaea such as Sulfolobus and Pyrococcus are represented with good E-values (e−20). Archaeal experts can simply tick the archaeal species listed, update the selection and then collect the sequences in FASTA format, suitable, for example, for ClustalX multiple alignment (8). Less knowledgeable researchers can select ‘All Species’ and type ‘archaea’ into the SRS text box and use SRS to retrieve all the archaeal sequences, choosing formats such as FASTA or SWISS-PROT. At a quick glance there are no significant eubacterial matches in the list but, to be sure, a query with ‘bacteria’ in the SRS text box will confirm that there are none. We have now established that TFIIB homologues exist in archaea but not in eubacteria and retrieved the significant archaeal matches for further processing. This example has a smallish list of matches so is easy to work through to become familiar with BLAST2SRS: the utility of the server increases with the size of the BLAST output since it is impractical to work through hundreds of entries by individual retrieval and scrutiny.
BLAST2SRS can be helpful in any situation where both homology and keyword are needed to define a set of sequences. This list illustrates some typical situations where this is true:
We thank Chenna Ramu and Christine Gemünd for help, discussions and Python modules.