|Home | About | Journals | Submit | Contact Us | Français|
SMART (Simple Modular Architecture Research Tool) is a web tool (http://smart.embl.de/) for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron–exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
The SMART database (http://smart.embl.de; http://smart.ox.ac.uk) provides a tool to identify and annotate the signalling domains found in many eukaryotic proteins (1). The database consists of a library of Hidden Markov Models that are used to provide statistically robust inferences of the presence of specific domains in a particular sequence, and multiple sequence alignments of user query sequences with domains. The database provides extensive annotation for each domain, and is a comprehensive source of information on which proteins each is found in.
The primary motivation for the development of SMART was as a tool to study the evolution of function within multi-domain proteins. The availability of completed metazoan genomes, and increasing accuracy of prediction of gene structures and their multiple splice variants (2), has enabled us to create new extensions to SMART, allowing detailed overlaying of gene intron and exon structure with protein domain organization. This is coupled with a cross-referencing of orthologous genes in multiple genomes, collections of multiple splice variants of individual genes and new visualization tools to show schematic alignments of multiple gene structures. These new developments make SMART an ideal tool for studies of the evolution of gene and protein function.
In addition to the Swiss-Prot and spTrembl databases (3), which have been used by SMART since its inception, SMART’s source sequence databases now include all available Ensembl proteomes (2). We compare sequences from all sources and generate a non-redundant set of proteins with multiple identifiers per sequence. Sequences are retrievable, and linkable, via any of the original identifiers.
SMART continues to expand its domain coverage, with more than 70 new domains in the latest release, bringing the total close to 700. The rate of new, widespread domain discovery is falling, primarily as their numbers are limited (4). However, we continue to identify new domains of interest e.g. (5,6), and establish new links between others e.g. (7).
The core of SMART is a relational database management system (RDBMS) which stores information on SMART domains. In addition to previously available features (1,8), the SMART database now includes information on Pfam (9) domains in all proteins in NRDB. Users can now query the database for proteins that contain specific combinations of Pfam and SMART domains.
In addition to standard ‘Domain selection’ querying, it is now possible to find proteins based on Gene Ontology (GO) (10) terms associated with domains. Associations of domains with GO are taken from Interpro (11). GO querying is a two-step process. In the first step, the user obtains a list of domains matching the GO terms entered. After selecting the domains of interest from the list, proteins containing those domains are displayed. As with standard domain querying, results can be limited to specific taxonomic ranges.
SMART uses the CRC64 algorithm to calculate checksums for all user-supplied sequences. If a matching checksum is found in the SMART database, pre-calculated results are displayed. Approximately 45% of all user-submitted sequences are identified in this way, resulting in shorter queues and much faster response times for all users.
Since user-supplied sequences can now be identified, several important new features have been introduced into SMART:
(i) Batch access: the SMART batch access facility allows users to submit multiple sequence identifiers or actual sequences, either by directly pasting the data into their web browser, or by uploading a file to the SMART server. If the user supplies plain sequences, their CRC checksums are calculated, and those with matches in the SMART database are displayed.
(ii) Intron positions shown in schematic protein figures: for proteins that match any of the Ensembl predictions, SMART will show intron positions as vertical coloured lines in graphical representations (Fig. (Fig.1).1). This information is retrieved from a pre-calculated mapping of Ensembl gene structures to protein sequences.
(iii) Extra information in the main results page: in cases where multiple IDs are associated with the same sequence, users get a list of all IDs with links to corresponding source databases. Since SMART now incorporates Ensembl genomes, users also get a list of alternative splices of the gene encoding the analysed protein (if there are any). It is possible to either display SMART protein annotation for any of the alternative splices, or get a graphical multiple sequence alignment of all of them.
User sequences can now be searched against profiles derived from the SCOP database, using RPS-Blast (12,13). As well as detecting homologues of known structure, this enables easy identification of the evolutionary superfamily to which any domains belong, and complements the links provided in domain annotation pages.
SMART provides orthology information for all Ensembl predicted proteins. These relationships are distinct from those provided by Ensembl. There are two separate sets of orthologues for each protein: 1:1 reciprocal best matches in other genomes and orthologous groups with reciprocal best hits from all genomes analysed (i.e. each of these proteins has exactly one orthologue in all six genomes). Orthologous groups are displayed as graphical multiple sequence alignments (Fig. (Fig.2).2). All orthology information is extracted from all-against-all Smith–Waterman (14) similarities for combined proteomes, using a previously described method (15).
With the growing number of completely sequenced eukaryotic genomes, the scientific community requires tools for easy comparative and large-scale analyses. With recent additions, we have expanded SMART’s capabilities to accommodate the needs of many different types of user.