1.  The InterPro protein families database: the classification resource after 15 years 
Nucleic Acids Research  2014;43(Database issue):D213-D221.
The InterPro database ( is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
PMCID: PMC4383996  PMID: 25428371
2.  PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins 
Nucleic Acids Research  2014;43(Database issue):D494-D502.
The post-translational regulation of proteins is mainly driven by two molecular events, their modification by several types of moieties and their interaction with other proteins. These two processes are interdependent and together are responsible for the function of the protein in a particular cell state. Several databases focus on the prediction and compilation of protein–protein interactions (PPIs) and no less on the collection and analysis of protein post-translational modifications (PTMs), however, there are no resources that concentrate on describing the regulatory role of PTMs in PPIs. We developed several methods based on residue co-evolution and proximity to predict the functional associations of pairs of PTMs that we apply to modifications in the same protein and between two interacting proteins. In order to make data available for understudied organisms, PTMcode v2 ( includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. The second release of PTMcode covers 19 eukaryotic species from which we collected more than 300 000 experimentally verified PTMs (>1 300 000 propagated) of 69 types extracting the post-translational regulation of >100 000 proteins and >100 000 interactions. In total, we report 8 million associations of PTMs regulating single proteins and over 9.4 million interplays tuning PPIs.
PMCID: PMC4383916  PMID: 25361965
3.  SMART: recent updates, new developments and status in 2015 
Nucleic Acids Research  2014;43(Database issue):D257-D260.
SMART (Simple Modular Architecture Research Tool) is a web resource ( providing simple identification and extensive annotation of protein domains and the exploration of protein domain architectures. In the current version, SMART contains manually curated models for more than 1200 protein domains, with ∼200 new models since our last update article. The underlying protein databases were synchronized with UniProt, Ensembl and STRING, bringing the total number of annotated domains and other protein features above 100 million. SMART's ‘Genomic’ mode, which annotates proteins from completely sequenced genomes was greatly expanded and now includes 2031 species, compared to 1133 in the previous release. SMART analysis results pages have been completely redesigned and include links to several new information sources. A new, vector-based display engine has been developed for protein schematics in SMART, which can also be exported as high-resolution bitmap images for easy inclusion into other documents. Taxonomic tree displays in SMART have been significantly improved, and can be easily navigated using the integrated search engine.
PMCID: PMC4384020  PMID: 25300481
4.  PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins 
Nucleic Acids Research  2012;41(Database issue):D306-D311.
Post-translational modifications (PTMs) are involved in the regulation and structural stabilization of eukaryotic proteins. The combination of individual PTM states is a key to modulate cellular functions as became evident in a few well-studied proteins. This combinatorial setting, dubbed the PTM code, has been proposed to be extended to whole proteomes in eukaryotes. Although we are still far from deciphering such a complex language, thousands of protein PTM sites are being mapped by high-throughput technologies, thus providing sufficient data for comparative analysis. PTMcode ( aims to compile known and predicted PTM associations to provide a framework that would enable hypothesis-driven experimental or computational analysis of various scales. In its first release, PTMcode provides PTM functional associations of 13 different PTM types within proteins in 8 eukaryotes. They are based on five evidence channels: a literature survey, residue co-evolution, structural proximity, PTMs at the same residue and location within PTM highly enriched protein regions (hotspots). PTMcode is presented as a protein-based searchable database with an interactive web interface providing the context of the co-regulation of nearly 75 000 residues in >10 000 proteins.
PMCID: PMC3531129  PMID: 23193284
6.  InterPro in 2011: new developments in the family and domain prediction database 
Nucleic Acids Research  2011;40(Database issue):D306-D312.
InterPro ( is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
PMCID: PMC3245097  PMID: 22096229
7.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges 
Nucleic Acids Research  2011;40(Database issue):D284-D289.
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database ( contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).
PMCID: PMC3245133  PMID: 22096231
8.  SMART 7: recent updates to the protein domain annotation resource 
Nucleic Acids Research  2011;40(Database issue):D302-D305.
SMART (Simple Modular Architecture Research Tool) is an online resource ( for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 7 contains manually curated models for 1009 protein domains, 200 more than in the previous version. The current release introduces several novel features and a streamlined user interface resulting in a faster and more comfortable workflow. The underlying protein databases were greatly expanded, resulting in a 2-fold increase in number of annotated domains and features. The database of completely sequenced genomes now includes 1133 species, compared to 630 in the previous release. Domain architecture analysis results can now be exported and visualized through the iTOL phylogenetic tree viewer. ‘metaSMART’ was introduced as a novel subresource dedicated to the exploration and analysis of domain architectures in various metagenomics data sets. An advanced full text search engine was implemented, covering the complete annotations for SMART and Pfam domains, as well as the complete set of protein descriptions, allowing users to quickly find relevant information.
PMCID: PMC3245027  PMID: 22053084
9.  iPath2.0: interactive pathway explorer 
Nucleic Acids Research  2011;39(Web Server issue):W412-W415.
iPath2.0 is a web-based tool ( for the visualization and analysis of cellular pathways. Its primary map summarizes the metabolism in biological systems as annotated to date. Nodes in the map correspond to various chemical compounds and edges represent series of enzymatic reactions. In two other maps, iPath2.0 provides an overview of secondary metabolite biosynthesis and a hand-picked selection of important regulatory pathways and other functional modules, allowing a more general overview of protein functions in a genome or metagenome. iPath2.0′s main interface is an interactive Flash-based viewer, which allows users to easily navigate and explore the complex pathway maps. In addition to the default pre-computed overview maps, iPath offers several data mapping tools. Users can upload various types of data and completely customize all nodes and edges of iPath2.0′s maps. These customized maps give users an intuitive overview of their own data, guiding the analysis of various genomics and metagenomics projects.
PMCID: PMC3125749  PMID: 21546551
10.  Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy 
Nucleic Acids Research  2011;39(Web Server issue):W475-W478.
Interactive Tree Of Life ( is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. In addition to classical tree viewer functions, iTOL offers many novel ways of annotating trees with various additional data. Current version introduces numerous new features and greatly expands the number of supported data set types. Trees can be interactively manipulated and edited. A free personal account system is available, providing management and sharing of trees in user defined workspaces and projects. Export to various bitmap and vector graphics formats is supported. Batch access interface is available for programmatic access or inclusion of interactive trees into other web services.
PMCID: PMC3125724  PMID: 21470960
11.  Toward molecular trait-based ecology through integration of biogeochemical, geographical and metagenomic data 
Using metagenomic ‘parts lists' to study microbial ecology remains a significant challenge. This work proposes a molecular trait-based approach to biogeography by integrating metagenomic data with external metadata and using functional community composition as readout.
Climatic factors drive functional and phylogenetic composition of ocean microbial communities.Function dispersal is controlled by environmental conditions.Functional richness has a clear latitudinal gradient and correlates with primary production.Metagenomic data can be used as a predictor for ecosystem processes.To understand the relationship between community composition and environment, functional readouts are the most direct. Metagenomic data enable such trait-based ecology at the molecular level.
Metagenomics (shotgun sequencing of pooled DNA of complete microbial communities) is widely used to investigate ecosystem functioning of environmental and clinical samples. However, the nature of this data (usually a gigantic collection of gene fragments of 1000s of organisms) makes it very hard to infer global patterns on microbial ecology of the environment at hand. To address important ecological questions such as ‘How do microbial communities adapt to the environmental conditions?', ‘What drives the functional variation across the globe and to what extent do genes disperse?' and ‘What drives variation of CO2 uptake across different locations and communities?', we integrated 25 ocean metagenomes from the Global Ocean Sampling project with geographical, meteorological and geophysicochemical data. We find that climatic factors (temperature, sunlight) are the major determinants of the functional and phylogenetic composition of an environment and the main limiting factor on whether functions dispersal across the planet. We find a distinct latitudinal gradient in the size and diversity of the functional repertoire of ocean microbial communities, peaking at 20°N, and which correlates with oceanic CO2 uptake. The latter can also be predicted from the molecular functional composition of an environmental sample. Together, our results show that the functional community composition derived from metagenomes can be used as quantitative predictor for molecular trait-based biogeography and ecology.
Using metagenomic ‘parts lists' to infer global patterns on microbial ecology remains a significant challenge. To deduce important ecological indicators such as environmental adaptation, molecular trait dispersal, diversity variation and primary production from the gene pool of an ecosystem, we integrated 25 ocean metagenomes with geographical, meteorological and geophysicochemical data. We find that climatic factors (temperature, sunlight) are the major determinants of the biomolecular repertoire of each sample and the main limiting factor on functional trait dispersal (absence of biogeographic provincialism). Molecular functional richness and diversity show a distinct latitudinal gradient peaking at 20°N and correlate with primary production. The latter can also be predicted from the molecular functional composition of an environmental sample. Together, our results show that the functional community composition derived from metagenomes is an important quantitative readout for molecular trait-based biogeography and ecology.
PMCID: PMC3094067  PMID: 21407210
ecosystems biology; environmental genomics; metagenomics; microbiology; molecular trait-based ecology
12.  A side effect resource to capture phenotypic effects of drugs 
The molecular understanding of phenotypes caused by drugs in humans is essential for elucidating mechanisms of action and for developing personalized medicines. Side effects of drugs (also known as adverse drug reactions) are an important source of human phenotypic information, but so far research on this topic has been hampered by insufficient accessibility of data. Consequently, we have developed a public, computer-readable side effect resource (SIDER) that connects 888 drugs to 1450 side effect terms. It contains information on frequency in patients for one-third of the drug–side effect pairs. For 199 drugs, the side effect frequency of placebo administration could also be extracted. We illustrate the potential of SIDER with a number of analyses. The resource is freely available for academic research at
PMCID: PMC2824526  PMID: 20087340
adverse drug reactions; database; drugs; human phenotypes; side effects
13.  Discovering Functional Novelty in Metagenomes: Examples from Light-Mediated Processes▿  
Journal of Bacteriology  2008;191(1):32-41.
The emerging coverage of diverse habitats by metagenomic shotgun data opens new avenues of discovering functional novelty using computational tools. Here, we apply three different concepts for predicting novel functions within light-mediated microbial pathways in five diverse environments. Using phylogenetic approaches, we discovered two novel deep-branching subfamilies of photolyases (involved in light-mediated repair) distributed abundantly in high-UV environments. Using neighborhood approaches, we were able to assign seven novel functional partners in luciferase synthesis, nitrogen metabolism, and quorum sensing to BLUF domain-containing proteins (involved in light sensing). Finally, by domain analysis, for RcaE proteins (involved in chromatic adaptation), we predict 16 novel domain architectures that indicate novel functionalities in habitats with little or no light. Quantification of protein abundance in the various environments supports our findings that bacteria utilize light for sensing, repair, and adaptation far more widely than previously thought. While the discoveries illustrate the opportunities in function discovery, we also discuss the immense conceptual and practical challenges that come along with this new type of data.
PMCID: PMC2612456  PMID: 18849420
14.  The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans 
Nature  2008;451(7180):783-788.
Choanoflagellates are the closest known relatives of metazoans. To discover potential molecular mechanisms underlying the evolution of metazoan multicellularity, we sequenced and analysed the genome of the unicellular choanoflagellate Monosiga brevicollis. The genome contains approximately 9,200 intron-rich genes, including a number that encode cell adhesion and signalling protein domains that are otherwise restricted to metazoans. Here we show that the physical linkages among protein domains often differ between M. brevicollis and metazoans, suggesting that abundant domain shuffling followed the separation of the choanoflagellate and metazoan lineages. The completion of the M. brevicollis genome allows us to reconstruct with increasing resolution the genomic changes that accompanied the origin of metazoans.
PMCID: PMC2562698  PMID: 18273011
15.  SMART 6: recent updates and new developments 
Nucleic Acids Research  2008;37(Database issue):D229-D232.
Simple modular architecture research tool (SMART) is an online tool ( for the identification and annotation of protein domains. It provides a user-friendly platform for the exploration and comparative study of domain architectures in both proteins and genes. The current release of SMART contains manually curated models for 784 protein domains. Recent developments were focused on further data integration and improving user friendliness. The underlying protein database based on completely sequenced genomes was greatly expanded and now includes 630 species, compared to 191 in the previous release. As an initial step towards integrating information on biological pathways into SMART, our domain annotations were extended with data on metabolic pathways and links to several pathways resources. The interaction network view was completely redesigned and is now available for more than 2 million proteins. In addition to the standard web access to the database, users can now query SMART using distributed annotation system (DAS) or through a simple object access protocol (SOAP) based web service.
PMCID: PMC2686533  PMID: 18978020
16.  metaTIGER: a metabolic evolution resource 
Nucleic Acids Research  2008;37(Database issue):D531-D538.
Metabolic networks are a subject that has received much attention, but existing web resources do not include extensive phylogenetic information. Phylogenomic approaches (phylogenetics on a genomic scale) have been shown to be effective in the study of evolution and processes like horizontal gene transfer (HGT). To address the lack of phylogenomic information relating to eukaryotic metabolism, metaTIGER ( has been created, using genomic information from 121 eukaryotes and 404 prokaryotes and sensitive sequence search techniques to predict the presence of metabolic enzymes. These enzyme sequences were used to create a comprehensive database of 2257 maximum-likelihood phylogenetic trees, some containing over 500 organisms. The trees can be viewed using iTOL, an advanced interactive tree viewer, enabling straightforward interpretation of large trees. Complex high-throughput tree analysis is also available through user-defined queries, allowing the rapid identification of trees of interest, e.g. containing putative HGT events. metaTIGER also provides novel and easy-to-use facilities for viewing and comparing the metabolic networks in different organisms via highlighted pathway images and tables. metaTIGER is demonstrated through evolutionary analysis of Plasmodium, including identification of genes horizontally transferred from chlamydia.
PMCID: PMC2686446  PMID: 18953037
17.  InterPro: the integrative protein signature database 
Nucleic Acids Research  2008;37(Database issue):D211-D215.
The InterPro database ( integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (
PMCID: PMC2686546  PMID: 18940856
18.  4DXpress: a database for cross-species expression pattern comparisons 
Nucleic Acids Research  2007;36(Database issue):D847-D853.
In the major animal model species like mouse, fish or fly, detailed spatial information on gene expression over time can be acquired through whole mount in situ hybridization experiments. In these species, expression patterns of many genes have been studied and data has been integrated into dedicated model organism databases like ZFIN for zebrafish, MEPD for medaka, BDGP for Drosophila or GXD for mouse. However, a central repository that allows users to query and compare gene expression patterns across different species has not yet been established. Therefore, we have integrated expression patterns for zebrafish, Drosophila, medaka and mouse into a central public repository called 4DXpress (expression database in four dimensions). Users can query anatomy ontology-based expression annotations across species and quickly jump from one gene to the orthologues in other species. Genes are linked to public microarray data in ArrayExpress. We have mapped developmental stages between the species to be able to compare developmental time phases. We store the largest collection of gene expression patterns available to date in an individual resource, reflecting 16 505 annotated genes. 4DXpress will be an invaluable tool for developmental as well as for computational biologists interested in gene regulation and evolution. 4DXpress is available at
PMCID: PMC2238840  PMID: 17916571
19.  New developments in the InterPro database 
Nucleic Acids Research  2007;35(Database issue):D224-D228.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
PMCID: PMC1899100  PMID: 17202162
20.  SMART 5: domains in the context of genomes and networks 
Nucleic Acids Research  2005;34(Database issue):D257-D260.
The Simple Modular Architecture Research Tool (SMART) is an online resource () used for protein domain identification and the analysis of protein domain architectures. Many new features were implemented to make SMART more accessible to scientists from different fields. The new ‘Genomic’ mode in SMART makes it easy to analyze domain architectures in completely sequenced genomes. Domain annotation has been updated with a detailed taxonomic breakdown and a prediction of the catalytic activity for 50 SMART domains is now available, based on the presence of essential amino acids. Furthermore, intrinsically disordered protein regions can be identified and displayed. The network context is now displayed in the results page for more than 350 000 proteins, enabling easy analyses of domain interactions.
PMCID: PMC1347442  PMID: 16381859
21.  Fast identification of folded human protein domains expressed in E. coli suitable for structural analysis 
High-throughput protein structure analysis of individual protein domains requires analysis of large numbers of expression clones to identify suitable constructs for structure determination. For this purpose, methods need to be implemented for fast and reliable screening of the expressed proteins as early as possible in the overall process from cloning to structure determination.
88 different E. coli expression constructs for 17 human protein domains were analysed using high-throughput cloning, purification and folding analysis to obtain candidates suitable for structural analysis. After 96 deep-well microplate expression and automated protein purification, protein domains were directly analysed using 1D 1H-NMR spectroscopy. In addition, analytical hydrophobic interaction chromatography (HIC) was used to detect natively folded protein. With these two analytical methods, six constructs (representing two domains) were quickly identified as being well folded and suitable for structural analysis.
The described approach facilitates high-throughput structural analysis. Clones expressing natively folded proteins suitable for NMR structure determination were quickly identified upon small scale expression screening using 1D 1H-NMR and/or analytical HIC. This procedure is especially effective as a fast and inexpensive screen for the 'low hanging fruits' in structural genomics.
PMCID: PMC516802  PMID: 15113422
structural genomics; hydrophobic interaction chromatography; homonuclear NMR; protein domains; high-throughput expression
22.  SMART 4.0: towards genomic data integration 
Nucleic Acids Research  2004;32(Database issue):D142-D144.
SMART (Simple Modular Architecture Research Tool) is a web tool ( for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. The January 2004 release of SMART contains 685 protein domains. New developments in SMART are centred on the integration of data from completed metazoan genomes. SMART now uses predicted proteins from complete genomes in its source sequence databases, and integrates these with predictions of orthology. New visualization tools have been developed to allow analysis of gene intron–exon structure within the context of protein domain structure, and to align these displays to provide schematic comparisons of orthologous genes, or multiple transcripts from the same gene. Other improvements include the ability to query SMART by Gene Ontology terms, improved structure database searching and batch retrieval of multiple entries.
PMCID: PMC308822  PMID: 14681379
23.  ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins 
Nucleic Acids Research  2003;31(13):3625-3630.
Multidomain proteins predominate in eukaryotic proteomes. Individual functions assigned to different sequence segments combine to create a complex function for the whole protein. While on-line resources are available for revealing globular domains in sequences, there has hitherto been no comprehensive collection of small functional sites/motifs comparable to the globular domain resources, yet these are as important for the function of multidomain proteins. Short linear peptide motifs are used for cell compartment targeting, protein–protein interaction, regulation by phosphorylation, acetylation, glycosylation and a host of other post-translational modifications. ELM, the Eukaryotic Linear Motif server at, is a new bioinformatics resource for investigating candidate short non-globular functional motifs in eukaryotic proteins, aiming to fill the void in bioinformatics tools. Sequence comparisons with short motifs are difficult to evaluate because the usual significance assessments are inappropriate. Therefore the server is implemented with several logical filters to eliminate false positives. Current filters are for cell compartment, globular domain clash and taxonomic range. In favourable cases, the filters can reduce the number of retained matches by an order of magnitude or more.
PMCID: PMC168952  PMID: 12824381
24.  The InterPro Database, 2003 brings increased coverage and new features 
Nucleic Acids Research  2003;31(1):315-318.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver ( and anonymous FTP (
PMCID: PMC165493  PMID: 12520011
25.  Recent improvements to the SMART domain-based sequence annotation resource 
Nucleic Acids Research  2002;30(1):242-244.
SMART (Simple Modular Architecture Research Tool, is a web-based resource used for the annotation of protein domains and the analysis of domain architectures, with particular emphasis on mobile eukaryotic domains. Extensive annotation for each domain family is available, providing information relating to function, subcellular localization, phyletic distribution and tertiary structure. The January 2002 release has added more than 200 hand-curated domain models. This brings the total to over 600 domain families that are widely represented among nuclear, signalling and extracellular proteins. Annotation now includes links to the Online Mendelian Inheritance in Man (OMIM) database in cases where a human disease is associated with one or more mutations in a particular domain. We have implemented new analysis methods and updated others. New advanced queries provide direct access to the SMART relational database using SQL. This database now contains information on intrinsic sequence features such as transmembrane regions, coiled-coils, signal peptides and internal repeats. SMART output can now be easily included in users’ documents. A SMART mirror has been created at
PMCID: PMC99073  PMID: 11752305

