|Home | About | Journals | Submit | Contact Us | Français|
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The Repository currently contains about 300 000 three-dimensional models for sequences from the Swiss-Prot and TrEMBL databases. The content of the Repository is updated on a regular basis incorporating new sequences, taking advantage of new template structures becoming available and reflecting improvements in the underlying modelling algorithms. Each entry consists of one or more three-dimensional protein models, the superposed template structures, the alignments on which the models are based, a summary of the modelling process and a force field based quality assessment. The SWISS-MODEL Repository can be queried via an interactive website at http://swissmodel.expasy.org/repository/. Annotation and cross-linking of the models with other databases, e.g. Swiss-Prot on the ExPASy server, allow for seamless navigation between protein sequence and structure information. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated three-dimensional protein models generated by automated homology modelling, bridging the gap between sequence and structure databases.
Three-dimensional protein structures are key to a detailed understanding of the molecular basis of protein function. Combining sequence information with 3D structure gives invaluable insights for the development of effective rational strategies for experiments such as site-directed mutagenesis, studies of disease related mutations, or the structure based design of specific inhibitors. Techniques for experimental structure solution by X-ray crystallography and nuclear magnetic resonance spectroscopy have made great progress in recent years and currently more than 22 000 experimental protein structures have been deposited in the Protein Data Bank PDB (1). However, experimental protein structure determination is still a time-consuming process without guaranteed success. This is reflected by the fact that the number of structurally characterized proteins is about two orders of magnitude smaller than the number of known protein sequences in the Swiss-Prot and TrEMBL (2) databases, which hold more than one million entries. Thus, no experimental structural information is available for the vast majority of protein sequences. Therefore, theoretical methods for protein structure prediction aiming to bridge this structure knowledge gap have gained much interest in recent years. Among all current computational approaches, homology modelling is the only method that can reliably generate a three-dimensional model for a protein (3). If a target protein shares significant amino acid sequence similarity to at least one experimentally solved three-dimensional structure (template), homology or comparative modelling can be applied to construct a three-dimensional model for the new protein. During the past few years, several structural genomics initiatives (4) were started with the goal to speed up the experimental elucidation of new protein folds. Protein structure determination and comparative modelling complement one another in the exploration of the protein structure space (5).
Information from three-dimensional comparative protein models is used routinely in a wide variety of applications (6,7). The usefulness of homology models for specific applications is strongly dependent on their quality. The accuracy of a protein model can be evaluated by assessing the deviation of the model from its actual experimentally determined structure. Manual assessment of prediction methods, e.g. during the biannual CASP experiments (8), is a good means to evaluate new algorithmic developments based on a small number of examples. Likewise, automated blind assessment of modelling servers provides statistically meaningful estimates for the expected accuracy and stability of automated prediction methods (9,10). Several attempts at automated evaluation of modelling methods have been developed during recent years (11–13). SWISS-MODEL was among the first modelling servers to join the EVA (12) project, which continuously and automatically monitors the accuracy and reliability of the participating protein structure prediction pipelines. Applications for high quality models are manifold, and include planning site-directed mutagenesis experiments and rationalizing the effect of mutations (14–16), characterization of molecular functions (17,18) and structure based drug design (19,20). Although medium accuracy models are prone to significant errors (7,21), often such inaccuracies are located in the variable surface and loop regions, while the conserved core and active sites are modelled correctly. These protein models can still provide a valuable basis for identifying functionally relevant residues for site-directed mutagenesis experiments, or for the validation of sequence based functional annotations (22).
Homology modelling of protein structures consists of four steps: template selection, target-template alignment, model building, and model evaluation. Each of these individual steps usually requires expertise in structural biology and the use of specialized computer programs. A huge and constantly growing number of structurally uncharacterized protein sequences together with the increasing number of available template structures motivated the development of automated, stable and reliable modelling methods (23,24). The idea of an internet based automated modelling facility with integrated expert knowledge was first implemented 10 years ago by Peitsch and co-workers (23,25) and formed the starting point for the SWISS-MODEL server. With the presently available computing power, it is possible to apply comparative modelling on a large scale to whole genomes, e.g. Escherichia coli (26), Saccharomyces cerevisiae (24), or entire sequence databases (27,28). It was proposed by Sanchez et al. (29) that the availability of structural information for whole protein families, organisms, or metabolic pathways will encourage new types of applications. For example, the development of drugs with higher selectivity for a given target protein would be facilitated by the availability of structural models for all proteins sharing similar ligand binding sites. Structural comparisons would allow screening for drug candidates with better specificity at an earlier stage of drug development.
Storing and organizing results of large-scale automated modelling in a database makes better use of the available computing resources, and gives instant and queryable access to models without having to wait for a computation to complete. The easy access to pre-computed and annotated comparative models through a model repository helps to enrich other database projects with structural information, e.g. sequence knowledge bases like Swiss-Prot (2), or databases dedicated to specific cellular functions, e.g. the meiosis specific database GermOnline (30). In this paper, we describe the SWISS-MODEL Repository, a database of annotated three-dimensional protein models created by the SWISS-MODEL server pipeline (31).
The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated models generated by automated homology modelling, bridging the gap between sequence and structure databases. All models in the Repository are publicly accessible via our interactive website at http://swissmodel.expasy.org/repository/. The design of the web page is printer-friendly, so that all information can be printed in one step from any standard web browser (Fig. (Fig.1).1). A graphical ‘model navigator’ provides an overview of the models that have been generated for a selected sequence, allowing fast and easy navigation for the different regions in the protein, for which three-dimensional models are available. The ‘model info’ section contains information about the template structure and target-template sequence alignment on which the modelling has been based. The interactive display allows expanding detailed views of the target-template sequence alignment, the force field based assessment of the model and the modelling log files. The model assessment is presented as a diagram of Gromos96 (32) force field energies and Anolea (33) mean force potentials on a per residue basis. These allow visual inspection of the model quality to identify unreliable regions, e.g. caused by errors in the target-template alignment. A small ribbon representation is included to obtain a first impression of the model structure. Model coordinates can be downloaded in PDB format or as DeepView projects. Protein models can be displayed directly from within the web browser using any molecular viewer application, e.g. DeepView (25), Dino (http://www.dino3d.org), or Rasmol (34). Moreover, complete DeepView (Swiss-PdbViewer) modelling projects can be exported. These project files contain the final model superposed on the template structure. DeepView is used to visualize the model and analyse certain structural features, e.g. Ramachandran plots or electrostatic properties. It allows manual adjustment of the placement of insertions and deletions in the alignment on which the initial modelling process was based. The project with the modified alignment can then be re-submitted to the SWISS-MODEL server for further model building.
The Repository can be queried for protein or gene name, Swiss-Prot accession codes, protein description key words, E.C. numbers and organism names. The search interface allows combining all these different descriptors to complex queries, e.g. searching the Repository for all models of a certain enzyme in several organisms. For each model, the Repository provides links to the target sequence entry in Swiss-Prot (2), the template structure entries in PDB (1), SCOP (35) and CATH (36), and domain organization in InterPro (37). Cross-linking individual repository entries to and from other databases, e.g. ExPASy (38), allows navigation between protein sequence and structure information.
As of August 2003, the Swiss-Model Repository contained 317 616 models for 282 096 different Swiss-Prot/TrEMBL sequence entries, i.e. for 34% of the 132 244 Swiss-Prot (Release 41.19) and 25% of 941 322 trEMBL (Release 24.6) entries, a significant part of the sequence could be modelled. The length of the models varies from 45 to 1524 residues (ferredoxin-dependent glutamate synthase from Oryza sativa), with an average model size of 200, which corresponds well with the expected size of individual protein domains; 47% of the models in the Repository correspond to eukaryotic proteins, 4.7% to human protein sequences, and 20% are of prokaryotic origin. The Repository is updated regularly to take into account new sequences, modifications of existing sequence entries, and new template structures released by the PDB that might allow the construction of models for previously unmodelled proteins, or might provide a better template for already existing model entries. Also, fundamental changes and improvements of the modelling pipeline initiate a new update cycle. An individual checksum is assigned to each model entry to ensure the consistency of the data with the information in other databases.
The SWISS-MODEL Repository has been implemented using relational database technology. During the modelling process, it communicates with the SWISS-MODEL server pipeline and keeps track of the workflow for individual target sequences. The models in the SWISS-MODEL Repository are computed by a modified version of the SWISS-MODEL server pipeline (31). Since no manual intervention takes place during the model building process, care must be taken to assess the quality of models generated to minimize the number of erroneous models in the database. We have defined criteria for entry of models to the Repository based on the EVA evaluation of the SWISS-MODEL server during a period of 150 weeks comprising 12 100 models built for 9125 individual proteins, and the evaluation during the 3D Crunch experiment comprising 1200 model/control structure comparisons (21,39). Additionally, each model is assessed using a partial Gromos96 force field implementation (32) and the empirical Anolea mean force potential (33). Protein models with a minimum length of 45 residues sharing at least 40% sequence identity with their template structure are entered into the database if their Anolea mean force potential is below 200 kJ/mol. The chosen threshold values for models to enter the Repository are specific for the current implementation of the modelling pipeline and will be adjusted with improvements of the modelling algorithms.
The number of models in the Repository is expected to grow rapidly as a result of ongoing genome sequencing and structural genomics efforts. We will continue to develop the SWISS-MODEL Repository as a resource connecting sequence to structure information. Integrating InterPro domain information (37) in our data schema will provide functional annotation mapped onto three-dimensional structural models. The SWISS-MODEL Repository will widen its spectrum of the provided biological information by adding species-specific views and cross-linking with other knowledge bases.
Users of the SWISS-MODEL Repository are requested to cite this article in their publications.
We are deeply indebted to Manuel C. Peitsch (Novartis AG, Basel) and to Nicolas Guex (GSK, Raleigh, NC) for their pioneering work on large-scale protein structure modelling. We would like to thank Jozef Aerts (Biozentrum and SIB, Basel) for excellent technical help on the Anolea mean force potentials. We are grateful to Nicola Mulder, Rolf Apweiler (EBI Hinxton, UK) and Lorenza Bordoli (EMBnet and Biozentrum and SIB, Basel) for very fruitful and encouraging discussions. We would like to acknowledge the financial support by the Swiss National Science Foundation (SNF) and Novartis Pharma AG, Basel.