This paper presents a variety of techniques and technologies aimed at the transformation of crystallographic data into information and knowledge.
Structural and functional studies require the development of sophisticated ‘Big Data’ technologies and software to increase the knowledge derived and ensure reproducibility of the data. This paper presents summaries of the Structural Biology Knowledge Base, the VIPERdb Virus Structure Database, evaluation of homology modeling by the Protein Model Portal, the ProSMART tool for conformation-independent structure comparison, the LabDB ‘super’ laboratory information management system and the Cambridge Structural Database. These techniques and technologies represent important tools for the transformation of crystallographic data into knowledge and information, in an effort to address the problem of non-reproducibility of experimental results.
meaning from data; big data; databases; knowledge bases; data deposition
Computational modeling and prediction of three-dimensional macromolecular structures and complexes from their sequence has been a long standing vision in structural biology as it holds the promise to bypass part of the laborious process of experimental structure solution. Over the last two decades, a paradigm shift has occurred: starting from a situation where the “structure knowledge gap” between the huge number of protein sequences and small number of known structures has hampered the widespread use of structure-based approaches in life science research, today some form of structural information – either experimental or computational – is available for the majority of amino acids encoded by common model organism genomes. Template based homology modeling techniques have matured to a point where they are now routinely used to complement experimental techniques. With the scientific focus of interest moving towards larger macromolecular complexes and dynamic networks of interactions, the integration of computational modeling methods with low-resolution experimental techniques allows studying large and complex molecular machines. Computational modeling and prediction techniques are still facing a number of challenges which hamper the more widespread use by the non-expert scientist. For example, it is often difficult to convey the underlying assumptions of a computational technique, as well as the expected accuracy and structural variability of a specific model. However, these aspects are crucial to understand the limitations of a model, and to decide which interpretations and conclusions can be supported.
Motivation: Membrane proteins are an important class of biological macromolecules involved in many cellular key processes including signalling and transport. They account for one third of genes in the human genome and >50% of current drug targets. Despite their importance, experimental structural data are sparse, resulting in high expectations for computational modelling tools to help fill this gap. However, as many empirical methods have been trained on experimental structural data, which is biased towards soluble globular proteins, their accuracy for transmembrane proteins is often limited.
Results: We developed a local model quality estimation method for membrane proteins (‘QMEANBrane’) by combining statistical potentials trained on membrane protein structures with a per-residue weighting scheme. The increasing number of available experimental membrane protein structures allowed us to train membrane-specific statistical potentials that approach statistical saturation. We show that reliable local quality estimation of membrane protein models is possible, thereby extending local quality estimation to these biologically relevant molecules.
Availability and implementation: Source code and datasets are available on request.
Supplementary data are available at Bioinformatics online.
The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) was created in 1998 as an institution to foster excellence in bioinformatics. It is renowned worldwide for its databases and software tools, such as UniProtKB/Swiss-Prot, PROSITE, SWISS-MODEL, STRING, etc, that are all accessible on ExPASy.org, SIB's Bioinformatics Resource Portal. This article provides an overview of the scientific and training resources SIB has consistently been offering to the life science community for more than 15 years.
Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.
Motivation: The assessment of protein structure prediction techniques requires objective criteria to measure the similarity between a computational model and the experimentally determined reference structure. Conventional similarity measures based on a global superposition of carbon α atoms are strongly influenced by domain motions and do not assess the accuracy of local atomic details in the model.
Results: The Local Distance Difference Test (lDDT) is a superposition-free score that evaluates local distance differences of all atoms in a model, including validation of stereochemical plausibility. The reference can be a single structure, or an ensemble of equivalent structures. We demonstrate that lDDT is well suited to assess local model quality, even in the presence of domain movements, while maintaining good correlation with global measures. These properties make lDDT a robust tool for the automated assessment of structure prediction servers without manual intervention.
Availability and implementation: Source code, binaries for Linux and MacOSX, and an interactive web server are available at http://swissmodel.expasy.org/lddt
Supplementary information: Supplementary data are available at Bioinformatics online.
One goal of the CASP Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction is to identify the current state of the art in protein structure prediction and modeling. A fundamental principle of CASP is blind prediction on a set of relevant protein targets, i.e. the participating computational methods are tested on a common set of experimental target proteins, for which the experimental structures are not known at the time of modeling. Therefore, the CASP experiment would not have been possible without broad support of the experimental protein structural biology community. In this manuscript, several experimental groups discuss the structures of the proteins which they provided as prediction targets for CASP9, highlighting structural and functional peculiarities of these structures: the long tail fibre protein gp37 from bacteriophage T4, the cyclic GMP-dependent protein kinase Iβ (PKGIβ) dimerization/docking domain, the ectodomain of the JTB (Jumping Translocation Breakpoint) transmembrane receptor, Autotaxin (ATX) in complex with an inhibitor, the DNA-Binding J-Binding Protein 1 (JBP1) domain essential for biosynthesis and maintenance of DNA base-J (β-D-glucosyl-hydroxymethyluracil) in Trypanosoma and Leishmania, an so far uncharacterized 73 residue domain from Ruminococcus gnavus with a fold typical for PDZ-like domains, a domain from the Phycobilisome (PBS) core-membrane linker (LCM) phycobiliprotein ApcE from Synechocystis, the Heat shock protein 90 (Hsp90) activators PFC0360w and PFC0270w from Plasmodium falciparum, and 2-oxo-3-deoxygalactonate kinase from Klebsiella pneumoniae.
CASP; protein structure; X-ray crystallography; NMR; structure prediction
The Protein Model Portal (PMP) has been developed to foster effective use of 3D molecular models in biomedical research by providing convenient and comprehensive access to structural information for proteins. Both experimental structures and theoretical models for a given protein can be searched simultaneously and analyzed for structural variability. By providing a comprehensive view on structural information, PMP offers the opportunity to apply consistent assessment and validation criteria to the complete set of structural models available for proteins. PMP is an open project so that new methods developed by the community can contribute to PMP, for example, new modeling servers for creating homology models and model quality estimation servers for model validation. The accuracy of participating modeling servers is continuously evaluated by the Continuous Automated Model EvaluatiOn (CAMEO) project. The PMP offers a unique interface to visualize structural coverage of a protein combining both theoretical models and experimental structures, allowing straightforward assessment of the model quality and hence their utility. The portal is updated regularly and actively developed to include latest methods in the field of computational structural biology.
The Critical Assessment of Protein Structure Prediction round 9 (CASP9) aimed to evaluate predictions for 129 experimentally determined protein structures. To assess tertiary structure predictions, these target structures were divided into domain-based evaluation units that were then classified into two assessment categories: template based modeling (TBM) and template free modeling (FM). CASP9 targets were split into domains of structurally compact evolutionary modules. For the targets with more than one defined domain, the decision to split structures into domains for evaluation was based on server performance. Target domains were categorized based on their evolutionary relatedness to existing templates as well as their difficulty levels indicated by server performance. Those target domains with sequence-related templates and high server prediction performance were classified as TMB, while those targets without identifiable templates and low server performance were classified as FM. However, using these generalizations for classification resulted in a blurred boundary between CASP9 assessment categories. Thus, the FM category included those domains without sequence detectable templates (25 target domains) as well as some domains with difficult to detect templates whose predictions were as poor as those without templates (5 target domains). Several interesting examples are discussed, including targets with sequence related templates that exhibit unusual structural differences, targets with homologous or analogous structure templates that are not detectable by sequence, and targets with new folds.
Protein Structure; CASP9; Classification; Fold space; sequence homologs; structure analogs; free modeling; template based modeling; structure prediction
Summary: MODalign is an interactive web-based tool aimed at helping protein structure modelers to inspect and manually modify the alignment between the sequences of a target protein and of its template(s). It interactively computes, displays and, upon modification of the target-template alignment, updates the multiple sequence alignments of the two protein families, their conservation score, secondary structure and solvent accessibility values, and local quality scores of the implied three-dimensional model(s). Although it has been designed to simplify the target-template alignment step in modeling, it is suitable for all cases where a sequence alignment needs to be inspected in the context of other biological information.
Interactions between proteins and their ligands play central roles in many physiological processes. The structural details for most of these interactions, however, have not yet been characterized experientially. Therefore, various computational tools have been developed to predict the location of binding sites and the amino acid residues interacting with ligands. In this manuscript, we assess the performance of 33 methods participating in the ligand binding site prediction category in CASP9. The overall accuracy of ligand binding site predictions in CASP9 appears rather high (average MCC of 0.62 for the ten top performing groups), and compared to previous experiments more groups performed equally well. However, this should be seen in context of a strong bias in the test data towards easy template based models. Overall, the top performing methods have converged to a similar approach using ligand binding site inference from related homologous structures, which limits their applicability for difficult “de novo” prediction targets. Here, we present the results of the CASP9 assessment of the ligand binding site category, discuss examples for successful and challenging prediction targets in CASP9, and finally suggest changes in the format of the experiment to overcome the current limitations of the assessment.
protein function; protein structure; evaluation; assessment; binding site; active site; co-factor; ligand; CASP
The Protein Structure Initiative’s Structural Biology Knowledgebase (SBKB, URL: http://sbkb.org) is an open web resource designed to turn the products of the structural genomics and structural biology efforts into knowledge that can be used by the biological community to understand living systems and disease. Here we will present examples on how to use the SBKB to enable biological research. For example, a protein sequence or Protein Data Bank (PDB) structure ID search will provide a list of related protein structures in the PDB, associated biological descriptions (annotations), homology models, structural genomics protein target status, experimental protocols, and the ability to order available DNA clones from the PSI:Biology-Materials Repository. A text search will find publication and technology reports resulting from the PSI’s high-throughput research efforts. Web tools that aid in research, including a system that accepts protein structure requests from the community, will also be described. Created in collaboration with the Nature Publishing Group, the Structural Biology Knowledgebase monthly update also provides a research library, editorials about new research advances, news, and an events calendar to present a broader view of structural genomics and structural biology.
Protein; Protein production; Structural biology; Structural databases; Structural genomics; Theoretical models
Motivation: Quality assessment of protein structures is an important part of experimental structure validation and plays a crucial role in protein structure prediction, where the predicted models may contain substantial errors. Most current scoring functions are primarily designed to rank alternative models of the same sequence supporting model selection, whereas the prediction of the absolute quality of an individual protein model has received little attention in the field. However, reliable absolute quality estimates are crucial to assess the suitability of a model for specific biomedical applications.
Results: In this work, we present a new absolute measure for the quality of protein models, which provides an estimate of the ‘degree of nativeness’ of the structural features observed in a model and describes the likelihood that a given model is of comparable quality to experimental structures. Model quality estimates based on the QMEAN scoring function were normalized with respect to the number of interactions. The resulting scoring function is independent of the size of the protein and may therefore be used to assess both monomers and entire oligomeric assemblies. Model quality scores for individual models are then expressed as ‘Z-scores’ in comparison to scores obtained for high-resolution crystal structures. We demonstrate the ability of the newly introduced QMEAN Z-score to detect experimentally solved protein structures containing significant errors, as well as to evaluate theoretical protein models.
In a comprehensive QMEAN Z-score analysis of all experimental structures in the PDB, membrane proteins accumulate on one side of the score spectrum and thermostable proteins on the other. Proteins from the thermophilic organism Thermatoga maritima received significantly higher QMEAN Z-scores in a pairwise comparison with their homologous mesophilic counterparts, underlining the significance of the QMEAN Z-score as an estimate of protein stability.
Availability: The Z-score calculation has been integrated in the QMEAN server available at: http://swissmodel.expasy.org/qmean.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Developers of new methods in computational structural biology are often hampered in their research by incompatible software tools and non-standardized data formats. To address this problem, we have developed OpenStructure as a modular open source platform to provide a powerful, yet flexible general working environment for structural bioinformatics. OpenStructure consists primarily of a set of libraries written in C++ with a cleanly designed application programmer interface. All functionality can be accessed directly in C++ or in a Python layer, meeting both the requirements for high efficiency and ease of use. Powerful selection queries and the notion of entity views to represent these selections greatly facilitate the development and implementation of algorithms on structural data. The modular integration of computational core methods with powerful visualization tools makes OpenStructure an ideal working and development environment. Several applications, such as the latest versions of IPLT and QMean, have been implemented based on OpenStructure—demonstrating its value for the development of next-generation structural biology algorithms.
Availability: Source code licensed under the GNU lesser general public license and binaries for MacOS X, Linux and Windows are available for download at http://www.openstructure.org.
Supplementary information: Supplementary data are available at Bioinformatics online.
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
Structural Genomics has been successful in determining the structures of many unique proteins in a high throughput manner. Still, the number of known protein sequences is much larger than the number of experimentally solved protein structures. Homology (or comparative) modeling methods make use of experimental protein structures to build models for evolutionary related proteins. Thereby, experimental structure determination efforts and homology modeling complement each other in the exploration of the protein structure space. One of the challenges in using model information effectively has been to access all models available for a specific protein in heterogeneous formats at different sites using various incompatible accession code systems. Often, structure models for hundreds of proteins can be derived from a given experimentally determined structure, using a variety of established methods. This has been done by all of the PSI centers, and by various independent modeling groups. The goal of the Protein Model Portal (PMP) is to provide a single portal which gives access to the various models that can be leveraged from PSI targets and other experimental protein structures. A single interface allows all existing pre-computed models across these various sites to be queried simultaneously, and provides links to interactive services for template selection, target-template alignment, model building, and quality assessment. The current release of the portal consists of 7.6 million model structures provided by different partner resources (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM, ModBase, SWISS-MODEL Repository). The PMP is available at http://www.proteinmodelportal.org and from the PSI Structural Genomics Knowledgebase.
Protein model portal; PSI structural genomics knowledgebase; Comparative protein structure modeling; Homology modeling; Model database
The selection of the most accurate protein model from a set of alternatives is a crucial step in protein structure prediction both in template-based and ab initio approaches. Scoring functions have been developed which can either return a quality estimate for a single model or derive a score from the information contained in the ensemble of models for a given sequence. Local structural features occurring more frequently in the ensemble have a greater probability of being correct. Within the context of the CASP experiment, these so called consensus methods have been shown to perform considerably better in selecting good candidate models, but tend to fail if the best models are far from the dominant structural cluster. In this paper we show that model selection can be improved if both approaches are combined by pre-filtering the models used during the calculation of the structural consensus.
Our recently published QMEAN composite scoring function has been improved by including an all-atom interaction potential term. The preliminary model ranking based on the new QMEAN score is used to select a subset of reliable models against which the structural consensus score is calculated. This scoring function called QMEANclust achieves a correlation coefficient of predicted quality score and GDT_TS of 0.9 averaged over the 98 CASP7 targets and perform significantly better in selecting good models from the ensemble of server models than any other groups participating in the quality estimation category of CASP7. Both scoring functions are also benchmarked on the MOULDER test set consisting of 20 target proteins each with 300 alternatives models generated by MODELLER. QMEAN outperforms all other tested scoring functions operating on individual models, while the consensus method QMEANclust only works properly on decoy sets containing a certain fraction of near-native conformations. We also present a local version of QMEAN for the per-residue estimation of model quality (QMEANlocal) and compare it to a new local consensus-based approach.
Improved model selection is obtained by using a composite scoring function operating on single models in order to enrich higher quality models which are subsequently used to calculate the structural consensus. The performance of consensus-based methods such as QMEANclust highly depends on the composition and quality of the model ensemble to be analysed. Therefore, performance estimates for consensus methods based on large meta-datasets (e.g. CASP) might overrate their applicability in more realistic modelling situations with smaller sets of models based on individual methods.
Model quality estimation is an essential component of protein structure prediction, since ultimately the accuracy of a model determines its usefulness for specific applications. Usually, in the course of protein structure prediction a set of alternative models is produced, from which subsequently the most accurate model has to be selected. The QMEAN server provides access to two scoring functions successfully tested at the eighth round of the community-wide blind test experiment CASP. The user can choose between the composite scoring function QMEAN, which derives a quality estimate on the basis of the geometrical analysis of single models, and the clustering-based scoring function QMEANclust which calculates a global and local quality estimate based on a weighted all-against-all comparison of the models from the ensemble provided by the user. The web server performs a ranking of the input models and highlights potentially problematic regions for each model. The QMEAN server is available at http://swissmodel.expasy.org/qmean.
The Protein Structure Initiative Structural Genomics Knowledgebase (PSI SGKB, http://kb.psi-structuralgenomics.org) has been created to turn the products of the PSI structural genomics effort into knowledge that can be used by the biological research community to understand living systems and disease. This resource provides central access to structures in the Protein Data Bank (PDB), along with functional annotations, associated homology models, worldwide protein target tracking information, available protocols and the potential to obtain DNA materials for many of the targets. It also offers the ability to search all of the structural and methodological publications and the innovative technologies that were catalyzed by the PSI's high-throughput research efforts. In collaboration with the Nature Publishing Group, the PSI SGKB provides a research library, editorials about new research advances, news and an events calendar to present a broader view of structural biology and structural genomics. By making these resources freely available, the PSI SGKB serves as a bridge to connect the structural biology and the greater biomedical communities.
SWISS-MODEL Repository (http://swissmodel.expasy.org/repository/) is a database of 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated 3D protein models generated by automated homology modelling for all sequences in Swiss-Prot and for relevant models organisms. Regular updates ensure that target coverage is complete, that models are built using the most recent sequence and template structure databases, and that improvements in the underlying modelling pipeline are fully utilised. As of September 2008, the database contains 3.4 million entries for 2.7 million different protein sequences from the UniProt database. SWISS-MODEL Repository allows the users to assess the quality of the models in the database, search for alternative template structures, and to build models interactively via SWISS-MODEL Workspace (http://swissmodel.expasy.org/workspace/). Annotation of models with functional information and cross-linking with other databases such as the Protein Model Portal (http://www.proteinmodelportal.org) of the PSI Structural Genomics Knowledge Base facilitates the navigation between protein sequence and structure resources.
Arginine (R)-based ER localization signals are sorting motifs that confer transient ER localization to unassembled subunits of multimeric membrane proteins. The COPI vesicle coat binds R-based signals but the molecular details remain unknown. Here, we use reporter membrane proteins based on the proteolipid Pmp2 fused to GFP and allele swapping of COPI subunits to map the recognition site for R-based signals. We show that two highly conserved stretches—in the β- and δ-COPI subunits—are required to maintain Pmp2GFP reporters exposing R-based signals in the ER. Combining a deletion of 21 residues in δ-COP together with the mutation of three residues in β-COP gave rise to a COPI coat that had lost its ability to recognize R-based signals, whilst the recognition of C-terminal di-lysine signals remained unimpaired. A homology model of the COPI trunk domain illustrates the recognition of R-based signals by COPI.
The Sec61/SecY translocon mediates translocation of proteins across the membrane and integration of membrane proteins into the lipid bilayer. The structure of the translocon revealed a plug domain blocking the pore on the lumenal side. It was proposed to be important for gating the protein conducting channel and for maintaining the permeability barrier in its unoccupied state. Here, we analyzed in yeast the effect of introducing destabilizing point mutations in the plug domain or of its partial or complete deletion. Unexpectedly, even when the entire plug domain was deleted, cells were viable without growth phenotype. They showed an effect on signal sequence orientation of diagnostic signal-anchor proteins, a minor defect in cotranslational and a significant deficiency in posttranslational translocation. Steady-state levels of the mutant protein were reduced, and when coexpressed with wild-type Sec61p, the mutant lacking the plug competed poorly for complex partners. The results suggest that the plug is unlikely to be important for sealing the translocation pore in yeast but that it plays a role in stabilizing Sec61p during translocon formation.
The SWISS-MODEL Repository is a database of annotated 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. As of September 2005, the repository contained 675 000 models for 604 000 different protein sequences of the UniProt database. Regular updates ensure that the content of the repository reflects the current state of sequence and structure databases, integrating new or modified target sequences, and making use of new template structures. Each Repository entry consists of one or more 3D models accompanied by detailed information about the target protein and the model building process: functional annotation, a detailed template selection log, target-template alignment, summary of the model building and model quality assessment. The SWISS-MODEL Repository is freely accessible at .
The SWISS-MODEL Repository is a database of annotated three-dimensional comparative protein structure models generated by the fully automated homology-modelling pipeline SWISS-MODEL. The Repository currently contains about 300 000 three-dimensional models for sequences from the Swiss-Prot and TrEMBL databases. The content of the Repository is updated on a regular basis incorporating new sequences, taking advantage of new template structures becoming available and reflecting improvements in the underlying modelling algorithms. Each entry consists of one or more three-dimensional protein models, the superposed template structures, the alignments on which the models are based, a summary of the modelling process and a force field based quality assessment. The SWISS-MODEL Repository can be queried via an interactive website at http://swissmodel.expasy.org/repository/. Annotation and cross-linking of the models with other databases, e.g. Swiss-Prot on the ExPASy server, allow for seamless navigation between protein sequence and structure information. The aim of the SWISS-MODEL Repository is to provide access to an up-to-date collection of annotated three-dimensional protein models generated by automated homology modelling, bridging the gap between sequence and structure databases.