This paper presents a variety of techniques and technologies aimed at the transformation of crystallographic data into information and knowledge.
Structural and functional studies require the development of sophisticated ‘Big Data’ technologies and software to increase the knowledge derived and ensure reproducibility of the data. This paper presents summaries of the Structural Biology Knowledge Base, the VIPERdb Virus Structure Database, evaluation of homology modeling by the Protein Model Portal, the ProSMART tool for conformation-independent structure comparison, the LabDB ‘super’ laboratory information management system and the Cambridge Structural Database. These techniques and technologies represent important tools for the transformation of crystallographic data into knowledge and information, in an effort to address the problem of non-reproducibility of experimental results.
meaning from data; big data; databases; knowledge bases; data deposition
The Procrustes Structural Matching Alignment and Restraints Tool (ProSMART) has been developed to allow local comparative structural analyses independent of the global conformations and sequence homology of the compared macromolecules. This allows quick and intuitive visualization of the conservation of backbone and side-chain conformations, providing complementary information to existing methods.
The identification and exploration of (dis)similarities between macromolecular structures can help to gain biological insight, for instance when visualizing or quantifying the response of a protein to ligand binding. Obtaining a residue alignment between compared structures is often a prerequisite for such comparative analysis. If the conformational change of the protein is dramatic, conventional alignment methods may struggle to provide an intuitive solution for straightforward analysis. To make such analyses more accessible, the Procrustes Structural Matching Alignment and Restraints Tool (ProSMART) has been developed, which achieves a conformation-independent structural alignment, as well as providing such additional functionalities as the generation of restraints for use in the refinement of macromolecular models. Sensible comparison of protein (or DNA/RNA) structures in the presence of conformational changes is achieved by enforcing neither chain nor domain rigidity. The visualization of results is facilitated by popular molecular-graphics software such as CCP4mg and PyMOL, providing intuitive feedback regarding structural conservation and subtle dissimilarities between close homologues that can otherwise be hard to identify. Automatically generated colour schemes corresponding to various residue-based scores are provided, which allow the assessment of the conservation of backbone and side-chain conformations relative to the local coordinate frame. Structural comparison tools such as ProSMART can help to break the complexity that accompanies the constantly growing pool of structural data into a more readily accessible form, potentially offering biological insight or influencing subsequent experiments.
ProSMART; Procrustes; structural comparison; alignment; external restraints; refinement
The PDB_REDO pipeline aims to improve macromolecular structures by optimizing the crystallographic refinement parameters and performing partial model building. Here, algorithms are presented that allowed a web-server implementation of PDB_REDO, and the first user results are discussed.
The refinement and validation of a crystallographic structure model is the last step before the coordinates and the associated data are submitted to the Protein Data Bank (PDB). The success of the refinement procedure is typically assessed by validating the models against geometrical criteria and the diffraction data, and is an important step in ensuring the quality of the PDB public archive [Read et al. (2011 ▶), Structure, 19, 1395–1412]. The PDB_REDO procedure aims for ‘constructive validation’, aspiring to consistent and optimal refinement parameterization and pro-active model rebuilding, not only correcting errors but striving for optimal interpretation of the electron density. A web server for PDB_REDO has been implemented, allowing thorough, consistent and fully automated optimization of the refinement procedure in REFMAC and partial model rebuilding. The goal of the web server is to help practicing crystallographers to improve their model prior to submission to the PDB. For this, additional steps were implemented in the PDB_REDO pipeline, both in the refinement procedure, e.g. testing of resolution limits and k-fold cross-validation for small test sets, and as new validation criteria, e.g. the density-fit metrics implemented in EDSTATS and ligand validation as implemented in YASARA. Innovative ways to present the refinement and validation results to the user are also described, which together with auto-generated Coot scripts can guide users to subsequent model inspection and improvement. It is demonstrated that using the server can lead to substantial improvement of structure models before they are submitted to the PDB.
PDB_REDO; validation; model optimization
Paramagnetic NMR data (pseudocontact shifts and self-orientation residual dipolar couplings) and diamagnetic residual dipolar couplings can now be used in the program REFMAC5 from CCP4 as structural restraints together with X-ray crystallographic data. These NMR restraints can reveal differences between solid state and solution conformations of molecules or, in their absence, can be used together with X-ray crystallographic data for structural refinement.
The program REFMAC5 from CCP4 was modified to allow the simultaneous use of X-ray crystallographic data and paramagnetic NMR data (pseudocontact shifts and self-orientation residual dipolar couplings) and/or diamagnetic residual dipolar couplings. Incorporation of these long-range NMR restraints in REFMAC5 can reveal differences between solid-state and solution conformations of molecules or, in their absence, can be used together with X-ray crystallographic data for structural refinement. Since NMR and X-ray data are complementary, when a single structure is consistent with both sets of data and still maintains reasonably ‘ideal’ geometries, the reliability of the derived atomic model is expected to increase. The program was tested on five different proteins: the catalytic domain of matrix metalloproteinase 1, GB3, ubiquitin, free calmodulin and calmodulin complexed with a peptide. In some cases the joint refinement produced a single model consistent with both sets of observations, while in other cases it indicated, outside the experimental uncertainty, the presence of different protein conformations in solution and in the solid state.
structure refinement; PCS; RDC; X-ray; REFMAC
Three-dimensional (3D) structure determination by single particle electron cryomicroscopy (cryoEM) involves the calculation of an initial 3D model, followed by extensive iterative improvement of the orientation determination of the individual particle images and the resulting 3D map. Because there is much more noise than signal at high resolution in the images, this creates the possibility of noise reinforcement in the 3D map, which can give a false impression of the resolution attained. The balance between signal and noise in the final map at its limiting resolution depends on the image processing procedure and is not easily predicted. There is a growing awareness in the cryoEM community of how to avoid such over-fitting and over-estimation of resolution. Equally, there has been a reluctance to use the two principal methods of avoidance because they give lower resolution estimates, which some people believe are too pessimistic. Here we describe a simple test that is compatible with any image processing protocol. The test allows measurement of the amount of signal and the amount of noise from overfitting that is present in the final 3D map. We have applied the method to two different sets of cryoEM images of the enzyme beta-galactosidase using several image processing packages. Our procedure involves substituting the Fourier components of the initial particle image stack beyond a chosen resolution by either the Fourier components from an adjacent area of background, or by simple randomisation of the phases of the particle structure factors. This substituted noise thus has the same spectral power distribution as the original data. Comparison of the Fourier Shell Correlation (FSC) plots from the 3D map obtained using the experimental data with that from the same data with high-resolution noise (HR-noise) substituted allows an unambiguous measurement of the amount of overfitting and an accompanying resolution assessment. A simple formula can be used to calculate an unbiased FSC from the two curves, even when a substantial amount of overfitting is present. The approach is software independent. The user is therefore completely free to use any established method or novel combination of methods, provided the HR-noise test is carried out in parallel. Applying this procedure to cryoEM images of beta-galactosidase shows how overfitting varies greatly depending on the procedure, but in the best case shows no overfitting and a resolution of ~6 Å. (382 words)
•A new method to validate 3D cryoEM maps of biological structures is described.•High-resolution noise substitution is a tool to measure the amount of overfitting of noise in single particle cryoEM.•A reliable, unbiased resolution estimation can be obtained even when some overfitting is present.•Structure of beta-galactosidase at ~6 Å resolution is determined by cryoEM.
Single particle; Electron cryomicroscopy; Validation; Resolution; Overfitting; Beta-galactosidase
The new scaling program AIMLESS is described and tests of refinements at different resolutions are compared with analyses from the scaling step.
Following integration of the observed diffraction spots, the process of ‘data reduction’ initially aims to determine the point-group symmetry of the data and the likely space group. This can be performed with the program POINTLESS. The scaling program then puts all the measurements on a common scale, averages measurements of symmetry-related reflections (using the symmetry determined previously) and produces many statistics that provide the first important measures of data quality. A new scaling program, AIMLESS, implements scaling models similar to those in SCALA but adds some additional analyses. From the analyses, a number of decisions can be made about the quality of the data and whether some measurements should be discarded. The effective ‘resolution’ of a data set is a difficult and possibly contentious question (particularly with referees of papers) and this is discussed in the light of tests comparing the data-processing statistics with trials of refinement against observed and simulated data, and automated model-building and comparison of maps calculated with different resolution limits. These trials show that adding weak high-resolution data beyond the commonly used limits may make some improvement and does no harm.
data reduction; data scaling; software; data statistics
Dethiobiotin synthetase (DTBS) is involved in the biosynthesis of biotin in bacteria, fungi and plants. As humans lack this pathway, dethiobiotin synthetase is a promising antimicrobial drug target. We determined structures of DBTS from H. pylori (hpDTBS) bound with cofactors and a substrate analog and described its unique characteristics relative to other DTBS proteins. Comparison with bacterial DTBS orthologues revealed considerable structural differences in nucleotide recognition. The C-terminal region of DTBS proteins, which contains two nucleotide-recognition motifs, greatly differs among DTBS proteins from different species. The structure of hpDTBS revealed that this protein is unique and does not contain a C-terminal region containing one of the motifs. The single nucleotide-binding motif in hpDTBS is similar to its counterpart in GTPases, however, ITC binding studies show that hpDTBS has a strong preference for ATP. The structural determinants of ATP specificity were assessed through X-ray crystallographic studies of hpDTBS:ATP and hpDTBS:GTP complexes. The unique mode of nucleotide recognition in hpDTBS makes this protein a good target for H. pylori-specific inhibitors of the biotin synthesis pathway.
The CCP4 template-restraint library defines restraints for biopolymers, their modifications and ligands that are used in macromolecular structure refinement. JLigand is a graphical editor for generating descriptions of new ligands and covalent linkages.
Biological macromolecules are polymers and therefore the restraints for macromolecular refinement can be subdivided into two sets: restraints that are applied to atoms that all belong to the same monomer and restraints that are associated with the covalent bonds between monomers. The CCP4 template-restraint library contains three types of data entries defining template restraints: descriptions of monomers and their modifications, both used for intramonomer restraints, and descriptions of links for intermonomer restraints. The library provides generic descriptions of modifications and links for protein, DNA and RNA chains, and for some post-translational modifications including glycosylation. Structure-specific template restraints can be defined in a user’s additional restraint library. Here, JLigand, a new CCP4 graphical interface to LibCheck and REFMAC that has been developed to manage the user’s library and generate new monomer entries is described, as well as new entries for links and associated modifications.
macromolecular refinement; restraint library; molecular graphics
Low-resolution refinement tools implemented in REFMAC5 are described, including the use of external structural restraints, helical restraints and regularized anisotropic map sharpening.
Two aspects of low-resolution macromolecular crystal structure analysis are considered: (i) the use of reference structures and structural units for provision of structural prior information and (ii) map sharpening in the presence of noise and the effects of Fourier series termination. The generation of interatomic distance restraints by ProSMART and their subsequent application in REFMAC5 is described. It is shown that the use of such external structural information can enhance the reliability of derived atomic models and stabilize refinement. The problem of map sharpening is considered as an inverse deblurring problem and is solved using Tikhonov regularizers. It is demonstrated that this type of map sharpening can automatically produce a map with more structural features whilst maintaining connectivity. Tests show that both of these directions are promising, although more work needs to be performed in order to further exploit structural information and to address the problem of reliable electron-density calculation.
low-resolution refinement; REFMAC5
The decision-making algorithms and software used in PDB_REDO to re-refine and rebuild crystallographic protein structures in the PDB are presented and discussed.
Developments of the PDB_REDO procedure that combine re-refinement and rebuilding within a unique decision-making framework to improve structures in the PDB are presented. PDB_REDO uses a variety of existing and custom-built software modules to choose an optimal refinement protocol (e.g. anisotropic, isotropic or overall B-factor refinement, TLS model) and to optimize the geometry versus data-refinement weights. Next, it proceeds to rebuild side chains and peptide planes before a final optimization round. PDB_REDO works fully automatically without the need for intervention by a crystallographic expert. The pipeline was tested on 12 000 PDB entries and the great majority of the test cases improved both in terms of crystallographic criteria such as R
free and in terms of widely accepted geometric validation criteria. It is concluded that PDB_REDO is useful to update the otherwise ‘static’ structures in the PDB to modern crystallographic standards. The publically available PDB_REDO database provides better model statistics and contributes to better refinement and validation targets.
validation; refinement; model building; automation; PDB
An overview of the CCP4 software suite for macromolecular crystallography is given.
The CCP4 (Collaborative Computational Project, Number 4) software suite is a collection of programs and associated data and software libraries which can be used for macromolecular structure determination by X-ray crystallography. The suite is designed to be flexible, allowing users a number of methods of achieving their aims. The programs are from a wide variety of sources but are connected by a common infrastructure provided by standard file formats, data objects and graphical interfaces. Structure solution by macromolecular crystallography is becoming increasingly automated and the CCP4 suite includes several automation pipelines. After giving a brief description of the evolution of CCP4 over the last 30 years, an overview of the current suite is given. While detailed descriptions are given in the accompanying articles, here it is shown how the individual programs contribute to a complete software package.
CCP4; macromolecular crystallography; software; collaboration; automation; macromolecular structure determination
The automated pipelines for molecular replacement MrBUMP and BALBES are reviewed, with an emphasis on understanding their output. Conclusions are drawn from their performance in extensive trials.
Molecular replacement is one of the key methods used to solve the problem of determining the phases of structure factors in protein structure solution from X-ray image diffraction data. Its success rate has been steadily improving with the development of improved software methods and the increasing number of structures available in the PDB for use as search models. Despite this, in cases where there is low sequence identity between the target-structure sequence and that of its set of possible homologues it can be a difficult and time-consuming chore to isolate and prepare the best search model for molecular replacement. MrBUMP and BALBES are two recent developments from CCP4 that have been designed to automate and speed up the process of determining and preparing the best search models and putting them through molecular replacement. Their intention is to provide the user with a broad set of results using many search models and to highlight the best of these for further processing. An overview of both programs is presented along with a description of how best to use them, citing case studies and the results of large-scale testing of the software.
MrBUMP; BALBES; molecular replacement
The general principles behind the macromolecular crystal structure refinement program REFMAC5 are described.
This paper describes various components of the macromolecular crystallographic refinement program REFMAC5, which is distributed as part of the CCP4 suite. REFMAC5 utilizes different likelihood functions depending on the diffraction data employed (amplitudes or intensities), the presence of twinning and the availability of SAD/SIRAS experimental diffraction data. To ensure chemical and structural integrity of the refined model, REFMAC5 offers several classes of restraints and choices of model parameterization. Reliable models at resolutions at least as low as 4 Å can be achieved thanks to low-resolution refinement tools such as secondary-structure restraints, restraints to known homologous structures, automatic global and local NCS restraints, ‘jelly-body’ restraints and the use of novel long-range restraints on atomic displacement parameters (ADPs) based on the Kullback–Leibler divergence. REFMAC5 additionally offers TLS parameterization and, when high-resolution data are available, fast refinement of anisotropic ADPs. Refinement in the presence of twinning is performed in a fully automated fashion. REFMAC5 is a flexible and highly optimized refinement package that is ideally suited for refinement across the entire resolution spectrum encountered in macromolecular crystallography.
The automated building of a protein model into an electron density map remains a challenging problem. In the ARP/wARP approach, model building is facilitated by initially interpreting a density map with free atoms of unknown chemical identity; all structural information for such chemically unassigned atoms is discarded. Here, this is remedied by applying restraints between free atoms, and between free atoms and a partial protein model. These are based on geometric considerations of protein structure and tentative (conditional) assignments for the free atoms. Restraints are applied in the REFMAC5 refinement program and are generated on an ad hoc basis, allowing them to fluctuate from step to step. A large set of experimentally phased and molecular replacement structures showcases individual structures where automated building is improved drastically by the conditional restraints. The concept and implementation we present can also find application in restraining geometries, such as hydrogen bonds, in low-resolution refinement.
The default model-preparation scheme of MOLREP is described. Two examples are presented of model improvement using X-ray data.
The success of molecular replacement is critically dependent on the quality of the search model. Several model-preparation procedures are integrated in the molecular-replacement program MOLREP. These include model modification on the basis of amino-acid sequence alignment and model correction based on analysis of the solvent-accessibility of the atoms. The packing function used in MOLREP for the translational search is explained in the context of model preparation. In difficult cases, bioinformatics-based modifications are not sufficient for successful molecular replacement. An approach implemented in MOLREP for solving cases with translational noncrystallographic symmetry is an example of model preparation in which analysis of X-ray data plays an essential role. In addition, two examples are presented in which the X-ray data were used to refine partial models for subsequent use in molecular replacement.
MOLREP; model preparation; molecular replacement
A systematic test shows how ARP/wARP deals with automated model building for structures that have been solved by molecular replacement. A description of protocols in the flex-wARP control system and studies of two specific cases are also presented.
Automatic iterative model (re-)building, as implemented in ARP/wARP and its new control system flex-wARP, is particularly well suited to follow structure solution by molecular replacement. More than 100 molecular-replacement solutions automatically solved by the BALBES software were submitted to three standard protocols in flex-wARP and the results were compared with final models from the PDB. Standard metrics were gathered in a systematic way and enabled the drawing of statistical conclusions on the advantages of each protocol. Based on this analysis, an empirical estimator was proposed that predicts how good the final model produced by flex-wARP is likely to be based on the experimental data and the quality of the molecular-replacement solution. To introduce the differences between the three flex-wARP protocols (keeping the complete search model, converting it to atomic coordinates but ignoring atom identities or using the electron-density map calculated from the molecular-replacement solution), two examples are also discussed in detail, focusing on the evolution of the models during iterative rebuilding. This highlights the diversity of paths that the flex-wARP control system can employ to reach a nearly complete and accurate model while actually starting from the same initial information.
model building; refinement; molecular replacement
The fully automated pipeline, BALBES, integrates a redesigned hierarchical database of protein structures with their domains and multimeric organization, and solves molecular-replacement problems using only input X-ray and sequence data.
The number of macromolecular structures solved and deposited in the Protein Data Bank (PDB) is higher than 40 000. Using this information in macromolecular crystallography (MX) should in principle increase the efficiency of MX structure solution. This paper describes a molecular-replacement pipeline, BALBES, that makes extensive use of this repository. It uses a reorganized database taken from the PDB with multimeric as well as domain organization. A system manager written in Python controls the workflow of the process. Testing the current version of the pipeline using entries from the PDB has shown that this approach has huge potential and that around 75% of structures can be solved automatically without user intervention.
BALBES; molecular replacement
The presence of pseudosymmetry can cause problems in structure determination and refinement. The relevant background and representative examples are presented.
It is not uncommon for protein crystals to crystallize with more than a single molecule per asymmetric unit. When more than a single molecule is present in the asymmetric unit, various pathological situations such as twinning, modulated crystals and pseudo translational or rotational symmetry can arise. The presence of pseudosymmetry can lead to uncertainties about the correct space group, especially in the presence of twinning. The background to certain common pathologies is presented and a new notation for space groups in unusual settings is introduced. The main concepts are illustrated with several examples from the literature and the Protein Data Bank.
pathology; twinning; pseudosymmetry