Here we describe the updated MolProbity rotamer-library distributions derived from an order-of-magnitude larger and more stringently quality-filtered dataset of about 8000 (vs. 500) protein chains, and we explain the resulting changes and improvements to model validation as seen by users. To include only sidechains with satisfactory justification for their given conformation, we added residue-specific filters for electron-density value and model-to-density fit. The combined new protocol retains a million residues of data, while cleaning up false-positive noise in the multi-χ datapoint distributions. It enables unambiguous characterization of conformational clusters nearly 1000-fold less frequent than the most common ones. We describe examples of local interactions that favor these rare conformations, including the role of authentic covalent bond-angle deviations in enabling presumably strained sidechain conformations. Further, along with favored and outlier, an allowed category (0.3% to 2.0% occurrence in reference data) has been added, analogous to Ramachandran validation categories. The new rotamer distributions are used for current rotamer validation in Mol-Probity and PHENIX, and for rotamer choice in PHENIX model-building and refinement. The multi-dimensional χ distributions and Top8000 reference dataset are freely available on GitHub. These rotamers are termed “ultimate” because data sampling and quality are now fully adequate for this task, and also because we believe the future of conformational validation should integrate sidechain with backbone criteria.
sidechain rotamer library; rare sidechain conformations; structural bioinformatics; structure validation; Phenix; high-quality dataset; protein conformation
The Staphylococcus aureus virulence factor staphylococcal protein A (SpA) is a major contributor to bacterial evasion of the host immune system, through high-affinity binding to host proteins such as antibodies. SpA includes five small three-helix-bundle domains (E-D-A-B-C) separated by conserved flexible linkers. Prior attempts to crystallize individual domains in the absence of a binding partner have apparently been unsuccessful. There have also been no previous structures of tandem domains. Here we report the high-resolution crystal structures of a single C domain, and of two B domains connected by the conserved linker. Both structures exhibit extensive multiscale conformational heterogeneity, which required novel modeling protocols. Comparison of domain structures shows that helix1 orientation is especially heterogeneous, coordinated with changes in sidechain conformational networks and contacting protein interfaces. This represents the kind of structural plasticity that could enable SpA to bind multiple partners.
Staphylococcal protein A (SpA) is a multi-domain protein consisting of 5 globular
IgG binding domains separated by a conserved 6 – 9 residue flexible linker. We
collected SAXS data on the N-terminal protein-binding half of SpA (SpA-N) and constructs
consisting of 1 – 5 domain modules in order to determine statistical conformation
of this important S. aureus virulence factor. We fit the SAXS data to a
scattering function based on a new polymer physics model, which provides an analytical
description of the SpA-N statistical conformation. We describe a protocol for
systematically determining the appropriate level of modeling to fit a SAXS dataset, based
on goodness of fit and whether the addition of parameters improves it. In the case of
SpA-N, the analytical polymer physics description provides a depiction of the statistical
conformation of a flexible protein that, while lacking atomistic detail, properly reflects
the information content of the data.
The hepatitis delta virus (HDV) ribozyme is a self-cleaving RNA enzyme essential for processing viral transcripts during rolling circle viral replication. The first crystal structure of the cleaved ribozyme was solved in 1998, followed by structures of uncleaved, mutant-inhibited and ion-complexed forms. Recently, methods have been developed that make the task of modeling RNA structure and dynamics significantly easier and more reliable. We have used ERRASER and PHENIX to rebuild and re-refine the cleaved and cis-acting C75U-inhibited structures of the HDV ribozyme. The results correct local conformations and identify alternates for RNA residues, many in functionally important regions, leading to improved R values and model validation statistics for both structures. We compare the rebuilt structures to a higher resolution, trans-acting deoxy-inhibited structure of the ribozyme, and conclude that although both inhibited structures are consistent with the currently accepted hammerhead-like mechanism of cleavage, they do not add direct structural evidence to the biochemical and modeling data. However, the rebuilt structures (PDBs: 4PR6, 4PRF) provide a more robust starting point for research on the dynamics and catalytic mechanism of the HDV ribozyme and demonstrate the power of new techniques to make significant improvements in RNA structures that impact biologically relevant conclusions.
Model validation has evolved from a passive final gatekeeping step to an ongoing diagnosis and healing process that enables significant improvement of accuracy. A recent phase of active development was spurred by the worldwide Protein Data Bank requiring data deposition and establishing Validation Task Force committees, by strong growth in high-quality reference data, by new speed and ease of computations, and by an upswing of interest in large molecular machines and structural ensembles. Progress includes automated correction methods, concise and user-friendly validation reports for referees and on the PDB websites, extension of error correction to RNA and error diagnosis to ligands, carbohydrates, and membrane proteins, and a good start on better methods for low resolution and for multiple conformations.
Macromolecular crystal structures are among the best of scientific data, providing detailed insight into these complex and biologically important molecules with a relatively low level of error and subjectivity. However, there are two notable problems with getting the most information from them. The first is that the models are not perfect: there is still opportunity for improving them, and users need to evaluate whether the local reliability in a structure is up to answering their question of interest. The second is that protein and nucleic acid molecules are highly complex and individual, inherently handed and 3-dimensional, and the cooperative and subtle interactions that govern their detailed structure and function are not intuitively evident. Thus there is a real need for graphical representations and descriptive classifications that enable molecular 3D literacy. We have spent our career working to understand these elegant molecules ourselves, and building tools to help us and others determine and understand them better. The Protein Data Bank (PDB) has of course been vital and central to this undertaking. Here we combine some history of our involvement as depositors, illustrators, evaluators, and end-users of PDB structures with commentary on how best to study and draw scientific inferences from them.
The foundations and current features of a widely used graphical user interface for macromolecular crystallography are described.
A new Python-based graphical user interface for the PHENIX suite of crystallography software is described. This interface unifies the command-line programs and their graphical displays, simplifying the development of new interfaces and avoiding duplication of function. With careful design, graphical interfaces can be displayed automatically, instead of being manually constructed. The resulting package is easily maintained and extended as new programs are added or modified.
macromolecular crystallography; graphical user interfaces; PHENIX
A macromolecular structure, as measured data or as a list of coordinates or even on-screen as a full atomic model, is an extremely complex and confusing object. The underlying rules of how it folds, moves, and interacts as a biological entity are even less evident or intuitive to the human mind. To do science on such molecules, or to relate them usefully to higher levels of biology, we need to start with a natural history that names their features in meaningful ways and with multiple representations (visual or algebraic) that show some aspect of their organizing principles. The two of us have jointly enjoyed a highly varied and engrossing career in biophysical research over nearly 50 years. Our frequent changes of emphasis are tied together by two threads: first, by finding the right names, visualizations, and methods to help both ourselves and others to better understand the 3D structures of protein and RNA molecules, and second, by redefining the boundary between signal and noise for complex data, in both directions—sometimes identifying and promoting real signal up out of what seemed just noise, and sometimes demoting apparent signal into noise or systematic error. Here we relate parts of our scientific and personal lives, including ups and downs, influences, anecdotes, and guiding principles such as the title theme.
scientific biography; structural biology; molecular graphics; ribbon drawings; structure validation; all-atom contacts
We have developed a suite of protein redesign algorithms that improves realistic in silico modeling of proteins. These algorithms are based on three characteristics that make them unique: (1) improved flexibility of the protein backbone, protein side chains, and ligand to accurately capture the conformational changes that are induced by mutations to the protein sequence; (2) modeling of proteins and ligands as ensembles of low-energy structures to better approximate binding affinity; and (3) a globally-optimal protein design search, guaranteeing that the computational predictions are optimal with respect to the input model. Here, we illustrate the importance of these three characteristics. We then describe OSPREY, a protein redesign suite that implements our protein design algorithms. OSPREY has been used prospectively, with experimental validation, in several biomedically-relevant settings. We show in detail how OSPREY has been used to predict resistance mutations and explain why improved flexibility, ensembles, and provability are essential for this application.
protein design; OSPREY; Dead-end elimination; protein ensembles; protein flexibility; K*; minDEE
X-ray crystallography is a critical tool in the study of biological systems. It is able to provide information that has been a prerequisite to understanding the fundamentals of life. It is also a method that is central to the development of new therapeutics for human disease. Significant time and effort are required to determine and optimize many macromolecular structures because of the need for manual interpretation of complex numerical data, often using many different software packages, and the repeated use of interactive three-dimensional graphics. The Phenix software package has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on automation. This has required the development of new algorithms that minimize or eliminate subjective input in favour of built-in expert-systems knowledge, the automation of procedures that are traditionally performed by hand, and the development of a computational framework that allows a tight integration between the algorithms. The application of automated methods is particularly appropriate in the field of structural proteomics, where high throughput is desired. Features in Phenix for the automation of experimental phasing with subsequent model building, molecular replacement, structure refinement and validation are described and examples given of running Phenix from both the command line and graphical user interface.
Macromolecular Crystallography; Automation; Phenix; X-ray; Diffraction; Python
Amino acid substitutions in protein structures often require subtle backbone adjustments that are difficult to model in atomic detail. An improved ability to predict realistic backbone changes in response to engineered mutations would be of great utility for the blossoming field of rational protein design. One model that has recently grown in acceptance is the backrub motion, a low-energy dipeptide rotation with single-peptide counter-rotations, that is coupled to dynamic two-state sidechain rotamer jumps, as evidenced by alternate conformations in very high-resolution crystal structures. It has been speculated that backrubs may facilitate sequence changes equally well as rotamer changes. However, backrub-induced shifts and experimental uncertainty are of similar magnitude for backbone atoms in even high-resolution structures, so comparison of wildtype-vs.-mutant crystal structure pairs is not sufficient to directly link backrubs to mutations. In this study, we use two alternative approaches that bypass this limitation. First, we use a quality-filtered structure database to aggregate many examples for precisely defined motifs with single amino acid differences, and find that the effectively amplified backbone differences closely resemble backrubs. Second, we directly apply a provably-accurate, backrub-enabled protein design algorithm to idealized versions of these motifs, and discover that the lowest-energy computed models match the average-coordinate experimental structures. These results support the hypothesis that backrubs participate in natural protein evolution and validate their continued use for design of synthetic proteins.
Protein design has the potential to generate useful molecules for medicine and chemistry, including sensors, drugs, and catalysts for arbitrary reactions. When protein design is carried out starting from an experimentally determined structure, as is often the case, one important aspect to consider is backbone flexibility, because in response to a mutation the backbone often must shift slightly to reconcile the new sidechain with its environment. In principle, one may model the backbone in many ways, but not all are physically realistic or experimentally validated. Here we study the "backrub" motion, which has been previously documented in atomic detail, but only for sidechain movements within single structures. By a twopronged approach involving both structural bioinformatics and computation with a principled design algorithm, we demonstrate that backrubs are sufficient to explain the backbone differences between mutation-related sets of very precisely defined motifs from the protein structure database. Our findings illustrate that backrubs are useful for describing evolutionary sequence change and, by extension, suggest that they are also appropriate for rational protein design calculations.
Recent developments in PHENIX are reported that allow the use of reference-model torsion restraints, secondary-structure hydrogen-bond restraints and Ramachandran restraints for improved macromolecular refinement in phenix.refine at low resolution.
Traditional methods for macromolecular refinement often have limited success at low resolution (3.0–3.5 Å or worse), producing models that score poorly on crystallographic and geometric validation criteria. To improve low-resolution refinement, knowledge from macromolecular chemistry and homology was used to add three new coordinate-restraint functions to the refinement program phenix.refine. Firstly, a ‘reference-model’ method uses an identical or homologous higher resolution model to add restraints on torsion angles to the geometric target function. Secondly, automatic restraints for common secondary-structure elements in proteins and nucleic acids were implemented that can help to preserve the secondary-structure geometry, which is often distorted at low resolution. Lastly, we have implemented Ramachandran-based restraints on the backbone torsion angles. In this method, a ϕ,ψ term is added to the geometric target function to minimize a modified Ramachandran landscape that smoothly combines favorable peaks identified from nonredundant high-quality data with unfavorable peaks calculated using a clash-based pseudo-energy function. All three methods show improved MolProbity validation statistics, typically complemented by a lowered R
free and a decreased gap between R
work and R
macromolecular crystallography; low resolution; refinement; automation
The crystal structure of the hypothetical protein PF0899 from P. furiosus has been determined to 1.85 Å resolution.
The hypothetical protein PF0899 is a 95-residue peptide from the hyperthermophilic archaeon Pyrococcus furiosus that represents a gene family with six members. P. furiosus ORF PF0899 has been cloned, expressed and crystallized and its structure has been determined by the Southeast Collaboratory for Structural Genomics (http://www.secsg.org). The structure was solved using the SCA2Structure pipeline from multiple data sets and has been refined to 1.85 Å against the highest resolution data set collected (a presumed gold derivative), with a crystallographic R factor of 21.0% and R
free of 24.0%. The refined structure shows some structural similarity to a wedge-shaped domain observed in the structure of the major capsid protein from bacteriophage HK97, suggesting that PF0899 may be a structural protein.
structural genomics; SECSG; Pfu-871755; PF0899; high-throughput structure
What conformations do protein molecules populate in solution? Crystallography provides a high-resolution description of protein structure in the crystal environment, while NMR describes structure in solution but using less data. NMR structures display more variability, but is this because crystal contacts are absent or because of fewer data constraints? Here we report unexpected insight into this issue obtained through analysis of detailed protein energy landscapes generated by large-scale, native-enhanced sampling of conformational space with Rosetta@HOME for 111 protein domains. In the absence of tightly associating binding partners or ligands, the lowest-energy Rosetta models were nearly all <2.5Å CαRMSD from the experimental structure; this result demonstrates that structure prediction accuracy for globular proteins is limited mainly by the ability to sample close to the native structure. While the lowest-energy models are similar to deposited structures, they are not identical; the largest deviations are most often in regions involved in ligand, quaternary, or crystal contacts. For ligand binding proteins, the low energy models may resemble the apo structures, and for oligomeric proteins, the monomeric assembly intermediates. The deviations between the low energy models and crystal structures largely disappear when landscapes are computed in the context of the crystal lattice or multimer. The computed low-energy ensembles, with tight crystal-structure-like packing in the core, but more NMR-structure-like variability in loops, may in some cases resemble the native state ensembles of proteins better than individual crystal or NMR structures, and can suggest experimentally testable hypotheses relating alternative states and structural heterogeneity to function.
Rosetta; alternative conformations; protein mobility; structure prediction; validation
Central to crystallographic structure solution is obtaining accurate phases in order to build a molecular model, ultimately followed by refinement of that model to optimize its fit to the experimental diffraction data and prior chemical knowledge. Recent advances in phasing and model refinement and validation algorithms make it possible to arrive at better electron density maps and more accurate models.
For template-based modeling in the CASP8 Critical Assessment of Techniques for Protein Structure Prediction, this work develops and applies six new full-model metrics. They are designed to complement and add value to the traditional template-based assessment by GDT (Global Distance Test) and related scores (based on multiple superpositions of Cα atoms between target structure and predictions labeled “model 1”). The new metrics evaluate each predictor group on each target, using all atoms of their best model with above-average GDT. Two metrics evaluate how “protein-like” the predicted model is: the MolProbity score used for validating experimental structures, and a mainchain reality score using all-atom steric clashes, bond length and angle outliers, and backbone dihedrals. Four other new metrics evaluate match of model to target for mainchain and sidechain hydrogen bonds, sidechain end positioning, and sidechain rotamers. Group-average Z-score across the six full-model measures is averaged with group-average GDT Z-score to produce the overall ranking for full-model, high-accuracy performance.
Separate assessments are reported for specific aspects of predictor-group performance, such as robustness of approximately correct template or fold identification, and self-scoring ability at identifying the best of their models. Fold identification is distinct from but correlated with group-average GDT Z-score if target difficulty is taken into account, while self-scoring is done best by servers and is uncorrelated with GDT performance. Outstanding individual models on specific targets are identified and discussed. Predictor groups excelled at different aspects, highlighting the diversity of current methodologies. However, good full-model scores correlate robustly with high Cα accuracy.
homology modeling; protein structure prediction; all-atom contacts; full-model assessment
Application of phenix.model_vs_data to the contents of the Protein Data Bank shows that the vast majority of deposited structures can be automatically analyzed to reproduce the reported quality statistics. However, the small fraction of structures that elude automated re-analysis highlight areas where new software developments can help retain valuable information for future analysis.
phenix.model_vs_data is a high-level command-line tool for the computation of crystallographic model and data statistics, and the evaluation of the fit of the model to data. Analysis of all Protein Data Bank structures that have experimental data available shows that in most cases the reported statistics, in particular R factors, can be reproduced within a few percentage points. However, there are a number of outliers where the recomputed R values are significantly different from those originally reported. The reasons for these discrepancies are discussed.
PHENIX; Protein Data Bank; data quality; model quality; structure validation; R factors
The PHENIX software for macromolecular structure determination is described.
Macromolecular X-ray crystallography is routinely applied to understand biological processes at a molecular level. However, significant time and effort are still required to solve and complete many of these structures because of the need for manual interpretation of complex numerical data using many software packages and the repeated use of interactive three-dimensional graphics. PHENIX has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on the automation of all procedures. This has relied on the development of algorithms that minimize or eliminate subjective input, the development of algorithms that automate procedures that are traditionally performed by hand and, finally, the development of a framework that allows a tight integration between the algorithms.
PHENIX; Python; macromolecular crystallography; algorithms
MolProbity structure validation will diagnose most local errors in macromolecular crystal structures and help to guide their correction.
MolProbity is a structure-validation web service that provides broad-spectrum solidly based evaluation of model quality at both the global and local levels for both proteins and nucleic acids. It relies heavily on the power and sensitivity provided by optimized hydrogen placement and all-atom contact analysis, complemented by updated versions of covalent-geometry and torsion-angle criteria. Some of the local corrections can be performed automatically in MolProbity and all of the diagnostics are presented in chart and graphical forms that help guide manual rebuilding. X-ray crystallography provides a wealth of biologically important molecular data in the form of atomic three-dimensional structures of proteins, nucleic acids and increasingly large complexes in multiple forms and states. Advances in automation, in everything from crystallization to data collection to phasing to model building to refinement, have made solving a structure using crystallography easier than ever. However, despite these improvements, local errors that can affect biological interpretation are widespread at low resolution and even high-resolution structures nearly all contain at least a few local errors such as Ramachandran outliers, flipped branched protein side chains and incorrect sugar puckers. It is critical both for the crystallographer and for the end user that there are easy and reliable methods to diagnose and correct these sorts of errors in structures. MolProbity is the authors’ contribution to helping solve this problem and this article reviews its general capabilities, reports on recent enhancements and usage, and presents evidence that the resulting improvements are now beneficially affecting the global database.
all-atom contacts; clashscore; automated correction; KiNG; ribose pucker; Ramachandran plots; side-chain rotamers; model quality; systematic errors; database improvement
Misfit sidechains in protein crystal structures are a stumbling block in using those structures to direct further scientific inference. Problems due to surface disorder and poor electron density are very difficult to address, but a large class of systematic errors are quite common even in well-ordered regions, resulting in sidechains fit backwards into local density in predictable ways. The MolProbity web site is effective at diagnosing such errors, and can perform reliable automated correction of a few special cases such as 180° flips of Asn or Gln sidechain amides, using all-atom contacts and H-bond networks. However, most at-risk residues involve tetrahedral geometry, and their valid correction requires rigorous evaluation of sidechain movement and sometimes backbone shift. The current work extends the benefits of robust automated correction to more sidechain types. The Autofix method identifies candidate systematic, flipped-over errors in Leu, Thr, Val, and Arg using MolProbity quality statistics, proposes a corrected position using real-space refinement with rotamer selection in Coot, and accepts or rejects the correction based on improvement in MolProbity criteria and on χ angle change. Criteria are chosen conservatively, after examining many individual results, to ensure valid correction. To test this method, Autofix was run and analyzed for 945 representative PDB files and on the 50S ribosomal subunit of file 1YHQ. Over 40% of Leu, Val, and Thr outliers and 15% of Arg outliers were successfully corrected, resulting in a total of 3,679 corrected sidechains, or 4 per structure on average. Summary Sentences: A common class of misfit sidechains in protein crystal structures is due to systematic errors that place the sidechain backwards into the local electron density. A fully automated method called “Autofix” identifies such errors for Leu, Val, Thr, and Arg and corrects over one third of them, using MolProbity validation criteria and Coot real-space refinement of rotamers.
Electronic supplementary material
The online version of this article (doi:10.1007/s10969-008-9045-8) contains supplementary material, which is available to authorized users.
Automation; Structure improvement; Crystallography; Sidechain rotamers; Protein/RNA interactions
In molecular applications, virtual reality (VR) and immersive virtual environments have generally been used and valued for the visual and interactive experience – to enhance intuition and communicate excitement – rather than as part of the actual research process. In contrast, this work develops a software infrastructure for research use and illustrates such use on a specific case.
The Syzygy open-source toolkit for VR software was used to write the KinImmerse program, which translates the molecular capabilities of the kinemage graphics format into software for display and manipulation in the DiVE (Duke immersive Virtual Environment) or other VR system. KinImmerse is supported by the flexible display construction and editing features in the KiNG kinemage viewer and it implements new forms of user interaction in the DiVE.
In addition to molecular visualizations and navigation, KinImmerse provides a set of research tools for manipulation, identification, co-centering of multiple models, free-form 3D annotation, and output of results. The molecular research test case analyzes the local neighborhood around an individual atom within an ensemble of nuclear magnetic resonance (NMR) models, enabling immersive visual comparison of the local conformation with the local NMR experimental data, including target curves for residual dipolar couplings (RDCs).
The promise of KinImmerse for production-level molecular research in the DiVE is shown by the locally co-centered RDC visualization developed there, which gave new insights now being pursued in wider data analysis.
Motivation: The Backrub is a small but kinematically efficient side-chain-coupled local backbone motion frequently observed in atomic-resolution crystal structures of proteins. A backrub shifts the Cα–Cβ orientation of a given side-chain by rigid-body dipeptide rotation plus smaller individual rotations of the two peptides, with virtually no change in the rest of the protein. Backrubs can therefore provide a biophysically realistic model of local backbone flexibility for structure-based protein design. Previously, however, backrub motions were applied via manual interactive model-building, so their incorporation into a protein design algorithm (a simultaneous search over mutation and backbone/side-chain conformation space) was infeasible.
Results: We present a combinatorial search algorithm for protein design that incorporates an automated procedure for local backbone flexibility via backrub motions. We further derive a dead-end elimination (DEE)-based criterion for pruning candidate rotamers that, in contrast to previous DEE algorithms, is provably accurate with backrub motions. Our backrub-based algorithm successfully predicts alternate side-chain conformations from ≤0.9 Å resolution structures, confirming the suitability of the automated backrub procedure. Finally, the application of our algorithm to redesign two different proteins is shown to identify a large number of lower-energy conformations and mutation sequences that would have been ignored by a rigid-backbone model.
Availability: Contact authors for source code.
MolProbity is a general-purpose web server offering quality validation for 3D structures of proteins, nucleic acids and complexes. It provides detailed all-atom contact analysis of any steric problems within the molecules as well as updated dihedral-angle diagnostics, and it can calculate and display the H-bond and van der Waals contacts in the interfaces between components. An integral step in the process is the addition and full optimization of all hydrogen atoms, both polar and nonpolar. New analysis functions have been added for RNA, for interfaces, and for NMR ensembles. Additionally, both the web site and major component programs have been rewritten to improve speed, convenience, clarity and integration with other resources. MolProbity results are reported in multiple forms: as overall numeric scores, as lists or charts of local problems, as downloadable PDB and graphics files, and most notably as informative, manipulable 3D kinemage graphics shown online in the KiNG viewer. This service is available free to all users at http://molprobity.biochem.duke.edu.
MolProbity is a general-purpose web service offering quality validation for three-dimensional (3D) structures of proteins, nucleic acids and complexes. It provides detailed all-atom contact analysis of any steric problems within the molecules and can calculate and display the H-bond and van der Waals contacts in the interfaces between components. An integral step in the process is the addition and full optimization of all hydrogen atoms, both polar and nonpolar. The results are reported in multiple forms: as overall numeric scores, as lists, as downloadable PDB and graphics files, and most notably as informative, manipulable 3D kinemage graphics shown on-line in the KiNG viewer. This service is available free to all users at http://kinemage.biochem.duke.edu.
We compared the results of Gram staining and culture of cerebrospinal fluid to results obtained with a rapid PCR assay for the diagnosis of meningococcal meningitis in 281 cases of suspected bacterial meningitis. PCR had a sensitivity of 97% compared to a sensitivity of 55% for culture, and the PCR specificity was 99.6%. PCR results were available within 2 h of the start of the assay.