Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed.
Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM–HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues.
Availability: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
Despite the importance of intracellular signaling networks, there is currently no consensus regarding the fundamental nature of the protein complexes such networks employ. One prominent view involves stable signaling machines with well-defined quaternary structures. The combinatorial complexity of signaling networks has led to an opposing perspective, namely that signaling proceeds via heterogeneous pleiomorphic ensembles of transient complexes. Since many hypotheses regarding network function rely on how we conceptualize signaling complexes, resolving this issue is a central problem in systems biology. Unfortunately, direct experimental characterization of these complexes has proven technologically difficult, while combinatorial complexity has prevented traditional modeling methods from approaching this question. Here we employ rule-based modeling, a technique that overcomes these limitations, to construct a model of the yeast pheromone signaling network. We found that this model exhibits significant ensemble character while generating reliable responses that match experimental observations. To contrast the ensemble behavior, we constructed a model that employs hierarchical assembly pathways to produce scaffold-based signaling machines. We found that this machine model could not replicate the experimentally observed combinatorial inhibition that arises when the scaffold is overexpressed. This finding provides evidence against the hierarchical assembly of machines in the pheromone signaling network and suggests that machines and ensembles may serve distinct purposes in vivo. In some cases, e.g. core enzymatic activities like protein synthesis and degradation, machines assembled via hierarchical energy landscapes may provide functional stability for the cell. In other cases, such as signaling, ensembles may represent a form of weak linkage, facilitating variation and plasticity in network evolution. The capacity of ensembles to signal effectively will ultimately shape how we conceptualize the function, evolution and engineering of signaling networks.
Intracellular signaling networks are central to a cell's ability to adapt to its environment. Developing the capacity to effectively manipulate such networks would have a wide range of applications, from cancer therapy to synthetic biology. This requires a thorough understanding of the mechanisms of signal transduction, particularly the kinds of protein complexes that are formed during transmission of extracellular information to the nucleus. Traditionally, signaling complexes have been largely perceived (albeit often implicitly) as machine-like structures. However, the number of molecular complexes that could theoretically be formed by complex signaling networks is astronomically large. This has led to the pleiomorphic ensemble hypothesis, which posits that diverse and rapidly changing sets of transient protein complexes can transmit and process information. Our goal was to use computational approaches, specifically rule-based modeling, to test these hypotheses. We constructed a model of the prototypical yeast mating pathway and found significant ensemble-like behavior. Our results thus demonstrated that ensembles can in fact transmit extracellular signals with minimal noise. Additionally, a comparison of this model with one tailored to generate machine-like complexes displayed notable phenotypic differences, revealing potential advantages for ensemble-like signaling. Our demonstration that ensembles can function effectively will have a significant impact on how we conceptualize signaling and other processes inside cells.
The proper biological functioning of proteins often relies on the occurrence of coordinated fluctuations around their native structure, or on their ability to perform wider and sometimes highly elaborated motions. Hence, there is considerable interest in the definition of accurate coarse-grained descriptions of protein dynamics, as an alternative to more computationally expensive approaches. In particular, the elastic network model, in which residue motions are subjected to pairwise harmonic potentials, is known to capture essential aspects of conformational dynamics in proteins, but has so far remained mostly phenomenological, and unable to account for the chemical specificities of amino acids. We propose, for the first time, a method to derive residue- and distance-specific effective harmonic potentials from the statistical analysis of an extensive dataset of NMR conformational ensembles. These potentials constitute dynamical counterparts to the mean-force statistical potentials commonly used for static analyses of protein structures. In the context of the elastic network model, they yield a strongly improved description of the cooperative aspects of residue motions, and give the opportunity to systematically explore the influence of sequence details on protein dynamics.
Decades of experimental evidence have underlined the fact that protein structures can hardly be considered as static objects. To understand how a protein achieves its biological purpose, it is therefore quite often necessary to unravel the complexity of its dynamical behavior. However, the definition of accurate and computationally tractable descriptions of protein dynamics remains a highly challenging task. Indeed, even though proteins are all built from a limited set of amino acids and local conformational arrangements, the specific nature of biologically relevant motions may vary widely from one protein to another, which constitutes a serious obstacle to the identification of common rules and properties. Here, instead of focusing on the study of a single protein, we adopt a more general perspective by condensing the information contained in a multitude of NMR conformational ensembles. This approach allows us to characterize the dynamical behavior of residues and residue pairs in a mean protein environment, independently of each protein's specific architecture. We describe how this analysis can be exploited to assess the performances of coarse-grained models of protein dynamics, to take advantage of existing experimental data for a more rational and efficient parametrization of these models and, ultimately, to improve our understanding of the intrinsic dynamical properties of amino acid chains.
Temporally and spatially controlled activation of the Aurora-A kinase (AURKA) is regulates centrosome maturation, entry into mitosis, formation and function of the bipolar spindle, and cytokinesis. Genetic amplification, and mRNA and protein overexpression of Aurora-A are common in many types of solid tumor, and associated with aneuploidy, supernumerary centrosomes, defective mitotic spindles, and resistance to apoptosis. These properties have led Aurora-A to be considered a high value target for development of cancer therapeutics, with multiple agents currently in early phase clinical trials. More recently, identification of additional, non-mitotic functions and means of activation of Aurora-A during interphase neurite elongation and ciliary resorption have significantly expanded understanding of its function, and may offer insights into clinical performance of Aurora-A inhibitors. We here review mitotic and non-mitotic functions of Aurora-A, discuss Aurora-A regulation in the context of protein structural information, and evaluate progress in understanding and inhibiting Aurora-A in cancer.
Aurora-A; AURKA; cancer; mitosis; cell cycle; kinase; centrosome; cilia
Predicting the phenotypes of missense mutations uncovered by large-scale sequencing projects is an important goal in computational biology. High-confidence predictions can be an aid in focusing experimental and association studies on those mutations most likely to be associated with causative relationships between mutation and disease. As an aid in developing these methods further, we have derived a set of random mutations of the enzymatic domains of human cystathionine beta synthase. This enzyme is a dimeric protein that catalyzes the condensation of serine and homocysteine to produce cystathionine. Yeast missing this enzyme cannot grow on medium lacking a source of cysteine, while transfection of functional human CBS into yeast strains missing endogenous enzyme can successfully complement for the missing gene. We used PCR mutagenesis with error-prone Taq polymerase to produce 948 colonies, and compared cell growth in the presence or absence of a cysteine source as a measure of CBS function. We were able to infer the phenotypes of 204 single-site mutants, 79 of them deleterious and 125 neutral. This set was used to test the accuracy of six publicly available prediction methods for phenotype prediction of missense mutations: SIFT, PolyPhen, PMut, SNPs3D, PhD-SNP, and nsSNPAnalyzer. The top methods are PolyPhen, SIFT, and nsSNPAnalyzer, which have similar performance. Using kernel discriminant functions, we found that the difference in position-specific scoring matrix values is more predictive than the wild-type PSSM score alone, and that the relative surface area in the biologically relevant complex is more predictive than that of the monomeric proteins.
mutations; phenotype prediction; cystathionine beta synthase
Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.
The goal of this paper is to reduce the complexity of the side chain search within docking problems. We apply six methods of generating side chain conformers to unbound protein structures, and determine their ability of obtaining the bound conformation in small ensembles of conformers. Methods are evaluated in terms of the positions of side chain end groups. Results for 68 protein complexes yield two important observations. First, the end group positions change less than 1 Å upon association for over 60% of interface side chains. Thus, the unbound protein structure carries substantial information about the side chains in the bound state, and the inclusion of the unbound conformation into the ensemble of conformers is very beneficial. Second, considering each surface side chain separately in its protein environment, small ensembles of low energy states include the bound conformation for a large fraction of side chains. In particular, the ensemble consisting of the unbound conformation and the two highest probability predicted conformers includes the bound conformer with an accuracy of 1 Å for 78% of interface side chains. Since more than 60% of the interface side chains have only one conformer and many others only a few, these ensembles of low energy states substantially reduce the complexity of side chain search in docking problems. This approach was already used for finding pockets in protein-protein interfaces that can bind small molecules to potentially disrupt protein-protein interactions. Side chain search with the reduced search space will also be incorporated into protein docking algorithms.
rotamer libraries; side chain flexibility; protein binding; structure prediction; preexisting ensemble of conformers
The stress-induced heat shock protein 70 (HSP70) is an ATP-dependent molecular chaperone that plays a key role in refolding misfolded proteins and promoting cell survival following stress. HSP70 is marginally expressed in non-transformed cells, but is greatly overexpressed in tumor cells. Silencing HSP70 is uniformly cytotoxic to tumor but not normal cells; therefore, there has been great interest in the development of HSP70 inhibitors for cancer therapy. Here we report that the HSP70 inhibitor 2-phenylethynesulfonamide (PES) binds to the substrate-binding domain of HSP70, and requires the C-terminal helical ‘lid’ of this protein (amino acids 573-616) in order to bind. Using molecular modeling and in silico docking, we have identified a candidate binding site for PES in this region of HSP70, and we identify point mutants that fail to interact with PES. A preliminary structure-activity relationship analysis has revealed a derivative of PES, 2-(3-chlorophenyl) ethynesulfonamide (PES-Cl), which shows increased cytotoxicity and ability to inhibit autophagy, along with significantly improved ability to extend the life of mice with pre-B cell lymphoma, compared to the parent compound (p=0.015). Interestingly, we also show that these HSP70 inhibitors impair the activity of the Anaphase Promoting Complex/Cyclosome (APC/C) in cell-free extracts, and induce G2/M arrest and genomic instability in cancer cells. PES-Cl is thus a promising new anti-cancer compound with several notable mechanisms of action.
Phenylethynesulfonamide; HSP70; HSP72; lymphoma; autophagy; HSP90
Agents targeting EGFR and related ErbB family proteins are valuable therapies for the treatment of many cancers. For some tumor types, including squamous cell carcinomas of the head and neck (SCCHN), antibodies targeting EGFR were the first protein-directed agents to show clinical benefit, and remain a standard component of clinical strategies for management of the disease. Nevertheless, many patients display either intrinsic or acquired resistance to these drugs; hence, major research goals are to better understand the underlying causes of resistance, and to develop new therapeutic strategies that boost the impact of EGFR/ErbB inhibitors. In this review, we first summarize current standard use of EGFR inhibitors in the context of SCCHN, and described new agents targeting EGFR currently moving through pre-clinical and clinical development. We then discuss how changes in other transmembrane receptors, including IGF1R, c-Met, and TGF-β, can confer resistance to EGFR-targeted inhibitors, and discuss new agents targeting these proteins. Moving downstream, we discuss critical EGFR-dependent effectors, including PLC-γ; PI3K and PTEN; SHC, GRB2, and RAS and the STAT proteins, as factors in resistance to EGFR-directed inhibitors and as alternative targets of therapeutic inhibition. We summarize alternative sources of resistance among cellular changes that target EGFR itself, through regulation of ligand availability, post-translational modification of EGFR, availability of EGFR partners for hetero-dimerization and control of EGFR intracellular trafficking for recycling versus degradation. Finally, we discuss new strategies to identify effective therapeutic combinations involving EGFR-targeted inhibitors, in the context of new system level data becoming available for analysis of individual tumors.
PLC-γ; PI3K; PTEN; SHC; GRB2; RAS; STAT; IGFR; c-MET
Rotamer libraries are used in protein structure determination, structure prediction, and design. The backbone-dependent rotamer library consists of rotamer frequencies and their mean dihedral angles and variances as a function of the backbone dihedral angles ϕ and ψ. Previous versions of this rotamer library were not developed with smoothness in mind, although some structure prediction and protein design methods would strongly benefit from smoothing. A new version of the backbone-dependent rotamer library has been developed using adaptive kernel density estimates for the rotamer frequencies and adaptive kernel regression for the mean dihedral angles and variances. The formulation presented allows for evaluation of the rotamer probabilities, mean angles and variances at any ϕ, ψ point, i.e. as a continuous function of ϕ and ψ. Continuous probability density estimates for the non-rotameric degrees of freedom of amides, carboxylates, and aromatic side chains have been modeled as a function of the backbone dihedral angles and rotamers of the remaining degrees of freedom. New backbone-dependent rotamer libraries at varying levels of smoothing are available from http://dunbrack.fccc.edu.
Previous analyses of the complementarity determining regions (CDRs) of antibodies have focused on a small number of “canonical” conformations for each loop. This is primarily the result of the work of Chothia and colleagues, most recently in 1997. Because of the widespread utility of antibodies, we have revisited the clustering of conformations of the six CDR loops with the much larger amount of structural information currently available. In this work, we were careful to use a high-quality data set by eliminating low-resolution structures and CDRs with high B-factors or high conformational energies. We used a distance function based on directional statistics and an effective clustering algorithm using affinity propagation. With this data set of over 300 non-redundant antibody structures, we were able to cover 28 CDR-length combinations (e.g., L1 length 11, or “L1-11” in our nomenclature) for L1, L2, L3, H1 and H2. The Chothia analysis covered only 20 CDR-lengths. Only four of these had more than one conformational cluster, of which two could easily be distinguished by gene source (mouse/human; κ/λ) and one purely by the presence and positions of Pro residues (L3-9). Thus using the Chothia analysis does not require the complicated set of “structure-determining residues” that is often assumed. Of our 28 CDR-lengths, 15 of them have multiple conformational clusters including ten for which Chothia had only one canonical class. We have a total of 72 clusters for the non-H3 CDRs; approximately 85% of the non-H3 sequences can be assigned to a conformational cluster based on gene source and/or sequence. We found that earlier predictions of “bulged” vs. “non-bulged” conformations based on the presence or absence of anchor residues Arg/Lys94 and Asp101 of H3 have not held up, since all four combinations lead to a majority of conformations that are bulged. Thus the earlier analyses have been significantly enhanced by the increased data. We believe the new classification will lead to improved methods for antibody structure prediction and design.
antibody structure; canonical loop conformations; affinity propagation
Foldamers present a particularly difficult challenge for accurate computational design compared to the case for conventional peptide and protein design due to the lack of a large body of structural data to allow parameterization of rotamer libraries and energies. We therefore explored the use of molecular mechanics for constructing rotamer libraries for non-natural foldamer backbones. We first evaluated the accuracy of molecular mechanics (MM) for the prediction of rotamer probability distributions in the crystal structures of proteins is explored. The van der Waals radius, dielectric constant and effective Boltzmann temperature were systematically varied to maximize agreement with experimental data. Boltzmann-weighted probabilities from these molecular mechanics energies compare well with database-derived probabilities for both an idealized α-helix (R = 0.95) as well as β-strand conformations (R = 0.92). Based on these parameters, de novo rotamer probabilities for secondary structures of peptides built from β-amino acids were determined. To limit computational complexity, it is useful to establish a residue-specific criterion for excluding rare, high-energy rotamers from the library. This is accomplished by including only those rotamers with probability greater than a given threshold (e.g. 10%) of the random value, defined as 1/n where n is the number of potential rotamers for each residue type.
Protein intrinsic disorder is becoming increasingly recognized in proteomics research. While lacking structure, many regions of disorder have been associated with biological function. There are many different experimental methods for characterizing intrinsically disordered proteins and regions; nevertheless, the prediction of intrinsic disorder from amino acid sequence remains a useful strategy especially for many large-scale proteomics investigations. Here we introduced a consensus artificial neural network (ANN) prediction method, which was developed by combining the outputs of several individual disorder predictors. By eight-fold cross-validation, this meta-predictor, called PONDR-FIT, was found to improve the prediction accuracy over a range of 3 to 20% with an average of 11% compared to the single predictors, depending on the datasets being used. Analysis of the errors shows that the worst accuracy still occurs for short disordered regions with less than ten residues, as well as for the residues close to order/disorder boundaries. Increased understanding of the underlying mechanism by which such meta-predictors give improved predictions will likely promote the further development of protein disorder predictors. The access to PONDR-FIT is available at www.disprot.org.
natively unfolded; intrinsically unstructured; intrinsically disordered; highly flexible; highly dynamic; structurally disordered; predictor; PONDR
Comparison of protein structures is important for revealing the evolutionary relationship among proteins, predicting protein functions and predicting protein structures. Many methods have been developed in the past to align two or multiple protein structures. Despite the importance of this problem, rigorous mathematical or statistical frameworks have seldom been pursued for general protein structure comparison. One notable issue in this field is that with many different distances used to measure the similarity between protein structures, none of them are proper distances when protein structures of different sequences are compared. Statistical approaches based on those non-proper distances or similarity scores as random variables are thus not mathematically rigorous. In this work, we develop a mathematical framework for protein structure comparison by treating protein structures as three-dimensional curves. Using an elastic Riemannian metric on spaces of curves, geodesic distance, a proper distance on spaces of curves, can be computed for any two protein structures. In this framework, protein structures can be treated as random variables on the shape manifold, and means and covariance can be computed for populations of protein structures. Furthermore, these moments can be used to build Gaussian-type probability distributions of protein structures for use in hypothesis testing. The covariance of a population of protein structures can reveal the population-specific variations and be helpful in improving structure classification. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions. We show that our method performs comparably with commonly used methods in protein structure classification on a large manually annotated data set.
Protein structure comparison is important for understanding the evolutionary relationships among proteins, predicting protein functions, and predicting protein structures. Despite its importance, there have been no rigorous mathematical or statistical frameworks for protein structure comparison. One notable issue in this field is that with many different similarity measures used in comparing protein structures, none of them are proper distances when protein structures of different sequences are compared. In this study, we develop a mathematical framework for protein structure comparison by treating protein structures as three dimensional curves. A formal distance, geodesic distance, can be computed for any two protein structures. In this framework, population-specific variations within protein families can be characterized through building probability distributions for structures of protein families. The mean and covariance computed from groups of protein structures can also help to improve the classifications of protein structures. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions.
In homology modeling of protein structures, it is typical to find templates through a sequence search against a database of proteins with known structures. In more complicated modeling cases, such as modeling a protein structure in contact with a ligand, sequence information itself may not be enough and more biological information is required for a successful modeling process. SCOP and PFAM are two databases providing protein domain information which can be utilized in complex protein structure modeling. However, due to the manually-curated nature of both databases, they fail to provide timely coverage of protein sequences existing in the Protein Data Bank (PDB). In this paper, we introduce a new relational database, IDOPS, which integrates sequence and biological information extracted from remediated PDB files and protein domain information generated with HMM profiles of PFAM families. With a carefully designed protocol, this database is updated regularly and the coverage rate of PDB entries is guaranteed to be high.
Determination of side-chain conformations is an important step in protein structure prediction and protein design. Many such methods have been presented, although only a small number are in widespread use. SCWRL is one such method, and the SCWRL3 program (2003) has remained popular due to its speed, accuracy, and ease-of-use for the purpose of homology modeling. However, higher accuracy at comparable speed is desirable. This has been achieved through: 1) a new backbone-dependent rotamer library based on kernel density estimates; 2) averaging over samples of conformations about the positions in the rotamer library; 3) a fast anisotropic hydrogen bonding function; 4) a short-range, soft van der Waals atom-atom interaction potential; 5) fast collision detection using k-discrete oriented polytopes; 6) a tree decomposition algorithm to solve the combinatorial problem; and 7) optimization of all parameters by determining the interaction graph within the crystal environment using symmetry operators of the crystallographic space group. Accuracies as a function of electron density of the side chains demonstrate that side chains with higher electron density are easier to predict than those with low electron density and presumed conformational disorder. For a testing set of 379 proteins, 86% of χ1 angles and 75% of χ1+2 are predicted correctly within 40° of the X-ray positions. Among side chains with higher electron density (25th–100th percentile), these numbers rise to 89% and 80%. The new program maintains its simple command-line interface, designed for homology modeling, and is now available as a dynamic-linked library for incorporation into other software programs.
homology modeling; side-chain prediction; protein structure; rotamer library; graph decomposition; SCWRL
The protein common interface database (ProtCID) is a database that contains clusters of similar homodimeric and heterodimeric interfaces observed in multiple crystal forms (CFs). Such interfaces, especially of homologous but non-identical proteins, have been associated with biologically relevant interactions. In ProtCID, protein chains in the protein data bank (PDB) are grouped based on their PFAM domain architectures. For a single PFAM architecture, all the dimers present in each CF are constructed and compared with those in other CFs that contain the same domain architecture. Interfaces occurring in two or more CFs comprise an interface cluster in the database. The same process is used to compare heterodimers of chains with different domain architectures. By examining interfaces that are shared by many homologous proteins in different CFs, we find that the PDB and the Protein Interfaces, Surfaces, and Assemblies (PISA) are not always consistent in their annotations of biological assemblies in a homologous family. Our data therefore provide an independent check on publicly available annotations of the structures of biological interactions for PDB entries. Common interfaces may also be useful in studies of protein evolution. Coordinates for all interfaces in a cluster are downloadable for further analysis. ProtCiD is available at http://dunbrack2.fccc.edu/protcid.
Protein structure determination and predictive modeling have long been guided by the paradigm that the peptide backbone has a single, context-independent ideal geometry. Both quantum-mechanics calculations and empirical analyses have shown this is an incorrect simplification in that backbone covalent geometry actually varies systematically as a function of the Φ and Ψ backbone dihedral angles. Here, we use a nonredundant set of ultrahigh-resolution protein structures to define these conformation-dependent variations. The trends have a rational, structural basis that can be explained by avoidance of atomic clashes or optimization of favorable electrostatic interactions. To facilitate adoption of this new paradigm, we have created a conformation-dependent library of covalent bond lengths and bond angles and shown that it has improved accuracy over existing methods without any additional variables to optimize. Protein structures derived both from crystallographic refinement and predictive modeling both stand to benefit from incorporation of the new paradigm.
Comparison of elastic network model predictions with experimental data has provided important insights on the dominant role of the network of inter-residue contacts in defining the global dynamics of proteins. Most of these studies have focused on interpreting the mean-square fluctuations of residues, or deriving the most collective, or softest, modes of motions that are known to be insensitive to structural and energetic details. However, with increasing structural data, we are in a position to perform a more critical assessment of the structure-dynamics relations in proteins, and gain a deeper understanding of the major determinants of not only the mean-square fluctuations and lowest frequency modes, but the covariance or the cross-correlations between residue fluctuations and the shapes of higher modes. A systematic study of a large set of NMR-determined proteins is analyzed using a novel method based on entropy maximization to demonstrate that the next level of refinement in the elastic network model description of proteins ought to take into consideration properties such as contact order (or sequential separation between contacting residues) and the secondary structure types of the interacting residues, whereas the types of amino acids do not play a critical role. Most importantly, an optimal description of observed cross-correlations requires the inclusion of destabilizing, as opposed to exclusively stabilizing, interactions, stipulating the functional significance of local frustration in imparting native-like dynamics. This study provides us with a deeper understanding of the structural basis of experimentally observed behavior, and opens the way to the development of more accurate models for exploring protein dynamics.
As more protein structures are solved, we are able to perform a more critical assessment of the relationship between protein structure and dynamics, and to gain a deeper understanding of the major determinants of structural dynamics. Here we perform a systematic study on a set of proteins structurally determined by NMR spectroscopy. The dynamics are analyzed using elastic network models and a novel method based on entropy maximization to demonstrate that properties such as contact order and secondary structure do play a role in defining the experimentally observed covariance data. Most importantly, an optimal description of observed cross-correlations requires the inclusion of destabilizing, as well as stabilizing, interactions, stipulating the functional significance of local frustration in imparting native-like dynamics.
Distributions of the backbone dihedral angles of proteins have been studied for over 40 years. While many statistical analyses have been presented, only a handful of probability densities are publicly available for use in structure validation and structure prediction methods. The available distributions differ in a number of important ways, which determine their usefulness for various purposes. These include: 1) input data size and criteria for structure inclusion (resolution, R-factor, etc.); 2) filtering of suspect conformations and outliers using B-factors or other features; 3) secondary structure of input data (e.g., whether helix and sheet are included; whether beta turns are included); 4) the method used for determining probability densities ranging from simple histograms to modern nonparametric density estimation; and 5) whether they include nearest neighbor effects on the distribution of conformations in different regions of the Ramachandran map. In this work, Ramachandran probability distributions are presented for residues in protein loops from a high-resolution data set with filtering based on calculated electron densities. Distributions for all 20 amino acids (with cis and trans proline treated separately) have been determined, as well as 420 left-neighbor and 420 right-neighbor dependent distributions. The neighbor-independent and neighbor-dependent probability densities have been accurately estimated using Bayesian nonparametric statistical analysis based on the Dirichlet process. In particular, we used hierarchical Dirichlet process priors, which allow sharing of information between densities for a particular residue type and different neighbor residue types. The resulting distributions are tested in a loop modeling benchmark with the program Rosetta, and are shown to improve protein loop conformation prediction significantly. The distributions are available at http://dunbrack.fccc.edu/hdp.
The three-dimensional structure of a protein enables it to perform its specific function, which may be catalysis, DNA binding, cell signaling, maintaining cell shape and structure, or one of many other functions. Predicting the structures of proteins is an important goal of computational biology. One way of doing this is to figure out the rules that determine protein structure from protein sequences by determining how local protein sequence is associated with local protein structure. That is, many (but not all) of the interactions that determine protein structure occur between amino acids that are a short distance away from each other in the sequence. This is particularly true in the irregular parts of protein structure, often called loops. In this work, we have performed a statistical analysis of the structure of the protein backbone in loops as a function of the protein sequence. We have determined how an amino acid bends the local backbone due to its amino acid type and the amino acid types of its neighbors. We used a recently developed statistical method that is particularly suited to this problem. The analysis shows that backbone conformation prediction can be improved using the information in the statistical distributions we have developed.
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
Many proteins function as homooligomers and are regulated via their oligomeric state. For some proteins, the stoichiometry of homooligomeric states under various conditions has been studied using gel filtration or analytical ultracentrifugation experiments. The interfaces involved in these assemblies may be identified using crosslinking and mass spectrometry, solution-state NMR, and other experiments. But for most proteins, the actual interfaces that are involved in oligomerization are inferred from X-ray crystallographic structures using assumptions about interface surface areas and physical properties. Examination of interfaces across different PDB entries in a protein family reveals several important features. First, similarity of space group, asymmetric unit size, and cell dimensions and angles (within 1%) does not guarantee that two crystals are actually the same crystal form, that is containing similar relative orientations and interactions within the crystal. Conversely, two crystals in different space groups may be quite similar in terms of all of the interfaces within each crystal. Second, NMR structures and an existing benchmark of PDB crystallographic entries consisting of 126 dimers and larger structures and 132 monomers was used to determine whether the existence or lack of existence of common interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. Monomeric proteins tend to have common interfaces across only a minority of crystal forms, while higher order structures exhibit common interfaces across a majority of available crystal forms. The data can be used to estimate the probability that an interface is biological if two or more crystal forms are available. Finally, the PISA database available from the EBI is more consistent in identifying interfaces observed in many crystal forms than is the PDB or EBI’s Protein Quaternary Server (PQS). The PDB in particular is missing highly likely biological interfaces in its biological unit files for about 10% of PDB entries.
Cytosolic sulfotransferases catalyze the sulfonation of hormones, metabolites, and xenobiotics. Many of these proteins have been shown to form homo- and heterodimers. An unusually small dimer interface was previously identified by Petrotchenko et al. (FEBS Lett 490, 39-43, 2001) by crosslinking, protease digestion, and mass spectrometry, and verified by site-directed mutagenesis. Analysis of the crystal packing interfaces in all 28 available crystal structures consisting of 17 crystal forms shows that this interface occurs in all of them. With a small number of exceptions, the publicly available databases of biological assemblies contain either monomers or incorrect dimers. Even crystal structures of mouse SULT1E1, which is a monomer in solution, contain the common dimeric interface, although distorted and missing two important salt bridges.
SCWRL and MolIDE are software applications for prediction of protein structures. SCWRL is designed specifically for the task of prediction of side-chain conformations given a fixed backbone usually obtained from an experimental structure determined by X-ray crystallography or NMR. SCWRL is a command-line program that typically runs in a few seconds. MolIDE provides a graphical interface for basic comparative (homology) modeling using SCWRL and other programs. MolIDE takes an input target sequence, and uses PSI-BLAST to identify and align templates for comparative modeling of the target. The sequence alignment to any template can be manually modified within a graphical window of the target-template alignment and visualization of the alignment on the template structure. MolIDE builds the model of the target structure based on the template backbone, predicted side-chain conformations with SCWRL, and a loop-modeling program for insertion-deletion regions with user-selected sequence segments. SCWRL and MolIDE can be obtained at http://dunbrack.fccc.edu/Software.php.
Computational methods; Protein structure prediction; Comparative (homology) modeling