We describe a method to predict protein-protein interactions (PPIs) formed between structured domains and short peptide motifs. We take an integrative approach based on consensus patterns of known motifs in databases, structures of domain-motif complexes from the PDB and various sources of non-structural evidence. We combine this set of clues using a Bayesian classifier that reports the likelihood of an interaction and obtain significantly improved prediction performance when compared to individual sources of evidence and to previously reported algorithms. Our Bayesian approach was integrated into PrePPI, a structure-based PPI prediction method that, so far, has been limited to interactions formed between two structured domains. Around 80,000 new domain-motif mediated interactions were predicted, thus enhancing PrePPI’s coverage of the human protein interactome.
Complexes formed between a structured domain on one protein and an unstructured peptide on another are ubiquitous. However, they are often quite difficult to detect experimentally. The development of computational approaches to predict domain-motif interactions is therefore an important goal. We report a method to predict domain-motif interactions using a Bayesian approach to integrate evidence from a variety of sources, including three-dimensional structural and non-structural information. The method was applied to the entire human proteome and showed significant improvement over existing methods. The method was incorporated into PrePPI, a computational pipeline for the prediction of protein-protein interactions that relies heavily on structural information. Approximately 80,000 new interactions were detected. The new PrePPI database provides easy access to about 400,000 human protein-protein interactions and should thus constitute a valuable resource in a variety of biological applications including the characterization of molecular interaction networks and, more generally, in the study of interactions mediated by proteins in families that may not be extensively studied experimentally.
The last decade has seen a dramatic expansion in the number and range of techniques available to obtain genome-wide information, and to analyze this information so as to infer both the function of individual molecules and how they interact to modulate the behavior of biological systems. Here we review these techniques, focusing on the construction of physical protein-protein interaction networks, and highlighting approaches that incorporate protein structure which is becoming an increasingly important component of systems-level computational techniques. We also discuss how network analyses are being applied to enhance the basic understanding of biological systems and their disregulation, and how they are being applied in drug development.
Protein domain family PF11267 (DUF3067) is a family of proteins of unknown function found in both bacteria and eukaryotes. Here we present the solution NMR structure of the 102-residue Alr2454 protein from Nostoc sp. PCC 7120, which constitutes the first structural representative from this conserved protein domain family. The structure of Nostoc sp. Alr2454 adopts a novel protein fold.
Alr2454 protein; DUF3067; PF11267; Protein Structure Initiative; Solution NMR structure; Structural genomics
High-quality NMR structures of the homo-dimeric proteins Bvu3908 (69-residues in monomeric unit) from Bacteroides vulgatus and Bt2368 (74-residues) from Bacteroides thetaiotaomicron reveal the presence of winged helix-turn-helix (wHTH) motifs mediating tight complex formation. Such homo-dimer formation by winged HTH motifs is otherwise found only in two DNA-binding proteins with known structure: the C-terminal wHTH domain of transcriptional activator FadR from E. coli and protein TubR from B. thurigensis, which is involved in plasmid DNA segregation. However, the relative orientation of the wHTH motifs is different and residues involved in DNA-binding are not conserved in Bvu3908 and Bt2368. Hence, the proteins of the present study are not very likely to bind DNA, but are likely to exhibit a function that has thus far not been ascribed to homo-dimers formed by winged HTH motifs. The structures of Bvu3908 and Bt2368 are the first atomic resolution structures for PFAM family PF10771, a family of unknown function (DUF2582) currently containing 128 members.
Bvu3908; Bt2368; PF10771; DUF2582; Winged helix-turn-helix; Structural genomics
The genome-wide identification of pairs of interacting proteins is an important step in the elucidation of cell regulatory mechanisms1,2. Much of our current knowledge derives from high-throughput techniques such as yeast two hybrid and affinity purification3, as well as from manual curation of experiments on individual systems4. A variety of computational approaches based, for example, on sequence homology, gene co-expression, and phylogenetic profiles have also been developed for the genome-wide inference of protein-protein interactions (PPIs)5,6. Yet, comparative studies suggest that the development of accurate and complete repertoires of PPIs is still in its early stages7–9. Here we show that three-dimensional structural information can be used to predict PPIs with an accuracy and coverage that are superior to predictions based on non-structural evidence. Moreover, an algorithm, PrePPI, that combines structural information with other functional clues is comparable in accuracy to high-throughput experiments, yielding over 30,000 high confidence interactions for yeast and over 300,000 for human. Experimental tests of a number of predictions demonstrate the ability of the PrePPI algorithm to identify unexpected PPIs of significant biological interest. The surprising effectiveness of three-dimensional structural information can be attributed to the use of homology models combined with the exploitation of both close and remote geometric relationships between proteins.
PrePPI (http://bhapp.c2b2.columbia.edu/PrePPI) is a database that combines predicted and experimentally determined protein–protein interactions (PPIs) using a Bayesian framework. Predicted interactions are assigned probabilities of being correct, which are derived from calculated likelihood ratios (LRs) by combining structural, functional, evolutionary and expression information, with the most important contribution coming from structure. Experimentally determined interactions are compiled from a set of public databases that manually collect PPIs from the literature and are also assigned LRs. A final probability is then assigned to every interaction by combining the LRs for both predicted and experimentally determined interactions. The current version of PrePPI contains ∼2 million PPIs that have a probability more than ∼0.1 of which ∼60 000 PPIs for yeast and ∼370 000 PPIs for human are considered high confidence (probability > 0.5). The PrePPI database constitutes an integrated resource that enables users to examine aggregate information on PPIs, including both known and potentially novel interactions, and that provides structural models for many of the PPIs.
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is “optimal” in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are “suboptimal” in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for “modelability”, we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.
It has been suggested that, for nearly every protein sequence, there is already a protein with a similar structure in current protein structure databases. However, with poor or undetectable sequence relationships, it is expected that accurate alignments and models cannot be generated. Here we show that this is not the case, and that whenever structural relationship exists, there are usually local sequence relationships that can be used to generate an accurate alignment, no matter what the global sequence identity. However, this requires an alternative to the traditional dynamic programming algorithm and the consideration of a small ensemble of alignments. We present an algorithm, S4, and demonstrate that it is capable of generating accurate alignments in nearly all cases where a structural relationship exists between two proteins. Our results thus constitute an important advance in the full exploitation of the information in structural databases. That is, the expectation of an accurate alignment suggests that a meaningful model can be generated for nearly every sequence for which a suitable template exists.
We describe MarkUs, a web server for analysis and comparison of the structural and functional properties of proteins. In contrast to a ‘structure in/function out’ approach to protein function annotation, the server is designed to be highly interactive and to allow flexibility in the examination of possible functions, suggested either automatically by various similarity measures or specified by a user directly. This is combined with tools that allow a user to assess independently whether or not a suggested function is consistent with the bioinformatic and biophysical properties of a given query structure, further allowing the user to generate testable hypotheses. The server is available at http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:Mark-Us.
We describe PredUs, an interactive web server for the prediction of protein–protein interfaces. Potential interfacial residues for a query protein are identified by ‘mapping’ contacts from known interfaces of the query protein’s structural neighbors to surface residues of the query. We calculate a score for each residue to be interfacial with a support vector machine. Results can be visualized in a molecular viewer and a number of interactive features allow users to tailor a prediction to a particular hypothesis. The PredUs server is available at: http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:PredUs.
SkyLine, a high-throughput homology modeling pipeline tool, detects and models true sequence homologs to a given protein structure. Structures and models are stored in SkyBase with links to computational function annotation, as calculated by MarkUs. The SkyLine/SkyBase/MarkUs technology represents a novel structure-based approach that is more objective and versatile than other protein classification resources. This structure-centric strategy provides a multidimensional organization and coverage of protein space at the levels of family, function, and genome. The concept of “modelability”, the ability to model sequences on related structures, provides a reliable criterion for membership in a protein family (“leverage”) and underlies the unique success of this approach. The overall procedure is illustrated by its application to START domains, which comprise a Biomedical Theme for the Northeast Structural Genomics Consortium (NESG) as part of the Protein Structure Initiative (PSI). START domains are typically involved in the non-vesicular transport of lipids. While 19 experimentally determined structures are available, the family, whose evolutionary hierarchy is not well determined, is highly sequence diverse, and the ligand-binding potential of many family members is unknown. The SkyLine/SkyBase/MarkUs approach provides significant insights and predicts: 1) many more family members (~4,000) than any other resource; 2) the function for a large number of unannotated proteins; 3) instances of START domains in genomes from which they were thought to be absent; and 4) the existence of two types of novel proteins, those containing dual START domain and those containing N-terminal START domains.
Homology modeling; Structural genomics; Bioinformatics; Protein function annotation; START domain; Arabidopsis thaliana
The construction of a homology model for a protein can involve a number of decisions requiring the integration of different sources of information and the application of different modeling tools depending on the particular problem. Functional information can be especially important in guiding the modeling process, but such information is not generally integrated into modeling pipelines. Pudge is a flexible, interactive protein structure prediction server, which is designed with these issues in mind. By dividing the modeling into five stages (template selection, alignment, model building, model refinement and model evaluation) and providing various tools to visualize, analyze and compare the results at each stage, we enable a flexible modeling strategy that can be tailored to the needs of a given problem. Pudge is freely available at http://wiki.c2b2.columbia.edu/honiglab_public/index.php/Software:PUDGE.
Mutations in leucine-rich repeat kinase 2 (LRRK2) are the most common genetic cause of Parkinson disease (PD). LRRK2 contains an “enzymatic core” composed of GTPase and kinase domains that is flanked by leucine-rich repeat (LRR) and WD40 protein-protein interaction domains. While kinase activity and GTP-binding have both been implicated in LRRK2 neurotoxicity, the potential role of other LRRK2 domains has not been as extensively explored.
We demonstrate that LRRK2 normally exists in a dimeric complex, and that removing the WD40 domain prevents complex formation and autophosphorylation. Moreover, loss of the WD40 domain completely blocks the neurotoxicity of multiple LRRK2 PD mutations.
These findings suggest that LRRK2 dimerization and autophosphorylation may be required for the neurotoxicity of LRRK2 PD mutations and highlight a potential role for the WD40 domain in the mechanism of LRRK2-mediated cell death.
The current non-redundant protein sequence database contains over seven million entries and the number of individual functional domains is significantly larger than this value. The vast quantity of data associated with these proteins poses enormous challenges to any attempt at function annotation. Classification of proteins into sequence and structural groups has been widely used as an approach to simplifying the problem. In this article we question such strategies. We describe how the multi-functionality and structural diversity of even closely related proteins confounds efforts to assign function based on overall sequence or structural similarity. Rather, we suggest that strategies that avoid classification may offer a more robust approach to protein function annotation.
protein function annotation; protein classification; fold space
Very few methods address the problem of predicting beta-barrel membrane proteins directly from sequence. One reason is that only very few high-resolution structures for transmembrane beta-barrel (TMB) proteins have been determined thus far. Here we introduced the design, statistics and results of a novel profile-based hidden Markov model for the prediction and discrimination of TMBs. The method carefully attempts to avoid over-fitting the sparse experimental data. While our model training and scoring procedures were very similar to a recently published work, the architecture and structure-based labelling were significantly different. In particular, we introduced a new definition of beta- hairpin motifs, explicit state modelling of transmembrane strands, and a log-odds whole-protein discrimination score. The resulting method reached an overall four-state (up-, down-strand, periplasmic-, outer-loop) accuracy as high as 86%. Furthermore, accurately discriminated TMB from non-TMB proteins (45% coverage at 100% accuracy). This high precision enabled the application to 72 entirely sequenced Gram-negative bacteria. We found over 164 previously uncharacterized TMB proteins at high confidence. Database searches did not implicate any of these proteins with membranes. We challenge that the vast majority of our 164 predictions will eventually be verified experimentally. All proteome predictions and the PROFtmb prediction method are available at http://www.rostlab.org/services/PROFtmb/.