Predicting the function of newly sequenced proteins is crucial due to the pace at which these raw sequences are being obtained. Almost all resources for predicting protein function assign functional terms to whole chains, and do not distinguish which particular domain is responsible for the allocated function. This is not a limitation of the methodologies themselves but it is due to the fact that in the databases of functional annotations these methods use for transferring functional terms to new proteins, these annotations are done on a whole-chain basis. Nevertheless, domains are the basic evolutionary and often functional units of proteins. In many cases, the domains of a protein chain have distinct molecular functions, independent from each other. For that reason resources with functional annotations at the domain level, as well as methodologies for predicting function for individual domains adapted to these resources are required.
We present a methodology for predicting the molecular function of individual domains, based on a previously developed database of functional annotations at the domain level. The approach, which we show outperforms a standard method based on sequence searches in assigning function, concomitantly predicts the structural fold of the domains and can give hints on the functionally important residues associated to the predicted function.
Intrinsically Disordered Proteins/Regions (IDPs/IDRs) are currently recognized as a widespread phenomenon having key cellular functions. Still, many aspects of the function of these proteins need to be unveiled. IDPs conformational flexibility allows them to recognize and interact with multiple partners, and confers them larger interaction surfaces that may increase interaction speed. For this reason, molecular interactions mediated by IDPs/IDRs are particularly abundant in certain types of protein interactions, such as those of signaling and cell cycle control. We present the first large-scale study of IDPs in Arabidopsis thaliana, the most widely used model organism in plant biology, in order to get insight into the biological roles of these proteins in plants. The work includes a comparative analysis with the human proteome to highlight the differential use of disorder in both species. Results show that while human proteins are in general more disordered, certain functional classes, mainly related to environmental response, are significantly more enriched in disorder in Arabidopsis. We propose that because plants cannot escape from environmental conditions as animals do, they use disorder as a simple and fast mechanism, independent of transcriptional control, for introducing versatility in the interaction networks underlying these biological processes so that they can quickly adapt and respond to challenging environmental conditions.
To some extent contradicting the classical paradigm of the relationship between protein 3D structure and function, now it is clear that large portions of the proteomes, especially in higher organisms, lack a fixed structure and still perform very important functions. Proteins completely or partially unstructured in their native (functional) form are involved in key cellular processes underlain by complex networks of protein interactions. The intrinsic conformational flexibility of these disordered proteins allows them to bind multiple partners in transient interactions of high specificity and low affinity. In concordance, in plants this type of proteins has been found in processes requiring these complex and versatile interaction networks. These include transcription factor networks, where disordered proteins act as integrators of different signals or link different transcription factor subnetworks due to their ability to interact (in many cases simultaneously) with different partners. Similarly, they also serve as signal integrators in signaling cascades, such as those related to response to external stimuli. Disordered proteins have also been found in plants in many stress-response processes, acting as protein chaperones or protecting other cellular components and structures. In plants, it is especially important to have complex and versatile networks able to quickly and efficiently respond to changing environmental conditions since these organisms cannot escape and have no other choice than adapting to them. Consequently, protein disorder can play an especially important role in plants, providing them with a fast mechanism to obtain complex, interconnected and versatile molecular networks.
protein function; protein structure; protein interactions; protein intrinsic disorder; biological networks; plant environmental responses
The prediction and study of protein interactions and functional relationships based on similarity of phylogenetic trees, exemplified by the mirrortree and related methodologies, is being widely used. Although dependence between the performance of these methods and the set of organisms used to build the trees was suspected, so far nobody assessed it in an exhaustive way, and, in general, previous works used as many organisms as possible. In this work we asses the effect of using different sets of organism (chosen according with various phylogenetic criteria) on the performance of this methodology in detecting protein interactions of different nature.
We show that the performance of three mirrortree-related methodologies depends on the set of organisms used for building the trees, and it is not always directly related to the number of organisms in a simple way. Certain subsets of organisms seem to be more suitable for the predictions of certain types of interactions. This relationship between type of interaction and optimal set of organism for detecting them makes sense in the light of the phylogenetic distribution of the organisms and the nature of the interactions.
In order to obtain an optimal performance when predicting protein interactions, it is recommended to use different sets of organisms depending on the available computational resources and data, as well as the type of interactions of interest.
Prediction of physical protein-protein interactions represents a key challenge in computational systems biology. This study provides a proof-of-principle that high-throughput in silico protein docking results can be used to predict interaction partners.
Deciphering the whole network of protein interactions for a given proteome (‘interactome') is the goal of many experimental and computational efforts in Systems Biology. Separately the prediction of the structure of protein complexes by docking methods is a well-established scientific area. To date, docking programs have not been used to predict interaction partners. We provide a proof of principle for such an approach. Using a set of protein complexes representing known interactors in their unbound form, we show that a standard docking program can distinguish the true interactors from a background of 922 non-redundant potential interactors. We additionally show that true interactions can be distinguished from non-likely interacting proteins within the same structural family. Our approach may be put in the context of the proposed ‘funnel-energy model'; the docking algorithm may not find the native complex, but it distinguishes binding partners because of the higher probability of favourable models compared with a collection of non-binders. The potential exists to develop this proof of principle into new approaches for predicting interaction partners and reconstructing biological networks.
interactome; protein docking; protein–protein interaction
Global studies of the protein repertories of organisms are providing important information on the characteristics of the protein space. Many of these studies entail classification of the protein repertory on the basis of structure and/or sequence similarities. The situation is different for metabolism. Because there is no good way of measuring similarities between chemical reactions, there is a barrier to the development of global classifications of "metabolic space" and subsequent studies comparable to those done for protein sequences and structures.
In this work, we propose a vectorial representation of chemical reactions, which allows them to be compared and classified. In this representation, chemical compounds, reactions and pathways may be represented in the same vectorial space. We show that the representation of chemical compounds reflects their physicochemical properties and can be used for predictive purposes. We use the vectorial representations of reactions to perform a global classification of the reactome of the model organism E. coli.
We show that this unsupervised clustering results in groups of enzymes more coherent in biological terms than equivalent groupings obtained from the EC hierarchy. This hierarchical clustering produces an optimal set of 21 groups which we analyzed for their biological meaning.
Co-evolution has an important function in the evolution of species and it is clearly manifested in certain scenarios such as host–parasite and predator–prey interactions, symbiosis and mutualism. The extrapolation of the concepts and methodologies developed for the study of species co-evolution at the molecular level has prompted the development of a variety of computational methods able to predict protein interactions through the characteristics of co-evolution. Particularly successful have been those methods that predict interactions at the genomic level based on the detection of pairs of protein families with similar evolutionary histories (similarity of phylogenetic trees: mirrortree). Future advances in this field will require a better understanding of the molecular basis of the co-evolution of protein families. Thus, it will be important to decipher the molecular mechanisms underlying the similarity observed in phylogenetic trees of interacting proteins, distinguishing direct specific molecular interactions from other general functional constraints. In particular, it will be important to separate the effects of physical interactions within protein complexes (‘co-adaptation') from other forces that, in a less specific way, can also create general patterns of co-evolution.
co-evolution; protein–protein interaction; interactome; phylogenetic tree; mirrortree
A fundamental problem when trying to define the functional relationships between proteins is the difficulty in quantifying functional similarities, even when well-structured ontologies exist regarding the activity of proteins (i.e. 'gene ontology' -GO-). However, functional metrics can overcome the problems in the comparing and evaluating functional assignments and predictions. As a reference of proximity, previous approaches to compare GO terms considered linkage in terms of ontology weighted by a probability distribution that balances the non-uniform 'richness' of different parts of the Direct Acyclic Graph. Here, we have followed a different approach to quantify functional similarities between GO terms.
We propose a new method to derive 'functional distances' between GO terms that is based on the simultaneous occurrence of terms in the same set of Interpro entries, instead of relying on the structure of the GO. The coincidence of GO terms reveals natural biological links between the GO functions and defines a distance model Df which fulfils the properties of a Metric Space. The distances obtained in this way can be represented as a hierarchical 'Functional Tree'.
The method proposed provides a new definition of distance that enables the similarity between GO terms to be quantified. Additionally, the 'Functional Tree' defines groups with biological meaning enhancing its utility for protein function comparison and prediction. Finally, this approach could be for function-based protein searches in databases, and for analysing the gene clusters produced by DNA array experiments.
It has repeatedly been shown that interacting protein families tend to have similar phylogenetic trees. These similarities can be used to predicting the mapping between two families of interacting proteins (i.e. which proteins from one family interact with which members of the other). The correct mapping will be that which maximizes the similarity between the trees. The two families may eventually comprise orthologs and paralogs, if members of the two families are present in more than one organism. This fact can be exploited to restrict the possible mappings, simply by impeding links between proteins of different organisms. We present here an algorithm to predict the mapping between families of interacting proteins which is able to incorporate information regarding orthologues, or any other assignment of proteins to "classes" that may restrict possible mappings.
For the first time in methods for predicting mappings, we have tested this new approach on a large number of interacting protein domains in order to statistically assess its performance. The method accurately predicts around 80% in the most favourable cases. We also analysed in detail the results of the method for a well defined case of interacting families, the sensor and kinase components of the Ntr-type two-component system, for which up to 98% of the pairings predicted by the method were correct.
Based on the well established relationship between tree similarity and interactions we developed a method for predicting the mapping between two interacting families using genomic information alone. The program is available through a web interface.
The production of new chemicals for industrial or therapeutic applications exceeds our ability to generate experimental data on their biological fate once they are released into the environment. Typically, mixtures of organic pollutants are freed into a variety of sites inhabited by diverse microorganisms, which structure complex multispecies metabolic networks. A machine learning approach has been instrumental to expose a correlation between the frequency of 149 atomic triads (chemotopes) common in organo-chemical compounds and the global capacity of microorganisms to metabolise them. Depending on the type of environmental fate defined, the system can correctly predict the biodegradative outcome for 73–87% of compounds. This system is available to the community as a web server (http://www.pdg.cnb.uam.es/BDPSERVER). The application of this predictive tool to chemical species released into the environment provides an early instrument for tentatively classifying the compounds as biodegradable or recalcitrant. Automated surveys of lists of industrial chemicals currently employed in large quantities revealed that herbicides are the group of functional molecules more difficult to recycle into the biosphere through the inclusive microbial metabolism.
BDPServer; biodegradation; chemotopes; machine learning; REACH
This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family.
bioinformatics; proteome annotation; protein function
An entire family of methodologies for predicting protein interactions is based on the observed fact that families of interacting proteins tend to have similar phylogenetic trees due to co-evolution. One application of this concept is the prediction of the mapping between the members of two interacting protein families (which protein within one family interacts with which protein within the other). The idea is that the real mapping would be the one maximizing the similarity between the trees. Since the exhaustive exploration of all possible mappings is not feasible for large families, current approaches use heuristic techniques which do not ensure the best solution to be found. This is why it is important to check the results proposed by heuristic techniques and to manually explore other solutions. Here we present TSEMA, the server for efficient mapping assessment. This system calculates an initial mapping between two families of proteins based on a Monte Carlo approach and allows the user to interactively modify it based on performance figures and/or specific biological knowledge. All the explored mappings are graphically shown over a representation of the phylogenetic trees. The system is freely available at . Standalone versions of the software behind the interface are available upon request from the authors.
The TreeDet (Tree Determinant) Server is the first release of a system designed to integrate results from methods that predict functional sites in protein families. These methods take into account the relation between sequence conservation and evolutionary importance. TreeDet fully analyses the space of protein sequences in either user-uploaded or automatically generated multiple sequence alignments. The methods implemented in the server represent three main classes of methods for the detection of family-dependent conserved positions, a tree-based method, a correlation based method and a method that employs a principal component analyses coupled to a cluster algorithm. An additional method is provided to highlight the reliability of the position in the alignments. The server is available at .
Here we introduce EVAcon, an automated web service that evaluates the performance of contact prediction servers. Currently, EVAcon is monitoring nine servers, four of which are specialized in contact prediction and five are general structure prediction servers. Results are compared for all newly determined experimental structures deposited into PDB (∼5–50 per week). EVAcon allows for a precise comparison of the results based on a system of common protein subsets and the commonly accepted evaluation criteria that are also used in the corresponding category of the CASP assessment. EVAcon is a new service added to the functionality of the EVA system for the continuous evaluation of protein structure prediction servers. The new service is accesible from any of the three EVA mirrors: PDG (CNB-CSIC, Madrid) (); CUBIC (Columbia University, NYC) (); and Sali Lab (UCSF, San Francisco) ().
EVA (http://cubic.bioc.columbia.edu/eva/) is a web server for evaluation of the accuracy of automated protein structure prediction methods. The evaluation is updated automatically each week, to cope with the large number of existing prediction servers and the constant changes in the prediction methods. EVA currently assesses servers for secondary structure prediction, contact prediction, comparative protein structure modelling and threading/fold recognition. Every day, sequences of newly available protein structures in the Protein Data Bank (PDB) are sent to the servers and their predictions are collected. The predictions are then compared to the experimental structures once a week; the results are published on the EVA web pages. Over time, EVA has accumulated prediction results for a large number of proteins, ranging from hundreds to thousands, depending on the prediction method. This large sample assures that methods are compared reliably. As a result, EVA provides useful information to developers as well as users of prediction methods.
Infectious bursal disease virus (IBDV) capsids are formed by a single protein layer containing three polypeptides, pVP2, VP2, and VP3. Here, we show that the VP3 protein synthesized in insect cells, either after expression of the complete polyprotein or from a VP3 gene construct, is proteolytically degraded, leading to the accumulation of product lacking the 13 C-terminal residues. This finding led to identification of the VP3 oligomerization domain within a 24-amino-acid stretch near the C-terminal end of the polypeptide, partially overlapping the VP1 binding domain. Inactivation of the VP3 oligomerization domain, by either proteolysis or deletion of the polyprotein gene, abolishes viruslike particle formation. Formation of VP3-VP1 complexes in cells infected with a dual recombinant baculovirus simultaneously expressing the polyprotein and VP1 prevented VP3 proteolysis and led to efficient virus-like particle formation in insect cells.