The oral cavity is a complex ecosystem where human chemical compounds coexist with a particular microbiota. However, shifts in the normal composition of this microbiota may result in the onset of oral ailments, such as periodontitis and dental caries. In addition, it is known that the microbial colonization of the oral cavity is mediated by protein-protein interactions (PPIs) between the host and microorganisms. Nevertheless, this kind of PPIs is still largely undisclosed. To elucidate these interactions, we have created a computational prediction method that allows us to obtain a first model of the Human-Microbial oral interactome.
We collected high-quality experimental PPIs from five major human databases. The obtained PPIs were used to create our positive dataset and, indirectly, our negative dataset. The positive and negative datasets were merged and used for training and validation of a naïve Bayes classifier. For the final prediction model, we used an ensemble methodology combining five distinct PPI prediction techniques, namely: literature mining, primary protein sequences, orthologous profiles, biological process similarity, and domain interactions. Performance evaluation of our method revealed an area under the ROC-curve (AUC) value greater than 0.926, supporting our primary hypothesis, as no single set of features reached an AUC greater than 0.877. After subjecting our dataset to the prediction model, the classified result was filtered for very high confidence PPIs (probability ≥ 1-10−7), leading to a set of 46,579 PPIs to be further explored.
We believe this dataset holds not only important pathways involved in the onset of infectious oral diseases, but also potential drug-targets and biomarkers. The dataset used for training and validation, the predictions obtained and the network final network are available at http://bioinformatics.ua.pt/software/oralint.
Protein-protein interactions; Oral interactome; Bayesian classification
Protein-protein interaction (PPI) data sets generated by high-throughput experiments are contaminated by large numbers of erroneous PPIs. Therefore, computational methods for PPI validation are necessary to improve the quality of such data sets. Against the background of the theory that most extant PPIs arose as a consequence of gene duplication, the sensitive search for homologous PPIs, i.e. for PPIs descending from a common ancestral PPI, should be a successful strategy for PPI validation.
To validate an experimentally observed PPI, we combine FASTA and PSI-BLAST to perform a sensitive sequence-based search for pairs of interacting homologous proteins within a large, integrated PPI database. A novel scoring scheme that incorporates both quality and quantity of all observed matches allows us (1) to consider also tentative paralogs and orthologs in this analysis and (2) to combine search results from more than one homology detection method. ROC curves illustrate the high efficacy of this approach and its improvement over other homology-based validation methods.
New PPIs are primarily derived from preexisting PPIs and not invented de novo. Thus, the hallmark of true PPIs is the existence of homologous PPIs. The sensitive search for homologous PPIs within a large body of known PPIs is an efficient strategy to separate biologically relevant PPIs from the many spurious PPIs reported by high-throughput experiments.
Motivation: Protein–protein interaction (PPI) extraction from published biological articles has attracted much attention because of the importance of protein interactions in biological processes. Despite significant progress, mining PPIs from literatures still rely heavily on time- and resource-consuming manual annotations.
Results: In this study, we developed a novel methodology based on Bayesian networks (BNs) for extracting PPI triplets (a PPI triplet consists of two protein names and the corresponding interaction word) from unstructured text. The method achieved an overall accuracy of 87% on a cross-validation test using manually annotated dataset. We also showed, through extracting PPI triplets from a large number of PubMed abstracts, that our method was able to complement human annotations to extract large number of new PPIs from literature.
Availability: Programs/scripts we developed/used in the study are available at http://stat.fsu.edu/~jinfeng/datasets/Bio-SI-programs-Bayesian-chowdhary-zhang-liu.zip
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries, and thus high-quality protein–protein interaction (PPI) maps are critical for a deeper understanding of cellular processes. However, the unreliability and paucity of current available PPI data are key obstacles to the subsequent quantitative studies. It is therefore highly desirable to develop an approach to deal with these issues from the computational perspective. Most previous works for assessing and predicting protein interactions either need supporting evidences from multiple information resources or are severely impacted by the sparseness of PPI networks.
Results: We developed a robust manifold embedding technique for assessing the reliability of interactions and predicting new interactions, which purely utilizes the topological information of PPI networks and can work on a sparse input protein interactome without requiring additional information types. After transforming a given PPI network into a low-dimensional metric space using manifold embedding based on isometric feature mapping (ISOMAP), the problem of assessing and predicting protein interactions is recasted into the form of measuring similarity between points of its metric space. Then a reliability index, a likelihood indicating the interaction of two proteins, is assigned to each protein pair in the PPI networks based on the similarity between the points in the embedded space. Validation of the proposed method is performed with extensive experiments on densely connected and sparse PPI network of yeast, respectively. Results demonstrate that the interactions ranked top by our method have high-functional homogeneity and localization coherence, especially our method is very efficient for large sparse PPI network with which the traditional algorithms fail. Therefore, the proposed algorithm is a much more promising method to detect both false positive and false negative interactions in PPI networks.
Availability: MATLAB code implementing the algorithm is available from the web site http://home.ustc.edu.cn/∼yzh33108/Manifold.htm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Most functions within the cell emerge thanks to protein–protein interactions (PPIs), yet experimental determination of PPIs is both expensive and time-consuming. PPI networks present significant levels of noise and incompleteness. Predicting interactions using only PPI-network topology (topological prediction) is difficult but essential when prior biological knowledge is absent or unreliable.
Methods: Network embedding emphasizes the relations between network proteins embedded in a low-dimensional space, in which protein pairs that are closer to each other represent good candidate interactions. To achieve network denoising, which boosts prediction performance, we first applied minimum curvilinear embedding (MCE), and then adopted shortest path (SP) in the reduced space to assign likelihood scores to candidate interactions. Furthermore, we introduce (i) a new valid variation of MCE, named non-centred MCE (ncMCE); (ii) two automatic strategies for selecting the appropriate embedding dimension; and (iii) two new randomized procedures for evaluating predictions.
Results: We compared our method against several unsupervised and supervisedly tuned embedding approaches and node neighbourhood techniques. Despite its computational simplicity, ncMCE-SP was the overall leader, outperforming the current methods in topological link prediction.
Conclusion: Minimum curvilinearity is a valuable non-linear framework that we successfully applied to the embedding of protein networks for the unsupervised prediction of novel PPIs. The rationale for our approach is that biological and evolutionary information is imprinted in the non-linear patterns hidden behind the protein network topology, and can be exploited for predicting new protein links. The predicted PPIs represent good candidates for testing in high-throughput experiments or for exploitation in systems biology tools such as those used for network-based inference and prediction of disease-related functional modules.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Prediction and analysis of protein-protein interactions (PPI) and specifically types of PPIs is an important problem in life science research because of the fundamental roles of PPIs in many biological processes in living cells. In addition, electrostatic interactions are important in understanding inter-molecular interactions, since they are long-range, and because of their influence in charged molecules. This is the main motivation for using electrostatic energy for prediction of PPI types.
We propose a prediction model to analyze protein interaction types, namely obligate and non-obligate, using electrostatic energy values as properties. The prediction approach uses electrostatic energy values for pairs of atoms and amino acids present in interfaces where the interaction occurs. The main features of the complexes are found and then the prediction is performed via several state-of-the-art classification techniques, including linear dimensionality reduction (LDR), support vector machine (SVM), naive Bayes (NB) and k-nearest neighbor (k-NN). For an in-depth analysis of classification results, some other experiments were performed by varying the distance cutoffs between atom pairs of interacting chains, ranging from 5Å to 13Å. Moreover, several feature selection algorithms including gain ratio (GR), information gain (IG), chi-square (Chi2) and minimum redundancy maximum relevance (mRMR) are applied on the available datasets to obtain more discriminative pairs of atom types and amino acid types as features for prediction.
Our results on two well-known datasets of obligate and non-obligate complexes confirm that electrostatic energy is an important property to predict obligate and non-obligate protein interaction types on the basis of all the experimental results, achieving accuracies of over 98%. Furthermore, a comparison performed by changing the distance cutoff demonstrates that the best values for prediction of PPI types using electrostatic energy range from 9Å to 12Å, which show that electrostatic interactions are long-range and cover a broader area in the interface. In addition, the results on using feature selection before prediction confirm that (a) a few pairs of atoms and amino acids are appropriate for prediction, and (b) prediction performance can be improved by eliminating irrelevant and noisy features and selecting the most discriminative ones.
Protein-protein interactions (PPIs) play crucial roles in the execution of various cellular processes and form the basis of biological mechanisms. Although large amount of PPIs data for different species has been generated by high-throughput experimental techniques, current PPI pairs obtained with experimental methods cover only a fraction of the complete PPI networks, and further, the experimental methods for identifying PPIs are both time-consuming and expensive. Hence, it is urgent and challenging to develop automated computational methods to efficiently and accurately predict PPIs.
We present here a novel hierarchical PCA-EELM (principal component analysis-ensemble extreme learning machine) model to predict protein-protein interactions only using the information of protein sequences. In the proposed method, 11188 protein pairs retrieved from the DIP database were encoded into feature vectors by using four kinds of protein sequences information. Focusing on dimension reduction, an effective feature extraction method PCA was then employed to construct the most discriminative new feature set. Finally, multiple extreme learning machines were trained and then aggregated into a consensus classifier by majority voting. The ensembling of extreme learning machine removes the dependence of results on initial random weights and improves the prediction performance.
When performed on the PPI data of Saccharomyces cerevisiae, the proposed method achieved 87.00% prediction accuracy with 86.15% sensitivity at the precision of 87.59%. Extensive experiments are performed to compare our method with state-of-the-art techniques Support Vector Machine (SVM). Experimental results demonstrate that proposed PCA-EELM outperforms the SVM method by 5-fold cross-validation. Besides, PCA-EELM performs faster than PCA-SVM based method. Consequently, the proposed approach can be considered as a new promising and powerful tools for predicting PPI with excellent performance and less time.
Protein-protein interactions (PPIs) play important roles in various cellular processes. However, the low quality of current PPI data detected from high-throughput screening techniques has diminished the potential usefulness of the data. We need to develop a method to address the high data noise and incompleteness of PPI data, namely, to filter out inaccurate protein interactions (false positives) and predict putative protein interactions (false negatives).
In this paper, we proposed a novel two-step method to integrate diverse biological and computational sources of supporting evidence for reliable PPIs. The first step, interaction binning or InterBIN, groups PPIs together to more accurately estimate the likelihood (Bin-Confidence score) that the protein pairs interact for each biological or computational evidence source. The second step, interaction classification or InterCLASS, integrates the collected Bin-Confidence scores to build classifiers and identify reliable interactions.
We performed comprehensive experiments on two benchmark yeast PPI datasets. The experimental results showed that our proposed method can effectively eliminate false positives in detected PPIs and identify false negatives by predicting novel yet reliable PPIs. Our proposed method also performed significantly better than merely using each of individual evidence sources, illustrating the importance of integrating various biological and computational sources of data and evidence.
Motivation: Protein–protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled).
Results: We propose a semi-supervised multi-task framework for predicting PPIs from not only labeled, but also partially labeled reference sets. The basic idea is to perform multi-task learning on a supervised classification task and a semi-supervised auxiliary task. The supervised classifier trains a multi-layer perceptron network for PPI predictions from labeled examples. The semi-supervised auxiliary task shares network layers of the supervised classifier and trains with partially labeled examples. Semi-supervision could be utilized in multiple ways. We tried three approaches in this article, (i) classification (to distinguish partial positives with negatives); (ii) ranking (to rate partial positive more likely than negatives); (iii) embedding (to make data clusters get similar labels). We applied this framework to improve the identification of interacting pairs between HIV-1 and human proteins. Our method improved upon the state-of-the-art method for this task indicating the benefits of semi-supervised multi-task learning using auxiliary information.
Protein-protein interaction (PPI) is essential to most biological processes. Abnormal interactions may have implications in a number of neurological syndromes. Given that the association and dissociation of protein molecules is crucial, computational tools capable of effectively identifying PPI are desirable. In this paper, we propose a simple yet effective method to detect PPI based on pairwise similarity and using only the primary structure of the protein. The PPI based on Pairwise Similarity (PPI-PS) method consists of a representation of each protein sequence by a vector of pairwise similarities against large subsequences of amino acids created by a shifting window which passes over concatenated protein training sequences. Each coordinate of this vector is typically the E-value of the Smith-Waterman score. These vectors are then used to compute the kernel matrix which will be exploited in conjunction with support vector machines.
To assess the ability of the proposed method to recognize the difference between "interacted" and "non-interacted" proteins pairs, we applied it on different datasets from the available yeast saccharomyces cerevisiae protein interaction. The proposed method achieved reasonable improvement over the existing state-of-the-art methods for PPI prediction.
Pairwise similarity score provides a relevant measure of similarity between protein sequences. This similarity incorporates biological knowledge about proteins and it is extremely powerful when combined with support vector machine to predict PPI.
Motivation: Eukaryotic proteins are highly modular, containing multiple interaction interfaces that mediate binding to a network of regulators and effectors. Recent advances in high-throughput proteomics have rapidly expanded the number of known protein–protein interactions (PPIs); however, the molecular basis for the majority of these interactions remains to be elucidated. There has been a growing appreciation of the importance of a subset of these PPIs, namely those mediated by short linear motifs (SLiMs), particularly the canonical and ubiquitous SH2, SH3 and PDZ domain-binding motifs. However, these motif classes represent only a small fraction of known SLiMs and outside these examples little effort has been made, either bioinformatically or experimentally, to discover the full complement of motif instances.
Results: In this article, interaction data are analysed to identify and characterize an important subset of PPIs, those involving SLiMs binding to globular domains. To do this, we introduce iELM, a method to identify interactions mediated by SLiMs and add molecular details of the interaction interfaces to both interacting proteins. The method identifies SLiM-mediated interfaces from PPI data by searching for known SLiM–domain pairs. This approach was applied to the human interactome to identify a set of high-confidence putative SLiM-mediated PPIs.
Availability: iELM is freely available at http://elmint.embl.de
Supplementary data are available at Bioinformatics online.
Motivation: Protein–protein interactions (PPIs) are a promising, but challenging target for pharmaceutical intervention. One approach for addressing these difficult targets is the rational design of small-molecule inhibitors that mimic the chemical and physical properties of small clusters of key residues at the protein–protein interface. The identification of appropriate clusters of interface residues provides starting points for inhibitor design and supports an overall assessment of the susceptibility of PPIs to small-molecule inhibition.
Results: We extract Small-Molecule Inhibitor Starting Points (SMISPs) from protein-ligand and protein–protein complexes in the Protein Data Bank (PDB). These SMISPs are used to train two distinct classifiers, a support vector machine and an easy to interpret exhaustive rule classifier. Both classifiers achieve better than 70% leave-one-complex-out cross-validation accuracy and correctly predict SMISPs of known PPI inhibitors not in the training set. A PDB-wide analysis suggests that nearly half of all PPIs may be susceptible to small-molecule inhibition.
Supplementary data are available at Bioinformatics online.
Despite the availability of a large number of protein–protein interactions (PPIs) in several species, researchers are often limited to using very small subsets in a few organisms due to the high prevalence of spurious interactions. In spite of the importance of quality assessment of experimentally determined PPIs, a surprisingly small number of databases provide interactions with scores and confidence levels. We introduce HitPredict (http://hintdb.hgc.jp/htp/), a database with quality assessed PPIs in nine species. HitPredict assigns a confidence level to interactions based on a reliability score that is computed using evidence from sequence, structure and functional annotations of the interacting proteins. HitPredict was first released in 2005 and is updated annually. The current release contains 36 930 proteins with 176 983 non-redundant, physical interactions, of which 116 198 (66%) are predicted to be of high confidence.
Protein-protein interactions (PPIs) play crucial roles in virtually every aspect of cellular function within an organism. Over the last decade, the development of novel high-throughput techniques has resulted in enormous amounts of data and provided valuable resources for studying protein interactions. However, these high-throughput protein interaction data are often associated with high false positive and false negative rates. It is therefore highly desirable to develop scalable methods to identify these errors from the computational perspective.
We have developed a robust computational technique for assessing the reliability of interactions and predicting new interactions by combining manifold embedding with multiple information integration. Validation of the proposed method was performed with extensive experiments on densely-connected and sparse PPI networks of yeast respectively. Results demonstrate that the interactions ranked top by our method have high functional homogeneity and localization coherence.
Our proposed method achieves better performances than the existing methods no matter assessing or predicting protein interactions. Furthermore, our method is general enough to work over a variety of PPI networks irrespectively of densely-connected or sparse PPI network. Therefore, the proposed algorithm is a much more promising method to detect both false positive and false negative interactions in PPI networks.
Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles.
To evaluate the performance of our PPI text classifier, we conducted experiments based on the BioCreAtIvE-II IAS dataset. Our results show that adding likely-labeled data generally increases AUC by 3~6%, indicating better ranking ability. Our experiments also show that our newly-proposed term-weighting scheme has the highest AUC among all common weighting schemes. Our final model achieves an F-measure and AUC 2.9% and 5.0% higher than those of the top-ranking system in the IAS challenge.
Our experiments demonstrate the effectiveness of integrating unlabeled and likely labeled data to augment a PPI text classification system. Our mixed model is suitable for ranking purposes whereas our hierarchical model is better for filtering. In addition, our results indicate that supervised weighting schemes outperform unsupervised ones. Our newly-proposed weighting scheme, TFBRF, which considers documents that do not contain the target word, avoids some of the biases found in traditional weighting schemes. Our experiment results show TFBRF to be the most effective among several other top weighting schemes.
The rapid growth of protein-protein interaction (PPI) data has led to the emergence of PPI network analysis. Despite advances in high-throughput techniques, the interactomes of several model organisms are still far from complete. Therefore, it is desirable to expand these interactomes with ortholog-based and other methods.
Orthologous pairs of 18 eukaryotic species were expanded and merged with experimental PPI datasets. The contributions of interologs from each species were evaluated. The expanded orthologous pairs enable the inference of interologs for various species. For example, more than 32,000 human interactions can be predicted. The same dataset has also been applied to the prediction of host-pathogen interactions. PPIs between P. falciparum calmodulin and several H. sapiens proteins are predicted, and these interactions may contribute to the maintenance of host cell Ca2+ concentration. Using comparisons with Bayesian and structure-based approaches, interactions between putative HSP40 homologs of P. falciparum and the H. sapiens TNF receptor associated factor family are revealed, suggesting a role for these interactions in the interference of the human immune response to P. falciparum.
The PPI datasets are available from POINT and POINeT . Further development of methods to predict host-pathogen interactions should incorporate multiple approaches in order to improve sensitivity, and should facilitate the identification of targets for drug discovery and design.
As an increasing number of reliable protein–protein interactions (PPIs) become available and high-throughput experimental methods provide systematic identification of PPIs, there is a growing need for fast and accurate methods for discovering homologous PPIs of a newly determined PPI. PPISearch is a web server that rapidly identifies homologous PPIs (called PPI family) and infers transferability of interacting domains and functions of a query protein pair. This server first identifies two homologous families of the query, respectively, by using BLASTP to scan an annotated PPIs database (290 137 PPIs in 576 species), which is a collection of five public databases. We determined homologous PPIs from protein pairs of homologous families when these protein pairs were in the annotated database and have significant joint sequence similarity (E ≤ 10−40) with the query. Using these homologous PPIs across multiple species, this sever infers the conserved domain–domain pairs (Pfam and InterPro domains) and function pairs (Gene Ontology annotations). Our results demonstrate that the transferability of conserved domain-domain pairs between homologous PPIs and query pairs is 88% using 103 762 PPI queries, and the transferability of conserved function pairs is 69% based on 106 997 PPI queries. The PPISearch server should be useful for searching homologous PPIs and PPI families across multiple species. The PPISearch server is available through the website at http://gemdock.life.nctu.edu.tw/ppisearch/.
Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles.
In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to InterPreTS (Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure.
We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at http://liao.cis.udel.edu/pub/svdsvm. Implemented in Matlab and supported on Linux and MS Windows.
As numerous experimental factors drive the acquisition, identification, and interpretation of protein-protein interactions (PPIs), aggregated assemblies of human PPI data invariably contain experiment-dependent noise. Ascertaining the reliability of PPIs collected from these diverse studies and scoring them to infer high-confidence networks is a non-trivial task. Moreover, a large number of PPIs share the same number of reported occurrences, making it impossible to distinguish the reliability of these PPIs and rank-order them. For example, for the data analyzed here, we found that the majority (>83%) of currently available human PPIs have been reported only once.
In this work, we proposed an unsupervised statistical approach to score a set of diverse, experimentally identified PPIs from nine primary databases to create subsets of high-confidence human PPI networks. We evaluated this ranking method by comparing it with other methods and assessing their ability to retrieve protein associations from a number of diverse and independent reference sets. These reference sets contain known biological data that are either directly or indirectly linked to interactions between proteins. We quantified the average effect of using ranked protein interaction data to retrieve this information and showed that, when compared to randomly ranked interaction data sets, the proposed method created a larger enrichment (~134%) than either ranking based on the hypergeometric test (~109%) or occurrence ranking (~46%).
From our evaluations, it was clear that ranked interactions were always of value because higher-ranked PPIs had a higher likelihood of retrieving high-confidence experimental data. Reducing the noise inherent in aggregated experimental PPIs via our ranking scheme further increased the accuracy and enrichment of PPIs derived from a number of biologically relevant data sets. These results suggest that using our high-confidence protein interactions at different levels of confidence will help clarify the topological and biological properties associated with human protein networks.
High confidence; Human protein interaction network; Protein-protein interactions
Many crucial cellular operations such as metabolism, signalling, and regulations are based on protein-protein interactions. However, the lack of robust protein-protein interaction information is a challenge. One reason for the lack of solid protein-protein interaction information is poor agreement between experimental findings and computational sets that, in turn, comes from huge false positive predictions in computational approaches. Reduction of false positive predictions and enhancing true positive fraction of computationally predicted protein-protein interaction datasets based on highly confident experimental results has not been adequately investigated.
Gene Ontology (GO) annotations were used to reduce false positive protein-protein interactions (PPI) pairs resulting from computational predictions. Using experimentally obtained PPI pairs as a training dataset, eight top-ranking keywords were extracted from GO molecular function annotations. The sensitivity of these keywords is 64.21% in the yeast experimental dataset and 80.83% in the worm experimental dataset. The specificities, a measure of recovery power, of these keywords applied to four predicted PPI datasets for each studied organisms, are 48.32% and 46.49% (by average of four datasets) in yeast and worm, respectively. Based on eight top-ranking keywords and co-localization of interacting proteins a set of two knowledge rules were deduced and applied to remove false positive protein pairs. The 'strength', a measure of improvement provided by the rules was defined based on the signal-to-noise ratio and implemented to measure the applicability of knowledge rules applying to the predicted PPI datasets. Depending on the employed PPI-predicting methods, the strength varies between two and ten-fold of randomly removing protein pairs from the datasets.
Gene Ontology annotations along with the deduced knowledge rules could be implemented to partially remove false predicted PPI pairs. Removal of false positives from predicted datasets increases the true positive fractions of the datasets and improves the robustness of predicted pairs as compared to random protein pairing, and eventually results in better overlap with experimental results.
Protein–protein interaction (PPI) maps provide insight into cellular biology and have received considerable attention in the post-genomic era. While large-scale experimental approaches have generated large collections of experimentally determined PPIs, technical limitations preclude certain PPIs from detection. Recently, we demonstrated that yeast PPIs can be computationally predicted using re-occurring short polypeptide sequences between known interacting protein pairs. However, the computational requirements and low specificity made this method unsuitable for large-scale investigations. Here, we report an improved approach, which exhibits a specificity of ∼99.95% and executes 16 000 times faster. Importantly, we report the first all-to-all sequence-based computational screen of PPIs in yeast, Saccharomyces cerevisiae in which we identify 29 589 high confidence interactions of ∼2 × 107 possible pairs. Of these, 14 438 PPIs have not been previously reported and may represent novel interactions. In particular, these results reveal a richer set of membrane protein interactions, not readily amenable to experimental investigations. From the novel PPIs, a novel putative protein complex comprised largely of membrane proteins was revealed. In addition, two novel gene functions were predicted and experimentally confirmed to affect the efficiency of non-homologous end-joining, providing further support for the usefulness of the identified PPIs in biological investigations.
Motivation: Identification and characterization of protein–protein interactions (PPIs) is one of the key aims in biological research. While previous research in text mining has made substantial progress in automatic PPI detection from literature, the need to improve the precision and recall of the process remains. More accurate PPI detection will also improve the ability to extract experimental data related to PPIs and provide multiple evidence for each interaction.
Results: We developed an interaction detection method and explored the usefulness of various features in automatically identifying PPIs in text. The results show that our approach outperforms other systems using the AImed dataset. In the tests where our system achieves better precision with reduced recall, we discuss possible approaches for improvement. In addition to test datasets, we evaluated the performance on interactions from five human-curated databases—BIND, DIP, HPRD, IntAct and MINT—where our system consistently identified evidence for ∼60% of interactions when both proteins appear in at least one sentence in the PubMed abstract. We then applied the system to extract articles from PubMed to annotate known, high-throughput and interologous interactions in I2D.
Availability: The data and software are available at: http://www.cs.utoronto.ca/∼juris/data/BI09/.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Complexes of physically interacting proteins are one of the fundamental functional units responsible for driving key biological mechanisms within the cell. With the advent of high-throughput techniques, significant amount of protein interaction (PPI) data has been catalogued for organisms such as yeast, which has in turn fueled computational methods for systematic identification and study of protein complexes. However, many complexes are dynamic entities - their subunits are known to assemble at a particular cellular space and time to perform a particular function and disassemble after that - and while current computational analyses have concentrated on studying the dynamics of individual or pairs of proteins in PPI networks, a crucial aspect overlooked is the dynamics of whole complex formations. In this work, using yeast as our model, we incorporate 'time' in the form of cell-cycle phases into the prediction of complexes from PPI networks and study the temporal phenomena of complex assembly and disassembly across phases. We hypothesize that 'staticness' (constitutive expression) of proteins might be related to their temporal "reusability" across complexes, and test this hypothesis using complexes predicted from large-scale PPI networks across the yeast cell cycle phases. Our results hint towards a biological design principle underlying cellular mechanisms - cells maintain generic proteins as 'static' to enable their "reusability" across multiple temporal complexes. We also demonstrate that these findings provide additional support and alternative explanations to findings from existing works on the dynamics in PPI networks.
Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors.
We have developed a method that predicts Protein-Protein Interactions (PPIs) based on the similarity of the context in which proteins appear in literature. This method outperforms previously developed PPI prediction algorithms that rely on the conjunction of two protein names in MEDLINE abstracts. We show significant increases in coverage (76% versus 32%) and sensitivity (66% versus 41% at a specificity of 95%) for the prediction of PPIs currently archived in 6 PPI databases. A retrospective analysis shows that PPIs can efficiently be predicted before they enter PPI databases and before their interaction is explicitly described in the literature. The practical value of the method for discovery of novel PPIs is illustrated by the experimental confirmation of the inferred physical interaction between CAPN3 and PARVB, which was based on frequent co-occurrence of both proteins with concepts like Z-disc, dysferlin, and alpha-actinin. The relationships between proteins predicted by our method are broader than PPIs, and include proteins in the same complex or pathway. Dependent on the type of relationships deemed useful, the precision of our method can be as high as 90%. The full set of predicted interactions is available in a downloadable matrix and through the webtool Nermal, which lists the most likely interaction partners for a given protein. Our framework can be used for prioritizing potential interaction partners, hitherto undiscovered, for follow-up studies and to aid the generation of accurate protein interaction maps.