Motivation: Tandem mass spectrometry (MS/MS) offers fast and reliable characterization of complex protein mixtures, but suffers from low sensitivity in protein identification. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other information available, e.g. the probability of a protein's presence is likely to correlate with its mRNA concentration.
Results: We develop a Bayesian score that estimates the posterior probability of a protein's presence in the sample given its identification in an MS/MS experiment and its mRNA concentration measured under similar experimental conditions. Our method, MSpresso, substantially increases the number of proteins identified in an MS/MS experiment at the same error rate, e.g. in yeast, MSpresso increases the number of proteins identified by ∼40%. We apply MSpresso to data from different MS/MS instruments, experimental conditions and organisms (Escherichia coli, human), and predict 19–63% more proteins across the different datasets. MSpresso demonstrates that incorporating prior knowledge of protein presence into shotgun proteomics experiments can substantially improve protein identification scores.
Availability and Implementation: Software is available upon request from the authors. Mass spectrometry datasets and supplementary information are available from http://www.marcottelab.org/MSpresso/.
Contact: firstname.lastname@example.org; email@example.com
Supplementary Information: Supplementary data website: http://www.marcottelab.org/MSpresso/.
M. tuberculosis is a formidable bacterial pathogen. There is thus an increasing demand on understanding the function and relationship of proteins in various strains of M. tuberculosis. Protein-protein interactions (PPIs) data are crucial for this kind of knowledge. However, the quality of the main available M. tuberculosis PPI datasets is unclear. This hampers the effectiveness of research works that rely on these PPI datasets. Here, we analyze the two main available M. tuberculosis H37Rv PPI datasets. The first dataset is the high-throughput B2H PPI dataset from Wang et al’s recent paper in Journal of Proteome Research. The second dataset is from STRING database, version 8.3, comprising entirely of H37Rv PPIs predicted using various methods. We find that these two datasets have a surprisingly low level of agreement. We postulate the following causes for this low level of agreement: (i) the H37Rv B2H PPI dataset is of low quality; (ii) the H37Rv STRING PPI dataset is of low quality; and/or (iii) the H37Rv STRING PPIs are predictions of other forms of functional associations rather than direct physical interactions.
To test the quality of these two datasets, we evaluate them based on correlated gene expression profiles, coherent informative GO term annotations, and conservation in other organisms. We observe a significantly greater portion of PPIs in the H37Rv STRING PPI dataset (with score ≥ 770) having correlated gene expression profiles and coherent informative GO term annotations in both interaction partners than that in the H37Rv B2H PPI dataset. Predicted H37Rv interologs derived from non-M. tuberculosis experimental PPIs are much more similar to the H37Rv STRING functional associations dataset (with score ≥ 770) than the H37Rv B2H PPI dataset. H37Rv predicted physical interologs from IntAct also show extremely low similarity with the H37Rv B2H PPI dataset; and this similarity level is much lower than that between the S. aureus MRSA252 predicted physical interologs from IntAct and S. aureus MRSA252 pull-down PPIs. Comparative analysis with several representative two-hybrid PPI datasets in other species further confirms that the H37Rv B2H PPI dataset is of low quality. Next, to test the possibility that the H37Rv STRING PPIs are not purely direct physical interactions, we compare M. tuberculosis H37Rv protein pairs that catalyze adjacent steps in enzymatic reactions to B2H PPIs and predicted PPIs in STRING, which shows it has much lower similarities with the B2H PPIs than with STRING PPIs. This result strongly suggests that the H37Rv STRING PPIs more likely correspond to indirect relationships between protein pairs than to B2H PPIs. For more precise support, we turn to S. cerevisiae for its comprehensively studied interactome. We compare S. cerevisiae predicted PPIs in STRING to three independent protein relationship datasets which respectively comprise PPIs reported in Y2H assays, protein pairs reported to be in the same protein complexes, and protein pairs that catalyze successive reaction steps in enzymatic reactions. Our analysis reveals that S. cerevisiae predicted STRING PPIs have much higher similarity to the latter two types of protein pairs than to two-hybrid PPIs. As H37Rv STRING PPIs are predicted using similar methods as S. cerevisiae predicted STRING PPIs, this suggests that these H37Rv STRING PPIs are more likely to correspond to the latter two types of protein pairs rather than to two-hybrid PPIs as well.
The H37Rv B2H PPI dataset has low quality. It should not be used as the gold standard to assess the quality of other (possibly predicted) H37Rv PPI datasets. The H37Rv STRING PPI dataset also has low quality; nevertheless, a subset consisting of STRING PPIs with score ≥770 has satisfactory quality. However, these STRING “PPIs” should be interpreted as functional associations, which include a substantial portion of indirect protein interactions, rather than direct physical interactions. These two factors cause the strikingly low similarity between these two main H37Rv PPI datasets. The results and conclusions from this comparative analysis provide valuable guidance in using these M. tuberculosis H37Rv PPI datasets in subsequent studies for a wide range of purposes.
Motivation: High-throughput protein interaction data, with ever-increasing volume, are becoming the foundation of many biological discoveries, and thus high-quality protein–protein interaction (PPI) maps are critical for a deeper understanding of cellular processes. However, the unreliability and paucity of current available PPI data are key obstacles to the subsequent quantitative studies. It is therefore highly desirable to develop an approach to deal with these issues from the computational perspective. Most previous works for assessing and predicting protein interactions either need supporting evidences from multiple information resources or are severely impacted by the sparseness of PPI networks.
Results: We developed a robust manifold embedding technique for assessing the reliability of interactions and predicting new interactions, which purely utilizes the topological information of PPI networks and can work on a sparse input protein interactome without requiring additional information types. After transforming a given PPI network into a low-dimensional metric space using manifold embedding based on isometric feature mapping (ISOMAP), the problem of assessing and predicting protein interactions is recasted into the form of measuring similarity between points of its metric space. Then a reliability index, a likelihood indicating the interaction of two proteins, is assigned to each protein pair in the PPI networks based on the similarity between the points in the embedded space. Validation of the proposed method is performed with extensive experiments on densely connected and sparse PPI network of yeast, respectively. Results demonstrate that the interactions ranked top by our method have high-functional homogeneity and localization coherence, especially our method is very efficient for large sparse PPI network with which the traditional algorithms fail. Therefore, the proposed algorithm is a much more promising method to detect both false positive and false negative interactions in PPI networks.
Availability: MATLAB code implementing the algorithm is available from the web site http://home.ustc.edu.cn/∼yzh33108/Manifold.htm.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: High-throughput protein identification experiments based on tandem mass spectrometry (MS/MS) often suffer from low sensitivity and low-confidence protein identifications. In a typical shotgun proteomics experiment, it is assumed that all proteins are equally likely to be present. However, there is often other evidence to suggest that a protein is present and confidence in individual protein identification can be updated accordingly.
Results: We develop a method that analyzes MS/MS experiments in the larger context of the biological processes active in a cell. Our method, MSNet, improves protein identification in shotgun proteomics experiments by considering information on functional associations from a gene functional network. MSNet substantially increases the number of proteins identified in the sample at a given error rate. We identify 8–29% more proteins than the original MS experiment when applied to yeast grown in different experimental conditions analyzed on different MS/MS instruments, and 37% more proteins in a human sample. We validate up to 94% of our identifications in yeast by presence in ground-truth reference sets.
Availability and Implementation: Software and datasets are available at http://aug.csres.utexas.edu/msnet
Contact: firstname.lastname@example.org, email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
Identification of protein-protein interactions (PPIs) is essential for a better understanding of biological processes, pathways and functions. However, experimental identification of the complete set of PPIs in a cell/organism (“an interactome”) is still a difficult task. To circumvent limitations of current high-throughput experimental techniques, it is necessary to develop high-performance computational methods for predicting PPIs.
In this article, we propose a new computational method to predict interaction between a given pair of protein sequences using features derived from known homologous PPIs. The proposed method is capable of predicting interaction between two proteins (of unknown structure) using Averaged One-Dependence Estimators (AODE) and three features calculated for the protein pair: (a) sequence similarities to a known interacting protein pair (FSeq), (b) statistical propensities of domain pairs observed in interacting proteins (FDom) and (c) a sum of edge weights along the shortest path between homologous proteins in a PPI network (FNet). Feature vectors were defined to lie in a half-space of the symmetrical high-dimensional feature space to make them independent of the protein order. The predictability of the method was assessed by a 10-fold cross validation on a recently created human PPI dataset with randomly sampled negative data, and the best model achieved an Area Under the Curve of 0.79 (pAUC0.5% = 0.16). In addition, the AODE trained on all three features (named PSOPIA) showed better prediction performance on a separate independent data set than a recently reported homology-based method.
Our results suggest that FNet, a feature representing proximity in a known PPI network between two proteins that are homologous to a target protein pair, contributes to the prediction of whether the target proteins interact or not. PSOPIA will help identify novel PPIs and estimate complete PPI networks. The method proposed in this article is freely available on the web at http://mizuguchilab.org/PSOPIA.
Prediction of protein-protein interactions; Homology; Machine learning; Averaged One-Dependence Estimators (AODE)
Human protein-protein interaction (PPIs) data are the foundation for understanding molecular signalling networks and the functional roles of biomolecules. Several human PPI databases have become available; however, comparisons of these datasets have suggested limited data coverage and poor data quality. Ongoing collection and integration of human PPIs from different sources, both experimentally and computationally, can enable disease-specific network biology modelling in translational bioinformatics studies.
We developed a new web-based resource, the Human Annotated and Predicted Protein Interaction (HAPPI) database, located at . The HAPPI database was created by extracting and integrating publicly available protein interaction databases, including HPRD, BIND, MINT, STRING, and OPHID, using database integration techniques. We designed a unified entity-relationship data model to resolve semantic level differences of diverse concepts involved in PPI data integration. We applied a unified scoring model to give each PPI a measure of its reliability that can place each PPI at one of the five star rank levels from 1 to 5. We assessed the quality of PPIs contained in the new HAPPI database, using evolutionary conserved co-expression pairs called "MetaGene" pairs to measure the extent of MetaGene pair and PPI pair overlaps. While the overall quality of the HAPPI database across all star ranks is comparable to the overall qualities of HPRD or IntNetDB, the subset of the HAPPI database with star ranks between 3 and 5 has a much higher average quality than all other human PPI databases. As of summer 2008, the database contains 142,956 non-redundant, medium to high-confidence level human protein interaction pairs among 10,592 human proteins. The HAPPI database web application also provides …” should be “The HAPPI database web application also provides hyperlinked information of genes, pathways, protein domains, protein structure displays, and sequence feature maps for interactive exploration of PPI data in the database.
HAPPI is by far the most comprehensive public compilation of human protein interaction information. It enables its users to fully explore PPI data with quality measures and annotated information necessary for emerging network biology studies.
The oral cavity is a complex ecosystem where human chemical compounds coexist with a particular microbiota. However, shifts in the normal composition of this microbiota may result in the onset of oral ailments, such as periodontitis and dental caries. In addition, it is known that the microbial colonization of the oral cavity is mediated by protein-protein interactions (PPIs) between the host and microorganisms. Nevertheless, this kind of PPIs is still largely undisclosed. To elucidate these interactions, we have created a computational prediction method that allows us to obtain a first model of the Human-Microbial oral interactome.
We collected high-quality experimental PPIs from five major human databases. The obtained PPIs were used to create our positive dataset and, indirectly, our negative dataset. The positive and negative datasets were merged and used for training and validation of a naïve Bayes classifier. For the final prediction model, we used an ensemble methodology combining five distinct PPI prediction techniques, namely: literature mining, primary protein sequences, orthologous profiles, biological process similarity, and domain interactions. Performance evaluation of our method revealed an area under the ROC-curve (AUC) value greater than 0.926, supporting our primary hypothesis, as no single set of features reached an AUC greater than 0.877. After subjecting our dataset to the prediction model, the classified result was filtered for very high confidence PPIs (probability ≥ 1-10−7), leading to a set of 46,579 PPIs to be further explored.
We believe this dataset holds not only important pathways involved in the onset of infectious oral diseases, but also potential drug-targets and biomarkers. The dataset used for training and validation, the predictions obtained and the network final network are available at http://bioinformatics.ua.pt/software/oralint.
Protein-protein interactions; Oral interactome; Bayesian classification
Protein-protein interactions (PPIs) are crucial for almost all cellular processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. Given the importance of PPIs, several methods have been developed to detect them. Since the experimental methods are time-consuming and expensive, developing computational methods for effectively identifying PPIs is of great practical significance.
Most previous methods were developed for predicting PPIs in only one species, and do not account for probability estimations. In this work, a relatively comprehensive prediction system was developed, based on a support vector machine (SVM), for predicting PPIs in five organisms, specifically humans, yeast, Drosophila, Escherichia coli, and Caenorhabditis elegans. This PPI predictor includes the probability of its prediction in the output, so it can be used to assess the confidence of each SVM prediction by the probability assignment. Using a probability of 0.5 as the threshold for assigning class labels, the method had an average accuracy for detecting protein interactions of 90.67% for humans, 88.99% for yeast, 90.09% for Drosophila, 92.73% for E. coli, and 97.51% for C. elegans. Moreover, among the correctly predicted pairs, more than 80% were predicted with a high probability of ≥0.8, indicating that this tool could predict novel PPIs with high confidence.
Based on this work, a web-based system, Pred_PPI, was constructed for predicting PPIs from the five organisms. Users can predict novel PPIs and obtain a probability value about the prediction using this tool. Pred_PPI is freely available at http://cic.scu.edu.cn/bioinformatics/predict_ppi/default.html.
Protein-protein interactions (PPIs) are crucial in cellular processes. Since the current biological experimental techniques are time-consuming and expensive, and the results suffer from the problems of incompleteness and noise, developing computational methods and software tools to predict PPIs is necessary. Although several approaches have been proposed, the species supported are often limited and additional data like homologous interactions in other species, protein sequence and protein expression are often required. And predictive abilities of different features for different kinds of PPI data have not been studied.
In this paper, we propose ppiPre, an open-source framework for PPI analysis and prediction using a combination of heterogeneous features including three GO-based semantic similarities, one KEGG-based co-pathway similarity and three topology-based similarities. It supports up to twenty species. Only the original PPI data and gold-standard PPI data are required from users. The experiments on binary and co-complex gold-standard yeast PPI data sets show that there exist big differences among the predictive abilities of different features on different kinds of PPI data sets. And the prediction performance on the two data sets shows that ppiPre is capable of handling PPI data in different kinds and sizes. ppiPre is implemented in the R language and is freely available on the CRAN (http://cran.r-project.org/web/packages/ppiPre/).
We applied our framework to both binary and co-complex gold-standard PPI data sets. The detailed analysis on three GO aspects suggests that different GO aspects should be used on different kinds of data sets, and that combining all the three aspects of GO often gets the best result. The analysis also shows that using only features based solely on the topology of the PPI network can get a very good result when predicting the co-complex PPI data. ppiPre provides useful functions for analysing PPI data and can be used to predict PPIs for multiple species.
Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.
Protein interactions are the basic units in almost all biological processes. It is thus vitally important to reconstruct protein-protein interactions (PPIs) before we can fully understand biological processes. However, critical difficulties exist. Particularly the rate of wrongly predicting PPIs to be true (false positive rate) is extremely high in PPIs prediction. The traditional approaches of error correction from each generating source can be both time-consuming and inefficient. We propose a method that can substantially reduce false positive rates by emphasizing information from more reliable data sources, and de-emphasizing less reliable sources. We indicate that it is indeed the case from our extensive studies. Our predictions also suggest that large numbers of not only false positives but also false negatives may exist in previous studies, as validated by two human PPIs datasets having high quality. The ability to predict large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and speed up PPIs prediction with high quality. Reliable prediction from our method may benefit other studies involving such as protein function prediction and roles of PPIs in disease susceptibility.
Summary: Mass spectrometry-based proteomics stands to gain from additional analysis of its data, but its large, complex datasets make demands on speed and memory usage requiring special consideration from scripting languages. The software library ‘mspire’—developed in the Ruby programming language—offers quick and memory-efficient readers for standard xml proteomics formats, converters for intermediate file types in typical proteomics spectral-identification work flows (including the Bioworks .srf format), and modules for the calculation of peptide false identification rates.
Availability: Freely available at http://mspire.rubyforge.org. Additional data models, usage information, and methods available at http://bioinformatics.icmb.utexas.edu/mspire
Protein-protein interactions (PPIs) may represent one of the next major classes of therapeutic targets. So far, only a minute fraction of the estimated 650,000 PPIs that comprise the human interactome are known with a tiny number of complexes being drugged. Such intricate biological systems cannot be cost-efficiently tackled using conventional high-throughput screening methods. Rather, time has come for designing new strategies that will maximize the chance for hit identification through a rationalization of the PPI inhibitor chemical space and the design of PPI-focused compound libraries (global or target-specific). Here, we train machine-learning-based models, mainly decision trees, using a dataset of known PPI inhibitors and of regular drugs in order to determine a global physico-chemical profile for putative PPI inhibitors. This statistical analysis unravels two important molecular descriptors for PPI inhibitors characterizing specific molecular shapes and the presence of a privileged number of aromatic bonds. The best model has been transposed into a computer program, PPI-HitProfiler, that can output from any drug-like compound collection a focused chemical library enriched in putative PPI inhibitors. Our PPI inhibitor profiler is challenged on the experimental screening results of 11 different PPIs among which the p53/MDM2 interaction screened within our own CDithem platform, that in addition to the validation of our concept led to the identification of 4 novel p53/MDM2 inhibitors. Collectively, our tool shows a robust behavior on the 11 experimental datasets by correctly profiling 70% of the experimentally identified hits while removing 52% of the inactive compounds from the initial compound collections. We strongly believe that this new tool can be used as a global PPI inhibitor profiler prior to screening assays to reduce the size of the compound collections to be experimentally screened while keeping most of the true PPI inhibitors. PPI-HitProfiler is freely available on request from our CDithem platform website, www.CDithem.com.
Protein-protein interactions (PPIs) are essential to life and various diseases states are associated with aberrant PPIs. Therefore significant efforts are dedicated to this new class of therapeutic targets. Even though it might not be possible to modulate the estimated 650,000 PPIs that regulate human life with drug-like compounds, a sizeable number of PPI should be druggable. Only 10-15% of the human genome is thought to be druggable with around 1000-3000 druggable protein targets. A hypothetical similar ratio for PPIs would bring the number of druggable PPIs to about 65,000, although no data can yet support such a hypothesis. PPI have been historically intricate to tackle with standard experimental and virtual screening techniques, possibly because of the shift in the chemical space between today's chemical libraries and PPI physico-chemical requirements. Therefore, one possible avenue to circumvent this conundrum is to design focused libraries enriched in putative PPI inhibitors. Here, we show how chemoinformatics can assist library design by learning physico-chemical rules from a data set of known PPI inhibitors and their comparison with regular drugs. Our study shows the importance of specific molecular shapes and a privileged number of aromatic bonds.
Motivation: Approaches that use supervised machine learning techniques for protein–protein interaction (PPI) prediction typically use features obtained by integrating several sources of data. Often certain attributes of the data are not available, resulting in missing values. In particular, our host–pathogen PPI datasets have a large fraction, in the range of 58–85% of missing values, which makes it challenging to apply machine learning algorithms.
Results: We show that specialized techniques for missing value imputation can improve the performance of the models significantly. We use cross species information in combination with machine learning techniques like Group lasso with ℓ1/ℓ2 regularization. We demonstrate the benefits of our approach on two PPI prediction problems. In our first example of Salmonella–human PPI prediction, we are able to obtain high prediction accuracies with 77.6% precision and 84% recall. Comparison with various other techniques shows an improvement of 9 in F1 score over the next best technique. We also apply our method to Yersinia–human PPI prediction successfully, demonstrating the generality of our approach.
Availability: Predicted interactions, datasets, features are available at: http://www.cs.cmu.edu/~mkshirsa/eccb2012_paper46.html.
Supplementary data are available at Bioinformatics online.
Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate.
We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.
Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.
Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/.
Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners.
One of the crucial steps toward understanding the biological functions of a cellular system is to investigate protein–protein interaction (PPI) networks. As an increasing number of reliable PPIs become available, there is a growing need for discovering PPIs to reconstruct PPI networks of interesting organisms. Some interolog-based methods and homologous PPI families have been proposed for predicting PPIs from the known PPIs of source organisms.
Here, we propose a multiple-strategy scoring method to identify reliable PPIs for reconstructing the mouse PPI network from two well-known organisms: human and fly. We firstly identified the PPI candidates of target organisms based on homologous PPIs, sharing significant sequence similarities (joint E-value ≤ 1 × 10−40), from source organisms using generalized interolog mapping. These PPI candidates were evaluated by our multiple-strategy scoring method, combining sequence similarities, normalized ranks, and conservation scores across multiple organisms. According to 106,825 PPI candidates in yeast derived from human and fly, our scoring method can achieve high prediction accuracy and outperform generalized interolog mapping. Experiment results show that our multiple-strategy score can avoid the influence of the protein family size and length to significantly improve PPI prediction accuracy and reflect the biological functions. In addition, the top-ranked and conserved PPIs are often orthologous/essential interactions and share the functional similarity. Based on these reliable predicted PPIs, we reconstructed a comprehensive mouse PPI network, which is a scale-free network and can reflect the biological functions and high connectivity of 292 KEGG modules, including 216 pathways and 76 structural complexes.
Experimental results show that our scoring method can improve the predicting accuracy based on the normalized rank and evolutionary conservation from multiple organisms. Our predicted PPIs share similar biological processes and cellular components, and the reconstructed genome-wide PPI network can reflect network topology and modularity. We believe that our method is useful for inferring reliable PPIs and reconstructing a comprehensive PPI network of an interesting organism.
Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles.
In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to InterPreTS (Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure.
We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at http://liao.cis.udel.edu/pub/svdsvm. Implemented in Matlab and supported on Linux and MS Windows.
Identifying protein-protein interactions (PPIs) is essential for elucidating protein functions and understanding the molecular mechanisms inside the cell. However, the experimental methods for detecting PPIs are both time-consuming and expensive. Therefore, computational prediction of protein interactions are becoming increasingly popular, which can provide an inexpensive way of predicting the most likely set of interactions at the entire proteome scale, and can be used to complement experimental approaches. Although much progress has already been achieved in this direction, the problem is still far from being solved and new approaches are still required to overcome the limitations of the current prediction models.
In this work, a sequence-based approach is developed by combining a novel Multi-scale Continuous and Discontinuous (MCD) feature representation and Support Vector Machine (SVM). The MCD representation gives adequate consideration to the interactions between sequentially distant but spatially close amino acid residues, thus it can sufficiently capture multiple overlapping continuous and discontinuous binding patterns within a protein sequence. An effective feature selection method mRMR was employed to construct an optimized and more discriminative feature set by excluding redundant features. Finally, a prediction model is trained and tested based on SVM algorithm to predict the interaction probability of protein pairs.
When performed on the yeast PPIs data set, the proposed approach achieved 91.36% prediction accuracy with 91.94% precision at the sensitivity of 90.67%. Extensive experiments are conducted to compare our method with the existing sequence-based method. Experimental results show that the performance of our predictor is better than several other state-of-the-art predictors, whose average prediction accuracy is 84.91%, sensitivity is 83.24%, and precision is 86.12%. Achieved results show that the proposed approach is very promising for predicting PPI, so it can be a useful supplementary tool for future proteomics studies. The source code and the datasets are freely available at http://csse.szu.edu.cn/staff/youzh/MCDPPI.zip for academic use.
Motivation: Most functions within the cell emerge thanks to protein–protein interactions (PPIs), yet experimental determination of PPIs is both expensive and time-consuming. PPI networks present significant levels of noise and incompleteness. Predicting interactions using only PPI-network topology (topological prediction) is difficult but essential when prior biological knowledge is absent or unreliable.
Methods: Network embedding emphasizes the relations between network proteins embedded in a low-dimensional space, in which protein pairs that are closer to each other represent good candidate interactions. To achieve network denoising, which boosts prediction performance, we first applied minimum curvilinear embedding (MCE), and then adopted shortest path (SP) in the reduced space to assign likelihood scores to candidate interactions. Furthermore, we introduce (i) a new valid variation of MCE, named non-centred MCE (ncMCE); (ii) two automatic strategies for selecting the appropriate embedding dimension; and (iii) two new randomized procedures for evaluating predictions.
Results: We compared our method against several unsupervised and supervisedly tuned embedding approaches and node neighbourhood techniques. Despite its computational simplicity, ncMCE-SP was the overall leader, outperforming the current methods in topological link prediction.
Conclusion: Minimum curvilinearity is a valuable non-linear framework that we successfully applied to the embedding of protein networks for the unsupervised prediction of novel PPIs. The rationale for our approach is that biological and evolutionary information is imprinted in the non-linear patterns hidden behind the protein network topology, and can be exploited for predicting new protein links. The predicted PPIs represent good candidates for testing in high-throughput experiments or for exploitation in systems biology tools such as those used for network-based inference and prediction of disease-related functional modules.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Knowing which proteins exist in a certain organism or cell type and how these proteins interact with each other are necessary for the understanding of biological processes at the whole cell level. The determination of the protein-protein interaction (PPI) networks has been the subject of extensive research. Despite the development of reasonably successful methods, serious technical difficulties still exist. In this paper we present DomainGA, a quantitative computational approach that uses the information about the domain-domain interactions to predict the interactions between proteins.
DomainGA is a multi-parameter optimization method in which the available PPI information is used to derive a quantitative scoring scheme for the domain-domain pairs. Obtained domain interaction scores are then used to predict whether a pair of proteins interacts. Using the yeast PPI data and a series of tests, we show the robustness and insensitivity of the DomainGA method to the selection of the parameter sets, score ranges, and detection rules. Our DomainGA method achieves very high explanation ratios for the positive and negative PPIs in yeast. Based on our cross-verification tests on human PPIs, comparison of the optimized scores with the structurally observed domain interactions obtained from the iPFAM database, and sensitivity and specificity analysis; we conclude that our DomainGA method shows great promise to be applicable across multiple organisms.
We envision the DomainGA as a first step of a multiple tier approach to constructing organism specific PPIs. As it is based on fundamental structural information, the DomainGA approach can be used to create potential PPIs and the accuracy of the constructed interaction template can be further improved using complementary methods. Explanation ratios obtained in the reported test case studies clearly show that the false prediction rates of the template networks constructed using the DomainGA scores are reasonably low, and the erroneous predictions can be filtered further using supplementary approaches such as those based on literature search or other prediction methods.
Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors.
Infectious diseases result in millions of deaths each year. Physical interactions between pathogen and host proteins often form the basis of such infections. While a number of methods have been proposed for predicting protein–protein interactions (PPIs), they have primarily focused on intra-species protein–protein interactions.
We present an application of a supervised learning method for predicting physical interactions between host and pathogen proteins, using the human–HIV system. Using a Support Vector Machine with a linear kernel, we explore the use of a number of features including domain profiles, protein sequence k-mers, and properties of human proteins in a human PPI network. We achieve the best cross-validation performance when we use a combination of all three of these features. At a precision value of 70% we obtain recall values greater than 40%, depending on the ratio of positive examples to negative examples used during training. We use a classifier trained using these features to predict new PPIs between human and HIV proteins. We focus our discussion on those predicted interactions that involve human proteins known to be critical for HIV replication and propagation. Examples of predicted interactions with support in the literature include those necessary for viral attachment to the host membrane and subsequent invasion of the host cell.
Unlike intra-species PPIs, host–pathogen PPIs have not yet been experimentally detected on a large scale, though they are likely to play important roles in pathogenesis and disease outcomes. Computational methods that can robustly and accurately predict host–pathogen PPIs hold the promise of guiding future experiments and gaining insights into potential mechanisms of pathogenesis.
Host–pathogen interactions; Protein interaction prediction; Systems biology; Infectious disease
Motivation: Protein-protein interactions (PPIs), though extremely valuable towards a better understanding of protein functions and cellular processes, do not provide any direct information about the regions/domains within the proteins that mediate the interaction. Most often, it is only a fraction of a protein that directly interacts with its biological partners. Thus, understanding interaction at the domain level is a critical step towards (i) thorough understanding of PPI networks; (ii) precise identification of binding sites; (iii) acquisition of insights into the causes of deleterious mutations at interaction sites; and (iv) most importantly, development of drugs to inhibit pathological protein interactions. In addition, knowledge derived from known domain–domain interactions (DDIs) can be used to understand binding interfaces, which in turn can help discover unknown PPIs.
Results: Here, we describe a novel method called K-GIDDI (knowledge-guided inference of DDIs) to narrow down the PPI sites to smaller regions/domains. K-GIDDI constructs an initial DDI network from cross-species PPI networks, and then expands the DDI network by inferring additional DDIs using a divide-and-conquer biclustering algorithm guided by Gene Ontology (GO) information, which identifies partial-complete bipartite sub-networks in the DDI network and makes them complete bipartite sub-networks by adding edges. Our results indicate that K-GIDDI can reliably predict DDIs. Most importantly, K-GIDDI's novel network expansion procedure allows prediction of DDIs that are otherwise not identifiable by methods that rely only on PPI data.
Supplementary information: Supplementary data are available at Bioinformatics online.
Protein-protein interaction (PPI) is essential to most biological processes. Abnormal interactions may have implications in a number of neurological syndromes. Given that the association and dissociation of protein molecules is crucial, computational tools capable of effectively identifying PPI are desirable. In this paper, we propose a simple yet effective method to detect PPI based on pairwise similarity and using only the primary structure of the protein. The PPI based on Pairwise Similarity (PPI-PS) method consists of a representation of each protein sequence by a vector of pairwise similarities against large subsequences of amino acids created by a shifting window which passes over concatenated protein training sequences. Each coordinate of this vector is typically the E-value of the Smith-Waterman score. These vectors are then used to compute the kernel matrix which will be exploited in conjunction with support vector machines.
To assess the ability of the proposed method to recognize the difference between "interacted" and "non-interacted" proteins pairs, we applied it on different datasets from the available yeast saccharomyces cerevisiae protein interaction. The proposed method achieved reasonable improvement over the existing state-of-the-art methods for PPI prediction.
Pairwise similarity score provides a relevant measure of similarity between protein sequences. This similarity incorporates biological knowledge about proteins and it is extremely powerful when combined with support vector machine to predict PPI.
Motivation: Protein–protein interactions (PPIs) are a promising, but challenging target for pharmaceutical intervention. One approach for addressing these difficult targets is the rational design of small-molecule inhibitors that mimic the chemical and physical properties of small clusters of key residues at the protein–protein interface. The identification of appropriate clusters of interface residues provides starting points for inhibitor design and supports an overall assessment of the susceptibility of PPIs to small-molecule inhibition.
Results: We extract Small-Molecule Inhibitor Starting Points (SMISPs) from protein-ligand and protein–protein complexes in the Protein Data Bank (PDB). These SMISPs are used to train two distinct classifiers, a support vector machine and an easy to interpret exhaustive rule classifier. Both classifiers achieve better than 70% leave-one-complex-out cross-validation accuracy and correctly predict SMISPs of known PPI inhibitors not in the training set. A PDB-wide analysis suggests that nearly half of all PPIs may be susceptible to small-molecule inhibition.
Supplementary data are available at Bioinformatics online.
Here we present ComPPI, a cellular compartment-specific database of proteins and their interactions enabling an extensive, compartmentalized protein–protein interaction network analysis (URL: http://ComPPI.LinkGroup.hu). ComPPI enables the user to filter biologically unlikely interactions, where the two interacting proteins have no common subcellular localizations and to predict novel properties, such as compartment-specific biological functions. ComPPI is an integrated database covering four species (S. cerevisiae, C. elegans, D. melanogaster and H. sapiens). The compilation of nine protein–protein interaction and eight subcellular localization data sets had four curation steps including a manually built, comprehensive hierarchical structure of >1600 subcellular localizations. ComPPI provides confidence scores for protein subcellular localizations and protein–protein interactions. ComPPI has user-friendly search options for individual proteins giving their subcellular localization, their interactions and the likelihood of their interactions considering the subcellular localization of their interacting partners. Download options of search results, whole-proteomes, organelle-specific interactomes and subcellular localization data are available on its website. Due to its novel features, ComPPI is useful for the analysis of experimental results in biochemistry and molecular biology, as well as for proteome-wide studies in bioinformatics and network science helping cellular biology, medicine and drug design.