Biological processes exhibit a hierarchical structure in which the basic working units, proteins, physically associate to form stoichiometrically stable complexes. Complexes interact with individual proteins or other complexes to form functional modules and pathways that carry out most cellular processes. Such higher level interactions are more transient than those within complexes and are highly dependent on temporal and spatial context. The function of each protein or complex depends on its interaction partners. Therefore, a faithful reconstruction of the entire set of complexes in the cell is essential to identifying the function of individual proteins and complexes as well as serving as a building block for understanding the higher level organization of the cell, such as the interactions of complexes and proteins within cellular pathways. Here we describe a novel method for reconstruction of complexes from a variety of biological assays and a method for predicting the network of interactions relating these core cellular units (complexes and proteins).
Our reconstruction effort focuses on the yeast
Saccharomyces cerevisiae. Yeast serves as the prototypical case study for the reconstruction of protein-protein interaction networks. Moreover the yeast complexes often have conserved orthologs in other organisms, including human, and are of interest in their own right. Several studies (
1–
4) using a variety of assays have generated high throughput data that directly measure protein-protein interactions. Most notably, two high quality data sets (
3,
4) used tandem affinity purification (TAP)
1 followed by MS to provide a proteome-wide measurement of protein complexes. These data provide the basis for attempting a comprehensive reconstruction of a large fraction of the protein complexes in this organism. Indeed a number of works (
5,
6) have attempted such a reconstruction. Generally speaking, all use the same general procedure: one or more data sources are used to estimate a set of affinities between pairs of proteins, essentially measuring the likelihood of that pair to participate together in a complex. These affinities induce a weighted graph whose nodes are proteins and whose edges encode the affinities. A clustering algorithm is then used to construct complexes, sets of proteins that have high affinity in the graph. Although similar at a high level, the different methods differ significantly on the design choices made for the key steps in the process.
Recent works (since 2006) all focus on processing the proteome-wide TAP-MS data and using the results to define complexes. Gavin
et al. (
3), Collins
et al. (
7), and Hart
et al. (
5) all use probabilistic models that compare the number of interactions observed between proteins in the data
versus the number expected in some null model. Collins
et al. (
7) and Hart
et al. (
5) both used all three of the available high throughput data sets (
2–
4) in an attempt to provide a unified interaction network. The two unified networks resulting from these studies were shown to have large overlap and to achieve comparable agreement with the set of co-complex interactions in the MIPS data set (
8) that are collated from previous small scale studies. The interaction graphs resulting from the computed affinity scores are then clustered to produce a set of identified complexes. Gavin
et al. (
3), Hart
et al. (
5), and Pu
et al. (
6) all use a Markov clustering (MCL) (
9) procedure; Collins
et al. (
7) use a hierarchical agglomerative clustering (HAC) procedure but do not suggest a computational procedure for using the resulting dendrogram to produce specific complex predictions.
Despite the fairly high quality of these networks and the agreement between them, they still contain many false positives and negatives. False negatives can arise, for example, from the difficulty in detecting interactions involving low abundance proteins or membrane proteins or from cases where the tag added to the bait protein during TAP-MS prevents binding of the bait to its interacting partners. False positives can arise, for example, from complexes that share components or from the contaminants that bind to the bait nonspecifically after cell lysis. Therefore, the set of complexes derived from the protein-protein interaction network alone has limited accuracy. Less than 20% of the MIPS complexes (
8), which are derived from reliable small scale experiments, are exactly captured by the predictions of Pu
et al. (
6) or by those of Hart
et al. (
5).
In this study, we constructed a method that generates a set of complexes with higher sensitivity and coverage by integrating multiple sources of data, including mRNA gene expression data, cellular localization, and yeast two-hybrid data. The data integration approach was used in some early works on predicting protein-protein interactions (
10,
11) and more recently by Qiu and Noble (
12), but these studies focus only on predicting pairs of proteins in the same complex and not on reconstructing entire complexes. Many recent studies (
13–
21) have successfully integrated multiple types of data to predict functional linkage between proteins, constructing a graph whose pairwise affinity score summarizes the information from different sources of data. However, because the data integration is not trained toward predicting complexes, the high affinity pairs contain transient binding partners and even protein pairs that never interact directly but merely function in the same pathways. When these graphs are clustered, the clusters correspond to a variety of cellular entities, including pathways, functional modules, or co-expression clusters. We developed a data integration approach that is aimed directly at the problem of predicting stoichiometrically stable complexes.
We used a two-phase automated procedure that we trained on a new high quality reference set that we generated from annotations in MIPS and SGD and from manual curation of the literature. In the first phase, we used
boosting (
22), a state-of-the-art machine learning method, to train an affinity function that is specifically aimed at predicting whether two proteins are co-complexed. Unlike most other learning methods, boosting is capable of inducing useful features by combining different aspects of the raw data, making it particularly well suited to a data integration setting. Once we generated the learned affinity graph over pairs of proteins, we predicted complexes by using a novel clustering algorithm called hierarchical agglomerative clustering with overlap (HACO). The HACO algorithm is a simple and elegant extension of HAC that addresses many of its limitations, such as the irreversible commitment to a possibly incorrect clustering decision. HACO can be applied to any setting where HAC is applied; given the enormous usefulness of HAC for the analysis of biological data sets of many different types (
e.g. Refs. 7, 23, and 24), we believe that HACO may be applicable in a broad range of other tasks.
To validate our approach, we tested the ability of our methods and other methods to predict reference complexes that were not used in training. By integrating multiple sources of data, we recovered more reference complexes than other state-of-the-art methods (
5,
6) when applied to the same set of yeast proteins. We also validated our predicted set of complexes against external data sources that are not used in the training. In all cases, our predictions were shown to be more coherent than other methods and, in many cases, more coherent even than the set of reference complexes.
A detailed examination of our predicted complexes suggests that many of them were previously known but not included in our (comprehensive) reference set, suggesting that our complexes form a valuable new set of reference complexes. In several cases, our predicted complexes were not previously characterized. We experimentally validated two of these predictions: a new component in the recently characterized eisosome complex (
25), which marks the site of endocytosis in eukaryotes, and a newly characterized six-protein complex, including four phosphatases, that appears to be involved in the response to nutrient starvation and that we named the nutrient starvation complex (NSC).
The complex-based view provides a new perspective on the analysis and reconstruction of the protein interaction network. In the past, Jeong
et al. (
26) have suggested that the degree of a protein in an interaction network is positively correlated with its essentiality and have argued that “hubs” in the network are more likely to be essential because they are involved in more interactions. Our analysis presents a complex-based alternative view: essential proteins tend to cluster together in essential complexes (
5), and essential complexes tend to be large; thus, the essential hubs in the network are often members in large complexes comprised mostly of essential proteins. We also reformulate the task of reconstructing the protein interaction network. Rather than considering interactions between individual proteins (
27–
29), a somewhat confusing network that confounds interactions within complexes and interactions between complexes, we tackle the novel task of predicting a comprehensive protein interaction network that involves both individual proteins and larger complexes. We argue that these entities are the right building blocks in reconstructing cellular processes, providing a view of cellular interaction networks that is both easier to interpret than the complex network of interactions between individual proteins and more faithful to biological reality. Moreover a complex, which is a stable collection of many proteins that act together, provides a more robust basis for predicting interactions as we can combine signals for all its constituent proteins, reducing sensitivity to noise.
To accomplish this goal, we constructed a reference set of complex-complex interactions, considering two complexes to interact if they are significantly enriched for reliable interactions between their components. We further augmented this set with a hand-curated list of established complex-complex interactions. We then used a machine learning approach to detect the “signature” of such interactions from a large set of assays that are likely to be indicative. We explored different machine learning methods and showed that a partially supervised naïve Bayes model, where we learned the model from both labeled and unlabeled interactions, provides the best performance. This model was applied both to our predicted complexes and to individual proteins, providing a new, comprehensive reconstruction of the
S. cerevisiae interaction network, which can be downloaded from our project Web page.
2 We showed that entities that are predicted to interact are more likely to share the same functional categories. A detailed investigation of our new predicted interactions presents many that are established in the literature as well as some that are novel but consistent, presenting plausible hypotheses for further investigation.