|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact firstname.lastname@example.org
We demonstrate that protein–protein interaction networks in several eukaryotic organisms contain significantly more self-interacting proteins than expected if such homodimers randomly appeared in the course of the evolution. We also show that on average homodimers have twice as many interaction partners than non-self-interacting proteins. More specifically, the likelihood of a protein to physically interact with itself was found to be proportional to the total number of its binding partners. These properties of dimers are in agreement with a phenomenological model, in which individual proteins differ from each other by the degree of their ‘stickiness’ or general propensity toward interaction with other proteins including oneself. A duplication of self-interacting proteins creates a pair of paralogous proteins interacting with each other. We show that such pairs occur more frequently than could be explained by pure chance alone. Similar to homodimers, proteins involved in heterodimers with their paralogs on average have twice as many interacting partners than the rest of the network. The likelihood of a pair of paralogous proteins to interact with each other was also shown to decrease with their sequence similarity. This points to the conclusion that most of interactions between paralogs are inherited from ancestral homodimeric proteins, rather than established de novo after duplication. We finally discuss possible implications of our empirical observations from functional and evolutionary standpoints.
Many functionally important proteins, such as receptors [G-protein-coupled receptors (1), tyrosine kinase receptors (2)], enzyme complexes (3), ion channels (4) and transcriptional factors (5), are homo- or hetero-dimers. For example, ~70% of enzymes listed in the Brenda database (http://www.brenda.uni-koeln.de/) can self-interact to form dimers or higher-order oligomers. As another example, G-protein-coupled receptors (1), chemokine (6), cytokine (7) and tyrosine kinase receptor (2) families all use oligomerization as a step in the pathway activation in response to an agonist (3). The examples of multi-protein complexes containing homodimers include proteasome (8), ribosome (9) and nucleosome (10). The function of most filamentous proteins of the cytoskeleton, such as actin, myosin, spectrin, tubulin, etc., relies on their oligomerization or polymerization. The ability to self-interact confers several structural and functional advantages to proteins, including improved stability (11,12) control over the accessibility and specificity of active sites (3), and increased structural complexity. In addition, self-association can help to minimize genome size, while maintaining the advantages of modular complex formation. Protein assembly into heterodimers has the combinatorial effect of producing multiple species with different affinity to its substrates and other biophysical characteristics, giving the cell an instrument for fine-tuning its regulatory responses. Even bigger variety of complexes contain (or are formed by) the interacting paralogs, such as spliceosome (13), acting promoting complex Apr2/3, membrane receptors (14) and transcription factors (5).
While many specific dimerizing proteins are well studied and their biological and structural properties have been established, little is known about an overall topological influence and high-level statistical properties of dimer distribution in protein networks. The protein networks have recently become a subject of extensive research by biologists as well as by scientists from other fields interested in networks and graphs [e.g. (5,15–19)]. Among various studied types of protein–protein networks, a binding, or physical interaction networks have several appealing properties that make them a popular research subject: they are undirected, Boolean and the most extensive ones, in principle spanning over all proteins present in a given organism. Several universal features of the binding networks are believed to be established fairly well. Examples include an apparent broad (scale-free) degree distribution [(16) and references therein], suppression of interactions between high-degree (hub) proteins (17), a higher than randomly expected number of tightly linked sub-graphs or cliques (15) and evolutionary conservation of such tightly linked sub-graphs (18). In this paper, we describe a systematic empirical study of topological properties of physical interaction networks in the neighborhood of homodimers (self-interacting proteins) as well as heterodimers formed by paralogous proteins.
The protein interaction data for all four species were obtained from the Biological Association Network databases available from Ariadne Genomics (http://www.ariadnegenomics.com/). The database for Homo sapiens was derived from the Ariadne Genomics ResNet database, constructed from the various literature sources using Medscan. Medscan is the Ariadne Genomics' proprietary natural language processing technology (20,21). The list of all human proteins used in our study along with their degrees (number of binding partners), dimerization state and a brief description of their functional role in the cell (if it is known) is available in the Supplementary Material. The databases for the baker's yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans and the fruit fly Drosophila melanogaster were constructed by combining the data from published high-throughput experiments with the literature data obtained using Medscan technology. For more details on the construction of these databases, please refer to the PathwayAssist manual (http://www.ariadnegenomics.com/products/pathway.html).
Most of the protein–protein interactions (PPIs) among fly proteins (20 496 out of 20 595 or 99.5%) are extracted from a single system-wide two-hybrid study (22), while most of worm interactions (4027 out of 5309 or 75%) are from a large-scale two-hybrid study (23). An abnormally small average degree in the worm PPI network compared with that of other organisms might be explained by the fact that, unlike in the yeast (24) and the fly (22) cases, the high-throughput two-hybrid assay of worm proteins was not truly genome-wide. Indeed, in (23) the authors experimentally investigated interactions of only 1873 specially selected baits (out of some 22000 worm proteins) against genome-wide libraries of preys. Owing to a small probability that a given interaction would be observed in both directions, proteins that were not tested as baits on average get only half of their number of interaction partners. Indeed, we found that the average degree of worm proteins tested as baits (or rather 729 of them that were found to have at least one prey partner) is ~6.1 as opposed to the average degree of ~3 in the whole two-hybrid part of the worm network. This is now remarkably close to the 5.7–6.6 range found in the other three organisms studied here. It is important to note that the number of homodimer proteins found in this study (60 proteins) is a gross underestimate of the total number of homodimers among worm proteins as in order for self-interaction to be detected both bait and prey hybrids of a proteins have to be used in the study. A crude estimate gives the overall number of homodimers in the worm to be at least (60 × 22000)/1873 ~700.
Lists of paralogous pairs and their sequence similarities for all four species studied here were obtained by the following procedure. Amino acid sequences of individual proteins were obtained from the RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/). For each organism, the sequences were compared against themselves using the BLASTp program with the expectation value cutoff equal to 0.001 (25). A global alignment similarity was then computed by adding together numbers of similar amino acids from all non-overlapping locally aligned segments and dividing this number by the geometric average of two protein lengths. Thus, gaps between the aligned segments were considered to have zero similarity. In the case of overlapping segments, we took the one with the highest percent of similarity. We estimated that ~2% of the true homologs are not recovered by this approach due to an incompleteness of the BLASTp output for local alignment. Another sacrifice for quicker calculation is an underestimation of the global alignment score by 5–10% compared with more precise calculation after alignment using the CLUSTALW algorithm (26).
To reduce the number of false positives we further restricted our set to include only protein pairs with the similarity >30%. At the end, all protein pairs that have been aligned by BLAST but omitted from the final paralog list due to failing the similarity cutoff were checked for having common paralogs. If a common paralog was found, the pair was reinstated in the paralog list.
We have assembled and analyzed the PPI (binding) networks from four organisms: the baker's yeast S.cerevisiae, the nematode worm C.elegans, the fruit fly D.melanogaster and the human H.sapiens (see Materials and Methods for details). The most apparent observation that follows from the network data (Table 1) is that the number of self-interacting proteins in all four organisms is substantially higher than one would expect purely by chance. Indeed, in a network with N proteins (each having at least one interaction), a straightforward estimate assuming equal affinity to itself and other proteins suggests that a protein with the connectivity (degree) k would have a probability to bind to itself equal to k/N. The total number of dimers then will be the sum of this expression over all proteins, which is the average connectivity, . The actual number of dimers is 25–200 times higher than expected based on this simple-minded hypothesis (Table 1).
The abundance of dimers in all species suggests that their functional importance has been preserved through the evolution. In support of this conclusion, we note that self-interacting proteins also have about twice as many interaction partners compared with non-dimers (Table 1). Indeed, the number of interaction partners of a protein was shown before to be positively correlated with its probability to be essential for the survival of the cell and to be conserved in the course of evolution (18).
Sometimes, the ease with which proteins form self-interactions has purely structural (as opposed to functional) origin explained, e.g. by the domain swapping model (27). Indeed, in the fully folded state the individual structural components of a protein are expected to make multiple binding contacts with each other. A pair of identical (or homologous) proteins then might be able to use the same set of contacts to physically interact with each other if they encounter each other in a partially unfolded state. It is interesting to note that average degrees of dimers are almost equal to each other in all four organisms studied here. Average degrees of all proteins in the network are also quite close to each other (a plausible experimental source of an anomalously low k 3 of the worm network is explained in Materials and Methods). At present, it is unclear whether this apparent similarity is just a coincidence or has some deeper explanations. In any case, the inter- and intra-species comparison of these networks with each other indicate that the data for PPI in any of these organisms are far from saturation and a considerable number of new interactions is expected to be added to these networks in the future.
To better understand connectivity patterns of homodimers in protein interaction networks, we studied how the likelihood of a protein to interact with itself Pdimer(k) depends on its overall number of binding partners (degree) k. Pdimer(k) is simply a fraction of homodimers among all proteins with the degree k. Figure 1 shows Pdimer(k) versus k measured in the fly data based mainly on the species-wide two-hybrid dataset described previously (22). As one can see, the probability of self-interaction linearly increases with the degree in the protein network (the dashed line on the log–log plot in Figure 1 has slope 1). The proportionality coefficient of this linear increase can be interpreted as the probability pself 3.5 × 10−3 that a given edge of a physical interaction network starting at a certain protein ends up connecting this node with itself. It is ~25 times larger than the probability pothers = 1/7000 1.4 × 10−4 that it will instead connect with a randomly selected other node among ~7000 proteins present in the fly interaction dataset. This is consistent with a larger than expected number of homodimers discussed above. The observation that the likelihood of a protein to interact with itself linearly increases with the total number of its interaction (binding) partners (Figure 1) contains an important information about the general mechanisms of such interactions. We conjecture that every protein i can be characterized by a unique intrinsic parameter that we would refer to as its ‘stickiness’ σi. This parameter quantifies protein's overall propensity toward forming physical interactions. We further assume that both the probability of a protein to interact with itself and its probability to interact with other proteins are proportional to this stickiness (albeit with different coefficients as we saw above) and thus should linearly depend on each other. This rather plausible conjecture of the existence of a ‘universal propensity toward interactions’ of individual proteins in an organism thus explains both the linear scaling in Figure 1 and our original observation that self-interacting proteins in several organisms tend to have higher than average number of binding partners in the physical interaction network (Table 1). Indeed, by considering the homodimers, we automatically pick proteins with higher than average stickiness and thus end up with a subset of proteins characterized by a higher than average number of binding partners k. It is important to emphasize that the proposed ‘stickiness’ of a protein should not be interpreted literally, i.e. as the ability of a protein to unspecifically bind other proteins. In fact, all interactions in our datasets (with the exception of false positives) come from specific functionally relevant bindings between proteins. Instead, one should view the ‘stickiness’ as a complex quantitative characteristic of a protein, which has contributions from such properties as the number and nature of its constituent domains, the hydrophobicity of its surface, the number of copies of the protein per cell, the extent of its evolutionary conservation, the overall level of a ‘cooperativity’ of the functional task it is involved, etc. In some of our datasets (e.g. human), which are based on a large number of small-scale experiments instead of a single genome-wide assay, the ‘stickiness’ of a protein may also correlate with its overall popularity, i.e. the number of publications it was studied in. Figure 2 shows the correlation between the propensity toward self-interactions and the number of binding partners in the human dataset. Here, as for the fly (see Figure 1), Pdimer(k) has a region of linear k-dependence. However, here this region is limited to small values of . For larger values of k, Pdimer(k) starts to show saturation effects and completely saturates at 1 for k > 100. The saturation is expected to follow a linear region as obviously no probability could exceed 1. Moreover, it can be qualitatively described by the following simple model. Suppose that each of the k interaction links starting at a given protein with a probability pself ends at the same protein, while with a probability 1 − pself it selects some other protein target. Then, the chances that none of the k links results in the formation of the homodimer are (1 − pself)k, while a homodimer is formed with a probability
For k < 1/pself, this expression yields a linear k-dependence for Pdimer(k), as it was observed for the fly data (Figure 1). This general formula also fits Pdimer(k) nicely over the whole range of k (see dashed lines in the Figure 2). The fit with this formula provides an estimate of a propensity toward self-interactions among human proteins: , which is ~10 times higher than in our fly dataset. This is why the saturation of Pdimer(k) is clearly visible in human but not in the fly. However, due to a vast differences in the extent of coverage and sources of the data describing PPIs in the human (interacting protein pairs extracted from abstracts indexed in PubMed) and the fly (a genome-wide two-hybrid assay), different values of pself do not have to reflect actual differences between these two organisms. Finally, in Figure 3 we show the fraction of homodimers versus degree in our worm and yeast datasets. One can see that our previous observations remain valid. Worm dataset is well described by a linear scaling of Pdimer(k) with k corresponding to somewhere halfway between the fly and the human. The curve for the yeast exactly follows that of the worm until its slope suddenly changes to a much smaller value around k = 10. Causes of such sudden change of behavior in yeast are unclear to us. It could be somehow caused by the popularity of yeast as a model eukaryotic organism. Thus, unlike in worm or fly, both large-scale and small-scale experimental techniques significantly contribute to our knowledge of PPIs in yeast.
Interacting paralogous proteins (paralogous heterodimers) are often thought [e.g. (5)] to be closely related to the self-interacting proteins or homodimers. Indeed, a duplication of a homodimer encoding gene in evolution results in an appearance of a new pair (or several pairs for larger families) of interacting paralogous proteins. Such interaction links between paralogs could be destroyed with time as accumulation of mutations in the constituent proteins changes their 3D shapes. A binding between a pair of non-homodimeric paralogous proteins may also appear de novo after duplication event. Relative importance of these two mechanisms of formation of paralogous heterodimers are not universally agreed on [e.g. (16) for a point of view favoring the de novo formation]. In this section we study pairs of interacting paralogs present in our datasets. The purpose of this study is twofold:
We first count the number of linked paralogous pairs nlinked paralogs in each dataset. If most links between paralogs were indeed inherited from homodimeric ancestors, nlinked paralogs should be significantly higher than nlinked random,the number of links one expects to find between the same number Nparalogous pairs of randomly selected pairs of non-paralogous proteins. Indeed, as we demonstrated in the previous sections, all four organisms included in our study are characterized by an unusually large number of homodimers. However, if most links between paralogous proteins were established de novo after duplication, there is no reason to expect the number of such links to be unusually large compared with a random set of protein pairs. The results presented in Table 2 strongly support the hereditary origin of most paralogous heterodimers: for all species nlinked paralogs is much larger than nlinked random (by several orders of magnitude). This is a strong evidence for the hereditary rather than the de novo origin of the paralog–paralog links. Another strong argument for the hereditary hypothesis follows from Figure 4. This figure reveals that the further paralogs diverge in their amino acid sequences, the smaller is the probability of them to be linked to each other. This suggests that typically pairs of linked paralogs gradually loose inherited interactions rather than establish new ones. Thus, we conclude that most interacting paralogs present in our data were created by duplication of homodimeric proteins. A final argument in support of this conclusion is that the average number of binding partners of interacting paralogs klinked paralogs is indistinguishable from that of homodimers kdimer and is ~2–3 times higher than the average over the whole network (see Tables 1 and and2).2). Given that most paralogous heterodimers were at some point formed from homodimers, one might assume that most proteins involved in such heterodimeric complexes are homodimers. However, it is far from being the case (see Table 3). Such discrepancy is caused by two reasons, one purely evolutionary while another anthropogenic.
We demonstrated that self-interacting proteins tend to have connectivity significantly above the average in the PPI network. This phenomenon appears universally in PPI networks of all four model organisms studied above. As a related phenomenon, we found that interacting paralogs also have increased connectivity, likely because most of them are descendants of ancient self-interacting proteins. We also have shown that numbers of homodimers and interacting paralogs are both higher than expected by pure chance alone. We unify these phenomena by introducing a concept of protein's ‘stickiness’ measuring its overall propensity for binding. Both the propensity of proteins toward self-interactions and the degree of a protein in the PPI network are proportional to this parameter. However, the dimerization probability apparently has a larger proportionality coefficient. This is not very surprising given a multitude of functional roles dimers (or polymers) play in living cells. Dimerizing and oligomerizing proteins are ubiquitous in all organisms and are present in the most evolutionary conserved protein complexes (3). On the evolutionary side, we have confirmed that most links between paralogs are most probably inherited from their dimerizing ancestors. This does not exclude a possibility that some of these links are formed after duplication as a result of random mutations, but the relative number of such de novo created links is relatively small. This conclusion has several implications for the network topology. If a given dimerizing protein has duplicated several times, it leads to an appearance of a fully interconnected complex or clique of paralogous heterodimers. In reality, some links inside this complex are lost due the divergence of sequences of paralogous proteins. Such loss of links may split a higher-order clique into several lower-order ones or make it just a densely (yet not fully) interconnected motif. A higher density of links around dimers caused by these remaining heterodimeric links may provide a qualitative explanation to the empirically observed abundance of highly interconnected motifs and cliques in protein networks (15). Several simple models of network growth and evolution due to gene duplications followed by subsequent functional divergence of the resulting pair of paralogous proteins lead to networks with an unrealistic bipartite topology, in which descendants of a particular protein never interact with their paralogs (19). Introduction of a large number of heterodimers to the ancestral network in these models generates frequent links between paralogs, which in the end gives rise to more realistic network topologies. Finally, we would like to speculate on a general role that the highly connected self-interacting proteins might play in the cell. A single protein molecule can simultaneously bind only a limited number of partners, at most equal to the number of its functional domains. On the other hand, most biological processes require many different proteins in numbers far greater than the binding capacity of a single protein molecule. The protein components of large signaling or biochemical pathways do not form large stable complexes containing all proteins simultaneously. Yet, all the necessary molecules must be in a physical proximity to each other to form a functional module. This contradiction poses a question: how so many different proteins could co-localize in a cell to correctly perform a physiological function? A possible solution to this question involves highly connected self-interacting proteins serving as self-organizing centers for co-localization of the pathway components. The self-interaction (oligomerization) of such proteins might function as a general mechanism for sensing protein concentration (3). Indeed, a random increase of a local concentration of monomers leads to their oligomerization and subsequently to the increase in the concentration of binding sites for other pathway components, increasing in turn their effective concentration.
Supplementary Material is available at NAR Online.
This work was supported by 1 R01 GM068954-01 grant from NIGMS. Work at Brookhaven National Laboratory was carried out under Contract no. DE-AC02-98CH10886, Division of Material Science, U.S. Department of Energy. Two of us (I.I. and I.M.) thank the theory Institute for Strongly Correlated and Complex Systems at BNL for the hospitality and financial support during visit where some of this work was accomplished. Funding to pay the Open Access publication charges for this article was provided by the NIGMS grant.
Conflict of interest statement. None declared.