|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: SA CMD MAP NSJ. Performed the experiments: SA. Analyzed the data: SA. Contributed reagents/materials/analysis tools: SA CMD MAP NSJ. Wrote the paper: SA CMD MAP NSJ. Group leader: NSJ.
The idea of “date” and “party” hubs has been influential in the study of protein–protein interaction networks. Date hubs display low co-expression with their partners, whilst party hubs have high co-expression. It was proposed that party hubs are local coordinators whereas date hubs are global connectors. Here, we show that the reported importance of date hubs to network connectivity can in fact be attributed to a tiny subset of them. Crucially, these few, extremely central, hubs do not display particularly low expression correlation, undermining the idea of a link between this quantity and hub function. The date/party distinction was originally motivated by an approximately bimodal distribution of hub co-expression; we show that this feature is not always robust to methodological changes. Additionally, topological properties of hubs do not in general correlate with co-expression. However, we find significant correlations between interaction centrality and the functional similarity of the interacting proteins. We suggest that thinking in terms of a date/party dichotomy for hubs in protein interaction networks is not meaningful, and it might be more useful to conceive of roles for protein-protein interactions rather than for individual proteins.
Proteins are key components of cellular machinery, and most cellular functions are executed by groups of proteins acting in concert. The study of networks formed by protein interactions can help reveal how the complex functionality of cells emerges from simple biochemistry. Certain proteins have a particularly large number of interaction partners; some have argued that these “hubs” are essential to biological function. Previous work has suggested that such hubs can be classified into just two varieties: party hubs, which coordinate a specific cellular process or protein complex; and date hubs, which link together and convey information between different function-specific modules or complexes. In this study, we re-examine the ideas of date and party hubs from multiple perspectives. By computationally partitioning protein interaction networks into functionally coherent subnetworks, we show that the roles of hubs are more diverse than a binary classification allows. We also show that the position of an interaction in the network is related to the functional similarity of the two interacting proteins: the most important interactions holding the network together appear to be between the most dissimilar proteins. Thus, examining interaction roles may be relevant to understanding the organisation of protein interaction networks.
Protein interaction networks, constructed from data obtained via techniques such as yeast two-hybrid (Y2H) screening, do not capture the fact that the actual interactions that occur in vivo depend on prevailing physiological conditions. For instance, actively expressed proteins vary amongst the tissues in an organism and also change over time. Thus, the specific parts of the interactome that are active, as well as their organisational form, might depend a great deal on where and when one examines the network , . One way to incorporate such information is to use mRNA expression data from microarray experiments. Han et al.  examined the extent to which hubs in the yeast interactome are co-expressed with their interaction partners. They defined hubs as proteins with degree at least 5, where “degree” refers to the number of links emanating from a node. Based on the averaged Pearson correlation coefficient (avPCC) of expression over all partners, they concluded that hubs fall into two distinct classes: those with a low avPCC (which they called date hubs) and those with a high avPCC (so-called party hubs). They inferred that these two types of hubs play different roles in the modular organisation of the network: Party hubs are thought to coordinate single functions performed by a group of proteins that are all expressed at the same time, whereas date hubs are described as higher-level connectors between groups that perform varying functions and are active at different times or under different conditions.
The validity of the date/party hub distinction has since been debated in a sequence of papers –, and there appears to be no consensus on the issue. Two established points of contention are: (1) Is the distribution of hubs truly bimodal (as opposed to exhibiting a continual variation without clear-cut groupings) and (2) is the date/party distinction that was originally observed a general property of the interactome or an artefact of the employed data set? Different statistical tests have suggested seemingly different answers. However, despite (or in some cases due to) this ongoing debate, the hypothesis has been highly prominent in the literature , –. Here, following up on the work of Batada et al. , , we revisit the initial data and suggest additional problems with the statistical methodology that was employed. In accordance with their results, we find that the differing behaviour observed on the deletion of date and party hubs , which seemed to suggest that date hubs were more essential to global connectivity, was largely due to a very small number of key hubs rather than being a generic property of the entire set of date hubs. More generally, we use a complementary perspective to Batada et al. to define structural roles for hubs in the context of the modular organisation of protein interaction networks. Our results indicate that there is little correlation between expression avPCC and structural roles. In light of this, the more refined categorisation of date, party, and ‘family’ hubs, which was based on taking into account differences in expression variance in addition to avPCC , also appears inappropriate. A recent study by Taylor et al.  argued for the the existence of ‘intermodular’ and ‘intramodular’ hubs—a categorisation along the same lines as date and party hubs—in the human interactome. We show that their observation of a binary hub classification is susceptible to changes in the algorithm used to normalise microarray expression data or in the kernel function used to smooth the histogram of the avPCC distribution. The data does not in fact display any statistically significant deviation from unimodality as per the DIP test , , as has already been observed by Batada et al. ,  for yeast data. We revisited the bimodality question because it was a key part of the original paper , and in particular because it made a reappearance in Taylor et al.  for human data. However, it is possible that a date-party like continuum may exist even in the absence of a bimodal distribution, and this is why we also attempt to examine the more general question of whether the network roles of hub proteins really are related to their co-expression properties with interaction partners.
Many real-world networks display some sort of modular organisation, as they can be partitioned into cohesive groups of nodes that have a relatively high ratio of internal to external connection densities. Such sub-networks, known as communities, often correspond to distinct functional units –. Several studies in recent years have considered the existence of community structure in protein-protein interaction networks –. A myriad of algorithms have been developed for detecting communities in networks , . For example, the concept of graph ‘modularity’ can be used to quantify the extent to which the number of links falling within groups exceeds the number that would be expected in an appropriate random network (e.g., one in which each node has the same number of links as in the network of interest, but which are randomly placed) . One of the standard techniques to detect communities is to partition a network into sub-networks such that graph modularity is maximised , .
We use the idea of community structure to take a new approach to the problem of hub classification by attempting to assign roles to hubs purely on the basis of network topology rather than on the basis of expression data. Our rationale is that the biological roles of date and party hubs are essentially topological in nature and should thus be identifiable from the network alone (rather than having to be inferred from additional information). Once we have partitioned the network into a set of meaningful communities, it is possible to compute statistics to measure the connectivity of each hub both within its own community and to other communities. One method for assigning relevant roles to nodes in a metabolic network was formulated by Guimerà and Amaral , and we follow an analogous procedure for the hubs in our protein interaction networks. We then examine the extent to which these roles match up with the date/party hypothesis, finding little evidence to support it.
One might also wonder about the extent to which observed interactome properties are dependent on the particular instantiation of the network being analysed. Several papers have discussed at length concerns about the completeness and reliability (or lack thereof) of existing protein interaction data sets, e.g. –. Such data have been gathered using multiple methods, the most prominent of which are Y2H and affinity purification followed by mass spectrometry (AP/MS). (See the discussion in Materials and Methods.) In a recent paper, Yu et al. examined the properties of interaction networks that were derived from different sources, suggesting that experimental bias might play a key role in determining which properties are observed in a given data set . In particular, their findings suggest that Y2H tends to detect key interactions between protein complexes—so that Y2H data sets may contain a high proportion of date hubs (i.e., hubs with low partner co-expression)—whereas AP/MS tends to detect interactions within complexes, so that hubs in AP/MS-derived networks are predominantly highly co-expressed with their partners (i.e., these networks will contain party hubs). This indicates that a possible reason for observing the bimodal hub avPCC distribution  is that the interaction data sets used information that was combined from both of these sources. Here we compare several yeast interaction data sets and find both widely differing structural properties and a surprisingly low level of overlap.
Finally, as an alternative to the node-based date/party categorisation, we suggest thinking about topological roles in networks by defining measures on links rather than on nodes. In other words, one can attempt to categorise interactions between proteins rather than the proteins themselves. We use a well-known measure of link significance known as betweenness centrality ,  and examine its relation to phenomena such as protein co-expression and functional overlap. Here as well we find little evidence of a significant correlation with expression PCC of the interactors. However, there seems to be a reasonably strong relation between link betweenness and functional similarity of the interacting proteins, so that link-centric role definitions might have some utility.
In summary, we have examined the proposed division of hubs in the protein interaction network into the date and party categories from several different angles, demonstrating that prior arguments in favour of a date/party dichotomy appear to be susceptible to various kinds of changes in the data and methods used. Observed differences in network vulnerability to attacks on the two hub types seem to arise from only a small number of particularly important hubs. These results strengthen the existing evidence against the existence of date and party hubs. Furthermore, a detailed analysis of network topology, employing the novel perspective of community structure and the roles of hubs within this context, suggests that the picture is more complicated than a simple dichotomy. Proteins in the interactome show a variety of topological characteristics that appear to lie along a continuum—and there does not exist a clear correlation between their location on this continuum and the avPCC of expression of their interaction partners. On the other hand, investigating link (interaction) betweenness centralities reveals an interesting relation to the functional linkage of proteins, suggesting that a framework incorporating a more nuanced notion of roles for both nodes and links might provide a better framework for understanding the organisation of the interactome.
The definitions of date and party hubs are based on the expression correlations of hubs with their interactors in the protein interaction network . Specifically, the avPCC has been computed for each hub and its distribution was observed by Han et al.  to be bimodal in some cases. A date/party threshold value of avPCC (for a given expression data set) was defined in order to optimally separate the two types of hubs .
We have re-examined the data sets and analyses that were used to propose the existence and dichotomy of date versus party hubs. In the original studies on yeast data , , any hub that exhibited a sufficiently high avPCC (i.e., any hub lying above the date/party threshold) on any one expression data set was identified as a party hub. Batada et al.  noted that this definition causes the date/party assignment to be overly conservative, in that a hub's status is unlikely to change as a result of additional expression data. In fact, some of the original expression data sets were quite small, containing fewer than 10 data points per gene. This suggests that classification of proteins as ‘party’ hubs was based on high co-expression with partners for just a small number of conditions in a single microarray experiment, even though such co-expression need not have been observed in other conditions and experiments. For instance, Han et al. found 108 party hubs in their initial study . However, calculating avPCC across their entire expression compendium (rather than separately for the five constituent microarray data sets) and using the date/party threshold specified by the authors for this compendium avPCC distribution yields just 59 party hubs. Using only the “stress response” data set , which comprises over half of the data points in their compendium and is substantially larger than the other 4 sets, yields 74 party hubs. Thus, the results of applying this method to categorise hubs depend heavily on the expression data sets that one employs and is vulnerable to variability in smaller microarray experiments.
Recent support for the idea of date and party hubs appeared in a paper that considered data relating to the human interactome; the authors found multimodal distributions of avPCC values, seemingly supporting a binary hub classification . We used an interaction data set provided by Taylor et al.  (an updated version of the one used in their paper, sourced from the Online Predicted Human Interaction Database (OPHID) ; see Materials and Methods), and found that the form of the distribution of hub avPCC that they observed is not robust to methodological changes. For instance, raw intensity data from microarray probes has to be processed and normalised in order to obtain comparable expression values for each gene . The expression data used by Taylor et al.  (taken from the human GeneAtlas ) was normalised using the Affymetrix MAS5 algorithm ; when we repeated the analysis using the same data normalised by the GCRMA algorithm  (which is the preferred method to control for probe affinity) instead of by MAS5, we obtained significantly different results. Figure 1 depicts the avPCC distributions for hubs (defined as the top 15% of nodes by degree , corresponding in this case to degree 15 or greater) in the two cases. We obtained density plots for varying smoothing kernel widths. The GCRMA-processed data does not appear to lead to a substantially bimodal distribution at any kernel width, whereas the MAS5-processed data appears to give bimodality for only a relatively narrow range of widths and could just as easily be regarded as trimodal. We also used Hartigan's DIP test , ,  to check whether either of the two versions of the expression data gives a distribution of avPCC values showing significant evidence of bimodality. The DIP value is a measure of how far an observed distribution deviates from the best-fit unimodal distribution, with a value of 0 corresponding to no deviation. We used a bootstrap sample of 10,000 to obtain -values for the DIP statistic. We found no significant deviation from unimodality: for MAS5, the DIP value is (-value ) and for GCRMA the DIP value is (-value ). This suggests that the apparent bimodal or trimodal nature of some of the curves in Figure 1 is illusory and not statistically robust.
We also find variability across different interaction data sets: For instance, we analysed the recent protein-fragment complementation assay (PCA) data set  and found no clear evidence of a bimodal distribution of hubs along date/party lines (data not shown). Even in cases where multimodality is observed, it might be arising as a consequence or artefact of combining different types of interaction data; there are believed to be significant and systematic biases in which types of interactions each data-gathering method is able to obtain , , . For instance, analysing avPCC values on the stress response expression data set  for hubs in networks obtained from Y2H or AP/MS alone , we find that 100% (259/259) are date hubs in the former but that only about 30% (56/186) are date hubs in the latter. At the moment it is reasonable to entertain the possibility that new kinds of interaction tests might smear out the observed bimodality; this appears to be the case with the PCA data set.
One of the key pieces of evidence used to argue that date and party hubs have distinct topological properties was the apparent observation of different effects upon their deletion from the network. Removing date hubs seemed to lead to very rapid disintegration into multiple components, whereas removal of party hubs had much less effect on global connectivity , . However, it has been observed that removing just the top 2% of hubs by degree from the comparison of deletion effects obviates this difference, suggesting that the observation is actually due to just a few extreme date hubs . In order to study this in greater detail, and to isolate the extreme hubs, we used node betweenness centrality  (see Materials and Methods), a standard metric of a node's importance to network connectivity (this need not be strongly correlated with degree). We found that in the original ‘filtered yeast interactome’ (FYI) data set , date hubs have on average somewhat higher betweenness centralities ( for 91 date hubs versus for 108 party hubs, two-sample -test -value ). However, there happens to be one date hub (SPC24/UniProtKB:Q04477, a highly connected protein involved in chromosome segregation ) that has an exceptionally high betweenness () in this network. When the set of date hubs minus this one hub is targeted for deletion, we find that the observed difference between date and party hubs is greatly reduced (Figure 2(a)).
It was subsequently shown that the FYI network was particularly sparse; as more data became available, the updated filtered high-confidence (FHC) data set was used to perform the same analysis  (we also looked at the Y2H-only and AP/MS-only networks ; see Figure S1). In the case of FHC, the network did not break down on removing date hubs but nevertheless displayed a substantially greater increase in characteristic path length (CPL) than seen for party hub deletion. For FHC too, date hubs have, on average, higher betweenness values than party hubs ( for 306 date hubs versus for 240 party hubs, -value ). However, the larger average is due almost entirely to a small number of hubs with unusually high betweennesses, as removing the top 10 date hubs by betweenness (which all had values higher than any party hub) greatly reduced the difference between the distributions (-value ). Furthermore, the removal of just these 10 hubs from the set of targeted date hubs is sufficient to virtually obviate the difference with party hubs, as shown in Figure 2(b). Notably, the set of 10 high-betweenness hubs includes prominent proteins such as Actin (ACT1/UniProtKB:P60010), Calmodulin (CMD1/UniProtKB:P06787), and the TATA binding protein (SPT15/UniProtKB:P13393), which are known to be key to important cellular processes (Table 1). Thus, we can account for the critical nodes for network connectivity using just a few major hubs, and most of the proteins that are classified as date hubs appear to be no more central than the party hubs. High betweenness nodes have previously been referred to as bottlenecks  and it has been suggested that these are in general highly central and tend to correspond to date hubs. However, the same sort of analysis on the Yu et al. data set  once again revealed that only the top 0.5% or so of nodes by betweenness are truly critical for connectivity (data not shown). Additionally, the 10 key hubs in the FHC network show a wide range of avPCC values (Table 1): high betweenness does not necessitate low avPCC. Similarly, we found no strong correspondence between bottleneck/non-bottleneck and date/party distinctions across multiple data sets. These observations further weaken the claim that there is an inverse relation between a hub's avPCC and its central role in the network.
In principle, one should be able to view a categorisation of hubs according to the date/party dichotomy directly in the network structure, as the two kinds of hubs are posited to have different neighbourhood topologies. We thus leave gene expression data to one side for the moment and focus on what can be inferred about node roles purely from network topology. Guimerà and Amaral  have proposed a scheme for classifying nodes into topological roles in a modular network according to their pattern of intramodule and intermodule connections. Their classification uses two statistics for each node—within-community degree and participation coefficient (a measure of how well spread out a node's links are amongst all communities, including its own)—and divides the plane that they define into regions encompassing seven possible roles (see Materials and Methods for details). We depict these regions in Figure 3, which shows the node roles for yeast (FHC ) and human (Center for Cancer Systems Biology Human Interactome version 1 (CCSB-HI1) ) data sets, which we computed based on communities detected by optimising modularity via the Potts method  (see Text S1 for details, and Figure S4 and Table S1 for indications of the structural and functional coherence of the communities, respectively). Also, when partitioning the network using this method, one can adjust the resolution to get more or fewer communities. In Figure S2, we show the results of this computation repeated for two other values of the resolution parameter. In each case, we obtain a similar pattern to the results shown here, and the conclusions below are valid across the multiple resolutions examined.
Some of the topological roles defined by this method correspond at least to some extent to those ascribed to date/party hubs. For instance, one might argue that party hubs ought to be ‘provincial hubs’, which have many links within their community but few or none outside. Date hubs might be construed as ‘non-hub connectors’ or ‘connector hubs’, both of which have links to several different modules; they could also fall into the ‘kinless’ roles (though very few nodes are actually classified as such). We thus sought to examine the relationship between the date/party classification and this topological role classification. In Figure 3, we colour proteins according to their avPCC. In Figure 4, we present the same data in a more compact form, as we only show the hubs (defined as the top 20% of nodes ranked by degree ) in the two interaction networks, plotting them according to node role and avPCC. The horizontal lines correspond to an avPCC of 0.5, which was the threshold used to distinguish date and party hubs in the yeast interactome .
One immediate observation from these results is that the avPCC threshold clearly does not carry over to the human data. In fact, all of the hubs in the latter have an avPCC of well below 0.5. Even if we utilise a different threshold in the human network, we find that there is little difference in the avPCC distribution across the topological roles, suggesting that no meaningful date/party categorisation can be made (at least for this data set). This might be the case because the human data set represents only a small fraction of the actual interactome. Additionally, it is derived from only one technique (Y2H) and is thus not multiply-verified like the yeast data set.
For yeast, we see that hubs below the threshold line (i.e., the supposed date hubs) include not only virtually all of those that fall into the ‘connector’ roles but also many of the ‘provincial hubs’. On the other hand, those that lie above the line (i.e., the supposed party hubs) include mainly the provincial hub and peripheral categories. Although one can discern a difference in role distributions above and below the threshold, it is not very clear-cut and the so-called date hubs fall into all 7 roles. It would thus appear that even for yeast, the distribution of hubs does not clearly fall into two types (the original statistical analysis has already been disputed by Batada et al. , ), and the properties attributed to date and party hubs  do not seem to correspond very well with the actual topological roles that we estimate here. Indeed, these roles are more diverse than what can be explained using a simple dichotomy.
It has been proposed that date and party hubs play different roles with respect to the modular structure of protein interaction data. As there are diverse examples of such data, one might ask to what extent entities like date and party hubs can be consistently defined across these. In order to investigate the extent of network overlap and the preservation of the interactome's structural properties (such as community structure and node roles) for different data sets and data-gathering techniques, we compared statistics and results for four different yeast interaction data sets: FYI, FHC, Database of Interacting Proteins core (DIPc), and PCA (see Table 2 and Materials and Methods for details of these). Our motivation for these choices of data sets (aside from PCA) was that they all encompass multiply-verified or high-confidence interactions. We also used PCA data because it is from the first large-scale screen with a new technique that records interactions in their natural cellular environment . For each data set, we counted the number of nodes and links in common using pairwise comparisons in the largest connected component of the network. For the overlapping portions, we then computed the extent of overlap in node roles and communities. For the latter, we employed the Jaccard distance , which ranges from 0 for identical partitions to 1 for entirely distinct ones (see Materials and Methods). In Table 3, we present the results of our binary comparisons of the yeast data sets.
Table 3 reveals that there are large variations amongst the different networks reported in the literature. FYI, FHC, and DIPc are all regarded as high-quality data sets, yet they contain numerous disparate interactions. PCA has a very low overlap with both FYI and DIPc (considered separately), suggesting that it provides data that is not captured by either Y2H or AP/MS screens. Such differences unsurprisingly lead to nodes having variable community structure between data sets. The Jaccard distance for each pairwise comparison amongst the 4 networks is around 0.8, so on average the intersection of communities for the same node covers only about a fifth of their union (for comparison purposes, communities are computed over the complete network in each case, and then each community is pruned to retain only those nodes also present in the other network). Because we compute topological node roles relative to community structure, it is not surprising that the role overlap is also not very high in any of the cases.
Given the above, it is difficult to make any general inferences regarding proteome organisation from results on existing protein interaction networks. They depend a great deal on the explored data set, which in each case represents only part of the total interactome and may also contain substantial noise.
Most research on interactome properties has focused on node-centric metrics, which draws on the perspective of individual proteins (e.g., , , , ). Here we try an alternative approach that instead uses link-centric metrics in order to examine how the topological properties of interactions in the network relate to their function. In order to quantify the importance of a given link to global network connectivity, we use link betweenness centrality ,  (see Materials and Methods). We investigate the relationship between link betweenness and the expression correlation for a given interaction. If date and party hubs genuinely exist, one might expect a similar sort of dichotomy for interactions, with more central interactions having lower expression correlations and vice versa. That is, given the hypothesised functional roles of date and party hubs, most intermodular interactions would connect to a date hub, whereas most intramodular interactions would connect to a party hub. In Figure 5, we depict all of the interactions in two yeast data sets, which we position on a plane based on the values of their link betweenness and interactor expression PCC (calculated using the stress response data set as before). Additionally, we colour each point according to the level of functional similarity between the interacting proteins, as determined by overlap in GO (Cellular Component) annotations (see Materials and Methods). We also obtain similar results using the other two GO ontologies, which are shown in Figure S3.
For the FHC data set, we find no substantial relation between expression PCC and the logarithm of link betweenness (linear Pearson correlation , -score , -value ). For the FYI data set, there is a larger correlation (, -score , -value ). Correspondingly, we observe a dense cluster of interactions in the top left (i.e., they have low betweennesses and high expression correlations), but most of these are interactions within ribosomal complexes. If one removes such interactions from the data set, then here too one finds only a small correlation (, -score , -value ) between expression PCC and (log of) link betweenness. (Note that ribosomal proteins were already removed from FHC .) On the other hand, we find a fairly strong correlation between link betweenness (on a log-scale) and similarity in cellular component annotations (which can be used as a measure of co-localisation): the PCC values are (-score, -value) for FYI and (-score, -value) for FHC (very similar values are obtained for the Spearman rank correlation coefficient: for FYI and for FHC). In particular, there appears to be a natural threshold at the modal value of betweenness. (As discussed in Materials and Methods, this is a finite-size effect.) This is somewhat reminiscent of the weak/strong tie distinction in social networks , , as the ‘weak’ (high betweenness) interactions serve to connect and transmit information between distinct cellular modules, which are composed predominantly of ‘strong’ (low betweenness) interactions. For instance, we found that interactions involving kinases fall largely into the ‘weak’ category. Additionally, GO terms such as intracellular protein transport, GTP binding, and nucleotide binding were enriched significantly in proteins involved in high-betweenness interactions.
In this paper, we have analysed modular organisation and the roles of hubs in protein interaction networks. We revisited the possibility of a date/party hub dichotomy and found points of concern. In particular, claims of bimodality in hub avPCC distributions do not appear to be robust across available interaction and expression data sets, and tests for the differences observed on deletion of the two hub types have not considered important outlier effects. Moreover, there is considerable evidence to suggest that the observed date/party distinction is at least partly an artefact, or consequence, of the different properties of the Y2H and AP/MS data sets.
In order to study the topological properties of hub nodes in greater detail, we partitioned protein interaction networks into communities and examined the statistics of the distributions of hub links. Our results show that hubs can exhibit an entire spectrum of structural roles and that, from this perspective, there is little evidence to suggest a definitive date/party classification. We find, moreover, that expression avPCC of a hub with its partners is not a strong predictor of its topological role, and that the extent of interacting protein co-expression varies considerably across the data sets that we examined.
Additionally, a key issue with existing interaction networks is that they are incomplete. We have compared some of the available ‘high-quality’ yeast data sets and shown that they have very little overlap with each other. One can obtain protein interaction data using several experimental techniques, and each method appears to preferentially pick up different types of interactions , . The only published interactome map of which we are aware that examines proteins in their natural cellular environment  is largely disjoint with other data sets and shows little evidence of a date/party dichotomy. We find similar issues in human interaction data sets. A general conclusion about interactome properties is thus difficult to reach, as it would require robust results for a number of different species, which are unattainable at present due to the limited quantity and questionable quality of protein interaction and expression data.
As an alternative way of defining roles in the interactome, we have also investigated a link-centric approach, in which we study the topological properties of links (interactions) as opposed to nodes (proteins). In particular, we examined link betweenness centrality as an indicator of a link's importance to network connectivity. We found that this too does not correlate significantly with expression PCC of the interacting proteins. For certain data sets, however, it does appear to correlate quite strongly with the functional similarity of the proteins. Additionally, there appears to be a threshold value of betweenness centrality beyond which one observes a sudden drop in functional similarity. We also found that the high-betweenness interactions are enriched for kinase bindings and other kinds of interactions involved in signalling and transportation functions. This suggests that a notion of intramodular versus intermodular interactions, somewhat analogous to the weak/strong tie dichotomy in social networks, might be more useful. However, further work would be required to establish such a framework of elementary biological roles in protein interaction networks. As the quantity, quality, and diversity of protein interaction and expression data sets increases, we hope that this perspective will enhance understanding of the organisational principles of the interactome.
Several experimental methods can be used to gather protein interaction data. These include high-throughput yeast two-hybrid (Y2H) screening –; affinity purification of tagged proteins followed by mass spectrometry (AP/MS) to identify associated proteins , ; curation of individual protein complexes reported in the literature ; and in silico predictions based on multiple kinds of gene data . There is also a more recent technique, known as the protein-fragment complementation assay (PCA) , which is able to detect protein-protein interactions in their natural environment within the cell. However, only one large-scale study has used this technique thus far . Each of these methods gives an incomplete picture of the interactome; for instance, a recent aggregation of high-quality Y2H data sets for Saccharomyces cerevisiae (the best-studied organism) was estimated to represent only about 20% of the whole yeast binary protein interaction network .
Each technique also suffers from particular biases. It has been suggested that Y2H is likely to report binary interactions more accurately, and (due to the multiple washing steps involved in affinity purification) it is also expected to be better at detecting weak or transient interactions . Converting protein complex data into interaction data is also an issue with AP/MS. This method entails using a ‘bait’ protein to ‘capture’ other proteins that subsequently bind to it to form complexes. Once one has obtained these complexes and identified their proteins using mass spectrometry, one can assign protein-protein interactions using either the spoke or the matrix model . The spoke model only counts interactions between the bait and each of the proteins captured by it, whereas the matrix model counts all possible pairwise interactions in the complex. Unsurprisingly, the actual topology of the complex is generally different from either of these representations. On the other hand, AP/MS is expected to be more reliable at finding permanent associations. Two-hybrid approaches also do not seem to be particularly suitable for characterising protein complexes, giving rise to the view that complex formation is not merely the superposition of binary interactions . Thus, the two major techniques appear to be disjoint and to cover different aspects of the interactome, and the differences between data sets from these sources perhaps correspond mostly to false negatives rather than false positives .
Given these factors, choosing which data sets to use for building and analysing the network is itself a significant issue (see the discussion in the main text). For our analysis, we chose to work predominantly with networks consisting of multiply-verified interactions, which are constructed from evidence attained using at least two distinct sources. Such data sets are unlikely to contain many false positives, but might include many false negatives (i.e., missing interactions). In Table 2, we summarise the data sets that we employed. Here are additional details about how they were compiled:
Betweenness centrality is a way of quantifying the importance of individual nodes or links to the connectivity of a network. It is based on the notion of information flow in the network. The (geodesic) betweenness centrality of a node/link is defined as the number of pairwise shortest paths in the network that pass through that object , . If there are multiple shortest paths between a pair of nodes, each one is given equal weight so that all of their weights sum to unity. Thus, the weighted count of all pairwise shortest paths passing through a given node/link equals its betweenness centrality.
For finite, sparse, unweighted networks such as the ones we study, one observes an interesting effect in the distribution of link betweenness centrality values. The distribution is almost normal, with the exception of a large spike at a value well above the mean (see the long vertical bar of points in the plots in Figure 5). This results from the large number of nodes with degree 1. The link that connects such a node to the rest of the network must have a betweenness of , where is the total number of nodes in the network. Simply, this link must lie on the shortest paths that connect the degree 1 node to all of the other nodes, and it cannot lie on any other shortest paths. Thus, for our networks, the link betweenness centrality distribution shows a strong spike at a value of precisely .
The within-community degree refers to the number of connections a node has within its own community. It is normalised here to a -score, which for the node is given by the formula
where denotes the community label of node , is the number of links of node to other nodes in the same community , the quantity is the average of for all nodes in community , and is the standard deviation of in community . The participation coefficient of node measures how its links are distributed amongst different communities. It is defined as 
where is the number of communities, is the number of links of node to nodes in community , and is the total degree of node . The participation coefficient approaches if the links of node are uniformly distributed amongst all communities (including its own) and is if they are all within its own community.
In the main text, we plot all nodes in the network in a two-dimensional space using coordinates determined by within-community degree and participation coefficient, and we divide the space into regions that correspond to different node roles. The boundaries between regions are of course arbitrary, so for simplicity we have used the demarcations employed by Guimerà and Amaral . First, it is important to distinguish between ‘community hubs’ and ‘non-hubs’; the former are defined as those nodes with within-community degree . In this context, the term ‘hub’ is applied to nodes with high within-community degree , so ‘non-hubs’ might have high overall degree. One can further partition both ‘community hubs’ and ‘non-hubs’ on the basis of the participation coefficient as follows :
We depict these 7 roles as demarcated regions in the plots in Figure 3.
If one has two partitions of a given set of nodes, and a node is part of subset (or community) of nodes in one partition and part of subset in the other partition, then the Jaccard distance  for node across the two partitions is defined as
The symbols and correspond, respectively, to set intersection and union, and denotes the number of elements in set . A Jaccard distance of 0 corresponds to identical communities, whereas the distance approaches 1 for very different communities. By averaging over all nodes in the set, we can get an estimate of the similarity of the two partitions.
In order to compute the functional similarity of two interacting proteins, we first define the set information content (SIC)  of each term in our ontology for a given data set. Suppose the complete set of proteins is denoted by , and the subset annotated by term is denoted by . The SIC of the term is then defined as
Now suppose that we have two interacting proteins called and . Let and , respectively, denote their complete sets of annotations (consisting of not only their leaf terms but also all of their ancestors) from the ontology. Then the functional similarity of the proteins is given by
Hub deletion effects for AP/MS-only and Y2H-only data sets.
(0.07 MB PDF)
Topological node role assignments and relation with avPCC.
(0.24 MB PDF)
Relating interaction betweenness, co-expression, and functional similarity. Plots show link betweenness centralities versus expression correlations, with points coloured according to average similarity of interactors' GO Biological Process (BP, above) and Molecular Function (MF, below) annotations, for two protein interaction data sets: FYI (778 nodes, 1,798 links) and FHC (2,233 nodes, 5,750 links). Pearson correlation coefficient values of log(link betweenness) with functional similarity are BP: −0.41 (z-score−18.6, p-value3.9×10−77) for FYI, −0.42 (z-score−33.9, p-value4.7×10−252) for FHC; MF: −0.39 (z-score−17.3, p-value4.5×10−67) for FYI, −0.31 (z-score−24.7, p-value1.6×10−134) for FHC.
(0.15 MB PNG)
Community structure in the largest connected component of the FYI network.
(0.07 MB PDF)
Evaluating community partitions.
(0.02 MB PDF)
Communities in the Interactome.
(0.05 MB PDF)
We thank Patrick Kemmeren for providing the yeast expression data sets; Nicolas Simonis for giving details of date/party hub classification; Jeffrey Wrana, Ian Taylor and Katherine Huang for the curated OPHID human interaction data, and GeneAtlas expression data; Pao-Yang Chen and Waqar Ali for pointing us to relevant online data repositories; Peter Mucha and Stephen Reid for useful discussions and help with some code; Anna Lewis for several useful discussions, as well as providing GO annotation data and MATLAB code for computing functional similarities; and George Nicholson and Max Little for useful discussions.
The authors have declared that no competing interests exist.
SA acknowledges a scholarship from the Clarendon Fund (http://www.clarendon.ox.ac.uk/). MAP acknowledges a research award (#220020177) from the James S. McDonnell Foundation (http://www.jsmf.org/). NSJ acknowledges support from the BBSRC (http://www.bbsrc.ac.uk/) and the EPSRC (http://www.epsrc.ac.uk/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.