|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: WKK EMM. Performed the experiments: WKK. Analyzed the data: WKK EMM. Contributed reagents/materials/analysis tools: WKK. Wrote the paper: WKK EMM.
Proteins interact in complex protein–protein interaction (PPI) networks whose topological properties—such as scale-free topology, hierarchical modularity, and dissortativity—have suggested models of network evolution. Currently preferred models invoke preferential attachment or gene duplication and divergence to produce networks whose topology matches that observed for real PPIs, thus supporting these as likely models for network evolution. Here, we show that the interaction density and homodimeric frequency are highly protein age–dependent in real PPI networks in a manner which does not agree with these canonical models. In light of these results, we propose an alternative stochastic model, which adds each protein sequentially to a growing network in a manner analogous to protein crystal growth (CG) in solution. The key ideas are (1) interaction probability increases with availability of unoccupied interaction surface, thus following an anti-preferential attachment rule, (2) as a network grows, highly connected sub-networks emerge into protein modules or complexes, and (3) once a new protein is committed to a module, further connections tend to be localized within that module. The CG model produces PPI networks consistent in both topology and age distributions with real PPI networks and is well supported by the spatial arrangement of protein complexes of known 3-D structure, suggesting a plausible physical mechanism for network evolution.
Proteins function together forming stable protein complexes or transient interactions in various cellular processes, such as gene regulation and signaling. Here, we address the basic question of how these networks of interacting proteins evolve. This is an important problem, as the structures of such networks underlie important features of biological systems, such as functional modularity, error-tolerance, and stability. It is not yet known how these network architectures originate or what driving forces underlie the observed network structure. Several models have been proposed over the past decade—in particular, a “rich get richer” model (preferential attachment) and a model based upon gene duplication and divergence—often based only on network topologies. Here, we show that real yeast protein interaction networks show a unique age distribution among interacting proteins, which rules out these canonical models. In light of these results, we developed a simple, alternative model based on well-established physical principles, analogous to the process of growing protein crystals in solution. The model better explains many features of real PPI networks, including the network topologies, their characteristic age distributions, and the spatial distribution of subunits of differing ages within protein complexes, suggesting a plausible physical mechanism of network evolution.
Life is highly organized at all levels of molecules, cells, tissues, and organisms, and such relationships among biological entities are often represented as networks, with vertices representing e.g. genes or proteins, and edges representing e.g. physical protein interactions, transcriptional regulation, or metabolic reactions. The topology of biological networks shows many interesting characteristics, such as scale-free topology (power-law or broad degree distribution) and hierarchical modularity (reviewed in ). These properties are believed to be the basis of functional modularity, error-tolerance, and stability – characteristic of many biological networks.
One important question is thus how these important network architectures originate, and what driving forces underlie the observed networks. It has not been clear whether network architecture results from the mosaic sum of each gene or protein's inherent properties, such as stickiness or interactive promiscuity ,, or from a stochastic mechanism underlying network evolution, in which the trajectory of network evolution is conditioned on the previous state of the network . This problem has been of wide interest because it raises fundamental questions about design principles of molecular networks and the role of natural selection in the evolution of network structure .
Initially, Barabási and Albert proposed a preferential attachment rule as a general mechanism to generate scale-free networks . In this model, a newly introduced node is more likely to be attached to highly connected nodes, resulting in a power-law degree distribution. In a network of protein-protein interactions (PPI), gene duplication and divergence (DD) is most popularly thought of as the origin of the scale-free topology of protein interaction networks –. In the DD model, the degree of a node increases mainly by having duplicate genes as its neighbors. Therefore, the preferential attachment rule is achieved implicitly, with highly connected nodes having more chance to have duplicate genes as their neighbors . The DD model is also shown to generate hierarchically modular networks under certain conditions .
Although the DD model generates scale-free and modular networks, it has drawbacks that must be noted if it is to be considered a main mechanism for PPI network evolution. Primarily, only a small fraction of duplicate genes effectively contribute to the overall network topology. The key feature of the DD model originates from the fact that duplicate genes share a certain number of interaction partners. However, the interaction patterns of duplicate genes diverge rapidly , and the vast majority of gene duplicates are shown to share no interaction partners –. Some duplicates, in fact, may have diverged so extensively that they can no longer be detected by sequence homology. These distant duplicates would share even fewer interaction partners, and thus they are essentially indistinguishable from non-duplicate pairs in terms of interaction patterns.
To better understand the evolution of PPI networks, we analyzed a non-topological property—the age of each protein as estimated based upon the taxonomic distribution of its constituent domains ,—and observe that yeast PPI networks show a unique interaction density pattern between different protein age groups. The density pattern of the yeast PPI network was compared with those generated by canonical network evolution models—preferential attachment (the Barabási-Albert model), duplication-divergence (DD), and anti-preferential attachment (AP). Each model generates a unique interaction density pattern between the age groups; thus, the validity of the models could be effectively discriminated. Using this test, we observe that none of the canonical models are consistent with real yeast PPI networks. The age-dependent interaction density pattern nonetheless suggests growth by a stochastic process. We therefore propose an alternative model called the crystal growth (CG) model, which is based upon known physical and chemical principles and shows good agreement with real PPI networks in both topological and age properties as well as the 3-D subunit configurations of protein complexes.
First, we introduce the basic attachment rules of protein-protein interactions. The interaction densities, Dm,n, between two protein age groups (m,n) show unique patterns depending upon the attachment rule. Three basic rules are considered—random attachment (RA), preferential attachment (PA) by Barabási and Albert ,, and anti-preferential attachment (AP). Here, we consider three protein age groups (G1, G2, and G3, from oldest to youngest), and assume a fixed number of new connections (ΔE) are made between a newly introduced node and the existing nodes as a network grows.
In the RA model, a new node is randomly connected to existing nodes with equal probabilities. Initially, at time t=1, the first age group, G1, makes only intra-group connections. Then a new group, G2, is introduced and connected randomly either to G1 (inter-group) or within G2 (intra-group). In the RA model, the expected interaction density, D, is the same between D1,2 and D2,2. Similarly, G3 connects to G1, G2, and within G3, showing the pattern of D1,3=D2,3=D3,3. More generally, the RA model shows a pattern of Dm,n=Dm+1,n (m<n) (Figure 1A). In the PA mode, new proteins are preferentially connected to highly connected nodes. Thus, G2 proteins are more likely to be linked to G1 than G2 because G1 proteins have previously made connections and have a higher average degree. Likewise, G3 proteins are more likely to be connected to older groups, showing D1,3>D2,3>D3,3. Thus the typical pattern of the PA model is Dm,n>Dm+1,n (m<n) (Figure 1B). The AP model shows an inverse pattern to the PA model, Dm,n<Dm+1,n (m<n), because new nodes prefer to connect to less-connected nodes (Figure 1C).
As a measure of age-dependency of interaction density, ΔD is defined as the average value of Dm+1,n - Dm,n (m<n) (see Methods). A positive ΔD indicates that protein interactions are more likely between similar age groups. The sign of ΔD effectively discriminates each model—it is positive in PA, negative in AP, and near zero in the RA model.
We collected two independent sets of yeast PPIs - literature curated (LC) and high-throughput (HTP) PPIs, using the method of Batada et al. , (Dataset S1 and Dataset S2) and inspected both the network topology and the age-dependency of interaction density. The number of nodes, N (proteins) and edges, E (interactions) in the LC and HTP networks are NLC=3268, ELC=12058 and NHTP=2488, EHTP=6766 respectively. The union (LC+HTP) of the two networks has 3780 nodes and 16505 edges. As HTP and LC+HTP show highly similar characteristics (Figure S2) as well as the original set by Batada et al. ,, we mainly discuss the LC data set as the yeast PPI network (PPIyeast) here. The recently compiled set (Y2H-union) by Vidal and colleagues  from large-scale yeast two-hybrid experiments showed the same trend (Figure S2).
The PPIyeast recapitulates known topological features such as a scale-free degree distribution, hierarchical modularity, and degree-dissortative mixing property , –, which were characterized by the various network property indices shown in the first column (PPI) in Figure 2 (summarized in Table S1). The probability of a node having degree k shows a scale-free or power-law degree distribution in P(k) ~ k−γ plot (the row I in Figure 2). The PPIyeast is shown to be highly modular, with a high degree of clustering coefficient, C and modularity index, Q defined by Newman . In particular, the PPIyeast has a scaling property in C(k) ~ k−β plot (β>0), suggesting hierarchical modularity  (the row II in Figure 2). In a dissortative network, high-degree nodes (hubs) tend to connect with low-degree nodes and hub-hub interactions are suppressed, as called the Maslov-Sneppen rule . The degree-dissortativity was characterized by a negative correlation in <knn>(k) ~ kδ (δ<0) plot (the row III in Figure 2), where <knn>(k) is the average degree of the nearest neighbors of the nodes with degree k.
Surprisingly, the interaction density of PPIyeast is also highly age-dependent. Yeast proteins were assigned to one of the age groups ABE, AE/BE, E and F depending on the taxonomic distribution of constituent domains among archaea (A), bacteria (B), eukaryote (E) and fungi (F) (see Methods, Figure S1). We measured the interaction density between the age groups and observe a positive ΔD similar to AP model (the row IV in Figure 2). The pattern of positive ΔD is highly robust regardless of the sources of data (LC, HTP and LC+HTP) and the random addition or deletion of edges, e.g. by 50%. It suggests that the positive ΔD is a genuine feature of PPIyeast.
We next simulated PPI network evolution using the three canonical models—PA (preferential attachment), DD (duplication and divergence), and AP (anti-preferential attachment) and tested compatibility with PPIyeast in terms of both topology and age-dependency. In all three models, the network starts from a small number, N0=4 of seed nodes and a new node is added until the total number of nodes reaches N=3,000, which is comparable to the PPIyeast (LC) with 3,268 nodes and 12,058 edges. In the PA and AP models, a fixed number of edges (ΔE=4) are added for each new node, which makes the final network size similar to the PPIyeast. The link probability (P) is proportional to the degree in the PA model (P ~ k) and inversely proportional in the AP model (P ~ k−1). For the DD model, we employ one of the simplest models by Vázquez et al. : One node (i) is duplicated randomly, the new node (i') is connected to all of the neighbors of i, and then the duplicates (i and i') are linked with a small probability p. For each neighbor (j) of the duplicates, one of the two links (i,j and i',j) is chosen randomly and deleted with the divergence probability q. Because this model may generate orphan nodes that are not connected to any other nodes, orphan nodes were removed in each duplication step.
Surprisingly, none of the three models satisfied all of the characteristics of PPIyeast (the 2nd, 3rd and 4th columns in Figure 2 for the PA, DD and AP model respectively). The PA and DD models generate scale-free networks and show degree-dissortativity and the DD model also shows some degree of hierarchical modularity. However, both the PA and DD models show an inverse interaction density pattern with negative ΔD. In contrast, although the AP model shows positive ΔD similar to PPIyeast, it deviates greatly in terms of topological characteristics. That is, the PPIyeast seem to show mixed characteristics, with the network topology resembling that of the DD (PA) model but with the interaction density similar to the AP model. Also, all three models generally show much lower levels of modularity than the PPIyeast (the row II in Figure 2). We further examined two more variants of DD models, where the divergence of edges between the duplicates is asymmetric (DDasym) by Ispolatov et al.  and allow rewiring as well as asymmetric (DDasym-rw) by Pastor-Satorras et al. . None of the tested DD variants were in good agreement with PPIyeast, showing negative ΔD and lower clustering coefficient. In yeast, whole genome duplication (WGD) occurred relatively recently after speciation of Kluyveromyces waltii and Saccharomyces cerevisiae . Simulation of WGD at the last stage of DD model did not improve the model either (data not shown). As a global topological index, the shortest path length was also examined but provided little discrimination among the tested models due to high variability depending on model parameters (DD model) and the choice of yeast PPI data set. Each model was simulated 100 times and the summary of the network properties is given in Table S2.
While additional variants of each model might be considered ,,, the critical characteristics of each model are largely captured by these canonical models, e.g. the DD model has no mechanism to generate positive ΔD. The inconsistency of these models with the interaction age density of real PPI networks clearly suggest that none of these canonical models is sufficient in itself to qualify as a valid model for the evolution of the yeast PPI network.
To better address both topological and age properties of real networks, we developed an alternative model for PPI network evolution called the crystal growth model (CG), in which we view the growth of a PPI network as analogous to incorporating new proteins into crystals grown in solution (Figure 3A). The two key ideas are as follows. First, the connection probability increases with the availability of unoccupied surface, and thus the model follows anti-preferential attachment rule (AP rule). Second, the connections of a new node tend to be limited within a network module, as observed in growing crystals and here termed as localized connection.
The procedure of the CG model is illustrated in Figure 3B. As in the PA and AP models, the CG model starts with a few seed nodes (N0=4), and a new node makes a fixed number of connections (here, ΔE=4) to existing nodes. For each new node added, network modules are redefined as local dense regions in the network. As modules emerge as a result of network growth and are not pre-defined artificially, the number of modules (M) is not fixed but may increase or decrease in each step. With a small probability Pnew, a new node becomes a new module by itself and makes connections ΔE times to other nodes in accordance with the AP rule. Otherwise, an existing module is selected randomly, and the new node is committed to the module by making connections exclusively within the selected module. The connection takes two steps, dubbed “anchoring and extension”. In the anchoring step, the new node connects to an anchor node in the module in accordance with the AP rule, and then, in the extension step, the new node further connects only to the neighbors of the anchor node in the module. Connections are created randomly to neighboring nodes until ΔE connections are made. The anchoring and extension steps are analogous to the node e in Figure 3A (stage II). Therefore, the CG model is inherently highly module-oriented. In case that the neighbors of the anchor node are fewer than ΔE in the chosen module, the module selection and connection step is repeated until ΔE connections are made and the new node becomes connected to multiple modules.
The CG model introduces two parameters, how to define the network modules and how frequently a new module is created (Pnew). A network module is generally defined as a densely connected sub-network, and there are various ways to partition a network into modules. Most stringently, modules can be defined as complete subgraphs or cliques, and more loosely they can be defined as k-cores, triangularly connected components (TCC) and so on. We tested two different module definitions, one by Newman  and the other by TCC. We mainly discuss the results by the Newman definition, but results using TCC were highly similar (Figure S3). Also, Pnew was assigned as M−1 because the chance of creating a new module generally decreases with the number of existing modules (M). Setting a small, fixed value of Pnew also show a similar result (data not shown).
Networks generated by the CG model show a remarkable similarity to real PPI networks for all tested network properties. A typical result of the CG model is shown in the 5th column in Figure 2. The topology of the CG model shows a scale-free, a hierarchical modular, and a degree-dissortative characteristic. Interestingly, both the magnitude and the shape of clustering coefficient was similar to the PPIyeast in the C(k) ~ k plot (the row II in Figure 2). The CG model also shows a similar pattern of degree-dissortativity and interaction density with a positive ΔD (the row III and IV in Figure 2). These characteristics were robust with varying network sizes, e.g., N=1,000 and N=5,000 (data not shown).
The canonical models were shown to significantly deviate from the PPIyeast, but the CG model shows a good agreement not only qualitatively but also quantitatively (Figure 4). For objective comparison of the models, various indices were used to summarize the network characteristics, including power-law degree distribution (γ), hierarchical modularity (Q, C, C(k) ~ k curve shape and triangle density, T), dissortativity (δ), and the age-dependency of interaction density (ΔD).
DD and PA show an inverse age-dependency of PPIyeast and much less modularity in terms of clustering coefficient and triangle density although they show scale-free degree distributions (Figure 4B and 4C). The AP model was not able to generate a scale-free network and significantly deviates from the PPIyeast for all the network indices tested except ΔD (Figure 4B). Only the CG model was comparable to the PPIyeast in terms of all the network indices tested, including both scale-freeness (γ) and age-dependency (ΔD) (Figure 4D). In particular, only the CG model shows an extremely high degree of modularity comparable to the PPIyeast in terms of both clustering coefficient and triangle density due to its inherently module-oriented mechanism. The mixing exponent (δ) is intermediate between LC and HTP. Therefore, of all models considered, the CG model agrees best with both topological and age-dependencies of the actual yeast PPI network. In Table S2, the network property indices are summarized for all the models tested after 100 simulations of each model.
In the CG model, homodimers would be more frequent in older groups because there are simply fewer proteins with which to make connections in earlier stages. The age distribution of homodimeric interactions was exactly in the order of ABE>AE/BE>E>Fu among the 166 homodimeric yeast proteins collected from UniProt  and the literature (Figure 5, Dataset S4). This result is also consistent with previous studies from protein 3-D structures, in which ancient proteins were shown to be highly enriched with homodimeric or paralogous interactions ,. Although the PA and AP would also generate a similar trend, the resulting topology and/or interaction density greatly deviate from PPIyeast to be considered as a realistic model. In the DD model, a fixed interaction probability, p is set for interactions between duplicates (paralogs), therefore implicitly predicts homodimeric formation is age-independent because most paralogous interactions originate from homodimeric interactions and were not created de novo ,. Thus, the age-dependency of homodimeric frequencies is a good support for the CG model, which has not previously been applied as a criterion for valid network evolution models.
Within the sub-networks of known complexes from MIPS, protein subunits tend to be either more likely to be connected among similar age groups in agreement with the general tendency of positive ΔD in the full yeast PPI networks (Figures S4A and S4B) or consist mostly of the same age group, reflecting the creation of a new protein module at a certain evolutionary lineage e.g. actin-associated proteins (Figure S4E). Other complexes form densely connected sub-networks, where age-dependency was not evident, e.g. RNA polymerase I and III (Figures S4C and S4D).
We further validated the CG model by inspecting the 3-D subunit arrangement of protein complexes according to age. Obviously, a protein subunit of a stable complex interacts mostly with the subunits of its participating complex. When a subunit is in contact with multiple other subunits in a protein complex, it is most likely that the partner subunits are spatially close, often interacting among themselves as well. For transient interactions, the member proteins can interact with fewer spatial constraints but the interactions are much denser within each biological module, e.g. as for a MAP kinase signaling pathway or transcription initiation complex. Therefore, a protein tends to interact in a highly “localized” manner within the biological modules it belongs to. None of the canonical models has such a module-oriented mechanism as the CG model. In the CG model, older subunits of protein complexes would tend to be more centrally located than younger ones because each protein is attached in the order of its age. Therefore, it is more likely that older subunits are aggregated centrally and younger subunits are scattered at the periphery in a protein complex.
To examine this trend among known protein complexes, we collected protein complexes from the Protein Databank (PDB) which consisted of at least 3 protein chains, with at least 2 age groups represented; these are stringent criteria that strongly limit the number of available complexes. After removing inappropriate complexes, such as non-protein structures, viral proteins, antibodies and small peptides, a non-redundant set of 12 multi-protein complexes was collected that met these criteria (detailed descriptions are in Methods).
In general, older subunits tend to be aggregated centrally (red tone), while younger ones are separated peripherally (green and blue) (Figure 6). In Figure 6A, older subunits form trimeric aggregates but younger ones were separated. There were four linear complexes and no younger subunit intervened between the older ones (Figure 6B–6E). That is, the contacts were always in e.g. the ABE-ABE-AE configuration but not the ABE-AE-ABE, as predicted by the CG model, in which ABE-ABE is connected first and ABE-AE later. The other three complexes contain trans-membrane helix bundles, where the younger helix chain is located at the periphery (Figure 6F–6H). Of the remaining four complexes, two had all subunits contacting each other and were thus non-informative (Figure 6I–6J), and two had ambiguous age assignments for subunits, although the putatively younger subunits were spatially separated (Figure 6K–6L). Considering the eight informative complexes (Figure 6A–6H), the observed subunit arrangements significantly support the CG model at P=0.019, based on random permutations of chain arrangements within the asymmetric unit of each complex.
It is notable that the total degree of PPIyeast is underestimated relative to the actual degree due to homomeric interactions and subunit stoichiometry. For example, the APRIL-TACI complex (Figure 6A) was the form A3B3 with the degree kA=3 (two homomeric, one heteromeric) and kB=1 (one heteromeric). In contrast, only one interaction (A–B) would be counted for each subunit in PPIyeast.
The validity of network evolution models have been measured mainly by the resulting network topology, such as a power-law degree distribution, hierarchical modularity and dissortativity as observed in real PPI networks. Accordingly, the DD model has been thought of as the principal mechanism for PPI network evolution. Here, we dissect the history of PPI network evolution by inspecting several protein age-dependent patterns such as interaction density, homodimeric frequency, and the 3-D spatial arrangement of subunits within multiprotein complexes. The age-dependencies are shown to be very effective in discriminating the validity of different models as summarized in Table 1. The tested aspects of age-dependency were independent of topologies as well as of each other, and are thus highly useful as orthogonal criteria for valid models. Importantly, the age-dependent interaction patterns provided insights on PPI evolution, suggesting evidence against the DD model as the dominant mode of PPI network evolution, instead supporting an alternative model, the CG model.
In the CG model, we view the PPI network as sparse and dynamic protein crystals per se. The CG model mimics the process of growing protein crystals in solution by sequentially adding each protein. Despite the huge differences in time scale and heterogeneous composition, PPI network evolution likely obeys similar constraints on growing protein crystals. In the CG model, a protein complex or a tightly linked module is analogous to individual crystals, and the number and membership of modules are not pre-defined but rather emerge naturally in each growing step. Crystals grow around multiple nuclei just as protein networks consist of multiple modules/complexes. New modules are generated as the genome size increases and novel function evolves in higher organisms, in a manner similar to how a new crystal forms occasionally through new nucleation events.
The CG model exploits two keys ideas, the first being that the chance of new connection is proportional to the availability of free surface, which is a feature readily recognized by a new protein molecule; this results in an anti-preferential attachment (AP) rule. Although the same surface of a protein can be involved in multiple interactions with different partners through spatial and temporal differentiation, such a factor uniformly increases the capacity of interactions in any protein. Therefore, the connection probability is still positively correlated with the available surface area. These results agree with those of Kim et al. , which show that the evolutionary rate is anti-correlated with available surface area. There, multi-interface hubs were nearly four times more frequent than single-interface hubs, reflecting the dominant connection mode of the AP rule. The second key idea is that once an initial connection is made, the subsequent connections are localized to the neighbors of the initial partner within the same module. This localized connection enforces high modularity, similar to that observed in real PPI networks.
At the basis of the crystal growth model is the notion that new interactions form preferentially within existing physical complexes (enforcing modularity), and thus are limited by available protein surface area (the AP rule). Thus modularity & the AP rule both arise due to simple physical constraints of which proteins are most accessible to each other. Recently, Levy and colleagues has shown that the successive steps of homo-oligomeric assembly mimics the evolutionary pathway . The CG model expands this idea, where crystal growth reproduces the evolution of the entire PPI network.
Given that the CG model follows an AP rule, how does it generate scale-freeness or “the rich get richer” connectivity? In the CG model, the network grows by anchoring and extension, where a node increases its degree either by becoming an anchor node (anchoring) or by being the neighbor of the anchor node (extension). Therefore, the highly connected nodes have greater chances to increase their degree within each module because they have more opportunities to have anchors as their neighbors. Therefore, the CG model implicitly implements the preferential attachment (PA) rule within each module in a manner similar to the DD model, where the nodes increase their degree by having duplicating genes as their neighbors.
Our result suggests that the CG model is a more plausible mechanism for PPI network evolution than the DD model. First, all the age-dependent aspects tested agree well with the CG model but disagree with the DD model. Second, the CG model is more comprehensive than the DD model in that the CG model can accommodate both gene duplication and horizontal gene transfer as the origins of new nodes (genes). Practically, the DD model may be applicable only to ~20% of the yeast proteome having identifiable duplicates . The CG model also embodies the rapid divergence of gene duplicates  by the AP rule, which avoids competition for the same interface on common partners and connects to new partners with less occupied surfaces. Finally, the CG model is more robust than the DD model. The DD model shows a highly variable degree distribution depending upon parameters and network sizes ,. In contrast, the CG model shows stable characteristics regardless of network size or different module definition methods. Taken together, these strongly suggest that the DD model is unlikely to be the principal, and strongly unlikely to be the sole, mechanism of PPI network evolution.
The age-dependency of interaction density also sheds light on a more fundamental question regarding the mechanism of PPI network evolution. It has been hypothesized that inherent features of proteins, such as stickiness and hydrophobicity are dominant factors in shaping the global network structure . However, the observed age-dependency is inconsistent with such a hypothesis and suggests that a stochastic process played a major role. For example, the yeast PPI network shows the patterns of both DABE,AE/BE>DABE,E and DAE/BE,Fu<DE,Fu (the row IV in Figure 2). The connection probability cannot depend solely upon a feature such as protein length or surface hydrophobicity because no single feature (F) can satisfy FAE/BE>FE (with common FABE) and FAE/BE<FE (with common FFu) simultaneously.
Power-law distributions have been commonly observed in various types of networks, such as the Internet, social networks, and biological networks. However, the growth of a PPI network poses unique constraints compared to other types of networks. For example, in an airline or railroad network, each new connection is made by considering the context of global network topology (e.g., to minimize average path length), which seems intuitively unlikely to be the case in PPI networks. The CG model follows two simple constraints of available free surface and localized connection, which are physically plausible and depend only on local context but not global topology. With these minimal assumptions analogous to growing protein crystals, the CG model recapitulates remarkably well the age-dependencies as well as the network topologies of the yeast PPI networks.
Two independent sets of yeast protein-protein interaction data were collected using a method essentially identical to that described by Batada et al. ,, only differing in that the HTP set was collected from the original publications instead of from BioGrid . We compiled the HTP set from Uetz et al. , Ito et al. , the merged set of Gavin et al. ,, Ho et al. , and Krogan et al. , and then filtered out the interactions supported by only a single experiment. Repeated and reciprocal assays were considered as independent experiments even if they were performed in the same publication. The LC data set was collected from the latest release of BioGrid, excluding high-throughput data. Ribosomal proteins were removed from both LC and HTP data sets. All protein-RNA interactions and interactions supported only by co-localization or co-fractionation were removed. We further removed interactions supported only by Ptacek et al. , Grandi , Collins et al. , or Fields et al. .
Pfam domains were assigned for yeast proteins using BioMart (http://www.biomart.org). The taxonomic distributions of Pfam domains were obtained for archaea (A), bacteria (B), eukaryotes (E), and fungi (F) (http://www.sanger.ac.uk/Software/Pfam). According to these distributions, each Pfam domain was assigned to one of the age groups ABE, AE/BE, E, and F. The group ABE includes the oldest proteins common to all three kingdoms, while group F is the youngest, being specific to fungi. As yeast is a eukaryote, groups A, B, and AB do not occur. A protein's age group was assigned as the youngest age of its constituent Pfam domains—e.g., E for a protein with domains from ABE and E (Dataset S3, Figure S1).
Interaction density Dm,n measures the normalized interaction density between two age groups m, n (m<n). ΔD measures the interaction preference of a new node by the age differences. A positive value of ΔD indicates that a new node makes connections more frequently with close age groups than with distant ones.
First, the normalized interaction density Dm,n between two age groups m,n (m<n) is calculated as
where lm,n is the number of edges between the two age groups m and n, and Em,n is the number of all possible interactions between the two groups. Nm and Nn are the number of nodes in the age groups m and n, respectively, L is the total number of edges, and N is the total number of nodes in the network. Then the average interaction density gradient, ΔD, of a network is defined as
where G (G≥2) is the number of age groups.
where M=the total number of modules, L=the number of total edges in the network, ls=the number of edges within the module s, and ds=the sum of the degrees of the module s. The modularity index Q measures the difference between the intra-module interaction density and the expected interaction density at random for a given partition, where Q≈0 for a random network and Q=1 for a completely modular network .
The list of PDB entries and 3-D coordinates were obtained from PQS (Protein Quaternary Structure Server, ftp://ftp.ebi.ac.uk/pub/databases/msd/pqs). First, we took the PDB entries having three or more protein chains. The PDB entries annotated as crystal packing interfaces by PQS or from non X-ray crystallographic method were excluded.
The protein chain clusters at 30% sequence identity cut-off were downloaded from PDB (Protein Data Bank, ftp://ftp.wwpdb.org). PDB entries consisting of the same set of NR30 clusters were grouped together regardless of the number of chains and one representative PDB entry was selected in each group as NR30 entries.
For NR30 entries, the age group of each PDB chain was assigned using BLAST against NR90 set of archaea, bacteria and eukaryote sequences from UNIPROT (ftp://ftp.uniprot.org/pub/databases/uniprot) using >30% identity and >30 alignment length as criteria. We took only the PDB entries consisting of two or more protein age groups and further applied a number of filters manually, excluding the entries with DNAs, RNAs, viral proteins, small peptides (<30 amino acids) and immunoproteins such as antibodies and MHCs with antigens. Where available, ambiguous quaternary structures were removed by comparing the data from PQS, PDB biological units and 3D complex databases .
(0.20 MB TXT)
(0.11 MB TDS)
The age group assignment of yeast genes
(0.08 MB TDS)
The list of homodimeric proteins and their age group assignment
(0.01 MB TDS)
The protein ratio of different age groups in yeast PPI networks. LC: literature-curated, HTP: high-throughput, LC+HTP: the union of LC and HTP.
(0.08 MB PDF)
The network properties of the HTP, LC+HTP, and Y2H-union dataset. The plots in each row, I-IV, indicate (I) The degree distribution P(k), (II) the clustering coefficient C(k), (III) the average degree of nearest neighbors <knn>(k), and (IV) the interaction density pattern (ΔD) between protein age groups. HTP, LC+HTP, and Y2H-union set show similar characteristics as LC dataset.
(0.29 MB PDF)
The network properties by the CG model, where the network modules were defined by TCC (triangularly connected components) instead of the Newman's method. The network structure is still similar to the yeast PPI networks, showing scale-free, hierarchical modular, degree-dissortative characteristics and an interaction density pattern of DD>0. (A) The degree distribution P(k), (B) the clustering coefficient C(k), (C) the average degree of nearest neighbors <knn>(k), (D) the interaction density pattern between protein age groups.
(0.09 MB PDF)
Age-dependent interaction patterns of several MIPS complexes in the LC+HTP set. In mRNA splicing (A) and replication (B) complexes, the subunits of the same age group are more likely to be connected. In RNA polymerase I & III (C and D), most subunits are densely connected to each other, therefore age-dependency is not evident. In the case of actin-associated proteins, most subunits are of the same age group (E), reflecting a relatively recently emerged module.
(0.52 MB PDF)
The network characteristics of the yeast PPI data.
(0.06 MB PDF)
The network characteristics of the network growth models
(0.13 MB PDF)
The authors have declared that no competing interests exist.
This work was supported by grants from the N.S.F. (IIS-0325116), N.I.H. (GM06779-01), Welch (F1515), and a Packard Fellowship (EMM).