PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2012; 7(8): e41474.
Published online Aug 13, 2012. doi:  10.1371/journal.pone.0041474
PMCID: PMC3418285
A Network Synthesis Model for Generating Protein Interaction Network Families
Sayed Mohammad Ebrahim Sahraeian and Byung-Jun Yoon*
Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, United States of America
Franca Fraternali, Editor
King's College London, United Kingdom
* E-mail: bjyoon/at/ece.tamu.edu
Competing Interests: The authors have declared that no competing interests exist.
Conceived and designed the experiments: SMES BJY. Performed the experiments: SMES. Analyzed the data: SMES BJY. Wrote the paper: SMES BJY.
Received December 20, 2011; Accepted June 27, 2012.
In this work, we introduce a novel network synthesis model that can generate families of evolutionarily related synthetic protein–protein interaction (PPI) networks. Given an ancestral network, the proposed model generates the network family according to a hypothetical phylogenetic tree, where the descendant networks are obtained through duplication and divergence of their ancestors, followed by network growth using network evolution models. We demonstrate that this network synthesis model can effectively create synthetic networks whose internal and cross-network properties closely resemble those of real PPI networks. The proposed model can serve as an effective framework for generating comprehensive benchmark datasets that can be used for reliable performance assessment of comparative network analysis algorithms. Using this model, we constructed a large-scale network alignment benchmark, called NAPAbench, and evaluated the performance of several representative network alignment algorithms. Our analysis clearly shows the relative performance of the leading network algorithms, with their respective advantages and disadvantages. The algorithm and source code of the network synthesis model and the network alignment benchmark NAPAbench are publicly available at http://www.ece.tamu.edu/bjyoon/NAPAbench/.
Protein-protein interactions (PPIs) lie at the core of a wide range of biological processes in cells, including transcriptional, signaling, and metabolic processes [1]. Recent technological advances have enabled the high-throughput measurement of these interactions in various species [2][4], and a variety of computational methods have been developed for in-silico prediction of protein interactions [5][8]. Availability of large-scale protein interaction data, typically represented as networks of interacting proteins, has opened up new ways for the systematic study of biological networks. Especially, cross-species comparison of genome-scale PPI networks can provide important insights into the structure and organization of biological networks, as well as important similarities and variations across different species [9]. In recent years, a large number of computational methods have been developed for comparative analysis of biological networks, where their main focus has been on the identification of functional modules that are conserved in the networks of multiple species [10][39]. These methods can be broadly divided into two categories, namely, network querying and network alignment. Network querying aims to identify subnetwork regions in the network of a target species that are similar to a small subnetwork of another species, used as query [32][39]. For example, this could be used for querying a known functional pathway in a well-studied species to identify putative homologous pathways in different species, thereby allowing knowledge transfer across species. Network alignment can be viewed as a generalization of network querying, and it aims to predict the best mapping between a set of networks, based on the similarity of the constituent molecules and their interaction patterns [10][31]. Network alignment methods may be used to analyze the cross-species variations of biological networks, to predict conserved functional modules, or to infer the function of unannotated proteins.
Research in comparative network analysis is still at an early stage, but many existing studies have demonstrated its potential as an effective tool for gaining important insights into biological systems, that would be otherwise difficult to obtain.
Unfortunately, further advance in comparative network analysis research is critically impeded by the lack of a gold standard for evaluating network alignment algorithms. Currently, there is no comprehensive and reliable benchmark dataset that can be used for validating these algorithms [12]. For this reason, it is common practice to assess the performance of network alignment algorithms in indirect ways, for instance, based on the functional coherence of the aligned nodes in the predicted network alignment or simply through anecdotal examples. Functional annotations based on Gene Ontology (GO) [40] or KEGG orthology (KO) [41] are often employed for this purpose. However, these annotations are mainly curated based on the sequence similarity between molecules, hence they may fail to effectively capture the actual functional coherence between the molecules [28], [42]. Considering that network alignment aims to incorporate molecular interaction data with sequence data to make predictions that are biologically more relevant, evaluating network alignment algorithms based on annotations that are strongly influenced by sequence similarity is certainly less than ideal. Besides, currently available protein interaction databases, such as BioGRID [43], MIPS [44], DIP [45], IntAct [46], MINT [47], and Human Protein Reference Database (HPRD) [48], include the protein interaction networks for only a few species, where the interaction data are very incomplete even for meta-databases – such as PINA [49] and APID [50] – that have been constructed by integrating multiple databases. For example, BioGRID v. 3.1.82 (November 2011), which is one of the most comprehensive among the existing PPI databases, contains the PPI networks of just 25 organisms, where the networks of 7 organisms – A. thaliana, C. elegans, D. melanogaster, H. sapiens, M. musculus, S. cerevisiae, and S. pombe – include more than few hundred interactions. It is widely suspected that a significant number of interactions in the current PPI networks may be spurious, while many true interactions may be still missing. As discussed in [51], based on the analysis of synthetic networks, incomplete knowledge poses a major challenge for interactome-level comparison between different species.
Considering the incompleteness of the current PPI networks, as well as the difficulty of accurately assessing the functional correspondence between proteins, a network synthesis model that can generate families of protein interaction networks with biologically realistic properties may provide a practical and effective alternative. Recently, Ali and Dean [51] have performed a simulation-based study, where a pair of evolutionary related synthetic networks were analyzed to investigate the source of low level of interaction conservation in network alignment results. Erten et al. [52] also proposed a simulation scheme for generating a set of networks with known phylogeny, where the driving motivation was to evaluate the accuracy of their network-based phylogeny reconstruction algorithm. These studies [51], [52] serve as interesting showcases of the important role of synthetic network models. However, these models have also a number of practical limitations. For example, the model presented in [51] cannot be used to synthesize a network family with an arbitrary phylogeny. Furthermore, both models in [51] and [52] do not explicitly represent the functional correspondence between individual proteins across different networks, which is indispensable for evaluating the accuracy of network alignment algorithms.
In this paper, we present a general network synthesis model that can effectively address these issues. Following a pre-specified phylogenetic tree, the model can generate a family of evolutionarily related protein interaction networks, whose properties closely mimic those of real networks – in terms of both the internal properties of the individual networks as well as the comparative properties across networks – as will be shown in our analysis. By internal network properties, we refer to the local characteristics (such as the node degree and the clustering coefficient) and their distributions over each network, which are important in understanding the overall topology. On the other hand, by comparative or cross-network properties, we refer to the properties that can be estimated through network comparison (e.g., sequence similarity between proteins that belong to different networks) and reflect the similarity (or the lack thereof) between networks, which arise from their evolutionary relationship. To demonstrate the utility of the network synthesis model, we created a comprehensive network alignment benchmark based on the proposed model and carried out an extensive performance analysis of select state-of-the-art network alignment algorithms.
Network Growth Models
In this section, we briefly review existing network growth models that aim to computationally simulate the evolutionary growth of a single biological network. Recently, there has been significant interest in developing network growth models [53][70] that can capture the characteristics of real biological networks, including PPI networks. As pointed out in [71], PPI networks do not follow the Erdös-Rényi's model for random graphs. Instead, the structure of biological networks appears to be governed by a scale-free degree distribution, which is also the case for social networks. The scale-free model suggests that the probability that a given node will have a degree (i.e., number of edges) of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e001.jpg follows a power-law An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e002.jpg, for some degree exponent An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e003.jpg. In general, a scale-free network possesses a few highly connected nodes (often referred as hubs), while the rest of the nodes have only a relatively small number of connections. This trend is generally observed in many PPI networks, which can be explained at a molecular level, at least in part, by the different degrees of protein binding specificity – i.e., the number of binding surfaces or binding partners – required by the cell for carrying out various biological functions [42]. Preferential attachment (PA) growth model [56] is one of the network evolution models that can generate such a distribution. In the PA model, the network is grown by iteratively adding a new node to the network and adding random connections to existing nodes. The probability of adding an edge to a given node is proportional to its degree, hence the model prefers to connect the new node to nodes that have many interacting partners. The PA model can also capture another important property of PPI networks called the “small-world effect”, which means that any node in the network can be typically reached from other nodes within a few links. Despite its effectiveness in modeling the scale-free degree distribution in PPI networks as well as their small-world property, the PA mechanism fails to capture other important properties, such as the graphlet distribution in real networks and their structural modularity [53], [65], [72], [73].
Inspired by the gene duplication model used to explain genome evolution [74], several duplication-based techniques have been proposed to simulate network evolution [53][55], [57][63], [66], [67], [69]. Basically, the gene duplication models assumes that the primary source of protein diversity is the repetitive duplication of existing genes followed by mutation of the duplicated genes leading to functional divergence [74]. A recent analysis of protein interaction networks [75] showed that gene duplication may play important roles in increasing the organismal complexity. The duplication-divergence model can generate networks that retain many of the generic characteristics of biological networks, such as the power-law degree distribution [76], hence it can provide an alternative framework for modeling PPI networks. The duplication-mutation-complementation (DMC) model [53] and the duplication with random mutation (DMR) model [54], [55] are two examples of duplication-divergence based network growth models that have been investigated in depth. Given a seed network, the DMC model [53] grows it by iterating the following steps:
  • Add a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e004.jpg to the network by duplicating a randomly chosen node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e005.jpg in the current network. Connect An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e006.jpg to all neighbors An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e007.jpg of the node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e008.jpg.
  • For every neighbor An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e009.jpg, randomly pick either edge An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e010.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e011.jpg, and randomly remove the edge with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e012.jpg.
  • Add a new edge between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e013.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e014.jpg with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e015.jpg.
It was shown that the above DMC model can capture various biological features of PPI networks [72], [77], including their hierarchical modularity. The DMR model is another well-studied network growth model based on the duplication-divergence principle [54], [55], where the network is obtained by repetitively applying the following steps:
  • As in the DMC model, add a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e016.jpg to the network by duplicating a randomly chosen node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e017.jpg in the current network. Connect An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e018.jpg to all neighbors An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e019.jpg of the node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e020.jpg.
  • Randomly remove the edges between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e021.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e022.jpg with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e023.jpg.
  • Introduce random edges between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e024.jpg and other nodes in the network (that are not connected to the original node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e025.jpg) with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e026.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e027.jpg is the size of the current network.
As shown in [73], [78], the DMR model can generate networks that resemble real PPI networks in various aspects, such as the An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e028.jpg-hop reachability (i.e, the number of distinct nodes that can be reached from a given node via a path of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e029.jpg edges), the graphlet distribution, as well as the betweenness, closeness, and degree distributions.
Another notable network growth model that is not based on the duplication-divergence principle is the crystal growth (CG) model, recently proposed by Kim and Marcotte [65]. The CG model takes a highly module-oriented approach, which tries to emulate the physical process of growing protein crystals in solution. Kim and Marcotte [65] showed that the CG model can better explain many features of real PPI networks, including their network topology, their characteristic age distribution, and the spatial distribution of the subunits of different ages within protein complexes, hinting at a plausible physical mechanism of network evolution. Specifically, the capability to accurately capture age-dependent interaction patterns in PPI networks is an important advantage of the CG model, as this is one major drawback of existing models (e.g., duplication-based techniques). The CG model grows a seed network by iteratively adding new nodes as follows:
  • Define modules (i.e., dense local network regions) in the current network using Newman's algorithm [79]. Let An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e030.jpg be the number of modules in the network.
  • Introduce a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e031.jpg to the network. Either define the node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e032.jpg as a new module by itself (with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e033.jpg) or add it to one of the existing modules (with probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e034.jpg).
  • If An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e035.jpg is defined as a new module, add An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e036.jpg random connections to other nodes in the network according to the anti-preferential attachment (AP) rule. (Note that, according to the AP rule, nodes prefer to add edges to low-degree nodes.)
  • Otherwise, randomly select one of the An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e037.jpg modules in the network and choose an anchor node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e038.jpg in the selected module, based on the AP rule. Add An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e039.jpg connections between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e040.jpg and the randomly selected neighbors of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e041.jpg. Repeat this step if An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e042.jpg has less than An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e043.jpg neighbors.
In addition to these three network growth models, there are also other randomized network generation schemes based on different approaches. For example, the scheme proposed in [70] does not generate a random network by growing a small seed network. Instead, this algorithm, which is developed based on Tailored random graphs, initiates from another random graph with the same dimensionality and the same degree sequence (i.e., the sequence of node degrees of the desired network) as the final network. Then it iteratively rewires the network (e.g., by edge swapping) to reach the desired degree distribution and joint degree statistics for connected nodes. However, this method is not well-suited for modeling network families, as it requires a predefined degree sequence (which may not be available in practice). Furthermore, as this scheme does not follow a growth model, it cannot effectively simulate evolutionarily related networks.
In the current work, we adopt and compare the three network growth models discussed above–i.e., DMC, DMR, and CG–to generate families of synthetic PPI networks. Note that the variables An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e044.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e045.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e046.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e047.jpg are user defined parameters for DMC, DMR, and CG schemes. Incorporation of other network evolution models is straightforward.
Characteristics of Protein Interaction Networks
To develop a biologically realistic model for generating families of synthetic protein interaction networks, we first study the characteristics of real PPI networks of five organisms: C. elegans, D. melanogaster, H. sapiens, M. musculus, and S. cerevisiae. We present the analysis results for D. melanogaster, H. sapiens, and S. cerevisiae, which have the largest PPI networks among the five organisms, while the rest can be found in the supplementary data. The protein interaction data for these organisms have been obtained from IsoBase [80], a recently developed database of functionally related protein orthologs. IsoBase consists of the PPI networks of these five species, along with the homology scores between all pairs of proteins across different species, measured in terms of BLAST bit-value similarity of the protein sequences. The PPI networks in the IsoBase have been constructed by integrating the data in three different public databases: DIP [45], BioGRID [43], and HPRD [48]. Table 1 summarizes the statistics of IsoBase, which currently contains 48,120 proteins and 114,897 protein-protein interactions. From this table, we can also observe the incompleteness of the current PPI networks, evidenced by the large number of isolated proteins (i.e., proteins without known interactions). Furthermore, it also shows that only a small portion of the included proteins have known functional annotations according to the KEGG orthology. In the following, we investigate several important features that can be observed in these PPI networks.
Table 1
Table 1
Statistics of the IsoBase database.
Intra-network properties of individual PPI networks
Two important network properties that we can typically observe in a real PPI network is the scale-free property and the modularity. The scale-free property manifests itself in the degree distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e048.jpg, defined as the probability that a given node in the network will have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e049.jpg connections to other nodes, that follows a power-law distribution: An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e050.jpg for some An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e051.jpg. One measure that can be used to evaluate the modularity of a network is the clustering coefficient function An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e052.jpg. We define the clustering coefficient of a node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e053.jpg of degree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e054.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e055.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e056.jpg is the number of connections among the neighbors of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e057.jpg. The clustering function An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e058.jpg is defined as the average clustering coefficient of all nodes with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e059.jpg neighbors, and it is expected to scale down with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e060.jpg in a modular network. Figures 1(A)–1(F) and Figures S1(A)–S1(D) show the degree distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e061.jpg and the clustering coefficient function An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e062.jpg for the five organisms. These figures show that the degree distribution of each organism clearly follows a power-law distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e063.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e064.jpg ranges between 1.8 and 2.3. We can also see that the clustering coefficient An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e065.jpg quickly scales down with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e066.jpg for all organisms, indicating the hierarchical modularity present in the PPI networks [71], [81].
Figure 1
Figure 1
Network properties of various organisms.
Cross-network properties between different PPI networks
In order to devise a practical model for synthesizing a family of related networks, instead of a single network, it is important to investigate the cross-network properties that can be observed when comparing the PPI networks of different organisms. As discussed earlier, two aspects that are important in the comparative analysis of PPI networks are the structural similarity of the networks and the molecular similarity between the proteins that belong to different networks. The molecular similarity between proteins and their potential orthology is typically assessed based on their sequence similarity using a sequence alignment algorithm, such as BLAST [82] or FASTA [83]. Two questions of practical interest are: (i) how many potential orthologs would exist in different networks, for a specific protein in a given network, and (ii) how the protein similarity scores are distributed when comparing a network pair.
Distribution of potential orthologs
Let An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e067.jpg be the set of nodes (i.e., protein) in a PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e068.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e069.jpg be the set of nodes in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e070.jpg. For a given node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e071.jpg in the network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e072.jpg, how many potential orthologs exist in the network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e073.jpg? By potential orthologs, we refer to pairs of proteins (in different PPI networks) that are candidates for being true orthologs according to their sequence similarity. Sequence similarity is often used as practical evidence for predicting protein orthology, and we assume that nodes with relatively high sequence similarity are more likely to be orthologous. Thus, we estimate the number of potential orthologs of each node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e074.jpg as
A mathematical equation, expression, or formula.
 Object name is pone.0041474.e075.jpg
which is the number of nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e076.jpg in the network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e077.jpg whose similarity score An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e078.jpg exceeds some threshold An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e079.jpg. In practice, we may use a sequence alignment score, such as the BLAST bit score, to estimate An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e080.jpg. For any integer An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e081.jpg, we define An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e082.jpg as the fraction of nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e083.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e084.jpg. This relative frequency An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e085.jpg can provide useful insights regarding the presence of potential orthologs across different networks. Figures 2(A)–2(F) and Figures S2(A)–S2(N) show An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e086.jpg across all pairs of the five organisms in IsoBase, where a threshold of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e087.jpg was used in all experiments. As shown in these figures, potential orthologs are generally sparse across networks. The results in Figure 2 and Figure S2 clearly reveal that the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e088.jpg closely follows a power-law distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e089.jpg with an exponent An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e090.jpg that ranges between 1.4 and 2.1. For example, let us consider the number of proteins in the D. melanogaster network that are potentially orthologous to proteins in the S. cerevisiae network. Among the 6,659 proteins in the S. cerevisiae network, 3,369 proteins do not have any potential orthologs in D. melanogaster whose sequence similarity score exceeds the threshold An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e091.jpg. Among the rest, 1,707 proteins have no more than two potential orthologs in the D. melanogaster PPI network, 578 proteins have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e092.jpg potential orthologs, 291 proteins have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e093.jpg potential orthologs, 246 proteins have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e094.jpg potential orthologs, 295 proteins have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e095.jpg potential orthologs, 130 proteins have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e096.jpg potential orthologs, and only 43 proteins have more than 100 potential orthologs. The general trend does not significantly change for choosing a different threshold An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e097.jpg. For example, even when we raise the threshold to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e098.jpg, the number of proteins in S. cerevisiae with more than 50 potential orthologs in D. melanogaster would just decrease to 33. The results are similar for other network pairs, which show that there are typically only a few nodes in a PPI network with a relatively large number of potential orthologs, while most nodes only have a small number of potential orthologs, if any, in other organisms. This observation reveals an important challenge in network alignment, namely, strong reliance on sequence similarity can lead to predictions that are biologically insignificant and misleading, and effective incorporation of interaction data is crucial to minimize this risk.
Figure 2
Figure 2
Cross-species network properties for different pairs of organisms.
Distribution of sequence similarity scores
Now, let us consider the distribution of the similarity score between nodes in different networks. As before, let An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e101.jpg be the set of nodes in a PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e102.jpg and let An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e103.jpg be the set of nodes in a different PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e104.jpg. We define the set of orthologous proteins in the two networks as
A mathematical equation, expression, or formula.
 Object name is pone.0041474.e105.jpg
and the set of non-orthologous proteins as
A mathematical equation, expression, or formula.
 Object name is pone.0041474.e106.jpg
where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e107.jpg (in network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e108.jpg) and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e109.jpg (in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e110.jpg) are regarded as orthologs if they belong to the same KEGG ortholog group, thus share the same functional annotation. We define An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e111.jpg as the distribution of the similarity score An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e112.jpg for orthologous nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e113.jpg. Similarly, we define An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e114.jpg as the score distribution for non-orthologous node pairs An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e115.jpg. These distributions are shown in Figures 2(G)–2(I) and Figures S3(A)–S3(G) across all pairs of the considered organisms. These results show that the score distribution can be closely approximated by the Gamma distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e116.jpg, whose probability density function An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e117.jpg is defined as follows
A mathematical equation, expression, or formula.
 Object name is pone.0041474.e118.jpg
(1)
for some shape parameter An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e119.jpg and scale parameter An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e120.jpg. These figures also show that there is a substantial overlap between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e121.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e122.jpg, the similarity score distribution for orthologs and that for non-orhologs, which again reveals the the importance of incorporating interaction data into comparative networks analysis. This observation also confirms the results in previous studies [28], [42], [67], which showed that proteins that are conserved at the sequence level may fail to have conserved functionalities at the network level.
Proposed Network Synthesis Model
Following the previous discussions, in this section, we propose a novel network synthesis model that can generate a family of evolutionarily related protein-protein interaction networks. Suppose we want to generate a family of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e123.jpg synthetic PPI networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e124.jpg. Each network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e125.jpg consists of set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e126.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e127.jpg nodes; a set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e128.jpg of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e129.jpg edges, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e130.jpg denotes the edge between node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e131.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e132.jpg; and a set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e133.jpg, which maps each node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e134.jpg to a functional group An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e135.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e136.jpg, the set of all functional orthology (FO) annotations. A node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e137.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e138.jpg is regarded as an annotated protein with a known function An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e139.jpg, while it is regarded as an unannotated protein if An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e140.jpg. We define An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e141.jpg as a An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e142.jpg similarity score matrix that contains the sequence similarity score between all pairs of proteins for the networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e143.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e144.jpg. The set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e145.jpg consists of the scoring matrices for all pairs of networks.
To generate the An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e146.jpg networks, we first specify the hypothetic phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e147.jpg that describes the evolutionary relationship among the networks. The tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e148.jpg, which is assumed to be a binary tree, will have exactly An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e149.jpg leaf nodes, in addition to a number of internal nodes, which correspond to the An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e150.jpg networks to be generated by the model. The basic idea of the proposed method is to follow the phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e151.jpg to create a set of related networks through repetitive network duplication, mutation, and network extension, starting from a single hypothetical ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e152.jpg. In order to create a biologically realistic ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e153.jpg, we begin by generating a small seed network and iteratively extend it using one of the network growth models – DMC, DMR, and CG models – described earlier. As discussed in [73], choosing the right seed network is crucial to capture the key topological features of real PPI networks. For the duplication-based models (i.e., DMC and DMR), we use a seed network that is similar to the one presented in [73], which was shown to accurately characterize the attributes of the S. cerevisiae PPI network. This seed network of size 50 includes two cliques (complete subgraphs), one with 10 nodes and the other with 7 nodes. Nodes in each of these two cliques are randomly connected to a few nodes in the other clique. The other 33 nodes are randomly connected to one of the 17 clique nodes. The nodes in the first and the second cliques are assigned to distinct functional groups An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e154.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e155.jpg, respectively. Each of the remaining 33 nodes is assigned to a different functional group, from An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e156.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e157.jpg. For the CG model, we use a seed graph of size 4 as in [65]. The initial seed network is grown into the ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e158.jpg of size An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e159.jpg by employing one of the network extension models. While growing the network, every new node is assigned to a new functional group of its own.
Once the ancestral PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e160.jpg is created, we traverse the phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e161.jpg to generate descendant networks that are evolutionarily related to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e162.jpg. Figure 3 illustrates an example of a phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e163.jpg for five hypothetical species, which correspond to the five leaf nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e164.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e165.jpg. The tree also includes three internal nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e166.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e167.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e168.jpg, and the root node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e169.jpg. Since the phylogenetic tree is assumed to be binary, each internal node (including the root node) branches off to two child nodes. For each child node, we create a network by duplicating the parent network and evolving it into a larger network. For example, according to the tree in Figure 3, we generate two networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e170.jpg (for the leaf node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e171.jpg) and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e172.jpg (for the internal node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e173.jpg) based on the ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e174.jpg that corresponds to the root node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e175.jpg, which is the parent of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e176.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e177.jpg. We will traverse the tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e178.jpg through a breadth-first search [84] and repeat this bifurcation process until all An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e179.jpg networks are generated. It is straightforward to see that this will require An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e180.jpg bifurcations, in total.
Figure 3
Figure 3
The phylogenetic tree of five hypothetical organisms.
The bifurcation step is carried out as follows. Suppose An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e181.jpg is the network that corresponds to the current internal node. We denote An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e182.jpg as the set of scoring matrices that contain the similarity scores between proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e183.jpg and those in the networks for other nodes in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e184.jpg that have been previously visited. We generate the networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e185.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e186.jpg for the two child nodes by duplicating the parent network: An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e187.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e188.jpg. Both networks inherit the functional annotations of their parent An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e189.jpg and the set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e190.jpg of scoring matrices. For every pair of nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e191.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e192.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e193.jpg in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e194.jpg, we randomly assign their similarity score according to a Gamma distribution as follows:
A mathematical equation, expression, or formula.
 Object name is pone.0041474.e195.jpg
(2)
where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e196.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e197.jpg are random numbers sampled according to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e198.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e199.jpg. Note that the similarity score An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e200.jpg takes a different distribution, depending on whether or not An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e201.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e202.jpg have the same functional annotation: An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e203.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e204.jpg are the shape and scale parameters of the Gamma distribution for orthologs (with identical FO annotations); An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e205.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e206.jpg are the parameters for non-orthologs (with different FO annotations). An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e207.jpg is used to simulate the thresholding effect of sequence similarity scores. As we have seen in our analysis of real PPI networks, potential orthologs across different networks are generally sparse. In the proposed model, we enforce the number of potential orthologs to follow a power-law distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e208.jpg, as in real PPI networks.
To diverge the child networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e209.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e210.jpg from the parent network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e211.jpg, we independently apply a network growth algorithm (DMC, DMR, or CG) to each of these networks. In this step, the number of new nodes added to each child network may be specified according to the evolutionary distance between the corresponding hypothetical species in the tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e212.jpg. For instance, in Figure 3, the number of additional nodes (referred as the “length” of a given branch) are shown along the branches. In this example, if the ancestral network has An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e213.jpg nodes, the PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e214.jpg for node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e215.jpg will have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e216.jpg nodes and the PPI network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e217.jpg for node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e218.jpg will have An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e219.jpg nodes. Consider a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e220.jpg that was either (i) obtained by duplicating an existing node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e221.jpg (when using either the DMC or the DMR model) or (ii) a new node whose anchor node was chosen to be An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e222.jpg (when using the CG model). We transfer the functional annotation and the similarity scores from an existing node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e223.jpg to a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e224.jpg as follows:
  • With probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e225.jpg, assign An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e226.jpg to the same functional group as An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e227.jpg by setting An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e228.jpg. With probability An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e229.jpg, set An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e230.jpg, which implies that An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e231.jpg takes a new unknown function.
  • For every protein An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e232.jpg in the networks that correspond to previously visited nodes in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e233.jpg, assign the similarity score between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e234.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e235.jpg as:
    A mathematical equation, expression, or formula.
 Object name is pone.0041474.e236.jpg
    (3)
    where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e237.jpg is a random scaling factor with a uniform distribution over An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e238.jpg. The upper bound An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e239.jpg (An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e240.jpg) specifies the extent of the sequence-level divergence between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e241.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e242.jpg.
In this way, we can model the functional inheritance and the sequence similarity between the duplicated nodes, where a duplicated node may have a different function from the original node. Finally, when using the CG model, a new node An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e243.jpg that forms a new functional module by itself, hence not anchored to any of the existing nodes, will be assigned a new unannotated function (i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e244.jpg).
Attributes of Synthetic Networks
To validate the proposed network synthesis model, we generated synthetic PPI networks according to the model and analyzed the individual and cross-species characteristics of the synthesized networks. We first generated an ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e245.jpg of size An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e246.jpg. A simple binary tree with two leaves was used to evolve An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e247.jpg into two networks An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e248.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e249.jpg, respectively with 5,000 nodes and 7,000 nodes. For network extension, we applied all three network growth models – DMC, DMR, and CG – discussed in this paper. For DMC, we used An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e250.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e251.jpg as in [65]. For DMR, we set the parameters to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e252.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e253.jpg as in [73]. We used An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e254.jpg for CG as in [65]. The scaling and shape parameters of the Gamma distributions in (2) were set to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e255.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e256.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e257.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e258.jpg, and the exponent An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e259.jpg in the distribution An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e260.jpg was set to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e261.jpg, such that the cross-network properties between An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e262.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e263.jpg resemble those between the D. melanogaster PPI network and the S. cerevisiae PPI network. The parameters An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e264.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e265.jpg that control the functional inheritance and sequence similarity between orthologous nodes were set to An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e266.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e267.jpg, so that protein function and sequence similarity is conserved at the An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e268.jpg level. Although it is practically difficult to accurately determine these two parameters in real networks, the analysis in [85] shows this rate of functional conservation for duplicated genes.
In the case of CG algorithm, we made a slight modification in the first step of the algorithm as follows. In the original algorithm proposed in [65], when adding a new node, the modules of the current network are recomputed at each iteration. To speed up the CG algorithm, we instead redefine the modules every An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e269.jpg steps, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e270.jpg is the size of the current network. In other words, in the early iterations, we redefine modules in every iteration, while as the network grows larger, we apply the module redefinition step only occasionally and use these modules over multiple iterations. Simulation results show that the CG method can still accurately capture the generic features of real PPI networks with this modification. We leave the module redefinition frequency as a control parameter that can be freely adjusted.
The properties of the synthetic PPI network are shown in Figure 4, Figure 5, and Figure 6, for using DMC, DMR, and CG, respectively. As can be seen in these figures, all three schemes can accurately model the scale-free degree distribution. However, it appears that the hierarchical modularity can be better captured by using either DMC or CG, rather than DMR. Regarding the cross-network properties, these results also clearly show that the proposed network synthesis model can effectively capture the attributes of real PPI networks. For example, this can be immediately seen by comparing the network properties of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e271.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e272.jpg in Figures 4(E) and 4(F) (when using DMC) with those of the D. melanogaster and the S. cerevisiae PPI networks shown in Figures 2(B) and 2(H). Similar observations can be made from Figures 5(E) and 5(F) (for DMR) as well as Figures 6(E) and 6(F) (for CG).
Figure 4
Figure 4
Properties of the networks generated using the DMC model.
Figure 5
Figure 5
Properties of the networks generated using the DMR model.
Figure 6
Figure 6
Properties of the networks generated using the CG model.
Construction of Network Alignment Benchmark
The network synthesis model presented in this paper provides an effective framework for generating network families with diverse characteristics. Such network sets may be used to assess the performance of various alignment techniques to identify their respective strengths and weaknesses under different conditions and problems settings. Furthermore, the proposed network synthesis model may be potentially used to expose previously unknown biases that a network alignment technique may have towards specific types of networks, thereby leading to better alignment techniques.
To demonstrate the utility of the proposed network generation scheme, we used it to create synthetic benchmark datasets that can be used for evaluating and comparing the performance of various network alignment algorithms. We call the proposed Network Alignment Performance Assessment benchmark as NAPAbench. In total, we generated three suites of datasets. The first suite (referred as the pairwise alignment dataset) contains three pairs of networks, where the respective network pairs were generated using DMC, DMR, and CG, respectively. Each pair consists of a network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e287.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e288.jpg nodes and another network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e289.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e290.jpg nodes, both evolved from an ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e291.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e292.jpg nodes, following a binary tree with two leaves. The second suite (referred as the 5-way alignment dataset) contains three network families, each with five networks generated using DMC, DMR, or CG. To generate the network family, we first created an ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e293.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e294.jpg nodes. The phylogenetic tree An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e295.jpg in Figure 3 was used to evolve An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e296.jpg into five networks – An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e297.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e298.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e299.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e300.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e301.jpg – which correspond to the five leaf nodes. For every branch, we set its length to 500. Thus, the size of the five networks were An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e302.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e303.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e304.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e305.jpg. This dataset simulates a family of PPI networks that correspond to distantly related species. Finally, the third suite (referred as the 8-way alignment dataset) also consists of three network families, each with eight networks generated by one of the three network extension models. The eight networks were obtained by evolving an ancestral network An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e306.jpg of size An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e307.jpg according to a full binary tree with eight leaf nodes. The branch length was set to 200 for all branches, which gave rise to eight equally sized networks, each with 1,000 nodes. This 8-way alignment dataset tries to simulate a network family of closely-related species. All the datasets in NAPAbench are publicly available at http://www.ece.tamu.edu/bjyoon/NAPAbench/.
Performance Analysis of Network Alignment Algorithms
The created benchmark datasets, NAPAbench, can be used for reliable and comprehensive performance evaluation of existing network alignments. In this work, we used this synthetic benchmark to assess the performance of five well-known multiple network alignment algorithms: IsoRank [13], IsoRankN [12], NetworkBLAST-M [15], Græ mlin 2.0 [11], and MI-GRAAL [25]. IsoRank [13] uses spectral graph theory to evaluate the overall similarity between nodes that belong to different networks. This pairwise alignment score is computed for every node pair across all pairs of networks, which is then used to build the multiple network alignment according to a greedy approach. IsoRankN [12] further extends the idea in IsoRank by employing a spectral clustering scheme based on the pairwise node alignment scores. NetworkBLAST-M [15] computes the network alignment by first constructing a layered alignment graph based on the potential orthologous nodes, and then greedily searching for highly conserved local regions in the alignment graph. Græ mlin 2.0 [11] takes a progressive approach to construct a global alignment of multiple networks, where it repeatedly performs pairwise network alignments according to a given phylogenetic tree that describes the relationship among the networks. The alignment is predicted by maximizing an objective function based on parameters that are learned from a set of known alignments. Finally, MI-GRAAL [25] is a recently proposed pairwise network alignment scheme that can integrate any number and type of similarity measures between network nodes, such as sequence similarity, structural similarity, and topological similarity.
Recall that the node similarity score in the proposed model tries to mimic the BLAST bit scores. Since NetworkBLAST-M and MI-GRAAL employ the BLAST E-values, instead of the BLAST bit scores, we should transform the bit scores into the corresponding E-values for these two algorithms. As discussed in [86], [87], the simulated bit score (An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e308.jpg) is related to the E-value (An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e309.jpg) as An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e310.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e311.jpg is the length of the BLAST query and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e312.jpg is the length of the target sequence. Here, we transform our simulated bit scores to E-values using An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e313.jpg (assuming, for instance, the case when we BLAST a protein sequence with 500 residues in a database that contains a total of 200,000,000 residues). In this paper, we used the restricted-order version of NetworkBLAST-M as the running time of the relaxed-order version increases exponentially with respect to the number of networks to be aligned. As Græ mlin needs to learn the parameters of its scoring function in advance, we generated a training set that consists of five networks (with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e314.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e315.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e316.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e317.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e318.jpg nodes, respectively), using the proposed scheme with the DMC model by following the tree shown in Figure 3. MI-GRAAL can integrate different kinds of similarity measures into the search process. Here, we adopt the graphlet degree signature distance and the E-values (measuring the sequence similarity) for MI-GRAAL alignment algorithm. For IsoRank and IsoRankN, the parameter An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e319.jpg, which determines the balance between sequence similarity and topological similarity, was set to 0.6.
The accuracy of each network alignment algorithm was assessed using four measures – specificity, correct nodes, mean normalized entropy, and coverage – which had been previously used in [11] and [12]. We refer the set of aligned nodes (i.e., potential orthologs) as the equivalence class. Each equivalence class may include an arbitrary number of nodes from each species. To compute the accuracy measures, we first removed the unannotated nodes from the alignment (i.e, nodes with the annotation An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e320.jpg) and then removed equivalence classes containing only a single node. A given equivalence class is viewed as being correct if all the included nodes belong to the same FO group. The four measures are defined as follows:
  • Specificity (SPE): The relative number of correctly predicted equivalence classes.
  • Correct Nodes (CN): The total number of nodes (i.e., proteins) that are assigned to the correct equivalence class. This measure reflects the sensitivity of the prediction [11].
  • Mean normalized entropy (MNE): The mean normalized entropy of the predicted equivalence classes can provide an effective measure of the consistency of the predicted network alignment. The normalized entropy of a given equivalence class An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e321.jpg is computed as:
    A mathematical equation, expression, or formula.
 Object name is pone.0041474.e322.jpg
    (4)
    where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e323.jpg is the fraction of proteins in An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e324.jpg with the FO annotation An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e325.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e326.jpg is the number of different FO groups. Thus, a cluster that consists of nodes with higher functional consistency will have lower entropy.
  • Coverage: For any integer An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e327.jpg, the total number of equivalence classes that contain nodes from An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e328.jpg species. We report this measure only for multiple network alignment experiments (and not for pairwise alignments).
NetworkBLAST-M reports only the local alignment of the input networks, while the other four algorithms yield the global alignment of the given networks. For a fair comparison between these algorithms, we first convert the local alignment predicted by NetworkBLAST-M into a global network alignment by merging all local node correspondences. For example, if nodes An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e329.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e330.jpg are aligned in one local alignment while An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e331.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e332.jpg are aligned in another local alignment, we assume that An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e333.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e334.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e335.jpg belong to the same equivalence class.
The SPE, CN, and MNE of the five algorithms are summarized in Table 2, Table 3, and Table 4, for the pairwise alignment dataset, 5-way alignment dataset, and the 8-way alignment dataset, respectively. Figure 7 and Figure 8 shows the coverage of different algorithms for the 5-way and 8-way dataset, respectively.
Table 2
Table 2
Performance of different alignment algorithms on the pairwise alignment dataset of NAPAbench.
Table 3
Table 3
Performance Comparison on the 5-way network alignment dataset of NAPAbench.
Table 4
Table 4
Performance Comparison on 8-way network alignment dataset of NAPAbench.
Figure 7
Figure 7
Number of equivalence classes in the 5-way alignment experiment that contain nodes from
Figure 8
Figure 8
Number of equivalence classes in the 8-way alignment experiment that contain nodes from
For pairwise network alignments, NetworkBLAST-M boasts significantly higher specificity and consistency (reflected in lower MNE) compared to other algorithms. IsoRank, IsoRankN, and MI-GRAAL yielded the highest number of correctly aligned nodes (i.e., CN) for networks generated using the DMC/DMR growth models, implying high sensitivity. For the networks created using the CG model, which yield highly modular networks, NetworkBLAST-M showed highest sensitivity, closely followed by MI-GRAAL.
For the 5-way and 8-way alignment experiments, we can clearly observe the degradation in sensitivity of NetworkBLAST-M, as shown in Table 3 and Table 4. This may be due to the fact that NetworkBLAST-M aims to predict equivalence classes that are conserved across all the compared species, as illustrated in Figure 7 and Figure 8. In these experiments, Græ mlin showed moderate performance, where the sensitivity was higher than NetworkBLAST-M, but the specificity and the consistency were lower. The multiple network alignment experiments based on the 5-way and the 8-way benchmark datasets in NAPAbench show that IsoRankN can yield the most accurate network alignment results, in terms of specificity, sensitivity, and consistency. This observation is in agreement with the performance assessment in [12], based on five real biological networks.
To compared the performance of different algorithms in predicting equivalence classes conserved across all networks, we also estimated the accuracy of IsoRankN and Græ mlin only for such classes. These results are shown in the last two rows of Table 3 and Table 4. We can see that IsoRankN still outperforms NetworkBLAST-M in most cases for 5-way alignment. In the 8-way network alignment, IsoRankN appears to outperform NetworkBLAST-M for networks generated using the DMC growth model. However, NetworkBLAST-M is more sensitive on networks obtained using the DMR model, and it is also more sensitive and more specific for networks generated using the CG model. These results also show that Græ mlin is outperformed by the other two algorithms in this case, which implies that it may not be effective in predicting orthologous nodes that are conserved across all species.
Figure 7 shows the number of equivalence classes (i.e., the coverage) that are predicted in the 5-way alignment dataset by the respective algorithms. In each case, the total number of equivalence classes is split into the number of classes that consist of nodes from An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e340.jpg different networks (An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e341.jpg). As shown in this figure, all three algorithms predicted similar number of equivalence classes that contain nodes from all An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e342.jpg networks. However, we can see that IsoRankN predicts a significantly larger number of equivalence classes with An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e343.jpg compared to the other algorithms. Considering that the 5-way alignment dataset consists of networks with varying size, equivalence classes that contain nodes from An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e344.jpg networks are fairly common, hence the ability of identifying such equivalence classes is certainly an important advantage of IsoRankN. Figure 8 shows coverage of different algorithms on the 8-way dataset. The trends are similar as in the 5-way alignment, and we can see that IsoRankN results in greater coverage for equivalence classes spanning An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e345.jpg networks. Another interesting observation is that Græ mlin predicts a large number of equivalence classes that contain only nodes from An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e346.jpg networks.
Next, we investigate the effect of sequence similarity on the performance of the various network alignment algorithms. To this aim, we add a bias term An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e347.jpg to the similarity score distribution of potential orthologs in (2), such that the score is randomly sampled as An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e348.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e349.jpg. Increasing the bias An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e350.jpg will further separate the similarity score distributions of orthologous and non-orthologous nodes. As a result, the larger An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e351.jpg is, the easier it becomes to align the networks (and to predict the potential orthologs across networks) based on sequence similarity alone, without utilizing the topological similarity between networks. For this experiment, we generated two networks with 1,000 nodes from an ancestral network of size An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e352.jpg. Figure 9 shows how specificity (SPE) and CN (which reflects sensitivity), change for varying values of An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e353.jpg between 0 and 250. As can be seen in this figure, as the separation between the score distributions of orthologs and non-orthologs increases, both the specificity and the sensitivity are improved for IsoRank, IsoRankN, and Græ mlin. On the other hand, NetworkBLAST-M and MI-GRAAL display a constant level of accuracy that does not depend on the amount of separation. This implies that the first three alignment algorithms rely on the similarity between nodes relatively strongly when predicting the network alignment, while NetworkBLAST-M and MI-GRAAL use the similarity score mainly to predict potential orthology and do not rely too much on the extent of the similarity. In these experiments, Græ mlin appears to most strongly rely on the node similarity among the compared algorithms. In fact, Græ mlin achieves the highest specificity and sensitivity when there is a large separation between the score distributions (e.g., An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e354.jpg), while resulting in the lowest sensitivity when the separation is small (e.g., An external file that holds a picture, illustration, etc.
Object name is pone.0041474.e355.jpg).
Figure 9
Figure 9
The specificity (SPE) and the CN (which reflects the sensitivity) of different alignment algorithms for varying level of separation between the similarity score distribution for orthologs and the score distribution for non-orthologs.
Table 5 compares the computational complexity of the five algorithms, in terms of the total CPU time needed to align the networks in the respective datasets. All experiments have been performed on a desktop computer with a 2.2GHz Intel Core2Duo CPU and 4GB memory. It should be noted that Græ mlin requires a training stage for estimating the parameters used by the algorithm, which took more than a day in our experiments. The CPU time shown in Table 5 reveals that Græ mlin (without considering the training stage) and NetworkBLAST-M are the fastest among the five algorithms, while IsoRankN and MI-GRAAL are computationally more complex than these two algorithms.
Table 5
Table 5
Total CPU time (min) for aligning the networks.
Discussion
Absence of a comprehensive and reliable network alignment benchmark has been a critical obstacle that has been hindering research progress in comparative network analysis. In this work, we addressed this problem by proposing a novel network synthesis model that can generate network families with biologically realistic properties. The proposed model allows us to effectively generate families of evolutionarily related networks, where the network family may contain any number of networks with arbitrary phylogenetic relationships. We demonstrated that the internal as well as the cross-network properties of the synthesized networks closely resemble those of real protein-protein networks. Based on the proposed model, we synthesized a number of network benchmark datasets and evaluated the performance of several representative network alignment algorithms. These experiments allow us to clearly delineate the advantages and disadvantages of the respective algorithms in contrast to other algorithms. As demonstrated throughout this paper, the proposed network synthesis model provides an effective framework for generating large-scale network benchmarks, which can be used to reliably assess the performance of current and future network alignment algorithms under various conditions and problem settings.
Supporting Information
Figure S1
Network properties of different organisms. (A), (B) show the degree distributions, and (C), (D) show the clustering coefficient profiles.
(EPS)
Figure S2
Cross-species network properties for different pairs of organisms. (A)–(N) show how the number of potential orthologs are distributed between a given pair of networks.
(EPS)
Figure S3
Cross-species network properties for different pairs of organisms. (A)–(G) illustrate the sequence similarity distribution for orthologous and non-orthologous node pairs.
(EPS)
Funding Statement
This work was supported in part by the National Science Foundation through NSF Award CCF-1149544. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
1. Zhang A (2009) Protein Interaction Networks: Computational Analysis, 1st ed. New York, NY: Cambridge University Press.
2. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623–627. [PubMed]
3. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183. [PubMed]
4. Ge H (2000) UPA, a universal protein array system for quantitative detection of protein-protein, protein-DNA, protein-RNA and protein-ligand interactions. Nucleic Acids Res 28: e3. [PMC free article] [PubMed]
5. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86–90. [PubMed]
6. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96: 4285–4288. [PubMed]
7. Deng M, Mehta S, Sun F, Chen T (2002) Inferring domain-domain interactions from proteinprotein interactions. Genome Res 12: 1540–1548. [PubMed]
8. Matthews LR, Vaglio P, Reboul J, Ge H, Davis BP, et al. (2001) Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or “interologs”. Genome Res 11: 2120–2126. [PubMed]
9. Sharan R, Ideker T (2006) Modeling cellular machinery through biological network comparison. Nat Biotechnol 24: 427–433. [PubMed]
10. Flannick J, Novak A, Srinivasan B, McAdams H, Batzoglou S (2006) Græmlin: general and robust alignment of multiple large interaction networks. Genome Res 16: 1169–1181. [PubMed]
11. Flannick J, Novak A, Do CB, Srinivasan BS, Batzoglou S (2009) Automatic parameter learning for multiple local network alignment. J Comput Biol 16: 1001–1022. [PMC free article] [PubMed]
12. Liao CS, Lu K, Baym M, Singh R, Berger B (2009) IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25: i253–258. [PMC free article] [PubMed]
13. Singh R, Xu J, Berger B (2008) Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci USA 105: 12763–12768. [PubMed]
14. Chindelevitch L, Liao CS, Berger B (2010) Local optimization for global alignment of protein interaction networks. Pac Symp Biocomput 123–132. [PubMed]
15. Kalaev M, Smoot M, Ideker T, Sharan R (2008) NetworkBLAST: comparative analysis of protein networks. Bioinformatics 24: 594–596. [PubMed]
16. Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, et al. (2005) Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci USA 102: 1974–1979. [PubMed]
17. Koyuturk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, et al. (2006) Pairwise alignment of protein interaction networks. J Comput Biol 13: 182–199. [PubMed]
18. Guo X, Hartemink AJ (2009) Domain-oriented edge-based alignment of protein interaction networks. Bioinformatics 25: i240–246. [PMC free article] [PubMed]
19. Dutkowski J, Tiuryn J (2007) Identification of functional modules from conserved ancestral protein-protein interactions. Bioinformatics 23: i149–158. [PubMed]
20. Berg J, Lassig M (2006) Cross-species analysis of biological networks by Bayesian alignment. Proc Natl Acad Sci USA 103: 10967–10972. [PubMed]
21. Zaslavskiy M, Bach F, Vert JP (2009) Global alignment of protein-protein interaction networks by graph matching methods. Bioinformatics 25: i259–267. [PMC free article] [PubMed]
22. Denilou YP, Boyer F, Viari A, Sagot MF (2009) Multiple alignment of biological networks: A flexible approach. In: Kucherov G, Ukkonen E, editors. Combinatorial Pattern Matching, volume 5577 of Lecture Notes in Computer Science. Heidelberg: Springer. pp. 263–273.
23. Klau GW (2009) A new graph-based method for pairwise global network alignment. BMC Bioinformatics 10 Suppl 1: S59. [PMC free article] [PubMed]
24. Bradde S, Braunstein A, Mahmoudi H, Tria F, Weigt M, et al. (2010) Aligning graphs and finding substructures by a cavity approach. Europhysics Letters (epl) 89.
25. Kuchaiev O, Przulj N (2011) Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics 27: 1390–1396. [PubMed]
26. Li Z, Zhang S, Wang Y, Zhang X, Chen L (2007) Alignment of molecular networks by integer quadratic programming. Bioinformatics 23: 1631–1639. [PubMed]
27. Ali W, Deane CM (2009) Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 25: 3166–3173. [PMC free article] [PubMed]
28. Bandyopadhyay S, Sharan R, Ideker T (2006) Systematic identification of functional orthologs based on protein network comparison. Genome Res 16: 428–435. [PubMed]
29. Bayati M, Gerritsen M, Gleich D, Saberi A, Wang Y (2009) Algorithms for large, sparse network alignment problems. In: Ninth IEEE International Conference on Data Mining (ICDM), 2009. pp. 705–710.
30. Qian X, Yoon BJ (2009) Effective identification of conserved pathways in biological networks using hidden Markov models. PLoS ONE 4: e8070. [PMC free article] [PubMed]
31. Ay F, Kellis M, Kahveci T (2011) SubMAP: aligning metabolic pathways with subnetwork mappings. J Comput Biol 18: 219–235. [PMC free article] [PubMed]
32. Kelley BP, Yuan B, Lewitter F, Sharan R, Stockwell BR, et al. (2004) PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res 32: W83–88. [PMC free article] [PubMed]
33. Pinter R, Rokhlenko O, Yeger-Lotem E, Ziv-Ukelson M (2005) Alignment of metabolic pathways. Bioinformatics 21: 3401–3408. [PubMed]
34. Shlomi T, Segal D, Ruppin E, Sharan R (2006) QPath: a method for querying pathways in a proteinprotein interaction network. BMC Bioinformatics 7. [PMC free article] [PubMed]
35. Dost B, Shlomi T, Gupta N, Ruppin E, Bafna V, et al. (2008) QNet: A tool for querying protein interaction networks. J Comput Biol 15: 913–925. [PubMed]
36. Bruckner S, Huffner F, Karp RM, Shamir R, Sharan R (2010) Topology-free querying of protein interaction networks. J Comput Biol 17: 237–252. [PubMed]
37. Qian X, Sze SH, Yoon BJ (2009) Querying pathways in protein interaction networks based on hidden Markov models. Journal of Computational Biology 16: 145–157. [PMC free article] [PubMed]
38. Sahraeian SME, Yoon BJ (2011) Fast network querying algorithm for searching large-scale biological networks. In: 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 6008–6011.
39. Sahraeian SME, Yoon BJ (2012) RESQUE: Network reduction using semi-Markov random walk scores for efficient querying of biological networks. Bioinformatics doi: 10.1093/bioinformatics/bts341. [PubMed]
40. Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. (2000) Gene Ontology: Tool for the unification of biology. the gene ontology consortium. Nat Genet 25: 25–29. [PMC free article] [PubMed]
41. Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28: 27–30. [PMC free article] [PubMed]
42. Beltrao P, Serrano L (2007) Specificity and evolvability in eukaryotic protein interaction networks. PLoS Comput Biol 3: e25. [PubMed]
43. Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, et al. (2011) The BioGRID Interaction Database: 2011 update. Nucleic Acids Res 39: 698–704. [PMC free article] [PubMed]
44. Mewes HW, Frishman D, Mayer KF, Munsterkotter M, Noubibou O, et al. (2006) MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res 34: D169–172. [PMC free article] [PubMed]
45. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32: D449–451. [PMC free article] [PubMed]
46. Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, et al. (2012) The IntAct molecular interaction database in 2012. Nucleic Acids Res 40: D841–846. [PMC free article] [PubMed]
47. Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, et al. (2012) MINT, the molecular interaction database: 2012 update. Nucleic Acids Res 40: D857–861. [PMC free article] [PubMed]
48. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, et al. (2009) Human Protein Reference Database–2009 update. Nucleic Acids Res 37: D767–772. [PMC free article] [PubMed]
49. Cowley MJ, Pinese M, Kassahn KS, Waddell N, Pearson JV, et al. (2012) PINA v2.0: mining interactome modules. Nucleic Acids Res 40: D862–865. [PMC free article] [PubMed]
50. Prieto C, De Las Rivas J (2006) APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Res 34: 298–302. [PMC free article] [PubMed]
51. Ali W, Deane CM (2010) Evolutionary analysis reveals low coverage as the major challenge for protein interaction network alignment. Mol Biosyst 6: 2296–2304. [PubMed]
52. Erten S, Li X, Bebek G, Li J, Koyuturk M (2009) Phylogenetic analysis of modularity in protein interaction networks. BMC Bioinformatics 10: 333. [PMC free article] [PubMed]
53. Vazquez A, Flammini A, Maritan A, Vespignani A (2003) Modeling of Protein Interaction Networks. Complexus 1: 38–44.
54. Sole RV, Pastor-Satorras R, Smith E, Kepler TB (2002) A model of large-scale proteome evolution. Advances in Complex Systems (ACS) 5: 43–54.
55. Pastor-Satorras R, Smith E, Sole RV (2003) Evolving protein interaction networks through gene duplication. J Theor Biol 222: 199–210. [PubMed]
56. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286: 509–512. [PubMed]
57. Qian J, Luscombe NM, Gerstein M (2001) Protein family and fold occurrence in genomes: powerlaw behaviour and evolutionary model. J Mol Biol 313: 673–681. [PubMed]
58. Bhan A, Galas DJ, Dewey TG (2002) A duplication growth model of gene expression networks. Bioinformatics 18: 1486–1493. [PubMed]
59. Rzhetsky A, Gomez SM (2001) Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17: 988–996. [PubMed]
60. i Goh K, Kahng B, Kim D (2005) Evolution of the protein interaction network of budding yeast: Role of the protein family compatibility constraint. J Korean Phys Soc 46: 551–555.
61. Bebek G, Berenbrink P, Cooper C, Friedetzky T, Nadeau JH, et al. . (2007) Improved duplication models for proteome network evolution. In: Proceedings of the 2005 joint annual satellite conference on systems biology and regulatory genomics (RECOMB' 05). Berlin: Springer-Verlag. pp. 119–137.
62. Przulj N, Kuchaiev O, Stevanović A, Hayes W (2010) Geometric evolutionary dynamics of protein interaction networks. Pac Symp Biocomput 178–189. [PubMed]
63. Berg J, Lassig M, Wagner A (2004) Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol 4: 51. [PMC free article] [PubMed]
64. Guillaume JL, Latapy M (2006) Bipartite graphs as models of complex networks. Physica A: Statistical and Theoretical Physics 371: 795–813.
65. Kim WK, Marcotte EM (2008) Age-dependent evolution of the yeast protein interaction network suggests a limited role of gene duplication and divergence. PLoS Comput Biol 4: e1000232. [PMC free article] [PubMed]
66. Ispolatov I, Krapivsky PL, Yuryev A (2005) Duplication-divergence model of protein interaction network. Phys Rev E Stat Nonlin Soft Matter Phys 71: 061911. [PMC free article] [PubMed]
67. Evlampiev K, Isambert H (2008) Conservation and topology of protein interaction networks under duplication-divergence evolution. Proc Natl Acad Sci USA 105: 9863–9868. [PubMed]
68. Levy ED, Pereira-Leal JB (2008) Evolution and dynamics of protein interactions and networks. Current Opinion in Structural Biology 18: 349–357. [PubMed]
69. Ratmann O, Jrgensen O, Hinkley T, Stumpf M, Richardson S, et al. (2007) Using likelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum. PLoS Comput Biol 3: e230. [PMC free article] [PubMed]
70. Coolen ACC, Fraternali F, Annibale A, Fernandes L, Kleinjung J (2011) Modelling Biological Networks via Tailored Random Graphs, JohnWiley and Sons, Ltd. pp. 309–329.
71. Barabasi AL, Oltvai ZN (2004) Network biology: understanding the cell's functional organization. Nat Rev Genet 5: 101–113. [PubMed]
72. Middendorf M, Ziv E, Wiggins CH (2005) Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci USA 102: 3192–3197. [PubMed]
73. Hormozdiari F, Berenbrink P, Przulj N, Sahinalp SC (2007) Not all scale-free networks are born equal: the role of the seed graph in PPI network evolution. PLoS Comput Biol 3: e118. [PubMed]
74. Ohno S (1970) Evolution by gene duplication, 1st ed. Heidelberg: Springer-Verlag. 160 p.
75. D'Antonio M, Ciccarelli FD (2011) Modification of Gene Duplicability during the Evolution of Protein Interaction Network. PLoS Comput Biol 7: e1002029. [PMC free article] [PubMed]
76. Wagner A (2003) How the global structure of protein interaction networks evolves. Proc Biol Sci 270: 457–466. [PMC free article] [PubMed]
77. Navlakha S, Kingsford C (2011) Network archaeology: Uncovering ancient networks from present-day interactions. PLoS Comput Biol 7: e1001119. [PMC free article] [PubMed]
78. Colak R, Hormozdiari F, Moser F, Schonhuth A, Holman J, et al. (2009) Dense graphlet statistics of protein interaction and random networks. Pac Symp Biocomput 178–189. [PubMed]
79. Newman ME (2004) Fast algorithm for detecting community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys 69: 066133. [PubMed]
80. Park D, Singh R, Baym M, Liao CS, Berger B (2011) IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res 39: 295–300. [PMC free article] [PubMed]
81. Vazquez A, Dobrin R, Sergi D, Eckmann JP, Oltvai ZN, et al. (2004) The topological relationship between the large-scale attributes and local interaction patterns of complex networks. Proc Natl Acad Sci USA 101: 17940–17945. [PubMed]
82. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410. [PubMed]
83. Lipman DJ, Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227: 1435–1441. [PubMed]
84. Knuth DE (1997) The Art of Computer Programming, 3rd ed, vol. 1. Boston, MA: Addison-Wesley Professional. 672 p.
85. Baudot A, Jacq B, Brun C (2004) A scale of functional divergence for yeast duplicated genes revealed from analysis of the protein-protein interaction network. Genome Biol 5: R76. [PMC free article] [PubMed]
86. Altschul SF, Gish W (1996) Local alignment statistics. Meth Enzymol 266: 460–480. [PubMed]
87. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. [PMC free article] [PubMed]
Articles from PLoS ONE are provided here courtesy of
Public Library of Science