Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS Comput Biol**|**v.3(11); 2007 November**|**PMC2098858

Formats

Article sections

- Abstract
- Author Summary
- Introduction
- Results/Discussion
- Materials and Methods
- Supporting Information
- References

Authors

Related links

PLoS Comput Biol. 2007 November; 3(11): e230.

Published online 2007 November 30. Prepublished online 2007 October 9. doi: 10.1371/journal.pcbi.0030230

PMCID: PMC2098858

Oliver Ratmann,^{1,}^{*} Ole Jørgensen,^{2} Trevor Hinkley,^{3} Michael Stumpf,^{4,}^{5} Sylvia Richardson,^{6} and Carsten Wiuf^{2,}^{7}

Sebastion Bonhoeffer, Editor^{}

ETH Zürich, Switzerland

* To whom correspondence should be addressed. E-mail: oliver.ratmann/at/imperial.ac.uk

Received 2007 March 28; Accepted 2007 October 5.

This article has been cited by other articles in PMC.

Gene duplication with subsequent interaction divergence is one of the primary driving forces in the evolution of genetic systems. Yet little is known about the precise mechanisms and the role of duplication divergence in the evolution of protein networks from the prokaryote and eukaryote domains. We developed a novel, model-based approach for Bayesian inference on biological network data that centres on approximate Bayesian computation, or likelihood-free inference. Instead of computing the intractable likelihood of the protein network topology, our method summarizes key features of the network and, based on these, uses a MCMC algorithm to approximate the posterior distribution of the model parameters. This allowed us to reliably fit a flexible mixture model that captures hallmarks of evolution by gene duplication and subfunctionalization to protein interaction network data of *Helicobacter pylori* and *Plasmodium falciparum*. The 80% credible intervals for the duplication–divergence component are [0.64, 0.98] for *H. pylori* and [0.87, 0.99] for *P. falciparum*. The remaining parameter estimates are not inconsistent with sequence data. An extensive sensitivity analysis showed that incompleteness of PIN data does not largely affect the analysis of models of protein network evolution, and that the degree sequence alone barely captures the evolutionary footprints of protein networks relative to other statistics. Our likelihood-free inference approach enables a fully Bayesian analysis of a complex and highly stochastic system that is otherwise intractable at present. Modelling the evolutionary history of PIN data, it transpires that only the simultaneous analysis of several global aspects of protein networks enables credible and consistent inference to be made from available datasets. Our results indicate that gene duplication has played a larger part in the network evolution of the eukaryote than in the prokaryote, and suggests that single gene duplications with immediate divergence alone may explain more than 60% of biological network data in both domains.

The importance of gene duplication to biological evolution has been recognized since the 1930s. For more than a decade, substantial evidence has been collected from genomic sequence data in order to elucidate the importance and the mechanisms of gene duplication; however, most biological characteristics arise from complex interactions between the cell's numerous constituents. Recently, preliminary descriptions of the protein interaction networks have become available for species of different domains. Adapting novel techniques in stochastic simulation, the authors demonstrate that evolutionary inferences can be drawn from large-scale, incomplete network data by fitting a stochastic model of network growth that captures hallmarks of evolution by duplication and divergence. They have also analyzed the effect of summarizing protein networks in different ways, and show that a reliable and consistent analysis requires many aspects of network data to be considered jointly; in contrast to what is commonly done in practice. Their results indicate that duplication and divergence has played a larger role in the network evolution of the eukaryote *P. falciparum* than in the prokaryote *H. pylori*, and emphasize at least for the eukaryote the potential importance of subfunctionalization in network evolution.

Genomic sequence data provides substantial evidence for the abundance of duplicated genes in all organisms surveyed: at least 40% of genes in two prokaryotes [1,2] and 15%–90% of genes in eukaryotes [3–5] appear to be products of gene duplication. This suggests that gene duplication is a key mechanistic driving force behind the evolution of complex organisms [6]. In particular, the fact that the number of interactions shared by paralogous proteins decreases with sequence similarity in *Saccharomyces cerevisiae* [7,8] indicates that gene duplication might shape the topology of protein networks.

In theory, the evolutionary fate of gene duplicates can differ: (D1) one copy may become silenced (nonfunctionalization); (D2) both copies are very similar in sequence, and one is functionally redundant to the other [9]; (D3) both copies are mutationally compromised, and one or more subfunctions of the single progenitor are altered (subfunctionalization); or (D4) one copy may acquire a novel function preserved by natural selection, while the other copy retains the original function (neofunctionalization). The strength of (D3) is that it does not rely on the sparse occurrence of beneficial mutations, but on more frequently occurring loss-of-function mutations in regulatory regions [10,11]. Alternatively, based mostly on the assumption that the number of protein pairs that may acquire a novel function is large, several studies [7,12,13] promoted the relative importance of (D4), as well as the formation or degeneration of functional links between proteins in general (link turnover).

The structure of protein interaction networks (PINs) derives from multiple stochastic processes over evolutionary time scales, and a number of mechanisms have been proposed to capture aspects of network growth [12,14–16]. These models correspond to our limited understanding of network evolution, and there is no consensus as to which mechanisms are required to produce “realistic” models for biological PINs [17]. What is required is to be able to fit to biological network data a model (or mixture of models) of growing networks that reproduce more accurately the properties of real biological networks than simple preferential attachment [18] or duplication models [16]. For duplication–attachment models of network growth, Wiuf et al. [19] developed a full likelihood approach; this class of models, however, does not adequately explain the structure of most biological network data.

The analysis of PINs is notoriously difficult because measurements of PINs are subject to considerable levels of noise [20,21], and in their present guise, offer only an incomplete description of the true interaction network [22]. Interaction datasets are also highly averaged, not only over technical aspects such as the experimental protocol, but also over the precise cellular conditions under which interactions take place, interaction strength, and individual variation.

In this work, we develop an approximate, likelihood-free Monte-Carlo inference technique based on approximate Bayesian computation (ABC) [23–26] for inference on biological protein network data. Previously, an approximate composite likelihood approach has been proposed, using only the degree sequence to test whether or not simple scale-free models offer an adequate description of PIN data [27]. Owing to the complexity of PINs, we take multiple features of the data into account, which characterize PINs more fully. Our likelihood-free approach allows us to reliably compare more complex models of network evolution in order to study the relative importance of aspects of gene duplication and subsequent interaction divergence in prokaryotic and eukaryotic network evolution. Within the limits of the model and the available data, we find evidence for different dynamics in PIN evolution between the prokaryotic and eukaryotic domains as represented by *H. pylori* and *P. falciparum*, respectively.

The degree sequence [18], as well as the frequency profile of motif counts [28] are widely used to analyze protein network data. Our analysis shows that the degree sequence barely captures evolutionary footprints of PINs relative to other statistics. It also suggests that motif counts are extremely variable over the modelled evolutionary history of PINs, and thus inference based on these alone is fragile. Only the simultaneous analysis of many global aspects of PIN data rendered our evolutionary study credible and consistent.

To study the relative importance of aspects of duplication divergence in network evolution between different domains, we simulated the evolutionary history of PINs with a mixture of duplication divergence with parent–child attachment (DDa) and preferential attachment (PA); see Box 1. At each step, the network either grows according to DDa with probability 1 − α or PA with probability α. More precisely, let *G _{t}* be a network with

PINs can be described as graphs (Figure 7), which contain a set of **nodes** representing proteins with observed interactions, and **edges** representing observed interactions between proteins. Here, we focus on **undirected, unweighted, binary interactions** representing physical or indirect interaction under possibly different experimental conditions. **Randomly growing graphs** model the long-term, undirected evolution of protein networks; here, we consider two simple, **local growth mechanisms** that add a single node to the network at a time.

**Duplication Divergence with parent–child attachment** (DDa) [14,58] features a node duplication step followed immediately by an interaction divergence step. At each step (Figure 7), a parent node (orange) is randomly chosen, and its edges are duplicated. Each of the parental or duplicated edges (purple) is then lost (i.e., diverges) with probability *δ*_{Div}; but for each parental edge, either the parental or the duplicated one must be retained after divergence. An interaction of the parent node with its child (blue, as indicated by the blue arrow) is given probability *δ*_{Att}. DDa also generates nodes with no interactions; we impose that DDa does not generate nodes with no interactions.

**Preferential Attachment** (PA). A new node (purple) is added to the network, and attached to an existing node (orange) with probability proportional to the node degree (number of black edges per node).

The final steps in the graphs display possible realizations of the DDa and PA mechanisms, respectively.

The terms PA(*u*, *v*) and DDa(*u*, *v*, *δ*_{Div,} *δ*_{Att}) correspond to the probabilities of moving to the new configuration under PA and DDa, respectively. They are explained in Box 1 and defined fully in Protocol S1. By repeated application of the mechanism in Equation 1, we grew PINs to the approximate number of open reading frames in the respective genomes (*H. pylori*: 1,500, and *P. falciparum*: 5,300).

We chose this mixture evolution model for a number of reasons. DDa agrees with aspects of genome evolution by gene duplication [29]. Several studies [7,8,10,30] found a rapid divergence of the interaction profiles of duplicate genes, indicating that duplication and subsequent divergence might be adequately modelled in a single step. Importantly, DDa may relate to subfunctionalization [31]: as a rule, at least one edge disappears, and the duplicates share the pleiotropy of the parent node [10,32]. Also, the model does not disagree with purifying selection that maintains the ancestral function at both duplicates [9,33,34], because, occasionally, all ancestral edges are retained.

The second component of the mixture model, first introduced in [18], is a generic local growth mechanism based on PA that may explain some characteristics of networks, in particular the approximate power-law decay of the node degrees. In the present context, it captures effects of network growth which are not specifically related to (D1–D3). Such effects are likely present in network evolution; Middendorf et al. [35] showed that PINs simulated by DDa alone may underrepresent tree-like subgraphs, whereas these are more accurately generated by PA. Also, horizontal gene transfer is a major force in prokaryote evolution. It is plausible to model such transfer with an attachment process, although no particular model has been proposed in the literature.

Overall, in the mixture model (Equation 1), network evolution proceeds by repeated node addition. Apart from rate homogeneity over all proteins, there are thus no further assumptions on the evolutionary clock of our model; a property that is particularly desirable because evolutionary events such as duplication or interaction divergence are generally unavailable or difficult to estimate reliably. Since link turnover is suspected to operate on a different time scale than duplication divergence, extending the model (Equation 1) with preferential link rewiring [12] would imply further assumptions on the evolutionary clock; potentially, phylogenetic data could help to fit such birth and death models of network evolution.

The evolution parameters are abstract quantities that subsume a number of complex biological processes [36]. The parameter *δ*_{Div} may, for example, relate to mutations and insertions or deletions on the sequence level, but also to novel posttranslational modifications or translocations into a different cellular compartment of one interaction partner. Notably, *δ*_{Div} is associated with immediate divergence and thus differs from divergence probabilities obtained from sequence data, since the latter are usually inferred over a time interval [37]. The parameter *δ*_{Att} represents the probability of link formation between duplicates. In this study, the mixture parameter α is of particular interest; we ask whether and to what extent, despite high incompleteness, the PIN topology of representatives from the prokaryotic and eukaryotic domains contain evolutionary footprints that may be related to a model that captures hallmarks of network evolution by (D1–D3).

To account for incomplete data, random subnetworks of order *N* are chosen from the simulated networks that are grown to approximately the number of open reading frames in the respective genomes. Here, *N* is the number of proteins with observed interactions in the two datasets (*H. pylori*: 675 and *P. falciparum*: 1,271). The PIN datasets generated by Equation 1 and subsequent subsampling are dominated by stochastic effects (Figure S1); nevertheless, different parameters leave distinguishable imprints on simulated PINs (Figure S2).

The Bayesian paradigm is a powerful probabilistic framework for making inference on complex stochastic systems and allows all sources of uncertainty to be accounted for [38]. We applied this paradigm to estimate the posterior density *p(θ*|
D) of *θ*, given a real PIN dataset
D. Bayes' Theorem relates *p*(*θ*|
D) to the likelihood *p*(
D|*θ*) and the prior of *θ*, *p*(*θ*), via

where denotes “proportional to.” In the absence of substantial prior information on *θ*, we use a uniform prior. The increased flexibility of Equation 1 comes at a computational cost and prohibits likelihood calculations that have been formalized by Wiuf et al. [19] for only very simple evolution models.

ABC confers computational tractability by circumventing the problem of evaluating the likelihood directly [23–26] and relies instead on the simulation of networks and the computation of network summaries. All ABC algorithms have in common to approximate first the data
D by a set of summaries
S_{
D}, for example ND and DIA (see Box 2 for a glossary of summary statistics and their abbreviations in the text) in the case of protein networks, and then proceed through several steps to sample parameter values from an approximate posterior density; see Materials and Methods for details. One approach is to sample from the prior density (noninformative in our case) and accept the proposed value, given that certain criteria are fulfilled. However, as suggested by Figure S2, only a small range of parameter values generate data with summaries close to
S_{
D}. Consequently, we anticipate that generating candidate parameters from the prior will be highly inefficient.

**CC** Average Cluster Coefficient, mean probability that two neighbours of a node are themselves neighbours.

**Degree** The number of edges associated with a node.

**DIA** Diameter, the longest minimum path among pairs of nodes in a connected component of the network.

**Distance** The distance between nodes *i* and *j* is the minimum number of edges that have to be visited to reach *j* from *i*.

**FRAG** Fragmentation, the percentage of nodes not in the largest connected component.

**ND** Node Degree Distribution or Degree Sequence, *p*(*nd* = *k*), the percentage of nodes with degree *k* in a network.

Average Node Degree, the mean degree of a network.

**Order** The number of nodes in a network.

**PL** Average Path Length, the average distance of all node pairs in a connected component in the network.

**Size (R)** The number of edges in a network.

**TRIA** Number of Triangles, the number of 3-cycles in the network.

**WR** Within-Reach Distribution, *p*(*wr* ≤ *k*), the mean probability of how many nodes are reached from one node within distance *k* in the network.

Likelihood-free inference (LFI) within Markov Chain Monte Carlo (MCMC) [25] improves efficiency of standard ABC by exploiting knowledge of the current parameter value to make an educated guess on the next one. The details of algorithm ABC-MCMC are outlined in Material and Methods. It is guaranteed to eventually generate a series of correlated samples from

where is the tolerance according to the distance function *d*, and
S* _{θ}* is the set of summary statistics calculated on simulated data with parameter

In order to achieve an approximation of the posterior for inference on protein networks, we modified ABC-MCMC to our algorithm LFI; see also Materials and Methods.

LFI compares summaries of the observed dataset with mean summaries S of an ensemble of simulated PINs at each iteration of the algorithm during the burn-in phase, i.e., the first 800 iterations in this study. Since mean summaries over larger ensembles have reduced variance (Figure S3), LFI initially accepts parameters that are disproportionally close to the posterior mode. Consequently, algorithm LFI enables to burn in rapidly, enhancing computational efficiency. Previous studies [39,40] used averaging in similar numerical methods to efficiently approximate the maximum of the likelihood. We found that averaging summary statistics over 50 generated PINs is sufficient to burn in rapidly.

LFI within MCMC is often prone to get stuck or to sit in the tails of the distribution [26]; essentially because the likelihood ratio within MCMC is coarsely approximated by either zero or one. To avoid the chain getting stuck, we adopted a tempering scheme [41] on the threshold . That is, during the burn-in phase, acceptance of parameters is controlled by a decreasing sequence of thresholds until a minimal, preset value _{min} is reached. To avoid the chain sitting in tails, it is in our case sufficient to temper the proposal variance Σ. See Materials and Methods for further details.

Choosing appropriate summary statistics is central to any method approximating the true likelihood. This choice is governed by the principle that useful summaries should be sensitive to genuine changes in real PINs. Briefly, we characterized genuine changes by comparing the standardized mean derivative (smd) of a summary smd(*θ*), and the variation cv(*θ*) of the summary for varying values of *θ*. As further described in Materials and Methods, both smd(*θ*) and cv(*θ*) are scaled and use the same distance measure across summaries, so that our analysis allows us to compare summaries one by one. Except for ND, smd was not close to 0, and highest for CC and TRIA. Figure 1 illustrates this for the mixture parameter *α*. CC and TRIA, as well as FRAG, had the greatest variability, whereas ND showed almost no random fluctuations (see Figure S4). Taken together, this indicates that relative to other statistics, motif counts are extremely variable for fixed *θ*, and that ND has very limited power to detect genuine changes in *θ*. We derived a novel distributional statistic, the within-reach distribution (WR), that is more sensitive to changes in *θ* than ND; see Materials and Methods. Indeed, WR conveyed twice as much information as ND for the mixture parameter *α*; see Figure 1B.

LFI is sensitive to the particular type of distance function *d* on a set of summaries. Often, a linear combination of standardized summaries is used [24]; instead, as detailed in Materials and Methods, we require that each summary over simulated PIN data is sufficiently close to the respective observed summary. Figure 2 shows that posterior support obtained by a simple linear combination differed not only in scale, but also in shape from the one obtained by our more stringent approach *d*_{∩}.

In summary, for inference on protein networks our results suggest that

where each *S _{k}* denotes the

Our evolutionary analysis of real PIN datasets centres on a comparison of two representatives from the prokaryotic and eukaryotic domain. We obtained descriptions of the PINs of *H. pylori* and *P. falciparum* from the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu). We first investigated LFI with different sets of summaries on simulated data as outlined in Protocol S1; based on the test results, we selected the set of summaries WR + DIA + CC +
+ FRAG for LFI.

We successfully applied LFI on the *H. pylori* PIN. Figure 3 presents the MCMC chains for the divergence parameter *δ*_{Div} [0,1], and the estimated posterior *p(δ*_{Div}|
D). Similar good convergence was obtained for the attachment probability *δ*_{Att} and the mixture parameter *α*, and the 80% credible intervals (i.e., the inner range of values of a random variable that attains 80% posterior mass) are presented in Table 1. Technically important, the Markov chain resulting from algorithm LFI did not get stuck and did not sit in the tails for relatively small threshold values _{min}. We could not reproduce our results without averaging over an ensemble of *B* = 50 simulated PIN datasets during burn-in, nor without tempering of and Σ as described in Materials and Methods. Based on our theoretical considerations with smd(*θ*) and cv(*θ*) and our test results, we believe approximation (Equation 4) has been achieved, but note that ultimate evidence cannot be provided since evaluating the likelihood is not feasible to date.

Comparison of the Evolutionary Dynamics Inferred from *H. pylori* and *P. falciparum* PIN Data, with LFI Based on WR + DIA + CC +
+ FRAG

We repeated the LFI analysis on the *P. falciparum* PIN with the same set of summaries; importantly, these capture global aspects of PIN data simultaneously. The posterior distribution of *θ* for *P. falciparum* is summarized in Table 1. Notably, the DDa component obtained more weight in the posterior mixture model DDa + PA relative to *H. pylori*. This suggests, first, that duplication–divergence shapes the global structure of protein networks in a way distinguishable from preferential attachment, and that the difference is also evident when incompleteness of present PIN data is taken into account. Second, gene duplication and interaction divergence might play a larger role in eukaryotic than in prokaryotic protein network evolution, pointing to either discontinuous (i.e., likely to be adaptive) or continuous (i.e., unlikely to be adaptive) taxonomical differences, as already suggested from the extent [42], the size [43,44], and the complexity [45] of protein families.

We found that the lower 80% quantile of 1 − α is larger than 0.6 in both investigated species. Genomic and expression data indicate that repeated single gene duplications with immediate subfunctionalization are a driving force in the evolution of higher organisms [10,11,32,46,47]. Since, on average, DDa mimics duplication with subfunctionalization (see also Box 1), our results emphasize the potential importance of single gene duplications with immediate subfunctionalization in the evolution of the eukaryote. Moreover, we prove in Protocol S1 that DDa may describe any protein network topology due to complementary, random interaction divergence. The precise mechanisms of evolution are less clear for the prokaryote; in particular it is unclear whether horizontal gene transfer is adequately modelled with PA [48], and we caution against interpreting DDa + PA as a model of vertical versus horizontal gene transfer. Nevertheless, the prevalence of duplication divergence in prokaryotic evolution is also indicated from the protein repertoire itself [5,49,50]. In particular, the phylogenetic distributions of protein families over 41 bacteria are consistent with our findings: 60% of protein families in these prokaryotes can be explained by gene duplications alone [50].

The role of duplication divergence in evolution of protein networks across domains we promote here must be considered within the limits of our model and the data. However, we note that our analysis is based on several global features of the network data, which are more reliable than local aspects (Figure S4). More importantly, LFI allows us to take the stochasticity of the evolutionary process and the incompleteness of available network data into account. Also, the credible intervals of *δ*_{Div} and *δ*_{Att} for the *P. falciparum* PIN overlap with parameter estimates obtained from sequence data of *S. cerevisiae*. The study of Wagner [37] indicates a mean divergence probability around 35%–42% and a mean attachment probability around 1%–2% within the first 25 million years after a duplication event in this species. Given the number of limitations in both approaches, further work will be required to combine genomic with network data for a detailed reconstruction of the evolution of complex cellular units. Importantly, fitting a model of network evolution that includes link turnover as a case of neofunctionalization might put our conclusions into perspective.

The complexity of PIN data suggests that LFI on biological network data may be highly influenced by the choice of summaries. Table 2 summarizes that for different combinations of four or more summaries, the respective posterior means and 80% credible intervals coincided with those obtained by WR + DIA + CC +
+ FRAG. Thus, based on many aspects of PINs, inference on *θ* was consistent. Based on the *H. pylori* PIN, we found that the approximate posterior (Equation 4) was not identifiable from single summary statistics. Using ND only, it is possible to choose _{min} small, _{min} ≤ 0.35; but Table 2 shows that the inferred 80% credible interval on *θ* is very wide. Considering the parameters *δ*_{Div}, *δ*_{Att}, and α pairwise, as in Figure 4, illustrates that ND alone leads to two-dimensional high-density regions that are inconsistent with those obtained by four or more summaries. Similarly, LFI based on several other single summary statistics allowed small threshold values _{min}, but did not lead to a reliable and consistent estimation of *θ* (unpublished data). This indicates that many evolutionary histories may explain single aspects of PINs almost perfectly without representing the full topology, reflecting the complex nature of biological network data. Our findings relating to ND are particularly worrisome because the degree sequence is a standard descriptor of protein networks, and often kept fixed when generating randomized networks for a significance analysis on aspects of PIN data [28,51,52].

PA alone generates tree-like networks, whereas DDa occasionally produces triangles. Surprisingly, LFI with TRIA included in the set of summary statistics did not aid inference in that convergence took longer and fewer samples were accepted without tightening the credible intervals. Taken together with the fact that other motif counts have a similar high variation over the evolutionary history (unpublished data), this suggests that the extreme variability of motif counts in simulated data reduces their usefulness for inference on biological network data.

Aspects of the complete, unobserved PINs are easily predicted from the observed networks, once MCMC output is available. Here, we discuss the true network size *R*, by means of its posterior predictive distribution; as outlined in Materials and Methods. The posterior predictive distribution of *R* for *H. pylori* and *P. falciparum* is displayed in Figure 5. De Silva et al. [22] proposed a simple estimator of the network size based on the sampling fraction *ρ* of proteins that are present in the dataset. Applied to *H. pylori* (*P. falciparum*), the estimate is *R′* = 5,636 (43,835). This is consistent with the posterior predictive distribution obtained by LFI based on WR + DIA + CC +
+ FRAG in the sense that *Pr*(½ ≤ *R*/*R*′ ≤ 2|
D) ≥ 0.80.

The fact that current PINs are largely incomplete hampers inference [22,53]. Within our Bayesian framework, we compared the effect of different network order and different levels of incompleteness of PIN datasets on protein network inference (*H. pylori*: 675, *ρ* = 0.45; and *P. falciparum*: 1,271, *ρ* = 0.24).

We found large variability associated with predictions of the true network size (see Figure 5); notably, the *P. falciparum* posterior network size was more diffuse than the one of *H. pylori*. In order to see whether the large variability arises from the approximative nature of LFI, we repeated LFI based on WR + DIA + CC +
+ FRAG for relaxed choices of _{min}. Figure 5 shows that tightening the threshold values results in more reliable predictions, and that this effect is negligible when twice as much network data are available. This suggests that aspects of the structure of the true networks remain highly uncertain under the model (Equation 1) when incompleteness is large.

Instead, the credible intervals of all evolution parameters *θ* are tighter for *P. falciparum* than for *H. pylori*, even though our model accounts for incompleteness. This indicates that the power of LFI to uncover the evolutionary history of PIN datasets increases with network order irrespective of the levels of incompleteness, essentially because the resolution of the network summaries increases.

We further analysed how the degree of incompleteness affects LFI by randomly withholding more network data of the *P. falciparum* PIN (*ρ =* 0.17, 0.12, 0.06); see Materials and Methods for details. Briefly, for PINs with *ρ* ≥ 0.17, LFI using WR + DIA + CC +
+ FRAG was possible, and the parameters were distinguishable in terms of the errors between the real and associated simulated summaries. Table 3 summarizes the 80% credible intervals of all parameters for LFI based on WR + DIA + CC +
+ FRAG for different *ρ*. As expected, highly increased incompleteness implied larger credible intervals. More importantly, randomly omitting 500 proteins from the available PIN of 1,271 proteins did not significantly affect LFI. This is further illustrated with the posterior densities of the mixture parameter *α*, Figure 6.

PINs from different species have attracted much attention in molecular systems biology. Apart from their suspected role in modulating and underpinning the molecular machinery of complex phenotypes, their evolutionary properties are increasingly being investigated using a range of evolutionary and statistical approaches. We showed that it is possible to draw evolutionary inferences from large-scale, incomplete network data when models of randomly growing graphs are conditioned on many, carefully chosen aspects of the networks. Using a likelihood-free approach that relies on comparing summaries of real network data to simulated PINs, we were able to study more complex models of network evolution at increased confidence than had previously been possible [19].

Our results have important implications for the analysis of protein network topology. Due to its elusive complexity, the topology of a PIN is commonly summarized by the degree sequence [18], as well as the frequency profile of motif counts [28]. An extensive sensitive analysis showed that the degree sequence has very little power to distinguish among different parameters relative to other statistics (Figures 1B and S4B), and fails to infer the parameters correctly (Figure 4). We found that the number of triangles is extremely variable over the evolutionary history of simulated PINs (Figures S1B and S4A) and did not help inference, suggesting that motif counts are risky descriptors of PINs. Instead, if four or more network summaries are combined, then our method yields (i) consistent estimates as well as tight credible intervals on biological data, and (ii) accurate estimates on simulated test data where, by definition, the model is correct. The fact that a reliable, consistent analysis requires the combination of several summaries that capture global aspects of the networks, of which WR is computationally very expensive, renders an implementation targeting the *S. cerevisiae* PIN dataset extremely challenging.

We used our computational inference scheme to estimate the potential role of aspects of duplication divergence in different domains from large-scale biological network data of *H. pylori* and *P. falciparum*, complementing a number of efforts to uncover the mechanisms that underlie the evolutionary history of complex organisms from sequence data [1–3], protein structures [4], or gene families within a wider context [54]. Here, the evolutionary history of PINs was modelled with a mixture of randomly growing graphs that (i) agrees in particular with evolution by single gene duplications and immediate divergence, and (ii) puts minimal assumptions on the time of evolutionary events, because these are difficult to estimate reliably. Crucially, our approach fully deals with incomplete network data and the stochasticity of the underlying evolutionary process. Inference of the evolutionary parameters improves with an increasing order of the PIN data, irrespective of the levels of incompleteness (Figure 6 and Table 3). Within the limits of our evolutionary model and the available data, gene duplication and interaction divergence appear to play a dominant, distinguishably larger part in the evolution of the protein network of the eukaryote *P. falciparum* (Table 1). Our results emphasize the potential importance of duplication divergence in the evolution of networks across domains. Based on our sensitivity analysis of network summaries, our study suggests, in line with two other recent studies [55,56], that more information could be inferred from combining global aspects of interaction networks than is presently appreciated.

The opportunities arising from LFI to computational statistics on complex systems are large. Our results emphasize that choosing a set of appropriate summaries is central to maintaining the approximate character of LFI. We proposed the standardized mean derivative and measures of scaled variation to compare the power of summaries one by one. Although ABC-MCMC failed on network data, algorithm LFI enabled efficient and consistent inference. LFI might prove useful in other biological contexts when prior information is relatively vague, and when the underlying model is complex and highly stochastic.

For clarity of exposition, we first outline algorithm ABC-MCMC [25] and then present algorithm LFI, which achieves approximation (Equation 4) in protein network inference. Let
S = {S_{1},…, S_{k}…S_{K}} be the chosen set of summary statistics, and let > 0 be a threshold value. Let
S* _{
D}*, respectively
S

**ABC-MCMC1** If now at *θ*, propose a move to *θ′* according to a proposal density *q*(*θ → θ′*).

**ABC-MCMC2** Generate a dataset from *θ′* and compute
S* _{θ′}*.

**ABC-MCMC3** Calculate

Here, *d*(
S_{
D},
S* _{θ′}*) ≤ denotes that the distance between the

**ABC-MCMC4** Accept *θ′* with probability *h* and otherwise stay at *θ*, then return to ABC-MCMC1.

ABC-MCMC is guaranteed to eventually sample from *p*(*θ*|*d*(
S_{
D,}
S* _{θ}*) ≤ ), [25]. We now present algorithm LFI. Let

**LFI0** If _{k,t} ≥ _{k}_{,min}, update * _{k,t}*; if Σ

**LFI1** If now at *θ*, propose a move to *θ′* according to a Gaussian density, centred at *θ* with diagonal covariance matrix Σ* _{t}* and restricted to the interval [0,1], i.e.,

**LFI2** During the preset, empirically determined burn-in phase, go to LFI2′. Else, generate *B* = 1 PIN according to the mixture model (Equation 1) with parameter *θ′* and grown to the number of open reading frames in the genome of the observed PIN. Take a subnetwork of order that equals the order of the observed PIN. Compute the summaries, put *s _{k,θ′}* :=

**LFI2′** Perform LFI2 with *B* = 50 and compute the sample mean *
_{k}*

**LFI3** Calculate

In our case, the prior is uniform, and *p*(*θ′*)/*p*(*θ*) is one. The distance function *d _{k}* for the

**LFI4** Accept *θ′* with probability *h* and otherwise stay at *θ*, then return to LFI0.

LFI fulfils the detailed balance equations for the same reasons as [25], and hence is guaranteed to eventually sample from

*Tempering scheme*. We temper the acceptance threshold * _{t}* with an exponential cooling scheme, starting at some initial temperature

*Choice of distance function d*. Our distance function in LFI3 is inspired by the Chebyshev distance proposed in [23] (and outlined in ABC-MCMC3). Notably, since the reliability of PIN summaries differs largely, we combine and do not standardize the summaries; our approach requires *K* different tempering schemes.

For ND and WR, *d _{k}* is chosen to capture major pointwise differences in the summaries. Given

*Initial values*. One approach to investigate whether a Markov chain has not yet converged is to start multiple chains at overdispersed initial values. We have started four Markov chains at the initial values (0.9, 0, 0), (0.7, 0.13, 0.23), (0.5, 0.26, 0.46), and (0.3, 0.4, 0.7). The first and the latter initial values represent unrealistic models to check that the chains move toward the support of the distribution. The other two initial values interpolate between these two extremes.

*Within-reach distribution (WR)*. Given a network
D and two connected nodes *i* and *j*, consider the shortest path from *i* to *j* as their distance *d*_{
D}(*i*, *j*). The (random) number *wr ^{k}*(

where the normalization constant *C* is the sum of all node pairs in each component in
D.

*Mean derivative of summary statistics*. In order to analyze the information content of summaries for protein networks, we follow the approach recently proposed by K. Heggland and A. Frigessi [39]. Consider one summary statistic S(*θ,**G*), evaluated on simulated data *G* generated with parameter *θ*. Heggland and Frigessi argue that “if for fixed *θ*, the variance in S(*θ*,·) is large compared with the derivative of its expectation, it will be more difficult to detect genuine changes at *θ* in S(**·***,**G*).” We adopt a variant of their approach, modified to the settings of this paper. Networks *G** ^{b}*,

are computed (note that *G** ^{b}* is a different realization of the mixture model (Equation 1) for the same values of

Here, *h* > 0 and *l _{D}* is the

Note that the average cluster coefficient is an observed probability, which is already normalized, and we utilize Equation 7 directly to compute its mean derivative. For the node degree distribution and the WR distribution, we compute the common support of *
*(θ + *h _{l}*) and

*Variation of summary statistics*. Consider a summary statistic S(*θ,* *G** ^{b}*) evaluated on simulated data

These values yield a relative error histogram for fixed
S and *θ*, and we employed the biweight kernel to estimate the density of standardized variation. In the case of CC, ND, and WR, we normalized as detailed above.

Aspects or quantities of PINs can be predicted within the Bayesian framework. The posterior predictive distribution of such a quantity, e.g., the network size *R*, may be estimated directly from the MCMC output:

where
denotes a posterior sample from the set of accepted parameters *θ* after convergence in the MCMC run. We are left to approximate *p*(*R* |
) by repeatedly generating PINs
according to
and calculating *R*, i.e.,

We have chosen *B* = 50 again, and took *I* = 500 samples from the MCMC output.

Out of 1,271 proteins in the *P. falciparum* PIN dataset, we randomly chose subgraphs of order *n* = 900, 600, and 300 to mimic increased incompleteness. For each Markov chain in an LFI simulation, such a subgraph was taken as the observed PIN dataset. Consequently, the four chains within one LFI simulation are fitted to slightly varying observations, making inference harder.

One thousand networks to *H. pylori* (grown to 1,500 nodes and subsampled to 675) are generated with the parameter *θ* = (0.32, 0.02, 0.15), and the squared errors between each summary and the mean summary are recorded. The frequency of cases such that the squared error is greater than values on the abscissa is plotted for WR,
, PL, ND, and TRIA. In 20% of all cases, the squared error in TRIA is greater than 1,000, whereas in all cases, the squared error in ND is not larger than one. Except for ND, large deviations are likely for all summaries, reflecting that stochastic effects dominate network summaries.

(55 KB PDF)

Click here for additional data file.^{(56K, pdf)}

We compared WR and ND for *α* = 0, 0.2, 1 to the observed summaries of *H. pylori* (grey) by simulating 50 networks to *H. pylori* (grown to the number of open reading frames: 1,500, and subsampled to the observed network order: 675) with *θ* = (0.24, 0.04, *α*) for varying *α*. For each within-distance *d* and each node degree *k*, the interquantile range of *p*(*wr* ≤ *d*) and *p*(*k*) for the 50 generated networks was drawn.

(A) The interquantile ranges of WR for PINs generated by different parameters were clearly distinct, and the mixture model with *α* = 0.2 visually improved fit relative to DDa and PA.

(B) On the same scale, the interquantile ranges of ND largely overlapped, indicating that ND might have significantly less power than WR to distinguish between different parameters.

(C) On the log scale for *p*(*k*), the interquantile ranges of ND generated by different parameters were again distinguishable, suggesting that the use of different distance metrics might play an important role in inference on protein network data.

(1.4 MB TIF)

Click here for additional data file.^{(1.4M, tif)}

Mean summaries over larger ensembles of simulated PIN datasets have reduced variance, as exemplified here with DIA. We computed the mean summary (red points) from *B* = 200, 50, 5 networks to *H. pylori* (grown to 1,500 nodes with *θ* = (0.28, 0.03, 0.21) and subsampled to 675 nodes). In each computation, the 50 networks were randomly chosen from the 200 networks, and then the five networks were randomly chosen from the 50 networks. This procedure was repeated 100 times, and we report the density of the distance of the mean simulated DIA to the observed DIA for *B* = 200, 50, 5. The average of these errors (vertical red line) and the range of one standard deviation (blue) are added. Clearly, the variance of the mean DIA shrinks with increasing *B*, and similarly for all other summaries (unpublished data) with
according to the Central Limit Theorem (unpublished data).

(47 KB PDF)

Click here for additional data file.^{(47K, pdf)}

To compare the variability of the mean posterior summaries of *H. pylori*, we studied the coefficient of variation density cv(*θ*), described in Materials and Methods, on the grid *θ* [0.1, 0.7] × [0, 0.5] × [0.1, 0.6] in steps of 0.025. Computations were based on summaries taken from 1,000 simulated PINs to *H. pylori* (grown to 1,500 nodes and subsampled to 675). We plot the marginal cv(*α*) against *α* for (A) summary statistics and (B) summary distributions. cv complements the information given by smd in Figure 1 to characterize the sensitivity and variability of the summary statistics. TRIA, FRAG, and CC are extremely variable, offsetting their high standardized mean derivatives. ND is almost invariant to random fluctuations and to different parameters. Results for the other two parameters are very similar (unpublished data).

(76 KB PDF)

Click here for additional data file.^{(76K, pdf)}

(1.1 MB PDF)

Click here for additional data file.^{(1.0M, pdf)}

We thank Mikael Hvidtfeldt Christensen, René Thomsen, and Thomas Bataillon for stimulating discussions. We also thank David Balding, David Welch, and John Molitor for critical review of the manuscript. Computations were performed at the Imperial College High Performance Computing Centre [57], and we thank Simon Burbidge and Matt Harvey for their excellent service.

- ABC
- approximate Bayesian computation
- cv
- coefficient of variation density
- DDa
- duplication divergence with parent–child attachment
- LFI
- likelihood-free inference
- MCMC
- Markov Chain Monte Carlo
- PA
- preferential attachment
- PIN
- protein interaction network
- smd
- standardized mean derivative
- WR
- within-reach distribution

**Author contributions.** OR conceived and designed the experiments. OR, OJ, and TH performed the experiments. OR, SR, and CW analyzed the data. OR, MS and CW wrote the paper.

**Competing interests.** The authors have declared that no competing interests exist.

A previous version of this article appeared as an Early Online Release on October 9, 2007 (doi: 10.1371/journal.pcbi.0030230.eor).

**Funding.** OR gratefully accepts funding from the Wellcome Trust; OJ, from the Danish Research Council; TH, from the Medical Research Council; MS, from the Wellcome Trust, the Biotechnology and Biological Sciences Research Council (BBSRC), and the European Molecular Biology Organization; SR, from the BBSRC and the Centre for Integrative Systems Biology at Imperial College; and CW, from the Danish Cancer Society and the Carlsberg Foundation.

- Labedan B, Riley M. Widespread protein sequence similarities: origins of Escherichia coli genes. J Bacteriol. 1995;177:1585–1588. [PMC free article] [PubMed]
- Teichmann S, Park J, Chothia C. Structural assignments to the mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A. 1998;95:14658–63. [PubMed]
- Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. [PubMed]
- Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. J Mol Biol. 2001;313:14658–14663.
- Chothia C, Gough J, Vogel C, Teichmann SA. Evolution of the Protein Repertoire. Science. 2003;300:1701–1703. 10.1126/science.1085371. [PubMed]
- Ohno S. Evolution by gene duplication. Springer-Verlag; 1970.
- Wagner A. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 2001;18:1283–1292. [PubMed]
- Maslov S, Sneppen K, Eriksen K, Yan K. Upstream plasticity and downstream robustness in evolution of molecular networks. BMC Evol Biol. 2004;4:9. [PMC free article] [PubMed]
- Nowak MA, Boerlijst MC, Cooke J, Smith JM. Evolution of genetic redundancy. Nature. 1997;388:167–171. [PubMed]
- Force A, Lynch M, Pickett F, Amoresa A, Yana Y, et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151:1531–1545. [PubMed]
- Blanc G, Wolfe K. Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell. 2004;16:1679–1691. 10.1105/tpc.021410. [PubMed]
- Berg J, Lässig M, Wagner A. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol. 2004;4:51. [PMC free article] [PubMed]
- Beltrao P, Serrano L. Specificity and evolvability in eukaryotic protein interaction networks. PLoS Comput Biol. 2007. e25
- Vazquez A, Flammini A, Maritan A, Vespignani A. Modeling of protein interaction networks. ComPlexUs. 2003;1:38–44.
- Albert R, Barabási A. Statistical mechanics of complex networks. Rev Mod Phy. 2002;74:47–97.
- Chung F, Lu L, Dewey T, Galas D. Duplication models for biological networks. J Comput Biol. 2003;10:677–87. [PubMed]
- de Silva E, Stumpf M. Complex networks and simple models in biology. J Roy Soc Interface. 2005;2:419–430. [PMC free article] [PubMed]
- Barabási A, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–512. [PubMed]
- Wiuf C, Brameier M, Hagberg O, Stumpf M. A likelihood approach to analysis of network data. Proc Natl Acad Sci U S A. 2006;103:7566–7570. [PubMed]
- Bader G, Hogue C. Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechn. 2002;20:991–997.
- von Mering C, Krause R, Snel B, Cornell M, Oliver S, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417:399–403. [PubMed]
- de Silva E, Thorne T, Ingram P, Agrafioti I, Swire J, et al. The effects of incomplete protein interaction data on structural and evolutionary inferences. BMC Biol. 2006;4:39. [PMC free article] [PubMed]
- Pritchard J, Seielstad M, Perez-Lezaun A, Feldman M. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol. 1999;16:1791–1798. [PubMed]
- Beaumont M, Zhang W, Balding D. Approximate Bayesian Computation in population genetics. Genetics. 2002;162:2025–2035. [PubMed]
- Marjoram P, Molitor J, Plagnol V, Tavaré S. Markov Chain Monte Carlo without likelihoods. Proc Natl Acad Sci U S A. 2003;100:15324–15328. [PubMed]
- Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proc Natl Acad Sci U S A. 2007;104:1760–1765. [PubMed]
- Stumpf M, Ingram P, Nouvel I, Wiuf C. Statistical model selection methods applied to biological networks. Trans Comp Sys Biol. 2005;3:65–77.
- Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, et al. Network motifs: simple building blocks of complex networks. Science. 2002;298:824–827. [PubMed]
- Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol. 2003;18:292–298.
- Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics. 2000;154:459–473. [PubMed]
- Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. [PubMed]
- Gu Z, Nicolae D, Lu H, Li W. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet. 2002;18:609–613. [PubMed]
- Nei M, Rogozin I, Piontkivska H. Purifying selection and birth-and-death evolution in the Ubiquitin gene family. Proc Natl Acad Sci U S A. 2000;97:10866–10871. [PubMed]
- Piontkivska H, Rooney A, Nei M. Purifying selection and birth-and-death evolution in the Histone H4 gene family. Mol Biol Evol. 2002;19:689–697. [PubMed]
- Middendorf M, Ziv E, Wiggins C. Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci U S A. 2005;102:3192–3197. [PubMed]
- Stumpf M, Kelly W, Thorne T, Wiuf C. Evolution at the system level: the natural history of protein interaction networks. Trends Ecol Evol. 2007;22:366–373. [PubMed]
- Wagner A. How the global structure of protein interaction networks evolves. Proc Biol Sci. 2003;270:457–466. [PMC free article] [PubMed]
- Green PJ, Hjort NL, Richardson S. Highly structured stochastic systems. Oxford (United Kingdom): Oxford University Press; 2003. 536 p.
- Heggland K, Frigessi A. Estimating functions in indirect inference. J Roy Stat Soc B. 2004;66:447–462.
- Jiang W, Turnbull B. The indirect method: inference based on intermediate statistics—A synthesis and examples. Stat Sci. 2004;19:239–263.
- Gilks WR, Richardson S, Spiegelhalter DJ, editors. Markov Chain Monte Carlo in practice. Boca Raton (Florida): Chapman and Hall; 1998. 486 p.
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. [PubMed]
- Huynen MA, van Nimwegen E. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998;15:583–589. [PubMed]
- Pushker R, Mira A, Rodríguez-Valera F. Comparative genomics of gene-family size in closely related bacteria. Gen Biol. 2004;5:R27.
- Koonin EV, Aravind L, Kondrashov AS. The impact of comparative genomics on our understanding of evolution. Cell. 2000;101:573–576. [PubMed]
- Makova K, Li WH. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res. 2003;13:1638–1645. [PubMed]
- van Noort V, Snel B, Huynen M. The yeast coexpression network has a small-world, scale-free architecture and can be explained by a simple model. EMBO Rep. 2004;5:280–284. [PubMed]
- Pàl C, Papp B, Lercher MJ. Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat Genet. 2005;37:1372–1375. [PubMed]
- Snel B, Bork P, Huynen MA. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res. 2002;12:17–25. [PubMed]
- Kunin V, Ouzounis CA. The balance of driving forces during genome evolution in prokaryotes. Genome Res. 2003;13:1589–1594. [PubMed]
- Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science. 2002;296:910–913. [PubMed]
- Barabási A, Oltvai ZN. Network biology: understanding the cell's functional organization. Nat Rev Genet. 2004;5:101–113. [PubMed]
- Han JDJ, Dupuy D, Bertin N, Cusick ME, Vidal M. Effect of sampling on topology predictions of protein-protein interaction networks. Nat Biotechn. 2005;23:839–844.
- Amoutzias G, Robertson D, Oliver S, Bornberg-Bauer E. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep. 2004;5:274–279. [PubMed]
- Ingram P, Stumpf M, Stark J. Network motifs: structure does not determine function. BMC Genomics. 2006;7:108. [PMC free article] [PubMed]
- Yu H, Kim P, Sprecher E, Trifonov V, Gerstein M. The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics. PLoS Comput Biol. 2007. e59 doi: 10.1371/journal.pcbi.0030059.
- Imperial College High Performance Computing Service. Available: http://www.imperial.ac.uk/ict/services/teachingandresearchservices/highperformancecomputing. Accessed 27 October 2007.
- Sole R, Pastor-Satorras R, Smith E, Kepler T. A model of large-scale proteome evolution. Adv Complex Syst. 2002;5:43–54.

Articles from PLoS Computational Biology are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |