Protein–protein interactions are one of the mechanisms by which biological organisms build complicated and flexible molecular machineries from relatively modest numbers of protein-coding genes. Similar to the way in which we can derive information about the evolution of genes and genomes through currently available high-throughput genome sequencing data, the availability of high-throughput protein interaction data from Yeast-2-Hybrid experiments and various other protocols gives us a snapshot of the evolutionary process by which the rich and complex structure of protein interactions in the cell is formed.
The nature of current protein interaction network (PIN) data presents challenges in analysing the data and performing inference that takes into account the global network structure. When considering evolutionary models we are faced with the problem of comparing the network structure produced by the model to that of the observed interaction network. A possible way of overcoming this problem is to calculate summary statistics describing some aspect of the data and compare these with predictions from evolutionary models. Several previous studies in the literature have applied summary statistics to compare the fit of network models to observed data [
1–
4], shedding some light on aspects of network evolution and organization. Early studies suggested that the scale-free (SF) network models [
5] might fit the observed PIN data well [
3,
6], but there have since been several and statistically robust challenges to this claim [
7–
9].
Considering more realistic biologically grounded models of network evolution has provided insights into potential mechanisms of PIN formation, and provides more readily interpretable and applicable results than those found by considering more general random graph models; in particular, it has become apparent that it is important to consider models of network growth (instead of static random graph models) even though they are vastly oversimplified compared with the real process of network evolution. A number of models have been proposed and analysed with respect to the observed data [
1,
2,
10–
12], all with the same general mechanism of node duplication, corresponding to gene duplication and subsequent divergence in function and of interactions.
Assessing the fit of various network growth models to the
Drosophila melanogaster protein interaction network, Middendorf
et al. [
13] found that a duplication model best describes the data. A similar result was found in Ratmann
et al. [
4], where combining several different network statistics to compare the fit of models with the
Treponema pallidum PIN, a model combining duplication divergence scheme with linear preferential attachment (LPA) was found to best explain the data. Plausible models should therefore include aspects of duplication followed by the ability of interactions to diverge and change with time.
Comparing models of network evolution—even if they are (by design) vastly oversimplified compared with the true process—holds the promise of allowing us to weigh up the relative contributions of different processes. For example, we may assess the relative role that duplication of individual proteins might have played in the evolution of natural systems. Ultimately, we would like to understand different processes and their roles in network evolution in a way that mirrors what is possible for sequence-based comparative analyses. Here, too, models are oversimplified (even if less severely) but have allowed us to disentangle different aspects affecting sequence evolution (codon usage, secondary structure constraints, etc.). More immediately, however, such evolutionary models also allow us to apply the comparative method to networks more meaningfully than mere lists of network characteristics would be. Comparative biology predates the availability of sequence information, of course, and here we will discuss models of network evolution in a manner akin to that used in classical morphologically based comparative studies [
14].
Evolutionary analysis at the level of network organization is fraught with considerable technical challenges: the data are often noisy and incomplete; networks are notoriously hard to describe in terms of summary statistics; and calibrating evolutionary models against the available data (or summary statistics) is also non-trivial. Here, we develop a flexible and robust inferential framework to deal with these three issues. Our approach is aimed at estimating the ‘effective’ parameters of models of network evolution against network data, and choosing between different plausible models of network evolution whenever possible. We employ a Bayesian framework that allows us to deal with different candidate models and the uncertainties and problems inherent to the PIN data; and we use concepts from spectral graph theory to describe the networks, rather than relying on summary statistics.
Because the likelihood of general network growth models is computationally difficult to evaluate, we adopt an approximate Bayesian computation (ABC) approach; in ABC procedures the data (or summary statistics thereof) of model simulations (with parameters, θ, drawn from the prior) are compared with the real data and if a suitable distance measure between the data/summary statistics falls below a tolerance level, ε, then θ is accepted as a draw from the (ABC) posterior distribution. If ε → 0, then the ABC posterior will be in agreement with the exact posterior, as long as the whole data are used. Use of summary statistics can be problematic for parameter inference and model selection if statistics are not sufficient. This is unlikely ever to be the case for networks and therefore the spectral perspective taken here, which captures the whole data, is particularly pertinent.
Below we outline the ABC framework employed here and its use in parameter estimation, model selection and model averaging contexts. After discussing the spectral graph measures, we outline different evolutionary models, and describe how we can analyse incomplete network datasets. We then illustrate our approach against simulated data before considering real protein–protein interaction data. We conclude with a discussion of the results and will make the case for the statistically informed analysis of such simple models in the context of evolutionary systems biology.