– similarity through common descent – occurs on scales ranging, from genetic sequence to anatomy. The high degree of observed protein sequence homology gives a strong expectation that discoveries about protein function made in one species will provide understanding in another 
. The extent of homology of protein function is of both practical and theoretical importance, as it underlies the reliance on a few model organisms and provides insight into the maintenance and diversification of protein function through evolution.
In this paper, we examine the evidence for homology in the realm of protein-protein interactions. Proteins, the main workhorses of the cell, do not carry out their functions in isolation but rather interact with each other to bring about biological function. In this study, we ask the following question: To what extent are protein-protein interactions conserved through evolution? A high degree of conservation makes viable the transfer of interactions across species. This is particularly pertinent given the cost of gathering experimental data and the concentration of that data in very few species. If, however, there is a low degree of conservation of protein interactions then – given the very high degree of conservation of protein sequences – this would suggest that interaction information cannot be transfered across species and that interactions can be lost and gained rapidly with little sequence change. This, in turn, could help explain how small changes in protein sequence on occasion bring about large phenotypic changes.
The homology of protein-protein interactions can be investigated by seeking evidence of interologs
. Interologs are pairs of interacting proteins:
in one species and
in another, where
is a homolog of
is a homolog of
(see ). Homolog detection is an unsolved problem 
, so we consider three different definitions of homology: blastp 
reciprocal hits at different thresholds of similarity, blastp reciprocal best hits, and EnsemblCompara GeneTrees 
Methodology for infering protein-protein interactions.
The notion of across-species interologs was first introduced by Walhout et al in 2000 
. Since then, many studies have predicted interactions on the basis of transfer by homology (e.g. 
). Despite the prevalent use of transferred interactions, relatively little work has been published that investigates the reliability of this procedure across species. Published success rates for transferring interactions vary from less than
, and many values in between have been reported 
. These differences can be explained in part by methodological choices. For example, Qian et al 
reported the highest conservation rate. They excluded gene-duplicates and compared two organisms that are evolutionarily very close. In contrast, the majority of studies have focused on comparisons between species that are much more distant on the tree of life – budding yeast S. cerevisiae
(SC), nematode worm C. elegans
(CE), fruitfly D. melanogaster
(DM), and human H. sapiens
(HS) – as these are the species for which there exists the most data 
It is also possible to investigate the homology of interactions within
a species. Two types of homologous interactions exist. Interactions
are homologous; we refer to these as both-different
conserved interactions. Additionally, interactions
are homologous; we refer to such interactions as one-same
conserved interactions. Mika and Rost found that interactions were more conserved within species than across species 
. They considered this result surprising due to the long-standing belief that proteins arising from gene-duplication events (paralogs) must diverge in function in order to be conserved, whereas proteins that arise from a speciation event (orthologs) have evolutionary pressure to maintain the function of the ancestral protein 
. However, Mika and Rost did not separate orthologs from paralogs in their across-species study so the results that they observed might be due to across-species out-paralogs outnumbering orthologs.
Errors in the interaction data, both – false negatives (i.e. existing interactions that are not reported in the data set) and false positives (i.e. interactions in the data set that do not actually exist) – can clearly have a substantial impact on results. Most obviously, false negatives in the target interactome will cause some interactions to be judged as non-conserved when the data in the target species is simply missing. However, except for Ref. 
, which examines one type of protein (transcription factors) in one pair of species (mouse and human), none of these studies investigated the role of errors in the data when assessing conservation.
A brief survey of the literature gives a sense of how significant these errors are believed to be. False-positive rates in high-throughput protein-protein interaction data, which have been estimated to be in excess of
, have more recently been estimated at
or considerably lower 
. False-positive rates in the multiple studies that are collated to give literature-curated data sets seem hard to assess. Error rates in the curation process have been estimated to be as high as
. By comparing the estimated sizes of interactomes to the current sizes of data sets, false-negative rates of aggregate data sets can be derived. Recent estimates of the S. cerevisiae
interactome range from
interactions in the data set we use); recent estimates for H. sapiens
in our data set); and recent estimates for D. melanogaster
range from about
in our data set). C. elegans
has been estimated to have about
in our data set). The large range of estimates gives a flavour of how results depend on the assumptions made. These estimates indicate that the false-negative rates for all species except S. cerevisiae
are very high, whereas the S. cerevisiae
interactome is potentially nearly complete.
In addition to being far from
in all organisms save S. cerevisiae
, the coverage of interactomes is biased 
. In particular, there is a high correlation between the number of publications in which a protein is mentioned and the number of interactions reported for that protein in literature-curated data (an
was reported by 
). This reflects the fact that low-throughput experiments are hypothesis-driven, i.e. particular interactions are tested for if they are of interest to researchers. If hypotheses are formulated in part on what is known about homologous proteins, then one should expect a bias in which homologous interactions are more likely to be reported. This would lead to conservation rates appearing inflated compared to data sampled independently in different species.
In this study, we investigate the evidence for the homology of binary protein-protein interactions using data from six species: S. cerevisiae (SC), C. elegans (CE), D. melanogaster (DM), H. sapiens (HS), fission yeast S. pombe (SP) and mouse M. musculus (MM). The first four species we investigate because there exists considerable data for them, the last two because these species are evolutionarily close to S. cerevisiae and H. sapiens respectively, and thus represent an interesting point of comparison.
In the first part of the present study, we calculate observed conservation rates for interactions across species and discuss the effects of potential bias.
In the second part, we attempt to address the sources of error that could cause the observed conversation rates to be underestimates. We decouple the effects of interaction completeness from the conservation of interactions through evolution and thereby arrive at estimates for both. Using the assumptions of our model and definitions of homology frequently employed for transferring functional annotations, we show that the fraction of interactions that are conserved is low even when interactome errors are taken into account. If strict definitions of homology are employed, the number of conserved interactions across species is low. We emphasise that our estimates of the fraction of conserved interactions do not consider the biases in the interaction data and are hence probably over
estimates. We then produce estimates for the rate at which interactions are lost through evolution – the first, to our knowledge, based on large-scale data sets and comparing species that are well separated on the tree of life – finding rates of about
per million years between the most sequence-similar proteins.
In the third part of this study, we consider the transfer of interactions within-species. We examine three different sets of inferences. Set one is one-same
is inferred from
are homologs and
is present in both interactions. Set two is both-different-1
inferences, for example,
is inferred from
are homologs and
are homologs. In a final case study on this data (both-different-2
) we identify the closest homologous interaction, and keep just a single inference for each interaction. This means if the closest inference comes from a one-same inference we no longer make a prediction from a less similar both-different inference. It has been shown previously that inferences of the one-same type are very powerful in within-species interaction prediction 
, a result we also observe. If one wishes to compare the rate of conservation of interactions within species to that across species then excluding one-same interactions as done in Ref. 
seems fair. In our test of this type (both-different-1) we find that within-species interactions are conserved to approximately the same extent as across species interactions.
Functional annotations are often transferred using definitions that are not particularly strict (see, e.g., 
). We argue that the low success of interaction transfer at comparable levels of sequence similarity cannot be explained solely by interactome errors. Unless a very stringent definition of homolog is employed, the rate of evolutionary change of interactions is too high to allow transfer across species that are well separated on the tree of life. At such stringent definitions, the number of conserved interactions is low. The common practice of transferring interactions on the basis of homology between such distant species 
must be treated with caution.