It has been proposed that interacting proteins should coevolve to maintain their interactions 1; 2; 3
. This idea provides the main motivation for the method to predict protein interactions known as the mirrortree method 1; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13
. The mirrortree method predicts protein-protein interactions by assessing the extent of agreement between evolutionary distances that could be attributed to correlated evolution. For this purpose, distance matrices are constructed from alignments of orthologous sequences taken from a common set of species. The degree of correlated evolution between families of orthologs is assessed by computing the correlation coefficient between the corresponding distance matrices. The mirrortree method measures the correlation between evolutionary distances and thus, indirectly, the correlation between evolutionary rates along individual branches of evolutionary trees of two families. While correlation between the evolutionary trees of interacting proteins has been well documented 1; 2; 4; 7; 10; 14
, the principal cause of such correlated changes remains unclear 15
. In particular, it has been proposed that higher correlation values between evolutionary trees of interacting proteins (with respect to non-interacting ones) can be caused by compensatory mutations, where mutations in one binding partner are being compensated by complementary mutations in another partner to maintain amino acid interactions important for protein function, stability and foldability 1; 2; 4; 16; 17; 18; 19
However, correlation between evolutionary distances of interacting proteins may also have other sources. For example, Fraser et al
used codon adaptation index analysis to infer that the levels of expression of interacting partners are also subject to correlated evolution and that such co-expression could be required for maintaining proper stoichiometry among interacting components. It has been observed that expression levels are correlated with evolutionary rates 16; 21; 22
, which might contribute to the coevolution signal measured by the mirrortree method. Indeed, Hakes et al
demonstrated that mRNA abundance is a good protein interaction predictor. Another important argument against using compensatory mutations to explain the entire coevolutionary signal detected by the mirrortree is that this approach could also identify as interacting non-interacting proteins within the same protein complex or biological pathway. Indeed, an extension of the mirrortree recently introduced by Juan et al. detects proteins within the same metabolic pathways despite the fact that they are not necessarily related by physical interactions 24
Challenging previous assumptions about the strong contribution of coevolution of binding interfaces to the correlation signal between evolutionary distances measured by the mirrortree method, Hakes et al
suggests that such correlation does not mostly originate from compensating mutations in the interface. In their work, Hakes et al
shows that selecting only the surface residues or the interface residues as input for the mirrortree approach yields similar results as using the whole protein sequence. Based on their analysis, they conclude that “correlated sequence evolution is most probably due to interacting proteins being constrained in similar ways and having similar rates of evolution across their entire sequences.”
Accounting for the considerations above, in this work “correlated evolution” refers to correlated changes in evolutionary rates imposed on a pair of interacting proteins to preserve their interaction properties. As such, this definition of correlated evolution includes also correlated changes to preserve physical binding properties, co-expression, foldability and all other constraints that are imposed on a pair of interacting proteins to preserve functional properties of interaction. It is important to keep in mind that since the mirrortree technique is based on correlated changes in distances between sequences of interacting proteins rather than on a direct measurement of any of the above mentioned factors, it cannot assess which one of them is a dominating contributor to the signal.
In this work, we analyze the contribution of the binding sites to the coevolutionary signal of domain-domain interactions measured by the mirrortree method. For this purpose, we use binding sites together with their spatially surrounding residues, which we refer to as “binding neighborhoods”. We select a set of protein domains with representatives in one common set of species, thereby avoiding problems related to comparing correlations computed based on different sets of species. Furthermore, to limit the impact of the coevolutionary signal due to common speciation divergence, we apply the Pazos et al
and Sato et al
speciation subtraction methods. With these controls in place, we develop several tests to compare the relative strength of the coevolutionary signal from binding and non-binding parts of proteins. In particular, we test how the coevolutionary signal computed from the binding neighborhood compares to that computed from an equivalent number of non-binding positions.
In agreement with previous work indicating that coevolutionary signal is not restricted to the binding interface, we find that when completely removing the binding neighborhoods, the remaining sequences of interacting domains still contain significant coevolutionary signal. However, we also find that the signal is not distributed uniformly across the sequence. In particular, removing the binding neighborhood significantly reduces the performance of the method. In addition, we find that the binding neighborhood alone provides a stronger coevolutionary signal than the same number of randomly selected residues outside the binding neighborhood. Thus, the correlation between evolutionary distances of interacting protein domains can be only partially explained by the common evolutionary pressure exerted along the whole sequence of interacting protein domains. In particular, our results indicate that the binding neighborhood has a significantly higher contribution to this signal than the rest of the protein domain sequence.