|Home | About | Journals | Submit | Contact Us | Français|
It has been observed that the evolutionary distances of interacting proteins often display a higher level of similarity than those of non-interacting proteins. This finding indicates that interacting proteins are subject to common evolutionary constraints and constitutes the basis of a method to predict protein interactions known as mirrortree. It has been difficult, however, to identify the direct cause of the observed similarities between evolutionary trees. One possible explanation is the existence of compensatory mutations between partners’ binding sites to maintain proper binding. This explanation, however, has been recently challenged; and it has been suggested that the signal of correlated evolution uncovered by the mirrortree method is unrelated to any correlated evolution between binding sites. We examined the contribution of binding sites to the correlation between evolutionary trees of interacting domains. We showed that binding neighborhoods of interacting proteins have, on average, higher coevolutionary signal compared to the regions outside binding sites; although when the binding neighborhood was removed, the remaining domain sequence still contained some coevolutionary signal. In conclusion, the correlation between evolutionary trees of interacting domains cannot exclusively be attributed to the correlated evolution of the binding sites or to common evolutionary pressure exerted on the whole protein domain sequence, both factors contribute to the signal measured by the mirror tree approach. In addition, we showed that binding neighborhoods of interacting proteins have, on average, higher coevolutionary signal compared to the regions outside binding sites.
It has been proposed that interacting proteins should coevolve to maintain their interactions 1; 2; 3. This idea provides the main motivation for the method to predict protein interactions known as the mirrortree method 1; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13. The mirrortree method predicts protein-protein interactions by assessing the extent of agreement between evolutionary distances that could be attributed to correlated evolution. For this purpose, distance matrices are constructed from alignments of orthologous sequences taken from a common set of species. The degree of correlated evolution between families of orthologs is assessed by computing the correlation coefficient between the corresponding distance matrices. The mirrortree method measures the correlation between evolutionary distances and thus, indirectly, the correlation between evolutionary rates along individual branches of evolutionary trees of two families. While correlation between the evolutionary trees of interacting proteins has been well documented 1; 2; 4; 7; 10; 14, the principal cause of such correlated changes remains unclear 15. In particular, it has been proposed that higher correlation values between evolutionary trees of interacting proteins (with respect to non-interacting ones) can be caused by compensatory mutations, where mutations in one binding partner are being compensated by complementary mutations in another partner to maintain amino acid interactions important for protein function, stability and foldability 1; 2; 4; 16; 17; 18; 19.
However, correlation between evolutionary distances of interacting proteins may also have other sources. For example, Fraser et al. 20 used codon adaptation index analysis to infer that the levels of expression of interacting partners are also subject to correlated evolution and that such co-expression could be required for maintaining proper stoichiometry among interacting components. It has been observed that expression levels are correlated with evolutionary rates 16; 21; 22, which might contribute to the coevolution signal measured by the mirrortree method. Indeed, Hakes et al. 23 demonstrated that mRNA abundance is a good protein interaction predictor. Another important argument against using compensatory mutations to explain the entire coevolutionary signal detected by the mirrortree is that this approach could also identify as interacting non-interacting proteins within the same protein complex or biological pathway. Indeed, an extension of the mirrortree recently introduced by Juan et al. detects proteins within the same metabolic pathways despite the fact that they are not necessarily related by physical interactions 24.
Challenging previous assumptions about the strong contribution of coevolution of binding interfaces to the correlation signal between evolutionary distances measured by the mirrortree method, Hakes et al. 23 suggests that such correlation does not mostly originate from compensating mutations in the interface. In their work, Hakes et al. 23 shows that selecting only the surface residues or the interface residues as input for the mirrortree approach yields similar results as using the whole protein sequence. Based on their analysis, they conclude that “correlated sequence evolution is most probably due to interacting proteins being constrained in similar ways and having similar rates of evolution across their entire sequences.”
Accounting for the considerations above, in this work “correlated evolution” refers to correlated changes in evolutionary rates imposed on a pair of interacting proteins to preserve their interaction properties. As such, this definition of correlated evolution includes also correlated changes to preserve physical binding properties, co-expression, foldability and all other constraints that are imposed on a pair of interacting proteins to preserve functional properties of interaction. It is important to keep in mind that since the mirrortree technique is based on correlated changes in distances between sequences of interacting proteins rather than on a direct measurement of any of the above mentioned factors, it cannot assess which one of them is a dominating contributor to the signal.
In this work, we analyze the contribution of the binding sites to the coevolutionary signal of domain-domain interactions measured by the mirrortree method. For this purpose, we use binding sites together with their spatially surrounding residues, which we refer to as “binding neighborhoods”. We select a set of protein domains with representatives in one common set of species, thereby avoiding problems related to comparing correlations computed based on different sets of species. Furthermore, to limit the impact of the coevolutionary signal due to common speciation divergence, we apply the Pazos et al. 14 and Sato et al. 11 speciation subtraction methods. With these controls in place, we develop several tests to compare the relative strength of the coevolutionary signal from binding and non-binding parts of proteins. In particular, we test how the coevolutionary signal computed from the binding neighborhood compares to that computed from an equivalent number of non-binding positions.
In agreement with previous work indicating that coevolutionary signal is not restricted to the binding interface, we find that when completely removing the binding neighborhoods, the remaining sequences of interacting domains still contain significant coevolutionary signal. However, we also find that the signal is not distributed uniformly across the sequence. In particular, removing the binding neighborhood significantly reduces the performance of the method. In addition, we find that the binding neighborhood alone provides a stronger coevolutionary signal than the same number of randomly selected residues outside the binding neighborhood. Thus, the correlation between evolutionary distances of interacting protein domains can be only partially explained by the common evolutionary pressure exerted along the whole sequence of interacting protein domains. In particular, our results indicate that the binding neighborhood has a significantly higher contribution to this signal than the rest of the protein domain sequence.
To compare the contribution of correlated evolution measured by mirrortree based on binding sites against that of the whole protein domain sequence, we considered a set of columns from the multiple sequence alignment (MSA) corresponding to binding sites and their close neighborhood (“binding neighborhood”). The rationale to consider this binding neighborhood rather than binding sites alone is as follows. The mirrortree method measures the correlation between evolutionary changes but the binding sites alone are often (sometimes nearly perfectly) conserved and might not display enough variation to provide detectable coevolutionary signal. Furthermore, it has been found previously that the majority of the coevolving positions are not in direct contact but usually physically close (≤10 Å) 25.
First, we compared the performance of the mirrortree method using MSA columns from the binding neighborhood alone to the performance of the same method when an equal number of randomly selected non-neighborhood MSA columns were used (Figure 1). We considered binding neighborhoods at increasing thresholds: 6 Å, 8 Å, 10 Å and 12 Å (see Methods). We corrected for the speciation divergence using two different methods that we will refer to as the “non-orthogonal” and “orthogonal” methods proposed by Pazos et al. 14 and Sato et al. 11 respectively (see Methods). We refer to our previous study 6 for details about the methodological differences between these methods and for the explanation of this naming convention. We should note that the gold standard used for benchmarking was designed based on a set of domain-domain interactions verified with crystallographic data from Shoemaker et al. 26. This dataset, with additional constrains (see Methods), might be biased towards domain pairs that form more stable complexes rather than transient interactions due to the limited sample size. Figure 2 depicts the comparison, for all interacting domains pairs, of the correlation coefficients obtained using the binding neighborhood alone against those of using an equal number of randomly selected non-neighborhood MSA columns (see Methods). Results using the orthogonal and non-orthogonal speciation corrections are depicted in Figure 2a and 2b respectively. For both speciation corrections, the coevolutionary signal strength, represented by the correlation coefficients, derived from the binding neighborhood is predominately higher.
In addition, the accuracy of the different methods was measured using the Receiver Operating Curves (ROC )27. We used complete ROC and ROCn curves (plots truncated after the first n false results 28; 29) that were normalized so that the area under the ROC curve for an ideal retrieval method (one which returns all the true results first) was equal to 1.0. The corresponding ROC curves, for binding neighborhood of 10 Å, are shown in Figure 3. Independent of the speciation subtraction method used, exclusive use of the binding neighborhood drastically improves the performance in predicting domain interactions over the set of randomly selected MSA columns outside the binding neighborhood. Figure 4 shows values for ROC50, ROCtotal for all these experiments, together with the corresponding values from additional experiments discussed below. The values of ROC are given in Table 1. For the experiment using randomly selected columns, we computed the standard deviation based on 100 trials. Note that the results for randomly selected columns and the binding neighborhood differ by several standard deviations. We confirmed that the results presented in this paper are robust with respect to the definition of binding neighborhood.
Next, we analyzed the effect of removing the binding neighborhood on the performance of the mirrortree method and compared it with the effect of removing randomly chosen columns outside the binding neighborhood. The number of removed random columns was equal to the number of columns in the binding neighborhood, thereby accounting for any effect that the number of columns might have on the method. We applied both orthogonal and non-orthogonal speciation subtraction; results are depicted in Figure 5a and 5b respectively. Independent of the applied speciation subtraction, we observed that the removal of the binding neighborhood leads to a significant decrease in the performance of the mirrortree method. Yet, our results show that the sequence without the binding neighborhood still provides significant coevolutionary signal. Furthermore, for randomly selected residues, the discriminating power measured by ROC value increased with the number of selected columns.
One can argue that since Hakes et al. 23 found no differences in the discriminating power between the surface region and the whole sequence, the better performance of the binding neighborhood could be due to surface residues that might be contained within the binding neighborhood. To eliminate this possibility, we compared the ROC values obtained from the binding neighborhood to those computed based on surface residues (of the same size as the binding neighborhood and excluding residues from the binding neighborhood). In addition, only interacting pairs that contain sufficiently large numbers of surface residues outside of the binding neighborhood were selected (see Methods). The results for this analysis, depicted in Table 2, show that the ROC values using surface residues outside the binding neighborhood are always smaller than those using the binding neighborhood (independent of the speciation correction used).
Finally, Figure 4 provides a summary view of the results discussed above without speciation subtraction (faded colors) and with the non-orthogonal subtraction (bright colors); results for orthogonal subtraction, not shown, were almost identical. Clearly, the binding neighborhood is a better discriminator of interactions than a randomly selected set of columns of the same size. The relative discriminative powers of the whole sequence, the whole sequence without binding neighborhood and the binding neighborhood differ between the ROC values, with the whole sequence performing the best on the more practical ROC50. In addition to the above results, it shows that a larger number of MSA columns provide a stronger signal (the number of columns in the randomly selected and binding neighborhood sets are smaller than the number of columns in the full length sequences). Furthermore, from the summary in Figure 4 one can also appreciate the strongly increased power of the mirrortree method when the correction for speciation is applied.
We have shown that binding neighborhoods of interacting proteins have, on average, higher coevolutionary signal compared to those columns outside binding sites. We also found that the sequences without the binding sites still contain some coevolutionary signal, however, the signal coming from a randomly selected set of columns is weaker than that of the binding neighborhoods. Interestingly, our results also show that the coevolutionary signal of randomly selected MSA columns outside the binding neighborhood increases with the number of columns.
Thus, in agreement with Hakes et al. and others, we found that the binding neighborhood alone is not the only contributor to the coevolutionary signal. Hakes et al. concluded that “correlated sequence evolution is most probably due to interacting proteins being constrained in similar ways and, consequently, having similar rates of evolution across their entire sequences.” Our additional experiments lead, however, to the conclusion that the signal is not uniform, instead it is stronger in the binding neighborhood.
Our results indicate that the binding neighborhood is subject to stronger correlated evolution than other regions of the interacting protein domains. However since the mirrortree technique is based on common variation of sequence distances between sequences of interacting proteins rather than on a direct measurement of any of the factors that might contribute to this variation, the exact coevolutionary mechanism which leads to the similarity of evolutionary trees of interacting proteins is still not fully uncovered.
The set of interacting and non-interacting domains has been selected similar to Kann et al.6. In particular, this test set ensures that all orthologous families (interacting or not) contain sequences from the same set of 70 species. Our set of interacting domain families contains a total of 26 interacting pairs and 1291 non-interacting pairs. The interacting domains have been selected rigorously based on the Conserved Binding Mode (CBM) database of Shoemaker et al. 26, which maps the protein domains from the Conserved Domain Database (CDD)30 onto a set of interacting domains from PDB and verifies the interactions via conserved binding modes (CBMs). The CBM analysis uses conserved geometric interfaces to reduce the possibility of including non-biological interactions. All interacting domains in this study were extracted from the CBM database. For cases where multiple conserved binding modes are present for a single interacting pair of domains, we randomly chose one conserved binding mode per pair. For each domain pair extracted from the CBM database an expanded binding neighborhood was determined about the binding interface. Any two residues a and b, from domains A and B respectively, are included if and only if the distance between the closest atoms from a and b does not exceed a given threshold. Threshold values of 6 Å, 8 Å, 10 Å and 12 Å were used. The results given in the main text are for 10 Å. In addition, all pairs selected for this study must contain at least 20 residues satisfying this condition.
Solvent-accessible residues were determined from DSSP 31 calculations available for all structures at ftp://ftp.cmbi.kun.nl/pub/molbio/data/dssp. The solvent accessibility of residue “X” is defined as a ratio of its solvent accessible area in protein structure to that for extended tripeptide Gly-X-Gly. An amino acid is considered as solvent accessible if this ratio is greater than 0.05. To account for the size effect the same number of columns was chosen for both surface and binding neighborhood regions. In addition, we selected only those interacting pairs that contained sufficiently large numbers of surface residues outside of the binding neighborhood and did not contain large disordered regions (surface accessibility could not be calculated for disordered regions). After applying all these restrictions we ended up with 18 interacting pairs (set_18).
For each domain pair reported in the interacting and non-interacting sets, the correlation between their evolutionary trees was measured by computing the correlation coefficient of the corresponding vectors. The sets of residues, and their corresponding MSA columns, were selected as follows. All the columns of the protein domain’s MSA were used for the “full sequence experiment”. For each interacting domain, we randomly selected a set of columns outside the binding neighborhood equal to the number of sites in the binding neighborhood. Only binding neighborhood columns or randomly selected columns were used (or excluded) in the remaining experiments. This selection procedure was repeated 100 times. The correlation coefficient was adjusted using one of the two speciation subtraction approaches: orthogonal subtraction or non-orthogonal subtraction proposed by Sato et al 11 and Pazos et al. 14 respectively. For two domains A and B with n species in common in their MSA, let’s denote Aij as the distance between species i and j for protein family A, and Bij for protein family B.
To implement the speciation signal subtraction, these distance vectors were further modified as follows. First, the background speciation matrix was computed by averaging the evolutionary distance matrices (F) of all protein families (from interacting and non-interacting sets). Thus, for N protein families, the distance between species i and j in the background speciation vector is given by
To reduce the impact of codivergence due to common speciation history, the non-orthogonal speciation subtraction method 14 computes modified distances A′ij and B′ij by A′ij = Aij − sij and B′ij = Bij −sij. When using the orthogonal reference, corresponding distances are defined by and where is the standard representation of the projection of the distance vector for protein families F=A or B into the speciation vector s→ij, and is given by
We point out that the implementations of orthogonal and non-orthogonal subtraction are identical to 6. Finally, given A′ij and B′ij, the correlation between evolutionary histories is estimated by computing a standard Pearson’s correlation coefficient of the upper right triangle of the corresponding matrices.
Table S1 List of the 70 organisms used in all of the experiments.
Table S2 Domain definitions for the interacting protein families.
Table S3 Dependence of ROC values on the size of the binding neighborhood.
Table S4 Number of residues inside and outside the binding neighborhood defined using threshold values of 6 Å, 8 Å, 10 Å and 12 Å.
Table S5 Correlation coefficient for all interacting pairs computed using the orthogonal method for speciation correction.
Table S6 Correlation coefficient for all interacting pairs computed using the non-orthogonal method for speciation correction.
Support for this work was provided in part by the intramural research program of the National Institutes of Health, National Library of Medicine and funding from the University of Maryland, Baltimore County Special Research Initiative Support.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.