Co-evolution of domain pairs in interacting proteins
To assess the degree of co-evolution of domain pairs in interacting proteins, we considered protein–protein interactions in the Saccharomyces cerevisiae (yeast) genome. A schematic overview of assessing the degree of co-evolution between two protein/ domain families is shown in . In this type of analysis, multiple sequence alignments of two proteins/domains for a common set of species are used to construct phylogenetic trees and similarity matrices. The degree of co-evolution of the two domains is measured by computing a linear correlation coefficient of the two similarity matrices, which implicitly compares the evolutionary histories of the two domains.
Figure 1 A schematic overview of the co-evolutionary analysis. Multiple sequence alignments of two yeast proteins for a common set of species are constructed, followed by the construction of their phylogenetic trees and similarity matrices. The extent of agreement (more ...)
To study the relative degree of co-evolution of domain pairs in interacting proteins, interacting proteins are first assigned with Pfam Hidden Markov Model (HMM)51
profiles. Then, as shown in , the correlation (agreement) scores, measuring the degree of co-evolution of all possible domain pairs between the two interacting proteins, are computed. Multiple sequence alignment for domain D
in protein P
is constructed by extracting those regions in P
's multiple sequence alignment that correspond to D
(see Materials and Methods for more details). Under the co-evolution hypothesis, which assumes that interacting domains undergo correlated mutations, domain pairs that are mediating the interaction between two proteins are expected to have co-evolved, and thus are expected to have high correlation score. To test this hypothesis, we began by examining yeast interactions supported by at least one PDB crystal structure. For a given protein–protein interaction, correlation scores of the interacting domain pairs (inferred from crystal structures) are compared against those domain pairs not known to interact to see whether or not the interacting domain pairs do really exhibit relatively high level of co-evolution. If two interacting proteins, P
, have two domains each, then there are a total of four domain–domain interaction possibilities between them.
Figure 2 Relative degree of co-evolution of domains in interacting proteins. (a) Domain architecture of proteins P and Q (shown using gray boxes) that are known to interact (interaction sites are shown as black boxes). (b) Correlation (agreement) scores, measuring (more ...)
Interactions in F1-ATPase complex
The F1-ATP synthase is a five-subunit catalytic core (in a stoichiometry of 3α, 3β, 1γ, 1δ, and 1
), which uses transmembrane proton motive force generated by photosynthesis or oxidative phosphorylation to drive the synthesis of ATP from ADP and inorganic phosphate. The central stalk, comprising 3α, 3β, and 1γ subunits, links F1 complex to the nine-subunit transmembrane channel through which the protons are pumped (F0 complex). The rod-shaped asymmetrical γ-subunit rotates inside a cylinder made of three α and β-subunits, arranged alternately52–56
, making contacts with α and β-subunits.
In this complex, we focused our attention on three interactions among the α, β, and γ chains (genes ATP1, ATP2, ATP3, respectively). The corresponding yeast proteins for the α, β, and γ chains (YBL099w, YJR121w, YBR039w, respectively) were assigned with Pfam domains. The α-subunit (YBL099w) was inferred to contain three domains: beta-barrel domain (PF02874, 2e-18), nucleotide-binding domain (PF00006, 3e-122), and C-terminal domain (PF00306, 2e-37). The β-subunit (YJR121w), a close homolog of the α-subunit, was inferred to contain the same three domains as well. The γ-subunit (YBR039w) contained just the ATP synthase domain (PF00231, 1e-130). depicts the domain architecture of the sequences along with the true domain–domain interactions between the subunits. The results from co-evolutionary analysis (using orthologs from 22 species) are listed as tables in , which clearly shows that those domain pairs that interact do, in fact, have relatively high correlation scores. A cartoon of one of several bovine mitochondrial F1-ATPase crystal structures (PDB: 1h8e), supporting the interactions, is shown in . Despite the complexity of this system, it is remarkable that the degree of co-evolution among the interacting domain pairs is clearly higher than that among the non-interacting domain pairs. In particular, between the α and the β chains, seven out of the nine domain pairs that are known to interact have higher correlation than the two noninteracting domain pairs. In comparison, an approach that picks seven domain pairs out of the nine possibilities at random (without replacement) will have a 0.028 probability (p-value) of getting all its seven picks correct (truly interacting domain pairs).
Figure 3 Interactions among alpha (ATP1), beta (ATP2), and gamma (ATP3) chains of the ATPase. (a) Protein sequences are shown using thick colored lines: red for the alpha chain, green for the beta chain, blue for the gamma chain, and black for alpha or beta chain. (more ...)
Yeast Sec23p/Sec24p heterodimer
Sec23p (YPR181c) and Sec24p (YIL109c) are components of the Sec23p-Sec24p heterodimeric complex of the COPII vesicle, which carries proteins from the endoplasmic reticulum (ER) to the Golgi complex.57
YPR181c and YIL109c are structurally related proteins, consisting of five distinct Pfam domains: zinc finger (PF04810,<1e-19), alpha/beta trunk domain (PF04811,<1e-124), beta-barrel domain (PF08033,<2e-36), helical domain (PF04815, <1e-43), and C-terminal gelsolin-like domain (PF00626, 1e-17). The domain architectures for both the proteins are given in , showing the lone inter-chain domain–domain interaction between the two alpha/beta trunk domains.
Figure 4 Interaction between Sec23 (YPR181c) and Sec24 (YIL109c) components of the COPII coat of ER-golgi vesicles. (a) Protein sequences are shown using thick gray lines, and Pfam domain annotations are shown using colored rectangular boxes (not drawn to scale). (more ...)
There are several intra-chain domain–domain interactions within each chain, involving all five domains (not shown in the Figure). Since, our analysis considers only inter-chain domain pairs (one in each chain), we considered only the lone inter-chain interaction between the alpha/beta trunk domains of the Sec23p and Sec24 chains. The correlation scores for all possible inter-chain domain pairs are listed as a table in , and the cartoon of the PDB crystal structure (PDB: 1m2v), supporting the interaction, is shown in . It is evident from our analysis that the interacting domain pair indeed has relatively high correlation. However, it was not the pair with the highest correlation. The domain pair of beta-barrel domain (PF04811) of Sec23p and alpha/beta trunk domain (PF08033) of Sec24p had the highest correlation. Since the trunk and beta-barrel domains interact within each chain, and both chains are remarkably conserved through evolution and are related, it may not be unreasonable to expect them to have a high correlation. Considering the fact that the lone true interacting domain pair has the second highest correlation score, an approach that picks two domain pairs out of the 25 possibilities at random (without replacement) will have a 0.04 probability (p-value) of picking the lone interacting domain pair.
Interactions in DNA-directed RNA polymerase complex
From the DNA-directed RNA polymerase complex in yeast, we considered two protein–protein interactions involving subunit Rpb8. There are several PDB crystal structures that show Rpb8 subunit interacting with smaller subunits Rpb3 and Rpb11, and the largest subunit Rpb1. The results of our domain-level co-evolutionary analysis for interactions between subunits Rpb8 (YOR224c) and Rpb3 (YIL021w), and subunits Rpb8 and Rpb1 (YDL140c) are shown in , respectively. YOR224c was inferred to contain just one domain: RNA_pol_Rpb8 (PF03870, 4e-92). YIL021w was assigned with dimeri-sation domain (PF01193, 1e-19) and insert domain (PF01000, 7e-41), and YDL140c was assigned with seven domains: clamp domain (PF04997, 7e-178), active site domain (PF00623, 5e-188), pore domain (PF04983, 2e-69), funnel domain (PF05000, 2e-50), cleft domain (PF04998, 2e-170), and two mobile module domains (PF04992, 4e-97 and PF04990, 1e-78).
Figure 5 Inferred domain–domain interactions in DNA-directed RNA polymerase complex. Protein sequences are shown using thick gray lines, and the domain annotations are shown using colored rectangular boxes (not drawn to scale). The names of the protein (more ...)
The correlation results for the interaction between Rpb3 and Rpb8 subunits (using orthologs from a common set of 15 species), listed as a table in , show that the interacting domain pair (PF01193, PF03870), shown in the cartoon of crystal structure (PDB: 1y1v), has a higher level of co-evolution when compared to the non-interacting domain pair (PF01000, PF03870). The correlation results are not that clear for the interaction between the Rpb1 and Rpb8 subunits (refer to the table in ). While two out of the three interacting domain pairs have high correlation scores, the interaction between the funnel domain (PF05000) and RNA_pol_Rpb8 (PF03870) has the lowest correlation score among all possible domain pairs. An approach that picks four domain pairs out of the six possibilities at random (without replacement) will have a 0.086 probability (p-value) of getting three out its four picks correct (truly interacting domain pairs).
An interacting domain pair with a low correlation (false negative in some sense) could be explained using one or both of the following reasons. When assessing the degree of co-evolution between two domains, we tend to ignore the number of interacting partners a domain may have. Even though the co-evolutionary hypothesis for interacting domains assumes that the interacting domains undergo correlated mutations, more specifically, it is actually the binding surfaces that undergo correlated mutations. A domain having multiple interacting partners may use distinct patches on its surface to interact with each of its partners.34,58
Each binding region of a domain is highly specific to its interacting partner. Thus, the surface patches used by a domain to interact with many interacting partners may undergo independent correlated mutations with their corresponding interacting partner. As a result, mutations at different surface patches of a domain need not be correlated (see ). Consequently, the degree of co-evolution between two interacting domains, one or both with multiple interacting partners, may actually be suppressed, resulting in a low correlation score. Thus, it may not always be the case that a pair of interacting domains, each of which has multiple interacting partners, has a high correlation score.
Figure 6 Uncorrelated set of correlated mutations. Each rectangular box is a cartoon representation of a multiple sequence alignment of a family of orthologous proteins/domains. There are a total of six families, A, B, C, D, E, and F. The binding residues of interaction, (more ...)
Mutations occurring in interacting domains may not be correlated due to many other biological constraints imposed on them. There may be cases in which the mutations at a binding surface in a domain may not be followed by compensatory mutations at the binding surface of its interacting partner. Since protein binding surfaces are relatively more conserved than the rest of the sequence,59,60
domains with many interacting partners, and thus many surface patches, are likely to be relatively more conserved than those with few interacting partners.61,62
For each interacting domain, from its multiple sequence alignment, we computed the sequence identity of each orthologous domain in reference to the yeast domain. The alignments for Rpb8 and Rpb1 had orthologs in a common set of 17 species, including yeast. The average sequence identities for the interacting domains, along with the number of known interacting partners are listed in . Interestingly, at least for this example, those domains with many interacting partners are relatively more conserved than those with few partners.
The average sequence identities of interacting domains between subunits Rpb1 (YDL140c) and Rpb8 (YOR224c) of DNA-directed RNA polymerase, along with the number of known interacting partners for each domain
Exportin Cse1p complexed with its cargo
Nuclear pore complexes serve as a medium for exchange of macromolecules between the nucleus and the cytoplasm. Carrier proteins that shuttle between the nucleus and the cytoplasm enable active transport of large molecules through these pore complexes. Importin-alpha Srp1 (YNL189w), which acts as a carrier for many nuclear trafficking processes, binds cargo in the cytoplasm, moves through the nuclear pore and releases the cargo in the nucleus. The nuclear envelope protein Cse1p (YGL238w), a yeast homolog of mammalian CAS, recycles importin-alpha from the nucleus back to the cytoplasm, thereby allowing it to participate in multiple rounds of nuclear import.63,64
To understand the degree of co-evolution between the interacting domains in this complex, we first assigned Pfam domains to the proteins. Cse1 (YGL238w) was assigned with importin-beta N-terminal domain (PF03810, 2e-22), Cse1 domain containing HEAT repeats (PF08506, 7e-275), and Cas/Cse C terminus domain (PF03378, 5e-77). Importin-alpha Srp1 (YNL189w) was assigned with importin beta binding domain (PF01749, 1e-45) and eight Armadillo repeats (PF00514, <5e-6). The domain architecture and the domain-level interactions in this complex are shown in . The co-evolution scores for all domain pairs between these chains are listed as a table in . Three out of the five interacting domain pairs have high correlation, implying high level of co-evolution. The remaining two interacting domain pairs do not have high correlation scores. An approach that picks five domain pairs out of the 29 possibilities at random (without replacement) will have a 0.019 probability (p-value) of getting two out its five picks correct (truly interacting domain pairs).
Figure 7 Interaction between importin alpha Srp1 (YNL189w) and nuclear export receptor Cse1 (YGL238w). (a) Protein sequences are shown using thick gray lines, and Pfam domain annotations are shown using colored rectangular boxes (not drawn to scale). The names (more ...)
For each interacting domain, from its multiple sequence alignment, we computed the sequence identity of each orthologous domain in reference to the yeast domain. The average sequence identities for the interacting domains, along with the number of known interacting partners are listed in , which shows that those domains with many interacting partners are relatively more conserved than those with few partners, which is consistent with similar other findings suggesting that hubs (those proteins/domains with numerous interacting partners) are relatively more conserved.59–62
If true, this could possibly explain why domains PF03378 and PF08506, with average sequence identities ~34% and ~46%, respectively, have low correlation with domain PF00514_8, which happens to have a relatively high average sequence identity at ~75%.
The average sequence identities of interacting domains between importin alpha Srp1 (YNL189w) and nuclear export receptor Cse1 (YGL238w), along with the number of known interacting partners for each domain
Predicting large-scale domain–domain interactions from the yeast interactome
Motivated by our results that interacting domain pairs (in interacting proteins) have higher correlation compared to non-interacting domain pairs, we developed a method to test the generality of the observed trend, and to predict large-scale domain–domain (i.e. Pfam–Pfam) interactions using the yeast interactome. First, we assigned the interacting yeast protein sequences with Pfam domains (see Materials and Methods). We then considered only those interactions involving proteins with at least 50% of their sequence lengths assigned (SLA) with Pfam domain(s), which we refer to as “test set SLA ≥50%”. A cutoff of 50% was chosen as a compromise between being sufficiently small to provide enough interactions, and large enough for assigned domains to contain sufficient binding sites. In addition, we imposed a restriction that the interacting proteins have orthologs in at least a common set of ten species. This resulted in a set of 1180 interactions among 654 proteins.
For each protein–protein interaction, we computed the correlation scores of all possible domain pairs between the two proteins, and inferred the domain pair(s) with the highest correlation score to be the interacting pair that is most likely to mediate the protein–protein interaction. For our set of 1180 protein–protein interactions, we inferred a total of 1222 domain–domain interactions (), 960 of which are unique. In order to validate our predictions, we used the iPfam database,50
which contains the list of known domain–domain interactions inferred from PDB crystal structures. We found that 206 out of our 1222 predictions (109 out of the 960 unique predictions ≈11.35%) are in iPfam.
If we restricted our set to only those interactions involving proteins with at least 75% (instead of 50% before) of their sequence lengths assigned with Pfam domain(s), the percentage of predictions in iPfam jumps by 52%. Our restricted set, referred to as “test set SLA ≥75%”, contained a total of 374 protein–protein interactions among 298 proteins. For this set, we inferred a total of 392 domain–domain interactions (), 336 of which are unique. 58 out of the 336 unique predictions (≈17.26%) are in iPfam. The increase in the prediction accuracy from 11.35% (for the test set SLA ≥50%) to 17.26% (for the test set SLA ≥75%) indicates that the higher the SLA, the better the prediction accuracy. This is understandable considering the fact that the binding region in an interacting protein needs to be contained in one of that protein's domains in order to have any possibility of it being identified.
We compared our prediction results with those of Chen and Liu's RDFF method32
and Riley et al
.'s DPEA method.33
The objective of this comparison is to find what percent of RCDP's predictions are confirmed by the other two methods. Chen and Liu used 9834 yeast protein interactions to infer 4366 domain–domain interactions, out of which 2475 are between Pfam-A domains while the rest involve Pfam-B domains. Riley et al.
's DPEA method is a statistical approach, which uses expectation maximization (EM) algorithm as a subroutine. They used a network of 26,032 protein–protein interactions from 69 organisms to infer a total of 3005 domain–domain interactions, out of which 1812 of them are between Pfam-A domain pairs. The comparison summary is shown in , in which we refer to our method as RCDP (relative co-evolution of domain pairs). Although our analysis shows that the predictions by the RCDP method are more likely to be in iPfam than those by RDFF or DPEA methods, one needs to keep in mind that the prediction accuracies of the three methods are not directly comparable as they all use different datasets of varying sizes. However, the dataset used in our study is a subset of that used by Chen and Liu, and Riley et al
. So, one would expect a good fraction of our predictions to be confirmed by the other two methods. Interestingly, only about 5% of RCDP's predictions are confirmed by both the RDFF and DPEA methods (). About 14% of RCDP's predictions are confirmed by DPEA alone, and about 23% of RCDP's predictions are confirmed by RDFF alone. Overall, 31% of RCDP's predictions are confirmed by DPEA and/or RDFF, indicating that RCDP can predict known domain–domain interactions missed by the other two. Thus, the RCDP method can be used with other methods to detect unrecognized domain–domain interactions on a genome scale with wider coverage.
Figure 8 (a) An indirect comparison of RCDP's prediction results with those of RDFF32 and DPEA33 methods. The predictions were validated against the known domain–domain interactions found in PDB crystal structures (as inferred in iPfam50). The prediction (more ...)
Validation of predicted domain–domain interactions and estimating the true predictive power of the RCDP method
We used domain pairs found to interact in PDB49
crystal structures, as reported in the iPfam database,50
as our gold standard to verify the predicted domain–domain interactions. The iPfam database defines two domains from two different chains to be interacting if and only if they are close enough in at least one PDB complex to form an interaction. We consider a predicted interaction between domain Pi
in protein P
and domain Qj
in protein Q
to be a true interaction (true positive) if and only if iPfam lists this pair of domain to be interacting based on one or more PDB crystal structure evidences.
An interacting protein pair P and Q is said to contain an iPfam domain–domain interaction xy if domain x is present in protein P and domain y is present in protein Q, or vice versa. A given protein–protein interaction may contain more than one iPfam domain–domain interaction, i.e. out of all possible domain pairs between the two interacting proteins, there may be more than one domain pair listed in iPfam.
We consider only those domain pairs found to interact in PDB crystal structures, as reported in iPfam, as true positives. Absence of a domain pair in iPfam does not necessarily mean that the two domains do not interact. Thus, it may not be fair to consider those predictions (without PDB evidence) as false positives. It could very well be that a good fraction of them could be biologically occurring domain interactions. A simple case would be a true protein–protein interaction, none of whose possible domain pairs are in iPfam. In order to estimate the true predictive power of the RCDP method, we tested it on a validation set comprising only those protein–protein interactions satisfying all of the following conditions: (i) is between proteins with at least 50% of their sequence lengths assigned with Pfam domain(s), (ii) is not between two one-domain proteins, (iii) contains a domain pair that is known to interact as per iPfam, and (iv) is between proteins having orthologs in at least a common set of ten species. The validation set contained a total of 109 protein–protein interactions (), comprising a total of 109 unique domain interactions that are in iPfam.
Ideally, a good domain–domain interaction prediction method should be able to recover all 109 unique known domain–domain interactions present in the validation set. To measure the percentage of recovery, we use the sensitivity measure, which is the ratio of the number of unique true positives to the number of unique positives (which is 109). On the validation set of 109 protein–protein interactions, RCDP predicted 109 unique domain–domain interactions, out of which 63 are in iPfam. This resulted in a sensitivity of 63/109=57.8%. This is an underestimation because there may more than one domain pair mediating a given protein–protein interaction, and since RCDP is designed to find only the pair(s) with the highest correlation, it may not be able to recover all 109 unique domain–domain interactions present in the set.
While it is important for any good method to be able to recover the 109 known domain–domain interactions from the validation set, it is equally important that every predicted domain–domain interaction is correct. To measure the accuracy of our predictions, we used the positive predictive value (PPV) defined as:
is the number of predicted domain pairs that are known to be true (in iPfam), and FP
is the number of predicted domain pairs that are not in the iPfam. RCDP predicted a total of 147 domain–domain interactions (109 of them are unique) for the validation set, out of which 94 are in iPfam with a PPV
of 63.95%. To ensure that the 63.95% prediction accuracy of RCDP is not by chance, we compared it against a random method. For this, we used the exact same random strategy used by Nye et al.31
, which, for a given protein–protein interaction, picks a domain pair at random out of all possible domain pairs. Since there could be more than one interacting domain pair within each interacting protein pair, there is a certain probability that the domain pair picked at random is a true interaction. We performed 100,000 runs of this random method on our validation set. The p
-value of obtaining a prediction accuracy of ≥63.95% by chance is 1.05×10−2
-score: 2.39). The performance of RCDP versus
the random method is shown in . On average, the random method is expected to have 55.19 (±0.01)% of its predictions to be in iPfam (). RCDP outperforms random by about 9%, which is significant, considering the fact that Nye et al
. showed, on a different dataset, that the random method performs as good as three other popular methods for predicting domain–domain interactions. In particular, Nye et al.
showed that one can expect their “lowest p
-value” method, Deng et al
.'s MLE method,30
and the random method to have about 55% of their predictions to be true, and Sprinzak et al
.'s association method26
to have about 52% of its predictions to be true. Since the interaction dataset and domain annotations (SCOP domains) used in Nye et al
.'s study are different from those used in this study, the results are not directly comparable.
Figure 9 Domain–domain interaction predictions results for 109 yeast protein–protein interactions, each of which (i) is between proteins with at least 50% of their sequence lengths assigned with Pfam domain(s), (ii) is not an interaction between (more ...)