|Home | About | Journals | Submit | Contact Us | Français|
The characterization of protein interactions is essential for understanding biological systems. While genome-scale methods are available for identifying interacting proteins, they do not pinpoint the interacting motifs (e.g., a domain, sequence segments, a binding site, or a set of residues). Here, we develop and apply a method for delineating the interacting motifs of hub proteins (i.e., highly connected proteins). The method relies on the observation that proteins with common interaction partners tend to interact with these partners through a common interacting motif. The sole input for the method are binary protein interactions; neither sequence nor structure information is needed. The approach is evaluated by comparing the inferred interacting motifs with domain families defined for 368 proteins in the Structural Classification of Proteins (SCOP). The positive predictive value of the method for detecting proteins with common SCOP families is 75% at sensitivity of 10%. Most of the inferred interacting motifs were significantly associated with sequence patterns, which could be responsible for the common interactions. We find that yeast hubs with multiple interacting motifs are more likely to be essential than hubs with one or two interacting motifs, thus rationalizing the previously observed correlation between essentiality and the number of interacting partners of a protein. We also find that yeast hubs with multiple interacting motifs evolve slower than the average protein, contrary to the hubs with one or two interacting motifs. The proposed method will help us discover unknown interacting motifs and provide biological insights about protein hubs and their roles in interaction networks.
Recent advances in experimental methods have produced a deluge of protein–protein interactions data. However, these methods do not supply information on which specific protein regions are physically in contact during the interactions. Identifying these regions (interfaces) is fundamental for scientific disciplines that require detailed characterizations of protein interactions. In this work, we present a computational method that identifies groups of proteins with similar interfaces. This is achieved by relying on the observation that proteins with common interaction partners tend to interact through similar interfaces. The proposed method retrieves protein interactions from public data repositories and groups proteins that share a sensible number of interacting partners. Proteins within the same group are then labeled with the same “interacting motif” identifier (iMotif). The evaluation performed using known protein domains and structural binding sites suggests that the method is better suited for proteins with multiple interacting partners (hubs). Using yeast data, we show that the cellular essentiality of a gene better correlates with the number of interacting motifs than with the absolute number of interactions.
Protein–protein interactions play a central role in many cellular processes, ranging from signal transduction to formation of cellular macrostructures and cell cycle control [1–3]. Recently, several techniques such as two-hybrid assays [4–6] and affinity purifications followed by mass spectrometry [7–9] have enabled large-scale identification of protein–protein interactions. While these efforts provide rich lists of interacting proteins, they do not produce information about the specific interfaces involved in each interaction.
Proteins interact through a limited set of interface types [3,10,11]. These interfaces are usually key determinants of the function. Therefore, narrowing down protein–protein interactions to interactions between specific protein components (e.g., a domain, sequence segments, a binding site, or a set of residues) is important for a more accurate characterization of the function of proteins and their complexes. Identifying the protein interfaces that mediate interactions may also be useful for the prediction of unknown protein–protein interactions [12,13], for homology-based protein annotation methods , and for relating gene essentiality and network topology .
Traditionally, the description of protein interactions in terms of the interacting components has been based on protein structural domains , protein functional sites , and protein patches . However, fully characterizing protein surfaces that are in contact with each other during an interaction requires the determination of the structure of protein complexes by X-ray crystallography or NMR spectroscopy. These methods are not always applicable and thus the number of known 3-D atomic structures of proteins and their complexes is limited. As a result, accurate and general computational methods for identifying motifs involved in protein–protein interactions are needed.
Recently, several methods [19–25] have been developed to describe protein–protein interactions in terms of interacting protein domains, as defined in the Structural Classification of Proteins (SCOP) , PFAM , and InterPro  databases. However, while these methods find interactions between predefined protein domains, interactions between undefined domains remain undetected. Sequence-based methods overcome this problem by identifying sequence signatures that consistently co-occur in pairs of interacting protein sequences , while structure-based methods can predict the amino acid residues that are in contact during a protein–protein interaction, but require information about the structures of both proteins [30–33]. Recently, Kim et al. used known protein interactions and structures to characterize the interfaces between two interacting proteins . They found that some previously accepted relationships between network topology and genomic features [34–36] are actually more reflective of the number of distinct binding interfaces. For example, highly connected proteins in the network (i.e., hubs) with multiple interfaces are twice as likely to be essential as hubs with one or two interfaces. The findings of Kim and coworkers clarify some previous analyses that related the observed essentiality of hubs with their high number of interacting partners [34,37] or with their interactions to other hubs . Kim et al. also demonstrated that the evolutionary rate is significantly lower for multi-interface hubs than for the average protein, but not so for hubs with one or two interfaces.
Here, our basic assumption is that proteins with overlapping sets of interacting partners tend to interact with the common partners through the same interacting motif, such as a domain, sequence segments, a binding site, or a set of residues. A similar assumption has been previously used to annotate protein sequences [14,39–41]. We first tested this assumption based on databases of protein interactions  and protein domains defined in SCOP , observing that the assumption holds true for highly connected proteins (i.e., hubs in a protein–protein interaction network). Building on this validation, we then developed a method for identifying interacting motifs (iMotifs), which has been implemented within the protein–protein interaction framework and integration engine PIANA (Protein Interactions and Network Analysis) . iMotifs are not required to be of any particular structural type or size, thus allowing us to characterize proteins and their interactions at different levels of resolution, ranging from full proteins to small binding sites. In contrast to other methods, our approach is not limited to finding predefined classes of interacting motifs, such as SCOP domains or PROSITE functional sites, and can be used to identify unknown interacting motifs. Moreover, the sole input for our method is binary protein interactions; neither structure nor sequence information is required to assign iMotifs to proteins.
Two main objectives have been addressed in this work. The first objective was to demonstrate whether protein interactions alone can be used to infer interacting motifs. The positive predictive value of our method in detecting proteins with common SCOP families was 75% at sensitivity of 10%, and the Spearman correlation coefficient between the number of iMotifs assigned to proteins and the number of interfaces found by Kim et al.  was 0.57. The second objective was to examine if the conclusions on protein hubs of Kim et al.  hold for our iMotifs assignments. The results demonstrate that protein hubs with multiple iMotifs are more likely to be essential than hubs with one or two iMotifs and that protein hubs with multiple iMotifs evolve slower than the average protein in the dataset, as opposed to hubs with one or two iMotifs.
The basic assumption behind this work is that proteins with overlapping sets of interaction partners tend to interact with those partners through a common interacting motif. The validity of this assumption was tested on a nonredundant set of 368 proteins with known SCOP domains (Material and Methods). Although SCOP does not classify proteins by their interfaces, SCOP domains were used as surrogates for iMotifs because protein interaction types can be defined by the domains in the interacting proteins .
We found the number of common interaction partners (N) to be a good indicator of the probability of two proteins having a domain in the same SCOP family, especially for highly connected proteins (Figure S1). For example, 73% of protein pairs with 50–60 common interaction partners shared a SCOP domain. We also studied other metrics to measure the similarity between two sets of interaction partners, but none of them outperformed N at the identification of protein pairs with a common domain family (Figures S1 and S2A, and Table S1).
It is worth noting that our assumption relies on the binary nature of the input interactions. Two proteins tend to have a common interacting motif only if they share direct physical interactions with the same partner(s). However, the likelihood of two proteins sharing a SCOP domain was lower by solely using yeast two-hybrid experimental data, a detection method that is more likely to contain binary protein interactions than other experimental methods  (Figure S2B).
Based on the observation that highly connected proteins with common interaction partners tend to interact with them through a common interacting motif, we have developed a method that groups proteins with similar interacting motifs (Figure 1 and Material and Methods section). Briefly, the procedure is carried out in four steps: 1) build a protein–protein interaction network; 2) initialize a cluster interaction network by assigning each protein of the network to a cluster; 3) iteratively create new clusters by fusing similar clusters (allowing a protein to be in more than one cluster) until the similarity score drops below a predefined threshold; and 4) label with a different interacting motif identifier (iMotif) each cluster with more than one protein and derive iMotif–iMotif interactions from the clustered network. In step 3), the similarity score between two clusters is their number of common interacting partners in the cluster interaction network (N). Assigning an iMotif to a group of proteins simply establishes that they have a certain feature that allows them to interact with the same set of partners, without determining the size, sequence, or structure of that feature (Figure 2A). Thus, an iMotif can be an interface consisting of a set of domains or only a specific constellation of a small number of residues (Figure 2B).
The definition of iMotifs depends on a similarity metric and its threshold. Thus, different thresholds or metrics produce different iMotifs, corresponding to different levels of resolution in the description of protein–protein interactions. For example, the method can be applied at the resolution of domains from SCOP , and PFAM , or at the higher resolution of functional sites from PROSITE . In this section, we have evaluated the method on a nonredundant set of proteins (Material and Methods) for three different tasks: (i) detecting proteins with common SCOP domain families; (ii) predicting SCOP domain–domain interactions observed in the Protein Data Bank (PDB) ; and (iii) predicting the number of distinct binding interfaces as defined by Kim et al. . Therefore, in the evaluation, iMotifs effectively represent SCOP family domains (for the first two tasks) and structural binding interfaces (for the third task).
We evaluated the ability of the method to detect proteins with a domain in the same SCOP family (Methods). Using an N threshold of 30 common interaction partners, our method achieves a positive predictive value of ~75%, sensitivity of ~10%, and applicability of ~20% (Figure 3). The positive predictive value drops to ~50% for N of 15, indicating that the accuracy of our method in detecting proteins with common SCOP family domains proportionally decreases with the number of common interacting partners. Therefore, the method should be preferentially applied to assigning interacting motifs to highly connected proteins. The growth of the interactome data [47,48] is likely to make the approach more applicable in the future. Nevertheless, the applicability can already be increased at the expense of the positive predictive value by using other similarity metrics (Table S1). We provide a complete list of iMotif assignments for the test set (Table S2).
Domain–domain interactions can be predicted from the iMotif–iMotif interactions found by the method (Materials and Methods). We evaluated the accuracy of these predictions with respect to domain interactions in the PDB. Our method achieves a positive predictive value of ~65% for ~5% of the proteins in the test set (Figure S4), suggesting that the method can be applied to the prediction of domain–domain interactions when a sufficiently large and varied sample of protein interactions is known. However, with the available interaction data, other methods that rely on both interaction networks and predefined domains [19–22] may be better suited than our approach for predicting domain–domain interactions.
Kim et al. used protein 3-D structures and binary protein interactions to make inferences about the number of binding interfaces of proteins . We tested whether there is a correlation between the number of binding interfaces found in their work and the number of iMotifs predicted by our method (Figure 4). The number of protein interfaces indeed correlates with the number of predicted iMotifs per protein (e.g., for N of 20, rs is 0.57 and p-value 0.01). The number of iMotifs assigned to proteins by our method tends to be higher than the number of binding sites defined by Kim et al. This might be attributed to two factors: (i) current structural data do not contain all possible protein–protein interactions, resulting in an underestimation of the number of binding sites assigned by the method in , and (ii) the lack of coverage of the interactome space, which results in an overestimation of the number of iMotifs per protein assigned by our method. The second factor is addressed by using sequence information to merge similar iMotifs (below).
Using an N threshold of 20, our method assigned 12,342 iMotifs to 2,014 of the 5,571 hub proteins in PIANA (i.e., proteins with 20 or more interaction partners), resulting on average in 8.6 iMotifs per hub. The percentage of hubs with one or two iMotifs was 46% (241 hubs had one iMotif; 689 hubs had two iMotifs). We studied the correlation between the number of iMotifs assigned to a hub and its number of interactions, finding no relationship between the two variables (Spearman correlation coefficient is −0.002 with p-value 0.94). A complete list of iMotif assignments for all hub proteins in PIANA is in Table S3 and the number of iMotifs per hub is in Table S4.
Similarly to Kim and co-workers' results, , we found that yeast hubs with multiple iMotifs are more likely to be essential than those with one or two iMotifs (singlish-iMotif) (Table 1). Furthermore, we observed a correlation (rs is 0.61 and p-value is 1.64 × 10−5) between the number of iMotifs in yeast hubs and the fraction of essential proteins (Figure 5A). We compared the correlation between iMotifs and essentiality to the correlation between the number of interactions of hubs and essentiality to confirm that the first was not a direct consequence of the second (Figure 5B). These results suggest that the number of iMotifs predicted for a protein could be used for selecting biologically relevant candidates for gene deletion experiments.
A common measure of evolutionary rate is the dN/dS ratio (the ratio of nonsynonymous to synonymous substitutions) . Kim et al. found that multi-interface hubs have a lower evolutionary rate than the average protein in their data, but the same was not true for singlish-interface hubs. Our results are in agreement with their findings. Multi-iMotif hubs, in contrast to singlish-iMotif, evolve significantly slower than the average protein in our dataset (Table 2). However, the evolutionary rate difference between multi- and singlish-iMotif hubs (i.e., 0.062 and 0.056, respectively) was not found to be significant (p-value of 0.21).
Sequence patterns for each iMotif were generated using the PRATT program  (Methods). Briefly, PRATT identifies sequence patterns common to a set of sequences. In this work, we selected significant sequence patterns for each iMotif by maximizing the number of proteins within the iMotif that contained the pattern. The significance (i.e., p-value) of a sequence pattern assigned to an iMotif depends on the occurrence of the pattern in the iMotif with respect the whole dataset. As shown on Figure 6, 80% of iMotifs had a specific sequence pattern contained in at least 74% of their proteins (using a p-value cutoff of 10−8). A list with the best sequence pattern for each iMotif is provided in Table S5. Interestingly, a similar analysis based on Pfams assignments to iMotifs showed a different trend (i.e., very few iMotifs had most of their proteins described by a Pfam). For example, as shown on Figure S6, only 10% of all iMotifs had a specific Pfam in at least 28% of their proteins (p-value cutoff of 10−8). Such a difference can be explained by the fact that many interactions are carried out by short sequence patches [3,51], while Pfam families usually consist of long structured protein regions.
As indicated above, incompleteness in interaction data may result in artificially high numbers of iMotifs. This overestimation can be reduced by merging iMotifs with a common sequence pattern (Material and Methods). Fusing iMotifs based on sequence pattern similarity decreased the average number of iMotifs per hub from 8.6 to 4.2. This reduction, in turn, increased the correlation between the number of binding sites from Kim et al.  and the number of iMotifs in the test set proteins (Spearman correlation coefficient was 0.59 with p-value of 0.001).
We described, implemented, and evaluated a method that relies solely on binary protein interactions to identify interacting motifs (iMotifs) and their interactions. Our approach obtained high positive predictive value for identifying proteins with domains from the same SCOP family and predicting domain–domain interactions. We also analyzed hub proteins and their properties based on the number of iMotif assigned to them, obtaining similar findings to those in an independent approach that rely on protein structure information .
Recent estimates suggested that only one-fifth of interaction types are known . Therefore, current knowledge of protein structures is not sufficient to describe all protein interaction types. Our approach, in contrast to other previously described methods, accomplishes three different objectives: (i) it predicts the number of different iMotifs in a protein, (ii) it classifies proteins by their predicted iMotifs, and (iii) it predicts interactions between the iMotifs. The method can identify iMotifs independently of structural or sequence information; it can assign an iMotif to two structural domains or two iMotifs to a single domain. Since the resolution at which iMotifs describe protein interfaces depends on the similarity metric used and the threshold applied by the method, iMotif assignments can be used to infer whether the interaction is mediated through multiple, single, or partial domains. On the one hand, setting a high threshold on the number of common interaction partners (N) will assign few iMotifs to reduced sets of proteins (i.e., very specific and restrictive iMotifs). On the other hand, using low N thresholds will assign the same iMotif to broad numbers of proteins (i.e., very unspecific and general iMotifs). We showed that the method works better for highly connected proteins and using high values for N. Moreover, our approach is not limited to finding predefined classes of protein components and thus allows us to predict new types of interacting motifs. For example, an iMotif can be mapped to a predefined class (e.g., a SCOP domain or a PROSITE functional site) by examining the known classes assigned to proteins with that iMotif. Therefore, iMotifs that remain unmapped are likely candidates for unknown classes. Such predictions may prove useful for target selection in structural genomics.
Relying solely on experimentally detected interactions affects the accuracy of our method. It has been shown that high-throughput experiments have limited reliability and that many of the detected interactions are probably not direct (i.e., they are carried out through a third protein) or do not even exist (i.e., false positives) . However, we did not observe an improvement by solely using interactions from yeast two-hybrid assays (Figure S2B), the high-throughput method that is best suited to detect direct interactions. As more interaction data becomes available, we will reexamine the effects of restricting the method to employ interactions specifically labeled as “direct” . One way of avoiding these limitations is to calculate similarity scores using families of proteins instead of absolute numbers of protein partners. This will prevent assigning the same iMotif to proteins that have many common partners but all of them belong to a single protein family. Removal of redundancy from the sets of partners indeed increases the percentage of identified protein pairs with a common domain family (Figure S7).
The combination of iMotif assignments with sequence search methods identified specific sequence patterns in iMotifs. We found that most iMotifs had a significant sequence pattern that was contained in most of the iMotif proteins. These patterns, which could be responsible for the iMotif proteins common interactions, could then be used to: (i) localize the iMotif in the protein sequence, (ii) assign iMotifs to proteins for which no interaction data is yet available, and (iii) predict interactions between proteins that contain patterns assigned to two interacting iMotifs.
Our iMotif assignments are similar to those obtained using an independent approach, which relies not only on known protein–protein interactions, but also on protein structure information . In agreement with the results of Kim et al., we observe different properties between hubs with multiple iMotifs (multi-iMotif) and hubs with one or two iMotifs (singlish-iMotif). In particular, we find that (i) multi-iMotif hubs are more likely than singlish-iMotif hubs to be essential for cell viability, and (ii) multi-iMotif hubs, in contrast to singlish-iMotif hubs, evolve slower than the average protein. Furthermore, we have also observed a correlation between the number of iMotifs of a hub and its essentiality for cell survival. The properties observed for hubs with respect to their number of iMotifs may reflect the difference between proteins with multiple simultaneously possible interactions (multi-iMotif hubs are probably involved in permanent complexes) and proteins with multiple exclusive interactions (for singlish-iMotif hubs involved in transient interactions). This is in agreement with the previous observation that interfaces of transient protein–protein interactions are less restricted in evolution than interfaces in permanent complexes .
Our results extend the findings and conclusions of Kim and co-workers  to proteins of unknown structure. Thus, inferring interacting motifs from protein interactions is likely to be helpful for providing biological insights about hubs for which no structural information is available.
Protein–protein interactions from DIP 2006.01.16 , MIPS 2006.01 , HPRD 2005.09.13 , BIND 2006.01 , and two recent high-throughput experiments [5,6] were integrated using PIANA version 1.2 , allowing us to work with a large set of 363,571 interactions between 42,040 proteins. PIANA represents protein interactions as a network where the nodes are proteins and the edges are interactions between the proteins. In such a network, a set of proteins linked to protein pj (i.e., physically interacting with pj) is named “partners of pj”. In such a network, we define hubs as proteins with 20 or more partners. The average number of interactions per hub in our dataset was ~49. PIANA builds the protein interaction network by retrieving partners for an initial set of proteins. To avoid a positive bias in the method evaluation, interactions inferred from 3-D structures were not used in this work.
Protein domain assignments and classification were obtained from the SCOP release 1.69 . Here, domains are defined at the SCOP family level. Thus, domain–domain interactions refer to SCOP family interactions. The number of protein binding interfaces for hub proteins was obtained from the Structural Interaction Network 2.0 .
A list of ORFs essential for the survival of the yeast cell was obtained from the Saccharomyces Genome Deletion Project . The evolutionary rates (dN/dS) of yeast proteins were taken from the adjusted values given by Wall et al. .
The assignment of iMotifs to a set of proteins is carried out in a four-step procedure (Figure 1):
First, build the protein interaction network.
Second, initialize a cluster interaction network (i.e., nodes are clusters that contain one or more proteins, and edges are interactions between clusters) by assigning each protein of the protein interaction network to a different cluster. Each cluster (containing one protein pj) interacts with those clusters that contain a partner of pj in the protein interaction network.
Third, iteratively create new clusters by fusing the most similar clusters until the similarity score drops below a predefined threshold. Two clusters are similar if they share a minimum number of common interacting partners (N). Thus, the similarity score between two clusters is their number of common partners in the cluster interaction network. Other similarity metrics were considered, but none outperformed the use of N (Figure S1). When fusing two clusters, the resulting cluster inherits interactions that were common to both fused clusters. Since proteins may have multiple interfaces, all initial clusters (from step 2) remain in the cluster interaction network even after being fused to another cluster. Interactions between non-initial clusters are not considered for calculating the similarity scores.
Fourth, each cluster with more than one protein is labeled with a different interacting motif identifier (iMotif), and that iMotif is assigned to all proteins within that cluster. iMotif–iMotif interactions are then derived from interactions in the cluster interaction network where both sides of the interaction have been labeled with an iMotif identifier.
For example (Figure 1), a proteome of six proteins (namely A, B, C, D, E, and F) forms a network of interactions that connects proteins A with B, C, and D, and protein E with B, C, D, and F (step 1). Our method starts by creating a cluster interaction network from the network of protein interactions (i.e., six clusters with seven interactions) (step 2). Next, the clusters that share the largest number of common interactions are fused (i.e., clusters 1 and 5, with three common interactions, are fused into a new cluster 7). This step is then repeated until the maximum similarity score between the clusters drops below a predefined threshold (i.e., N = 2 common interactions). Thus, the iterative process will run for another iteration creating a new cluster (cluster 8) by fusing clusters 2, 3, and 4, which have two common interactions (step 3). Once the iterative process is finished, the method assigns iMotif identifiers to all proteins in clusters with more than one protein (i.e., proteins A and E in cluster 7 share iMotif 1, and proteins B, C, and D in cluster 8 share iMotif 2) (step 4). Moreover, iMotif–iMotif interactions are then derived from the cluster interaction network (i.e., one interaction between iMotif 1 and iMotif 2).
Figure 2 illustrates iMotif assignments from a network perspective (Figure 2A) and from a structural perspective (Figure 2B). A more detailed description of the algorithm is provided as pseudocode in Figure S8.
We have evaluated the method on a test set created by selecting proteins (i) with at least five experimentally detected interactions, (ii) with at least 80% of their sequence covered by the domains defined in SCOP, and (iii) that did not introduce a redundancy bias in the evaluation (i.e., if any two sequences had a sequence identity greater than 30%, a BLAST e-value smaller than 10−5, and the alignment had at least 30 residues, the shortest member of the pair was not selected). The final set contained 368 sequences (Table S6). Due to the restrictions imposed, the test set contains many proteins related to the proteosome and the ribosome.
The SCOP family assignment was evaluated by considering as positive assignments those proteins found by the method to have a common iMotif with the query protein. Among these positives, we define as true positives those proteins that have a common SCOP family code with the query protein. Moreover, we define as false negatives the proteins that have the same SCOP family code as the query protein but were not found by the method to share an iMotif.
iMotif–iMotif interaction predictions were evaluated against interacting SCOP families obtained from the PDB. Two SCOP domains were considered to interact if they were co-crystallized and had at least two atoms within 5 Å distance. Because we are interested in domain interactions at the protein–protein interaction level, we excluded intrachain interactions from this set. Our method creates a list of putative domain–domain interactions for each predicted iMotif–iMotif interaction by assuming that all domains of the query protein with one iMotif interact with all domains of proteins with the other iMotif. In this context, we define as positive any iMotif–iMotif interaction where the query protein is involved. A positive is then considered a true prediction if at least one of its putative domain–domain interactions is observed in the PDB. Finally, false negatives are interactions observed in the PDB for SCOP families of the query protein that do not appear in any list of putative SCOP family interactions.
To avoid biases in the evaluation, only proteins from the test set (before removing redundancy) and their SCOP families were considered when counting positives and negatives. The positive predictive value is defined as the number of true positives over the total number of positives, and sensitivity is the number of true positives over the sum of true positives and false negatives. The positive predictive value and sensitivity were calculated with respect to the similarity score threshold used for stopping the clustering. We also define the applicability of the method as the percentage of proteins with at least one positive under a given threshold.
For each group of protein sequences with a given iMotif, sequence signatures were generated using the PRATT program , a software tool capable of finding flexible sequence patterns from a set of unaligned sequences. Parameters were set to produce patterns covering a maximum of 15 residues with no more than three consecutive unspecified positions (gaps) and a maximum of one flexible (of variable length) gap region. The number of nonredundant patterns for all iMotifs was 80,654. Next, all patterns were searched against all proteins in PIANA with at least one interaction (a dataset of 42,040 sequences) using the ps_scan program .
The significance (i.e., p-value) of the association between a sequence pattern and an iMotif was assessed using the binomial distribution, based on the occurrence of the pattern inside the iMotif with respect the whole dataset :
where M is the number of protein sequences within an iMotif, n represents the number of proteins within the iMotif that contain the sequence pattern, and p is the probability of finding a protein from the whole dataset that contains the same pattern (i.e., the number of proteins containing the pattern divided by the number of proteins in the whole dataset).
In this work, the best sequence pattern for each iMotif (Table S5) was considered to be the pattern that maximized the number of proteins in the iMotif that had the pattern. Two additional considerations were taken into account for selecting the best pattern for an iMotif: (i) the sequence pattern should be found in at least 70% of proteins within the iMotif and (ii) the p-value of the pattern should be lower than 10−8.
Interacting motifs were merged by means of an agglomerative hierarchical clustering. Two iMotifs were considered to be similar if they had a common sequence pattern when applying a p-value cutoff of 1 × 10−5 and requiring the pattern to be found in at least 70% of proteins in both iMotifs. Using more stringent p-value cutoffs did not produce any iMotif fusions for proteins from the test set.
Hidden Markov Models from the Pfam-A database  were assigned to all proteins with at least one known interaction (a dataset of 42,040 sequences) using the HMMER package . The p-value for each HMM in relation with each iMotif was calculated using the binomial distribution, based on the occurrence of the Pfam inside the iMotif and in the whole dataset (above).
All correlations were measured using the Spearman rank correlation coefficient (rs). The assessment of whether two binomial samples of essentiality observations are significantly different was calculated using Fisher's test. The assessment of whether two non-Gaussian samples of evolutionary rate observations come from the same distribution was calculated using the Mann-Whitney U two-sided test. Correlations and differences in the observations were considered significant for p-values lower than 0.05. All tests were performed using the implementation provided by R .
To measure the likelihood of two proteins pi and pj having a common interacting motif, we defined four different similarity metrics: 1) N: the number of interaction partners that are common to pi and pj (long dashed line); 2) Rmax: the ratio between N and the number of partners of the protein with more partners (bold line); 3) Rmin: the ratio between N and the number of partners of the protein with fewer partners (circles); 4) Rave: the average of metrics Rmax and Rmin (dotted line). For each score obtained using the similarity metrics described above, the percentage of protein pairs within that score range is plotted. For example, we observed that using N as the similarity metric, 73% of proteins with 50–60 common interaction partners shared a SCOP domain.
(94 KB TIF)
(A) Proteins that have more than 70 interactions are ignored when performing the analysis.
(B) Only interactions from y2h are used.
(94 KB TIF)
(A) Superposition of the prothrombin and the pancreatic trypsin inhibitor structures (PDB IDs 1BTH and 2HPQ) shows an interaction through the SCOP family domain Eukaryotic proteases (in red).
(B) The structure of the anionic trypsin II interaction with the pancreatic trypsin inhibitor (PDB ID 1BRB) also shows an interaction through the SCOP family domain Eukaryotic proteases (in red).
(233 KB TIF)
The positive predictive value, sensitivity, and applicability are plotted as a function of the number of common interacting partners threshold used for the clustering. The positive predictive value and sensitivity using a trivial approach are also shown (thin lines). The applicability for the trivial approach is ~70%. Sensitivity is highly dependent on the group of proteins for which an iMotif–iMotif interaction can be predicted at a given threshold: if a protein with a prediction has a SCOP code with multiple interactions in the PDB, the sensitivity obtained can vary greatly from one threshold to another. Moreover, we compared our method with the trivial approach of creating putative lists of domain–domain interactions by assuming that all domain families of proteins in the test set interact with all domain families of their interaction partners. The positive predictive value for this trivial approach was 33%, which is below that of our method for thresholds higher than 15.
(93 KB TIF)
Due to the limited number of iMotif assignments with stringent N thresholds, correlations become nonsignificant (i.e., p-value > 0.05) for N thresholds higher than 22.
(35 KB TIF)
Three different significance cutoffs were used for finding Pfams associated with iMotifs: 10–5 (long dashed line), 10–8 (plain line), and 10–10 (short dashed line).
(87 KB TIF)
The procedure followed to remove redundancy was the same as the one used for creating the evaluation set. We observe a significant improvement for all metrics with respect to Figure S1.
(95 KB TIF)
(345 KB TIF)
In parentheses, the number of pairs with at least one domain within the same SCOP family is indicated. We observe that metrics such as Rmin outperform N at detecting a higher number of protein pairs with a domain within the same SCOP family, but this is done at the expense of decreasing the accuracy of the method.
(63 KB PDF)
The number of common interaction partners (N) was set to 15. First column is the iMotif identifier. Second column is the number of proteins within the iMotif. Subsequent colums are the proteins within the iMotif. Proteins are identified using UniProt entry names and NCBI GI identifiers. In parentheses, “yes”, “no”, and “-” indicate whether the proteins had a domain within the same SCOP family.
(251 KB TXT)
The number of common interaction partners (N) was set to 20. See legend on Table S2.
(2.4 MB TXT)
(38 KB TXT)
First column is the iMotif identifier. Second column is the pattern in PROSITE format. Third column is the fraction of proteins within the iMotif that have the pattern. Fourth column is the p-value of the pattern.
(704 KB TXT)
Proteins from the Test Set, using UniProt Accession Numbers
(68 KB PDF)
We thank P. M. Kim for providing the data for Figure 4. We acknowledge all members of the Sali and SBI labs, especially Fred P. Davis and J. García, for helpful discussions and providing data. We also thank the three anonymous reviewers for valuable comments.
A previous version of this article appeared as an Early Online Release on July 30, 2007 (doi:10.1371/journal.pcbi.0030178.eor).
Author contributions. RA conceived of the idea and performed research. AS, MAMR, and BO provided scientific guidance. JB contributed sections to the manuscript. RA, MAMR, JB, and BO analyzed results; RA, AS, MAMR, and BO wrote the paper.
Funding. RA is supported by a grant from the Spanish Ministerio de Ciencia y Tecnología (MCyT, BIO2002–03609). The work has been supported by grants from the Spanish Ministerio de Educación y Ciencia (MEC, BIO02005–00533 and PROFIT PSE0100000-2007-1), the “Programa Gaspar de Portolà (DURSI),” and the research program of the Barcelona Supercomputing Center (BSC) to access the facilities of Mare Nostrum; AS in particular was supported by US National Institutes of Health grants U54 RR022220 and PN2 EY016525, Sandler Family Supporting Foundation, IBM, Intel, Netapp, and Hewlett Packard. MAMR acknowledges support from Generalitat Valenciana (GVA2007/065). BO acknowledges support from Infociencia and European Union grants INFOBIOMED-NoE (IST-507585) and ANEURIST.
Competing interests. The authors have declared that no competing interests exist.