We analyzed the experimental data from 3,290 IP-MS experiments targeting 1,083 antigens (bait proteins) using 1,796 different antibodies. These experiments detected 11,485 non-redundant proteins (Dataset S1
). Some of the baits were pulled-down with several different antibodies. Some of the experiments with the same baits and antibodies were repeated several times but conducted under different conditions, i.e., stimulated/un-stimulated cells, or different cell types. Complexes are mostly isolated from nuclear fractions but some experiments use cytosolic fractions. Summary of the experimental conditions, cell types, antibodies and baits used, counts of normalized peptides identified in each experiment per protein, and size of the lists of proteins identified in each experiment can be directly obtained from the primary publication provided as reference 
IP-MS proteomics profiling have several known experimental challenges that need to be considered when applying functional global analyses on such data. First, it is well established that the proteins identified in such experiments are enriched for highly abundant and “sticky” proteins. This results in numerous proteins appearing frequently in almost all pull-downs regardless of the cell type, cellular fraction or experimental conditions. To address this we used a list of “non-specific” proteins to filter protein identifications that appear frequently in many pull-downs (Dataset S1
). For all further analyses we removed these proteins from the results. Such a “non-specific” protein list can be useful as a guideline for filtering other IP-MS proteomics data applied to human cells. However, it should be noted that the concept of filtering IP-MS proteomics data based on a “non-specific” list is only meant as a guide. The sticky non-relevant proteins may play an important biological role that would be missed by removing them. In general, proteins that appear in the list are enriched in heat shock, ribosomal, and heterogeneous nuclear ribonucleoproteins (hnRNPs). Also, the majority of proteins on the non-specific list were selected based on the purifications from nuclear extracts, so some abundant cytosolic proteins may be over represented in the protein-protein and domain-domain interaction predictions since these may not have been removed. In order to integrate and visualize the results from the 3,290 IP-MS experiments, we first used the Jaccard Distance (JD) to construct a CoRegs complex similarity graph were nodes represent protein lists from each experiment and links represent overlap between experiments (Fig. S1
). Nodes and links are preserved in the network if the similarity is greater than the Jaccard distance of 0.7. This retained 491 experiments and 2233 links between them, which are a small portion of all possible experiments and their similarities (Fig. S2A
). On average, pull-down experiments reported the identification of ~30–200 proteins but the distribution has a heavy tail with few experiments identifying over 1000 proteins (Fig. S2B
Our aim in this study is to assign confidence scores to binary prey-prey protein-protein and domain-domain interactions by integrating information from the 3,290 IP-MS experiments. The rationale for this approach is that the experiments, reporting lists of ~30–200 proteins for each pull-down, taken together, provide enough information to reconstruct high-fidelity, small-sized complexes and potentially enough to recover direct physical interactions between pairs of proteins and domains. We reasoned that if we use all the information across all experiments to score each pair of proteins for potential direct interaction, we will be able to identify novel associations in addition to recovering known interactions better than by chance. In contrast with most prior methods that focused on scoring bait-prey interactions, our equations predict interactions between prey proteins that commonly reappear together in different pull-downs. Although the data collected for this study was aimed at the recovery of interactions between the intended antigens (baits) and other proteins, the majority of primary antibodies cross-react with multiple secondary antigens and those antigens interact with other proteins. This complicates bait-prey scoring of HT-IP/MS data. Yet, logically, if two proteins reappear together at the top of lists in many different pull-downs, we can guess that they may physically interact regardless of which baits were used to pull them down, making it possible to predict likely binary interactions by utilizing the spectral counts, not just co-occurrence. To encode such logic into mathematical functions we devised four scoring schemes, each attempting to address the problem in a slightly different way. To evaluate the performance of the four scoring schemes we used known PPIs we consolidated from online databases 
. The overall schema for this approach is depicted in .
Workflow of the analysis of aggregated IP-MS experiments.
To compare the performance of the different scoring methods we visualized the results as either receiver operator curve (ROC) (Fig. S3
), random walks (Fig. S4
), or a sliding window (Fig. S5
). Visualization of overlap between a ranked list and a gene set using a random walk was borrowed from the popular Gene-Set Enrichment Analysis method 
. The three equations AB, E3, and Pr can be combined with the Sørenson coefficient to slightly improve the predictions by the AB and E3 equations, and significantly improve the predictions made with the Pr equation. AB and E3 perform best when combined with the Sørenson coefficient because these equations take into account the quantitative levels of the peptides, rewarding interactions that appear on top of the same pull-downs and penalizing potential interactions where the two proteins are not present in the same pull-down, or when one protein appears at the top and the other at the bottom. The different methods recover different sets of interactions and in some cases complement each other, suggesting perhaps that a combined weighted score may provide better results than using a single equation (Fig. S6
, Dataset S2
Next, we used ball-and-stick diagrams to visualize the results across all experiments. We first visualized all overlapping interactions listed in the top 10% of predicted protein-protein interactions by each method (AB, E3 and Pr combined with Sor). This resulted in a network made of 2,509 proteins (nodes) and 28,886 interactions (edges) (). Using Cytoscape's organic visualization algorithm, the hubs of this network self-organize into an interesting hierarchical structure that may reflect their complex formation relationship. This network provides a global view of the CoRegs interactome, allowing zoom-in to view the identity of high confidence predicted protein-protein interactions and the complexes that these interactions form. To accomplish this zoom-in view, we increased the threshold to only include interactions from the top 1% of predicted interactions by all three scoring methods and include only three-node cliques. Three-node cliques are triangles in the network topology where three proteins are connected to each other with a maximum of three links. The resultant network contains 543 proteins and 1,893 interactions organized into 63 tightly connected protein complexes containing 3 to 25 proteins (). Many of the interactions and complexes that emerged are already known from low-throughput protein-protein interactions studies. However, some of the complexes within this network and many of the predicted protein interactions are novel. As a proof of concept, we focused on one predicted complex where most of the members of the complex were exclusively prey proteins in all experiments, and most interactions in the complex were not previously known (). The complex contains ten densely connected proteins with the protein STRN in the center, predicted to interact with all other nine members. STRN, STRN3 and STRN4 are scaffolding proteins with a calmodulin binding domain. Interestingly CTTNBP2NL has been previously reported with STRN and STRN3 in another IP/MS study 
. To experimentally validate one of the interactions within this complex we used IP and western blotting to demonstrate a direct interaction between STRN and CTTNBP2NL which is another member of the predicted complex (). We chose this interaction based on antibody availability. Our experiment clearly shows that the two proteins interact. Such a demonstration of physical interaction experimentally does not prove that our prediction method works well, but it demonstrates how predicted interactions can be further validated experimentally. To prove that the predictions are of high quality, many such experiments need to be performed with appropriate controls to show statistically that the combined equations can predict, with high fidelity, physical interactions.
Network of predicted interactions comprised of 2509 proteins (nodes) and 28,886 interactions (edges) ranked by all three methods in the top 10% of predicted interactions.
A network of predicted protein complexes containing 543 proteins and 1,893 interactions.
Confirmation of a binding interaction within the STRN complex.
Before analyzing all of the 3,290 IP-MS experiments published by Malovannaya et al 
, we had access to a subset of the data before it was published. Therefore, we developed our analysis methods on a subset of 114 IP-MS experiments that are a fraction of the entire set of the 3,290 IP-MS experiments. In order to integrate and visualize the results from these 114 IP-MS experiments, similarly to the network shown in Fig. S1
, we created the Jaccard Distance (JD) CoRegs complex similarity graph (Fig. S7
). Most of these initial 114 experiments used Estrogen Receptor α (ESR1) and nuclear receptor co-activator 3 (NCOA3), also called SRC3, as baits in different cellular conditions. Both proteins play an important role in breast cancer, where SRC3 serves as the main co-activator of estradiol-dependent ESR1 
. The experiments that used ESR1 and NCOA3 as baits resulted in similar protein lists (clusters in the subnetwork in Fig. S7
) compared with the other pull-downs. Using the same prediction combined scores with the three equations, with lower thresholds, we identified five distinct high confidence complexes we named: SMARC, CSTF, RCOR, MBD, and SIN3A (Fig. S8
). These five complexes have been previously reported in the Corum database 
and some have been functionally characterized (Fig. S9
). Specifically, the SMARC complex highly overlaps with complex IDs 238, 714, 803, and 806 in Corum, a database of reported protein complexes 
. The CSTF complex is listed as complex number 1147 in Corum, RCOR is listed as 626, and MBD and SIN3A have associated IDs with highly overlapping entries for complexes in Corum. The SMARC and CSTF complexes were recovered mostly from ESR1 pull-down experiments, while the other three complexes are formed by combinations of many other types of baits. Notably, the SMARC and CSTF complexes are nearly mutually exclusive to two different antibodies targeting ESR1, and are recovered in the control experiment from HeLa cells that do not express ESR1. Thus, one antibody is likely cross-reacting with a member of the SMARC complex, whereas the other antibody cross-reacts with a member of the CSTF complex (Fig. S10
). This result highlights the importance of protein complex reconstruction from HT-IP/MS based on prey-prey co-occurrence alone, independently of the intended baits.
Since PPIs are often the result of interactions between the structural domains of the interacting proteins, and since we know most of those domains for all pulled prey proteins based on their amino-acid sequences, we can use the scores for PPIs to also score and rank domain-domain interactions (DDIs). The scoring of domain interactions is slightly more complex since most proteins have several different domains and the domains can appear more than once within the same protein. To resolve this we used the score for PPIs containing domains between all possible domain pairs from each side of the PPI and normalized the score across all the domains (see methods
). The aggregated score for all DDIs was accumulated across and within all 3,290 IP-MS experiments. The idea of predicting DDIs from PPIs is not new 
. DDIs can also be predicted using structural biology methods or by evolutionary conservation of sequences across organisms 
. To evaluate which PPI scoring method works best to predict DDIs, we compared the predicted scores for DDIs with reported DDIs from the Domine database. The Domine database contains both structurally observed and computationally predicted DDIs 
. ROC curves and random-walk plots were used to evaluate DDI predictions, similarly to how we evaluated the PPI prediction methods (Fig. S11
, Dataset S3
The plots show that we can reliably recover known and predicted DDIs. In addition to the four equations used to score PPIs, we introduced another scoring scheme, λ, for scoring DDIs. λ is an index that counts the number of times two predicted interacting prey proteins have a domain on each side of the PPI. Such an index improves DDI predictions. In addition to the type of analysis we did for PPIs, we also attempted to further combine different prediction methods to optimize DDI predictions. Finally we visualize our predicted DDIs with known DDIs as a network diagram to visually explore interactions among all domains (Fig. S13
) and within the STRN centered complex identified by the PPIs predictions (). To further validate one of the predicted DDIs we pursued a computational structural biology approach. We attempted to dock the PKinase domain of STK25 to the HEAT domain of PPP2R1A. We chose these two proteins because they had a crystal structure in PDB. Although the DDI is already listed in Domine, the prediction of this DDI interaction is based on sequence and homology. Hence there is no direct evidence of such interaction between these two proteins and their domains. Using the Molsoft ICM software we obtained a docking score of −46.75 kcal/mol. This score is considered high and as such confirms the interaction. By examining the confirmation of this interaction it appears that the Pkinase domain of the STK25 protein binds to the HEAT domain of PPP21RA. The energy gap of approximately 2 kcal/mol (ICM score units) between the best obtained and next consecutive docking score clearly suggests strong recognition of the HEAT domain by the Pkinase domain ().
Validation of a domain-domain interaction.