MCAM was developed to capitalize on the success unsupervised learning has had on biological inference in the past and apply it to a new challenge in the field, that of understanding the function and regulation of phosphorylation in the ERBB network. Since the dynamic trajectory of a phosphorylation site will be shaped by the combination of the kinase and phosphatase activities, as well as any protective influence accumulated through binding to other proteins, and the exposure to all of these proteins by localization with or away from their regulatory proteins, then co-regulation of multiple phosphorylation sites may yield testable hypotheses regarding one or all of these possible shared traits. Encouragingly, biological information is enriched in dataset partitions derived through separation by clustering, whereas multiple random controls do not demonstrate a significant relationship with biological category enrichment. In addition to using this information to gain understanding of the phosphorylation events in the ERBB network, we also undertook to account for the fact that a large number of mathematically correct solutions can be derived through the use of various data transformations, distance metrics, algorithms, and target set sizes. We found that given a variety of clustering solutions, one could see almost any pair of phosphopeptides co-cluster, which may not necessarily be a product of co-regulation that can be linked directly to spatial localization, shared functionality, or shared enzymatic regulation. Therefore, we feel any single clustering implementation will have hypotheses likely to yield testable and supportable information as well as a good mix of those that may not.
Each clustering set within an MCA represents the dimensionality reduction of a large, multidimensional dataset. However, we found there is unique value in the majority of all clustering sets in
, according to the variety of biological enrichment, and so no single set could be considered as an optimal solution and evaluated in a traditional way. Therefore, the most significant development that had to be made in MCAM was to reduce the complexity of a large number of clustering sets. To this end we focused on two aspects of information: 1) Understanding how the parameters affected the final clustering solution, and how this related the power of any particular type of biological inference that could be made, based on statistically enriched information and 2) How to derive meaningful and testable biological hypotheses, through inference, concerning the function and regulation of protein phosphorylation. We found the resulting methods provided important insights when comparing similar measurements, indirectly, across multiple datasets.
Through the use of this technique we found that indeed, optimal clustering parameters one would choose for clustering a dataset would vary greatly dependent on the type of information that was desired at the outcome. However, for those biological categories chosen in this study, there were a few parameters that performed badly across the board, including Hierarchical clustering and the differential transform. These are somewhat unsurprising since the differential transform will reduce the vector size by one dimension, which in a dataset with few measurements may be detrimental to the ability to separate the measurements. However, it is an important observation to mention that lessons found in one dataset should not be extrapolated to all datasets, even of the same type. Although the cosine and correlation distance metrics were uninformative in producing biological enrichment in the EGF4 dataset, they were useful in clustering various portions of the HER2 dataset. This observation highlights the importance of a broad-spectrum application of multiple parameters of clustering.
It is important to consider the impact the feature selection process has on the final result in the MCAM method, which is applied during the parameter refinement, or pruning, step of the MCAM method. Given the varied relationships we observed between particular types of information and the parameters of clustering that best gave rise to them, it is likely that the final result will change with the addition of new biological features, or the alteration to existing features, such as the improvement of GO annotations. One can imagine the parameters pruned during the feature selection process would decrease with the addition of new features, which would redefine the value of a successful clustering solution. Although this suggests the results may need to be reconsidered as annotations improve, we feel this parameter refinement process helps to avoid the consideration of a large portion of currently uninformative sets. The open source nature of the MCAM software project allows for flexibility in altering the specific categories and the thresholds used during parameter refinement.
We found that there is no single ‘optimum’ clustering solution, or one that performed in the top quartile of all biological metrics of interest. Depending on the application, a small number of solutions could be chosen and analyzed in a way that is more traditionally performed in the field. However, we decided to focus on allowing the agreement of many solutions to highlight a potential area of robust biological inference through the agreement of biological enrichment and alternately the agreement of co-clustering of phosphorylation dynamics. These methods produced a large list of possibly interesting biological inferences, of which we highlighted just a few possibilities to demonstrate MCAM's power. In we highlight two ‘robust’ clusters based on repeated enrichment of categorical terms, which creates a hybrid cluster from a combination of multiple clustering sets within
, based on a particular enrichment label of interest. The hybrid cluster, like any single cluster produced from a single clustering method, represents a cluster of phosphopeptides that are strongly co-regulated. The first cluster, , was produced based on enrichment for co-regulation of phosphorylation sites on proteins involved in the MAPK cascade. For those proteins not currently annotated in the MAPK cascade, there is individual evidence that they are involved in regulating MAPK activity. FAM59A was recently named GAREM, which stands for Grb2-associated and regulator of Erk/MAPK activity 
. Specifically, phosphorylation of Y453 on GAREM was required for association with Grb2 and subsequent activation of Erk by EGF stimulation. PTPN18, a protein tyrosine phosphatase also known as BDP1, has been implicated in regulation of HER2 directed MAPK signaling activation 
. The study specifically found that PTPN18 was capable of inhibiting activation of mitogenic signaling. The robust co-regulation of PTPN18 Y389 phosphorylation with other components of the MAPK cascade, shown here, further implicates PTPN18 in MAPK signaling downstream of EGF stimulation, and highlights a particular mechanism for PTPN18 activity, that of Y389 phosphorylation. This modification may possibly act as a negative regulator of BDP1 activity, thereby relieving its function as a negative regulator of MAPK activity. Alternatively, if Y389 phosphorylation on PTPN18 potentiates its ability to shut down MAPK signaling, then these dynamics suggest it occurs subsequently with MAPK activation. ARHGEF5, also known as TIM, is a RhoGEF, which has been shown to activate Rac, which is upstream of another MAPK family member, JNK. This suggests that the JNK cascade may be concurrently activated or that this particular RhoGEF has a role in the ERK1/ERK2 cascade directly. These results indicate that MCAM has been useful not only in highlighting a known co-regulation event, that of EGFR phosphorylation on sites that recruit SHC1 to the receptor with that of SHC1 phosphorylation sites that are phosphorylated following recruitment to the receptor, but also in highlighting proteins not yet generally recognized as playing a role in the EGFR/MAPK signaling pathway. This result also supports a role for these proteins in the MAPK pathway in human mammary epithelial cells as previous cell lines explored in previous studies. Additionally, the example of highlighting GAREM (FAM59A) phosphorylation on Y453 as playing an important biochemical role in the indicated pathway strengthens the hypothesis that PTPN18 Y389 phosphorylation is also an important biochemical mechanism in the MAPK pathway downstream of EGFR activation. Since we can find no specific study on the effect of PTPN18 Y389 phosphorylation, this hypothesis could not have come from literature mining. Additionally, we observe that traditional application of a single clustering implementation would likely not have highlighted this group of proteins in a way that would have linked PTPN18 Y389 and GAREM Y453 phosphorylation with that of EGFR Y1172 and MAPK Y187, since this event occurs in less than 15% of clustering set solutions in
In we highlight another ‘robust’ cluster, which indicates the dynamics of phosphorylation of Y44 on ENO1 are very similar to that of cortactin and paxillin phosphorylation, two proteins and phosphorylation sites that play a role in cell motility 
and are annotated as being localized to ‘lamellipodium’. However, since there is relatively little evidence that paxillin is localized to lamellipodia, but instead has a strong association with mature focal adhesions 
, it may be that ‘lamellipodium’ has been used as a blanket term for leading edge formations in the Gene Ontology since GO is lacking finer gradations of leading edge compartments. What is common to both paxillin and cortactin is localization in invadopodia 
, a term not currently included in GO. Such similarity of dynamics indicates potential co-regulation, which is dependent on a variety of factors. There is evidence that all three sites may be targets of Src 
. In addition to shared enzymatic control, similarity of dynamics might also be dependent on shared localization, especially when enzyme activation is localized to an area such as focal adhesions, invadopodia, or lamellipodium. To better conjecture where this co-regulation is occurring we looked at additional robust associations with ENO1 phosphorylation and found in addition to paxillin and cortactin this site is most closely associated with phosphorylation on a protein called AFAP1L2, which stands for actin filament associated protein like
-2. Despite several other phosphorylation sites on focal adhesion proteins, including integrin
(ITGB4), p130Cas/BCAR1, and focal adhesion kinase (FAK), none of these sites is similarly regulated. Additional evidence supporting this is that the parental EGF subset of the HER2
dataset also indicates that ENO1 and paxillin phosphorylation are tightly coupled with a third protein, Annexin A2, which has been implicated in cell spreading and migration 
as well as formation of invadopodia 
. Finally, a recent proteomic study found enolase is enriched in invadopodia 
further supporting the hypothesis that co-regulation of enolase, paxillin, and cortactin phosphorylation is through shared cellular localization.
The temporal association of ENO1 phosphorylation with that of cortactin and paxillin is intriguing. ENO1, in full and dimeric form, is a metabolic enzyme. However, a short isoform of the same gene product has been shown to block Myc transcription factor activity by binding the Myc promoter 
. This begs the question of whether ENO1 is playing a role in lamellipodium, invadopodia, or focal adhesions, related to its metabolic activity or some other gene product of yet to be discovered functionality. ENO1 phosphorylation on Y44 has long been known and when originally studied did not show a noticeable affect directly on its catalytic activity 
. However, in these studies it was observed that only a small percentage, roughly 5%, of total enolase was phosphorylated and so catalytic differences due to phosphorylation may have been imperceptible. Perhaps compartmentalization of a fraction of altered enzymatic activity could play an important functional role; this activity may be confined to lamellipodia, invadapodia, or focal adhesions. Despite the excess of ATP in the cell, the induction of an ATP gradient within invadopodia could conceivably act as a method of invadopodia formation, since many components, such as F-actin, would be sensitive to a shift in the ATP to ADP ratio. There are other indications that phosphorylation of metabolic enzymes may indeed be playing a functional role and driving tumor progression 
and our results may indicate a role for enolase phosphorylation specifically in the metastatic potential of tumor progression.
Finally, we found that using the concepts developed in MCAM, we could compare independent measures of the ERBB network and dissect signaling alterations occurring between different perturbations of the network, including ligand and receptor level differences. Using the co-occurrence matrix we were able to look “globally” at the differences between the networks. This study indicated that despite HER2 amplification, EGF stimulation drove signaling that was more similar between the two cell lines than it was for the same cell line under two different ligand stimulations. The greatest difference in signaling occurred when EGF stimulation of wild type cells was compared with HRG stimulation of HER2 overexpressing cells. We can dissect the signaling differences further by looking for those associations that have the most extreme differences between conditions. When we did this, we were fascinated to find that in the presence of HER2 overexpression, EGF drives very different dynamics on two phosphorylation sites of the adaptor protein SHC1 and sites on focal adhesion protein p130Cas/BCAR1. SHC1 is known to be recruited to EGFR by two different phosphopeptide binding events and it subsequently recruits, and activates, members of the MAPK cascade. Specifically, both Y349 and Y427 phosphorylation on SHC1 has been shown to recruit Grb2 
. Through its SH2 domain, SHC1 is recruited to EGFR Y1197 phosphorylation and through its PTB domain, it is recruited to EGFR Y1172 phosphorylation 
. Therefore, it is no surprise that subsequent phosphorylation of SHC1 on Y427 and Y349 would be tightly co-regulated with that of EGFR Y1172 and Y1197 phosphorylation when the network is stimulated with EGF. However, what is surprising is this co-regulation is broken for only Y349 in the presence of HER2 overexpression. Instead of being most closely co-regulated with the receptor phosphorylations it is instead most closely co-regulated with Y228 phosphorylation on catenin delta-1, CTTND1, a protein known to be interact with E-cadherin at cell-cell junctions 
. The authors of the original study found that HER2 amplification drives a higher migratory potential and posited that breakup of E-cadherin junctions would be essential to this process. Our finding may therefore indicate that SHC1 plays an important role in this process and that the differential regulation of Y349 and Y427 is perhaps driven by two populations of SHC1, one of which is localized at cell-cell junctions and which is differentially regulated, indicated by the sustained phosphorylation of SHC1 Y349 relative to Y427 phosphorylation, and a second population that is recruited to the receptor, which is probably the dominant population in EGF stimulation of wild type HMECs. The sequence surrounding catenin delta-1 Y228 matches known preferences for SHC1 SH2 recognition. In addition, multiple sites on the focal adhesion protein p130Cas/BCAR1 experience differential regulation in the presence of HER2 overexpession. Increased migration would come as a result of the disruption of both cell-cell contacts and cell-substrate contacts, so these sites might indicate a particular role in how cell-substrate contacts are disrupted in HER2 overexpressing cells. All of these data may help us in understanding the aggressiveness of tumor cells with HER2 amplification.
The methodology developed here has wide applicability to data mining of all varieties. Permutations of clustering parameters and their judgement of success by pertinent categorical data has the capability of producing a wide array of solutions that together span a meaningful range of data separation. MCAM, as developed, can be applied directly to any dataset currently in the PTMScout database, and any proteomic dataset that measures phosphorylation or lysine acetylation can be added by the public to PTMScout for analysis by MCAM. Extension to any type of multidimensional biological measurements simply requires the alteration of the target categorical data. For example, in addition to using gene and protein annotation information, one could look for known transcription factors when mining gene expression data. Another benefit of MCAM is that it provides a method for comparing the relationship between independent measurements of a system, even if the overlap of measurements is incomplete. This methodology provides a new method for understanding the relationship of quantitative measurements with each other, and importantly provides a means in which to judge the outcome of a parameter of clustering with regards to resulting power of inference. This is a much-needed tool when one lacks a satisfactory ‘gold standard’ by which to evaluate the impact of various parameters of unsupervised learning.