This paper addresses urgent calls to analyze proteomic data with more effective methods, and integrate these analyses with protein interaction and function databases to elucidate signaling networks that drive diseases such as lung cancer 
. Combining data interrogation methods with computer visualization tools significantly augments our capacity to make sense of large data sets and their links to genome and protein interaction databases. We describe here effective approaches to explore data structure, select subsets based on statistical relationships, and visualize selections as networks. The combined internal and external evaluations provided strong evidence that clusters of proteins identified here represent functional signaling networks in lung cancer because they contain proteins that are known to interact with each other.
The open-source software platforms R, Cytoscape, and RCytoscape were employed for this study. Scripting languages such as R are much more adept at handling large data sets than spreadsheets, and R has a rich library of statistical analysis tools, including many developed for bioinformatics and systems biology 
. Cytoscape is arguably the most advanced tool for network graphing, and offers a graphic user interface (GUI) well suited for exploration and analysis of networks 
. RCytoscape (rcytoscape.systemsbiology.net) links R and Cytoscape, and extends Cytoscape's functionality beyond what is possible with the Cytoscape GUI.
Key steps that resolved informative clusters were: 1
) Calculation of distance matrices using NA to represent the absence of data proved appropriate for mass spectrometry-based proteomic data, and would be advantageous for any data set where detection limits significantly compromise confidence about negative results. 2
) Dissimilarity matrices were used as feature vectors for embedding. Embedding dissimilarity representation may resolve data structure more effectively than the distance matrix because no attempt to preserve distance is made 
) Multiple methods were used for statistical calculation of dissimilarity. A combination of Spearman (or Pearson) and Euclidean distance may increase the resolution of the statistical data structure 
, or clusters identified by different methods may be combined later. 4
) t-SNE was employed for embedding 
. We found that t-SNE was as good or better at resolving clusters from proteins well-represented in the data than other methods, and far superior for identifying clusters from less-well-represented proteins. To explore data structure, displaying three dimensional data structures in PyMOL offered the advantage that the investigator may explore the graph and select clusters of nodes for further analysis (Figure S2
, movie S2
). Displaying two-dimensional data structure in Cytoscape had the advantage that individual node names were visible (Figures S1
) Data wrangling was performed where necessary to combine and filter clusters by conformity to a pattern in the primary data, membership, and/or signal strength. Inspection of the clusters' primary data (e.g.,
using heat maps) was crucial at this stage. This step is termed wrangling because manual, hypothesis-driven manipulation, and decisions based on the results, are akin to herding data into clusters. 6
) Clusters were analyzed using external databases containing protein interaction data and GO terms. 7
) Finally, clusters were visualized as networks to convey a large amount of information in a single graph. Merging edges was useful for clarity where graphs have a large number of edges. String and GeneMANIA use different methods to calculate edge weights, but the weights are of similar scale, so merging them is an acceptable way to provide an overview of evidence for interactions.
This kind of data analysis is an example of pattern recognition for which human brains can be very adept 
, whereas computers are functionally more capable of recognizing patterns in large matrices of numbers. Computer algorithms that embed statistical relationships into two- or three-dimensional structures are thus a valuable first step. We found that automated clustering methods were fairly effective for statistically robust data (Figures S4, S5, and 3), but for more difficult clusters, automated methods were less reliable (Figures S6, S7), so it was advantageous to employ the capabilities of the human brain aided by computer graphics.
The human mind's appreciation of shape also comes into play when constructing informative graphics 
. Networks of clusters with protein-interaction edges convey the amount of phosphorylation and known interactions in a meaningful way, which is much more informative than grids of colored squares adorned with dendrogram trees. Large, complex network graphs can be useful for computer-aided exploration, but rapidly become unwieldy due to their complexity. Simplification of protein interaction edges and filtering nodes made graphs more accessible (,,).
Individual cancerous tumors typically express different combinations of active tyrosine kinases, including multiple receptor tyrosine kinases 
, which makes it difficult to sort out relationships between signaling pathways for targeted therapy. These analyses provide new insights into mechanisms whereby different combinations of tyrosine kinases may delineate distinct divisions of labor that induce cell proliferation, avoidance of apoptosis, and in many cases, promote metastasis. The data-driven clusters suggest potential links between several different cancer driver RTKs, SRC-family kinases (SFKs), RTK-SFK pairs, and proteins that have not previously been characterized.
GO terms enriched in clusters were not randomly distributed, rather there were themes that suggest roles in cell proliferation, differentiation, adhesion and migration, as well as strong links to different metabolic processes such as nucleic acid or carbohydrate biosynthesis, RNA processing, DNA replication, and chromatin structure (GO Summary Tables, Information S1). That different groups were associated with different biological processes further validates the clustering technique, and suggests that proteins were activated by distinct pathways or processes in different tumor samples. While a detailed examination of all the clusters identified from these data was beyond the scope of this paper, the cluster membership and GO summary tables provide a starting point for further investigation. Identification of these new clusters provides a rich source of information to formulate hypotheses for further experiments and predict more effective therapies involving combinations of drugs 
Many RTKs shown to be tyrosine phosphorylated in this data set have been identified by other studies to be activated by different mechanisms, for example, INSR; MET; EHPA2; PDGFRA/B, FGFR1, and ALK 
. The presence of LCK and LYN in clusters containing proteins commonly phosphorylated in lung cancer suggest potential pathways of signal transduction (). These are of particular interest in light of studies that justify the use of SFK inhibitors, or a combination of SFK and RTK inhibitors, to treat lung cancer 
. SFKs associate with RTKs, play a role in transducing their signals, and can phosphorylate RTKs directly, in some cases mimicking those sites phosphorylated during ligand-induced receptor activation 
The results shown in expand the list of RTKs that potentially collaborate with MET in lung cancer to include EPHA2, ERBB2 (HER2), ERBB3 (HER3), and AXL. MET amplification in lung cancer has recently been shown to be associated with activation of EGFR, ERBB2, ERBB3, and RET 
. Co-immunoprecipitation of these RTKs with MET suggests that trans-activation of RTKs can occur through hetero-dimerization 
. Recently the RTK, AXL has been found to have a key role in determining lung cancer chemosensitivity 
. Tyrosine phosphorylation of AXL was detected concomitant with that of MET, ERBB2, and EPHA2 in a number of samples, indicated by the cluster shown in .
DDR1, which was itself highly tyrosine phosphorylated in the data analyzed here, clustered with EGFR and LYN (). DDR1 was unknown as a cancer driver at the time the Rikova et al.,
was published; yet this RTK is now known to be a cancer driver that promotes cell survival through Notch1 
. Recently, DDR2 has been shown to exhibit elevated mRNA levels in NSCLC samples 
. Co-activation of MET, AXL, ERBB2, and EPHA2 (), and co-activation of DDR1 with EGFR (), DDR2, HCK, PDGFRA, and FGR () is evidence that simultaneous activation of multiple tyrosine kinases may be common in lung cancer. The frequency in which tyrosine phosphorylated driver kinases are detected may suggest priorities for therapies that employ combinations of specific kinase inhibitors, as well as new avenues for research and drug development. Thus, assays for activation of sets of particular kinases in individual tumors may be broadly applicable for indicating appropriate drugs for cancer therapy in the lung and other tissues 
A major challenge for both basic research and cancer therapy is to identify critical signal transduction pathways governing cell fate decisions for specific cell types. The clusters identified here from lung cancer phosphoproteomic data, combined with network and GO analysis, suggests that RTK and SFK pathways have some degree of compartmentalization and functional specialization, and will hopefully guide further research and investment of resources to develop drugs targeted to specific proteins or pathways for cancer therapy.
The novel approaches for clustering sparse phosphoproteomic data described here can enhance resolution of complex data sets, which is an important step towards comprehension of molecular signaling networks in cancer. Our results are consistent with those of Naegle, et al.,
, who showed that no single clustering algorithm is sufficient to produce results with biological meaning, and therefore combining and filtering, or wrangling data, and employing external information such as that from protein-protein interaction and GO databases, are crucial for elucidating interesting relationships in the data.