We explore how clusterMaker and Cytoscape might be used together by presenting three example research scenarios. Our focus is on the computational tools rather than on the specific data; the scenarios are based on previously published studies and the results are not meant to represent novel findings. It is also the case that both clusterMaker and Cytoscape are relatively sophisticated tools, with many features that may require some effort to fully master. Our intent is not to illustrate all of the features available in these tools, but rather to provide examples of how they can be applied to gain insight into scientific problems.
Scenario 1: Analysis of Protein Expression Data
A principal goal of gene expression cluster analysis is to identify biologically meaningful groups of co-expressed genes or samples (i.e. transcriptomes) from potentially large data sets. Although downstream analysis of co-expression clusters typically involves exploration of enriched functional groups (e.g., using DAVID [41
] or BiNGO [42
]), another powerful analytical approach is to examine clusters for corresponding molecular interactions. Cluster analysis of data sets that integrate interaction and expression data can identify biomolecular networks with common expression patterns in a single step, and reveal both known and unexpected pathways [43
Hierarchical clustering builds a tree that hierarchically connects every data point [1
], but it does not automatically identify discrete clusters without the use of a tree cutting method (e.g. [44
]). Depending upon the goals of the researcher, it may be desirable to identify discrete clusters from large data sets, especially for functional enrichment and biomolecular pathway analysis.
By contrast, AutoSOME identifies both discrete and fuzzy clusters from large data sets without prior knowledge of cluster number [34
]. The latter feature is useful for exploring transcriptome clusters, for example, to show how different clusters of diverse transcriptomes relate to one another. In the following protocol, application of AutoSOME and hierarchical clustering to a combined protein interactome and gene expression data set is demonstrated, along with an anecdotal downstream analysis.
Scenario 1 Data sources
A mouse protein interactome (SVM-network) was downloaded from the MppDB website (http://bio.scu.edu.cn/mppi/
]. This network is a product of extensive literature mining, prior knowledge of co-expressed genes and interacting domains, and other measures of functional and contextual relatedness. To integrate gene expression data, a whole-genome microarray data set representing diverse mouse cells and tissues [46
] was downloaded from the Gene Expression Omnibus as a Series Matrix file (GSE10246; http://www.ncbi.nlm.nih.gov/geo/
). This microarray data set contains 182 arrays (91 in duplicate) and 45,101 gene probes.
Scenario 1 Protocol
After mapping of UniProtKB identifiers to official gene symbols using DAVID [41
] and removal of duplicate edges, the mouse interactome was imported into Cytoscape. This network consists of 3,347 proteins and 13,088 non-redundant interactions. The GSE10246 expression array was pre-processed by mapping probe set identifiers to gene symbols and taking the highest expressed probe for each gene symbol. The resulting data were log2
-scaled, and all genes were median-centered. These two normalization steps are generally recommended when using AutoSOME clustering, and can also be performed using the AutoSOME implementation within clusterMaker
. Of the 21,864 unique genes in the expression data set, 3,049 genes were successfully mapped to the interaction network when imported into Cytoscape (Additional File 1
Initially, the expression data were clustered hierarchically using the pairwise centroid linkage method and the uncentered correlation distance metric. All 182 array sources (i.e. transcriptomes) were used as input, and nodes without data were ignored. A heat map of the gene co-expression cluster results was rendered as a tree view with the yellow-cyan color scheme (Figure ).
Next, the same expression data were clustered with clusterMaker's AutoSOME implementation using Running Mode = Normal, P-value Threshold = 0.1, 50 Ensemble Runs, and Sum of Squares = 1 normalization (both genes and arrays) and the results were rendered as a yellow-cyan heat map (Figure ). Of 34 clusters and 14 singletons, 97% of all analyzed genes (2,958/3,049) fall into the largest 15 clusters. To map cluster results to the mouse interactome, a new network was created with inter-cluster edges removed (Figure ). In addition, AutoSOME fuzzy clustering was performed on all 182 arrays. Clustering was performed using Distance Metric = Uncentered Correlation, Running Mode = Normal, P-value Threshold = 0.05, 50 Ensemble Runs, and Sum of Squares = 1 normalization (both genes and arrays) identifying 16 fuzzy clusters. After setting the maximum number of edges to display in the fuzzy network to 4,000, 'Network' was selected in the Data Output section, and the fuzzy clusters were rendered by pressing 'Display' (Figure ). For increased legibility, the node and font sizes in Figure were enlarged using VizMapper, a core Cytoscape component that allows for the creation and editing of network visual styles.
Scenario 1 Results
Initially, 3,049 genes from the multi-tissue mouse microarray data set (GSE10246) were hierarchically clustered, and the resulting expression tree was rendered as a heat map (Figure ). Though complex gene co-expression patterns are evident in Figure , it is not immediately obvious how to parse the dendrogram into discrete clusters for further analysis. By contrast, AutoSOME identified 34 discrete co-expression clusters and 14 singleton genes (Figure ). These clusters partition the mouse protein interaction network into 148 subnetworks and 1,432 singleton proteins (Figure ). Composed of 42% of all proteins in the analyzed interactome (1,282/3,049), the ten largest subnetworks are indicated in Figure and their corresponding co-expression clusters are labeled in the heat map of Figure .
Downstream analysis of the ten largest subnetworks (Figure ) using DAVID revealed highly significant functional enrichments for all but one subnetwork (Table ). Subnetwork 1 is highly enriched in genes involved in endoderm and mesoderm differentiation pathways, important for diverse organs, subnetwork 2 genes are robustly associated with immune system functions, subnetwork 3 genes are highly enriched in neuronal processes, and subnetwork 4 genes in cell cycle activities (Table ). To illustrate modularity in gene expression, expression levels for representative cells/tissues were mapped onto the ten largest networks using Cytoscape's VizMapper. As shown in Figure , subnetwork expression profiles clearly distinguish the four selected cell/tissue types. Further, the results of the functional enrichment analysis strongly correlate with patterns of up- and down-regulation. For example, of the four cell/tissue types, subnetwork 3 is only up-regulated in the cerebral cortex sample, consistent with this subnetwork's enrichment in neuronal activity (Table ).
Enriched functional categories according to DAVID analysis, related to Figure 2C.
Finally, in addition to gene co-expression analysis, AutoSOME fuzzy clustering was performed on the 182 transcriptomes, and the 16 resulting clusters are illustrated in Figure . Along with discrete clusters denoted by different colored nodes, the fuzzy network shows how individual clusters and their constituents relate to one another. For example, as shown in Figure , mast cells are more closely related to dendritic cells than macrophages, and neuro2a cells (neuroblastoma cells) are more like embryonic stem cells than cerebral cells. Such fuzzy cluster networks provide an alternative to the conventional hierarchical method for exploring intra- and inter-cluster relationships.
Scenario 2: Identification of Protein Complexes
There are several challenges to finding complexes within a protein-protein interaction data set with clustering. Experimental sources of protein-protein interaction data include yeast two-hybrid (Y2H) [47
] and split-ubiquitin [49
] approaches, high-throughput mass spectrometric protein complex identification (HMS-PCI) [50
] and tandem affinity purification followed by mass spectrometry (TAP-MS) [35
]. Due to the multiplicity of approaches and the varying degrees of false positives and false negatives, it is difficult to have a high confidence in any particular cluster result. One approach to increasing confidence in the results of a clustering algorithm is to use additional independent data to corroborate the cluster selections. Besides increasing confidence in the clusters, combining data of different types and sources can provide additional insight into biomolecular interactions, regulatory mechanisms, and pathways. For example, combining putative protein-protein complex information with gene expression data can provide clues as to the role of individual proteins within a complex. For instance, differential expression in response to various stimuli might indicate a regulatory role for one or more of the proteins.
Scenario 2 Data sources
A high-quality protein-protein interaction data set published in 2007 [52
] forms the core network for this analysis. This data set combines three previously published high-throughput protein interaction data sets [35
] to increase the quality and coverage of the resulting interaction network. The authors assigned a Purification Enrichment (PE) score to reflect the quality of interactions within the combined set.
Two yeast epistatic miniarray profiles (EMAPs) were also used: chromosome biology [53
] and RNA processing [54
] to provide genetic interaction data as a complement to the protein-protein interaction data set.
Scenario 2 Protocol
The combined protein-protein interaction data set was imported into Cytoscape with a PE cutoff of 1.85, which corresponds to the scaled value of 0.20 used by the authors in the original data set [52
]. The result is a network with 2742 genes and 16,218 interactions. The PE score was imported as an edge attribute and in addition to the gene symbol, the systematic name was imported as a node attribute (Additional File 2
The initial network was clustered using clusterMaker's MCL implementation with the following settings: Granularity parameter = 1.8, Array sources = PE Score; and MCL Advanced Settings of: Weak edge weight pruning threshold = 1 × 10-20, maximum residual value = 1 × 10-6, and iterations = 16. MCL's iterations are not uniform, and in this example, iterations 3 and 4 take significantly more time than the other iterations. The resulting network contains 408 clusters, with the largest consisting of 254 nodes. The nodes were colored according to the cluster assignment (Figure ).
Figure 3 Clustering of yeast protein-protein interaction networks in the context of overlapping yeast genetic interaction data reveals possible pathway interactions between three well-known complexes. (A) The overall results of MCL clustering of the Collins et (more ...)
The EMAPs were converted into tab-delimited text files from the original Cluster (.cdt) format files with the strength of interaction imported as an edge attribute. Each EMAP was then clustered hierarchically with pairwise average linkage and uncentered correlation as the distance metric using the imported strength of interaction. The resulting clusters were shown in clusterMaker's tree view with the yellow-cyan color scheme used by convention for EMAPs (Figure and ). clusterMaker links selection of all heat map windows with the current network, facilitating interactive exploration and comparison of the clusters across the data sets.
Scenario 2 Results
To explore the putative complexes derived from combining the physical interactions with genetic data, we chose the cluster formed by GIM3, GIM4, GIM5, PAC10, YKE2, and PFD, which represents the prefoldin complex.
Figure shows the cluster results, with the prefoldin cluster shown in the lower right. These nodes also cluster well in all of the EMAPs where they appear, particularly in the chromosome biology (Figure inset) and RNA processing (Figure inset) EMAPs. In each case, the interaction in the EMAPs is epistatic, which indicates that each of the pairwise double-deletion mutants grows better than might be expected given the growth rate of each single deletion mutant. An epistatic interaction is evidence that the two proteins are part of the same pathway, and the tight clustering strongly suggests that they are in the same complex. Given the results of the clustering and the strong corroboration from the genetic interaction data, it is clear that these proteins are part of the same complex. Each of these proteins is annotated in the Saccharomyces
Genome Database (SGD) (http://www.yeastgenome.org
) as part of the prefoldin complex, consistent with these results.
While this result is only confirmatory and does not provide any new knowledge about prefoldin, it is interesting to explore genetic interactions between the prefoldin complex and other putative complexes. For example, in the chromosome biology EMAP, the genes in the prefoldin complex all show a strongly negative (aggravating) genetic interaction with the genes in the SWR1 chromatin remodeling complex (APR6, SWC3, SWR1, VPS72, VPS71, SWC5, and YAF9) and a positive (epistatic) interaction with the genes in the SET1/COMPASS complex (BRE2, SWD1, and SWD3) (Figure ). Both SET1 and SWR1 are involved in various aspects of chromatin biology. SET1 catalyzes methylation of histone H3 and the SWR1 complex is required for the incorporation of histone variant H2AZ into chromatin. It is interesting to speculate on why SET1 and SWR1 should have opposite genetic interactions with prefoldin. This might relate to the eukaryotic specialization of prefoldin for the correct tubular assembly of actin and related tubular proteins, which are required for cell division. A role in cell division is consistent with one additional negative genetic interaction between the genes in the prefoldin complex and several of the genes involved in kinetochore-microtubule interactions (e.g. MCM16, MCM21) and tubulin folding (CIN1, CIN2, CIN4).
Scenario 3: Protein Similarity
More than 40% of all known proteins lack any annotations within public databases [55
]. As a result, millions of proteins are completely uncharacterized except for sequence and (possibly) predicted domain architectures. A number of approaches have been proposed for classifying proteins by function[7
], and clusterMaker
provides several algorithms well-suited for clustering proteins based on some similarity metric such as BLAST [56
]. While sequence clustering approaches do not provide a definitive categorization of proteins, these approaches can be extremely useful as initial steps in an overall curation pipeline. clusterMaker
allows researchers to rapidly cluster data sets and visualize the results. By mapping protein function annotations to visible node properties, the curator may immediately discern clusters that include both unknowns and functionally characterized proteins. The availability of multiple clustering algorithms allows the curator to assign a greater confidence to those predictions that appear consistently across multiple clustering outputs. This approach can significantly reduce the overall curation timeline, particularly in the early stages before other approaches such as hidden Markov models (HMMs) are applicable.
Scenario 3 Data sources
The Structure-Function-Linkage Database (SFLD) is a gold-standard resource tool linking sequence information from mechanistically diverse enzyme superfamilies to their characterized structural and functional properties [57
]. The SFLD provides a three-level classification for proteins: superfamily - evolutionarily related proteins that catalyze the same partial reaction, family - proteins within a superfamily that catalyze the same overall reaction, and subgroup - a mid-level classification containing multiple families with shared functional residue motifs.
From the superfamilies present in the SLFD, we chose to cluster the vicinal oxygen chelate (VOC) superfamily, a group of metal-dependent enzymes that share a common fold motif and catalyze a variety of reactions [58
]. It is difficult to discriminate specific functions (overall reaction catalyzed and thus family membership) within this superfamily due to multiple, perhaps serial permutations and other rearrangements in its evolutionary history [59
]. The VOC superfamily data set is composed of 683 protein sequences, partially classified among seven subgroups and 17 families. Less than half of these sequences included both a family and subgroup classification, and 224 sequences contained a subgroup classification but not a family classification. The remaining 168 sequences were completely uncharacterized.
Scenario 3 Protocol
The SFLD VOC superfamily was loaded into Cytoscape through the SFLDLoader plugin with an e-value cutoff of 1e-1
(Additional File 5
). Nodes in the network represent individual proteins, with family and subgroup classifications already specified among the properties of the nodes. Edges in the network represent protein similarities based on the BLAST e-values of each pairwise sequence alignment.
was used to select a cutoff based on properties of the network edge weight distribution (Figure ). This cutoff selection heuristic has been shown to improve the accuracy of clustering a protein similarity network into families [39
]. Using the -LOG(value) edge weight conversion, a heuristically determined cutoff of 6.0 was used for all clustering runs.
Figure 4 Protein similarity network clustering indicates possible family membership for uncharacterized proteins. (A) A distribution of edge weights (binned -log(BLAST E-values)) of the VOC superfamily is shown, with a cutoff value of 5.5 indicated by a red vertical (more ...)
MCL, TransClust and SCPS clustering were performed on the VOC protein similarity network. Default parameters were used except that the MCL granularity parameter was set to 2.5. Clustering outputs were visualized by coloring the nodes based on known family assignments (where available), allowing rapid identification of clusters composed of characterized members of a single family plus uncharacterized nodes. Such uncharacterized nodes are potential members of the co-clustered family.
Scenario 3 Results
MCL generated 26 clusters and TransClust generated 28 clusters. These numbers adequately approximate the presence of 17 distinct families in 50% of the VOC data set. SCPS, on the other hand, generated only three clusters, which indicates an overabundance of false positives in the SCPS clustering data. Therefore, further analyses focused only on the MCL and TransClust clustering results. As shown in Figure , these results are dominated by uncharacterized proteins (colored red in the figure). Certain clusters are composed entirely of uncharacterized proteins, while other clusters are composed of uncharacterized proteins as well as two or more known families. The most interesting clusters contain just two colors, representing the grouping of uncharacterized proteins with a single VOC family. These clusters allow us to hypothesize the functions of the uncharacterized proteins.
Three such single-family clusters are present in almost equal measures across both the TransClust and MCL results (Figure ), one of which is the methylmalonyl-CoA epimerase subgroup of 50 proteins (arrow in Figure ). This includes the nine characterized members of the methylmalonyl-CoA epimerase family and 41 sequences that lack a family classification in the SFLD, although they are annotated to be in the same subgroup. The size of the cluster is 52 in the TransClust results and 53 in the MCL results. The additional few nodes represent sequences lacking a subgroup classification and that appear in both the TransClust and MCL results, suggesting that putatively assigning these to the methylmalonyl-CoA epimerase subgroup would be reasonable.
In an effort to seek additional evidence of family and subgroup membership, we explored in some detail a randomly chosen uncharacterized protein on the periphery of the methylmalonyl-CoA epimerase cluster (see Figure ). The hypothetical (predicted) protein BH2212 from Bacillus halodurans (gi:15614775) lacks both a family and subgroup assignment. We aligned its sequence with that of methylmalonyl-CoA epimerase from Propionibacterium shermanii (gi:15826388). Four of the five functionally critical active site residues align perfectly with the uncharacterized sequence. These four residues bind the active-site metal ion needed for catalysis. In the initial alignment, the fifth residue, a glutamic acid that abstracts a proton from the substrate, is shifted by one position, but minor editing can also align this residue without degrading the rest of the alignment. Thus, the unknown protein is likely capable of binding the active site metal and may also perform the epimerization of (2R)-methylamonyl-CoA. The next step in functional annotation of this sequence would be to compare it to the hidden Markov models (HMMs) used to characterize the methylmalonyl-CoA epimerase family and subgroup in the SFLD or experimentally verify the function of the protein. These further analyses are beyond the scope of this paper.