A key step in analyzing most large-scale gene expression studies is clustering or otherwise grouping gene expression data vectors and conditions (individual RNA samples or replicates) into sets that contain members more similar to each other than to the remainder of the data. To do this, biologists now have at their disposal a wide range of computational techniques including supervised and unsupervised machine learning algorithms and various heuristics, such as
k-means, phylogenic-like hierarchical ordering and clustering, Expectation Maximization of Mixture models, self organizing maps, support vector machines, Fourier analysis and more (
1–
6). Their purpose in all cases is to detect underlying relationships in the data, but different algorithms applied to a given dataset typically deliver only partly concordant results. As we show below, it is common to find 20–40% of genes from a high-quality dataset classified differently by two algorithms. These differences can be quite meaningful for a first-pass analysis, in which candidate genes will be selected based on their expression pattern for further detailed study. But clustering classifications are also increasingly important, not as results on their own, but as a key preprocessed input to higher level integrative modeling, such as gene network inference. Clustering results are also becoming important as gene annotations for interpreting entirely different kinds of data. For example, classification of genes as being ‘cell cycle regulated in G
1 phase’ has become part of major databases based on a specific clustering. If such annotations are uncertain or simply incorrect, the uncertainty or errors then ramify through future uses of the data.
The sources of difference between clustering algorithm outputs are many and varied, and the biological implications are also diverse, as illustrated below for cell cycle data. The general challenge is to detect, measure, evaluate and mine the commonalities and differences. Specifically, the biologist usually wants to first know whether one clustering is objectively and significantly ‘better’ than another, and just how big the difference is. If two clusterings are of similar overall quality, yet differ substantially from each other as is often observed, then what specific gene cluster or samples harbor the greatest differences, or are they evenly distributed across the data? At a finer level still, which genes are being assigned to different clusters and why? Importantly, do the distinctions between clusterings highlight properties of biological importance?
To begin to answer such questions, we first needed a way to make systematic quantitative comparisons and then we needed ways to effectively mine the resulting comparisons. We use confusion matrices as the common tool for these comparisons (see below and Methods). A confusion matrix effectively summarizes pairwise intersections between clusters derived from two clustering results. These similarities are quantified by applying scoring functions to the confusion matrix. In this work, we use two different scoring functions for this purpose: (i) normalized mutual information (NMI), which measures the amount of information shared between the two clustering results (
7) and; (ii) a linear assignment (LA) method, which quantifies the similarity of two clusterings by finding the optimal pairing of clusters between two clustering results and measuring the degree of agreement across this pairing (
8,
9). Previous studies have used metrics for evaluating the total number of data point pairs grouped together between two different clusterings to begin to address the need for quantifying overall differences (
10–
13). Ben-Hur
et al. (
13) used this to help determine an optimal number of clusters (
K) and to assess the overall validity of a clustering. These prior techniques did not, however, offer the capacity to isolate and inspect the similarities and differences between two different clusterings, nor did they provide an interactive interface for biology users that would permit them to usefully capture the comparative differences and similarities. We also introduce a new application of receiver operator characteristic (ROC) analysis (
14,
15). As we use it here, ROC enables one to quantify the distinctness of a given cluster relative to another cluster or relative to all non-cluster members. Implemented in this fashion, ROC provides another measure of local cluster quality and shape, and provides another tool for quantitatively dissecting a cluster. Though the methods and tools were worked out for clusterings of large-scale gene expression data, they are applicable to clusterings of other kinds of large-scale data as well.
We have integrated the algorithms and comparative tools into an interactive analysis package collectively called CompClust 1.0. CompClust enables a user to organize, interrogate and visualize the comparisons. In addition to comparative cluster analysis, an important feature of this software is that it establishes and maintains links between the outputs of clustering analyses and the primary expression data, and, critically, with all other desired annotations. In the sense used here, ‘annotations’ include other kinds of primary and metadata of diverse types. This gives a biologist crucial flexibility in data mining and permits analyses that integrate results from other kinds of experiments, such as global protein–DNA interactions (ChIP/Array), protein–protein interactions, comparative genome analysis or information from gene ontologies.
CompClust methods and tools are agnostic about the kinds of microarray data (ratiometric, Affymetrix, etc.) and types of clustering algorithms used. We demonstrate the tools by analyzing two different sets of yeast cell cycle expression data representing both major data platforms, clustered by four different methods: a statistical clustering algorithm [Expectation Maximization of a Mixture of Diagonal Gaussian distributions (EM MoDGs)] (this work), a human-driven heuristic (
1), a Fourier transform algorithm designed to take advantage of a periodic time-course patterns (
16) and an agglomerative version of the Xclust phylogenetic ordering algorithm [Eisen
et al. (
2) modified in this work]. We show that gene groups derived from these comparative analyses can be integrated with data on evolutionarily conserved transcription factor binding sites to infer regulatory modules. These results begin to illustrate how a more quantitative and nuanced understanding of both global and local features in the data can be achieved, and how these can be linked with diverse kinds of data types to infer connectivity between regulators and their target gene modules.