The high dimensionality and exploratory nature of microarray data analysis has led to the application of several unsupervised data clustering techniques to aid in the visualization of gene expression data. Three popular methods are hierarchical clustering (HC) [

1], k-means [

2], and self-organizing maps (SOM) [

3] (others have previously compared these systems [

4]). Implementations of these methods can be found in TreeView [

1], Cluster [

5], and GENECLUSTER [

3]; in more general statistical tools such as R [

6]; or on-line, as part of public microarray repositories such as NCBI's Gene Expression Omnibus (GEO) [

7]. Among these, the most popular is hierarchical clustering (HC), which builds a tree by recursively combining the two most similar objects.

In this work, we describe a scheme primarily for visualization based on minimum spanning trees. Unlike the clustering methods described above, a minimum spanning tree (MST) is more naturally defined as a graph structure. As motivation for our work, we generalize the types of microarray clustering and visualization tools available to researchers in Figure . We classify the methods into three broad categories. Clustering schemes such as k-means and self-organizing maps separate the objects under consideration (i.e., experiments or probes) into independent clusters (Figure (left)). An example of this would be a data set of tumor and normal tissue samples from patients. Hierarchical clustering produces a binary tree called a dendrogram, as shown in Figure (center). This technique is useful if the relationship between samples resemble a hierarchical structure. The last example in the figure presents a minimum spanning tree. Unlike the dendrogram, the graph is unrooted and hierarchical relationships are not necessarily implied. While others have used MSTs for clustering (as we outline below), we believe that such a method would also be ideal for visualizing data with a gradient structure (such as time-series data). Experiments based on cell differentiation would also benefit since MSTs permit direct associations between samples from mixed levels of differentiation without forcing every sample to be at the same level (as hierarchical clustering would). This paper describes our techniques for achieving this, as well as a system called "HAMSTER" (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) which embodies our ideas. Our main aim is to depict a microarray data set as a

*set *of MSTs, instead of a single MST. These MSTs are individually scored and ranked to aid users. HAMSTER is released as open source under the GNU General Public License (GPL). In addition, we also provide a web server version called HAMSTER

^{+ }[

8] that encapsulates the local version within a set of tools for selecting parts of the data set. This web server allows users to evaluate HAMSTER without performing any local software installation. Both systems are covered in this paper, but our focus is on the local version, which we describe by first comparing and contrasting MSTs with a related technique, hierarchical clustering.

Hierarchical Clustering (HC)

A microarray data set can be represented as a two-dimensional data table of *n *experiments and *m *probes. Even though we focus on visualizing experiments, the methods we describe are equally applicable to probes. Of course, issues such as computation time may be significantly different since the number of probes is usually many times larger than the number of available experiments.

Hierarchical clustering (HC) forms a tree called a dendrogram, in either a bottom-up (agglomerative) or top-down (divisive) fashion. In bottom-up construction, each experiment is initialized as being in its own cluster and these clusters form the leaves of the dendrogram. Recursive pairing of clusters grows the dendrogram upwards, until only a single cluster remains. Each merge step adds an internal node to the dendrogram.

The calculations performed by HC requires two steps. The first is to calculate the distance matrix *d *which assigns the level of dissimilarity between every pair of experiments. Some common metrics that are used include Euclidean distance and Pearson correlation coefficient (converted to a distance by subtracting from 1). We illustrate this with a sample data set in Table and the corresponding distance matrix *d *as Figure using Euclidean distance. The distance matrix is calculated once and thereafter never updated. In the second step, as clusters are formed, the dissimilarity between two clusters of experiments is calculated by the linkage type chosen by the user. Three common linkages are single, average, and complete, which represent the shortest distance, average distance, and longest distance between pairs of experiments (from different clusters).

Applying HC to the data set of Table using single linkage is shown in Figure , including the final dendrogram (bottom-right). In every case, the four experiments are shown along the bottom and internal nodes are indicated as smaller circles. In a dendrogram, every experiment is connected to exactly one internal node.

One important property of dendrograms which we will return to later is that the two branches of each internal node can be flipped without any lost in the definition of hierarchical clustering. This gives flexibility in the order in which experiments are shown, but it also can cause problems since a dendrogram could imply an order which is not present. In our example, the positions of nodes A and B can be inverted and the dendrogram would still represent the data of Table .

Minimum Spanning Trees (MSTs)

A graph is a concept in computer science which represents information as a set of nodes and edges, such that each edge indicates some relationship between its two associated nodes (for our purposes, we disallow edges which connect a node to itself). An undirected graph

*G*(

*V, E*) is composed of a set of vertices

*V *and a set of edges

*E*, with no direction on any of the edges. A

*spanning tree *for

*G *is a connected graph with the same vertices but only |

*V*| - 1 edges and no cycles. In a

*connected *graph, each node can be reached from any other node, by following a series of edges. If the edges in the original graph are weighted, a spanning tree whose edges have the minimal total weight is called a

*minimum spanning tree *(MST). We denote a minimum spanning tree of

*G *as

*G*_{M }(

*V*,

*E*_{M}), such that

*E*_{M } *E*.

Several algorithms exist for calculating MSTs, including Prim's algorithm [

9] and Kruskal's algorithm [

10]. Prim's algorithm starts from an arbitrary node and extends the MST a node at a time in a greedy manner by selecting neighboring edges with the lowest weight. Instead of adding nodes, Kruskal's algorithm adds edges. It organizes the dissimilarity matrix

*d *as a sorted list and adds edges, starting from the one with the lowest edge weight, if they connect two previously disconnected components. If all edge weights in

*G *are unique, then the MST produced is unique, regardless of the algorithm employed. These algorithms and a description of MSTs are described in books on graph theory [

11,

12].

MST construction refers to the selection of nodes or edges. If this procedure is coupled with the Euclidean distance measure for determining edge weights, then some authors have referred to the MSTs as EMSTs (Euclidean Minimum Spanning Trees) [

13]. We do not make such a restriction since several other dissimilarity metrics are equally important to microarray data.

In this work, we have selected Kruskal's algorithm, which we illustrate by extending our earlier example.

In Figure , we show the distances of Figure in sorted order. The resulting MST is shown in Figure such that each node is an experiment and the lengths of edges indicate the level of dissimilarity between experiments.

Dendrograms and MSTs

In our example, the dendrogram produced from HC appears similar in structure to the MST. As observed by others [

14], this property can be summarized as follows:

**Property 1 ***Bottom-up hierarchical clustering using single linkage generates a dendrogram which is equivalent in structure to a minimum spanning tree generated using Kruskal's algorithm*.

The reason for Property 1 is straightforward. Hierarchical clustering using single linkage recursively chooses the smallest dissimilarity between two clusters. On the other hand, Kruskal's algorithm sorts the edge weights in decreasing order and adds edges to *G *only if they connect two separated clusters. So, if the dendrogram from hierarchical clustering is treated as a more general graph, the connections added to both graph representations are the same. If there were identical dissimilarities (edge weights), additional tie-breaking rules (based, for example, on node labels) would be required.

Basically, Property 1 implies that the same groups of nodes are connected, whether they are graph components in an MST or sub-trees in a dendrogram. Despite Property 1, dendrograms are *visually *very different from MSTs and it is this difference that HAMSTER aims to leverage.

Some of the differences between dendrograms and MSTs are as follows. A dendrogram has an orientation, so that users can examine it from the root node down to the leaves. MSTs have no obvious starting point and need to be examined as a whole. The most important difference is that dendrograms introduce internal nodes to connect clusters together while MSTs do not. In an MST, experiments are connected directly to each other, allowing users to examine the MST for hubs and neighborhoods. For example, in Figure , it can be seen that experiment B is a central node, in the sense that it is the nearest neighbor of A, C, and D. This point cannot be easily seen from the corresponding dendrogram.

Previous Applications of MSTs

In bioinformatics, MSTs have been used in areas ranging from depicting the sequence repetitions in part of the

*C. elegans *genome [

15] to showing the shapes of disease clusters (e.g., cases of the West Nile virus) on a map [

13]. In the latter case, the authors showed that the disease clusters are allowed to form arbitrary shapes, as determined by distances between objects, instead of circles as implied by Euclidean distance-based methods.

As for microarray data, our application of MSTs is most related to the work on data clustering by researchers at the Oak Ridge National Laboratory/University of Georgia. Initially, they proved that MSTs have the desirable property that all "good" clusters must consist of nodes that form a connected subgraph of the MST [

16]. The sufficient condition of "good" proposed is that if a cluster of nodes from a graph

*G *is split into two non-empty halves

*C*_{1 }and

*C*_{2}, the nearest neighbor of any node in

*G*\

*C*_{1 }is in

*C*_{2}. Their main idea is to construct an MST from the genes in a microarray data set and then apply clustering algorithms on the MST instead of the original data. In essence, the problem of clustering the original data is converted to a simpler tree-partitioning problem. They incorporated their ideas into a system dubbed EXCAVATOR (EXpression data Clustering Analysis and VisualizATiOn Resource) [

17]. Since then, they have also built CUBIC for clustering regulatory binding sites using MSTs where the vertices are

*k*-mers and the dissimilarity between

*k*-mers is calculated using the Hamming distance [

18,

19]. More recently they investigated parallelizing MST construction where the number of vertices is as high as one million [

20]. Varma and Simon [2004] have used MSTs for feature (gene) selection for two-class microarray clustering [

21]. In many microarray-based studies, the number of genes that are up/down-regulated are typically very small. Their aim is to use MSTs to select the genes which are most pertinent to the study. Instead of examining all combinations of genes, they only evaluated the subset of genes obtained by removing a single edge at a time from the MST. Afterwards, the experiments are clustered using hierarchical clustering and this gene subset.

Magwene et al. [2003] presented a method for identifying the time-index of biological samples by combining an MST with a data structure called a PQ-tree [

22]. Assuming that changes in the transcriptome are "smooth and continuous", the samples should form a single path. Deviations from the path (branches) are checked recursively using the PQ-tree.