PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2010; 5(12): e14067.
Published online 2010 December 2. doi:  10.1371/journal.pone.0014067
PMCID: PMC2997101

Hierarchical Clustering Using the Arithmetic-Harmonic Cut: Complexity and Experiments

Vladimir Brusic, Editor

Abstract

Clustering, particularly hierarchical clustering, is an important method for understanding and analysing data across a wide variety of knowledge domains with notable utility in systems where the data can be classified in an evolutionary context. This paper introduces a new hierarchical clustering problem defined by a novel objective function we call the arithmetic-harmonic cut. We show that the problem of finding such a cut is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e001.jpg-hard and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e002.jpg-hard but is fixed-parameter tractable, which indicates that although the problem is unlikely to have a polynomial time algorithm (even for approximation), exact parameterized and local search based techniques may produce workable algorithms. To this end, we implement a memetic algorithm for the problem and demonstrate the effectiveness of the arithmetic-harmonic cut on a number of datasets including a cancer type dataset and a corona virus dataset. We show favorable performance compared to currently used hierarchical clustering techniques such as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e003.jpg-Means, Graclus and Normalized-Cut. The arithmetic-harmonic cut metric overcoming difficulties other hierarchal methods have in representing both intercluster differences and intracluster similarities.

Introduction

The problem of finding structure in a set of unlabeled data (the so-called clustering problem) appears in various domains of research including bioinformatics, machine learning, image processing and video processing. In the area of bioinformatics, clustering has become increasingly important, as finding genetic subtypes of heterogeneous diseases like breast cancer, ovarian cancer and multiple sclerosis, may be made easier by using suitable clustering methods. This work aims to facilitate this line of research by finding good clusterings of various datasets with known partitions.

The importance of the clustering problem in various areas has given rise to several greedy algorithms such as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e004.jpg-Means, optimization-based methods such as Normalized-Cut and neural-net based methods amongst others.

In this work, we will use a top-down approach for hierarchical clustering, recursively dividing the elements in the data. In each division step, often a graph partitioning technique is used (a similar approach is used for Normalized-Cut [1]). However, many graph (bi)partitioning problems can be formulated as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e005.jpg-hard optimization problems, for which there are no polynomial-time algorithms to find the optimal solution unless An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e006.jpg. This is an indication of the difficulty of the clustering problem and the focus of research since the work of Wertheimer [2]. In this work, we propose a new objective function for graph bipartitioning. The motivation for finding a new objective function for graph bipartitioning is that the known bipartitioning methods produce incorrect results for some datasets. For example, two formulations for clustering by graph partitioning are Max-Cut [3] and Normalized-Cut [1]. Max-Cut is already known to provide incorrect results for some datasets [1]. We show that Normalized-Cut also produces some incorrect results for some of the datasets examined.

To achieve a better clustering than agglomerative hierarchical clustering and existing graph partitioning formulations, our proposed objective function seeks to minimize the intra-cluster distances and at the same time it seeks to maximize the inter-cluster distances. Our objective function performs well in clustering diverse types of datasets.

More precisely, we pose the hierarchical clustering problem as a finite number of instances of a graph partitioning problem, called Arithmetic-Harmonic Cut (AH-Cut). In the AH-Cut problem, we start with a distance matrix for a set of objects and compute a weighted graph in which vertices represent objects and edges are weighted by the distance between the corresponding vertices. Our objective function tries to obtain a partition where the weight of the partition is directly proportional to the sum of the weights on the edges between the two partite sets and the sum of the reciprocals of the weights on the edges inside the partite sets. When considered as an optimisation problem, the goal is to maximise the weight of the partition. The recursive application of AH-Cut can then be used to generate a tree-based classification of the data.

As noted many graph bipartioniting problems are An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e007.jpg-hard at least, so a theoretical examination of any proposed clustering problem is necessary to determine whether it constitutes a practical approach to clustering. We give such a classification of AH-Cut and show that although it is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e008.jpg-hard and hard to approximate, it is fixed-parameter tractable, and therefore still a practical method for clustering.

Related Objective Functions for Hierarchical Clustering

The An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e009.jpg-Means Algorithm

The An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e010.jpg-Means algorithm is one of a group of algorithms called partitioning methods; Given An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e011.jpg objects in a An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e012.jpg-dimensional metric space, we wish to find a partition of the objects into An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e013.jpg groups, or clusters, such that the objects in a cluster are more similar to each other than to objects in different clusters. The value of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e014.jpg may or may not be specified and a clustering criterion, typically the squared-error criterion, must be adopted. The An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e015.jpg-Means algorithm initializes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e016.jpg clusters by arbitrarily selecting one object to represent each cluster. Each of the remaining objects are assigned to a cluster and the clustering criterion is used to calculate the cluster mean. These means are used as the new cluster points and each object is reassigned to the cluster that it is most similar to. This continues until there is no longer a change when the clusters are recalculated. However, it is well-known that depending on the initial centres of the clusters, clustering results can change significantly. We use Gene Cluster An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e017.jpg [4] for comparing our method with An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e018.jpg-Means.

Max Cut, Ratio Cut and Average Cut

Graph bipartitioning algorithms are also used for clustering [5]. Given a graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e019.jpg and perhaps a weighting function An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e020.jpg, a graph bipartitioning problem asks for a partition An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e021.jpg such that some function on the (weights of the) edges between An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e022.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e023.jpg satisfies the given bound, or in the case of an optimisation problem, is optimised. One of the most common formulations is essentially an An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e024.jpg-hard combinatorial problem, called Weighted Max-Cut, which is a simple weighted extension of the Max-Cut problem. If we denote the edges between An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e025.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e026.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e027.jpg then the function An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e028.jpg to be optimised in the case of Weighted Max-Cut is:

equation image

Although good algorithms exist for Weighted Max-Cut, Shi and Malik [1] and Wu and Leahy [5] show that (Weighted) Max-Cut 's objective function favours cutting small sets of isolated nodes in the graph. Furthermore, during bipartitioning, sometimes it may also cut small groups and put two parts of the same small group into different partite sets.

Ratio-Cut uses the objective function:

equation image

In this case An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e031.jpg is taken as a similarity metric. Ratio-Cut (and its An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e032.jpg-way extension) has also been employed for image segmentation [6] and circuit partitioning for hierarchical VLSI design [7].

Average-Cut employs the following objective funtion:

equation image

If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e034.jpg is a similarity metric, the the problem becomes a minimisation problem, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e035.jpg expresses distance the goal is maximisation. It turns out that even using the average cut, one cannot simultaneously minimise the inter-cluster similarity while maximizing the similarity within the groups.

Normalized-Cut

In the context of image segmentation, Shi and Malik [1] introduce Normalized-Cut. They use a similarity metric for An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e036.jpg, and thus Normalized-Cut is typically expressed as a minimisation problem with the following objective function:

equation image

It is well-known that by negating weights the Max-Cut problem is equivalent to the corresponding Min-Cut problem where one is supposed to minimise the sum of the weights (given by some similarity measure) between the two partitions (An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e038.jpg) of a set of vertices An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e039.jpg in a graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e040.jpg. It is straightforward to see that the same argument holds in case of Normalized-Cut as well, which allows the negation of a distance matrix to be used a similarity matrix, facilitating comparisons for datasets for which only distance matrices are available. However, Shi and Malik [1] start with a Euclidian distance matrix An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e041.jpg and then compute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e042.jpg as their similarity matrix. We use both approaches and demonstrate that the performance of this algorithm varies depending on the dataset and the two similarity measures.

Furthermore, Shi an Malik's [1] implementation relaxes the Normalized-Cut problem into a generalised eigen-value problem by allowing the vertices An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e043.jpg to take real-values, instead of taking values from just the set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e044.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e045.jpg denotes that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e046.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e047.jpg denotes that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e048.jpg. Then, for bipartitioning, the second smallest eigenvector of the generalized eigen system is the real-valued solution to Normalized-Cut. Finally, they search for the splitting point as follows: first choose An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e049.jpg equidistance splitting points, compute Normalized-Cut's objective value for each of these splits, then choose the one for which Normalized-Cut's objective value is the smallest. In fact, the implementation also allows An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e050.jpg-way Normalized-Cut, Yu and Shi [8] examine this further. It considers the first An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e051.jpg eigen vectors and yields An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e052.jpg partite sets from a discretisation step following it.

Notice that Max-Cut and Ratio-Cut do not cluster by intra-cluster similarity and this results in a poor clustering results for image segmentation in comparison to Normalized-Cut [1]. Therefore, among these three algorithms, we consider only Normalized-Cut for comparison with our algorithm.

Graclus

Graclus [9] implements a multilevel algorithm that directly optimizes various weighted graph clustering objectives, such as the popular ratio cut, normalized cut, etc. This algorithm for multilevel clustering consists of three steps: (a) iteratively merging nodes of the graph (using various criteria of merging) and creating supergraphs with fewer nodes; (b) applying some base-level clustering on the resulting supergraph; and (c) restoring the clusters of original graph iteratively from the clustering of the final supergraph. This algorithm does not use eigenvector computation, and is thus notably faster than existing implementations of normalised and ratio-cuts. However, in most of the examples shown in this paper, Shi and Malik's [1] implementation of Normalized-Cut results in a better clustering than Graclus.

Outline of the Paper

In this paper, after introducing the problem, we first examine the approximability of AH-Cut. In fact, we prove that AH-Cut is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e053.jpg-hard (and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e054.jpg-complete) via a reduction from the Max-Cut problem, which is already known to be An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e055.jpg-hard [3]. Therefore An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e056.jpg has no polynomial time approximation scheme unless An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e057.jpg. We then demonstrate that AH-Cut is fixed-parameter tractable via a greedy localisation algorithm. Such a complexity analysis provides an indication of what practical methods are suitable for application to the problem. In this case the complexity results indicate that there is unlikely to be a polynomial time algorithm (or even approximation), but that the exponential component of the running time is at worst only a function of a small independent parameter and therefore the problem is likely to still have effective algorithms.

Given the complexity result we use a meta-heuristic approach (namely, a memetic algorithm) for AH-Cut (an outline of which was presented previously [10]). We compare the performance of our algorithm on four diverse types of datasets and compare the results with two recent and highly regarded clustering algorithms: Normalized-Cut; and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e058.jpg-Means. The results indicate that AH-Cut gives a robust and broadly useful hierarchical clustering method.

Preliminaries

Graph Notation and Problem Definition

We consider only simple, undirected graphs, which may or may not be associated with a weight function on the edges. Given a graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e059.jpg unless otherwise specified we denote the vertex set of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e060.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e061.jpg and the edge set of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e062.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e063.jpg. We denote an edge between vertices An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e064.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e065.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e066.jpg.

Given a graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e067.jpg and two vertex sets An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e068.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e069.jpg we denote the set of edges with one endpoint in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e070.jpg and the other in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e071.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e072.jpg. When the graph is clear from context we write An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e073.jpg.

Arithmetic-Harmonic Cut (AH-Cut)

Instance: A graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e074.jpg, two positive integers An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e075.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e076.jpg and a weight function An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e077.jpg.

Question: Is there a partition of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e078.jpg into two sets An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e079.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e080.jpg such that

equation image

Given a graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e082.jpg and two disjoint vertex sets An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e083.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e084.jpg, for convenience we denote

equation image

by

equation image

The optimisation verion of AH-Cut is identical except that we maximise the function An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e087.jpg.

Approximation and Complexity

If a maximisation problem An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e088.jpg with objective function An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e089.jpg has an polynomial time algorithm which given an instance An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e090.jpg with optimal solution An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e091.jpg guarantees a solution An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e092.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e093.jpg for some An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e094.jpg then we say An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e095.jpg has a constant factor approximation algorithm. If there is an algorithm that guarantees such a bound for every An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e096.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e097.jpg has a polynomial time approximation scheme (ptas). An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e098.jpg is the class of all problems which have constant factor approximation algorithms. If a problem An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e099.jpg is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e100.jpg-hard, then An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e101.jpg has no ptas unless An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e102.jpg.

We refer to Ausiello et al. [11] for further reading.

Parameterized Complexity

A parameterized problem is a (decision) problem equipped with an additional input called the parameter. Typically the parameter will numeric and should be independent of the size of the instance and relatively small. A problem An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e103.jpg is fixed-parameter tractable if there is an algorithm that solves the problem in time bounded by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e104.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e105.jpg is the parameter, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e106.jpg is the size of the input, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e107.jpg is a computable function and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e108.jpg is a polynomial.

As we do not require the parameterized notion of hardness, we refer the reader to Flum and Grohe [12] for complete coverage.

Results and Discussion

The Complexity of AH-Cut

We first turn to theoretical results for AH-Cut. We show that the optimisation version of the problem is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e109.jpg-hard, and consquently that the decision version is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e110.jpg-complete, indicating that AH-Cut is not has no polynomial time algorithm, but has no polynomial time approximation scheme, under standard complexity theoretic assumptions. Under the parameterized complexity framework however we show that AH-Cut is fixed parameter tractable with a An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e111.jpg time algorithm.

NP-Completeness and APX-Hardness

We demonstrate the An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e112.jpg-completeness for AH-Cut via an An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e113.jpg-hardness reduction from Max-Cut which is known to be An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e114.jpg-hard [3] and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e115.jpg-complete [13].

Max-Cut

Instance: A graph An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e116.jpg, a positive integer An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e117.jpg.

Question: Is there a set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e118.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e119.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e120.jpg?

The goal of the optimisation version of Max-Cut is to maximise An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e121.jpg.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e122.jpg be an instance of Max-Cut with An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e123.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e124.jpg (we may assume that there is at least one cycle, as the maximum cut of any forest is trivially An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e125.jpg), we construct an instance An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e126.jpg of AH-Cut where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e127.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e128.jpg (i.e., An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e129.jpg is a complete graph). The elements of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e130.jpg are weighted as follows: if An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e131.jpg, then we set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e132.jpg, if An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e133.jpg we set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e134.jpg and for all other edges An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e135.jpg we set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e136.jpg. We set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e137.jpg. Clearly we can obtain An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e138.jpg in polynomial time.

Before moving to the hardness proof, we first prove some auxilliary lemmas.

Lemma 1

Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e139.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e140.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e141.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e142.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e143.jpg.

Proof. Consider the objective function

equation image

and let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e145.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e146.jpg.

Each of the edges between An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e147.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e148.jpg that are also in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e149.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e150.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e151.jpg, and all other edges that are also in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e152.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e153.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e154.jpg. As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e155.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e156.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e157.jpg are in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e158.jpg, the edges An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e159.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e160.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e161.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e162.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e163.jpg. The edges between An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e164.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e165.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e166.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e167.jpg. Therefore

equation image

As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e169.jpg,

equation image

Lemma 2

Assume An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e171.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e172.jpg be a subset of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e173.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e174.jpg. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e175.jpg then An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e176.jpg.

Proof. Assume An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e177.jpg, without loss of generality (by switching An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e178.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e179.jpg) we may assume that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e180.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e181.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e182.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e183.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e184.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e185.jpg.

As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e186.jpg we know An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e187.jpg, which contributes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e188.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e189.jpg. The edges in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e190.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e191.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e192.jpg. As two vertices from An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e193.jpg are in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e194.jpg, the edges between those two vertices and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e195.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e196.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e197.jpg. The third vertex in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e198.jpg contributes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e199.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e200.jpg.

One of the edges of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e201.jpg is not in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e202.jpg and thus contributes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e203.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e204.jpg. There are An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e205.jpg edges in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e206.jpg that are not in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e207.jpg (and thus not in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e208.jpg) and so contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e209.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e210.jpg. The edges between the two An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e211.jpg vertices and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e212.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e213.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e214.jpg. Finally the edges between the An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e215.jpg vertex and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e216.jpg contribute An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e217.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e218.jpg. Thus in total we have

equation image

and

equation image

As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e221.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e222.jpg we have

equation image

As we assume that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e224.jpg,

equation image

Lemma 3

Assume An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e226.jpg. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e227.jpg be a subset of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e228.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e229.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e230.jpg. In polynomial time we can obtain an An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e231.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e232.jpg.

Proof. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e233.jpg we may apply the greedy algorithm of Mahajan and Raman [14] which returns a set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e234.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e235.jpg. Therefore we may take An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e236.jpg as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e237.jpg and we have An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e238.jpg.

If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e239.jpg, we have that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e240.jpg. Then by Lemma 2, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e241.jpg. Without loss of generality we may assume that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e242.jpg (by switching An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e243.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e244.jpg as necessary). Denote An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e245.jpg by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e246.jpg. As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e247.jpg we have An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e248.jpg. We also have that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e249.jpg. We may then observe that

equation image

We know that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e251.jpg, and that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e252.jpg contributes 3 edges to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e253.jpg, as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e254.jpg there are at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e255.jpg edges of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e256.jpg that are in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e257.jpg and there are at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e258.jpg edges otherwise unnaccounted for in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e259.jpg. Therefore:

equation image

As An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e261.jpg:

equation image

We are now prepared to prove the main theorem of this section.

Theorem 4

AH-Cut is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e263.jpg-hard and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e264.jpg-complete.

Proof. Assume there is a An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e265.jpg-approximation algorithm An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e266.jpg for AH-Cut. We show that this implies a An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e267.jpg-approximation algorithm for Max-Cut. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e268.jpg be an instance of Max-Cut and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e269.jpg be the corresponding instance of AH-Cut derived from the reduction described above. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e270.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e271.jpg, we solve the instance by complete enumeration in constant time. Otherwise assume the optimal cut of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e272.jpg cuts at least An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e273.jpg edges of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e274.jpg and induces the partition An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e275.jpg. By Lemma 1 the partition An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e276.jpg induces a solution for AH-Cut such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e277.jpg. Algorithm An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e278.jpg will give a solution with An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e279.jpg. Then by Lemma 3 we have a set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e280.jpg such that

equation image

It is clear that AH-Cut is in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e282.jpg. Given a partition, we simply calculate An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e283.jpg for that partition and compare to the target value.

Fixed-Parameter Tractability

We show that AH-Cut is fixed-parameter tractable via a greedy localisation technique. First we compute a greedy solution as follows:

  1. Pick an edge An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e284.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e285.jpg for every An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e286.jpg. Add An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e287.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e288.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e289.jpg to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e290.jpg.
  2. While An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e291.jpg do
    1. Pick a vertex An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e292.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e293.jpg.
    2. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e294.jpg then set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e295.jpg, otherwise set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e296.jpg.

Note that we assume that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e297.jpg is connected. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e298.jpg is not connected then all vertices of degree An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e299.jpg can be discarded, and the initial selection of vertices must take an adjacent pair from each connected component, then the algorithm continues as before. After all vertices have been assigned if An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e300.jpg, then we answer Yes. If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e301.jpg we make the following claim:

Lemma 5

Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e302.jpg be an instance of AH-Cut and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e303.jpg a partition of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e304.jpg such that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e305.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e306.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e307.jpg.

Proof. Let An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e308.jpg be such an instance and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e309.jpg the partition.

If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e310.jpg, then in particular we know that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e311.jpg, therefore An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e312.jpg and we have An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e313.jpg. Furthermore if An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e314.jpg, then there are at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e315.jpg edges in An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e316.jpg. Thus the total number of edges is at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e317.jpg and we have at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e318.jpg vertices in the graph.

If An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e319.jpg, then we immediately have that An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e320.jpg and therefore An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e321.jpg. The case with the most edges with both endpoints in the same partite set is then when there is only one edge between the two partite sets, therefore An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e322.jpg, then An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e323.jpg, therefore An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e324.jpg and there are at most An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e325.jpg edges and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e326.jpg vertices in the graph.

As the instance is bounded by a function of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e327.jpg, we can exhaustively search the instance in time An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e328.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e329.jpg.

This algorithm immediately gives the following result:

Theorem 6

AH-Cut is fixed-parameter tractable with an algorithm running in time An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e330.jpg where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e331.jpg is the optimisation target value, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e332.jpg is the maximum edge weight and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e333.jpg is the number of vertices in the input graph.

AH-Cut in Practice

We apply our algorithm to four datasets: (i) melanoma-colon-leukemia data from National Cancer Institute, U.S [15] (involving gene expression of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e334.jpg genes for An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e335.jpg samples); (ii) SARS data of Yap et al. [16] and (iii) tissue type data given by Su et al. [17] (involving gene expression of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e336.jpg genes for An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e337.jpg tissue samples).

We also consider a large synthetic dataset consisting of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e338.jpg samples and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e339.jpg features with a known optimal solution with three clusters. Despite the size of this datasets, our algorithm finds all three clusters.

In each case we compare our algorithm to Normalized-Cut and where possible to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e340.jpg-Means and Graclus. Implementation details for the memetic algorithm are given in the Materials and Methods section.

Melanoma-Colon-Leukemia data from NCI

For the first comparison we use a subset of the data for An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e341.jpg cancer samples taken for the National Cancer Institute's (NCI) screening for anti-cancer drugs [15]. The dataset consists of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e342.jpg gene expressions of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e343.jpg melanoma, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e344.jpg colon tumour and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e345.jpg leukaemia samples. The reason for taking these three sets of samples is that others (non-small cell lung cancer, breast cancer, etc.) have heterogeneous profiles and removing these gives an expected solution of three clear clusters. Laan and Pollard [18] show that this simple dataset is already hard to cluster using agglomerative hierarchical clustering methods. Nevertheless, AH-Cut is able to cluster the samples of these three diseases effectively, see Figure 1 for the whole dendrogram generated by AH-Cut. We use centred correlation distance as the distance metric to maintain consistency with Golub et al. [19].

Figure 1
Dendrogram generated from AH-Cut for the melanoma-colon-leukemia dataset.

Conversely, Normalized-Cut behaves inconsistently in allocating samples to the partitions. Using the negated distance matrix as a similarity matrix and choosing two clusters, it either separates melanoma from colon and leukemia samples, or leukemia from colon and melanoma samples, or splits the leukemia sample group. Even when number of clusters is specified as An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e346.jpg, leukemia samples are split between different clusters. Using An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e347.jpg as the similarity matrix, where An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e348.jpg is the distance matrix, gives worse results.

On the other hand, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e349.jpg-Means performs much better than Normalized-Cut and successfully separates melanoma from colon and leukemia samples when An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e350.jpg and gives three distinct clusters of colon, melanoma and leukemia samples when An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e351.jpg.

SARS

Next we analyse Yap et al.'s [16] dataset for Severe Acute Respiratory Syndrome (SARS). To explore the exact origin of SARS, the genomic sequence relationships of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e352.jpg different single-stranded RNA (ssRNA) viruses (both positive and negative strand ssRNA viruses) of various families were studied. Yap et al. [16] generate the tetra-nucleotide usage pattern profile for each virus from which a distance matrix based on correlation coefficients is created. We use this distance matrix for the following performance comparison of AH-Cut and Normalized-Cut. See Figure 2 for the dendrogram generated by AH-Cut for this dataset. It is interesting to note that SARS virus is grouped in the same subtree as other corona viruses and is closest to the Feline Corona Virus (FCoV). Notice that these are all positive strand ssRNA viruses. This group of SARS and Corona viruses also contains other viruses (Porcine epidemic diarrhoea virus (PDV), Transmissible gastroenteritis virus (TGV), Avian infectious bronchitis virus (ABV), Murine hepatitis virus (MHV)). There is also a group of positive strand ssRNA viruses, called “Outliers”, which exhibit differences in their tetra-nucleotide usage pattern from the rest. Yellow Fever Virus (YFV), Avian Encephalomyelitis Virus (AEV), Rabbit Hemorrhagic disease Virus (RHV), Equine arteritis Virus (EV1), Lactate Dehydrogenase-elevating Virus (LDV) were also identified as outliers by Yap et al. [16]. This group also includes other ssRNA positive strand viruses - Igbo ora virus (IOV), Bovine viral diarrheoa (BDV), Foot and mouth disease virus C (FMV) and Simina Haemorrhagic fever virus (SFV). The negative strand ssRNA viruses are clustered in two subgroups, one unmixed with the positive strand ssRNA viruses, the remainder in the group “Mixed”. The first class (called -ve strand ssRNA viruses in Figure 2) covers Canine Distemper Virus (CDV) and Tioman virus (TV2), Reston Ebola Virus (REV), Bovine Ephemeral Fever Virus (BFV), Hantaan Virus (HV1), Bovine Respiratory syncytial Virus (BRV), Human Respiratory syncytial Virus (HRV) and Respiratory Syncytial Virus (RSV).

Figure 2
Dendrogram generated by AH-Cut for the SARS dataset.

Normalized-Cut mixes positive and negative strand ssRNA viruses even in the first partition for the majority of runs. Graclus puts corona viruses in different partitions even when the number of partition specified is An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e353.jpg. As Graclus requires a distance matrix of integers we scaled the dataset's distance matrix by An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e354.jpg to obtain an integral distance matrix. As only a distance matrix was available, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e355.jpg-Means was not applicable.

Tissue dataset

Next we present results of applying AH-Cut to Su at al.'s [17] human tissue dataset. This dataset consists of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e356.jpg tissue specific genes from An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e357.jpg samples collected from An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e358.jpg individuals. The known origin of tissue samples gives a great advantage in validating clusterings of this dataset. For brevity we will compare only the first partitioning of this dataset generated by the different algorithms. The first partition of AH-Cut consists of:

  • brain related tissues;
  • eye related tissues;
  • face related tissues;
  • testis tissues;
  • others (including ovary tissue).

The second partition of AH-Cut consists of:

  • bonemarrow related cells;
  • blood cells;
  • heart related cells;
  • fœtal cells;
  • others (including ovary tissue, uterine tissue and uterine corpus tissue).

The partitioning for this dataset is quite reasonable except occurrence ovary tissues in different partitions. This can be due to the possible outlier nature of ovary tissues. However, Normalized-Cut repeatedly separates the brain related tissues and thus performs even worse. Graclus performs similarly to Normalized-Cut on this dataset.

An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e359.jpg-Means agrees very well with AH-Cut in the partitions, except that it clusters placental tissues within the second partition instead of the first and it puts the two uterine tissues in different partitions. An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e360.jpg-Means also puts the two ovary tissues in two different partitions.

Synthetic large-sampled gene expression data

To test the scalability of our algorithm, we show the results of AH-Cut applied to a large synthetic dataset. Consider An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e361.jpg samples of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e362.jpg synthetic gene expression profiles corresponding to three subtypes of some disease, giving a known optimum clustering with three clusters. To generate the data, we follow Laan and Pollard's [18] method. We sample three groups of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e363.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e364.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e365.jpg samples respectively from three multivariate normal distributions with diagonal covariance matrices, which differed only in their mean vector. The number of samples are chosen keeping in mind that in general there are some predominant subtypes of a disease and some rarer subtypes. All genes had common standard deviation An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e366.jpg, which corresponds to a An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e367.jpg-quantile of all standard deviations in an actual data set. For the first subpopulation, the first An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e368.jpg genes had a mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e369.jpg, genes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e370.jpg had mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e371.jpg, and the other An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e372.jpg genes had mean zero. Then for the second subpopulation, genes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e373.jpg had mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e374.jpg, genes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e375.jpg had mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e376.jpg and the other An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e377.jpg genes had mean zero. For the third subpopulation, genes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e378.jpg had mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e379.jpg, genes An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e380.jpg had mean of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e381.jpg and the other An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e382.jpg genes had mean zero. In other words, signature of the three types of cancer is related to An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e383.jpg genes of which An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e384.jpg are under-expressed and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e385.jpg are over-expressed.

The application of AH-Cut on this dataset first separates the first group from the rest. A second application on the rest of the samples yields the second and third group as the two partitions.

When number of clusters is specified as two, Normalized-Cut clusters the first and third subtypes together and the second subtype separately. However, specifying number of clusters as three creates a partitioning which does not correspond to the expected known grouping.

An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e386.jpg-Means puts the first subtype in one partition and the other two subtypes in another partition when An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e387.jpg, and separates three subtypes successfully when An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e388.jpg.

Conclusion

We have introduced a novel objective function for clustering based on graph partitioning. We show that the resulting problem AH-Cut is, unfortunately, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e389.jpg-complete and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e390.jpg-hard, but is however fixed-parameter tractable.

We then gave several test cases demonstrating the potential of the approach using a memetic algorithm. The performance of AH-Cut based clustering exceeds the performance of Normalized-Cut based clustering across a wide variety of datasets, including large scale datasets, and notably datasets with known optimal clusterings. AH-Cut based clustering also has a wider applicability than An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e391.jpg-Means based clustering, and at least equal performance.

There are several avenues for further research from this initial exploration. The fixed-parameter tractability of AH-Cut promises the possibility of a practical exact algorithm, which would give stronger evidence of AH-Cut's performance, as random elements would be removed.

Further studies on datasets of all kinds would also be useful to explore the strengths and weaknesses of AH-Cut based clustering, especially in comparison to other existing methods.

Tangentially, the quality of the memetic algorithm solutions suggest that there may be a link between the fixed-parameter tractability and the performance of the memetic algorithm. As established by the fixed-parameter tractability of AH-Cut, if a simple greedy algorithm does not produce a solution with a sufficiently high objective value, then the instance size must be bounded by an relatively simple function of the parameters. Therefore it is possible that under these conditions the local search component of the memetic algorithm approximates an exhaustive search, or at least has a greater effectiveness. A definite link of this kind would be an interesting development for both parameterized complexity and memetic algorithmics, above and beyond this application.

Materials and Methods

The complexity analysis of the AH-Cut problem employ standard complexity theory techniques [11], [12].

The NCI60 cancer dataset was drawn from the NCI60 anti-cancer drug screening program data [20] and the gene expression data for the cell lines was given by Ross et al. [15].

The tissue dataset is drawn from Su et al. [17].

The synthetic dataset was generated using the methods of Laan and Pollard [18].

For the comparisons we use Gene Cluster's implementation of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e392.jpg-Means [4] (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm#ctv), Dhillon et al.'s Graclus software [9] (http://userweb.cs.utexas.edu/users/dml/Software/graclus.html) and Shi and Malik's implementation of Normalized Cut [1].

All experiments were performed on a Dell laptop with a 1.67GHz processor and 2GB of RAM with the software written in Java.

A memetic algorithm for AH-Cut

We have implemented AH-Cut via a memetic algorithm [21], [22]. Memetic algorithms provide a population-based approach for heuristic search in optimization problems. Broadly speaking they combine local search heuristics with crossover operators used in genetic algorithms [23][25]. The essence of our algorithm is similar to the work of Merz and Freisleben [26] for Graph Bi-partitioning. Differences arise from the fact that we need to remove the constraint of equal partitioning of the graph. The method consists of three main procedures: a greedy algorithm for initialization of a set of solutions for AH-Cut (detailed in the parameterized algorithm); a differential greedy crossover for evolution of the population; and a variable neighborhood local search, influenced by Festa et al. [27], to improve the newly generated solutions.

We use a ternary tree for population similar to Buriol et al. [28] and keep two solutions at each node of this tree. One solution is the best obtained so far at the node, called pocket solution and the other one is the current solution. Essentially, if we generate a current solution by recombination or local search which is better than the pocket solution, we swap the current solution with the pocket solution. Furthermore, each parent node of the tree must have better pocket solution than its children's pocket solutions. Similar tree structures were previously employed successfully for various combinatorially hard problems [28][30].

Differential Greedy Crossover

We allow a crossover of a parent's pocket solution with a child's current solution to ensure the diversity in the population. All vertices that are contained in the same set for both the parents, are included in the same set in the offspring. Then both sets are filled according to a greedy recombination method similar to the greedy algorithm used for the parameterized algorithm. Suppose, the parent solutions An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e393.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e394.jpg have the partitions An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e395.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e396.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e397.jpg respectively (after interchanging the sets suitably). Then the starting set An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e398.jpg (resp. An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e399.jpg) for the offspring is given by the intersection An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e400.jpg (resp. An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e401.jpg), with the remainder of the partition calculated greedily.

Local Search

We employ a variable-neighborhood search (VNS), first proposed by Hansen and Mladenovic [31] for a local search in the neighborhood of the new offspring. Contrary to other local search methods, VNS allows enlargement of the neighborhood structure. A An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e402.jpg-th order neighbor of a paritition giving a solution An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e403.jpg for AH-Cut is obtained by swapping the partite set of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e404.jpg vertices. In VNS, first local search is done starting from each neighbor An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e405.jpg of the current solution An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e406.jpg. If a solution An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e407.jpg is found which is better than An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e408.jpg, then the search moves to the neighborhood of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e409.jpg. Otherwise, the order An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e410.jpg of the neighborhood is increased by one, until some stop criterion holds. We use maximum value of An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e411.jpg for An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e412.jpg.

Diversity

Whenever the population stagnates (fails to improve the objective value), we keep the best solution and re-initialize the rest of solutions (using the greedy algorithm with a randomised starting vertex pair) in the set and run the above process again for certain number of generations (say, An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e413.jpg).

To get the optimal solution for very small sized problems (graphs containing less than An external file that holds a picture, illustration, etc.
Object name is pone.0014067.e414.jpg vertices), we used backtracking. Notice that even though backtracking gives us an optimal solution, a greedy or memetic algorithm may not. By applying this method (backtracking, memetic or greedy algorithm depending on the number of vertices) recursively, we have at each step a graph as input, and the two subgraphs induced by each of the sets of the vertex partition as output; stopping when we arrive to a graph with just one vertex, we generate a hierarchical clustering in a top-down fashion.

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: We acknowledge the support of the Australian Research Council (ARC) Centre of Excellence in Bioinformatics (CE0348221), The University of Newcastle, and ARC Discovery Project DP0773279 (Application of novel exact combinatorial optimisation techniques and metaheuristic methods for problems in cancer research). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Shi J, Malik J. Normalized cuts and image segmentation. IEEE Transactions of Pattern Analysis and Machine Intelligence. 2000;22:888–905.
2. Wertheimer M. Laws of organization in perceptual forms (partial translation). A Sourcebook of Gestalt Psychology. 1938:71–88.
3. Papadimitriou CH, Yannakakis M. Optimization, approximation, and complexity classes. Journal of Computer and System Sciences. 1991;43:425–440.
4. de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20:1453–1454. [PubMed]
5. Wu Z, Leahy R. An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions of Pattern Analysis and Machine Intelligence. 1993;15:1101–1113.
6. Wang S, Siskind JF. Image segmentation with ratio cut. IEEE Transactions of Pattern Analysis and Machine Intelligence. 2003;25:675–690.
7. Wei YC, Cheng CK. Ratio cut partitioning for hierarchical designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 1991;10:911–921.
8. Yu SX, Shi J. Multiclass spectral clustering. 2003. In: International Conference on Computer Vision.
9. Dhillon IS, Guan Y, Kulis B. Weighted graph cuts without eigenvectors: a multilevel approach. IEEE Transactions of Pattern Analysis and Machine Intelligence. 2007;29:1944–1957. [PubMed]
10. Mahata P, Costa W, Cotta C, Moscato P. Hierarchical clustering, languages and cancer. EvoWorkshops 2006. 2006. pp. 67–78.
11. Ausiello G, Crescenzi P, Gambosi G, Kann V, Spaccamela AM, et al. Complexity and Approximation: Combinatorial Optimization Problems and their Approximability Properties. New York: Springer-Verlag; 1999.
12. Flum J, Grohe M. Parameterized Complexity Theory. New York: Springer; 2006.
13. Karp RM. Proceedings of a symposium on the Complexity of Computer Computations. New York: IBM T. W. Watson Research Center; 1972. Reducibility among combinatorial problems. pp. 85–104.
14. Mahajan M, Raman V. Parameterizing above guaranteed values: Maxsat and maxcut. Electronic Colloquium on Computational Complexity (ECCC) 1997;4
15. Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics. 2000;24:227–235. [PubMed]
16. Yap YL, Zhang XW, Danchin A. Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling. BMC Bioinformatics. 2003;4:43. [PMC free article] [PubMed]
17. Su AI, Cooke MP, ching KA, Hakak Y, Walker JR, et al. Large scale analysis of the human and mouse transcriptomes. PNAS. 2002;99:4465–4470. [PubMed]
18. Laan MJ, Pollard KS. A new algorithm for hybrid clustering of gene expression data with visualization and bootstrap. Journal of Statistical Planning and Inference. 2003;117:275–303.
19. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. [PubMed]
20. NCI/NIH. Developmental Theraputics Program. 2010. Available: http://dtp.nci.nih.gov/
21. Moscato P. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. 1989. Technical Report 826, Caltech Concurrent Computation Program.
22. Moscato P, Cotta C. A gentle introduction to memetic algorithms. In: Glover F, Kochenberger G, editors. Handbook of Metaheuristics. Boston MA: Kluwer Academic Publishers; 2003. pp. 105–144.
23. Whitley D. A genetic algorithm tutorial. Statistics and Computing. 1994;4:65–85.
24. Moscato P, Cotta C, Mendes A. Memetic algorithms. In: Onwubolu G, Babu B, editors. New Optimization Techniques in Engineering. Berlin: Springer-Verlag; 2004. pp. 53–85.
25. Moscato P, Cotta C. Memetic algorithms. In: González T, editor. Handbook of Approximation Algorithms and Metaheuristics. Taylor & Francis; 2006.
26. Merz P, Freisleben B. Fitness landscapes, memetic algorithms, and greedy operators for graph bipartitioning. Evolutionary Computation. 2000;8:61–91. [PubMed]
27. Festa P, Pardalos P, Resende MGC, Ribeiro CC. Randomized heuristics for the MAX-CUT problem. Optimization Methods and Software. 2002;7:1033–1058.
28. Buriol L, Franca PM, Moscato P. A new memetic algorithm for the asymmetric traveling salesman problem. Journal of Heuristics. 2004;10:483–506.
29. Berretta R, Cotta C, Moscato P. Metaheuristics: computer decision-making. Norwell, MA, USA: Kluwer Academic Publishers; 2004. Enhancing the performance of memetic algorithms by using a matching-based recombination algorithm. pp. 65–90.
30. Mendes A, Cotta C, Garcia V, Franca P, Moscato P. Gene ordering in microarray data using parallel memetic algorithms. ICPP Workshops. 2005. pp. 604–611.
31. Hansen P, Mladenović N. Developments of variable neighborhood search. Essays and Surveys in Metaheuristics. 2001:415–439.

Articles from PLoS ONE are provided here courtesy of Public Library of Science