Cluster analysis [
110] aims at classifying a set of observations into two or more mutually exclusive
unknown groups based on combinations of variables. Thus, cluster analysis is usually presented in the context of
unsupervised classification [
111]. It can be applied to a wide range of biological study cases, such as microarray, sequence and phylogenetic analysis [
112]. The purpose of clustering is to group different objects together by observing common properties of elements in a system. In biological networks, this can help identify similar biological entities, like proteins that are homologous in different organisms or that belong to the same complex and genes that are co-expressed [
113,
114].
It is generally difficult to predict behavior and properties based on observations of behaviors or properties of other elements in the same system, therefore various approaches for cluster analysis emerge. Clustering algorithms may be
Exclusive, Overlapping, Hierarchical or
Probabilistic. In the first case, data are grouped in an exclusive way, so that a certain element can be assigned to only one group (exclusively). On the other hand, the overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. A hierarchical clustering algorithm organizes data in hierarchies and is based on the union between the two nearest clusters; it is commonly used for microarray and sequence analysis [
115]. A more analytical categorization of clustering algorithms can be found at [
110,
116].
An important component of a clustering algorithm is the distance measure between data points. If all the components of the data instance vectors have the same physical units, it is then possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. One example is to cluster cities on a map, since in this case Euclidean distance represents real natural distances. However, for higher dimensional data the Euclidean distance can sometimes be misleading. In that case, a popular measure is the
Minkowski metric and is calculated as

where
D is the dimensionality of the data. The
Euclidean can be calculated if we set
p = 2, while
Manhattan metric has
p = 1. There are no general theoretical guidelines for selecting a measure for a given application.
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. There are two different strategies to organize data. These are the
agglomerative and the
divisive:
Agglomerative: It is a "bottom-up" approach. Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive: This is a "top-down" approach. In this case, all of the observations start by forming one cluster, and then split recursively as one moves down the hierarchy. Some of the most common tree based clustering algorithms that organize data in hierarchies are the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [
117,
118], Neighbor Joining [
112,
119] and Hierarchical Clustering [
120,
121], all of which represent their clusters as tree structures. The results of hierarchical clustering are usually presented in a dendrogram. Figure shows an example of how genes can be clustered.
Let nr be the number of clusters and xri is the ith object in cluster r and cluster r is formed from clusters p and q. In the following, we describe the different methods used to calculate distances between clusters in hierarchical clustering.
Single linkage calculates the smallest distance between objects in the two clusters to merge them:
d(
r,
s) = min(
dist(
xri,
xsj)),
i ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
(
i,...,
nr),
j ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
(
1,....
ns).
Complete linkage calculates the largest distance between objects in the two clusters to merge them:
d(
r,
s) = max(
dist(
xri,
xsj)),
i ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
(
i,...,
nr),
j ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
(
1,....
ns).
Average linkage uses the average distance between all pairs of objects in any two clusters:

. This algorithm is also known as
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [
117,
118].
Centroid linkage finds the Euclidean distance between the centroids of the two clusters:

,

is the Euclidean distance.
Median linkage uses the Euclidean distance between weighted centroids of the two clusters,

are weighted centroids for the clusters
r and
s. If cluster
r was created by combining clusters
p and
q, x
r is defined recursively as

.
Single or complete linkages are the fastest of the linkage methods. However, single linkage tends to produce stringy clusters, which is not always preferable. The centroid or average linkage produce better results regarding the accordance between the produced clusters and the structure present in the data. These methods require much more computations. Average linkage and complete linkage may be the preferred methods for microarray data analysis [
115].
Ward's linkage finds the incremental sum of squares; that is, the increase in the total within-cluster sum of squares as a result of joining two clusters. The within-cluster sum of squares is defined as the sum of the squares of the distances between all objects in the cluster and the centroid of the cluster. The sum of squares measure is equivalent to the following distance measure

,
where || ||
2 is the Euclidean distance and

are the centroids of clusters
r and
s and
nr and
ns are the number of elements in clusters
r and
s.
Weighted average linkage uses a recursive definition for the distance between two clusters. If cluster
r was created by combining clusters
p and
q, the distance between
r and another cluster
s is defined as the average of the distance between
p and
s and the distance between
q and
s:

.
Neighbor Joining [
112,
119] was initially proposed for finding pairs of operational taxonomic units (OTUs) that minimize the total branch length at each stage of clustering of OTUs starting with a star-like tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method [
112].
Known platforms that already share the tree-based algorithms described above are the Hierarchical Clustering Explorer (HCE) [
122,
123], MEGA [
124-
127] or TM4 [
128]. A recent review article shows which file formats, visualization techniques and algorithms can be used for tree analysis [
129].
Another category of clustering algorithms tries to cluster data in separate groups by identifying common properties that the nodes of a network share. Different strategies exist, like for example trying to find dense areas in a graph or areas where message exchange between nodes is easier or to identify strongly connected components or clique-like areas etc. Many of such algorithms have been used in different case studies like for example to identify protein families [
130], to detect protein complexes in PPI networks [
131,
132], or for finding patterns and motifs in a sequence [
133]. Even though many more exist, some of the most famous algorithms are given below.
Markov Clustering [
134] (MCL) algorithm is a fast and scalable unsupervised clustering algorithm based on simulation of stochastic flow in graphs. The MCL algorithm can detect cluster structures in graphs by a mathematical bootstrapping procedure which takes into account the connectivity properties of the underlying network. The process deterministically computes the probabilities of random walks through a graph by alternating two operations: expansion, and inflation of the underlying matrix. The principle behind it is that random walks on a graph are likely to get locked within dense subgraphs rather than move between dense subgraphs via sparse connections. In other words, higher length paths are more often encountered between nodes in the same cluster than between nodes within different clusters, such that the probabilities between nodes in the same complex will typically be higher in expanded matrices. Clusters are identified by alternating expansion and inflation until the graph is partitioned into subsets so that there are no longer paths between these subsets [
135,
136].
k-Means [
137] is a method of cluster analysis which aims to partition
n observations into
k clusters in which each observation belongs to the cluster with the nearest mean. K-means and its modifications are widely used for gene expression data analysis [
138]. It is a supervised method and users need to predefine the number of clusters. Its complexity is
O(nlk) where
k is the number of clusters,
n the size of the dataset and
l the loops of the algorithm. The k-means algorithm is one of the simplest and fastest clustering algorithms. However, it has a major drawback: the results of the k-means algorithm may change in successive runs because the initial clusters are chosen randomly.
Affinity Propagation [
139] takes as input measures of similarity between pairs of data points and simultaneously considers all data points as potential candidates. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges.
Restricted Neighborhood Search Cluster Algorithm [
140]: It tries to find low cost clustering by composing first an initial random clustering. Later it iteratively moves one node from one cluster to another in a random way trying to improve the clustering cost.
Spectral clustering [
141]: This algorithm tries to find clusters in the graph such that the nodes within a cluster are connected with highly-similar edges and the connections between such areas are weak, constituted by edges with low similarity. The aim is to identify these tightly coupled clusters, and cut the inter-cluster edges. Figure shows an example of protein complex prediction from PPI yeast dataset [
12].
Despite the great variety of clustering techniques, many articles directly compare the various clustering methodologies like [
135] and [
142]. Very often we encounter articles that compare similar algorithms using different datasets and come to very diverse conclusions and results i.e [
142,
143].
Concerning the visualization of networks, the availability of clustering techniques and their complex configuration/combination, today to a large extent, there is a lack of visualization platforms or tools that are able to integrate a variety of more advanced algorithms and the development implementation of such implementations emerges [
144]. Platforms that share clustering algorithms are the Network Analysis Tool (NEAT) [
145] and jClust [
146] but they are still poor in the variety of methods they offer. Software like ArrayCluster [
147] and MCODE [
60] is often used in analysis of gene expression profiles and coexpression detection. Many visualization tools [
144] such as Medusa [
148], Cytoscape [
149], Pajek [
98] and many others [
144] visualize networks in both 2D and 3D, but very few of them like Arena3D [
150] try to bridge the gap between clustering analysis and visualization.