Home | About | Journals | Submit | Contact Us | Français |
“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.
One of the best known problems in the data mining is the clustering. Clustering is the task of categorising objects having several attributes into different classes such that the objects belonging to the same class are similar, and those that are broken down into different classes are not [1]. There are several clustering algorithms that have been proposed till now. Due to no prior information in clustering, the suitable evaluation of the results is necessary. Evaluation means measuring the similarity between clusters, measuring the compactness, and separation between clusters [2]. Evaluation measurement is also proposed as a key feature in internal and external cluster validation indexes [3]. Such a measure can be used to compare the performance of different data clustering algorithms on different real life datasets. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method. Three different techniques are available to evaluate the clustering results: external, internal, and relative [4]. Both internal and external criteria are based on statistical methods and they have high computation demand. The external validity methods evaluate the clustering based on some user specific intuitions [4]. The objective of this paper is the comparison of the different clustering schemas that have been already proposed [5] with Canonical PSO based K-means clustering algorithm.
The rest of the paper is organized as follows. The Canonical PSO based K-means algorithm is proposed in Section 2 with some other existing clustering algorithms. Some popular and widely used validity indices are introduced in Section 3. Section 4 demonstrates the clustering compactness measurements on a toy example dataset using K-means and DBSCAN clustering algorithms. Section 5 demonstrates the clustering compactness measurements with experimental results and comparison of the indices is outlined in this section, and Section 7 gives a brief conclusion of this paper. Interested readers may found some significant references at the end section of this paper.
This paper mainly focuses on the implementation of Canonical PSO based K-means algorithm. Some cluster validity indices (e.g., Dunn index, DB index, Silhouette index, Rand index, Mirkin index, etc.) and ANOVA test are analysed. All these indices are individually experimented on the air pollution dataset, customer, wine, and vehicle dataset using typical K-means, DBSCAN, Hierarchical, simple PSO based K-means, and Canonical PSO based K-means algorithms. The overall motivation of this paper is given in Figure 2.
This section introduces a new modified Canonical PSO based K-means clustering algorithm and also describes all these well-known algorithms briefly.
James Kennedy and Russell C. Eberhert originally proposed the Particle Swarm Optimization (PSO) algorithm for optimization. It is a population based, robust, and stochastic optimization technique mainly designed for the balancing weights in neural networks [6]. In data clustering, it is possible to view the clustering problem as an optimization problem that locates the optimal centroids of the clusters rather than to find an optimal partition. This view offers us a probability to apply PSO optimal algorithm on the clustering solution. The use of basic PSO technique in document clustering analysis is proposed in many papers [7, 8].
In the PSO based K-means algorithm, the capability of globalized searching of the PSO algorithm and the fast convergence of the K-means algorithm are combined [9]. The algorithm results in better accuracy than existing algorithms. In the original PSO, at any instance each particle has a position and a velocity. At the beginning, population of particles is initialized with random positions denoted by vectors x _{i} and random velocities v _{i}. The population of such particles is called a swarm, S. Each particle is searching for the optimum. Each particle remembers the position it was in where it had its best result so far (its personal best) [7]. The particles in the swarm cooperate. They exchange information about what they have discovered in the places they have visited.
In each time step, a particle has to move to a new position. It does this by adjusting its velocity. Velocity is updated based on information obtained in previous steps of the algorithm.
This updating of velocity and position can be described by the following set of equations:
where i = 1, 2 … N, j = 1, 2 … n.
Here, v _{ij}(t + 1) is the new velocity at time step (t + 1), v _{ij}(t) is the old velocity at time step t, p _{ij}(t) is the best position of each particle, p _{gj}(t) is the best position of swarm, x _{ij}(t + 1) is current position of each particle, and x _{ij}(t) is old position of each particle.
R1 and R2 are random variables uniformly distributed within [0,1] and C1C2 are weighting factors, also called the cognitive and social parameter, respectively. In the first version of PSO, a single weight, C = C1 = C2, called acceleration constant, was used instead of the two distinct weights in (1).
Oscan and Mohan (1999) focused on the early PSO model of (1) and (2), and they showed that particles were actually moving on sinusoidal waves per coordinate of the search space, while causticity was offering a means to control its frequency and amplitude. They had modified the PSO model and the model is defined by the following equation:
where i = 1,2,…, N particles (vector data) and j = 1, 2,…, n.
Where chi(χ) is a parameter called constriction coefficient or constriction factor, responsible for keeping the particle moving in the same direction, it was originally heading, while the rest of the parameters remain the same as for the previously described PSO models [9].
Input
Output. A set of clusters.
K-means clustering is a method of cluster analysis which aims to partition n observations into K clusters depending on some similarity/dissimilarity metric where the value of K may or may not be known a priori [7]. The objective of K-means clustering algorithm is usually to create one set of clusters that partitions the data into similar groups. As a result, maximum similarity samples are placed in same cluster and low similarity samples are placed in different clusters [10].
In hierarchical clustering clusters are generated by grouping data with similar pattern of expression across a range of samples located near each other. Hierarchical clustering calculates all pairs-wise distance associations between samples and experiments to merge pairs of values that are mainly similar [9].
This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region (minpts) [11]. It starts with an arbitrary starting point that has not been visited. This point's ε-neighbourhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labelled as noise. If a point is found to be a dense part of a cluster, its ε-neighbourhood is also part of that cluster [12]. Hence, all points that are found within the ε-neighbourhood are added as is their own ε-neighbourhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
In this paper the behaviour of several clustering validity indices has been examined (Figure 1). Among these indices some of them are clearly illustrated in this section. These indices are used for measuring “goodness” of a clustering result comparing to other ones which were created by other clustering algorithms or by the same algorithms but using different parameter values [13].
Silhouette index is used for cluster analysis formed by different clustering algorithms (Tables (Tables6,6, ,7,7, and and8).8). The silhouette validation technique calculates the silhouette width for each sample, average silhouette width for each cluster, and overall average silhouette width for a total dataset in this paper [14]. The average silhouette width could be applied for evaluation of clustering validity and also could be used to decide how good the number of selected clusters is.
To construct the silhouettes S(i) the following formula is used:
where a(i) is the average dissimilarity of ith object to all other objects in the same cluster and b(i) is the minimum of average dissimilarity of i-object to all objects in other clusters [9].
If silhouette value is close to 1, it means that sample is “well-clustered” and it was assigned to a very appropriate cluster. If silhouette value is about zero, it means that sample could be assigned to another closest cluster as well, and the sample lies equally far away from both clusters. If silhouette value is close to –1, it means that sample is “misclassified” and is merely somewhere in between the clusters. The overall average silhouette width for the entire plot is simply the average of the S(i) for all objects in the whole dataset.
The Davies-Bouldin index (DB) can be calculated by the following formula:
where n is the number of clusters, C _{x} is the centroid of cluster, _{x} is the average distance of all elements in cluster to centroids, and d(C _{i}, C _{j}) is the distance between centroids [15]. Since algorithms that produce clusters with low intracluster distances (high intracluster similarity) and high intercluster distances (low intercluster similarity) will have a low Davies-Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies-Bouldin index is considered the best algorithm based on this criterion [9].
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal intercluster distances to maximal intracluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:
where d(i, j) represents the distance between clusters i and j and d ^{/}(k) measures the intracluster distance of cluster k. The intercluster distance d(i, j) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters [15]. Similarly, the intracluster distance d ^{/}(k) may be measured in a variety of ways, such as the maximal distance between any pair of elements in cluster k. Since internal criterion seeks clusters with high intracluster similarity and low intercluster similarity, algorithms that produce clusters with high Dunn index are more desirable [16].
The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm [16].
Given a set of n elements S = {o _{1},…, o _{n}} and two partitions of S to compare, X = {X _{1},…, X _{r}}, a partition of S into r subsets, and Y = {Y _{1},…, Y _{s}}, a partition of S into s subsets, then the Rand index, R, is
where a is the number of pairs of elements in S that are in the same set in X and in the same set in Y, b is the number of pairs of elements in S that are in different sets in X and in different sets in Y, c is the number of pairs of elements in S that are in the same set in X and in different sets in Y, and d is the number of pairs of elements in S that are in different sets in X and in the same set in Y.
Mirkin index is also known as equivalence mismatch distance. It is defined by
where i = 1,…, k, j = 1,…, l.
Let X be a finite set with cardinality |x | −n. A clustering C is a set{C _{1},…, C _{k}} of nonempty disjoint subsets of X such that their union equals X. Clustering C′ is a refinement of cluster of C, formally:
“Analysis of variance” test is used to compare three or more groups or conditions in an experiment. A one-way ANOVA can help to find out if the means for each group/condition are significantly different from one another or if they are relatively the same. The null hypothesis typically corresponds to a general or default position. One-way ANOVA is a simple special case of the linear model. The one-way ANOVA form of the model is
where y _{ij} is a matrix of observations in which each column represents a different group. α _{.j} is a matrix whose columns are the group means (the “dot j” notation means that applies to all rows of the jth column. That is, the value α _{ij} is the same for all i). € _{ij} is a matrix of random disturbances.
The standard ANOVA table has six columns:
The ANOVA test makes the following assumptions about the data:
Every logical experiment needs a well-defined mathematical illustration. In this section, the Davis-Bouldin (DB) index and Dunn index are applied on three most commonly used clustering algorithms (typical K-means, DBSCAN, and Hierarchical) with the help of mathematical examples.
Eg.1. Let us assume there is a set of 15 data objects, such as 10, 12, 15, 7, 22, 29, 31, 3, 7, 5, 1, 4, 12, 11, and 10.
Sol:
K-MeansClustering
In the first step typical K-means algorithm is applied on this dataset. Suppose the total number of clusters is 3 and also assume their centroids are Clust1 = 10, Clust2 = 22, and Clust3 = 1.
Now use Manhattan distance measure |d _{i} − d _{j}| to calculate the following.
First iteration: see Table 1 and Figure 3.
Now, the new centroids are Clust1 = 10.5, Clust2 = 27.3, and Clust3 = 3.25.
Second iteration: see Table 2.
No change occurs.
So those three clusters are final clusters and now based on those clusters two popular indexes Davis-Bouldin (DB) index and Dunn index which are calculated below.
Davies-Bouldin Index. Consider
where n = 3, C _{i} = 10, C _{j} = 22, C _{k} = 1, _{i} = {(0 + 2 + 5 + 3 + 1 + 2 + 3)/8} = 2, _{j} = {(7 + 9)/3} = 5.3, _{k} = {(2 + 3 + 4)/4} = 2.25, d(C _{i}, C _{j}) = 12, d(C _{j}, C _{k}) = 21, and d(C _{i}, C _{k}) = 9.
So,
Dunn Index. The simple form of Dunn index calculation is
where d _{min } is the minimum distance between two points belonging to different clusters and d _{max } is the maximum distance between any two points selected from the same cluster [13].
According to this form, the calculated values are as shown in Table 3.
So,
DBSCAN Clustering. Now, on the same dataset DBSCAN clustering algorithm is applied with assuming the value of minimum number of points (minpts) = 3 and the radius of a cluster (eps) = 1cm. Initially there are three clusters with the same means and they are shown in Figure 4.
According to the DBSCAN clustering criteria, we have the following.
Here two data objects are treated like outliers because both objects 15 and 31 are ≥eps (1cm) and due to the minimum number points condition does not satisfied by the Clust3 that is why the objects of the clust3 also acts like outliers. But the objects of Clust3 and the objects 15 and 31 all together form a cluster (C _{new}) which satisfies both criteria of DBSCAN clustering. It is also called the cluster of noise.
Now in the same way the DB and DUNN indexes are calculated on those final clusters,
These values may be changed based on the assumption of eps and minpts values.
Evaluation. After checking the values of the indexes it can be said that for that particular set of data objects the typical K-means clustering gives better compactness and better separability than DBSCAN clustering (due to the fact that less value of DB index and high value of Dunn index give more desirable results). But in most of the cases DBSCAN gives bad results compared to K-means clustering because it forms arbitrary shape of clusters.
The experimented air pollution database is taken from the http://www.wbpcb.gov.in/ website of Kolkata District from 01/07/2011 to 01/05/2012. The database contains different air pollutant data like benzene, CO_{2} (carbon-dioxide), NO_{2} (nitrogen-dioxide), O_{3} (ozone), PM (particular matter), and SO_{2} (sulphur-dioxide).
The wholesale customer dataset is taken from UCI Machine Learning Repository. The dataset contains eight attributes like fresh products, milk, grocery, frozen, detergents, delicatessen, channel, and region.
The wholesale wine dataset is taken from UCI Machine Learning Repository. The dataset contains thirteen attributes like alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline.
The wine dataset is taken from UCI Machine Learning Repository. The dataset contains eighteen attributes like compactness, circularity, distance circularity area, radius ratio, Pr.Axis aspect ratio, max. length aspect ratio, scatter ratio, elongatedness, Pr.Axis rectangularity, max. length rectangularity area, scaled variance along major axis, scaled variance along minor axis, scaled radius of gyration, skewness about major axis, skewness about minor axis, kurtosis about minor axis, kurtosis about major axis, and hollows ratio. Here 4 classes are considered: Opel, Saab, Bus, and Van. The 4 real datasets and their main characteristics are shown in “Table 4.” The experiment is based on 150 iterations and five clustering algorithms.
At first analysis, the accuracy of different clustering method is measured using silhouette index depicted in Tables Tables2,2, ,3,3, ,4,4, and and5.5. Probably 150 iterations are considered in each case. One-way ANOVA test is applied on the values of silhouette index at different iterations and null hypothesis has been measured and reported in Tables Tables9,9, ,10,10, ,11,11, and and12.12. The accuracy comparisons of all clustering algorithms on different datasets are explained in Table 13.
The results of ANOVA test is as shown in Tables Tables9,9, ,10,10, ,11,11, and and1212 and Figures Figures5,5, ,6,6, ,7,7, and and88.
The reason for doing an ANOVA test is to see if there is any difference between groups on some variables. If probability > F, we reject null hypothesis. Here in all cases, Prob > F. Therefore, we reject the null hypothesis. The means of the five algorithms are not equal. At least one of the means is different. However, the ANOVA test does not tell where the difference lies. It requires a “t-test” to test each pair of means.
So from Table 13 it can be concluded that Canonical PSO based K-means algorithm provides most desirable results compared to other clustering algorithms stated in Section 2 and DBSCAN clustering provides the worst results than the others (based on the initial values of eps = 25 and minpts = 65). In this part, the results of these algorithms are evaluated by applying other well-known indexes (such as, Dunn, Davies-Bouldin, Rand, and Mirkin indices).
From “Table 14” the results of this experiment can be summerised as follows.
The same approach and results are also valid for the other 3 datasets (such as customer, wine, and vehicle). The detailed analysis of results is tabulated in Tables Tables15,15, ,16,16, and and1717 and their corresponding pictorial forms are represented in Figures Figures10,10, ,11,11, and and1212.
Although there are several proposals for validity indices in the literature, most of them succeed only in certain situations and are based on some certain conditions. This paper introduces the new concept like Canonical PSO based K-means algorithm and also presents ample comparisons of six popular clustering validity approaches. All these analyses are done on a few real time datasets which contain air pollutant particles, wholesale customer, wine, and vehicle data. This proposed algorithm provides better impact to produce desirable clustering results compared to other existing algorithms described in this paper (provided some certain assumptions).
Several extensions to the present work are in progress. Finally, studies of other clustering algorithms and others indices and their effects to measure proper, compact clusters are also underway. In future work, we can amalgamate DBSCAN and Hierarchical algorithms with PSO approach to seek some better ways of producing improved clustering output.
The authors declare that there is no conflict of interests regarding the publication of this paper.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |