The availability of microarray technology at an affordable price makes it possible to determine expression of several thousand genes simultaneously. For example, the AFFYMETRIX 430 2.0 array contains oligonucleotide probe sets representing approximately 39,000 mouse gene mRNA transcripts. Gene expression levels for a particular tissue or cell type under different conditions are captured by first isolating RNA from the test sample. Through a series of standardized reaction steps, each RNA sample is labeled fluorescently and used to probe an individual chip. Once the expression levels of the genes are quantified under all conditions, the differentially expressed genes are filtered using one of the several methods, such as fold change from one condition to another. Very often the number of differentially expressed genes in a particular comparison are in the order of hundreds.
The differentially expressed genes are then clustered using the expression profiles compared across the different conditions. Clustering is a technique that groups objects of similar features together and it has been studied thoroughly in statistics and data-mining literature [1
]. Among many clustering algorithms, hierarchal and k-means clustering algorithms are widely used in microarray analysis [2
]. The expression values of genes under k
different conditions may be viewed as a data point in k
dimensional space. A clustering algorithm groups nearby data points in k
-dimensional space together. Several distance measuring metrics have been proposed in the literature and the popular ones among them include Euclidian distance and Pearson correlation distance. Each algorithm clusters the genes differently and the same algorithm may have different results with each different distance metric. With a lack of any guideline for selecting appropriate algorithms and the associated distance metric, biologists and other researchers are confused as to which algorithms and the distance matrices to choose. The problem is further compounded with the influence of data instance over the effectiveness of an algorithm. Visualizing the expression profiles of each cluster for selecting a clustering algorithm is laborious and error prone and can not be done with a large number of genes.
To alleviate the problem in judging the quality of clusters or in validating clusters, several validation methods have been proposed in the literature including c-index [3
], Dunn's based index [4
], Davies-Bouldin index [5
], Silhouette method [6
]. Bezdek et al. [7
] had compared several indices for their effectiveness in validating clusters and had suggested Dunn's index to be the best among those they have tested. Bolshakova [8
] had developed an integrated platform for clustering microarray genes using hierarchal and k-means algorithms and measuring some of these cluster validation indices. All these cluster validation methods directly or indirectly relate the cluster density and separation among different clusters. These measures are generic and are using the same parametric space being used to cluster the objects.
Alternatively, clusters can be validated using its effectiveness in predicting correct membership. Yeung et al. [9
] proposed a method called figure of merit for validating clusters based on an estimate of the predictive power of a clustering algorithm. In their approach they apply the clustering algorithm to all the experimental conditions except for one and then use the left out condition to calibrate the predictive power of the clustering algorithm.
None of these methods proposed to validate clusters or to measure the quality of clusters has any bearing on biological interpretations of the clustered genes. Genes that are regulated by the same transcription factors or sets of transcription factors are expected to express similarly under different conditions. Hence, when genes of similar expression patterns are clustered together, it is expected that they share regulation by some of the same transcription factors and that they share function or are involved in some of the same biological processes. In this paper we investigate how to relate the quality of clusters with the expected outcome of clustering: cohesiveness of molecular function or biological processes in each cluster and the separation of biological behaviors among different clusters. We have used GO ontology to abstract the biological processes and molecular functions of genes and use this information to test the proposed metric to measure the effectiveness of clustering in terms of behavioral cohesiveness in clusters. For a lack of a better word, we use behavior
to refer to either molecular function or biological process in the sequel. Our approach may be viewed as an extension to the recent work of Jakel et al. [10
] in which they refer to an external validation. They used cluster selectivity and cluster sensitivity as a measure of external validation.
Several algorithms have been used in the literature for clustering DNA microarray expression and we will consider two popular algorithms, namely hierarchal and k-means clustering. We briefly describe each of them.
The data points are represented as hierarchical series of nested clusters and this representation has a single cluster at the root level and each branch leads to a cluster from top to leaf node [11
]. There are two ways of building hierarchical clustering namely, bottom up and top down. In the bottom up approach every data point is considered to be a cluster and a cluster is merged into another cluster based on their proximity to each other. The proximity measures include single link, average link, complete link and un-weighted pair group. We use average linkage clustering
, which is defined as the average of all the distances among all the pairs of elements between two clusters, say m
. It is represented formally as
Average link distance = Σ(dik
| * |Cln
|) where dik
is the distance between the elements ei
of cluster m
of cluster n
| is the size or the total number of genes in cluster Clr
When a cluster is merged into another cluster, a branch is formed and the process continues until no more individual clusters remain. Once the hierarchical cluster tree is constructed, only one cluster exists at the root level that includes all the genes. As we go down the tree each branch indicates divisions of a cluster into more clusters and the measure of closeness among the clusters are also increasing.
The data points in m-dimensional space are clustered together into k-groups. The algorithm starts by selecting k-data points randomly and these points are called cluster centers. The distance of each data point, say i, to these cluster centers are computed and the data point i is associated with the cluster of the closest cluster center. When all the data points are associated with the clusters, the new cluster center is computed and the process of associating data points to the closest cluster center continues until there are no significant changes in the cluster center between iterations.
Distance measuring metric
The Euclidian distance di,j between a pair of genes, say gi and gj with expression values under m conditions is given by
the expression value of gene gi
under the condition r
The Pearson correlation coefficient between a pair of gene expressions, say gi and gk, is given by
is the covariance of the gene expression of i
, and σi
are the standard deviation of the expression of gene i
and gene j
The gene ontology (GO) project [12
] provides structured controlled vocabularies to address gene products consistently over several databases including FlyBase (Drosophila
), the Saccharomyces
Genome Database (SGD) and the Mouse Genome Database (MGD). The ontology describes gene products in terms of their associated biological processes, cellular components and molecular functions for each annotated gene. Each description of a gene product is arranged in a hierarchy from more general to very specific.
In this work we identify the molecular function and biological process of each gene and use these behaviors to test the proposed metric in assessing the success of a clustering algorithm.