In recent years, DNA microarray technology has become a vital scientific tool for global analysis of genes and their networks. The new technology allows simultaneous profiling of the expression levels of thousands of genes in a single experiment. At the same time, the successful implementation of microarray technology has required new methods for analyzing such large scale datasets. Clustering is a central analysis method of gene-expressions that has been implemented extensively in various works and applications [
1-
5]. The primary goal is to cluster together genes or tissues that manifest similar expression patterns [
1]. The underlying assumption is that co-expressed genes or tissues with correlated pathways may share common functional tasks and regulatory mechanisms. Similar expression patterns might offer insights into various transcriptional and biological processes [
6-
8].
Many clustering algorithms depend heavily on 'similarity' or 'distance' measures (although not necessarily a distance function that satisfy all mathematical conditions of a metric) that quantify the degree of association between expression profiles. The definition of the distance measure is a key factor for a successful identification of the relationships between genes and networks [
6]. Different similarity measures are likely to result in different clustering, although based on the same expression data.
Despite the crucial influence of the similarity measure upon the clustering results, there are fewer publications on this subject in the bioinformatics literature. Many publications focus on the efforts to optimize and justify the implemented biological processes and the clustering algorithms, while the similarity measures are often selected by default [
6,
8,
16,
37-
39]. As indicated in [
16]: "
Clustering co-expressed genes usually requires the definition of 'distance' or 'similarity' between measured datasets, the most common choices being Pearson correlation or Euclidean distance... it is widely recognized that the choice of the distance may be as crucial as the choice of the clustering algorithm itself (D'haeseleer et al., 2000). However, as pointed out by Brazma and Vilo (2000), the appropriateness of similarity measures has not been systematically explored and these measures are used on an ad-hoc basis." Many publications use traditional clustering algorithms (e.g., K-means, Self Organizing Maps and Artificial Neural networks) that have roots in conventional data-extensive research fields, such as signal or image processing. In these fields, the similarity measures rely on the unique characteristics of the specific data structure. For example, in signal processing it is commonly assumed that identical codeword vectors are distorted by white noise components during transmission. Under such an assumption, it is reasonable to use a vector-quantizer encoder which is based on the Euclidean distance [
9]. We claim that the same reasoning is not necessarily applicable to the analysis of gene expression profiles. Thus, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.
In addition to the Euclidean distance, another widely used measure for analyzing and clustering gene expression data is the Pearson correlation coefficient [
1,
10-
14]. It is used despite its underlying assumption on the linear relationships between genes' expressions. As opposed to these measures, it is well known that
mutual information (MI) provides a general measurement for dependencies in the data, in particular positive, negative and nonlinear correlations (e.g., [
15,
16]). This property is important to identify genes that share inputs to which they respond differently [
17].
Within the large body of research on gene expression clustering, there are few publications that systematically explore the appropriateness of chosen similarity measures. Herzel and Grosse (1995) [
15] analyze the relationships between various correlation functions and the MI. They emphasize that the MI can detect any kind of dependence between patterns. Michaels et al. (1998) [
17] present a strategy for the analysis of large-scale quantitative gene expression data from time course experiments. They consider two distance measures: the well established Euclidean distance and a normalized MI. The authors present their approach mainly to demonstrate the essence of the MI measure, and state that further study is required to assure robustness. This paper follows their suggestion and uses known datasets to measure the robustness of clustering solutions based on the Euclidean distance, the Pearson correlation and a normalized MI measure. Steuer et al. (2002) [
16] and Daub et al. (2004) [
18] investigate the use of MI as a distance measure for gene expression data. They also focus on the comparison between the MI and the Pearson correlation measures.
Most of the above papers, with the exception of Daub et al (2004) [
18], mention the similarity between the MI measure and the conventional ones. In particular, Michaels et al. (1998) [
17] indicate that the Euclidean distance and the MI measure have a high degree of correspondence. Steuer et al. (2002) [
16] conclude that within the investigated dataset there seems to be almost a one-to-one correspondence between the MI and the Pearson correlation measures. A similar observation is supported in this study by finding a high correspondence level in the behaviour (e.g., trends) of average scores that are based on different distance measures. Nevertheless, this study shows that within the analyzed datasets, the MI-based scores better differentiate among clustering solutions of different quality when compared to the other distance measures.
This paper proposes a procedure to evaluate the MI between gene expression patterns. Consequently, by using several public gene expression datasets, it compares the MI measure with respect to both the Euclidean distance and the Pearson correlation. The comparison includes a consistency examination upon clustering solutions of different quality in terms of the number of errors. The clustering is carried out by using normalized homogeneity and separation functions that provide a uniform scale for the examination. The results of the first experiment clearly show that the MI outperforms the conventional measures by yielding a more significant differentiation among clustering solutions. Next, the paper employs the MI measure to evaluate the solutions of four recognized clustering algorithms over a yeast cell-cycle database [
19]. This known database has been traditionally used to examine various algorithms and techniques for gene expression analysis [
5,
20,
21]. The results show that the sIB algorithm [
32,
33], which is originally based on a mutual-information criterion, obtains better MI-based homogeneity and separation scores than those provided by the K-means, the CLICK and the SOM algorithms [
5,
21]. These results totally change when evaluating the same solutions by the Pearson correlation based homogeneity and separation scores.
The remainder of paper is organized as follows. The Results section describes two experiments: the first experiment compares the robustness of the distance measures and the second experiment evaluates the solutions of known clustering algorithms by both the MI based scores and the Pearson correlation based scores. The Discussion and Conclusion sections follow the Results section. The Methods section addresses the compared distance measures and their implementation to clustering; the assessment of the quality of the clustering solutions, and the compared clustering algorithms.