Gene expression microarray is a highly popular technology that allows genome-wide measurement of RNA expression levels in a highly quantitative manner. Gene expression data is typically arranged as an
m ×
n data matrix, with rows corresponding to genes and columns corresponding to experimental conditions. Conditions can be different environmental conditions or different time points corresponding to one or more environmental conditions. The (
m,
n)
th entry of the gene expression matrix represents the expression level of the gene corresponding to row
m under the specific condition corresponding to column
n. The numerical value of the entry is usually the logarithm of the relative amount of the mRNA of the gene under the specific condition. By analyzing the gene expression data, we can potentially determine which genes behave in similar ways, how genes interact, which genes contribute to the same pathway, and so on. The similarity between the genes can be analyzed by clustering the gene expression data. Cluster analysis plays an important role in the microarray data analysis through the grouping of genes into subsets with similar expression patterns or similar function. However, clustering has its limitations. Mainly, clustering works on the assumption that related genes behave similarly across all measured conditions. But a general understanding of cellular processes expects subsets of genes to be co-regulated and co-expressed only under certain experimental conditions, but behaves almost independently under other conditions [
1]. In order to overcome the above shortcoming of clustering, the concept of biclustering is applied to gene expression data.
Biclustering was first described in the literature by Hartigan [
2]. It refers to a distinct class of clustering algorithms that perform simultaneous row column clustering. Cheng and Church [
3] were the first to apply it to gene expression data. Biclustering identifies subsets of genes and subsets of conditions that shares similar expression patterns, by performing simultaneous clustering of both rows and columns of the gene expression matrix. As a result, homogeneous sub matrices of the gene expression matrix are obtained and they are known as biclusters. In [
3], Cheng and Church proposed a similarity score called
mean squared residue score as a measure of coherence of the rows and columns in the bicluster. When all the elements in a bicluster are similar, the mean squared residue score is low. The lower the score, stronger the coherence exhibited by the bicluster, and better is the quality of the bicluster. However, from a biological point of view, the interest resides in biclusters with subset of genes showing similar behaviour and not with similar values. Hence the method aims at finding large and maximal biclusters with mean squared residue score below a certain threshold, δ, and the biclusters thus obtained are called δ-biclusters. The value of δ has to be estimated in advance, and it is different for every dataset [
4].
In [
3], Cheng and Church identified the problem of finding significant biclusters as being NP-Hard and employed a greedy node deletion algorithm in their search. Greedy search algorithms start with an initial solution and find a locally optimal solution by successive transformations that improve some cost function. The survey of biclustering algorithms for biological data analysis by Madeira and Oliveira [
1] also identifies greedy search algorithms as a promising area. But greedy algorithms always make a choice that maximizes the local gain in the hope that this choice will lead to a globally good solution. It may make wrong decisions, gets stuck into local optima and thereby loose good biclusters. Metaheuristics technique such as Greedy Randomized Adaptive Search Procedure (GRASP) improves on pure greedy search due to their potential to escape from local minima.
In this work, we address the biclustering problem with a variant of the GRASP metaheuristics, Reactive GRASP, which is a combination of a semi-greedy heuristics and a local search procedure [
5-
7]. The approach starts from small high quality bicluster seeds, which are tightly co-regulated submatrices of the gene expression matrix. These seeds are further enlarged by adding more rows and columns to them. The seed generation phase is implemented using one-dimensional k-means clustering and the seeds are enlarged using the Reactive GRASP method. The algorithm makes use of
mean squared residue score as the cost function to evaluate the quality of the obtained biclusters. To avoid getting stuck at local optima, Reactive GRASP is equipped with the heuristics for randomizing the search and thereby allows the search process to go beyond local optima. We evaluated our work against the classic work of Cheng and Church and also with our own work based on basic GRASP [
8]. The results show that the Reactive GRASP method can generate larger and better biclusters than the others.