In a functional analysis of gene expression data, biclustering method can give crucial information by showing correlated gene expression patterns under a subset of conditions. However, conventional biclustering algorithms still have some limitations to show comprehensive and stable outputs.
We propose a novel biclustering approach called “BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC)” to find comprehensive sets of correlated expression patterns in biclusters using clustered seeds and their expansion with correlation of gene expression. BICLIC outperformed competing biclustering algorithms by completely recovering implanted biclusters in simulated datasets with various types of correlated patterns: shifting, scaling, and shifting-scaling. Furthermore, in a real yeast microarray dataset and a lung cancer microarray dataset, BICLIC found more comprehensive sets of biclusters that are significantly enriched to more diverse sets of biological terms than those of other competing biclustering algorithms.
BICLIC provides significant benefits in finding comprehensive sets of correlated patterns and their functional implications from a gene expression dataset.
In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of rows coherent with groups of columns. This kind of clustering is called biclustering. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed.
We introduce BiMine, a new enumeration algorithm for biclustering of DNA microarray data. The proposed algorithm is based on three original features. First, BiMine relies on a new evaluation function called Average Spearman's rho (ASR). Second, BiMine uses a new tree structure, called Bicluster Enumeration Tree (BET), to represent the different biclusters discovered during the enumeration process. Third, to avoid the combinatorial explosion of the search tree, BiMine introduces a parametric rule that allows the enumeration process to cut tree branches that cannot lead to good biclusters.
The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data. The experimental results show that BiMine competes well with several other biclustering methods. Moreover, we test the biological significance using a gene annotation web-tool to show that our proposed method is able to produce biologically relevant biclusters. The software is available upon request from the authors to academic users.
Biclustering, or the discovery of subsets of samples and genes that are homogeneous and distinct from the background, has become an important
technique in analyzing current microarray datasets. Most existing biclustering methods define a bicluster type as a fixed (predefined) pattern
and then trying to get results in some searching process. In this work, we propose a novel method for finding biclusters or 2-dimensional patterns
that are significantly distinct from the background without the need for pre-defining a pattern within the bicluster. The method named Distinct
2-Dimensional Pattern Finder (D2D) is composed of an iterative reordering step of the rows and columns in the matrix using a new similarity measure,
and a flexible scanning-and-growing step to identify the biclusters. Experiments on a large variety of simulation data show that the method works
consistently well under different conditions, whereas the existing methods compared may work well under some certain conditions but fail under some
other conditions. The impact of noise levels, overlapping degrees between clusters and different setting of parameters were also investigated, which
indicated that the D2D method is robust against these factors. The proposed D2D method can efficiently discover many different types of biclusters given
that they have distinctive features from the background. The computer program is available upon request.
gene expression matrices; simulation; biclusters; Distinct 2-Dimensional (D2D); noise
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request.
Biclustering is an important analysis procedure to understand the biological mechanisms from microarray gene expression data. Several algorithms have been proposed to identify biclusters, but very little effort was made to compare the performance of different algorithms on real datasets and combine the resultant biclusters into one unified ranking.
In this paper we propose differential co-expression framework and a differential co-expression scoring function to objectively quantify quality or goodness of a bicluster of genes based on the observation that genes in a bicluster are co-expressed in the conditions belonged to the bicluster and not co-expressed in the other conditions. Furthermore, we propose a scoring function to stratify biclusters into three types of co-expression. We used the proposed scoring functions to understand the performance and behavior of the four well established biclustering algorithms on six real datasets from different domains by combining their output into one unified ranking.
Differential co-expression framework is useful to provide quantitative and objective assessment of the goodness of biclusters of co-expressed genes and performance of biclustering algorithms in identifying co-expression biclusters. It also helps to combine the biclusters output by different algorithms into one unified ranking i.e. meta-biclustering.
An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together.
In this paper, we first propose a novel error-tolerant biclustering model, ‘ET-bicluster’, and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets.
The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.
Motivation: Biclustering of transcriptomic data groups genes and samples simultaneously. It is emerging as a standard tool for extracting knowledge from gene expression measurements. We propose a novel generative approach for biclustering called ‘FABIA: Factor Analysis for Bicluster Acquisition’. FABIA is based on a multiplicative model, which accounts for linear dependencies between gene expression and conditions, and also captures heavy-tailed distributions as observed in real-world transcriptomic data. The generative framework allows to utilize well-founded model selection methods and to apply Bayesian techniques.
Results: On 100 simulated datasets with known true, artificially implanted biclusters, FABIA clearly outperformed all 11 competitors. On these datasets, FABIA was able to separate spurious biclusters from true biclusters by ranking biclusters according to their information content. FABIA was tested on three microarray datasets with known subclusters, where it was two times the best and once the second best method among the compared biclustering approaches.
Availability: FABIA is available as an R package on Bioconductor (http://www.bioconductor.org). All datasets, results and software are available at http://www.bioinf.jku.at/software/fabia/fabia.html
Supplementary information: Supplementary data are available at Bioinformatics online.
Identifying a regulatory module (RM), a bi-set of co-regulated genes and co-regulating conditions (or samples), has been an important challenge in functional genomics and bioinformatics. Given a microarray gene-expression matrix, biclustering has been the most common method for extracting RMs. Among biclustering methods, order-preserving biclustering by a sequential pattern mining technique has native advantage over the conventional biclustering approaches since it preserves the order of genes (or conditions) according to the magnitude of the expression value. However, previous sequential pattern mining-based biclustering has several weak points in that they can easily be computationally intractable in the real-size of microarray data and sensitive to inherent noise in the expression value.
In this paper, we propose a novel sequential pattern mining algorithm that is scalable in the size of microarray data and robust with respect to noise. When applied to the microarray data of yeast, the proposed algorithm successfully found long order-preserving patterns, which are biologically significant but cannot be found in randomly shuffled data. The resulting patterns are well enriched to known annotations and are consistent with known biological knowledge. Furthermore, RMs as well as inter-module relations were inferred from the biologically significant patterns.
Our approach for identifying RMs could be valuable for systematically revealing the mechanism of gene regulation at a genome-wide level.
The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.
In this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.
We present results in real data showing the effectiveness of e-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.
The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.
A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in .
Biclustering algorithms belong to a distinct class of clustering algorithms that perform simultaneous clustering of both rows and columns of the gene expression matrix and can be a very useful analysis tool when some genes have multiple functions and experimental conditions are diverse. Cheng and Church have introduced a measure called mean squared residue score to evaluate the quality of a bicluster and has become one of the most popular measures to search for biclusters. In this paper, we review basic concepts of the metaheuristics Greedy Randomized Adaptive Search Procedure (GRASP)-construction and local search phases and propose a new method which is a variant of GRASP called Reactive Greedy Randomized Adaptive Search Procedure (Reactive GRASP) to detect significant biclusters from large microarray datasets. The method has two major steps. First, high quality bicluster seeds are generated by means of k-means clustering. In the second step, these seeds are grown using the Reactive GRASP, in which the basic parameter that defines the restrictiveness of the candidate list is self-adjusted, depending on the quality of the solutions found previously.
We performed statistical and biological validations of the biclusters obtained and evaluated the method against the results of basic GRASP and as well as with the classic work of Cheng and Church. The experimental results indicate that the Reactive GRASP approach outperforms the basic GRASP algorithm and Cheng and Church approach.
The Reactive GRASP approach for the detection of significant biclusters is robust and does not require calibration efforts.
The analysis of massive high throughput data via clustering algorithms is very important for elucidating gene functions in biological systems. However, traditional clustering methods have several drawbacks. Biclustering overcomes these limitations by grouping genes and samples simultaneously. It discovers subsets of genes that are co-expressed in certain samples. Recent studies showed that biclustering has a great potential in detecting marker genes that are associated with certain tissues or diseases. Several biclustering algorithms have been proposed. However, it is still a challenge to find biclusters that are significant based on biological validation measures. Besides that, there is a need for a biclustering algorithm that is capable of analyzing very large datasets in reasonable time.
Here we present a fast biclustering algorithm called DeBi (Differentially Expressed BIclusters). The algorithm is based on a well known data mining approach called frequent itemset. It discovers maximum size homogeneous biclusters in which each gene is strongly associated with a subset of samples. We evaluate the performance of DeBi on a yeast dataset, on synthetic datasets and on human datasets.
We demonstrate that the DeBi algorithm provides functionally more coherent gene sets compared to standard clustering or biclustering algorithms using biological validation measures such as Gene Ontology term and Transcription Factor Binding Site enrichment. We show that DeBi is a computationally efficient and powerful tool in analyzing large datasets. The method is also applicable on multiple gene expression datasets coming from different labs or platforms.
In DNA microarray experiments, discovering groups of genes that share similar transcriptional characteristics is instrumental in functional annotation, tissue classification and motif identification. However, in many situations a subset of genes only exhibits consistent pattern over a subset of conditions. Conventional clustering algorithms that deal with the entire row or column in an expression matrix would therefore fail to detect these useful patterns in the data. Recently, biclustering has been proposed to detect a subset of genes exhibiting consistent pattern over a subset of conditions. However, most existing biclustering algorithms are based on searching for sub-matrices within a data matrix by optimizing certain heuristically defined merit functions. Moreover, most of these algorithms can only detect a restricted set of bicluster patterns.
In this paper, we present a novel geometric perspective for the biclustering problem. The biclustering process is interpreted as the detection of linear geometries in a high dimensional data space. Such a new perspective views biclusters with different patterns as hyperplanes in a high dimensional space, and allows us to handle different types of linear patterns simultaneously by matching a specific set of linear geometries. This geometric viewpoint also inspires us to propose a generic bicluster pattern, i.e. the linear coherent model that unifies the seemingly incompatible additive and multiplicative bicluster models. As a particular realization of our framework, we have implemented a Hough transform-based hyperplane detection algorithm. The experimental results on human lymphoma gene expression dataset show that our algorithm can find biologically significant subsets of genes.
We have proposed a novel geometric interpretation of the biclustering problem. We have shown that many common types of bicluster are just different spatial arrangements of hyperplanes in a high dimensional data space. An implementation of the geometric framework using the Fast Hough transform for hyperplane detection can be used to discover biologically significant subsets of genes under subsets of conditions for microarray data analysis.
Several biclustering algorithms have been proposed to identify biclusters, in which genes share similar expression patterns across a number of conditions. However, different algorithms would yield different biclusters and further lead to distinct conclusions. Therefore, some testing and comparisons between these algorithms are strongly required.
In this study, five biclustering algorithms (i.e. BIMAX, FABIA, ISA, QUBIC and SAMBA) were compared with each other in the cases where they were used to handle two expression datasets (GDS1620 and pathway) with different dimensions in Arabidopsis thaliana (A. thaliana)
GO (gene ontology) annotation and PPI (protein-protein interaction) network were used to verify the corresponding biological significance of biclusters from the five algorithms. To compare the algorithms’ performance and evaluate quality of identified biclusters, two scoring methods, namely weighted enrichment (WE) scoring and PPI scoring, were proposed in our study. For each dataset, after combining the scores of all biclusters into one unified ranking, we could evaluate the performance and behavior of the five biclustering algorithms in a better way.
Both WE and PPI scoring methods has been proved effective to validate biological significance of the biclusters, and a significantly positive correlation between the two sets of scores has been tested to demonstrate the consistence of these two methods.
A comparative study of the above five algorithms has revealed that: (1) ISA is the most effective one among the five algorithms on the dataset of GDS1620 and BIMAX outperforms the other algorithms on the dataset of pathway. (2) Both ISA and BIMAX are data-dependent. The former one does not work well on the datasets with few genes, while the latter one holds well for the datasets with more conditions. (3) FABIA and QUBIC perform poorly in this study and they may be suitable to large datasets with more genes and more conditions. (4) SAMBA is also data-independent as it performs well on two given datasets. The comparison results provide useful information for researchers to choose a suitable algorithm for each given dataset.
Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties.
Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns).
We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology.
Gene expression data analysis; Shifting and scaling expression patterns; Evolutionary biclustering
The advent of microarray technology has revolutionized the search for genes that are differentially expressed across a range of cell types or experimental conditions. Traditional clustering methods, such as hierarchical clustering, are often difficult to deploy effectively since genes rarely exhibit similar expression pattern across a wide range of conditions. Biclustering of gene expression data (also called co-clustering or two-way clustering) is a non-trivial but promising methodology for the identification of gene groups that show a coherent expression profile across a subset of conditions. Thus, biclustering is a natural methodology as a screen for genes that are functionally related, participate in the same pathways, affected by the same drug or pathological condition, or genes that form modules that are potentially co-regulated by a small group of transcription factors. We have developed a web-enabled service called GEMS (Gene Expression Mining Server) for biclustering microarray data. Users may upload expression data and specify a set of criteria. GEMS then performs bicluster mining based on a Gibbs sampling paradigm. The web server provides a flexible and an useful platform for the discovery of co-expressed and potentially co-regulated gene modules. GEMS is an open source software and is available at .
Biclustering of gene expression data searches for local patterns of gene expression. A bicluster (or a two-way cluster) is defined as a set of genes whose expression profiles are mutually similar within a subset of experimental conditions/samples. Although several biclustering algorithms have been studied, few are based on rigorous statistical models.
We developed a Bayesian biclustering model (BBC), and implemented a Gibbs sampling procedure for its statistical inference. We showed that Bayesian biclustering model can correctly identify multiple clusters of gene expression data. Using simulated data both from the model and with realistic characters, we demonstrated the BBC algorithm outperforms other methods in both robustness and accuracy. We also showed that the model is stable for two normalization methods, the interquartile range normalization and the smallest quartile range normalization. Applying the BBC algorithm to the yeast expression data, we observed that majority of the biclusters we found are supported by significant biological evidences, such as enrichments of gene functions and transcription factor binding sites in the corresponding promoter sequences.
The BBC algorithm is shown to be a robust model-based biclustering method that can discover biologically significant gene-condition clusters in microarray data. The BBC model can easily handle missing data via Monte Carlo imputation and has the potential to be extended to integrated study of gene transcription networks.
With the dramatic increase in microarray data, biclustering has become a promising tool for gene expression analysis. Biclustering has been proven to be superior over clustering in identifying multifunctional genes and searching for co-expressed genes under a few specific conditions; that is, a subgroup of all conditions. Biclustering based on a genetic algorithm (GA) has shown better performance than greedy algorithms, but the overlap state for biclusters must be treated more systematically.
We developed a new biclustering algorithm (binary-iterative genetic algorithm [BIGA]), based on an iterative GA, by introducing a novel, ternary-digit chromosome encoding function. BIGA searches for a set of biclusters by iterative binary divisions that allow the overlap state to be explicitly considered. In addition, the average of the Pearson’s correlation coefficient was employed to measure the relationship of genes within a bicluster, instead of the mean square residual, the popular classical index. As compared to the six existing algorithms, BIGA found highly correlated biclusters, with large gene coverage and reasonable gene overlap. The gene ontology (GO) enrichment showed that most of the biclusters are significant, with at least one GO term over represented.
BIGA is a powerful tool to analyze large amounts of gene expression data, and will facilitate the elucidation of the underlying functional mechanisms in living organisms.
biclustering; microarray data; genetic algorithm; Pearson’s correlation coefficient
The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only . Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.
We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.
We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.
Microarray analysis is an important area of bioinformatics. In the last few years, biclustering has become one of the most popular methods for classifying data from microarrays. Although biclustering can be used in any kind of classification problem, nowadays it is mostly used for microarray data classification. A large number of biclustering algorithms have been developed over the years, however little effort has been devoted to the representation of the results.
We present an interactive framework that helps to infer differences or similarities between biclustering results, to unravel trends and to highlight robust groupings of genes and conditions. These linked representations of biclusters can complement biological analysis and reduce the time spent by specialists on interpreting the results. Within the framework, besides other standard representations, a visualization technique is presented which is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. This microarray analysis framework (BicOverlapper), is available at
The main visualization technique, tested with different biclustering results on a real dataset, allows researchers to extract interesting features of the biclustering results, especially the highlighting of overlapping zones that usually represent robust groups of genes and/or conditions. The visual analytics methodology will permit biology experts to study biclustering results without inspecting an overwhelming number of biclusters individually.
Newly microarray technologies yield large-scale datasets. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. Systematic analysis of those datasets provides the increasing amount of information, which is urgently needed in the post-genomic era. Biclustering, which is a technique developed to allow simultaneous clustering of rows and columns of a dataset, might be useful to extract more accurate information from those datasets. Biclustering requires the optimization of two conflicting objectives (residue and volume), and a multi-objective artificial immune system capable of performing a multi-population search. As a heuristic search technique, artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of optimization problems. During biclustering several objectives in conflict with each other have to be optimized simultaneously, so multi-objective optimization model is suitable for solving biclustering problem.
Based on dynamic population, this paper proposes a novel dynamic multi-objective immune optimization biclustering (DMOIOB) algorithm to mine coherent patterns from microarray data. Experimental results on two common and public datasets of gene expression profiles show that our approach can effectively find significant localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.
The proposed DMOIOB algorithm is an efficient tool to analyze large microarray datasets. It achieves a good diversity and rapid convergence.
Accumulated biological research outcomes show that biological functions do not depend on individual genes, but on complex gene networks. Microarray data are widely used to cluster genes according to their expression levels across experimental conditions. However, functionally related genes generally do not show coherent expression across all conditions since any given cellular process is active only under a subset of conditions. Biclustering finds gene clusters that have similar expression levels across a subset of conditions. This paper proposes a seed-based algorithm that identifies coherent genes in an exhaustive, but efficient manner.
In order to find the biclusters in a gene expression dataset, we exhaustively select combinations of genes and conditions as seeds to create candidate bicluster tables. The tables have two columns (a) a gene set, and (b) the conditions on which the gene set have dissimilar expression levels to the seed. First, the genes with less than the maximum number of dissimilar conditions are identified and a table of these genes is created. Second, the rows that have the same dissimilar conditions are grouped together. Third, the table is sorted in ascending order based on the number of dissimilar conditions. Finally, beginning with the first row of the table, a test is run repeatedly to determine whether the cardinality of the gene set in the row is greater than the minimum threshold number of genes in a bicluster. If so, a bicluster is outputted and the corresponding row is removed from the table. Repeating this process, all biclusters in the table are systematically identified until the table becomes empty.
This paper presents a novel biclustering algorithm for the identification of additive biclusters. Since it involves exhaustively testing combinations of genes and conditions, the additive biclusters can be found more readily.
Identifying a bicluster, or submatrix of a gene expression dataset wherein the genes express similar behavior over the columns, is useful for discovering novel functional gene interactions. In this article, we introduce a new algorithm for finding biClusters with Linear Patterns (CLiP). Instead of solely maximizing Pearson correlation, we introduce a fitness function that also considers the correlation of complementary genes and conditions. This eliminates the need for a priori determination of the bicluster size. We employ both greedy search and the genetic algorithm in optimization, incorporating resampling for more robust discovery. When applied to both real and simulation datasets, our results show that CLiP is superior to existing methods. In analyzing RNA-seq fly and worm time-course data from modENCODE, we uncover a set of similarly expressed genes suggesting maternal dependence. Supplementary Material is available online (at www.liebertonline.com/cmb).
algorithms; gene clusters; probability
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of “coherence” patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.
The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes.
Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes.
The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.
Multi-objective optimization (MOO) involves optimization problems with multiple objectives. Generally, theose objectives is used to estimate very different aspects of the solutions, and these aspects are often in conflict with each other. MOO first gets a Pareto set, and then looks for both commonality and systematic variations across the set. For the large-scale data sets, heuristic search algorithms such as EA combined with MOO techniques are ideal. Newly DNA microarray technology may study the transcriptional response of a complete genome to different experimental conditions and yield a lot of large-scale datasets. Biclustering technique can simultaneously cluster rows and columns of a dataset, and hlep to extract more accurate information from those datasets. Biclustering need optimize several conflicting objectives, and can be solved with MOO methods. As a heuristics-based optimization approach, the particle swarm optimization (PSO) simulate the movements of a bird flock finding food. The shuffled frog-leaping algorithm (SFL) is a population-based cooperative search metaphor combining the benefits of the local search of PSO and the global shuffled of information of the complex evolution technique. SFL is used to solve the optimization problems of the large-scale datasets.
This paper integrates dynamic population strategy and shuffled frog-leaping algorithm into biclustering of microarray data, and proposes a novel multi-objective dynamic population shuffled frog-leaping biclustering (MODPSFLB) algorithm to mine maximum bicluesters from microarray data. Experimental results show that the proposed MODPSFLB algorithm can effectively find significant biological structures in terms of related biological processes, components and molecular functions.
The proposed MODPSFLB algorithm has good diversity and fast convergence of Pareto solutions and will become a powerful systematic functional analysis in genome research.