# Related Articles

Background

Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed.

Methods

The proposed BicPAM biclustering approach integrates existing principles made available by state-of-the-art pattern-based approaches with two new contributions. First, BicPAM is the first efficient attempt to exhaustively mine non-constant types of biclusters, including additive and multiplicative coherencies in the presence or absence of symmetries. Second, BicPAM provides strategies to effectively compose different biclustering structures and to handle arbitrary levels of noise inherent to data and with discretization procedures.

Results

Results show BicPAM’s superiority against its peers and its ability to retrieve unique types of biclusters of interest, to efficiently deliver exhaustive solutions and to successfully recover planted biclusters in datasets with varying levels of missing values and noise. Its application over gene expression data leads to unique solutions with heightened biological relevance.

Conclusions

BicPAM approaches integrate existing disperse efforts towards pattern-based biclustering and provides the first critical strategies to efficiently discover exhaustive solutions of biclusters with shifting, scaling and symmetric assumptions with varying quality and underlying structures. Additionally, BicPAM dynamically adapts its behavior to mine data with different levels of missing values and noise.

doi:10.1186/s13015-014-0027-z

PMCID: PMC4302537
PMID: 25649207

Biclustering; Pattern mining; Biomedical data analysis

Background

An important analysis performed on microarray gene-expression data is to discover biclusters, which denote groups of genes that are coherently expressed for a subset of conditions. Various biclustering algorithms have been proposed to find different types of biclusters from these real-valued gene-expression data sets. However, these algorithms suffer from several limitations such as inability to explicitly handle errors/noise in the data; difficulty in discovering small bicliusters due to their top-down approach; inability of some of the approaches to find overlapping biclusters, which is crucial as many genes participate in multiple biological processes. Association pattern mining also produce biclusters as their result and can naturally address some of these limitations. However, traditional association mining only finds exact biclusters, which limits its applicability in real-life data sets where the biclusters may be fragmented due to random noise/errors. Moreover, as they only work with binary or boolean attributes, their application on gene-expression data require transforming real-valued attributes to binary attributes, which often results in loss of information. Many past approaches have tried to address the issue of noise and handling real-valued attributes independently but there is no systematic approach that addresses both of these issues together.

Results

In this paper, we first propose a novel error-tolerant biclustering model, ‘ET-bicluster’, and then propose a bottom-up heuristic-based mining algorithm to sequentially discover error-tolerant biclusters directly from real-valued gene-expression data. The efficacy of our proposed approach is illustrated by comparing it with a recent approach RAP in the context of two biological problems: discovery of functional modules and discovery of biomarkers. For the first problem, two real-valued S.Cerevisiae microarray gene-expression data sets are used to demonstrate that the biclusters obtained from ET-bicluster approach not only recover larger set of genes as compared to those obtained from RAP approach but also have higher functional coherence as evaluated using the GO-based functional enrichment analysis. The statistical significance of the discovered error-tolerant biclusters as estimated by using two randomization tests, reveal that they are indeed biologically meaningful and statistically significant. For the second problem of biomarker discovery, we used four real-valued Breast Cancer microarray gene-expression data sets and evaluate the biomarkers obtained using MSigDB gene sets.

Conclusions

The results obtained for both the problems: functional module discovery and biomarkers discovery, clearly signifies the usefulness of the proposed ET-bicluster approach and illustrate the importance of explicitly incorporating noise/errors in discovering coherent groups of genes from gene-expression data.

doi:10.1186/1471-2105-12-S12-S1

PMCID: PMC3247082
PMID: 22168285

Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request.

doi:10.1093/nar/gkp491

PMCID: PMC2731891
PMID: 19509312

Background

In a number of domains, like in DNA microarray data analysis, we need to cluster simultaneously rows (genes) and columns (conditions) of a data matrix to identify groups of rows coherent with groups of columns. This kind of clustering is called biclustering. Biclustering algorithms are extensively used in DNA microarray data analysis. More effective biclustering algorithms are highly desirable and needed.

Methods

We introduce BiMine, a new enumeration algorithm for biclustering of DNA microarray data. The proposed algorithm is based on three original features. First, BiMine relies on a new evaluation function called Average Spearman's rho (ASR). Second, BiMine uses a new tree structure, called Bicluster Enumeration Tree (BET), to represent the different biclusters discovered during the enumeration process. Third, to avoid the combinatorial explosion of the search tree, BiMine introduces a parametric rule that allows the enumeration process to cut tree branches that cannot lead to good biclusters.

Results

The performance of the proposed algorithm is assessed using both synthetic and real DNA microarray data. The experimental results show that BiMine competes well with several other biclustering methods. Moreover, we test the biological significance using a gene annotation web-tool to show that our proposed method is able to produce biologically relevant biclusters. The software is available upon request from the authors to academic users.

doi:10.1186/1756-0381-2-9

PMCID: PMC2804695
PMID: 20015398

Background

The DNA microarray technology allows the measurement of expression levels of thousands of genes under tens/hundreds of different conditions. In microarray data, genes with similar functions usually co-express under certain conditions only [1]. Thus, biclustering which clusters genes and conditions simultaneously is preferred over the traditional clustering technique in discovering these coherent genes. Various biclustering algorithms have been developed using different bicluster formulations. Unfortunately, many useful formulations result in NP-complete problems. In this article, we investigate an efficient method for identifying a popular type of biclusters called additive model. Furthermore, parallel coordinate (PC) plots are used for bicluster visualization and analysis.

Results

We develop a novel and efficient biclustering algorithm which can be regarded as a greedy version of an existing algorithm known as pCluster algorithm. By relaxing the constraint in homogeneity, the proposed algorithm has polynomial-time complexity in the worst case instead of exponential-time complexity as in the pCluster algorithm. Experiments on artificial datasets verify that our algorithm can identify both additive-related and multiplicative-related biclusters in the presence of overlap and noise. Biologically significant biclusters have been validated on the yeast cell-cycle expression dataset using Gene Ontology annotations. Comparative study shows that the proposed approach outperforms several existing biclustering algorithms. We also provide an interactive exploratory tool based on PC plot visualization for determining the parameters of our biclustering algorithm.

Conclusion

We have proposed a novel biclustering algorithm which works with PC plots for an interactive exploratory analysis of gene expression data. Experiments show that the biclustering algorithm is efficient and is capable of detecting co-regulated genes. The interactive analysis enables an optimum parameter determination in the biclustering algorithm so as to achieve the best result. In future, we will modify the proposed algorithm for other bicluster models such as the coherent evolution model.

doi:10.1186/1471-2105-9-210

PMCID: PMC2396181
PMID: 18433478

Abstract

The biclustering problem has been extensively studied in many areas, including e-commerce, data mining, machine learning, pattern recognition, statistics, and, more recently, computational biology. Given an n × m matrix A (n ≥ m), the main goal of biclustering is to identify a subset of rows (called objects) and a subset of columns (called properties) such that some objective function that specifies the quality of the found bicluster (formed by the subsets of rows and of columns of A) is optimized. The problem has been proved or conjectured to be NP-hard for various objective functions. In this article, we study a probabilistic model for the implanted additive bicluster problem, where each element in the n × m background matrix is a random integer from [0, L − 1] for some integer L, and a k × k implanted additive bicluster is obtained from an error-free additive bicluster by randomly changing each element to a number in [0, L − 1] with probability θ. We propose an O (n2m) time algorithm based on voting to solve the problem. We show that when \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$k \geq \Omega (\sqrt{n \log n})$$\end{document}, the voting algorithm can correctly find the implanted bicluster with probability at least \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$1 - {\frac {9} {n^ {2}}}$$\end{document}. We also implement our algorithm as a C++ program named VOTE. The implementation incorporates several ideas for estimating the size of an implanted bicluster, adjusting the threshold in voting, dealing with small biclusters, and dealing with overlapping implanted biclusters. Our experimental results on both simulated and real datasets show that VOTE can find biclusters with a high accuracy and speed.

doi:10.1089/cmb.2007.0219

PMCID: PMC3131804
PMID: 19040364

additive bicluster; computational biology; gene expression data analysis; polynomial-time algorithm; probability model

Background

In DNA microarray experiments, discovering groups of genes that share similar transcriptional characteristics is instrumental in functional annotation, tissue classification and motif identification. However, in many situations a subset of genes only exhibits consistent pattern over a subset of conditions. Conventional clustering algorithms that deal with the entire row or column in an expression matrix would therefore fail to detect these useful patterns in the data. Recently, biclustering has been proposed to detect a subset of genes exhibiting consistent pattern over a subset of conditions. However, most existing biclustering algorithms are based on searching for sub-matrices within a data matrix by optimizing certain heuristically defined merit functions. Moreover, most of these algorithms can only detect a restricted set of bicluster patterns.

Results

In this paper, we present a novel geometric perspective for the biclustering problem. The biclustering process is interpreted as the detection of linear geometries in a high dimensional data space. Such a new perspective views biclusters with different patterns as hyperplanes in a high dimensional space, and allows us to handle different types of linear patterns simultaneously by matching a specific set of linear geometries. This geometric viewpoint also inspires us to propose a generic bicluster pattern, i.e. the linear coherent model that unifies the seemingly incompatible additive and multiplicative bicluster models. As a particular realization of our framework, we have implemented a Hough transform-based hyperplane detection algorithm. The experimental results on human lymphoma gene expression dataset show that our algorithm can find biologically significant subsets of genes.

Conclusion

We have proposed a novel geometric interpretation of the biclustering problem. We have shown that many common types of bicluster are just different spatial arrangements of hyperplanes in a high dimensional data space. An implementation of the geometric framework using the Fast Hough transform for hyperplane detection can be used to discover biologically significant subsets of genes under subsets of conditions for microarray data analysis.

doi:10.1186/1471-2105-9-209

PMCID: PMC2386490
PMID: 18433477

Background

The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.

Methods

In this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.

Results

We present results in real data showing the effectiveness of e-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.

Discussion

The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.

Availability

A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in .

doi:10.1186/1748-7188-4-8

PMCID: PMC2709627
PMID: 19497096

Abstract

Background

Newly microarray technologies yield large-scale datasets. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. Systematic analysis of those datasets provides the increasing amount of information, which is urgently needed in the post-genomic era. Biclustering, which is a technique developed to allow simultaneous clustering of rows and columns of a dataset, might be useful to extract more accurate information from those datasets. Biclustering requires the optimization of two conflicting objectives (residue and volume), and a multi-objective artificial immune system capable of performing a multi-population search. As a heuristic search technique, artificial immune systems (AISs) can be considered a new computational paradigm inspired by the immunological system of vertebrates and designed to solve a wide range of optimization problems. During biclustering several objectives in conflict with each other have to be optimized simultaneously, so multi-objective optimization model is suitable for solving biclustering problem.

Results

Based on dynamic population, this paper proposes a novel dynamic multi-objective immune optimization biclustering (DMOIOB) algorithm to mine coherent patterns from microarray data. Experimental results on two common and public datasets of gene expression profiles show that our approach can effectively find significant localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.

Conclusions

The proposed DMOIOB algorithm is an efficient tool to analyze large microarray datasets. It achieves a good diversity and rapid convergence.

doi:10.1186/1471-2164-12-S2-S11

PMCID: PMC3194232
PMID: 21989068

Background

High-throughput microarray technologies have generated and accumulated massive amounts of gene expression datasets that contain expression levels of thousands of genes under hundreds of different experimental conditions. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. The analysis of such datasets can discover local structures composed by sets of genes that show coherent expression patterns under subsets of experimental conditions. It leads to the development of sophisticated algorithms capable of extracting novel and useful knowledge from a biomedical point of view. In the medical domain, these patterns are useful for understanding various diseases, and aid in more accurate diagnosis, prognosis, treatment planning, as well as drug discovery.

Results

In this work we present the CMOPSOB (Crowding distance based Multi-objective Particle Swarm Optimization Biclustering), a novel clustering approach for microarray datasets to cluster genes and conditions highly related in sub-portions of the microarray data. The objective of biclustering is to find sub-matrices, i.e. maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a subset of conditions. Since these objectives are mutually conflicting, they become suitable candidates for multi-objective modelling. Our approach CMOPSOB is based on a heuristic search technique, multi-objective particle swarm optimization, which simulates the movements of a flock of birds which aim to find food. In the meantime, the nearest neighbour search strategies based on crowding distance and ϵ-dominance can rapidly converge to the Pareto front and guarantee diversity of solutions. We compare the potential of this methodology with other biclustering algorithms by analyzing two common and public datasets of gene expression profiles. In all cases our method can find localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.

Conclusion

The proposed CMOPSOB algorithm is successfully applied to biclustering of microarray dataset. It achieves a good diversity in the obtained Pareto front, and rapid convergence. Therefore, it is a useful tool to analyze large microarray datasets.

doi:10.1186/1471-2105-10-S4-S9

PMCID: PMC2681067
PMID: 19426457

Background

Biclustering algorithms belong to a distinct class of clustering algorithms that perform simultaneous clustering of both rows and columns of the gene expression matrix and can be a very useful analysis tool when some genes have multiple functions and experimental conditions are diverse. Cheng and Church have introduced a measure called mean squared residue score to evaluate the quality of a bicluster and has become one of the most popular measures to search for biclusters. In this paper, we review basic concepts of the metaheuristics Greedy Randomized Adaptive Search Procedure (GRASP)-construction and local search phases and propose a new method which is a variant of GRASP called Reactive Greedy Randomized Adaptive Search Procedure (Reactive GRASP) to detect significant biclusters from large microarray datasets. The method has two major steps. First, high quality bicluster seeds are generated by means of k-means clustering. In the second step, these seeds are grown using the Reactive GRASP, in which the basic parameter that defines the restrictiveness of the candidate list is self-adjusted, depending on the quality of the solutions found previously.

Results

We performed statistical and biological validations of the biclusters obtained and evaluated the method against the results of basic GRASP and as well as with the classic work of Cheng and Church. The experimental results indicate that the Reactive GRASP approach outperforms the basic GRASP algorithm and Cheng and Church approach.

Conclusion

The Reactive GRASP approach for the detection of significant biclusters is robust and does not require calibration efforts.

doi:10.1186/1471-2105-10-S1-S27

PMCID: PMC2648745
PMID: 19208127

Background

In a functional analysis of gene expression data, biclustering method can give crucial information by showing correlated gene expression patterns under a subset of conditions. However, conventional biclustering algorithms still have some limitations to show comprehensive and stable outputs.

Results

We propose a novel biclustering approach called “BIclustering by Correlated and Large number of Individual Clustered seeds (BICLIC)” to find comprehensive sets of correlated expression patterns in biclusters using clustered seeds and their expansion with correlation of gene expression. BICLIC outperformed competing biclustering algorithms by completely recovering implanted biclusters in simulated datasets with various types of correlated patterns: shifting, scaling, and shifting-scaling. Furthermore, in a real yeast microarray dataset and a lung cancer microarray dataset, BICLIC found more comprehensive sets of biclusters that are significantly enriched to more diverse sets of biological terms than those of other competing biclustering algorithms.

Conclusions

BICLIC provides significant benefits in finding comprehensive sets of correlated patterns and their functional implications from a gene expression dataset.

doi:10.1186/1471-2164-14-144

PMCID: PMC3618306
PMID: 23496895

Background

The analysis of large-scale data sets via clustering techniques is utilized in a number of applications. Biclustering in particular has emerged as an important problem in the analysis of gene expression data since genes may only jointly respond over a subset of conditions. Biclustering algorithms also have important applications in sample classification where, for instance, tissue samples can be classified as cancerous or normal. Many of the methods for biclustering, and clustering algorithms in general, utilize simplified models or heuristic strategies for identifying the "best" grouping of elements according to some metric and cluster definition and thus result in suboptimal clusters.

Results

In this article, we present a rigorous approach to biclustering, OREO, which is based on the Optimal RE-Ordering of the rows and columns of a data matrix so as to globally minimize the dissimilarity metric. The physical permutations of the rows and columns of the data matrix can be modeled as either a network flow problem or a traveling salesman problem. Cluster boundaries in one dimension are used to partition and re-order the other dimensions of the corresponding submatrices to generate biclusters. The performance of OREO is tested on (a) metabolite concentration data, (b) an image reconstruction matrix, (c) synthetic data with implanted biclusters, and gene expression data for (d) colon cancer data, (e) breast cancer data, as well as (f) yeast segregant data to validate the ability of the proposed method and compare it to existing biclustering and clustering methods.

Conclusion

We demonstrate that this rigorous global optimization method for biclustering produces clusters with more insightful groupings of similar entities, such as genes or metabolites sharing common functions, than other clustering and biclustering algorithms and can reconstruct underlying fundamental patterns in the data for several distinct sets of data matrices arising in important biological applications.

doi:10.1186/1471-2105-9-458

PMCID: PMC2605474
PMID: 18954459

Biclustering, or the discovery of subsets of samples and genes that are homogeneous and distinct from the background, has become an important
technique in analyzing current microarray datasets. Most existing biclustering methods define a bicluster type as a fixed (predefined) pattern
and then trying to get results in some searching process. In this work, we propose a novel method for finding biclusters or 2-dimensional patterns
that are significantly distinct from the background without the need for pre-defining a pattern within the bicluster. The method named Distinct
2-Dimensional Pattern Finder (D2D) is composed of an iterative reordering step of the rows and columns in the matrix using a new similarity measure,
and a flexible scanning-and-growing step to identify the biclusters. Experiments on a large variety of simulation data show that the method works
consistently well under different conditions, whereas the existing methods compared may work well under some certain conditions but fail under some
other conditions. The impact of noise levels, overlapping degrees between clusters and different setting of parameters were also investigated, which
indicated that the D2D method is robust against these factors. The proposed D2D method can efficiently discover many different types of biclusters given
that they have distinctive features from the background. The computer program is available upon request.

PMCID: PMC2241930
PMID: 18305830

gene expression matrices; simulation; biclusters; Distinct 2-Dimensional (D2D); noise

Background

Biclustering is a popular method for identifying under which experimental conditions biological signatures are co-expressed. However, the general biclustering problem is NP-hard, offering room to focus algorithms on specific biological tasks. We hypothesize that conditional co-regulation of genes is a key factor in determining cell phenotype and that accurately segregating conditions in biclusters will improve such predictions. Thus, we developed a bicluster sampled coherence metric (BSCM) for determining which conditions and signals should be included in a bicluster.

Results

Our BSCM calculates condition and cluster size specific p-values, and we incorporated these into the popular integrated biclustering algorithm cMonkey. We demonstrate that incorporation of our new algorithm significantly improves bicluster co-regulation scores (p-value = 0.009) and GO annotation scores (p-value = 0.004). Additionally, we used a bicluster based signal to predict whether a given experimental condition will result in yeast peroxisome induction. Using the new algorithm, the classifier accuracy improves from 41.9% to 76.1% correct.

Conclusions

We demonstrate that the proposed BSCM helps determine which signals ought to be co-clustered, resulting in more accurately assigned bicluster membership. Furthermore, we show that BSCM can be extended to more accurately detect under which experimental conditions the genes are co-clustered. Features derived from this more accurate analysis of conditional regulation results in a dramatic improvement in the ability to predict a cellular phenotype in yeast. The latest cMonkey is available for download at https://github.com/baliga-lab/cmonkey2. The experimental data and source code featured in this paper is available http://AitchisonLab.com/BSCM. BSCM has been incorporated in the official cMonkey release.

doi:10.1186/1752-0509-9-S2-S1

PMCID: PMC4407105
PMID: 25881257

Background

Identifying a regulatory module (RM), a bi-set of co-regulated genes and co-regulating conditions (or samples), has been an important challenge in functional genomics and bioinformatics. Given a microarray gene-expression matrix, biclustering has been the most common method for extracting RMs. Among biclustering methods, order-preserving biclustering by a sequential pattern mining technique has native advantage over the conventional biclustering approaches since it preserves the order of genes (or conditions) according to the magnitude of the expression value. However, previous sequential pattern mining-based biclustering has several weak points in that they can easily be computationally intractable in the real-size of microarray data and sensitive to inherent noise in the expression value.

Results

In this paper, we propose a novel sequential pattern mining algorithm that is scalable in the size of microarray data and robust with respect to noise. When applied to the microarray data of yeast, the proposed algorithm successfully found long order-preserving patterns, which are biologically significant but cannot be found in randomly shuffled data. The resulting patterns are well enriched to known annotations and are consistent with known biological knowledge. Furthermore, RMs as well as inter-module relations were inferred from the biologically significant patterns.

Conclusions

Our approach for identifying RMs could be valuable for systematically revealing the mechanism of gene regulation at a genome-wide level.

doi:10.1186/1471-2164-12-S3-S5

PMCID: PMC3333188
PMID: 22369275

Background

Microarray analysis is an important area of bioinformatics. In the last few years, biclustering has become one of the most popular methods for classifying data from microarrays. Although biclustering can be used in any kind of classification problem, nowadays it is mostly used for microarray data classification. A large number of biclustering algorithms have been developed over the years, however little effort has been devoted to the representation of the results.

Results

We present an interactive framework that helps to infer differences or similarities between biclustering results, to unravel trends and to highlight robust groupings of genes and conditions. These linked representations of biclusters can complement biological analysis and reduce the time spent by specialists on interpreting the results. Within the framework, besides other standard representations, a visualization technique is presented which is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. This microarray analysis framework (BicOverlapper), is available at

Conclusion

The main visualization technique, tested with different biclustering results on a real dataset, allows researchers to extract interesting features of the biclustering results, especially the highlighting of overlapping zones that usually represent robust groups of genes and/or conditions. The visual analytics methodology will permit biology experts to study biclustering results without inspecting an overwhelming number of biclusters individually.

doi:10.1186/1471-2105-9-247

PMCID: PMC2416653
PMID: 18505552

Background

The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes.

Methods

Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes.

Results

The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.

doi:10.1186/1756-0381-4-3

PMCID: PMC3037342
PMID: 21261986

Background

Biclustering algorithms for microarray data aim at discovering functionally related gene sets under different subsets of experimental conditions. Due to the problem complexity and the characteristics of microarray datasets, heuristic searches are usually used instead of exhaustive algorithms. Also, the comparison among different techniques is still a challenge. The obtained results vary in relevant features such as the number of genes or conditions, which makes it difficult to carry out a fair comparison. Moreover, existing approaches do not allow the user to specify any preferences on these properties.

Results

Here, we present the first biclustering algorithm in which it is possible to particularize several biclusters features in terms of different objectives. This can be done by tuning the specified features in the algorithm or also by incorporating new objectives into the search. Furthermore, our approach bases the bicluster evaluation in the use of expression patterns, being able to recognize both shifting and scaling patterns either simultaneously or not. Evolutionary computation has been chosen as the search strategy, naming thus our proposal Evo-Bexpa (Evolutionary Biclustering based in Expression Patterns).

Conclusions

We have conducted experiments on both synthetic and real datasets demonstrating Evo-Bexpa abilities to obtain meaningful biclusters. Synthetic experiments have been designed in order to compare Evo-Bexpa performance with other approaches when looking for perfect patterns. Experiments with four different real datasets also confirm the proper performing of our algorithm, whose results have been biologically validated through Gene Ontology.

doi:10.1186/1748-7188-8-4

PMCID: PMC3668234
PMID: 23433178

Gene expression data analysis; Shifting and scaling expression patterns; Evolutionary biclustering

The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.

doi:10.1093/bib/bbs032

PMCID: PMC3659300
PMID: 22772837

biclustering; microarray; gene expression; clustering

Background

Biclustering is an important analysis procedure to understand the biological mechanisms from microarray gene expression data. Several algorithms have been proposed to identify biclusters, but very little effort was made to compare the performance of different algorithms on real datasets and combine the resultant biclusters into one unified ranking.

Results

In this paper we propose differential co-expression framework and a differential co-expression scoring function to objectively quantify quality or goodness of a bicluster of genes based on the observation that genes in a bicluster are co-expressed in the conditions belonged to the bicluster and not co-expressed in the other conditions. Furthermore, we propose a scoring function to stratify biclusters into three types of co-expression. We used the proposed scoring functions to understand the performance and behavior of the four well established biclustering algorithms on six real datasets from different domains by combining their output into one unified ranking.

Conclusions

Differential co-expression framework is useful to provide quantitative and objective assessment of the goodness of biclusters of co-expressed genes and performance of biclustering algorithms in identifying co-expression biclusters. It also helps to combine the biclusters output by different algorithms into one unified ranking i.e. meta-biclustering.

doi:10.1186/1748-7188-5-23

PMCID: PMC2892498
PMID: 20507637

Video abstract

Video

Objective

With the dramatic increase in microarray data, biclustering has become a promising tool for gene expression analysis. Biclustering has been proven to be superior over clustering in identifying multifunctional genes and searching for co-expressed genes under a few specific conditions; that is, a subgroup of all conditions. Biclustering based on a genetic algorithm (GA) has shown better performance than greedy algorithms, but the overlap state for biclusters must be treated more systematically.

Results

We developed a new biclustering algorithm (binary-iterative genetic algorithm [BIGA]), based on an iterative GA, by introducing a novel, ternary-digit chromosome encoding function. BIGA searches for a set of biclusters by iterative binary divisions that allow the overlap state to be explicitly considered. In addition, the average of the Pearson’s correlation coefficient was employed to measure the relationship of genes within a bicluster, instead of the mean square residual, the popular classical index. As compared to the six existing algorithms, BIGA found highly correlated biclusters, with large gene coverage and reasonable gene overlap. The gene ontology (GO) enrichment showed that most of the biclusters are significant, with at least one GO term over represented.

Conclusion

BIGA is a powerful tool to analyze large amounts of gene expression data, and will facilitate the elucidation of the underlying functional mechanisms in living organisms.

doi:10.2147/AABC.S32622

PMCID: PMC3459542
PMID: 23055751

biclustering; microarray data; genetic algorithm; Pearson’s correlation coefficient

An noticeable number of biclustering approaches have been proposed proposed for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. In this context, recognizing groups of co-expressed or co-regulated genes, that is, genes which follow a similar expression pattern, is one of the main objectives. Due to the problem complexity, heuristic searches are usually used instead of exhaustive algorithms. Furthermore, most of biclustering approaches use a measure or cost function that determines the quality of biclusters. Having a suitable quality metric for bicluster is a critical aspect, not only for guiding the search, but also for establishing a comparison criteria among the results obtained by different biclustering techniques. In this paper, we analyse a large number of existing approaches to quality measures for gene expression biclusters, as well as we present a comparative study of them based on their capability to recognize different expression patterns in biclusters.

doi:10.1371/journal.pone.0115497

PMCID: PMC4357449
PMID: 25763839

Background

Accumulated biological research outcomes show that biological functions do not depend on individual genes, but on complex gene networks. Microarray data are widely used to cluster genes according to their expression levels across experimental conditions. However, functionally related genes generally do not show coherent expression across all conditions since any given cellular process is active only under a subset of conditions. Biclustering finds gene clusters that have similar expression levels across a subset of conditions. This paper proposes a seed-based algorithm that identifies coherent genes in an exhaustive, but efficient manner.

Methods

In order to find the biclusters in a gene expression dataset, we exhaustively select combinations of genes and conditions as seeds to create candidate bicluster tables. The tables have two columns (a) a gene set, and (b) the conditions on which the gene set have dissimilar expression levels to the seed. First, the genes with less than the maximum number of dissimilar conditions are identified and a table of these genes is created. Second, the rows that have the same dissimilar conditions are grouped together. Third, the table is sorted in ascending order based on the number of dissimilar conditions. Finally, beginning with the first row of the table, a test is run repeatedly to determine whether the cardinality of the gene set in the row is greater than the minimum threshold number of genes in a bicluster. If so, a bicluster is outputted and the corresponding row is removed from the table. Repeating this process, all biclusters in the table are systematically identified until the table becomes empty.

Conclusions

This paper presents a novel biclustering algorithm for the identification of additive biclusters. Since it involves exhaustively testing combinations of genes and conditions, the additive biclusters can be found more readily.

doi:10.1371/journal.pone.0042431

PMCID: PMC3411756
PMID: 22879981

Background

Biclustering of gene expression data searches for local patterns of gene expression. A bicluster (or a two-way cluster) is defined as a set of genes whose expression profiles are mutually similar within a subset of experimental conditions/samples. Although several biclustering algorithms have been studied, few are based on rigorous statistical models.

Results

We developed a Bayesian biclustering model (BBC), and implemented a Gibbs sampling procedure for its statistical inference. We showed that Bayesian biclustering model can correctly identify multiple clusters of gene expression data. Using simulated data both from the model and with realistic characters, we demonstrated the BBC algorithm outperforms other methods in both robustness and accuracy. We also showed that the model is stable for two normalization methods, the interquartile range normalization and the smallest quartile range normalization. Applying the BBC algorithm to the yeast expression data, we observed that majority of the biclusters we found are supported by significant biological evidences, such as enrichments of gene functions and transcription factor binding sites in the corresponding promoter sequences.

Conclusions

The BBC algorithm is shown to be a robust model-based biclustering method that can discover biologically significant gene-condition clusters in microarray data. The BBC model can easily handle missing data via Monte Carlo imputation and has the potential to be extended to integrated study of gene transcription networks.

doi:10.1186/1471-2164-9-S1-S4

PMCID: PMC2386069
PMID: 18366617