The rapidly increasing volume of genome scale data has enabled global regulatory network inference and genome-wide prediction of gene function within single organisms. In this work, we exploit another advantage of the growing quantity of genomics data: by comparing genome-wide datasets for closely related organisms, we can add a critical evolutionary component to systems biology data analysis. Whereas several well-developed tools exist for identifying orthologous genes on the basis of sequence similarity, the identification of conserved co-regulated gene groups (modules) is a relatively recent problem requiring development of new methods. Here, we present an algorithm that performs integrative biclustering for multiple-species datasets in order to identify conserved modules and the conditions under which these modules are active. The advantages of this method are that conserved modules are more likely to be biologically significant than co-regulated gene groups lacking detectable conservation, and the identification of these conserved modules can provide a basis for investigating the evolution of gene regulatory networks.
Clustering has long been a popular tool in analyzing systems biology data types (for example, the clustering of microarray data to generate putative co-regulated gene groups). Most genomics studies employ clustering methods that require genes to participate in mutually exclusive clusters, such as hierarchical agglomerative clustering [
1], k-means clustering [
2] and singular value decomposition derived methods [
3-
5]. Because most genes are unlikely to be co-regulated under every possible condition (for instance, bacterial genes can have more than one transcription start site and, in that case, each site will be regulated by a different set of transcription factors depending on the cell's state), defining mutually exclusive gene clusters cannot capture the complexity of transcriptional regulatory networks. Clearly, sophisticated integrative methods are needed to arrive at the identification of more mechanistically meaningful condition-dependent conserved modules.
Biclustering refers to the simultaneous clustering of both genes and conditions [
6,
7]. Early work [
8] introduced the idea of biclustering as 'direct clustering' [
9], node deletion problems on graphs [
10], and biclustering [
11]. More recently, biclustering has been used in several studies to address the biologically relevant condition dependence of co-expression patterns [
6,
12-
19]. Additional genome-wide data (such as association networks and transcription factor binding sites) greatly improves the performance of these approaches [
19-
22]. Examples include the most recent version of SAMBA, which incorporates experimentally validated protein-protein and protein-DNA associations into a Bayesian framework [
19], and cMonkey [
20], an algorithm we recently introduced.
cMonkey integrates expression and sequence data, metabolic and signaling pathways [
23], protein-protein interactions, and comparative genomics networks [
24-
26] to estimate condition dependent co-regulated modules. We have previously shown that cMonkey can be used to 'pre-cluster' genes prior to learning global regulatory networks [
27]. Biclusters are iteratively optimized, starting with a random or semi-random seed, via a Monte Carlo Markov chain process. At each iteration, each bicluster's state is updated based upon conditional probability distributions computed using the bicluster's previous state. This enables cMonkey to determine the probability that a given gene or condition belongs in the bicluster, dependent upon the current state of the bicluster. The components of this conditional probability (one for each of the different data types) are modeled independently as
P-values based upon individual data likelihoods, which are combined to determine the full conditional probability of a given gene or condition belonging to a given bicluster.
Previous multi-species clustering methods generally fall into two classes (for reviews see [
17,
28]). The first class attempts to match conditions between species in order to identify similarities and differences for a given cell process [
29-
32]. By requiring matched conditions, this approach is not well suited to large sets of public experiments, as it is limited to only the conditions that have direct analogs for both species. The second class of multi-species clustering methods employs a strategy where the datasets for each organism are reduced to a unit-less measure of co-expression (for example Pearson's correlation) and are then used to compare co-expression patterns in multiple species [
33-
38]. This second class includes methods analyzing the conservation of individual orthologous pairs [
37,
38] and those seeking to identify larger conserved modules [
33,
34,
36]. The common objective is to gain insight into the evolution of related species, including the role of duplication in regulatory network evolution and the occurrence of convergent evolution versus conserved co-expression [
35,
38]. However, none of these studies can be considered a true multi-species biclustering algorithm; for example, both Bergmann
et al. [
34] and Tanay
et al. [
36] performed the analyses of the different species sequentially. Furthermore, with the exception of Tanay
et al. [
36], the methods were limited to considering only expression data.
Below, we present multi-species cMonkey, a biclustering framework that enables us to integrate data across multiple species and multiple data-types simultaneously. Our approach maintains the independence of the organism-specific data while still allowing for true biclustering. Specifically, gene membership in multiple clusters is possible and integration of a variety of data types remains an integral part of the approach. Once the conserved modules have been identified, our method further allows the discovery of species-specific modifications (which we term 'elaborations', that is, the addition of species-specific genes that fit well with the conserved core of the bicluster according to the multi-data score). The ability to find species specific elaborations of conserved co-regulated core sets of genes is a unique strength of the method and is critical to understanding the evolution and function of conserved modules.
Our multi-species biclustering method was applied to all pairings that are possible for three closely related species of Firmicutes:
Bacillus subtilis,
Bacillus anthracis and
Listeria monocytogenes. As one of the best-studied bacterial model organisms,
B. subtilis was selected due to the wealth of publicly available genomic data and the large amount of knowledge accumulated on this organism over the years. Additionally,
B. subtilis and
B. anthracis have similar life cycles, alternating between vegetative cell and dormant spore states [
39-
43]. The third member of the triplet,
L. monocytogenes, was selected as it shares similar morphology and physiology with
B. subtilis and
B. anthracis, but lacks the ability to form spores. In addition,
B. anthracis and
L. monocytogenes are pathogenic species, while
B. subtilis is non-pathogenic. Evolutionarily, the
Bacillus and
Listeria genera are estimated to have separated more than 1 billion years ago [
44]. Analysis of the biclusters obtained as a result of the procedure revealed several gene groups of interest and led us to formulate new hypotheses about the biology of these organisms. Specifically, we were able to detect a temporal difference between the two
Bacillus species in the expression of a group of metabolic genes involved in spore formation. Furthermore, the unexpected identification of a bicluster for genes required for flagellum formation in
B. anthracis prompted us to re-examine the capacity for flagellar-based motility in that species.