Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2881394

Formats

Article sections

Authors

Related links

Bioinformatics. 2010 June 15; 26(12): i158–i167.

Published online 2010 June 1. doi: 10.1093/bioinformatics/btq210

PMCID: PMC2881394

Richard S. Savage,^{1} Zoubin Ghahramani,^{2} Jim E. Griffin,^{3} Bernard J. de la Cruz,^{4} and David L. Wild^{1,}^{*}

* To whom correspondence should be addressed.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

**Motivation:** We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.

**Results:** We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs.

**Availability:** If interested in the code for the work presented in this article, please contact the authors.

**Contact:** ku.ca.kciwraw@dliw.l.d

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

Approaches to the elucidation of gene regulatory networks have often relied on the use of clustering methodologies, grouping genes on the basis of expression patterns over time, treatments and/or tissues. The genes in a given cluster are usually assumed to be potentially functionally related or to be influenced by common upstream factors. For example, Eisen *et al.* (1998) found that in the yeast *Saccharomyces cerevisiae*, genes that clustered together did indeed often share similar biological function, and a large number of subsequent authors have found the same, sometimes even being able to verify the results experimentally (e.g. Ihmels *et al.*, 2002).

Application of these approaches to gene expression data have led to the recognition that gene regulation is often performed by *regulatory programmes* or *transcriptional modules* (TMs); sets of co-regulated genes that share a common biological function and are regulated by a common set of transcription factors. Ihmels *et al.* (2002) devised a method for identifying TMs by assigning genes to clusters in a context-dependent manner. A gene could be assigned to several clusters, resulting in overlapping TMs, a feature which is biologically meaningful since a gene could be involved in multiple biological processes.

Clustering on the basis of expression data alone, however, only indicates *co-expression*, and does not directly identify *co-regulation*. The expression patterns of genes in the same cluster may be correlated for reasons other than co-regulation—the effects of experimental measurement error may be important, for example. Due to the complexity of gene regulatory networks, as well as the limitations of any given source of noisy experimental data, it is advantageous to make TM inferences using multiple sources of data. In addition to gene expression data, a range of other data types have been used to enhance the reconstruction of gene networks. These include information about transcription factor binding derived from experimental techniques such as ChIP-chip, sequence data and even information derived from relevant scientific literature. Both Segal *et al.* (2003a) and Kundaje *et al.* (2005) have described methods to integrate expression and sequence data within the framework of a probabilistic graphical model, using the method of expectation maximization —a statistical technique for maximum likelihood estimation of model parameters from incomplete data. Segal *et al.* (2003b) applied a variant of this approach to infer regulatory modules in *S.cerevisiae*, together with their component regulators, under the assumption that the regulators themselves are transcriptionally regulated, at least under a subset of conditions. Bar-Joseph *et al.* (2003) described a method to integrate ChIP-chip and expression data based on an exhaustive iterative search over possible combinations of regulators, which identifies a subset of gene targets with highly correlated expression patterns.

Dirichlet process mixture models (DPMs; Antoniak, 1974; Ferguson, 1973) are a class of Bayesian non-parametric models that has been widely used for clustering (Dahl, 2006; Liu *et al.*, 2006; Medvedovic and Sivaganesan, 2002; Medvedovic *et al.*, 2004; Qin, 2006; Rasmussen *et al.*, 2009; Rasmussen, 2000; Wild *et al.*, 2002). DPMs have the interesting property that the prior probability of a new data point joining a cluster is proportional to the number of points already in that cluster, thus encoding a natural clustering tendency. Clustering strength is controlled via a hyperparameter α, which sets the expected number of clusters as a function of the number of clustered items. By inferring α we can therefore determine the posterior distribution of the number of clusters.

Hierarchical Dirichlet Process models [HDPMs as defined by Teh *et al.* (2006)] are the hierarchical extension of DPMs. They consist of a DPM for each of a number of different *contexts*, with the mixture components for each context being drawn from a master list of mixtures from the next level of the hierarchy. A wider range of HDPMs are reviewed in Teh and Jordan (2010). Reid *et al.* (2009) use a type of HDPM to identify TMs from transcription factor binding site sequence data. Gerber *et al.* (2007) use HDPMs to model gene expression programs in a variety of tissues.

Effective combination of different datasets can be an effective way to identify TMs. Liu *et al.* (2007) introduced a HDPM that assigns a DPM to each of a pair of datasets (e.g. gene expression and ChIP-chip) and connects them via a common hyperparameter, α. By producing combined results from the sampled clustering for both contexts, they are able to produce a form of flexible data integration. As we shall see below, the Liu *et al.* approach is a special case of the model we present in this article.

Our aim is to cluster genes together on the basis of both gene expression and ChIP-chip (transcription factor binding site) information. We wish to identify the genes that possess the same clustering structure across both datasets, as these are more likely to represent underlying TMs and hence share specific biological function/s. We expect that the information coming from each dataset will be uncertain and possibly contradictory. Therefore, we wish to distinguish (on a gene-by-gene basis) between genes that can sensibly have their data fused and those for which there is contradiction.

We construct our model (shown in Fig. 1) from a two-level hierarchy of DPMs. This model can be regarded as a modified version of the HDPM presented in Teh *et al.* (2006). A naïve use of HDPMs for data integration may fail since it assumes that all data sources are clustered identically. Our principal innovation is to include, for each gene, an indicator variable that determines whether the gene should join a cluster based on both data sources combined (via a product of likelihoods) or whether it should be clustered separately for each dataset.

The genes are assigned to three contexts, each defining a clustering partition via a DPM. One context contains the genes that are fused across the two datasets. The other two contexts are for the unfused genes, one clustering solely on the basis of expression data and the other using the transcription factor binding dataset. The fusion indicator variables are learnt as part of the inference process (carried out via a Gibbs sampler), allowing us to determine the posterior probability that any given gene will be fused.

We define our model as follows [using the same notation as Teh *et al.* (2006)]. We have *n* genes, for each of which we have two sets of measurements (expression and ChIP-chip). Each gene can either be considered for the two datasets separately or we can *fuse* them together, giving a single (product) likelihood. This gives us three contexts overall. Let *x*_{ji} be the observed response for *i*-th gene in the *j*-th context. Note that when sampling, genes can flip from a fused to unfused state and vice versa.

Each context *j*{1, 2, 3} has a countably infinite number of mixture components, of which *K* are occupied. Each component *k*{1, …, *K*} has a mixture weight denoted by π_{jk} and a parameter vector θ_{jk}, where the set of parameters may be different for the different contexts.

Each gene is assigned to a mixture component via the indicator variable *Z*_{ij}, giving us the following equations.

(1)

The conditional likelihood for each gene is then:

(2)

where *L*_{j} is the likelihood we assign to model the data and θ_{jk} are the parameter values for mixture component *k* in context *j*.

These are assigned the stick-breaking prior associated with the Dirichlet process (Teh *et al.*, 2006)

(3)

where *V*_{jk} are mutually independent and *V*_{jk} ~ Be(1, α) which we write Stick(α). Marginalizing over the mixture weights gives us a DPM for each context. Similarly, ϵ are the mixture weights for the second level of the Dirichlet process hierarchy. Again, these are assigned a stick-breaking prior and marginalized over, so that they do not have to be considered explicitly in the analysis.

(4)

(5)

The hyperparameters α_{0} and γ are the concentration parameters for the two levels of the Dirichlet process hierarchy. α_{0} is shared across all three contexts, representing the prior belief that they are all representing the same biological system and hence we should expect the same number of underlying TMs. These can be inferred and therefore are assigned vague gamma priors as follows.

(6)

(7)

We choose the component likelihoods to reflect the nature of the two datasets. For the expression data, we discretize the measured value for each gene into three levels, representing under-, over- or unchanged expression. This is something of a simplifying assumption, but makes our analysis more robust to the non-Gaussian noise typical of gene expression data, and is an approach that has been shown to be effective (see e.g. Gerber *et al.*, 2007; Savage *et al.*, 2009). We therefore choose the component likelihood for the expression data to be a naïve Bayes model, constructed from a product of multinomial distributions.

For the ChIP-chip data, we have statistical (*P*-value) information as to whether or not a given transcription factor binds to a given gene. By thresholding these values, we obtain sparse binary data, indicating whether binding has occurred. We choose to model these data using a so-called ‘bag-of-words’ model, e.g. a multinomial likelihood over transcription factors. This has the advantage that only counts over genes are important, meaning that a large number of zeros in the data can be handled safely. Note that in both cases we choose to apply (conjugate) Dirichlet priors.

For the expression data, we have

(8)

where *B*_{a}=∑_{b}β_{ab} and *N*_{a}=∑_{b} *x*_{ab}, *a* is the index over features and *b* is the index over discrete data values. The β_{ab} are the Dirchlet prior hyperparameters, which in this case are all set to 0.5 (the Jeffreys' value).

For the ChIP-chip data, we have

(9)

where *B*=∑_{a} β_{a} and *N*=∑_{a} *x*_{a}, and *a* is the index over features. The β_{a} are the Dirichlet prior hyperparameters, which in this case are all set to 0.5 (the Jeffreys' value). The above two equations are obtained from Equation (2) for a particular cluster *k* by integrating out the parameters θ.

We encode the notion of data fusion for a given gene *i* by allowing the possibility of taking the product of likelihoods over the two datasets. So, if the likelihood parameters for contexts one and two are given by θ_{1i} and θ_{2i}, we have the following equations.

We introduce an extra latent variable *r*_{i} for each gene with

(10)

If *r*_{i}=1 (corresponding to a fused gene) then:

(11)

And if *r*_{i}=0 (corresponding to an unfused gene) then:

(12)

(13)

This defines three contexts. Unlike the HDPM, we have

(14)

(15)

(16)

(17)

where *F*_{0}^{(j)} represents the marginal distribution of θ_{j} under *F*_{0}. The hierarchical Dirichlet process structure allows sharing of clusters across the unfused and fused contexts. For example, an unfused gene can be allocated to the same cluster as fused genes for gene expression but allocated to a different cluster (shared by different fused genes) for transcription factors.

We choose to fix *w*=0.5 for the analyses in this article, representing that we have no prior knowledge of the degree to which these datasets should fuse. We note that it is also straightforward to sample from *w* and we run a test of this, the results of which can be seen in Table 3. Details of the algorithm to implement this model, in particular a Gibbs sampler, can be found in Appendix A.

Our model has two special cases that are of interest and represent alternative ways of approaching data integration. *w*=0 gives us the model of Liu *et al.* (2007). In this model, there is no direct data fusion (in the sense that all the *r*_{i}=0). Instead, information is shared via a common hyperparameter, α_{0}, between the clustering for each dataset. Each dataset is therefore clustered (almost) separately, but benefiting from this weak sharing of information via the hyperparameter. *w*=1 gives us simple data integration by taking the product of likelihoods and forcing all genes to be part of a single clustering partition. With *w*=1, only the fused context is used (genes can never be unfused) and so we have a straightforward DPM with a product of likelihoods over the two datasets.

Once we have explored the model space using Markov chain Monte Carlo (MCMC) sampling, we wish to extract useful results from the samples. In particular, we wish to identify TMs, which in our model correspond to groups of genes that fused with high probability and that are often found in the same fused cluster. This is a non-trivial challenge, as each MCMC sample contains a large number of parameters (mixture component labels for each gene in each of three contexts, *r*_{i} for each gene, plus the global hyperparameters). We therefore require a way to summarize the results. To do this, we choose to form a posterior similarity matrix (Fritsch and Ickstadt, 2009). From this we will extract a clustering partition, which will correspond to transcriptional modules. The posterior similarity matrix is an (*n*_{genes} × *n*_{genes}) matrix where each element gives the posterior probability that a given pair of genes are found in the same cluster (and hence also in the same context). These values can be estimated simply by counting the MCMC samples in the appropriate way.

A major advantage of our model is that it identifies how likely each gene is to be fused (estimated from the *r*_{i} values over MCMC samples). By rejecting genes with low *P*(*r*_{i}=1 | **x**), we can identify more tightly defined TMs. For this article, we choose to define ‘fused’ as being *P*(*r*_{i}=1 | **x**)≥0.5 (the prior value we assign to *w* for the full model).

From the posterior similarity matrix, we extract the most likely cluster partition using the method of Fritsch and Ickstadt (2009), which minimizes a defined loss function that is equivalent to maximizing the adjusted Rand index between estimated and true clustering partitions. We note that this represents a summarization of the full results implicit in the analysis. Some kind of compromise of this nature is inevitable, simply due to the richness of the posterior distribution of our model. As we shall demonstrate, this approach still leads to superior results and hence biological insight.

We can also extract other useful quantities from the posterior MCMC samples. For example, the 1D marginal distributions of the hyperparameters, the number of fused clusters and the number of fused genes are all easily determined.

We are interested in identifying TMs with well-defined biological function/s. Our quality measures should therefore reflect this. We choose two measures, both using the Gene Ontology (GO) database.

The first measure is the *Biological Homogeneity Index* (BHI; Datta and Datta, 2006). This is a global measure of how biologically homogeneous a given clustering partition is (as measured here using GO annotations). Clusters where many genes share annotations will lead to a high BHI score and vice versa. Perfect agreement for all GO terms (which is highly unlikely) would lead to a score of unity.

We compute error estimates of these BHI scores by performing 10 random combinations of the 20 MCMC chains (chosen via bootstrap sampling, eg. so that a given chain may be selected multiple times), finding in each case the clustering partition and hence the BHI score. This gives us a measure of any uncertainty due to inadequate mixing of the MCMC chains.

The second measure is to find GO terms that are over-represented in any given module, relative to the background population of genes. We use the R package *GOstats* (Falcon and Gentleman, 2007) to apply a hypergeometric test to each GO term in each cluster. We apply this analysis to all GO terms in the three ontologies (cellular component, molecular function and biological process). We account for the dependent, hierarchical structure of the ontologies using the relevant option in the call to the GOstats function ‘hyperGTest’. We correct for multiple hypotheses using a Bonferroni correction.

We perform two different analyses in this article, in each case analysing data from *S.cerevisiae*. To facilitate comparison with earlier work, in both cases we use ChIP-chip data from Lee *et al.* (2002) to provide information on transcription factor binding activity. This dataset contains nearly 4000 interactions between regulators and promoter regions, representing 6270 yeast genes and 106 transcriptional regulators.^{1}.

We use two different gene expression datasets in the two analyses. The first one (referred to subsequently as the *galactose utilization* data) is taken from a subset of the expression dataset of Ideker *et al.* (2001), which has been widely used for the validation of clustering methods (Savage *et al.*, 2009; Yao *et al.*, 2008; Yeung *et al.*, 2003). These data are a series of expression measurements across 20 experiments representing systematic perturbations of the yeast galactose utilization pathway. The subset used consists of 205 genes whose expression patterns reflect four functional categories (biosynthesis, protein metabolism and modification; energy pathways, carbohydrate metabolism and catabolism; nucleobase, nucleoside, nucleotide and nucleic acid metabolism; transport), based on GO annotations. However, as Yeung *et al.* (2003) note, since this array data may not fully reflect these functional categories, these classifications should be used with caution.

For the other analysis (referred to as the *cell cycle data*), we use gene expression measurements of the mitotic yeast cell cycle of Cho *et al.* (1998), which was chosen to facilitate direct comparison with the results of Liu *et al.* (2007), who also analysed this data. For this dataset, we keep only the genes that were identified as having a Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway, following the procedure of Liu *et al.* (2007). We note that there are small differences in the number of genes selected in this way (949 versus 1165 in this article), due to differences in the version of the KEGG database used.

For both expression datasets, we discretize the data into three levels. This has two benefits. First, it makes our analysis more robust to the non-Gaussian (and hence hard to model) noise distribution of the data. Secondly, it makes the task of data modelling more straightforward. To optimize the binning of the data, we use the Bayesian Hierarchical Clustering (BHC) package (Savage *et al.*, 2009) to find an optimized binning and assume that this will then also be reasonably optimal for this analysis.

For the ChIP-chip dataset, we follow the procedure of Lee *et al.* (2002) and threshold the data at a *P*-value of 0.001, giving us a sparse dataset where the detections are robust, at the expense of a larger proportion of false negatives (Lee *et al.* estimate 6–10% false positives and that about one-third of the interactions are missed).

The results of each data integration analysis are rich and detailed. In this section, we consider a number of metrics and highlight significant aspects of the results for each dataset. For comparison, we also run clustering analyses using the BHC algorithm (Savage *et al.*, 2009) to analyse the gene expression data alone. This gives us a baseline result with which to compare the data integration results.

It is important to assess the convergence of any MCMC analysis. All analyses on the galactose dataset comprise 20 MCMC chains, each of 50 000 samples (after removal of 10 000 burn-in samples). All analyses on the cell cycle data set comprise 20 MCMC chains, each of 5000 samples (after removal of 1000 burn-in samples). To speed up our subsequent analyses, we sparse sample each chain by a factor of 10.

We include as Supplementary Material plots of 1D hyperparameter posterior PDFs, both overall and for each MCMC chain. These demonstrate that the MCMC analyses are converging adequately. We also perform Geweke convergence tests (Geweke, 1992) on the hyperparameters of each chain (α_{0} and γ) for the *w*=0.5 case. For the galactose dataset, we find 30 of 40 tests (20 chains, 2 hyperparameters) have a *z*-score <2, with a maximum outlier of 3.48. For the cell cycle dataset, we find 28 of 40 tests have a *z*-score <2, with a maximum outlier of 5.53. We conclude that our chains are reasonably well (although not perfectly) mixed, although in both cases it is important that we combine multiple chains.

The algorithm is implemented in Matlab and run on 3 GHz cluster nodes (20 in parallel, one per MCMC chain). On the galactose utilization dataset (205 genes), the code produces 500 samples per chain per hour. On the cell cycle dataset (1165 genes), the code produces 40 samples per chain per hour. On a larger sample of 2332 genes [from the stress data of Gasch *et al.* (2000)], the code produces 10 samples per chain per hour. This scaling suggests that for large (genome-scale) datasets it may be of value to investigate (for example) fast variational methods as an alternative to MCMC.

In Table 1, we give the BHI scores for different analyses of the galactose utilization dataset. The outcome from our model (fused genes and *w*=0.5) extracts a subset of 51 genes with an overall BHI score that is 9% (3 SDs) higher than for any other method, indicating a greater degree of biological functional coherence. We note that all methods are superior to the BHC result for expression data alone (BHI = 0.323), except for the ChIP only analyses.

Figure 2 shows the variation of the BHI with the clustering partition resulting from keeping the top *n* genes, as sorted by *P*(*r*_{i}=1 | **x**). There is a clear enriching effect on the BHI (and hence biological homogeneity) by selecting a subset of genes in this way.

Plots of the BHI for the galactose dataset, showing the variation with different numbers of fused genes. Shown are the BHI results for each GO separately, plus all three combined. In all cases, selecting 100 or fewer genes leads to an increase in the **...**

In Figure 3, we show a matrix of the significantly over-represented GO terms in each of the clusters we extract for the model (fused genes and *w*=0.5). Notable are the density of hits, and also the distinct block structure, which reflects that each cluster is tending to capture all the significance for given GO terms. These GO terms reflect the four functional categories previously identified in this data, but detailed inspection of the functional annotations of the genes in each cluster reveals a finer level of biological specificity than previously identified. Cluster 1 (counted from the left) comprises four genes involved in glycolysis and the tricarboxylic acid (TCA) cycle. Cluster 2 represents genes involved in replication and RNA processing, while Cluster 3 comprises primarily ribosomal components. Cluster 4 comprises four hexose transporters, including at least one pseudogene, which, despite being non-functional, is nevertheless expressed.

Graphical representation of the significantly over-represented GO terms for each cluster of genes, for the galactose utilization (left) and cell cycle (right) datasets. Black indicates that a given gene is annotated with the relevant GO term and that **...**

Table 4 shows comparisons of over-represented GO terms with those obtained from the Liu method. In general, the data fusion GO terms are more enriched, with lower *P*-values and, in almost all cases, a higher proportion of the genes being annotated with the term.

Over-represented GO terms for one of the fused clusters extracted from the galactose utilization dataset (with *w*=0.5

In Figure 4, we show the ChIP-chip data for the fused genes, sorted by cluster membership. The structure in this plot (horizontally aligned hits) shows certain transcription factors are contributing to the data integration and, like Segal *et al.* (2003b), we find that TMs are characterized by partly overlapping but distinct motif combinations.

The ChIP-chip data for the fused genes of the galactose utilization (left) and cell cycle (right) dataset analyses. The data have been sorted by the clustering partition. Black pixels indicate a transcription factor that binds to that gene. The different **...**

Table 3 shows some comparison analyses carried out using the Harbison *et al.* (2004) ChIP data in place of that of Lee *et al.* While the Harbison *et al.* data analysis finds more fused genes (72 versus 51), the BHI scores are comparable. We also run an analysis where we sample over *w*; in this case, the BHI scores are marginally worse and there are fewer fused genes (56) than for the *w*=0.5 case.

In Table 2, we give the BHI scores for different analyses of the cell cycle dataset. Again, our model (fused genes and *w*=0.5) gives the best results, although in this case the method of Liu *et al.* provides a slightly lower BHI score. In all cases, the data integration provides benefit over simply using gene expression data and the BHC algorithm (BHI = 0.285).

In the Supplementary Material we show the posterior similarity matrix, sorted by cluster membership. The block-diagonal structure shows the core of each cluster clearly defined. In this figure, off-diagonal blocks may indicate one of two possibilities; it may mean that there is *uncertainty* in whether a set of genes should be assigned to one of the two clusters, or it may indicate a set of genes that should really belong *simultaneously* to two clusters. The two clusters in question (Clusters 3 and 4, counted from the left) do indeed share common GO annotations indicating metabolic function (Fig. 3). Cell cycle regulation is a complex interplay of many different external signals and intrinsic cell states (Bähler, 2005). The cell cycle is composed of at least four phases: S, synthesis phase wherein DNA is being replicated; G1, gap 1; M, mitosis where the yeast cell physically pulls chromosomes into the daughters and then separates; and G2, gap 2. The transitions between phases are critical checkpoints. There cell division is blocked by various conditions; for example, signals indicating there is DNA damage or incomplete DNA replication will block cells from going from S→G1. Thus, it would be expected that there may be multiple regulatory pathways, some of which likely overlap.

In Figure 3, we show a matrix of the significantly over-represented GO terms in each of the clusters we extract. As with the galactose utilization dataset, there is good block structure, although in this larger dataset there are some high-level GO terms that are significant in more than one cluster.

We identify 12 fused clusters in the data (excluding singletons). While the functional annotation of many of these correspond to those previously identified by Liu *et al.* (2007), there are some interesting differences. In addition to a cluster of genes involved in methionine and cysteine biosynthesis (Cluster 9), we identify a separate cluster for arginine and glutamine biosynthesis (Cluster 3). Cluster 1 comprises mainly ribosomal proteins, but also includes metabolic genes, which may be an indication of the importance of metabolic state in cell cycle progression. Interestingly, these include the same metabolic genes that comprise Cluster 1 in the galactose utilization dataset, suggesting that these genes represent a TM that is being co-regulated with ribosomal proteins in the cell cycle. This also highlights the value of perturbations (as used in the galactose utilization data) as a better experimental design to uncover underlying TMs than a study involving a natural biological process, such as the cell cycle. Cluster 7 contains several key genes associated with cell cycle regulation, as well as several genes involved with the M-phase, chromosome structure and repair. Cluster 11 contains several genes involved in the M→G2 phase transition.

Both gene expression and ChIP-chip data contain information about the biological functions of different genes, but it is non-trivial to combine them in a sensible way, both due to their noisy nature and also because co-expression and co-regulation may not necessarily be equivalent for all genes.

Our results show that by treating data fusion on a gene-by-gene basis, the model we present here is able to produce superior extraction of functionally coherent groups of genes from a combination of gene expression and ChIP-chip data. Our model also has special cases (given by *w*=0 and *w*=1) that produce data integration results that outperform the single dataset analyses (including a fast BHC clustering using expression data only). However, the model we present is both more flexible and outperforms these special cases in both the examples we have considered in this article.

The key innovation in our model is that the data integration is treated on a gene-by-gene basis. This allows crucial flexibility because we can distinguish between genes that are likely to be fused and those that are not. We can extract genes that are closely related on the basis of both datasets, while rejecting those that are not. It is these genes that are most likely to represent the underlying TMs.

*Funding*: This work was supported by the Engineering and Physical Sciences Research Council (EP/F027400/1, Life Sciences Interface).

*Conflict of Interest*: none declared.

We can perform inference for this model using MCMC sampling, by extending the sampler in section 5.1 of (Teh *et al.*, 2006) in the following way.

Let *Z*_{ji} be the allocation of gene *i* in context *j* to a cluster. We initialize these randomly to one of K (≈ log(*n*)) initial clusters. Using the notation of Teh *et al.*, we have the following equations.

(A1)

(A2)

For convenience, we define the following quantities.

(A3)

(A4)

(A5)

(A6)

(A7)

where for compactness of notation, we make the substitutions *f*_{ji}=*L*_{j}(*x*_{ji}|ϕ_{jk}) and *f*_{qi}=*L*_{q}(*x*_{qi}|ϕ_{k}) (and noting that the integrands are split over the two lines).

Updating w: if the *w* is given a beta prior distribution with parameters *a* and *b* then the full conditional distribution of *w* is beta with parameters *a*+∑*r*_{i} and *b*+∑(1−*r*_{i}). We choose *a*=*b*=2, encoding a weak preference for *w*=0.5.

Updating *r* and *t*: the parameters *r*_{i} and *t* are updated jointly. *r*_{i} is the indicator as to whether or not gene *i* is fused. *t* is an identifier for a given mixture component, such that *Z*_{i}=*t* means that gene *i* belongs to mixture component *t*. The full conditional distribution is

(A8)

(A9)

noting that is a given *Z* is not new, it has already been used previously, and where

(A10)

(A11)

(A12)

(A13)

(A14)

(A15)

(A16)

(A17)

(A18)

If new values are to be drawn then they should be drawn in the following way. If *r*_{i}=1 then

(A19)

If *r*_{i}=0 and only *Z*_{1} is new

(A20)

If *r*_{i}=0 and only *Z*_{2} is new

(A21)

If *r*_{i}=0 and *Z*_{1} and *Z*_{2} are new

(A22)

where *k*_{1}=*k*^{new}, *k*_{2}=*k*^{new}+1 represents the creation of two new clusters of which one contains only *x*_{1i} and the other only contains *x*_{2i}.

- Antoniak C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Stat. 1974;2:1152–1174.
- Bähler J. Cell-cycle control of gene expression in budding and fission yeast. Ann. Rev. Genet. 2005;39:69–94. [PubMed]
- Bar-Joseph Z, et al. Computational discovery of gene modules and regulatory networks. Nat. Biotechnol. 2003;21:1337–1342. [PubMed]
- Cho R, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. cell. 1998;2:65–73. [PubMed]
- Dahl D. Model-based clustering for expression data via a Dirichlet process mixture model. In: Do K.-A, et al., editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge: Cambridge University Press; 2006. pp. 201–218.
- Datta S, Datta S. Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics. 2006;7:397. [PMC free article] [PubMed]
- Eisen M. Cluster analysis and display of genome-wide expression. Proc. Natl Acad.Sci.USA. 1998;95:14863–14868. [PubMed]
- Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23:257. [PubMed]
- Ferguson T. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973;1:209–230.
- Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 2009;4:367–392.
- Gasch A, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell. 2000;11:4241–4257. [PMC free article] [PubMed]
- Gerber G, et al. Automated discovery of functional generality of human gene expression programs. PLoS Comput. Biol. 2007;3:e148. [PubMed]
- Geweke J. Evaluating the accuracy of sampling-based approaches to calcualting posterior moments. In: Bernardo JM, et al., editors. Bayesian Statistics 4. New York: Oxford University Press; 1992. pp. 169–193.
- Harbison C, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431:99–104. [PMC free article] [PubMed]
- Ideker T, et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001;292:929–934. [PubMed]
- Ihmels J, et al. Revealing modular organization in the yeast transcriptional network. Nat. Genet. 2002;31:370–377. [PubMed]
- Kundaje A, et al. Combining sequence and time series expression data to learn transcriptional modules. IEEE/ACM Trans. Comput. Biol. Bioinform. 2005;2:202. [PubMed]
- Lee T, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298:799. [PubMed]
- Liu X, et al. Context-specific infinite mixtures for clustering gene expression profiles across diverse microarray dataset. Bioinformatics. 2006;22:1737–1744. [PMC free article] [PubMed]
- Liu X, et al. Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and chip-chip data. BMC Bioinformatics. 2007;8:283. [PMC free article] [PubMed]
- Medvedovic M, Sivaganesan S. Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 2002;18:1194–1206. [PubMed]
- Medvedovic M, et al. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. [PubMed]
- Qin ZS. Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics. 2006;22:1988–1997. [PubMed]
- Rasmussen C, et al. Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Trans. Computat. Biol. Bioinform. 2009;6:615–628. [PubMed]
- Rasmussen CE. The infinite Gaussian mixture model. In: Solla SA, et al., editors. Advances in Neural Information Processing Systems 12. Cambridge: MIT Press; 2000. pp. 554–560.
- Reid J, et al. Transcriptional programs: modelling higher order structure in transcriptional control. BMC Bioinformatics. 2009;10:218. [PMC free article] [PubMed]
- Savage RS, et al. R/BHC: fast Bayesian hierarchical clustering for microarray data. BMC Bioinformatics. 2009;10:242. [PMC free article] [PubMed]
- Segal E, et al. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics. 2003a;19:273–282. [PubMed]
- Segal E, et al. Module networks: Discovering regulatory modules and their condition specific regulators from gene expression data. Nat. Genet. 2003b;34:166–176. [PubMed]
- Teh YW, Jordan MI. Hierarchical Bayesian nonparametric models with applications. In: Lid Hjort N, et al., editors. Bayesian Nonparametrics. Cambridge: Cambridge University Press; 2010. pp. 158–207.
- Teh YW, et al. Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 2006;101:1566–1581.
- Wild D, et al. A Bayesian approach to modeling uncertainty in gene expression clusters. 3rd International Conference on Systems Biology. 2002
- Yao J, et al. Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient. BMC Bioinformatics. 2008;9:288. [PMC free article] [PubMed]
- Yeung K, et al. Clustering gene-expression data with repeated measurements. Genome Biol. 2003;4:R34. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |