Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2904753

Formats

Article sections

- Significance of HEM
- Fully conditional distributions of gene (gi) and condition (cj) effects
- Modeling biological/experimental correlations
- References

Authors

Related links

Bioinformatics. Author manuscript; available in PMC 2010 July 15.

Published in final edited form as:

Published online 2006 May 26. doi: 10.1093/bioinformatics/btl173

PMCID: PMC2904753

NIHMSID: NIHMS212677

The publisher's final edited version of this article is available at Bioinformatics

See the article "Bayesian hierarchical error model for analysis of gene expression data." in *Bioinformatics*, volume 20 on page 2016.

Cho and Lee (2004) proposed a Bayesian hierarchical error model (HEM) to account for heterogeneous error variability in oligo-nucleotide microarray experiments. They estimated the parameters of their model using Markov Chain Monte Carlo (MCMC) and proposed an *F*-like summary statistic to identify differentially expressed genes under multiple conditions. Their HEM is one of the emerging Bayesian hierarchical modeling tools that have been developed for the analysis of multiple-level data structures and variation in microarray gene expression data (Broet *et al*., 2002; Tadesse and Ibrahim, 2004; Cho and Lee, 2004). In this letter, we first discuss the significance of the HEM developed by Cho and Lee. Then, we re-derive the fully conditional distributions for gene and conditional effects, since we think that these two fully conditional distributions were not presented properly in their paper. Finally, we expand the HEM to deal with biological or/and experimental correlations in gene expression data. A FORTRAN 90 program was developed to implement our extended method and it is available from the authors upon request.

Many aspects of the HEM presented in the article of Cho and Lee (2004) are significant. The following are the two aspects that we believe are important in the analysis of microarray gene expression data. First, error variability of gene expression data is decomposed into two components, biological and experimental. The former reflects variation owing to biological sample heterogeneity that affects gene expression, while the latter largely comes from variation in sample preparation and sample analysis. Estimating these two sources of errors separately will not only improve estimation of model parameters but also facilitate a better understanding of patterns of over all gene expression.

Second, and more importantly, the use of a Bayesian HEM is an effective way to represent complex relationships that have many parameters, because it allows the parameters to be modeled separately at several levels and provides a link (*S*) between the data (*D*) and the unobserved parameters (*θ*) of interest. In multilevel modeling, the top layer is the data, *p*(*D*|S, *θ*), which is modeled by an appropriate likelihood function. The data are assumed to be generated by some process that depends on certain unknown parameters. Modeling the link or process is the middle layer, where the art of statistical modeling takes place. On the bottom layer are the prior distributions that represent a priori beliefs about these parameters. Following the Bayesian inference, unknown parameters are inferred from the joint distribution of the parameters given the data

$$p(\theta ,S\mid D)\propto p(D\mid S,\theta )P(S\mid \theta )P(\theta )$$

(1)

Thus, Bayesian hierarchical modeling based on multivariate normal distributions effectively handles complex relationships that are governed by many parameters, and provides a tool for evaluating correlated gene expression data. For example, the model of Cho and Lee (2004) can be extended to allow for biological or/and experimental correlations. These will be discussed in more detail in the third section of this letter. Statistically, the estimation of these variables separately is biased when they are correlated (Johnston, 1972; Gianola and Sorensen, 2004). From a practical standpoint, ignoring correlations in gene expression data would result in high variability of statistical estimators and cause the reproducibility to deteriorate (Qiu *et al*., 2005).

We claim that Cho and Lee (2004) made a few mistakes when they derived the fully conditional distributions for the gene (*g _{i}*) and condition (

$$\begin{array}{l}p({\theta}_{L},{\sigma}_{e}^{2},x\mid y)\propto p(y\mid x,{\sigma}_{e}^{2})p(x\mid {\theta}_{L})p({\theta}_{L})p({\sigma}_{e}^{2})\propto p(y\mid x,{\sigma}_{e}^{2})p(x\mid \mu ,g,c,r,\mathrm{\Delta})p(\mu )p(g)p(c)p(r)p(\mathrm{\Delta})p({\sigma}_{e}^{2})\\ \propto {({\sigma}_{e}^{-2})}^{\left(\left({\sum}_{i}{\sum}_{j}{\sum}_{k}{n}_{\mathit{ijk}}\right)/2\right)}exp\left(-\frac{({\sigma}_{e}^{-2}){\sum}_{i}{\sum}_{j}{\sum}_{k}{\sum}_{l}{({y}_{\mathit{ijk}l}-{x}_{\mathit{ijk}})}^{2}}{2}\right)\times \left(\prod _{i}\prod _{j}{({\sigma}_{{b}_{ij}}^{-2})}^{{m}_{ij}/2}\right)\\ \times exp\left(-\frac{{\sum}_{i}{\sum}_{j}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}{({x}_{\mathit{ijk}}-\mu -{g}_{i}-{c}_{j}-{r}_{ij})}^{2})}{2}\right)\\ \times exp\left(-\frac{{\sum}_{i}{g}_{i}^{2}}{2{\sigma}_{g}^{2}}\right)\times exp\left(-\frac{{\sum}_{j}{c}_{j}^{2}}{2{\sigma}_{c}^{2}}\right)\times exp\left(-\frac{{\sum}_{i}{\sum}_{j}{r}_{ij}^{2}}{2{\sigma}_{r}^{2}}\right)\times \left(\prod _{i}\prod _{j}({({\sigma}_{{b}_{ij}}^{-2})}^{{\alpha}_{b}-1}exp(-{\beta}_{b}{\sigma}_{{b}_{ij}}^{-2}))\right)\times \left({({\sigma}_{e}^{-2})}^{{\alpha}_{e}-1}exp(-{\beta}_{e}{\sigma}_{e}^{-2})\right),\end{array}$$

(2)

where *θ _{L}* = {

$$\begin{array}{l}p(\mu \mid \xb7)\propto exp\left(-\frac{{\sum}_{i}{\sum}_{j}{\sum}_{k}{\sigma}_{{b}_{ij}}^{-2}{(\mu -({x}_{\mathit{ijk}}-{g}_{i}-{c}_{j}-{r}_{ij}))}^{2}}{2}\right)\\ \propto exp\left(-\frac{{\sum}_{i}{\sum}_{j}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2}){\mu}^{2}-2\mu {\sum}_{i}{\sum}_{j}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}({x}_{\mathit{ijk}}-{g}_{i}-{c}_{j}-{r}_{ij}))}{2}\right),\end{array}$$

the fully conditional distribution of the overall mean, given the data and all other parameters, is

$$\begin{array}{l}\mu \mid \xb7\sim N\left(\frac{{\sum}_{i}{\sum}_{j}{\sum}_{k}{\sigma}_{{b}_{ij}}^{-2}({x}_{\mathit{ijk}}-{g}_{i}-{c}_{j}-{r}_{ij})}{{\sum}_{i}{\sum}_{j}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2})},\frac{1}{{\sum}_{i}{\sum}_{j}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2})}\right)\\ \sim N\left(\sum _{i}\sum _{j}\frac{{\sum}_{k}({x}_{\mathit{ijk}}-{g}_{i}-{c}_{j}-{r}_{ij})}{{\mathrm{\Lambda}}_{\xb7\xb7}{\sigma}_{{b}_{ij}}^{2}},{\mathrm{\Lambda}}_{\xb7\xb7}^{-1}\right),\end{array}$$

(3)

where
${\mathrm{\Lambda}}_{\xb7\xb7}={\sum}_{i}{\sum}_{j}({m}_{ij})/({\sigma}_{{b}_{ij}}^{2})$. In (3) and throughout this letter, we use ‘|·’ to mean ‘conditional on the data and all unknown parameters except the one currently under consideration’. For the effect that corresponds to the *i*-th gene, we pick components in (2) that are related to *g _{i}*, which leads to

$$\begin{array}{l}p({g}_{i}\mid \xb7)\propto exp\left(-\frac{{\sum}_{j}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}{({g}_{i}-({x}_{\mathit{ijk}}-\mu -{c}_{j}-{r}_{ij}))}^{2})}{2}\right)\times exp\left(-\frac{{g}_{i}^{2}}{2{\sigma}_{g}^{2}}\right)\\ \propto exp\left(-\frac{({\sum}_{j}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2})+{\sigma}_{g}^{-2}){g}_{i}^{2}-2{g}_{i}{\sum}_{j}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}({x}_{\mathit{ijk}}-\mu -{c}_{j}-{r}_{ij}))}{2}\right)\end{array}$$

Thus, the fully conditional distribution of *g _{i}* given the data and all other parameters is

$$\begin{array}{l}{g}_{i}\mid \xb7\sim N\left(\frac{{\sum}_{j}{\sum}_{k}{\sigma}_{ij}^{-2}({x}_{\mathit{ijk}}-\mu -{c}_{j}-{r}_{ij})}{{\sigma}_{g}^{-2}+{\sum}_{j}{m}_{ij}{\sigma}_{{b}_{ij}}^{-2}},\frac{1}{{\sigma}_{g}^{-2}+{\sum}_{j}{m}_{ij}{\sigma}_{{b}_{ij}}^{-2}}\right)\\ \sim N\left({\sum}_{j}\frac{{\sum}_{k}({x}_{\mathit{ijk}}-\mu -{c}_{j}-{r}_{ij})}{({\sigma}_{g}^{-2}+{\mathrm{\Lambda}}_{i\xb7}){\sigma}_{{b}_{ij}}^{-2}},{({\sigma}_{g}^{-2}+{\mathrm{\Lambda}}_{i\xb7})}^{-1}\right),\end{array}$$

(4)

where ${\mathrm{\Lambda}}_{i\xb7}={\sum}_{j}({m}_{ij})/({\sigma}_{{b}_{ij}}^{2})$. Similarly, since

$$\begin{array}{l}p({c}_{j}\mid \xb7)\propto exp\left(-\frac{{\sum}_{i}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}{({c}_{j}-({x}_{\mathit{ijk}}-\mu -{g}_{i}-{r}_{ij}))}^{2})}{2}\right)\times exp\left(-\frac{{c}_{i}^{2}}{2{\sigma}_{c}^{2}}\right)\\ \propto exp\left(-\frac{({\sum}_{i}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2})+{\sigma}_{c}^{-2}){c}_{j}^{2}-2{c}_{j}{\sum}_{i}{\sum}_{k}({\sigma}_{{b}_{ij}}^{-2}({x}_{\mathit{ijk}}-\mu -{g}_{i}-{r}_{ij}))}{2}\right)\end{array}$$

the fully conditional distribution of *c _{j}*, the effect that corresponds to the

$$\begin{array}{l}{c}_{j}\mid \xb7\sim N\left(\frac{{\sum}_{i}{\sum}_{k}({\sigma}_{{b}_{ij}}^{2}({x}_{\mathit{ijk}}-\mu -{g}_{i}-{r}_{ij}))}{{\sigma}_{c}^{-2}+{\sum}_{i}({m}_{ij}{\sigma}_{ij}^{-2})},\frac{1}{{\sigma}_{c}^{-2}+{\sum}_{i}({m}_{ij}{\sigma}_{{b}_{ij}}^{-2})}\right)\\ \sim N\left({\sum}_{i}\frac{{\sum}_{k}({x}_{\mathit{ijk}}-\mu -{g}_{i}-{r}_{ij})}{({\sigma}_{c}^{-2}+{\mathrm{\Lambda}}_{\xb7j}){\sigma}_{{b}_{ij}}^{-2}},{({\sigma}_{c}^{-2}+{\mathrm{\Lambda}}_{\xb7j})}^{-1}\right)\end{array}$$

(5)

where
${\mathrm{\Lambda}}_{\xb7j}={\sum}_{i}({m}_{ij}/{\sigma}_{{b}_{ij}}^{2})$. Comparing (3), (4) and (5) with the first three equations in Appendix 1 of the paper of Cho and Lee (2004), we found that they applied a common quantity
${\mathrm{\Lambda}}_{\xb7\xb7}={\sum}_{i}{\sum}_{j}({m}_{ij}/{\sigma}_{{b}_{ij}}^{2})$ for the three fully conditional distributions and this introduced systematic biases to the posterior estimates of model parameters. In our simulation study, we found this flaw to be dramatically downward-biased posterior estimations of both condition effects and gene effects, and it consequently affected posterior estimation of interaction effects and biological variances. Figure 1 illustrates how it affected posterior sampling of a randomly chosen gene effect toward zero. In contrast, using the corrected fully conditional distribution as shown in (4), we were able to obtain a posterior estimate for this gene effect that was close to its true value (i.e. 0.7923). This situation was representative of all gene and condition effects. However, we observed no significant influence of this flaw on the posterior estimation of *μ* and
${\sigma}_{e}^{-2}$. In other words, the posterior estimates of *μ* and
${\sigma}_{e}^{-2}$ were very close to their true values using either the algorithm of Cho and Lee (2004) or the corrected one that we propose. This is not very surprising if one considers the restrictions Σ* _{i}g_{i}* =0 and Σ

Comparison of Markov chain sampling of a gene effect using (**a**) previous algorithm by Cho and Lee (2004) and (**b**) corrected algorithm by Wu, Forney and Joyce. The simulated gene effect was 0.7923. The *y*-axis represents posterior estimates of the gene effect **...**

In their paper, the conditional distribution *x _{ijk}* |· was derived with replicates, as shown below:

$${x}_{\mathit{ijk}}\mid \xb7\sim N\left(\frac{{\sigma}_{{b}_{ij}}^{2}{\sum}_{l}{y}_{\mathit{ijk}l}/{n}_{\mathit{ijk}}+({\sigma}_{e}^{2}/{n}_{\mathit{ijk}})(\mu +{g}_{i}+{c}_{j}+{r}_{ij})}{{\sigma}_{{b}_{ij}}^{2}+{\sigma}_{e}^{2}/{n}_{\mathit{ijk}}},{\left(\frac{{n}_{\mathit{ijk}}}{{\sigma}_{e}^{2}}+\frac{1}{{\sigma}_{{b}_{ij}}^{2}}\right)}^{-1}\right).$$

(6)

We think it was improper for them to reduce (6) to

$${x}_{\mathit{ijk}}\mid \xb7\sim N(\mu +{g}_{i}+{c}_{j}+{r}_{ij},{\sigma}_{{b}_{ij}}^{2}),$$

when there is no experimental replication. By our reasoning, no replication is equivalent to putting *n _{ijk}* = 1, not

$${x}_{\mathit{ijk}}=\left(\frac{1}{{n}_{\mathit{ijk}}\equiv 1}\right)\sum _{l\equiv 1}{y}_{\mathit{ijk}}={y}_{\mathit{ijk}}.$$

Cho and Lee (2004) assumed independence for both biological and experimental errors, but these assumptions may be relaxed in the HEM. In this section, we expand their HEM to model biological or/and experimental correlation in gene expression data. First, consider correlations among experimental replicates, which may arise when experimental replicates are exposed to some common environmental factors yet collected at different times or locations. In these cases, the gene expression data may show temporal or spatial variation. In bacterial biofilms, for example, microorganisms undergo significant phenotypic changes during the transition from planktonic to biofilm forms, which reflects altered underlying gene expression and regulation as they adapt to form highly organized and structured sessile communities (Schembri *et al*., 2003; Watnick and Kolter, 2000) with enhanced resistance to antimicrobial treatments and host immune responses. Furthermore, the development of a biofilm may be tuned differently as a response to a nutrient gradient from the bulk media to the inside of the biofilm. Thus, gene expression at different locations in a biofilm may show spatial variation similar to geographic data in such a way that ‘everything is related to everything else, but near things are more related than distant things’ (Tobler, 1970). Biofilms are the dominant mode of growth for bacteria in the environment, and are of great concern in medical, environmental and industry settings. Therefore, considering the spatial correlation in a model would help better understanding the formation of biofims as well as their altered biological activities.

Consider a balanced experimental design with observations on *i* = 1, …, *G* genes under *j* = 1, …, *C* conditions with *k* = 1, …, *m* biological samples each measured in *l* = 1, 2, …, *n* chips. Suppose that the *n* chips represent *n* experimental replicates (e.g. layers of a biofilm). Let *R _{n}* be a

$$p(y\mid x,{R}_{n})\propto \phantom{\rule{0.16667em}{0ex}}{\mid {I}_{m\ast}\otimes {R}_{n}\mid}^{\phantom{\rule{0.16667em}{0ex}}-1/2}exp({(y-x)}^{\prime}{({I}_{m\ast}\otimes {R}_{n})}^{-1}(y-x)).$$

If we assume that the prior distribution of *R _{n}* to be an inverted Wishart distribution with scale matrix

$${R}_{n}\mid \xb7\sim {\text{Wishart}}^{-1}({m}^{\ast}+{\kappa}_{e},{S}_{e}+{V}_{e}),$$

(7)

where *S _{e}* is a

$${x}_{\mathit{ijk}}\mid \xb7\sim N\left(\frac{{\sum}_{l}({\gamma}^{ll}{y}_{\mathit{ijk}l}+({\scriptstyle \frac{1}{2}}){\sum}_{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}}({y}_{\mathit{ijk}l}+{y}_{{\mathit{ijkl}}^{\prime}}))+{\sigma}_{{b}_{ij}}^{-2}{\mu}^{\ast}}{{\sigma}_{{b}_{ij}}^{-2}+{\sum}_{l}({\gamma}^{ll}+{\sum}_{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}})}{\left({\sigma}_{{b}_{ij}}^{-2}+\sum _{l}\left({r}^{ll}+\sum _{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}}\right)\right)}^{-1}\right),$$

(8)

where *μ*^{*} = *μ* + *g _{i}* +

To model correlated biological errors, we similarly define a *m* × *m* covariance matrix Δ* _{m}* such that

$$p(x\mid \mu ,g,c,r,\mathrm{\Delta})\propto \phantom{\rule{0.16667em}{0ex}}{\mid {I}_{C\ast}\otimes {\mathrm{\Delta}}_{m}\mid}^{-1/2}exp({(x-\mu -g-c-r)}^{\prime}{({I}_{C\ast}\otimes {\mathrm{\Delta}}_{m})}^{-1}(x-\mu -g-c-r))$$

Again, we assume the prior distribution for Δ* _{m}* to be an inverted Wishart distribution with scale matrix

$${\mathrm{\Delta}}_{m}\mid \xb7\sim {\text{Wishart}}^{-1}({C}^{\ast}+{\kappa}_{b},{S}_{b}+{V}_{b})$$

(9)

where *S _{b}* is a

$$\mu \mid \xb7\sim N\left({\sum}_{i}{\sum}_{j}{\sum}_{k}\frac{{\delta}^{kk}{x}_{\mathit{ijk}(\mu )}^{\ast}+({\scriptstyle \frac{1}{2}}){\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}({x}_{\mathit{ijk}(\mu )}^{\ast}+{x}_{{\mathit{ijk}}^{\prime}(\mu )}^{\ast})}{{\mathrm{\Lambda}}_{\xb7\xb7}^{\ast}},{({\mathrm{\Lambda}}_{\xb7\xb7}^{\ast})}^{-1}\right),$$

(10)

where *δ ^{kk}*

$$\begin{array}{l}{\mathrm{\Lambda}}_{\xb7\xb7}^{\ast}={\sum}_{i}{\sum}_{j}{\sum}_{k}\left({\delta}^{kk}+{\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}\right).\\ {g}_{i}\mid \xb7\sim N\left({\sum}_{j}{\sum}_{k}\frac{{\delta}^{kk}{x}_{\mathit{ijk}(g)}^{\ast}+({\scriptstyle \frac{1}{2}}){\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}({x}_{\mathit{ijk}(g)}^{\ast}+{x}_{{\mathit{ijk}}^{\prime}(g)}^{\ast})}{{\sigma}_{g}^{-2}+{\mathrm{\Lambda}}_{i}^{\ast}},{({\sigma}_{g}^{-2}+{\mathrm{\Lambda}}_{i\xb7}^{\ast})}^{-1}\right),\end{array}$$

(11)

where

$$\begin{array}{l}{x}_{\mathit{ijk}(g)}^{\ast}={x}_{\mathit{ijk}}-\mu -{c}_{j}-{r}_{ij}\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.38889em}{0ex}}{\mathrm{\Lambda}}_{i\xb7}^{\ast}={\sum}_{j}{\sum}_{k}\left({\delta}^{kk}+{\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}\right)\\ {c}_{i}\mid \xb7\sim N\left(\sum _{i}\sum _{k}\frac{{\delta}^{kk}{x}_{\mathit{ijk}(c)}^{\ast}+({\scriptstyle \frac{1}{2}}){\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}({x}_{\mathit{ijk}(c)}^{\ast}+{x}_{{\mathit{ijk}}^{\prime}(c)}^{\ast})}{{\sigma}_{c}^{-2}+{\mathrm{\Lambda}}_{\xb7j}^{\ast}},{({\sigma}_{c}^{-2}+{\mathrm{\Lambda}}_{\xb7j}^{\ast})}^{-1}\right),\end{array}$$

(12)

where ${x}_{\mathit{ijk}(c)}^{\ast}={x}_{\mathit{ijk}}-\mu -{g}_{j}-{r}_{ij}$ and ${\mathrm{\Lambda}}_{\xb7j}^{\ast}={\sum}_{j}{\sum}_{k}\left({\delta}^{kk}+{\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}\right)$.

$${r}_{ij}\mid \xb7\sim N\left(\sum _{k}\frac{{\delta}^{kk}{x}_{\mathit{ijk}(r)}^{\ast}+({\scriptstyle \frac{1}{2}}){\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}({x}_{\mathit{ijk}(r)}^{\ast}+{x}_{{\mathit{ijk}}^{\prime}(r)}^{\ast})}{{\sigma}_{r}^{-2}+{\mathrm{\Lambda}}_{r}^{\ast}},{({\sigma}_{r}^{-2}+{\mathrm{\Lambda}}_{r}^{\ast})}^{-1}\right),$$

(13)

where
${x}_{\mathit{ijk}(c)}^{\ast}={x}_{\mathit{ijk}}-\mu -{g}_{i}-{c}_{j}$ and
${\mathrm{\Lambda}}_{r}^{\ast}={\sum}_{k}({\delta}^{kk}+{\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}})$. Under the assumption of a non-zero biological covariance, (10), (11) and (12) are reduced to (3), (4) and (5), respectively, and (13) reduced to the fully conditional distribution of *r _{ij}* by Cho and Lee (2004). When both biological and experimental errors are correlated, (8) becomes

$${x}_{\mathit{ijk}}\mid \xb7\sim N\left(\frac{{\sum}_{l}{\gamma}^{ll}{y}_{\mathit{ijk}}+({\scriptstyle \frac{1}{2}}){\sum}_{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}}({y}_{\mathit{ijk}l}+{y}_{\mathit{ijk}{l}^{\prime}}))+({\delta}^{kk}{\mu}^{\ast}-{\sum}_{{k}^{\prime}\ne k}{\delta}^{{kk}^{\prime}}({x}_{{\mathit{ijk}}^{\prime}}-{\mu}^{\ast}))}{{\delta}^{kk}+{\sum}_{l}({\gamma}^{ll}+{\sum}_{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}})},{({\delta}^{kk}+{\sum}_{l}({\gamma}^{ll}+{\sum}_{{l}^{\prime}\ne l}{\gamma}^{{ll}^{\prime}}))}^{-1}\right)$$

(14)

To show how our expanded model works, we simulated the expression data of 1000 genes, which were measured using 5 biological samples with 3 experimental replicates of each and 10 biological samples each with 6 experimental replicates of each. Briefly, the grand mean was given as a constant. Gene effects were generated from
$N(0,{\sigma}_{g}^{2})$, condition effects from
$N(0,{\sigma}_{c}^{2})$, and interaction effects from
$N(0,{\sigma}_{r}^{2})$, where
${\sigma}_{g}^{2}=0.5,{\sigma}_{c}^{2}=0.5$ and
${\sigma}_{r}^{2}=1.0$. Biological and experimental errors were randomly generated either from *b* ~ *N*(0‚*I*_{C*} Δ* _{m}*) and
$e\sim N(0,{I}_{n\ast}\times {\sigma}_{e}^{2})$, or from
$b\sim N(0,{I}_{m\ast}\times {\sigma}_{{b}_{ij}}^{2})$ and

$$\text{MSE}=var({\widehat{g}}_{i})+{(\text{bias}({\widehat{g}}_{i}))}^{2}$$

(15)

where

$$var({\widehat{g}}_{i})=\frac{1}{T}\times \sum _{t=1}^{T}{({\widehat{g}}_{i}^{t}-{\overline{\widehat{g}}}_{i})}^{2}$$

represented sampling variation and

$${(\text{bias}({\widehat{g}}_{i}))}^{2}={\left(\frac{1}{T}\times \sum _{t=1}^{T}{\widehat{g}}_{i}^{t}-{g}_{i}\right)}^{2}$$

represented squared bias, and
${\widehat{g}}_{i}^{t}$ was an estimate of *g _{i}* at saved iteration

Our extended model facilitates estimating patterns of biological or/and experimental (co)variance or correlations (Fig. 2) whereas such information could not be inferred using the previous model (Cho and Lee, 2004). As mentioned before, biological or experimental correlations could be of considerable interest and necessary to help us better understand the nature of correlated gene expression. For example, suppose the pattern of experimental (co)variance in Figure 2b represents the environmental correlations in a biofilm. We might conclude that nutrient differences result in higher correlations between the inner and bottom layers than between the top and the other two layers.

3D patterns of estimated (co)variance components among (**a**) five biological sample and (**b**) three experimental replicates.

Further, we investigated how different model specifications about biological/experimental covariance (i.e. zero versus non-zero) would affect the posterior estimates of parameters of interest. Ignoring experimental covariance in the model did not lead to significant differences in parameter estimation (data not shown). However biological covariance components are important for estimating gene and interaction effects. If the biological covariance components are ignored then gene effects and interaction effects will be estimated with large biases. However, grand mean, condition effects and the common residual variance appear to be more robust, and gave similar results under either model specification about biological (co)variance. As shown in Table 1, considering non-zero biological (co)variance in the model considerably decreased estimation biases of gene and interaction effects. Squared biases, and consequently MSE, of these parameters were smaller in the analysis considering biological covariance components than by ignoring them. Given the variation that occurs during the collection of gene expression data, posterior estimation of gene and interaction effects was better, in terms of MSE, sampling variation and squared biases, when more samples were used (Table 1). This result would challenge the present practice where a very limited number (say three) of samples are used in most gene expression DNA array experiments.

Mean squared error (MSE), sampling variance (SV) and squared bias (SB) in posterior estimation of model effects under the assumption of zero [COV(*b*) = 0] versus non-zero [COV(*b*) ≠ 0] biological covariance^{a},^{b}

Finally, we show how model specifications about biological covariance would affect the estimation of gene expression difference, (*μ* + *g _{i}* +

Regression of estimated to true gene expression difference obtained by assuming (**a**) zero versus (**b**) non-zero biological covariance components.

In conclusion, we extended the model of Cho and Lee (2004) to allow the analysis of gene expression data with biological/environmental correlations, which not only facilitate estimating biological/expression correlation patterns, but it also improved estimation of model parameters and consequently identifying differentially expressed genes in DNA array experiments. A FORTRAN 90 program was developed to implement our extended method and it is freely downloadable at http://www.ibest.uidaho.edu/tools/banal_array/

- Broet P, et al. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol. 2002;9:671–683. [PubMed]
- Cho H, Lee JK. Bayesian hierarchical error model for analysis of gene expression data. Bioinformatics. 2004;20:2016–2025. [PubMed]
- Gianola D, Sorensen D. Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics. 2004;167:1407–1424. [PubMed]
- Johnston JM. Econometric Methods. McGraw-Hill; New York: 1972.
- Oleksiak MF, et al. Variation in gene expression within and among natural populations. Nat Genet. 2002;32:261–266. [PubMed]
- Qiu X, et al. The effects of normalization on the correlation structure of microarray data. BMC Bioinformatics. 2005;6:120. [PMC free article] [PubMed]
- Tadesse MG, Ibrahim JG. A Bayesian hierarchical model for the analysis of Affymetrix arrays. Ann N Y Acad Sci. 2004;1020:41–48. [PubMed]
- Tobler W. Review of J. Forrester, ‘Urban Dynamics’ Geogr Anal. 1970;2:198–199.
- Schembri MA, et al. Global gene expression in
*Escherichia*c*oli*biofilms. Mol Microbiol. 2003;48:253–267. [PubMed] - Watnick P, Kolter R. Biofilm: city of microbes. J Bacteriol. 2000;182:2675–2579. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |