The primary goal of our statistical model is to infer probable gene-TF interactions through the integration of available biologic data. Mathematically, we formulate our parameters of interest as unknown indicator variables Cij = 1 if gene i is regulated by TF j or 0 otherwise. Collectively, the matrix C of these indicator variables gives us our clusters of co-regulated genes, because all genes i, where Cij = 1, are estimated to be in a cluster together regulated by TF j. These indicator variables are the basis of a regulatory network, and can be visually represented as edges in a graph that connects TFs to the genes that they regulate. An important aspect of this formulation is that we are explicitly allowing genes to belong to multiple clusters controlled by different transcription factors (for instance, Cij = 1 and Cij' = 1 for j ≠ j').
In order to infer likely values for C, our model incorporates up to three general classes of biologic information: gene expression data, ChIP binding data, and sequence-level binding site data. We denote our expression data as git, the log-expression of gene i (I = 1 ... N) in experiment t (t = 1 ... T). The set of T experiments can be from different tissues, different time course experiments, and different gene knockout experiments, or any combination thereof. Within these expression data, we give special focus on the expression of genes that encode known TF proteins. For our J known TFs, we denote fjt as the regulation activity currently estimated by log expression of TF gene j (j = 1 ... J) in experiment t. In addition to expression data, we have available ChIP experiments, which give information on the physical binding location of specific TFd. We use bij to denote the probability that TF j physically binds in close proximity to gene i, from a ChIP binding experiment for TF j. Finally, we have available sequence-level information in the form of known or putative binding sites for specific TFs located in the upstream regions of target genes. We denote mij as the probability that TF j has a binding site in the regulatory region of gene i. These binding sites could be experimentally verified, or predicted by scanning upstream sequences for similarity to an established PWM for a particular TF. We outline our model in the most general case, where all three of these data types are present, but we also discuss the ramifications on our procedure when only subsets of these data types are available.
Our model consists of multiple levels organized in a hierarchical fashion. The first level of our model incorporates our gene expression data by specifying the observed gene log expression git as a linear function of TF gene log expression, fjt:
We are using the expression fjt
of the gene that encodes TF j
as a proxy for the activity of TF j
, in which case bj
is the linear effect of TF j
on target gene expression. Note that only TFs with probable connections to gene i
= 1) are allowed to influence the expression of gene i
in the above equation. This also means that ai
can be interpreted as the baseline expression for gene i
in the absence of regulation by known TFs (Cij
= 0 for all TFs j
). However, Eqn 1 (above) and similar linear models [6
] are limited by not allowing for combinatorial relationships between TFs. Each TF j
has a single effect (bj
) on the expression of gene i
, which does not take into account the biologic reality that expression is often the result of synergistic or antagonistic binding of multiple TFs. We acknowledge these combinatorial relationships by expanding our linear model to include interaction terms:
Where we now have additional coefficients gjk, which can be interpreted as the synergistic (or antagonistic) effect of both TFs j and k binding together to the same upstream region (in addition to the effects of TF j or k binding in isolation). Considering the large number of factors involved in these linear equations, we should be cautious about over-interpretation of individual bj or gjk coefficients, but these parameters can still provide information about the partial effects of particular TFs on gene expression. It should also be noted that the gene expression data we use is on the log scale, and although this same model could be used for measurements of absolute expression levels (when available), the interpretation of the linear and interaction terms would be quite different in that situation. Our model could also be expanded to allow higher order interaction terms, although at increased computational cost.
As mentioned above, we may have two additional classes of data for a particular gene-TF interaction Cij. We may have bij, the probability that TF j physically binds in proximity to gene i in a ChIP binding experiment, and mij, the probability of a binding site for TF j in the upstream region of gene i. The second level of our model incorporates both bij and mij into a prior distribution for our unknown indicator variable Cij. For a given value of the weight variable wj, the distribution of our clustering indicators Cij given bij and mij can be factored into a product of two prior distributions, one based on bij and one based on mij.
The variable wj
is the relative weight of the prior ChIP-binding information bij
versus the TF binding site information mij
. The weights w
) are TF specific but not gene specific, and are designed to reflect potential global differences in quality between the binding data and PWM scanning data for TF j
. However, because this relative quality is not necessarily known a priori
, we will treat each weight wj
as an unknown variable. Clearly, if only ChIP binding data for TF j
are available then wj
= 1 and Eqn 3 reduces to a function of bij
only, whereas if only TF binding site data for TF j
is available then wj
= 0 and Eqn 3 reduces to a function of mij
only. Our weighted prior methodology could also be used to balance the information from two different ChIP binding experiments, in which case the variable weight would measure the relative quality between the two ChIP datasets. It would also be easy to extend our model to accommodate more than two sources of data into our prior distribution, which would involve the use of multiple weight variables between the different data sources. Often, the available ChIP binding or TF binding site data are not available directly as probabilities, but rather as P
values or scores from a previous statistical analysis. We convert these P
values or scores to probabilities with an EM (expectation maximization) algorithm [30
] based on a mixture model, which is described in detail in the Additional data file 1 (Supplementary Materials).
The Bayesian paradigm gives us a principled framework for connecting these model levels into a single posterior distribution for all unknown parameters:
where Θ denotes the collection of linear model parameters (Θ = α,β,γ,σ2 ...). The term p(g|f,C,Θ) represents our first model level with expression data g = (git) and f = (fjt), and p(C|m,b,w) represents our second model level with ChIP binding data b = (bij) and TF binding site data m = (mij). All that remains is the specification p(Θ,w), the prior distributions for our TF specific prior weights w = (w1 ... wJ) and our linear model parameters Θ (summarized in Table ).
In the absence of additional prior information about these parameters, we can make these prior distributions 'non-informative' (non-influential relative to the data) by setting our prior variance parameters
to be very large (in this study, 10,000) and setting the degrees of freedom for σ2
to be small (in this study, 2). This complicated model is implemented using Gibbs sampling [31
], which is an iterative Markov Chain Monte Carlo algorithm that samples new values for each set of unknown parameters conditional on the current values of all other parameters. The COGRIM R package and supplementary materials are available for download from our COGRIM website [32