Time-series gene expression data for yeast segregants

We applied our method to a set of time-series mRNA expression data measuring the gene expression levels of 95 genotyped haploid yeast segregants perturbed with the macrolide drug rapamycin [

3]. These segregants, along with their genetically diverse parents, BY4716 (BY) and RM11-1a (RM), have been genotyped previously [

72]. Rapamycin was chosen for perturbation because it was expected to induce widespread changes in global transcription, based on a screen of the public microarray data repositories [

78-

80]. This perturbation allowed for the capture of a large subset of all regulatory interactions encoded by the yeast genome. Each yeast culture was sampled at 10-minute intervals for 50

minutes after rapamycin addition. The RNA purified from these samples was profiled with Affymetrix Yeast 2.0 microarrays. Probe signals were summarized into gene expression levels using the Robust Multi-array Average (RMA) method [

81] and genes not exhibiting significant changes in expression were filtered from the data as described in [

3]. The data subset that remained consisted of the time-dependent mRNA expression profiles of 3556 genes. The complete time series gene expression data are publicly available at ArrayExpress (

http://www.ebi.ac.uk/arrayexpress/) with accession number E-MTAB-412.

Bayesian model averaging (BMA)

BMA is a variable selection approach that takes model uncertainty into account by averaging over the posterior distribution of a quantity of interest based on multiple models, weighted by their posterior model probabilities [

82,

83]. In BMA, the posterior distribution of a quantity of interest Θ given the data

*D* is given by

, where

*M*_{1},…,

*M*_{k} are the models considered. Each model consists of a set of candidate regulators. In order to efficiently identify a compact set of promising models

*M*_{k} out of all possible models, two approaches are sequentially applied. First, the leaps and bounds algorithm [

84] is applied to identify the best

*nbest* models for each number of variables (i.e., regulators). Next, Occam’s window is applied to discard models with much lower posterior model probabilities than the best one [

85]. The Bayesian Information Criterion (BIC) [

86] is used to approximate each model's integrated likelihood, from which its posterior model probability can be determined.

While BMA has performed well in many applications [

60], it is hard to apply directly to the current data set in which there are many more variables than samples. Yeung et al. [

62] proposed an iterative version of BMA (iBMA) to resolve this problem. At each iteration, BMA is applied to a small number, say,

*w*=

30, of variables that could be efficiently enumerated by leaps and bounds. Candidate predictor variables with a low posterior inclusion probability are discarded, leaving room for other variables in the candidate list to be considered in subsequent iterations. This procedure continues until all the variables have been processed.

Supervised framework for the integration of external knowledge

We formulated network construction from time series data as a regression problem in which the expression of each gene is predicted by a linear combination of the expression of candidate regulators at the previous time point. Let *D* be the entire data set and *X*_{g,t,s} be the expression of gene *g* at time *t* in segregant *s*. Denote by *R*_{g} the set of regulators for gene *g* in a candidate model. The expression of gene *g* is formulated by the following regression model:

where *E* denotes expectation and *β*’s are regression coefficients. For each gene, we apply iBMA to infer the set of regulators.

To account for external knowledge in the network construction process, Yeung et al. [

3] introduced a supervised framework to estimate the weights of various types of evidence of transcriptional regulation and subsequently derived top candidate regulators. For instance, a target gene is likely to be co-expressed with its regulators across diverse conditions in publicly available, large-scale microarray experiments [

78,

87,

88]. ChIP-chip data [

89] provide supporting evidence for a direct regulatory relationship between a given TF and a gene of interest by showing that the TF directly binds to the promoter of that gene. A candidate regulator with known regulatory roles in curated databases such as the

*Saccharomyces* Genome Database (SGD) [

90] would be favored

*a priori*. Polymorphisms in the amino acid sequence of a candidate regulator that affect its regulatory potential provide further evidence of a regulatory relationship [

44]. Common gene ontology (GO) [

91] annotations for a target gene and candidate regulators also provide evidence of functional relationship.

To study the relative importance of the various types of external knowledge from the supervised framework, we collected 583 positive examples of known regulatory relationships between TFs and target genes from the

*Saccharomyces cerevisiae* Promoter Database (SCPD) [

92] and the Yeast Protein Database (YPD) [

93]. Random sampling of these TF-gene pairs was used to generate 444 negative examples. Logistic regression using BMA was applied to estimate the contribution of each type of external knowledge in the prediction of regulatory relationships. The fitted model was then used to predict the regulatory potential

*π*_{gr} of a candidate regulator

*r* for a gene

*g*, i.e., the prior probability that candidate

*r* regulates gene

*g*, for all possible regulator-gene pairs. Next, the regulatory potentials were used to rank and shortlist the top

*p* candidate regulators for each gene (

*p*=

100 by default in our experiments). The shortlisted candidates were then input to BMA for variable selection in the network construction process.

Incorporating prior probabilities into iBMA

The potential benefit of using information from external knowledge to refine the search for regulators was shown by Yeung et al. and many others [

3,

13,

15-

17,

43,

44]. However, external knowledge was only used to shortlist the top

*p* candidate regulators for each target gene in Yeung et al. Here, we develop a formal framework that fully incorporates external knowledge into the BMA network construction process.

We associate each candidate model *M*_{k} with a prior probability, namely:

where

*π*_{gr} is the regulatory potential of a candidate regulator

*r* for a gene

*g, δ*_{kr}=

1 if

*r* *M*_{k} and

*δ*_{kr}=

0 otherwise [

85,

94]. Intuitively, we consider models consisting of candidate regulators supported by considerable external evidence to be frontrunners. A model that contains many candidate regulators with little support from external knowledge is penalized.

The posterior model probability of model *M*_{k} is given by

where *f*(*D* | *M*_{k}) is the integrated likelihood of the data *D* under model *M*_{k}, and the proportionality constant ensures that the posterior model probabilities sum up to 1.

Then Occam’s window was used to discard any model *M*_{k} having a posterior odds less than 1/*OR* relative to the model with the highest posterior probability, *M*_{opt}. The parameter *OR* controls the compactness of the set of selected models, and here we set it to 20.

Extension of iBMA: cumulative model support

In Yeung et al. [

3], the models selected in an intermediate iteration by iBMA were not recorded once that iteration was completed, and the final set of models selected were chosen only from those considered in the last iteration. While computationally efficient, this strategy overlooked the possibility of accumulated model support over multiple iterations. We improve the model selection process by storing all the models selected in any iteration and applying Occam’s window to this cumulative set of models as the last step in the algorithm.

At the end of each iteration of iBMA, and after applying Occam’s window to all models considered, we compute the posterior inclusion probabilities for each candidate regulator *r* by summing up the posterior probabilities of all models that involve this regulator.

where F is the set of all possible models for gene g, β

_{gr} is the regression coefficient of a candidate regulator

*r* for a gene

*g, δ*_{kr}=

1 if

*r* *M*_{k} and

*δ*_{kr}=

0 otherwise. Finally, we infer regulators for each target gene

*g* by thresholding on the posterior inclusion probability at a predetermined level (50% in all our experiments unless otherwise specified).

Extensions of the supervised framework

We have extended the supervised framework of Yeung et al. [

3] in three ways.

Imputation of missing values in ChIP-chip data

About 9% of the ChIP-chip data used in the training samples were originally undefined. The ChIP-chip data take the form of

*p*-values for the statistical tests of whether candidate regulator

*r* binds to the upstream region of gene

*g in-vivo*. In [

3], those undefined values were regarded as lack of evidence for upstream binding and assigned values of one. Here, we used multiple imputation [

95,

96], in which we sampled with replacement from the empirical distribution of the non-missing ChIP-chip data, conditioning on the presence or absence of regulatory relationships. We used 20 imputations as recommended by Graham et al. [

97] for scenarios with about 10% missing data. Logistic regression was then performed on the training sample filled with the imputed ChIP-chip values.

Truncation of extreme values in external data

Some of the external data types used in the supervised learning stage contained value ranges for individual genes that far exceeded the ranges for these genes in the training samples, e.g. the SNP-level information in Additional file

2: Table S3. Therefore, we truncated all extreme values in the external data to the respective maximum value observed in the training samples.

Adjustment for sampling bias regarding positive and negative cases

In the supervised framework of Yeung et al., the expected number of regulators per target gene, computed as the sum of regulatory potentials of all candidate regulators, mostly fell between 400 and 600 (see Figure (a)). Such an apparent overestimation of positive regulatory relationships was due to the fact that similar numbers of positive and negative examples in the supervised learning stage. Given the sparse nature of a gene regulatory network, we expect the number of TF-gene pairs with regulatory relationships to be a small proportion of the total.

Here, we address this issue by using a strategy that is commonly used in case–control studies, in which disease (positive) cases are usually rare [

98,

99]. Let

*π*_{1} and

*π*_{0} be the sampling rates for positive and negative cases respectively. To adjust for the difference in the sampling rates, we add an offset of -log(

*π*_{1}/

*π*_{0}) to the logistic regression model. Equivalently, we divide the predicted odds by

*π*_{1}/

*π*_{0}. Previous literature has suggested that the in-degree distribution of gene regulatory networks decays exponentially [

100-

102]. Based on regulatory relationships documented in various yeast databases [

90,

92,

93,

103,

104], Guelzim et al. [

100] empirically estimated the in-degree distribution of the regulatory network as 157

*e*^{-0.45m}, where

*m* denotes the number of TFs for a target gene. This implies that each target gene is regulated by approximately 2.76 TFs on average. Since we have 583 positive training examples, 444 negative examples, and 6000 yeast genes, we characterize such a network with density

*τ*=

2.76/6000

=

0.00046, and compute

, and

. Therefore, we divide all the predicted odds by

*π*_{1}/

*π*_{0}=

2853. For instance, if the original predicted probability is 0.9, i.e., the predicted odds is 9, then after scaling the odds adjusted for sampling bias, it becomes 9/2853

=

0.0032, implying an adjusted probability of 0.0032. As shown in Figure (b), the expected number of regulators per target gene has dropped substantially to a level of around 0.5 after our three correction strategies (adjustment of sampling bias, imputation of missing ChIP-chip values and truncation of extreme values) are applied. Additional file

1: Figure S2 shows the incremental merit of our correction strategies. Additional file

2: Table S3 gives the estimated regression coefficient and the posterior probability for each external data type in our revised supervised framework.

To assess the sensitivity of our results to changes in the assumed prior average number of regulators per target gene, we repeated the analysis with various levels of the network density

*τ*, and found that the assessment results were comparable. Please see the

Additional file 3 for complete details.

Summary: outline of algorithm

1. For each gene *g*, rank the candidate regulators based on the regulatory potentials predicted from the supervised framework.

2. Shortlist the top

*p* candidates from the ranked list (

*p*=

100 in our experiments).

3. Fill the BMA window with the top

*w* candidates in the shortlist (

*w*=

30 in our experiments).

4. Apply BMA with prior model probabilities based on the external knowledge:

a. Determine the best

*nbest* models for each number of variables using the leaps and bounds algorithm (

*nbest*=

10 in our experiments).

b. For each selected model, compute its prior probability relative to the *w* candidates in the current BMA window using Equation (5).

c. Remove the

*w* candidate regulators with posterior inclusion probability Pr(

*β*_{gr}≠

0 |

*D*) <5%.

5. Fill the *w*-candidate BMA window with those not considered yet in the shortlist.

6. Repeat steps 4–5 until all the *p* candidates in the shortlist have been processed.

7. Compute the prior probability for all selected models relative to all the *p* shortlisted candidates using Equation (5).

8. Take the collection of all models selected at any iteration of BMA, and apply Occam’s window, reducing the set of models.

9. Compute the posterior inclusion probability for each candidate regulator using the set of selected models, and infer candidates associated with a posterior probability exceeding a pre-specified threshold (50%) to be regulators for target gene *g*.

External knowledge is used in the following ways:

1. All the candidate regulators are ranked according to their regulatory potentials, which were predicted using the available external data sources at the supervised learning stage.

2. Model selection is performed by comparing models against each other based on their posterior odds. As shown by Equation (6), the posterior odds is proportional to a product of the integrated likelihood and the prior odds. The prior probability and, therefore, the prior odds, of a candidate model are formulated as a function of regulatory potentials.

3. The posterior inclusion probability of each candidate regulator, from which inference is made about the presence or absence of a regulatory relationship, is positively related to its regulatory potential. As shown in Equation (5), a factor of *π*_{gr} is contributed to each model in which the candidate *g* is included. Otherwise, a factor of 1- *π*_{gr} is contributed to each model.