A major challenge in computational biology is to reveal the cis
-regulatory logics of gene expression through analysis of high-throughput genomic data, for example, genomic sequences and gene expression data. A common practice is to first identify putatively co-regulated genes by clustering gene expression patterns [1
], and then search for common motifs from the promoter sequences of the genes in the same cluster [4
]. Such enriched motifs, if identified, are often believed to be the binding motifs of a common transcription factor (TF). This strategy has been successful on small datasets, but is limited by its strong assumptions that co-expression means co-regulation and vice versa [8
]. Furthermore, in higher eukaryotes, genes are typically regulated by combinations of TFs, and TF binding motifs are often organized into modular units [10
]. Although some progress has been made [11
], it is still difficult to precisely identify combinatorial motifs. Finally, these methods by themselves do not reveal the actual TFs that bind to the sequence motifs.
Recently, several studies have attempted to build quantitative or qualitative models to predict gene expression levels from the regulatory motifs on their promoter sequences. Bussemaker et al. [15
] and others [9
] proposed to model gene expression levels as a linear regression of binding motif scores, and applied feature selection techniques to find the most significant motifs. These methods have been shown effective for discovering conserved short motifs related to several biological processes in S. cerevisiae
]. However, they are limited by the assumption of a linear additivity of binding motifs, and therefore is unable to represent complex cis
-regulatory logics such as AND and OR relations [17
]. Furthermore, such linear models are often difficult to interpret.
In order to obtain more realistic models with better interpretability, several classification models have been proposed. Decision tree methods have been successfully applied to find motif combinations that best separate two classes of genes [19
]. Other tree-based methods such as multivariate regression trees and bi-dimensional regression trees have been developed to model the transcriptional regulation of gene expressions over several time points simultaneously [21
]. Simonis et al. [23
] combined a string-based motif finding method and linear discriminant analysis to identify motif combinations that can separate true regulons from false ones. Middendorf et al. [24
] used an ensemble of decision trees to model gene expression levels by combining putative binding motifs and the expression levels of putative TFs. Segal et al. [13
] and Beer and Tavazoie [18
] built Bayesian networks to explain gene expression patterns from motifs. In these models, the predictors (features) are the matching scores of promoter sequences to putative binding motifs, and the predictions (responses) can be continuous or discrete gene expression levels or categorical cluster labels. A common goal of these methods is to derive simple and intuitive rules from the classification models, for example, in the form of "if a gene has motif A and motif B, it will be up-regulated under condition c".
The features used in the above models are usually de novo sequence motifs that are identified from the promoters of the genes under study [13
], or existing motifs that are obtained independently [21
]. One can also enumerate all possible words up to a certain length [9
]. The problem with these types of features is that they generally have low quality, are incomplete, and may contain many variations for the same motif. To address this issue, chromatin immunoprecipitation microarray (ChIP-chip) data [25
], which represent the relative binding strengths of TFs to the promoter regions of their target genes, have been used as a substitute of motif scores. For example, Banerjee and Zhang [26
] directly applied the method of Pilpel et al. [12
] to ChIP-chip data to identify TF combinations; Gao et al. [27
] replaced the variables in the linear model of Bussemaker et al. [15
] with ChIP-chip data and identified significant regulators for many experimental conditions. We have recently compared several types of features using decision tree models and showed that using ChIP-chip data as features generally result in better classification accuracy that using other types of features, when gene expression and ChIP-chip data are obtained from similar conditions (e.g., normal growth conditions) [20
]. An additional advantage of using ChIP-chip data than binding motifs is that the former directly creates a link between a TF and its target genes. While discovering binding motifs of TFs is still important, it can be separated from the learning of transcriptional regulatory networks.
In this paper, we propose a novel approach that combines the strengths of several recent methods in order to learn a highly accurate and reliable transcriptional regulation model, and combine the models learned from different time points and/or different experiments to construct a dynamic transcriptional regulatory network. We use TF binding data rather than binding motifs as predictor variables. We use decision trees as our underlying model, because decision trees are efficient to learn, easy to understand, can capture complex regulatory logics, and have feature selection built in [28
]. In order to improve the accuracy and robustness of our model, we use an ensemble learning approach to learn multiple decision trees for each training set. From these learned decision trees, we then extract rules in the form of, for example, "if a gene can be bound by TF A and TF B, it will be up-regulated under condition c at time point t". Furthermore, we propose a profile approach to reveal the condition-specific or time-dependent effects of TFs. We also propose a spline interpolation method to combine results from multiple time series data. Such an integrated approach can substantially eliminate noises in individual data sources and improve modeling accuracy. To validate our approach, we apply it to three sets of yeast cell cycle gene expression data [33
] and whole-genome yeast TF binding data [25
]. It is known that nine TFs – Mbp1, Swi4, Swi6, Mcm1, Fkh1, Fkh2, Ndd1, Swi5 and Ace2 – regulate a large number of yeast cell cycle dependent genes [35
]. Specifically, MBF (a complex of Mbp1 and Swi6) and SBF (a complex of Swi4 and Swi6) control late G1 genes; Mcm1, together with Fkh1 or Fkh2, recruits Ndd1 in late G2 and controls the transcription of G2/M genes; and Swi5 and Ace2 regulate genes at the end of M and early G1. This model was developed using a small set of genes and was recently confirmed by a number of computational studies [25
]. We thus applied our method to the cell cycle data to verify the accuracy of our method. In addition, by performing a large-scale analysis, we expect to construct a more detailed transcriptional regulatory network as well as capturing new, testable hypotheses for yeast cell cycle regulations.
We demonstrate that our method is able to identify biologically significant time-dependent regulatory rules, and the learned regulatory rules can be used as the basic building elements of a dynamic transcriptional regulatory network. Statistical evaluation indicates that the rules are robust and reliable. The transcriptional regulatory network constructed by our method for the yeast cell cycle genes agrees very well with the existing knowledge. Many transcriptional regulatory rules for yeast cell cycle genes discovered by our approach have been confirmed by the literature, while the other rules may yield additional insights into the biological process.