Decades of research on the bacterium
Escherichia coli have led to the accumulation of a large knowledge base about transcriptional regulation within this prokaryotic model organism. Researchers have electronically encoded in databases (such as EcoCyc and RegulonDB) thousands of activation and repression relationships among transcription factors (TFs) and genes
[1]–
[3]. However, while
E. coli has one of the most comprehensive datasets of experimentally verified transcriptional regulatory interactions of any organism, it is still far from complete. For instance, the experimentally verified and curated TF-gene interactions provides regulatory relationships for only approximately 1000 genes, which is well below the more than 4000 genes predicted to be present in
E. coli. This relatively low coverage of the experimentally verified and curated interaction network presents a challenge when attempting to reconstruct the active regulatory network for a condition of interest based on microarray gene expression data. When analyzing microarray experiments, researchers often need information about the set of genes predicted or known to be regulated by various TFs. This information can then be used to determine the influence of the TFs in the condition of interest by indirectly observing the activity of the regulated genes, even for cases in which the TF is post-transcriptionally regulated
[4]–
[6].
A traditional computational approach to identify additional gene targets of a TF, which has been applied to
E. coli, is to characterize the DNA sequence binding preferences of a TF based on an alignment of known binding sites of the TF, and then use this alignment to scan the promoter region of genes for sites matching the preferences
[7]. In some cases researchers have used conservation as an additional filter
[8]–
[10] or extended the alignment based approach using a biophysical based model
[11]. While it has been shown that for some TFs in
E. coli the presence of a motif can be highly predictive of true binding
[12], for other TFs the motif pattern is more degenerate leading to reduced accuracy. An additional limitation in
E. coli, where genes are organized into transcriptional units and many TFs function as both activators and repressors
[2], is that motif scanning only determines the binding site location, which is not sufficient to determine if a specific binding site is being used to activate or repress a specific gene
[13].
Another approach researchers have taken to predicting TF-gene interactions utilizes just mRNA expression data by evaluating whether the expression level of the TF and the target gene are consistent with a regulatory relationship. Faith et al.
[14] surveyed and evaluated a number of these methods using a compendium of
E. coli gene expression data. They also introduced a new method for this task: The context likelihood of relatedness (CLR) which extends Relevance Networks
[15]. CLR was found to be the top performing method by Faith et al. at recovering known interactions. Other methods considered by Faith et al. include ARACNe
[16], Bayesian Networks
[17] and linear regression networks. The Relevance Network approach directly ranks TF-gene interactions based on a statistical measure such as the correlation coefficient or mutual information of the expression profile pairs. CLR extends Relevance Networks by considering the distribution of values obtained by the statistical measure for all pairs involving the same TF or regulated gene. The authors found in their evaluation that for CLR and Relevance Networks the best results were obtained using mutual information and the square of the correlation coefficient, respectively. As these methods predict network interactions exclusively from expression data this provides the advantage of being broadly applicable to organisms for which prior knowledge on gene regulation is limited. However in the case of
E. coli these methods are unable to take advantage of known interactions or DNA sequence binding information to improve the accuracy of the predicted interactions. In particular these methods can only identify interactions for factors that are transcriptionally regulated, which may lead to missing many interactions for post-transcriptionally regulated factors.
In this paper we introduce a new method, SEREND (SEmi-supervised REgulatory Network Discoverer), to predict TF-gene regulatory interactions in
E. coli (). SEREND is an iterative semi-supervised computational prediction method that takes advantage of known regulatory interactions in
E. coli and extends them by leveraging TF sequence binding affinities and a compendium of expression data. Similar to other methods
[4]–
[6] SEREND does not assume that a TF is necessarily transcriptionally regulated. Instead SEREND uses expression data in the context of known or predicted TF-gene interactions. However, these previous methods assume a fixed set of TF-gene interactions, while the purpose of SEREND is to predict additional TF-gene interactions. These predictions can later be used as input to these other methods, as we demonstrate for one method on a new expression dataset. Other methods performed iterative analysis as SEREND does here
[18],
[19]. However, unlike SEREND, which focuses on classification, the goal of these prior methods was clustering or gene set module identification leading to different treatment for the features used and different meanings for the resulting sets. Another method
[20] used curated interactions and expression data along with Gene Ontology (GO) and phylogenic similarity to predict additional gene targets, but did not use an iterative or semi-supervised approach or motif information as we do here. We chose for our method not to use GO annotations in generating predictions giving us the advantage of being able to use GO for an unbiased assessment of the functional role of predicted targets.
In evaluating SEREND, we first establish that SEREND can successfully recover many direct gene targets implicated in chromatin immuno-precipitation (ChIP)-chip experiments and compare its ability to do so with other methods. To further test the predictive capability of SEREND and to assess the functional relevance of the newly-predicted TF-gene interactions, we combine them with new temporal microarray gene expression data obtained during the switch from aerobic to anaerobic growth conditions in
E. coli. For this we use a recently introduced computational method, Dynamic Regulatory Events Miner (DREM)
[4], that allows us to analyze and model the dynamics of the transcriptional regulatory network in response to this environmental change. As we show, the reconstructed network response agrees well with known responses during the
E. coli aerobic-anaerobic switch. Moreover, by using the new TF-gene interactions predicted by SEREND, DREM is also able to suggest additional TFs as controlling different stages of the aerobic-anaerobic switch response in
E. coli.