|Home | About | Journals | Submit | Contact Us | Français|
Transcription factors (TFs) regulate downstream genes in response to environmental stresses in plants. Identification of TF target genes can provide insight on molecular mechanisms of stress response systems, which can lead to practical applications such as engineering crops that thrive in challenging environments. Despite various computational techniques that have been developed for identifying TF targets, it remains a challenge to make best use of available experimental data, especially from time-series transcriptome profiling data, for improving TF target identification.
In this study, we used a novel approach that combined kinetic modelling of gene expression with a statistical meta-analysis to predict targets of 757 TFs using expression data of 14,905 genes in Arabidopsis exposed to different durations and types of abiotic stresses. Using a kinetic model for the time delay between the expression of a TF gene and its potential targets, we shifted a TF's expression profile to make an interacting pair coherent. We found that partitioning the expression data by tissue and developmental stage improved correlation between TFs and their targets. We identified consensus pairs of correlated profiles between a TF and all other genes among partitioned datasets. We applied this approach to predict novel targets of known TFs. Some of these putative targets were validated from the literature, for E2F's targets in particular, while others provide explicit genes as hypotheses for future studies.
Our method provides a general framework for TF target prediction with consideration of the time lag between initiation of a TF and activation of its targets. The framework helps make significant inferences by reducing the effects of independent noises in different experiments and by identifying recurring regulatory relationships under various biological conditions. Our TF target predictions may shed some light on common regulatory networks in abiotic stress responses.
Plants often respond and adapt to different environmental stresses, such as drought, cold and chemicals through various transcriptional regulatory systems . Identification of these regulations not only enhances our knowledge of biological processes in plants, but also helps a great deal in developing bio-engineered crops that can better sustain challenging environments . Typically, a handful of key transcription factors (TFs) control various biological pathways by regulating downstream target genes. In many cases, these target genes share functions or pathways. While basic ideas of these TFs and their target genes' general functions may be known, lack of knowing explicit target genes often limits the experimental design for validating intuitive hypotheses or developing new crop traits. A comprehensive list of putative targets of a TF could be used to provide more insight of a key TF through functional enrichment analysis or mapping these target genes into different biological pathways.
High-throughput expression profiling experiments  have generated large amounts of data that make it possible to develop computational approaches for predicting regulatory relations. Public repositories like NCBI Gene Expression Omnibus (GEO) [4,5], SMD (Stanford Microarray Database) , TAIR , etc. contain extensive microarray data from time series, developmental stages, genetic interventions or manipulative treatments for Arabidopsis thaliana, a model organism for plants [8,9]. These data as well as ChIP-chip data have been used to study interactions of TFs to their downstream genes [10-13]. However, mining microarray data for discovering complicated regulatory relationships is still challenging partially due to the fact that these data are often incomplete, noisy, and contain misleading outliers, all of which likely produce false positives in biological inferences.
Many computational approaches for predicting genome-wide targets of a TF are based on finding co-occurrence of TFs and their targets. These include Standard Pearson correlation technique to measure statistical significance of synchronous co-regulation of genes and order of regulation . However, correlation coefficient is a weak criterion for measuring dependence and can lead to many false positives in predicting TF targets . Another approach is Graphical Gaussian Model (GGM) based on the concept of partial correlation for learning high-dimensional dependency networks from genomic data [16,17], which is valid when number of genes is comparable to number of samples in the microarray data [18,19]. One way to avoid this limitation is to use GGM with regularization and moderation, which is implemented as an R package GeneNet [20,21]. This method has been used to infer genome-scale regulatory network for A. thaliana transcriptome . Some other methods are based on probabilistic models, such as the Bayesian network  and regression tree method . Such methods cannot be directly applied to many time series expression profiling data, because the apparent time lag between initiation of a TF and activation of its targets is not accounted in these models. For example, a study suggests a clear time lag between the mRNA levels of a TF, CBF and its known targets . In part, the time lag is used to translate the mRNAs of a TF into proteins before the proteins can act on activating/repressing TF's targets. To address this issue, it is important to adjust time-series transcription profiling data for detection of TF-target relationship .
Another group of methods to identify TF-target genes are specifically designed for time-series expression profiling data, including a method based on Needleman- Wunsch algorithm  and a dynamic probabilistic model based on chemical kinetics and linear differential equations . The dynamic probabilistic model, introduced by Friedman et al. , is able to learn the kinetic parameters of TFs binding to their target promoters and the structure of gene regulation network simultaneously. However, it requires estimation of a large number of parameters, and it does not provide an explicit way of identifying TFs' targets from predicted active regulator's protein profiles. The linear differential equation model in Ref.  describes the production and degradation of all mRNAs and their corresponding proteins with equations of chemical kinetics. While it is an interesting and promising theoretical approach, it tends to be very complex and requires concentration measurements of both mRNA and protein, at least at the initial state.
Many existing studies for retrieving regulatory information use a large collection of microarray data. A potential problem in using microarray data this way is ignoring the heterogeneity in topology of regulatory network due to biological/experimental factors, which could be different tissues, developmental stages or artificial treatments.
A specific tissue type often has its own set of genes expressed to keep its identity. This may lead to different sets of target genes regulated by the same TF. In our approach, we addressed these issues by performing tissue-wide meta-analysis of expression pattern in at least certain number of tissue types out of all tissue types as shown in Figure Figure1.1. In particular, we first perform statistical analysis on microarray datasets of each tissue type and then combine the statistics of multiple microarray datasets for predicting TF targets. Such an approach allows us to identify recurring and stable regulatory relationships under multiple biological conditions while reducing the effects of noises in gene expression data. To avoid the risk of biasing towards housekeeping genes, which are expressed in all tissues all the times, we consider only those genes whose expression profiles are differentially expressed in at least one tissue. The novelty of this approach lies in combining the meta-analysis technique to find consensus regulatory interactions with the kinetic model to estimate the time lag between a TF and its associated targets. The scope of our work is smaller than general regulatory network construction, as we are only interested in recurring targets of known TFs. The reduced scope is practically useful and makes the problem more tractable. We chose the model plant Arabidopsis Thaliana for this work given its rich availability of biological data and knowledge.
We used a kinetic model combined with statistical meta-analysis to identify TF targets and reconstructed an Arabidopsis global regulatory network using large-scale expression profiles of 14,905 genes. We then evaluated our strategy by comparative and functional analysis of predicted E2F target genes and by comparing our method with other existing methods. Finally, we analyzed the reconstructed network to infer some novel features from the network.
In order to conduct meta-analysis, we partitioned the datasets based on different attributes including tissue, experiment type and developmental stage. The tissue- specific partition of the microarray datasets produced totally 8 tissue types that have sample size of at least 9. We combined the rest of the samples into one group as combined tissues as shown in Table Table11.
We defined the significance level of TF-target pair as number of tissues in which the TF-target pair is significantly co-expressed (p-value < 0.01) after time lag corrections using the kinetic model. We built three networks of ~2K, ~12K and ~59K edges, which correspond to significance levels of more than 9, 8 and 7, respectively. For further analysis, we used the network of ~12K edges to balance the size of network and tolerance of experimental errors in each tissue. This network consists of 12,300 regulatory interactions amongst 4,968 genes, in which 757 genes act as TFs (Figure (Figure2).2). It is interesting to note that the distribution of the network is highly uneven. In some cases (e.g., lower right), a handful of TFs regulate many putative targets, while in other cases (e.g., left edge) many TFs form clusters among themselves.
In order to assess our TF target prediction with known regulatory mechanisms from the literature, we investigated Arabidopsis E2F family transcription factor "At2g36010", which represents a group of proteins that play a crucial role in the control of cell cycle progression and regulate expression of genes required for the G1/S transition. These include enzymes involved in nucleotide synthesis and DNA replication proteins [29-31]. Though it is clear that E2F is highly critical and conserved amongst high eukaryotes, only a few genes induced by E2F are experimentally verified in plants. Vandepoele et al.  combined microarray and promoter motif analyses to identify E2F-targets in plants. To do this, promoter regions of genes that were induced at the transcriptional level in Arabidopsis seedlings were searched for the presence of E2F-binding sites. In another study, Ramirez-Parra et al.  identified potential E2F-responsive genes by a genome-wide search of chromosomal sites containing E2F-binding sites. Using meta-analysis of tissue- specific microarray data, we identified 178 putative E2F-target genes (see Additional file 1). Some of these were also predicted by either Vandepoele et al.  or Ramirez- Parra et al.  as shown in Table Table2.2. As the two other studies used different analytical approaches to identify targets, the overlapping genes among all three methods have more confidence to be true E2F target genes.
We also conducted functional enrichment analysis for the 178 E2F-target genes identified using meta-analysis. We applied the AmiGO's Term Enrichment tool, which is based on GO-TermFinder . We used all the annotated genes in TAIR  as the background set. We selected enriched gene groups with a p-value cutoff of 0.01 and the minimum number of gene products of 2. Our result (Table (Table3)3) supports the previous findings that the E2F pathway plays critical roles in cell cycle regulation, DNA replication, and chromatin dynamics. In addition, we identified other novel genes, which are involved in DNA methylation on cytosine, DNA repair, ribosome biogenesis, etc.
In order to compare performance of meta-analysis with other methods for identifying TF targets, we prepared a benchmark dataset of TF-target pairs in Arabidopsis, which were obtained from the AGRIS database and AtRegNet . The benchmark set has 348 pairs in total. Some of the well-known methods to identify TF target and build regulatory network, including causal regression method, standard Pearson correlation method, and Graphical Gaussian model were used for comparative analysis. To make direct comparison of various methods, we used the exactly same microarray datasets as input to these methods and also exactly the same benchmark data. While using Pearson correlation method, Graphical Gaussian model and regression method, we did not partition the data rather we followed the procedure as previously done in the literature.
In case of microarray data partition and meta-analysis, we used three different ways to partition the microarray data, i.e., tissue based partition, experiment type based partition, and developmental stage based partition. For each type of partition, we identified genome-wide targets for the given set of TFs. While using other methods (Pearson correlation coefficient, causal regression and graphical Gaussian model), we input the microarray data as a single large dataset without partition and identified targets for the same list of TFs. Using these predicted TF-target pairs from each of the methods, we reconstructed two networks of different sizes that is, less than 40,000 edges and less than 70,000 edges. All the same category networks from different methods were then checked against the standard set to count the number of confirmed edges in these networks as shown in Table Table44.
The results show that our method with partitioning microarray data into tissue- specific datasets and then performing tissue-wide meta-analysis contains the most confirmed edges. Particularly, the network of less than 40,000 edges obtained using tissue-wide meta-analysis is 1/3 in size compared to other networks in the same category, but with more confirmed edges than any other network. The comparison clearly demonstrates that tissue-wide partition performs much better than experiment- wide or development-based partition. This indicates that the gene expression is driven more by tissue types than by development stages and experimental conditions, at least in Arabidopsis. It also shows that the tissue-wide meta-analysis could greatly improve network constructions over other methods. Interestingly, a simple method using Pearson correlation cutoff of 0.70, although not as good as meta-analysis, outperformed sophisticated methods of causal linear regression model and graphical Gaussian model. This may be because microarray data are often noisy and sophisticated methods could amplify noises to give incorrect predictions in gene regulatory relationships.
Using Cytoscape , we identified a few major hubs (nodes with many connections) from the medium sized network (~12K) using tissue-wide meta-analysis. In particular, we found regions of significant local density using the MCODE plugin  from Cytoscape. Figure Figure33 shows an example of a major hub cluster, which represents 12 TFs including SCL13, ZAT6, AtERF-1 and anac062 each targeting many genes as found in Table Table55 from further analysis.
Beside network feature analysis using Cytoscape, we analyzed TFs that target significantly more genes than other TFs across different tissues as shown in Table Table5.5. Not surprisingly, most of these TFs are annotated with response to different stimulus in Arabidopsis, such as response to chitin and external stress, given that the microarray data we used were measured in response of Arabidopsis exposed to different abiotic stresses. Some of well-connected TFs are also present in hubs as recognized by the MCODE plugin from Cytoscape and are known to work together for gene regulation. For example, Zat6, Zat10, and Zat12 in the hub of Fig. Fig.33 are activated together in cold and osmotic stresses . WRKY33 and WRKY40 in Table Table55 both function as activators of jasmonic acid-dependent defence pathways and repressors of salicylic acid signalling .
In this paper, we proposed a meta-analysis method for identifying TF targets. The novelty of the proposed method lies in combining two models that is (1) adjusting time lag between a TF and its target and (2) finding consensus regulatory interactions from different experimental sources/conditions including tissue types, developmental stages and experimental settings. Our study shows that tissue-wide partition performs much better than experiment-wide or development-based partition for predicting TF targets. The method successfully identified more known TF-target pairs in Arabidopsis than other methods.
There are some limitations of this study. Like other approaches, our method has both false positives and false negatives in predictions. Without performing a large-scale experimental validation, there is no reliable way to assess the prediction accuracies of the methods like ours. Hence, the value of our study is mainly to provide hypotheses for experimentalists to explore specific gene regulations of their interest, especially as most of the predicted TF targets with high confidence were not reported or predicted previously. Furthermore, our method may not be able to distinguish TF targets from other co-expressed non-target genes, although meta-analysis across multiple tissues reduces such a possibility. From the meta-analysis point of view, tissue-wide meta- analysis does not consider specific regulatory relations in particular tissue types. In plant, some regulations are specific to different tissue types or developmental stages. Since such relations do not exhibit significant correlation across different microarray data, meta-analysis may ignore them. Nevertheless, meta-analysis is more robust to find correlations that are consistent across different tissues. Typically, global regulations are those that are fundamental for the existence of all tissues in general.
In the context of our study, we only applied gene expression data of Arabidopsis exposed to different abiotic stresses. It is known that there are common regulatory mechanisms for abiotic stresses. For example, certain heat-shock proteins are commonly elicited in response to various stress conditions in multiple plants . Conserved regulatory mechanisms among responses to drought, salinity, and extreme temperature in Arabidopsis were identified, such as the DREB transcription factors . Characterizing common gene expression patterns under various abiotic stress conditions in plants can help elucidate these conserved regulatory mechanisms . Hence, the meta-analysis that we provided on gene expression data under different abiotic stress treatments may shed some light on common regulatory networks in abiotic stress responses. In our future studies, we will explore more into meta-analysis of microarray data by applying different statistics like meta correlation instead of chi- square statistics. Another dimension of improvement is to include inferences from other types of data such as promoter motif analysis.
We used publically available microarray data of A. thaliana from NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/) and TAIR (http://www.arabidopsis.org/). The microarray gene expression data were normalized and preprocessed in the databases. We removed genes with missing expression measurement in any tissue type and averaged the replicated expression data. Consequently, we applied our method on 497 arrays in total measuring whole-genome response of Arabidopsis exposed to different durations and types of abiotic stresses. Some 14,905 genes from Arabidopsis genome including 757 TFs were chosen for the analysis as each of these genes was differentially expressed in at least one of the stress conditions. The datasets consist of 27 different microarray experiments, out of which 10 experiments are time series (see Additional file 3).
In eukaryotic cells, the effect of a regulator is usually achieved in multiple steps, including transcription of the regulator genes, transportation of the regulator mRNA(s) out of the nucleus, translation of the transcript(s), transportation of the regulator protein back to the nucleus, and the binding of the regulator protein to the promoter regions of its target genes to achieve transcriptional regulation. Noticeable timing difference exists among changes in concentrations of the regulator mRNA, the regulator protein, and the mRNAs of its targets. A chemical kinetics model naturally fits this context by taking into account of the time lags among these events (Figure (Figure44).
Because the active level of the regulator protein is not measured directly in microarray experiments, the regulator protein concentration is treated as a hidden variable in our model to serve as the link between the measurable mRNA concentrations of a regulator and its target(s). More specifically, the regulator protein concentration can be modelled by the following chemical kinetic equation without considering post-translational regulation:
where RP is the regulator protein concentration, Rm is the regulator mRNA concentration, Ktram is the apparent rate of mRNA translation, and KP is the turnover rate of the regulator protein. Accordingly, the time course of the target mRNA concentration can be modelled with the following equation
where Tm is the concentration of the target mRNA; Bt is the basal transcription rate of the target gene; and Kt is the turnover rate of the target mRNA; f (RP) measures the regulated transcription rate, which is different for activators and repressors. For activators, it has the following Taylor first order approximation when RP is small .
f (Rp= 0) is equal to zero, assuming target gene transcription should not be activated when there is no regulator protein. is the activation rate of regulator protein on the target gene. If it is replaced by parameter Kact for simplicity, f (RP) takes the following form:
f (Rp) = KactRp. (4)
The basal level target transcription rate should satisfy the following condition:
Bt + f (Rpbasal) — KtTmbasal = 0 (5)
where Rpbasal and Tmbasal are the basal concentrations of the regulator protein and target mRNA, respectively.
Usually, what is reported in transcription profiling experiment is not the absolute concentration of mRNA, but rather a fold change compared to basal transcription level of that gene. Thus, we define relative changes of Rm and Tm as Rm' and Tm'
Rm' = Rm / Rmbasal — 1 ; (6)
Tm' = Tm / Tmbasal — 1 . (7)
Combining Equations (1), (2), (4), (5), (6) and (7), and considering the fact that KtranRmbasel - KpRpbaseal = 0, lead to the following second order ordinary differential equation:
where γ = KactKtranRmbasal / Tmbasel .
Given all the model parameters, the relationship between the relative mRNA levels of the regulator and its target, Rm' and Tm', is defined by Equation (8). In other words, for a target gene of a regulator, its relative mRNA level Tm' has to satisfy Equation (8), given the model parameters and the relative regulator mRNA level Rm'. It is interesting to note that the regulator protein concentration, a key variable in the original model equations, is not involved explicitly in the final equation relating the relative mRNA levels of regulator and target. To predict the target of a specific regulator, we can solve Equation (8) to obtain the theoretical target behaviour curve, and then find the genes with mRNA levels similar to the theoretical curve, which will be identified as the potential targets of that regulator.
In the case of transcript expression profiling experiments under stress conditions, the initial conditions should be the following:
Because the target gene mRNA and the regulator protein should be at their basal levels at the onset of stress condition (t=0). It is apparent from Equations (2) and (5) that initial condition (10) should be true.
To approximate Rm , a stepwise linear model can be fit as follows:
Rm'I (t) = α i + βitti ≤ t ≤ti + 1 i = 0, ··· ,n-1 (11)
where ti is i th time point; and αi and βi are the parameters of stepwise linear function in each time interval, which are determined by the measured regulator mRNA levels at the two adjacent time points. Equation (8) has analytic solution
Where Di = βiγ/KpKt and Ci = [αiγ – (Kp + Kt)Di]/ KpKt .
The contiguous restrictions on Tm' are stated in the following equations:
Tmi' (t) = Tmi + 1' (t), where t = ti i = 1,…, n - 1. (13)
After substituting Equation (12) into Equations (9), (10), (13) and (14), Ai and Bi can be obtained by solving sets of linear algebra equations, and are functions of αi, ßi, γ, Kt and KP.
For each regulator-target pair, there are three parameters involved in Equation (8), the target mRNA turnover rate Kt , the active regulator turnover rate KP, and γ, which is equal to KactKtrαnRmbαsl / Tmbαsal . Kαct represents the strength of regulator protein effect on the target gene; Ktram is the translation rate of regulator mRNA. They lump together with the ratio of basal mRNA concentrations of regulator and target to form parameter γ, which determines the magnitude of the relative target mRNA level but not its shape. It is the parameters Kt and KP that determine the shape of the relative target mRNA level, such as how fast the target gene responds to the regulator. For gene expression experiments under stress conditions in plants, the kinetics model can be trained with known regulator-target pair reported in the literature (e.g., CBF and RD17 in Arabidopsis under cold stress) with a non-linear regression model . When the normalized expression profile of a target gene with its maximal response is considered, there is no need to keep γ as a free model parameter (γ1 = nγ2 leads to Tm1'= nTm2' when other parameters are kept the same in Equations (8), (9) and (10)). Therefore, only two parameters Kt and KP are estimated from the non-linear regression model, and are used to predict other regulators and their targets in plant stress response.
The theoretical TF-target mRNA expression profiles are calculated for all the genes annotated as TFs and are substituted in place of TFs' profiles during further computation for co-expression calculation. The theoretical target profile of any TF in terms of relative expression levels among different time points is independent of actual targets of that TF as it is solely calculated based on the kinetic model. According to the model, the theoretical target profile of a TF should match the profile of its actual targets in the trend of expression although not in the absolute abundance. With this assumption, we can use Pearson correlation coefficient to find similarity of co-expression between the theoretical/shifted profile of a TF and rest of the genes to find potential targets of this TF.
We used a statistical meta-analysis technique  to identify highly correlated expression profiles from multiple microarray datasets. Using this technique, we evaluated the statistical significance (right-tailed p-value) of a Pearson correlation coefficient r for two expression profiles in a single dataset based on the standard t- statistics:
where T is a t–random variable with n-2 degree of freedom and n is the number of conditions of the gene expression profiles. Since we assume that the datasets are obtained independently, we apply the inverse chi-square method and obtain the meta chi-square statistics:
where Pi is the p-value obtained from the i th data set for a given gene pair defined in Equation (15). When there is no linear correlation between a gene pair in any of the multiple datasets, the above chi-square statistics follows a central chi-square distribution with degrees of freedom 2n and hence the p-value for meta-analysis can be obtained by
where is a chi-square random variable with 2n degrees of freedom. We calculate significance level of the gene pair from multiple datasets. The significance level of gene pair represents the count of datasets in which that gene pair has significant correlation (p-value < 0.01) based on Equation (15). We used meta p-value statistics (Equation (17)) combined with significance level to rank potential targets for a TF .
The meta p-value combined with significance level and the Pearson correlation coefficient were used as co-expression statistics for finding putative targets for a TF. For a single dataset (without partitioning of microarray data), we ranked all the potential targets of a TF based on Pearson correlation coefficient and select targets such that TF-target correlation > 0.75 (medium size network) or 0.70 (large size network). For multiple datasets, we ranked all TF-target pairs based on the number of individual p-values that are smaller than 0.01 across multiple datasets; for pairs that have the same number of significant p-values, they were ranked by the corresponding meta chi-square statistics defined in Equation (16). Here we used meta chi-square instead of meta p-value since the meta p-value for many gene pairs are very close to zero and hard to distinguish computationally; both meta chi-square and meta p-value should result in the same order when the degrees of freedom for each gene pair is same. In the end, a fixed number of TF-target pairs were selected based on ranking.
In case of meta-analysis, number of target genes for a TF was determined in two methods, i.e., (1) selecting fixed number of targets from top (50 or 75) or (2) choosing targets form top-ranked genes that shows significance correlation as TF-target pair in at least certain number of microarray datasets used for meta-analysis. For example, we used significance cutoff 9 (out of 9 datasets) for small network and cutoff 8 (out of 9) for medium network and cutoff 7 (out of 9) for large network. The second method worked better in general.
The authors declare that they have no competing interests.
JL and PL conceived the initial study and prepared relevant data and their preprocessing. GPS and DX designed the statistical method. PL and JL implemented kinetic model. GPS and DX performed the data analyses. All wrote the manuscript.
A list of identified 178 putative E2F-target genes.
A complete list of putative targets for each TF.
A list of 27 different microarray experiments, out of which 10 experiments are time series.
We would like to thank Jing Qiu from Department of Statistics, and Trupti Joshi and Jingfen Zhang from Christopher S. Bond Life Sciences Center, University of Missouri for helpful discussions and technical assistance. Part of this work was supported by Monsanto systems biology internship, Shumaker fellowship from Computer Science Department, University of Missouri, and a research fund from United Soybean Board.
This article has been published as part of BMC Systems Biology Volume 4 Supplement 2, 2010: Selected articles from the Third International Symposium on Optimization and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1752-0509/4?issue=S2