We have created an analytical workflow for the analysis of complex transcriptional data sets. GWTM uses a mathematical model to split the transcriptional response into its component parts, then applies a clustering method to group transcripts with similar behaviour. The factors controlling the predicted activities were identified using bioinformatics and verified experimentally. The method not only associates a target with its controlling transcription factor but also defines the confidence with which this association is made, allowing subsequent analysis of a gene list to be prioritized in a rational manner.
We have previously shown that by linking transcript levels measured in a time series, a dynamic picture of network activity can be created. We developed a mathematical modelling approach called HVDM that incorporates RNA production and degradation terms, and allows transcription factor activity to be deduced from microarray data. We term this activity the hidden variable because information needed to derive it is hidden in the transcript values reported in the array data. Derived activity profiles can then be used to predict and rank other putative targets of the same transcription factor, thereby creating a quantitative model of a transcriptional network. We applied HVDM to predict a ranked list of p53 targets that were experimentally validated by depleting cells of p53 using RNAi, with a high success rate.
HVDM, however, is limited to systems in which prior information about transcript control is available, and can only model one behaviour at a time. The key to the success of GWTM was the observation that global measurement of mRNA degradation rates could extract extensive hidden information regarding the entire transcriptional network from microarray data in the absence of prior biological knowledge of the system. Recently, strong evidence has been presented emphasizing on the importance of including degradation rates when modelling gene expression profiles (Yang et al, 2003
; Perez-Ortin, 2007
; Perez-Ortin et al, 2007
; Molina-Navarro et al, 2008
). While studying the transcriptional response to stress in yeast, Shalem et al (2008
) found co-ordinated mRNA production and degradation such that greater message stability could lead to a sustained level of transcript despite a short production period. Very recently, Matsushita et al (2009
) have demonstrated a critical role for an RNAse in regulating the immune response in mice through mediating mRNA degradation of a set of inflammatory genes.
We used microarrays to measure the mRNA degradation rates of all genes in the human T cell line, MOLT4, after treatment with ionizing radiation. By incorporating degradation data into a rearranged HVDM model equation, we were able to isolate production terms for all the transcripts. The time course data yielded transcript production profiles, which we are able to cluster on the basis of correlation. Then by applying a graph representation, we identified three clusters of genes with similar production profiles. By definition, transcript production is controlled by transcription factors, either alone or in combination. The three clusters were, therefore, likely to correspond to three global activity profiles influencing transcript production during the response to ionizing radiation.
We then used bioinformatics approaches to predict the identity of the factors controlling the major activities. The two most dominant activities contained genes known to be regulated by NF-κB, c-Jun/AP-1 and p53. The first activity is an early response that peaks 2 h after treatment with ionizing radiation. The model was unable to distinguish between NF-κB and c-Jun/AP-1 as candidate targets for this activity. When we verified model predictions for activity 1 by inhibiting NF-κB or c-Jun transcription factors using the NF-κB inhibitor, BAY 11-7082, and siRNAcJun, respectively, in the context of ionizing radiation, our results clearly showed a very high degree of crosstalk between the two factors. With c-Jun itself being a target of NF-κB, a likely explanation is that many of these genes are targets of targets such that NF-κB activates c-Jun, which activates its own targets. The kinetics of this pathway must be rapid as GWTM recognizes this regulation as one activity. It may be possible to separate these activities by taking closer time points during the first 2 h after irradiation.
The data from IPA and GSEA indicated that the second global activity identified by GWTM was controlled by p53, in agreement with our previous results obtained using HVDM. Examination of genes comprising the third global activity seems to indicate a portion of cells re-entering the cell cycle in the latter stages of the experiment. E2F is a possible candidate for controlling this activity. However, it is not clear whether this population represents cells surviving irradiation (unlikely at the dose of irradiation applied), or cells aberrantly re-entering cycle before undergoing apoptosis.
The generation of confidence limits for predictions means that appropriate cutoffs can be applied when deciding the importance or biological significance of transcriptional changes in a microarray experiment. Overall, our results predict the controlling dynamics behind the majority of genes whose expression increases as a result of irradiation-induced DNA damage. A total of 76 of the 100 most confidently changing genes were assigned by the model to a controlling factor and verified experimentally. This number falls, as would be expected, with the addition of genes with lower confidence of irradiation-induced change, or with lower confidence of model prediction. Use of confidence limits can, therefore, ensure the maximal cost–benefit analysis of microarray data because although a lower than expected proportion of the behaviour was explained by the model, additional experiments could be performed to extend the proportion of explicable behaviour.
Although it is ideal if the transcription factors controlling the discovered activities can be identified, this may not always be possible due to limitations in the bioinformatics databases. However, by synthesizing the causes of a complex response involving hundreds of genes to a considerably smaller number of activities, the search for activators is substantially simplified. As GWTM attaches probability levels to each transcript in an activity, it is possible to select on a rational basis the strongest members of the group. If one can identify what regulates this subgroup of targets it is likely to be the transcription factor for all of them. Using this approach it may even be possible to analyse co-regulated genes with unknown transcription factors to identify common regulatory motifs—akin to orphan receptors—where the binding domain is known before the transcription factor that binds it.
Around 25% of the 200 most upregulated transcripts were not explained by the model. Measurement error is a likely explanation for some of these discrepancies. Many of the genes at the lower end of the scale are expressed at lower levels in which the limiting factor is the microarray detection technology itself, and so there are uncertainties regarding gene detection, whether transcripts really are upregulated and whether they fit the model. One potential solution to this problem could come with a switch to a less error-prone technology. High throughput sequencing platforms, such as the Illumina GAII, ABI SOLiD and Roche GS-FLX all generate essentially digital signals, eliminating measurement errors resulting from cross-hybridization. We predict that modelling this type of data will lead to even more accurate models, and we are currently investigating this possibility.
We noticed a number of highly upregulated transcripts behaviour of which was also not predicted accurately by the model, but which were knocked down by the verification experiments. In the case of the most significantly changing gene, IER3
, there is biological evidence that co-regulation between the NF-κB and p53 pathways may account for its synergistic behaviour. This is probably also the case for several other targets. Other hints to the importance of co-regulation appear in some of the results presented in the Supplementary information
. For example, FAS, through distinct probe sets, is a constituent of merged cliques 1 and 2 (see Supplementary Tables I and II
), and has indeed been shown to be the target of NF-κB (Kuhnel et al, 2000
), as well as p53. Similarly, of the six genes that are predicted targets of global activity 1 and 2, two (CD70 and PTP4A1) present good verification scores for NF-κB and p53 dependence (Supplementary Tables IV and V
). We are exploring whether introducing another term to the model could potentially account for such behaviour.
Other experimental methods like ChIP-Chip and ChIP-Seq, which measure binding of transcription factors to regulatory domains, are gaining in power and affordability, and are revealing the complexity of gene regulation (Valouev et al, 2008
; Visel et al, 2009
). However, ChIP experiments are typically static and require prior knowledge of the system. Combining ChIP-seq data with data-driven modelling approaches, like GWTM, will allow us to identify targets for Chip-seq, and to quantitatively relate transcriptional dynamics with the kinetics of transcription factor binding. Together these approaches could lead to a better understanding of how gene networks are regulated in complex responses.
In conclusion, the aim of GWTM is to maximize the efficient and systematic use of genomics technologies like microarrays. The approach can dissect complex transcriptional responses into their major controlling activities and identify the targets of each one. Furthermore, the method does not require complex and expensive artificial model systems, and only necessitates a short time course of microarray data measuring transcript levels and degradation rates. Our approach, therefore, offers a data-driven method for accurately identifying gene network components, establishing their connectivity and quantifying the controlling activities at the transcriptome level.