Our approach follows the schematic in Figure . NCA requires two inputs: a set of gene expression profiles and a pre-defined regulatory network, which is a matrix that contains initial estimates of the influence each transcription factor on the target genes. The original gene expression data set is obtained from Calvano et al
], in which peripheral blood leukocytes were obtained from four different individuals prior to and at five time points after injection with endotoxin, 24 profiles in total.
Figure 1 Schematic of the approach. (A) Flowchart describing the steps to reconstruct our initial transcriptional regulatory network. (B) A set of gene expression profiles (matrix E) and about a proposed structure for the underlying transcriptional regulatory (more ...)
To define a regulatory network which could account for a significant percentage of the gene expression response, we identified a set of key transcription factors previously known to be involved in the LPS response, together with a set of known target genes for these factors. Ten transcription factors were chosen for our study (listed here by gene name for continuity). NFKB1 (encoding p50/p105), RELA (encoding p65) and IRF3 were chosen as factors involved in the primary response to endotoxin. Endotoxin binding to Toll-like receptors (TLR) leads to activation of NF-κB dimers, among which p65:p50 is common [18
]. LPS stimulation also induces IRF3 activation through TLR4 [19
]. These transcription factors induce expression of several cytokines which can further activate a secondary transcription response through factors such as STAT1, 3 and 6 [15
]. CREB1 is activated by LPS through the p38 kinase-SAPK2 pathway [21
]. It is known that LPS activates AP-1 complexes consisting of FOS, JUN, JUNB and JUND [22
]. Among these factors, JUN and FOS were chosen for our model. The role of MYC in inflammation response is poorly understood [23
]; however, many genes connected to MYC showed significant changes in their expression levels in the original study and so we included it as well [16
To identify established regulatory interactions between these transcription factors and target genes, we relied largely on the primary literature [15
] (Figure ). However, two knowledge-bases were also used: Ingenuity Systems http://www.ingenuity.com
and Pathway Studio [29
]. Both the Ingenuity and Pathway Studio knowledge-bases consist of regulatory relationships parsed from MEDLINE abstracts; the Ingenuity knowledge-base also includes information from manually-curated peer-reviewed publications. For our ten transcription factors, this strategy resulted in a list of 1,287 target genes, with 2,183 interactions between transcription factors and target genes. To reconcile differences in these different sources of regulatory network information (literature, Ingenuity, Pathway Studio), we only included an interaction in our network if it could be identified in two out of the three resources. This filtering process reduced our list to 219 target genes regulated by 306 interactions with the ten transcription factors. To focus on the most useful expression information, we only considered target genes for which expression changed significantly over time (p-value < 0.01).
The network for the inflammatory response finally included 10 transcription factors, 99 target genes and 149 regulatory relations. This network can be represented in matrix form, with a density of ~15%, or 149 relations/(10 factors × 99 targets). In contrast, the expected density of a genome-wide regulatory relationship matrix, given our current state of knowledge about human transcriptional regulation would be about 0.1% (~20,000 relations in Ingenuity Systems and Pathway Studio databases, ~1,000 transcription factors and ~20,000 target genes). Our network density is therefore relatively high, reflecting the comparatively high level of research interest in this system.
We estimated the activation of the transcription factors in our network over time using NCA (Figure ). NCA decomposes a matrix containing gene expression values (E
) into a matrix which represents the influence of a transcription factor on a target gene (strength matrix S
) and a matrix which contains the transcription factor activities (activity matrix A
]. We found that both outputs of NCA – predicted factor activities A
and regulatory influences S
– have added additional insights to gene expression data where the underlying regulatory network structure is partially known.
Transcription factor activities
Figure and show the estimated activities of our 10 transcription factors. Transcription factor activities clearly showed early-, mid-, and late-phase action in response to LPS. IRF3, NFKB1(p50/p105) and RELA(p60) were activated within 2 hours after the endotoxin was injected. IRF3 activation peaked at 2 hours and returned to its base level at 4 hours. NFKB1 and RELA were also activated early but decreased in activity more slowly. These three factors can induce expression of tumor necrosis factor alpha, which then further activates the NF-κB complex [25
], and could contribute to the extended NF-κB activation. JUN and FOS are known to be activated through the JNK pathway [30
], and had a similar activation profile to NFKB1 and RELA. In contrast, STAT1, STAT3 and CREB1 exhibited a late-phase response. The STAT1 and STAT3 predictions correspond to previous findings that STATs are activated by cytokines transcribed by the NF-κB complex [15
]. It was surprising that predicted CREB1 activation peaked at four hours, given that previous reports detect phosphorylated CREB at 30 minutes [32
]. However, the prediction was the result of late-phase induction of known CREB-dependent gene expression, such as ALAS1 and CEBPD [33
]. Both STAT6 and MYC were predicted to be somewhat deactivated over nine hours. Deactivation of STAT6 was predicted due to repression of MHC-II class genes which are known to be regulated by STAT6 [35
], as well as the expression of SOCS1, which has been reported to lead to deactivation of STAT6 [36
]. MYC expression can be decreased through a STAT1-dependent pathway under IFN-γ stimulation conditions [37
], and it is possible that the deactivation predicted here depends on STAT1 as well.
Figure 2 Transcription factor activities calculated using NCA. (A) Predicted activities of the ten transcription factors used in this study. For each transcription factor, rows represent progression in time and columns correspond to the four human subjects. Activities (more ...)
Transcription factor activities are sometimes, but not always, correlated with the gene expression of the factor. We compared the calculated transcription factor activities with the gene expression data for each factor (Figure ). NFKB1, RELA, STAT1, STAT3 and MYC showed strong positive correlation between activities and expression (correlation coefficient c
> 0.56), possibly due to auto- or cross-regulation. For example, NFKB1 activity and expression are tightly correlated (c
= 0.6022), possibly because the NF-κB p65:p50 complex can regulate NFKB1 [38
]. STAT1 activity and expression are also strongly correlated (c
= 0.8362), which might relate to the transcriptional effect of STAT3 on STAT1 expression [40
], particularly given that STAT1 and STAT3 have highly correlated activities (Figure , c
= 0.9329). On the other hand, the activities and expression show lower or no correlation for IRF3, JUN, FOS and CREB (-0.15 <c
The linear model of gene expression upon which NCA rests does not account for the interactions between transcription factors. However, we wondered if the NCA-predicted correlation in transcription factor activities could be due to the combined action of two transcription factors, either as a complex or otherwise. We therefore checked transcription factor pairs with significant activity correlation to published protein-protein interactions catalogued in the Biomolecular Interaction Network Database (BIND) [41
]. Interestingly, transcription factors known to act together showed high correlation in their activity profiles (Figure ). For example, highly correlated transcription factors NFKB1(p50/p105) and RELA(p65) regulate their target genes as a p65:50 heterodimer form [42
], and STAT1 and STAT3 are also known to act as a dimer [20
], as are JUN and FOS [30
]. Additionally, some transcription factors (STAT1 and CREB1, STAT6 and MYC) showed a positive correlation in their activity even though they are not known to form a complex with other transcription factors. Transcription factors can have similar – and even coordinated – activities without direct interaction, so it may be that these latter predictions reflect an indirect interaction.
On the other hand, it is possible that some of the correlated transcription factor activities may be based on incorrect NCA predictions. The largest possible source of error for NCA decomposition is the initial connectivity matrix, which is based on the current, generally incomplete or erroneous, understanding of the human transcriptional regulatory network. The effect of missing or false data in the connectivity matrix is hard to predict in advance. However, the sensitivity of NCA to the connectivity matrix can be estimated by adding or removing connections randomly from the original matrix, and repeating the NCA calculation multiple times [14
]. Using this approach, we found that transcription factor activities predicted by NCA and our original connectivity matrix were robust, even if 10–15% of the connectivity matrix contained inaccurate connections (Table ). Given that our matrix was limited to only high-confidence interactions, this level of sensitivity was assumed to be tolerable.
NCA simulation with random noisy connections
Regulatory influence matrix and gene expression clustering
We thought that the adjusted strength matrix might be used to enhance typical gene expression clustering techniques. Signed quantitative values of the adjusted strengths were able to form more biological meaningful clusters beyond the prior binary regulatory connections. In Figure , target genes were hierarchically clustered with the adjusted strengths of transcription factors and shown with gene expression. We identified seven major clusters, which correlate to the coordinated action of transcription factors to regulate gene expression. Cluster A highlights the influence of NFKB1(p50/p105) and RELA(p65) on a set of eighteen genes. Interestingly, some genes are linked to p65 only, suggesting that these genes may be under the specific control of the p65:p65 homodimer, rather than the p65:p50 heterodimer. For example, the cluster suggests that CXCL10 expression depends on both p65 and p50, which has been demonstrated experimentally in NFKB1-/-
knockout mice [43
]. Clusters B and C contain the genes regulated by STATs 1 and 3, while Cluster D genes are regulated by JUN and FOS. Clusters E and G are primarily regulated by MYC, but with repression in E and induction in G. Cluster F genes are regulated by STAT6. All of the transcription factors known to act in dimers [20
] – the NF-κB complex of NFKB1-RELA, as well as STAT1-STAT3 and JUN-FOS – were either in the same cluster or closely adjoining clusters, and had correlated activation profiles. However, although STAT6 and MYC had correlating activation profiles, the genes under their influence (Clusters E, F and G) did not cluster closely. Therefore, when studied together, activation profiles and regulatory influences may provide insight on the coordination between transcription factors.
Figure 3 Hierarchical clustering in the context of a defined regulatory network. (A) The adjusted strength matrix was used for clustering, after which the gene expression matrix was appended. Seven major clusters which have more than five associated genes are (more ...)
Although our clustering was based on the matrix of regulatory influence, the clusters also provided a strong basis for interpreting gene expression. Pair-wise correlation tests on expression between genes within a cluster showed significantly higher average correlation than random clusters (Table ). Furthermore, the resulting gene expression clusters can be immediately linked to the specific transcription factors whose action created the expression profile. Importantly, clustering by transcription strength can identify new clusters unobtainable by clustering the expression data alone. For example, Cluster F and G could not be distinguished when the same clustering method was applied to the gene expression data alone (Figure ). However, they formed unmistakable clusters from the regulatory strength matrix, being linked to the regulatory influence of either STAT6 or MYC. Furthermore, our clusters required the NCA-processed strength matrix, and could not be obtained from the initial connectivity matrix, the clustering of which led to groups of genes that did not show common expression patterns (Figure ). We conclude that the estimated transcription factor regulatory strengths can provide unique insights with regard to the regulation underlying gene expression, even between genes with similar expression.
Major clusters formed from the adjusted strength matrix.
Correlation test and prediction over extended regulatory sets
The clusters shown in Figure suggested that we might be able to use our cluster information to discover new regulatory relationships. We first determined the average normalized expression pattern of the genes in each cluster (= model gene group). The expression vector for each gene was normalized to have zero mean and a standard deviation of 1, and then normalized gene expression sets were averaged for each cluster (Figure ). We then divided all human genes measured on the expression array into three groups: those for which we had high-confidence regulatory information linking the dominant transcription factors in the cluster to the gene (model genes); genes for which we had lower-confidence regulatory information (found in only one of the two knowledge-bases), but could still be valid to extend our model (extended genes); and genes where we found no evidence of regulation by the cluster transcription factors (no-evidence genes). If a cluster had more than two dominant transcription factors, only genes which had established regulatory interactions with all factors were collected for the extended gene group.
Figure 4 Identification of new target genes for major clusters. (A) The average expression profiles of the four clusters with > 10 members. (B) Expressions of extended regulatory genes sorted by correlation coefficients(c) with the average expression profile (more ...)
We first wanted to see if a gene in the extended gene group had similar expression to a cluster (Figure ). First, Pearson's correlations were calculated between each gene in the extended gene group and the average normalized gene expression of each cluster. We then also randomly selected one thousand genes from the no-evidence gene group, and calculated correlations between expression of these genes and the clusters. To obtain standard deviations, we performed this step one hundred times. The fraction of genes with a Pearson's correlation > 0.5 was then compared between both groups using Fisher's exact test (Figure ). We found that average gene expression in each cluster was more highly correlated with genes in the corresponding extended gene set than in the no-evidence gene set, particularly for Clusters A, B and G.
Based on an earlier report involving p53 targets [45
], we decided to use the average normalized expression pattern of each cluster to predict new target genes for dominant transcription factors, First, we identified the genes with significant changes in gene expression in each gene group (p
< 0.01). We then identified the subset of genes whose expression best matched each cluster using Pearson's correlations, and determined the relationship between the fraction of accepted genes (based on a range of cutoffs) that was contained in the extended gene set versus the "no evidence" gene set for each cluster (Figure ). As expected, all extended gene sets had higher accepted rates than the "no evidence" gene sets. However, as can be seen in Figure , genes in the extended set for Clusters A and B were many times more likely to be matched the cluster aggregate expression profile than "no evidence" genes. This indicates that Cluster A and B expression profiles are better able to distinguish true member genes than the profiles for Cluster D or G. We identified 12 genes in the extended gene set for Cluster A and 24 for Cluster B that were highly correlated (c
> 0.5) to the cluster aggregate expression profile (Table ).
Predicted genes for Cluster A and B from the extended gene sets.
We also focused on Clusters A and B for predicting new target genes from the "no evidence" group. Some of the predicted new member genes for these clusters are listed in Table [46
]. Although there was no evidence for including these genes in our model initially, we were able to partially validate certain target gene predictions based on evidence beyond the original knowledge-bases that we used to define our sets. Notable among this evidence was the use of genome-scale location analysis [46
], as well as bioinformatics techniques [47
] to detect NF-κB binding to the promoters of several predicted target sites. We conclude that such clustering may be useful for identifying new target genes, particularly in combination with other methods.
Predicted genes for Cluster A and B form the "no evidence" group
Overall regulatory dynamics in response to LPS
Finally, we were able to address our original goal of building an integrated temporal model of the human blood leukocyte response to LPS (Figure ). This required the integration of our calculated transcription factor activities, transcription factor regulatory influences on each gene, clustering on the adjusted strength, and the gene expression data. Endotoxin was administered to the subjects at 0 hours. During the next two-hour period, IRF3, p65 and p50 were activated and interacted to regulate gene expression, as were JUN and FOS as well as CREB1. By two hours, these transcription factors had already affected gene expression, including the genes in Clusters A and D as well as the additional genes we predicted to belong in Cluster A. Between 2 and 4 hours after endotoxin administration, cytokines such as the interleukins (ILs) and tumor necrosis factor (TNF) whose genes were expressed at 2 hours were produced and secreted. These secreted proteins then and maintained or initiated the activity of several transcription factors in the blood leukocytes. Presumably, TNF then reactivated the NF-κB complex and some of ILs stimulated AP-1 complex [49
]. In contrast, IRF3 activation rapidly returned the base level of activity. The ILs could have activated the STATs to initiate a secondary response, inducing expression of the genes in Clusters B and C together with the additional genes predicted to belong to Cluster B. After 4 hours, the transcription factors began to return to their basal level of activity, leading to a near-complete return to initial values of gene expression by 24 hours. The temporal model therefore provided a global view of activation, transcription and resolution of the blood leukocyte response to lipopolysaccharide in humans.
Figure 5 A dynamic network of transcription. At time zero, LPS is injected, giving rise to transcription factor activation, which then leads to induction or repression of gene expression, production and secretion of cytokines, and initiation of secondary signals. (more ...)