Response Variation Dataset
We began with the set of genes previously identified to participate in the pathways involved in the innate immune response to NDV infection of human DCs [12
]. We then used our microarray data from NDV infected primary human DCs [14
] to identify a panel of 34 significantly induced key regulator genes from this set. The microarray study suggested that the expression of most early induced genes reached a plateau by 10 h after infection. The experimental methodology is summarized in Figure . In Figure we show typical histograms of two of the measured genes, a chemokine CCL5 and an antiviral protein MX1, that illustrate the fact that gene expression levels can vary considerably among individuals, varying on the log2 plots by several units for CCL5 and MX1. Moreover the levels at which genes are expressed vary as well significantly, with copy numbers up to 100 for MX1, but up to 1000 for CCL5 (Figure ). The raw data and a summary table are presented in Tables S1 [see Additional file 1
] and S2 [see Additional file 2
], respectively. As the copy number data were not well fitted by Gaussian or other parametric distributions (Figures and ), we computed the median and the median absolute deviation (MAD), which gave robust estimates of the center and the spread of the distributions. For a Gaussian distribution, the median is the same as the mean, and the MAD the same as the standard deviation. We transformed the copy number by the log2
function so that the distribution could be approximated by a Gaussian (Figures and ). The data for log2
copy numbers are summarized by the mean and standard deviation in Table S2. The variation contributed by the experimental procedure (Table S2) is estimated by repeating the same experiment six times on the same individual donor.
Figure 1 Flowchart illustrating the overall methodology. A. Experimental design. B. Illustration of an approach to systematically identify and represent the correlation among the different genes across the population. C. Illustration of TF prediction procedure. (more ...)
Figure 2 Variation in gene response. All qPCR measurements for 34 genes were taken after the DCs were infected with NDV for 10 hours. Results shown are for CCL5 (A and B) and MX1 (C and D). The histograms are drawn for the copy number (A and C) and log2 copy number (more ...)
We filtered out genes showing too low gene expression (gene names in blue with a crossed line in Table S2, copy number<2), and kept the genes with experimental variation
smaller than the donor population variation
(colored in red in the last column of Table S2) at some statistical level. The variance comparison was done by an F-test with H0
. Of the original list of genes, we retained the 13 with copy number greater than 2 and with an uncorrected p-value smaller than 0.05 for the F-tests. The highest false discovery rate [15
] for the 13 genes is 10.5%. Because the effective multiplicity of infection (MOI) defined as the virus/cell ratio may vary among the experiments, we also measured the expression of viral RNA. We examined the effect of adjustment of the gene copy numbers by dividing by NDV expression, which gave a larger list of 27 genes with p-value no greater than 0.05 for the F-test. However, some genes are induced only in infected cells whereas others are expressed similarly in both infected and uninfected cells. The aforementioned NDV correction procedure may introduce bias for the genes whose expression is not directly related to the MOI. The 13 genes that are selected from the data without NDV correction also appear significant in the data after NDV correction. To be conservative, we used the unadjusted 13 filtered genes for our subsequent analysis.
We hypothesized that genes showing similar patterns of change across infected individuals would be likely to share common genetic mechanisms responsible for these patterns. Our first goal was therefore to develop an approach to systematically identify and represent the correlation among the different genes across the population. We tested several metrics for pairwise correlation and represented the results using multidimensional scaling and, for completeness, hierarchical clustering. The approach is shown schematically in Figure .
We started by examining scatter plots comparing the expression levels of pairs of the 13 genes under consideration from the 145 donors. These showed varying levels of correlation (Figure ). For example, the levels of IFNA1 and IFNA2 were strongly correlated (Figure ), whereas IFNAR1 and CCL5 were weakly correlated (Figure ). For quantification, we evaluated three types of correlation coefficients, Pearson's, Spearman's and Kendall's (Table S3; see Additional file 2
). Pearson's correlation reflects a linear relationship between the levels of the two genes. Spearman's correlation is computed by replacing the data points with their rank and captures monotonic correlations. Kendall's correlation is less constrained than Spearman's as it uses relative rankings only between pairs of data, which makes it the most robust correlation measure of the three (See reference [16
] and the Methods section for details of the calculation of these correlations). The results of this analysis showed that the pairs of genes ranged from strongly correlated to weakly correlated (Table S3). It is worth noting that correlation coefficients results were similar for all three.
Figure 3 An example of the scatter plots for two pairs. Each panel gives a scatter plot of a pair of genes. The x-axis and y-axis give the log2 transformed values of the corresponding gene copy number. Panel A displays a strong correlation between IFNA1 and IFNA2 (more ...)
We next used the correlation matrices (Table S3) to visualize the cluster structure of the genes either through hierarchical clustering dendrograms or multidimensional scaling (MDS) (Figure ) For the latter we generated two-dimensional maps using the pairwise dissimilarities (one minus the correlation) as a distance measurement.
Figure 4 Pearson's, Spearman's and Kendall's correlations in MDS and hierarchical clustering. The MDS plots (A, C, E) and dendrograms of hierarchical clustering (B, D, F) of the 13 genes based on a different correlation metric. For the MDS plot, the x-axis and (more ...)
We briefly describe the use of MDS for gene correlation clustering visualization. Let δij
be the dissimilarity between gene i and gene j. The MDS finds the representative coordinates (xi
) of the 13 genes such that the stress function,
is minimized, where di, j
is the geometrical Euclidean distance between the representative points (xi
) and (xj
). Further details of the MDS algorithm can be found in reference [17
]. The geometric distance between any pair of genes reflects the dissimilarity in relative levels between the two genes among the 145 samples (Figure ). The MDS approximation is relatively good, as given by the stress function 0.075, 0.092 and 0.17 for the Pearson's, Spearman's and Kendall's coefficients, respectively. The analysis for the three measures of dissimilarity showed the grouping of two clusters: (CCL5, IFNA1, IFNA2, IFNB1, IFIT2, IL29) and (IFIT1, IKBKE, IL6, IRF7, MX1, TBK1).
The MDS analysis can also be applied to the 145 samples and the 6 experimental repeats of a single sample (Figure S1; see Additional file 2
) to look for correlation between individuals rather than correlation between genes. We can see there is some degree of within experimenter reproducibility, and it seems that when all genes are combined, between experimenter variation is not too different from donor variation. One of the experimental repeats (#4) appears to be an outlier. Therefore, we carried out an analysis both including and excluding this measurement and obtained comparable results.
Based on the MDS analysis of the 13 genes, the two gene clusters showed significant correlation (see Figure ). We assessed the statistical significance of the two groups by generating randomized datasets that leave the distribution of each gene intact. For each randomization, the gene expression levels of the 145 samples of a gene were randomly permuted. An independent permutation was applied for each gene separately. For the measured MDS plot based on Kendall's correlation (Figure ), the observed cluster dimensions, defined as the average of all pairwise distances among the genes, were 0.129 and 0.178 for the aforementioned two clusters. For the first cluster (CCL5, IFIT2, IFNA1, IFNA2, IFNB1, IL29), at each randomization, we recomputed the cluster dimension. Based on 4000 randomizations, we found that only one of the cluster dimensions of this six-member cluster was no greater than the observed dimension (0.129), i.e., the six genes were strongly correlated with high level of significance (p-value about 0.0003). We further evaluated the statistical significance of the observed cluster dimension. In each randomization, instead of computing the cluster dimension, we isolated the six-member cluster that gives the smallest cluster dimension. With the same 4000 randomizations, about four of the cluster dimensions of the best six-member clusters were no greater than the observed one, i.e. the group formed a tight cluster with a significant p-value (0.001). When applied to the second cluster (IFIT1, IKBKE, IL6, IRF7, MX1, TBK1), the same analysis showed that the six genes were highly correlated with a significant p-value of 0.0006 and formed a tight cluster with a significant p-value of 0.016.
Transcription Factor Sharing Analysis
We investigated whether the sharing of transcription factors (TFs) could be responsible for the clustering. Current studies showed many immunological response genes in mice could be categorized based on their expression dependence on transcription factor IRF-3 [18
]. However the underlying gene expression regulation mechanisms were difficult to infer. In agreement with reference [18
], we observed that 5 genes (CCL5, IFIT2, IFNA1, IFNB1, IL29) in the first cluster were IRF-dependent response genes [18
]. This observation motivated our hypothesis that proximity in dissimilarity space results from shared regulatory components such as TFs. TFs for each of the 13 genes were generated using TF site motif analysis constrained by human-chimp phylogenetic conservation (see Methods). The TF analysis was only based on 12 genes as IL29 has no-entry, probably because of false negatives in the TF prediction algorithm. We used Jaccard coefficients to assess the extent of sharing of TFs between pairs of genes (see Methods). A negative Pearson's coefficient of r
= -0.606 was found for all pairs of genes for the correlation between gene expression distance (MDS distance) and Jaccard coefficient (Figure ), while the value was r
= -0.422 if we used the actual one minus Kendall's correlation for gene distance. These results were not changed when the analysis was extended to include more deeply rooted phylogenetic conservation. However, due to the high false positive rate in gene to TF mapping, we introduced additional constraints and used a reduced gene to TF mapping table where the gene and the TF must belong to the three immune response related gene ontology (GO) categories (GO:0009615-Immune system process, GO:0002376-Response to virus, GO:0005132-IFNA and IFNB receptor binding). After this GO filtering, the reduced table contained only 9 genes since IFIT1, IFIT2, IL29 and TBK1 have no TF entry. Pearson's coefficient for the MDS distance and Jaccard coefficient was now r
= -0.753 (Figure ), in comparison with r
= -0.591 for Kendall's correlation based distances. Thus, the negative correlation between MDS distance and Jaccard coefficient was significantly strengthened when the GO constraints were taken into account, presumably through their reduction of the false positive rate of the TF prediction algorithm. The negative correlation confirmed that the closer in distance two genes are in the MDS plot, the more TFs they shared.
Figure 5 The scatter plot of the expression distance and TF sharing Jaccard coefficients. The MDS derived distance for gene expression (on the x-axis) and Jaccard coefficient for sharing TFs (on the y-axis) are plotted for all pairs among 12 genes for A (IL29 (more ...)
Transcription Factor Enrichment Analysis
Since proximity in dissimilarity space was shown to positively correlate with the fact that gene pairs share TFs, we asked whether genes belonging to one group have in common a number of TFs to a larger degree than the genes that do not belong to it. As a methodology we used TF enrichment analysis. The approach is shown schematically in Figure . We considered the following formulation of the enrichment procedure. For a transcription factor, let k1
) be the number of genes in the cluster (or in the remaining genes) that are bound by this TF. An enrichment score s
can be defined as the binding fraction difference, s = k1
, where n1
) is the number of genes in the cluster (or outside of the cluster). For a fixed k
, the score s
is a strictly increasing function of k1
, and therefore the p-value for the enrichment score of s
can be computed by the hypergeometric distribution for k1
. The right sided p-value for score s
, and S
denote the random variables associated with observed s
, respectively. The p-value is not corrected for multiple comparisons as our goal is not to make individual statements whether some cluster is enriched for some specific transcription factor, but to identify which transcription factors can enter the computation of the total significance score (TSSc). For each cluster, TSSc is summarized by adding up all significant enrichment scores,
are the score and p-value computed for transcription factor i
. In almost all cases, the condition pi
≤ 0.05 ensures a nonnegative score si
so TSSc takes into account all significant scores. A cluster's TSSc measures the extent to which genes in the cluster were co-regulated by transcription factors. For example, when evaluating a cluster of size 2 taken from the 12 genes (IL29 was removed because it lacks a TF entry), for which n1
= 2 and n2
= 10, a p-value no greater than 0.05 occurs only when k1
= 2 and k2
= 0, i.e. the TF should bind to all genes in the two-member cluster, but not to any gene outside of it. The results of the enrichment analysis are listed in Table S4A [see Additional file 2
]. For GO term filtered data (see the TF sharing analysis section), the TF enrichment analysis is only based on 9 genes since IFIT1, IFIT2, IL29 and TBK1 have no TF entry, and the results are listed in Table S4B [see Additional file 2
]. For the two clusters identified in the MDS plot, we are left with two four-member clusters (CCL5, IFNA1, IFNA2, IFNB1) and (IKBKE, IL6, IRF7, MX1) after we removed the aforementioned no TF entry genes. In Table S5 [see Additional file 2
], among all 126 possible four-member clusters, the greatest TSSc of 4.8 belongs to (CCL5, IFNA1, IFNA2, IFNB1). The next greatest TSSc belongs to (IKBKE, IL6, IRF7, MX1). Thus the two clusters identified by MDS are most likely to be coregulated in terms of TSSc with a p-value of 1/126 = 0.008, and 2/126 = 0.016, respectively, which is consistent with the randomization results described earlier.
For the GO term filtered TF enrichment analysis, we checked the relation between expression distance as given by MDS and the total significance score TSSc. The correlation for all pairs is relatively weak (r = -0.329), which is probably due to many zero entries (Table S4B), in comparison with the correlation between MDS and TF sharing (r = -0.753). The correlation for the four-member clusters between the cluster dimension (defined as the average distance among all pairs of genes) and the TSSc is somewhat stronger (r = -0.456). This relatively weak correlation with MDS distance for the TF enrichment analysis is probably due to the fact that the genes in our analysis are biologically related, which tends to keep at low values the enrichment scores calculated for genes belonging to a cluster.
The TF enrichment analysis results, when compared with those of MDS clustering, need to be interpreted with caution, even when one ignores the presumed occurrence of many false positives. The analysis described above leads to two groups of 4 TFs each in which each of the two clusters is enriched relative to the other. These TFs however, despite the GO filtering by the main category of Immune system process (the others are Response to virus, and IFNA and IFNB receptor binding), do not appear to belong directly to the NDV immune response. Their appearance reflects the fact that across a wide spectrum of immune responses that includes cancer, the genes in the two clusters considered are effectively enriched, and in such a way that their total enrichment scores are larger than those associated with any cluster of four genes chosen from the set of 9 genes (see Table S5). To search for TF enrichment with a set of TFs with a presumed connection to the NDV immune response, we considered a list based on the transcriptional program in macrophages [19
], and pathogenic [20
] and common responses [21
] (See Methods for details). This list includes 67 TFs and 98 motifs. An enrichment analysis gives IRF1 for cluster (CCL5, IFNA1, IFNA2, IFNB1), and CREM, E2F1, E2F6, E2F7, STAT1, STAT2 for cluster (IKBKE, IL6, IRF7, MX1). The results here are more attuned to the experimental situation at hand. In particular STAT1,2 are directly involved through ISGF3 in the transcription of IRF7 and MX1. For other expected TFs in the NDV immune response such as IRF3 or NF-kB (CCL5, IFNB1) that are part of the list, they bind to motifs in the promoter regions of genes in both clusters, have therefore very low enrichment scores, and do not show up in the final result. Thus enrichment analysis has limitations for interpreting the measured cluster structure. One reason is presumably that the relatively small number of genes central to the NDV immune response that are measured at 10 hours after infection are all more or less synchronously up-regulated, which makes it hard to distinguish between clusters unless enrichment encompasses categories of stimulation and cell types that go beyond the confines of the experiment proper. While our focus here was on TFs with known binding site preferences defined in TRANSFAC, future work could apply ab initio methods on these gene clusters to define novel TF candidate binding sites [22