Given the binding peaks of a TF in the genome, the simple method identifies genes with promoter regions overlapping with one or more peaks, and results in a target gene set without providing the confidence of each individual gene. In contrast, the probabilistic model provides a gene list ranked by a confidence score as well as the false positive rate of the list. In addition to the practical convenience, the target genes from the probabilistic model are of higher confidence as it considers more information than the simple method does, e.g. the distance of TF binding signal from the TSS. Although there is no gold-standard target set for a TF (i.e. a complete target gene list), we can use two criteria to compare the performance of the two methods.
First, we would expect that the expression levels of the target genes are more upregulated (or downregulated) when the TF is activated (or repressed) by cytokine or hormone stimulation or as a consequence of TF perturbation [overexpression, knockout (KO) or siRNA interference]. Second, we would expect overrepresentation of the binding motif for a TF in the promoters of its target genes. In this section, we first apply the probabilistic model to two datasets [STAT and estrogen receptor (ER)] containing both TF genomic occupation and gene expression profiles, and show that the target genes identified by the model are more differentially expressed in response to TF perturbation or activation than those identified by the simple method. Then we show the STAT4 binding motifs are more enriched in the target promoters identified by the probabilistic model. Finally, we apply our analysis to the TCF4 binding data containing two ChIP-seq replicates with substantial read depth difference, and show that the probabilistic model is not sensitive to the sequencing depth and it provides a confident target gene set for dataset with fairly low depth. Thus when applied to the ChIP-seq datasets, the probabilistic model provides a cost-efficient tool for identifying target genes (without requiring high read depth).
In our analysis, we utilized the datasets from the original publications, which provide mapped reads (STAT and TCF4) or signal track files (ER), as well as binding peak data. To identify the binding peaks, two-sample analysis from the CisGenome software package (Ji et al., 2008
) was used for the STAT (Wei et al., 2010
) and the TCF4 data (Mokry et al., 2010
), while the sliding window method was used for the ER data (Stender et al., 2010
). We apply the simple peak-based method to the binding peaks published with the original papers, assuming optimized parameter setting for peak-calling programs by the authors.
3.1 Reduced expression of STAT4 target genes in STAT4-deficient mice
STAT4 and STAT6 are two key TFs in the differentiation of the helper T cells. Wei et al. (2010
) have profiled the genomic occupation of the two factors and have measured gene expression in T helper 1 or T helper 2 (Th2) cells from wild-type (WT) and STAT-deficient mice. We examine the binding preference of STAT4 and STAT6 across the whole genome, concentrating on the DNA regions surrounding the TSS of genes. We find that the ChIP-seq reads are more likely to be mapped to regions nearby TSSs as shown in A for STAT4. This binding profile around the TSS enables us to more accurately evaluate the regulatory relationship of STAT4 to individual genes. For instance, if gene A contains a STAT4 binding peaks 100 bp upstream of its TSS and gene B contains a peak 900 bp upstream of its TSS, then it is reasonable to assume that A is more likely to be the regulatory target of STAT4. The probabilistic model is designed to take into account this binding profile information. We apply the model to identify the target genes of STAT4 and STAT6, and for each gene we calculate a regulatory score. We subsequently transform the regulatory scores into z
-scores and estimate their P
-values and Q
-values (see Section 2
for details). In B, we sorted genes in the increasing order of their z
-scores for STAT4, that is, genes on the left side are more likely to be the regulatory targets of STAT4. Then we examine the differential expression of genes in WT versus STAT4-deficient mice [log2 (WT/KO)]. In principle, the expression of STAT4 target genes would be downregulated as a consequence of the STAT4 KO. As expected, the genes with higher z
-scores (potential STAT4 target genes) tend to be more highly expressed in WT than in KO mice (B).
Fig. 2. Reduced expression of STAT4 target genes in STAT4-deficient mice. (A) The binding profile of STAT4 around the TSS. (B) Reduced expression [high log2 (WT/KO)] of genes with high z-scores in STAT4-deficient mice. Genes are sorted in the decreasing order (more ...)
shows the target gene numbers of STAT4 and STAT6 in four ChIP-seq experiments. Signals from a negative control (using natural rabbit serum as antibody in Th2) have been used for identifying binding peaks. To take into account the non-specific binding in the probabilistic model, we excluded genes that are significant in the control. As shown, compared with the simple model the probabilistic model is more conservative, which gives rise to a relatively smaller target gene set even at a fairly relaxed cut-off value, e.g. at 10% false positive rate (Q < 0.1). The number of target genes identified by the simple method depends on the size of DNA region considered as promoter. Because STAT4 binding peaks tend to be located nearby the TSS of genes, we do not observe dramatic difference in target gene numbers when [−500, 500] and [−2000, 2000] are used as the cut-off. However, for other TFs with broader binding profile around the TSS, the target gene set size might strongly depend on the cut-off value, making it difficult to determine a confident set of target genes.
Number of STAT4 and STAT6 target genes identified by the simple method and the probabilistic model
It should be noted that STAT6 KO is not a KO that completely deletes the whole gene. Instead, only the DNA region encoding the SH2 domain of STAT6 protein is deleted. Thus, a large fraction of the STAT6 targets can still be identified in STAT6 KO (). Even in the clean knockout STAT4 KO, we still detect a few significantly bound genes as shown in . This is due to the fact that the sequencing signal for STAT4 KO ChIP-seq does not reflect a random background distribution. As shown previously, ChIP-seq using non-specific antibody or input DNA would exhibit enrichments of signal proximal to TSSs in a sample specific manner, leading to ‘artificial’ binding peaks or target genes (Rozowsky et al., 2009
To evaluate the performance of the probabilistic model and the simple method, we sort genes in the decreasing order of log2 (WT/KO) and examine the distribution of target genes identified by the two methods. If a target gene set more accurately reflects a real regulatory relationship, we would expect a larger fraction of its genes present at the top of the ranked list (highly expressed in WT with respect to KO of STAT4). As shown in C, the STAT4 target gene sets obtained from the probabilistic model demonstrate higher expression changes, log2 (WT/KO), than those identified by the simple method (Supplementary Fig. S3A
). Moreover, the better performance of the probabilistic model is not due to its relatively smaller target gene set. When we choose the same number of target genes based on the probabilistic model output (top 1461), the probabilistic model still achieves better performance than the simple model (see blue curve in C). We also compared the expression difference of the target genes identified by the two methods in WT versus STAT4-deficient mice (D). As shown, both methods result in gene sets with higher log2 (WT/KO) than the ‘all genes’. In comparison to the simple method, the probabilistic model identifies target genes that show even higher log2 (WT/KO) values (P
< 0.001). The reduced expression of STAT4 target genes identified by the probabilistic model in STAT4-deficient mice suggests that these bound targets captured by ChIP-seq reflect the actual regulatory relationships of STAT4 with genes. Similarly, we compared the expression changes of STAT6 targets in WT with respect to KO of STAT6, and achieved very consistent results (Supplementary Fig. S4
). As we described, the probabilistic model can also take the binding peak data as input, and we identify target genes of STAT4 and examine their expression changes in WT versus KO of STAT4. These target genes show comparable magnitude of expression changes to those by the simple method but are significantly worse than those by the probabilistic model using the binding signal data (Supplementary Fig. S5
). This suggests that improvement of the probabilistic method is mainly contributed by operating on signals instead of binding peaks. As such, if available we would strongly suggest applying the probabilistic model to the binding signal data rather than the binding peaks.
3.2 Induction of ER target genes in response to E2 treatment
Other than the mouse STAT4/6 data, we also performed a similar analysis on the human ER alpha (ERα) data (Stender et al., 2010
). The data contain the genome-wide localization of human ERα and its mutant (mutERα) measured by the ChIP-seq experiment in MDA-MB-231 breast cancer cells. The mutERα harbors point mutations in the DNA binding domain that disable its ability to bind to its DNA response element. At the 1% false positive rate (Q
< 0.01), the probabilistic model results in 312 target genes for the WT ERα and 41 for the mutant ERα (Supplementary Table S6
). The number of target genes identified by the simple method, however, is strongly dependent on the parameter setting. If the DNA region (−500, 500) bp around the TSS is considered as promoter, it results in 349 target genes, whereas 1091 genes are identified when [−2000, 2000] window is used.
To validate these identified ERα -bound genes by ChIP-seq, Stender et al. (2010
) have performed microarray experiments to examine their response to estradiol (E2), the hormone activating ERα. We therefore examined the correlation between ERα regulatory scores of genes and their expression levels. We found that genes with higher ERα regulatory scores are more responsive to E2 stimulation (A). Particularly, the genes with highest scores tend to be upregulated after 24 h treatment with E2. When sorted in the decreasing order of E2 responsiveness [log2 (24 h/0 h)], the target gene sets identified by the probabilistic model are more skewed to the higher responsive side than those gene sets by the simple method (B and Supplementary Fig. S3B
), and they are more highly upregulated by E2 treatment (C).
Fig. 3. Induction of ER α target genes by E2 treatment. (A) Enhanced expression of genes with high z-scores in response to E2 treatment. Genes are sorted and separated into bins as described in . (B) Cumulative distribution of genes in the gene (more ...)
3.3 Enrichment of TF binding motif in the target promoters
Another way to compare the performance of TF target gene identification methods is motif analysis. If the genomic occupation of a TF is mainly attributed to its direct binding to the DNA binding motif rather than mediated by any other DNA-binding proteins, we would expect to see the enrichment of its binding motifs in target gene promoters. Thus, we search the promoter region (1 kb upstream of TSS) of all mouse genes (23 573) for the occurrence of the STAT4 binding motif and compare the proportion of genes containing STAT4 binding motif from the target sets identified by the probabilistic model and the simple method. Overall, 3689 out of the 23 573 mouse genes (15.6%) contain at least one STAT4 binding motif in its promoter region (). As for the target gene sets identified by the simple method, the fractions of STAT4 motif-containing genes are at best slightly higher than the average. In contrast, the target gene sets identified by the probabilistic model include more STAT4 binding motif-containing genes. For example, the model results in 325 target genes at 0.1% false positive rate (Q < 0.001), of which 19.4% genes contain at least one STAT4 motif in their promoter (P=0.04). Thus, in comparison to the simple method, the probabilistic model identifies a target set of higher confidence.
Enrichment of the STAT4 binding motif in the promoter of its target genes. The y-axis indicates the percentage of genes that contain at least one STAT4 binding motif their promoters (DNA region within 1 kb upstream of TSS).
3.4 Consistency of target genes between replicates
As described in Section 2
, the simple method requires a predefined binding peaks from a peak-calling method, whereas the probabilistic model can calculate the regulatory scores of a TF to genes based on either the mapped reads or the binding peaks. In general, the number of binding peaks identified by peak-calling methods depends strongly on the sequencing depth of the ChIP-seq experiment. As a consequence, the target genes identified based on binding peaks is also sensitive to the sequencing depth. In contrast, based on the total mapped reads the probabilistic model results in very consistent target gene sets between replicate experiments with different sequence depths. shows the results for the TCF4 dataset, which contains two technical replicates with 1 160 496 and 16 946 805 mapped reads, respectively (Mokry et al., 2010
). Due to a substantial difference in the sequencing depths, the two replicates result in quite different number of binding peaks—1128 for the former and 10 436 for the latter. Based on these binding peaks, the simple method identified 177 and 1941 target genes for the two replicates, respectively, with [−1000, 500] bp around the TSS as the promoter region (, the right panel). Based on the mapped reads, however, the probabilistic model ends up with 250 and 252 target genes for the two replicates at the significance level of 0.001 (P
< 0.001). Despite the significant difference of the two replicates in their read depths, the target genes identified from them are highly consistent with 215 overlapping genes (, the left panel). This result is further confirmed by a simulation analysis. We generate 21 simulated sequencing datasets for TCF4 by sampling n
= 5, 6,…, 25 M) from the pooled data of the two replicates (~ 18 M mapped reads). We then apply the probabilistic method to identify significant target genes for each dataset. Meanwhile, we call the peaks in these datasets using PeakSeq (Rozowsky et al., 2009
) and subsequently determine the target genes using the simple method. The simulation results indicate that the probabilistic model provides consistent target genes, while the number of peaks called by PeakSeq (increase from 7716 for 5 M to 40 318 for 25 M) and target genes subsequently identified by the simple method (increase from 1458 for 5 M to 7923 for 25 M) are strongly dependent on the read depth (Supplementary Table S7
Consistency of target genes in two experimental replicates of TCF4.
Thus, the probabilistic model based on the mapped reads provides a more cost-efficient tool for identifying the TF target genes. For many ChIP-seq experiments, it might not be necessary to perform such a high sequencing depth, if they are designed to identify target genes instead of specific binding peaks. In fact, with the increase of the read depth, the number of binding peaks from ChIP-seq data will keep growing (as more and more weak binding peaks are identified), while the number of target genes identified by our probabilistic model will not change substantially.