|Home | About | Journals | Submit | Contact Us | Français|
Identifying the genomic regions and regulatory factors that control the transcription of genes is an important, unsolved problem. The current method of choice predicts transcription factor (TF) binding sites using chromatin immunoprecipitation followed by sequencing (ChIP-seq), and then links the binding sites to putative target genes solely on the basis of the genomic distance between them. Evidence from chromatin conformation capture experiments shows that this approach is inadequate due to long-distance regulation via chromatin looping. We present CisMapper, which predicts the regulatory targets of a TF using the correlation between a histone mark at the TF's bound sites and the expression of each gene across a panel of tissues. Using both chromatin conformation capture and differential expression data, we show that CisMapper is more accurate at predicting the target genes of a TF than the distance-based approaches currently used, and is particularly advantageous for predicting the long-range regulatory interactions typical of tissue-specific gene expression. CisMapper also predicts which TF binding sites regulate a given gene more accurately than using genomic distance. Unlike distance-based methods, CisMapper can predict which transcription start site of a gene is regulated by a particular binding site of the TF.
Transcription factors regulate gene transcription by binding to specific regions of DNA called regulatory elements. This binding then activates or inhibits the action of transcriptional machinery at the transcription start site (TSS) of each gene it regulates. Particular TF binding sites are often unique to a specific cell type, condition, developmental stage or tissue (for brevity hereinafter referred to as a ‘tissue’), and defective binding due to mutations in the bound region (e.g. ‘regulatory SNPs’ (1)) or in the TF itself (2) can cause dysregulation of genes and pathological phenotypes. Thus, two key questions are (i) which genes does a given TF regulate in a particular tissue, and, for a given gene, (ii) which binding sites of the TF affect its expression?
The current preferred method for determining the regulatory actions of a TF begins with predicting where it binds the genome in a given tissue using a chromatin immunoprecipitation followed by sequencing (ChIP-seq) assay (3). The next step usually assumes that each such predicted TF binding site (TFBS) regulates the closest gene, or that each gene is regulated by the closest TFBS, where distance is measured in bases (b) along the chromosome between a TSS of the gene and the TFBS.
This ‘nearest neighbor’ assumption works fairly well in practice for predicting the gene targets of a TF, since many TFs regulate by binding in the promoter of the target gene. However, a good deal of regulation is via distal enhancer regions and involves chromatin looping (4,5), which causes these distance-based methods to make incorrect predictions. In one human cell line (GM12878), fully 41% of chromatin loops connecting a non-promoter region to a promoter skip one or more intervening promoters (6), violating the ‘closest gene’ assumption. Similarly, if the target gene has multiple TSSs, distance-based methods cannot tell which TSS is the actual target of a TF bound at a nearby enhancer. Finally, if a TF binds at multiple locations near a gene, there is no guarantee that the closest site actually regulates the gene, as the ‘closest TFBS’ method assumes.
A number of methods for linking regulatory elements (such as enhancers) to target genes have previously been proposed that are not based on distance alone, but none have been tested with TFBSs predicted by TF ChIP-seq. The method of Ernst et al. (7) uses distance plus data for three histone modifications (H3K4me1, H3K4me2 and H3K27ac) and gene expression in a panel of tissues. It requires a supervised learning training step, and was not tested with regulatory elements predicted in a tissue not included in the panel. Similarly, Thurman et al. (8) showed that cross-tissue correlation of DNaseI hypersensitivity (DHS) between DHS regions overlapping promoters and DHS regions not overlapping promoters can predict regulatory relationships, but it is not clear how to extend their approach to linking TFBSs to promoters. DHS data are also available in far fewer organisms than histone modification data, restricting the applicability of that approach. The PreSTIGE algorithm (9) uses cross-tissue correlation of H3K4me1 and expression, but it was designed for linking enhancers (not TFBSs) to genes, requires CTCF binding data and only predicts links when both the H3K4me1 and expression signals are specifically enriched in a given tissue. He et al. (10) and Roy et al. (11) also proposed methods for training predictors of regulatory links between regulatory elements and genes using a large number of input features (e.g. histone modifications, DHS and TF ChIP-seq). These predictors are more accurate than the simple correlation-based approaches like PreSTIGE, but require data from many assays in order to make predictions in a tissue of interest.
We previously described a method for predicting links between enhancers and genes using cross-tissue correlation between histone modifications and gene expression (12), and in the current work we extend and validate that approach for TFBS-gene links. Our primary goal is to provide a method for analyzing peaks from TF ChIP-seq experiments that is as easy to use as distance-based methods, but is substantially more accurate. We propose a method we call CisMapper that, like distance-based methods, only requires the user to provide the genomic locations of predicted TFBSs. Rather than using distance, CisMapper infers regulatory links from the correlation between the presence of a selected histone modification (typically H3K27ac) at the TFBS and the expression of a gene across a panel of tissues in the same organism. We make available for free download the CisMapper software (suitable for OS X, Linux or Unix) and panels of histone and expression data for human (13) and mouse (14) from ENCODE, and for human from the Roadmap Epigenomics Project (15). We show that CisMapper is substantially more accurate than distance-based methods for predicting regulatory links between a TF's binding sites and specific TSSs, that the target tissue need not be present in the tissue panel, and that the target TF need not be expressed in all the panel tissues. We also show that accuracy increases with the number of tissues in the panel, and that CisMapper predictions can improve gene enrichment analyses.
Given a set of ChIP-seq peaks for a TF in some tissue along with auxiliary information in the form of expression and histone modification data for each of a panel of tissues in the same organism, CisMapper computes a score for a (peak, TSS) link using the correlation of expression at the TSS and the presence of the histone modification at the peak across the panel of tissues (Figure (Figure1).1). Specifically, the score of a (peak, TSS) link is the p-value of the Pearson correlation coefficient between the log of the histone modification signal at peak and the log of the expression at the TSS. (Details are given in the Supplementary Material). We also tested using the Spearman rank correlation coefficient, but found it to give worse results (data not shown).
Here, we study using the active enhancer mark H3K27ac (16), the poised enhancer marks H3K27me3 and Zentner2011 (17), and the active promoter mark H3K4me3 (18), but in principle any histone mark could be used with CisMapper. (Note that CisMapper only uses data for a single histone mark at a time.) Using the P-value of the correlation as the score normalizes for panel size, allowing us to compare the effect of the score threshold across experiments with varying panel sizes. Although the correlation of a histone mark a ChIP-seq peak with expression at a TSS can be positive or negative, with positive correlation implying that the mark increases expression, since we are using histone marks indicative of active enhancers and promoters we restrict our analyses here to positive correlations.
CisMapper generates four ranked lists of predictions from the set of scored (peak, TSS) links. Two ‘target’ lists rank TSSs and genes, respectively, as potential targets of the ChIP-ed TF. The target score for a TSS is the minimum (best) score of any of its links. The target score for a gene is the minimum (best) target score of any of its TSSs. Two ‘element’ lists rank TF ChIP-seq peaks as potential regulators of TSSs and genes, respectively. The regulatory element lists group all the links for a given TSS or gene together, and sort within each group in increasing order by link score. Details of list creation are given in the Supplementary Material.
For practical reasons, it is necessary to restrict the set of possible (peak, TSS) links for which CisMapper computes link scores. First, in this work we restrict CisMapper to links where the TF ChIP-seq peak and the TSS are on the same chromosome and separated by at most 500 Kb. We do this to reduce the required compute time as well as to reduce the number of possible links with low (good) link scores merely due to chance. We note that previous studies that predicted enhancer-promoter links also chose to limit the maximum link length considered for similar reasons (e.g. 125 Kb in Ernst et al. (7), 500 Kb in Thurman et al. (8) and 2 Mb in He et al. (10)). Second, following related work by (19), CisMapper only computes scores for links where there is non-zero variation in the histone level at the peak and the variation in expression at the TSS meets certain criteria. (See Supplementary Methods for details.) Subject to the above caveats, CisMapper computes link scores for all possible (peak, TSS) pairs, so each peak can be linked to multiple TSSs, and vice-versa.
We look for direct evidence of physical contact between CisMapper high-scoring (peak, TSS) pairs from promoter capture Hi-C (CHiC) data. We use these data to study (i) the coverage and accuracy of CisMapper predictions, (ii) the necessity of the target (ChIP-ed) tissue in CisMapper's panel and (iii) whether the ChIP-ed TF needs to be expressed in the panel tissues. The chromatin contact data we use are for GM12878 cells (6), which was the highest resolution data available when this study was conducted. To measure accuracy, we use the positive predictive value (PPV), which is equal to one minus the false discovery rate (1−FDR), where a predicted link is confirmed if its two ends overlap the two ends of a promoter-other chromatin contact in the Mifsudet al. (6) data. The CisMapper panel consists of eight tissues—GM12878, Ag04450, H1-hESC, HeLa-S3, HepG2, HUVEK, K562 and NHEK—and the histone (H3K27ac) and expression data (CAGE) come from ENCODE (Supplementary Table S2 lists data sources). TF ChIP-seq peaks are for the 19 TFs in Supplementary Table S1 with ENCODE ChIP-seq data in GM12878 cells. Further details are given in Supplementary Methods.
Sikora-Wohlfeld et al. (20) developed the ‘differential TF activity’ evaluation method and used it to evaluate a large number of distance-based predictors of regulatory interactions from TF ChIP-seq data. This evaluation method uses sets of TSSs that are differentially expressed in two tissues in which the ChIP-ed TF is active. They reasoned that if a TF is active in both tissues, some of the changes in gene expression between those tissues should be due to changes in activity of the TF. Hence, the top 500 differentially-expressed TSSs should be enriched for direct targets of the ChIP-ed TF. The figure of merit is the size of the overlap between the top 500 differentially-expressed TSSs and the top 500 predictions of predictor being evaluated, minus size of overlap expected if the predictor guessed randomly. Sikora-Wohlfeld et al. (20) found that the differential TF activity evaluation method gave results consistent with other evaluation methods that use TF perturbation data, functional homogeneity of target genes or consistency of target gene predictions across multiple ChIP-seq data sets, respectively. Note that although we use the evaluation method of Sikora-Wohlfeld et al. (20), we do not use their data or results. A diagram (Supplementary Figure S2) and further details are given in Supplementary Methods.
We analyze the enrichment of genes predicted by CisMapper or GREAT (21) to be associated with TF ChIP-seq peaks for p300 in embryonic (E14.5) mouse neocortical tissue from Table S1 of Wenger et al. (22). For CisMapper we use Mouse ENCODE histone (H3K27ac) and expression (long polyA+) for a panel of 22 mouse tissues listed in Supplementary Table S11 and a distance limit of 500 Kb. We use the target gene list produced by CisMapper with a link score threshold of 0.01. We then apply the DAVID (23,24) on-line gene enrichment tool to the gene targets predicted by CisMapper to determine enriched Gene Ontology (25) terms. For comparison, we perform enrichment analysis on the same TF peaks using GREAT with its default region-gene association rule. This associates each peak with every gene whose ‘genomic region’ it overlaps. GREAT defines the genomic region of a gene as a basal domain of −5 Kb to +1 Kb around its TSS, which it then extends that up to 1 Mb in either direction, stopping if it encounters another gene's basal domain.
We first demonstrate that CisMapper can accurately predict the long-distance contacts between TF-bound regions and promoters to be expected when a distal TFBS regulates a gene. For validation we use CHiC chromatin contact data (see Materials and Methods), and observe that CisMapper predicted (peak, TSS) links are greatly enriched for chromatin contacts compared with links predicted by distance. Using a panel of eight tissues and TF ChIP-seq peaks for 19 TFs in GM12878 cells, the potential regulatory links predicted by CisMapper with link scores less than 0.01 are at least 73% more likely to be confirmed by CHiC chromatin contact data than all links of the same length (Figure (Figure2).2). High-confidence CisMapper links (score < 10−5) shorter than 50 Kb have a median PPV of 0.57 across 19 TF ChIP-seq data sets, whereas all potential (peak, TSS) links shorter than 50 Kb have a median accuracy of only 0.21 (2.7-fold improvement).
As shown in Figure Figure2,2, the median accuracy of CisMapper-predicted links is higher than that of all similar length links for all tested score thresholds (from 0.1 to 10−20) and for all tested link lengths (50–500 Kb). The maximum improvement in accuracy is seen for short links (d < 50 Kb) and score thresholds below 0.001 (2.7-fold improvement in median PPV). Prediction accuracy increases with decreasing link length and increasing score stringency, with a maximum median PPV of 59% for links shorter than 50 Kb and a score threshold of 10−4 or lower. The higher accuracy of CisMapper predictions relative to distance-based predictions is consistent across the 19 TF ChIP-seq data sets analyzed here (Supplementary Figure S4C). The PPV of all CisMapper links predicted at a score threshold of <10−5 ranges from a high of 37% for RXRA to a low of 12% for ZBTB33. For all 19 of the TFs studied in this experiment, the PPV of CisMapper predictions is higher than that of links predicted using a distance threshold yielding a similar length distribution (350 Kb).
CisMapper's approach is clearly superior to using distance alone for predicting specific regulatory interactions between a bound TFs and TSSs. What is more, predicted links can easily be thresholded on both link score and link length (as done in Figure Figure2)2) to select links with high probability (>50%) of corresponding to contacts between promoters and TF-bound regions (Supplementary Figure S6). In this experiment, prediction accuracy for links predicted using a CisMapper score threshold of 0.01 drops below 10% (see Supplementary Figure S5) for the subset of links with lengths in the range 450–500 Kb. While this level of accuracy is still nearly twice as high as using a distance threshold alone, the 500 Kb limit on link length we have chosen here may be a reasonable value in practice.
Although the coverage (recall) of CisMapper is relatively low compared to using a simple distance threshold (Supplementary Figure S4A and B), we would argue that this is a reasonable trade-off in circumstances where a set of predicted regulatory links is desired for further examination. Higher PPV means lower FDR, so if predictions will be tested via expensive wet-lab experimentation, a smaller set of predicted links of higher precision may be preferable to a larger set of links that contains a higher proportion of false positives.
We wondered if CisMapper could successfully predict potential regulatory interactions using TF ChIP-seq data from a tissue not included in its panel. If true, this would greatly expand its utility. To examine this question we repeated the CHiC validation experiment after removing the target tissue (GM12878) from CisMapper's panel. As seen in Figure Figure3,3, using a score threshold of 10−5 CisMapper's predictions are still substantially more accurate at all distance thresholds than distance alone. On the other hand, including the target tissue in the panel does increase accuracy, especially for links shorter than 50 Kb. It is clear, therefore, that CisMapper is useful for analyzing TF ChIP-seq peaks from tissue types not included in its tissue panel, but accuracy will be better if the panel includes the tissue in which the TF was ChIP-ed.
We also wondered if the ChIP-ed TF needs to be expressed across CisMapper's tissue panel. Consequently we examined the relationship between the accuracy of predicted regulatory links and the level of expression of the TF across the panel for the CHiC validation experiments. As can be seen in Figure Figure4,4, there is no discernible relationship between accuracy (PPV) and the expression of the ChIP-ed TF across the panel. For example, the median expression of a single TF varies by four orders of magnitude (from 0.01 to 100 reads-per-million, Figure Figure4,4, blue), but this has no consistent effect on the accuracy of CisMapper's predictions. The TF for which CisMapper's predictions are most accurate is RXRA, which has the smallest median and third smallest maximum of expression across the panel of tissues used by CisMapper (data not shown). In fact, RXRA has no measurable expression (according to the ENCODE CAGE data used here) in two of the eight tissues, including in GM12878, the tissue in which it was ChIP-ed. Two other TFs have no measurable expression in five out of eight tissues (data not shown), yet they rank third (BCL11A, PPV = 0.33) and eighth (PU.1, PPV=0.27) in accuracy among the 19 TFs tested here (Supplementary Figure S4C).
Some TFs show highly tissue specific expression, so we wondered if CisMapper could predict regulatory links for them even if they were not expressed in any tissue included in its panel. We therefore repeated our validation using chromatin contacts after removing any tissue from the panel where the ChIP-ed TF showed measurable expression. In this new experiment, we selected five additional TFs (RUNX3, PAX5, IRF4, IKZF1 and BATF) with ENCODE ChIP-seq peaks in GM12878 because these TFs have measurable expression in GM12878 and at most two other panel tissues. When we exclude these tissues, each panel contains at least five of the original eight panel tissues (but the number and identities of the tissues varies depending on the TF). CisMapper's predictions are still more accurate at all distance thresholds than distance alone (Supplementary Figure S8) for these five ‘tissue specific’ (with respect to the panel) TFs. The ability to make predictions for a TF not expressed in any tissue in the histone/expression panel is likely due to the fact that the TF binds in enhancer regions that are active (and varying) across the panel.
We next explore how CisMapper accuracy compares with distance-based approaches. A recent survey of distance-based methods for linking TF ChIP-seq peaks to genes studied six methods and found two—Linear and ClosestGene—to be consistently superior to the others they tested (20). The window-based Linear method simply adds a value between 0 and 1 to a gene's score for each peak within 10 Kb of the gene's TSS, where the value added decreases linearly with the peak-TSS distance. The ClosestGene method assigns each peak to the nearest gene, then scores the peak based on how well the distance fits the observed distribution of peak-TSS distances, and finally sums all the peak scores for each gene. We applied CisMapper, Linear and ClosestGene to ChIP-seq data for 27 TFs in a variety of tissues (Supplementary Table S1), and estimated the accuracy of the predictions using Sikora-Wohlfeld et al. (20)'s ‘differential TF activity’ evaluation method (see Materials and Methods).
Overall, CisMapper predictions are substantially more accurate than those made by ClosestGene (Figure (Figure5)5) or Linear (Supplementary Figure S9). The median accuracy of the TSS target predictions made by CisMapper is higher than that of ClosestGene for 26 out of 27 TFs tested (P < 10−6, sign test), and higher than that of Linear for 25 of 27 TFs tested (P < 10−5, sign test). For 26 out of 27 TFs, CisMapper correctly identifies between 1.5 and 26.5 more TSS targets than ClosestGene, and correctly identifies 10 times more TSS targets on average (Supplementary Table S4). CisMapper is also more accurate than ClosestGene for predicting gene (rather than TSS) targets for 20 of 27 TFs (Supplementary Figure S10, P < 0.01, sign test). Here the CisMapper panel of tissues draws from six of the eight following tissues: Ag04450, GM12878, H1-hESC, HeLa-S3, HepG2, HUVEC, K562 and NHEK; the CAGE expression and H3K27ac histone data is from the ENCODE sources listed in Supplementary Table S2 (see Supplementary Methods for details).
To check the consistency of our two evaluation methods, we looked at how they ranked the accuracy of CisMapper predictions on the 19 TF ChIP-seq data sets that we evaluated using both methods. In both these evaluations, CisMapper based its predictions on an enhancer mark (H3K27Ac), so we divided the 19 TFs into two groups according to their preference for binding in enhancer regions, based on data from Ernst et al. (26). Supplementary Table S5 shows that for six of the seven TFs that bind preferentially in enhancer regions, CisMapper predictions are ranked highly by both evaluation methods. The notable exception is that the two evaluation methods disagree strongly on the accuracy of the CisMapper predictions for the RXRA ChIP-seq data set. This anomaly may be due to poor quality of the RXRA ChIP-seq data set. There is no significant enrichment of any of the known motifs for RXRA from the JASPAR database (27) in the RXRA ChIP-seq peaks based on a CentriMo (28) motif enrichment analysis (data not shown). The high PPV of the links predicted by CisMapper in the RXRA data set according to the chromatin contact evaluation method suggests that those ChIP-seq peaks frequently contain regions in contact with neighboring genes. The low accuracy according to the differential TF activity evaluation is not surprising given the lack of evidence of actual RXRA binding in the peaks. Thus, with the exception of the RXRA data set, both evaluation methods estimate the accuracy of CisMapper predictions based on an enhancer mark to be generally highest for TFs binding primarily in enhancer regions, as would be expected.
Thus far we have only presented results based on using the active enhancer histone mark H3K27ac in CisMapper's tissue panel. When we repeat the TSS target prediction experiment above using histone data for the active promoter histone mark H3K4me3 in place of the H3K27ac data used above, CisMapper is more accurate than ClosestGene, although the comparative advantage is smaller than when using H3K27ac (Supplementary Figure S11). For 21 of 27 TFs, the median accuracy of CisMapper predictions is higher than that of ClosestGene (P < 0.003, sign test), compared with 26 of 27 TFs when CisMapper uses H3K27ac data (Figure (Figure5).5). We also examined using histone marks H3K27me3, associated with poised enhancers (29) and H3K36me3, associated with active enhancers and transcribed genes (17). We found that the accuracy of predicted links was somewhat lower using these two marks (data not shown). These results suggests that CisMapper can be used effectively with ChIP-seq data for histone marks other than H3K27ac should data for that mark not be available for enough tissues to build a panel (see next section).
We assumed that CisMapper coverage and accuracy should increase with the size of the panel of tissues it uses for computing peak-TSS correlations. To test this we again used the differential TF activity method, but switched to data from the more extensive Roadmap Epigenomics Project (15) to allow us to create panels of from 5 to 30 tissues using histone ChIP-seq data for H3K4me3, and polyA+ RNA-seq expression data. Since RNA-seq data does not identify the TSS as accurately as CAGE data, we use the gene target list output by CisMapper rather than its TSS target list in this evaluation. (See Supplementary Methods for details.)
The accuracy of CisMapper target predictions increases with the panel size (Figure (Figure6).6). The median of the adjusted overlap score almost triples over the range of panel sizes we tested (5–30). What is more, the coverage of CisMapper target predictions increases with the panel size (Supplementary Figure S12A), as might be expected due to the increased statistical power of larger panels. A similar increase in accuracy between tissue panels of size 5 and 30 is seen for each of the 19 individual TFs we tested (Supplementary Figure S12B). Although we observe a plateau in the accuracy of CisMapper gene target predictions when the panel size reaches 25 tissues (Figure (Figure6),6), for most of the 19 TFs we tested, the number of gene targets predicted by CisMapper at a link score threshold of 0.001 more than doubles. This plateau is probably due to limitations in the available data reducing the diversity of any additional tissues added to the panel beyond 25. (See Supplementary Methods for further discussion of this issue.)
Using data from the previous section, we checked that CisMapper scores are ‘calibrated’ in the sense that a given score corresponds to the same accuracy regardless of panel size. This is evidenced by the scatter plot in Figure Figure7,7, which shows the accuracy (y-axis) of gene target predictions using the link score threshold given on the x-axis. Each point represents the median CisMapper results for one TF ChIP-seq data set, averaged over the different tissue subset panels, as described above. The X-value of each point is the median of the link score of the 500th gene in the target list, and the Y-value is the median accuracy (adjusted overlap scores).
Two things are clear from Figure Figure7.7. First, there is a very strong correlation between the CisMapper link score threshold and gene target prediction accuracy. Secondly, the slope of this correlation is essentially unchanged when CisMapper uses a panel of five tissues (grey points) or 30 tissues (blue points). This implies that the prediction accuracy when using a given link score threshold does not depend strongly on panel size. Therefore, a reasonable choice of link score threshold will remain so regardless of how many paired histone-expression data sets are provided as input to CisMapper. Thus, the main effect of increasing panel size is to increase the coverage (number of predictions) at a given link score, while maintaining the accuracy of those predictions.
Perhaps the most common downstream analysis applied to TF target gene predictions is gene enrichment analysis, and we wondered if this type of analysis would benefit from the improved accuracy of CisMapper predictions. To address this, we compare gene enrichment analysis of gene targets predicted by CisMapper with a similar analysis using the distance-based enrichment analysis tool GREAT (21). The TF ChIP-seq peaks are for p300 in embryonic (E14.5) mouse neocortical tissue (22). Given the tissue and stage of neocortical development, we expect p300-bound regions to regulate many neural-development related functions.
In this example, the gene enrichment analysis based on the CisMapper predicted targets appears more informative than analysis based on distance-based target prediction (see Supplementary Tables S8, S9 and S10). Although the GREAT tool identifies many neural-related biological processes and molecular functions enriched among its predicted 4676 gene targets (22), the 938 gene targets predicted by CisMapper are enriched for important neural-related processes and functions that are not identified by GREAT. For example, only the CisMapper-predicted targets are enriched for genes involved in the neural projection biological process (Supplementary Table S8), a critical process in neuron formation within the cortex (30). CisMapper also scores a key regulator of neural projection in neuron development, Fezf2 (31,32), as a top target.
Furthermore, CisMapper predictions identify genes primarily enriched in ion transport and charge potentiation molecular functions (Supplementary Table S9), crucial to the excitatory function of pyramidal neurons in the neocortex (33). These are missing from the GREAT predictions, which mainly identify transcription-related functions.
Finally, there are no enriched cellular component terms among the GREAT-predicted gene targets, whereas terms highly relevant to neocortical neurons such as ‘neural projection’, ‘plasma membrane’ (the location of ion channels), ‘axon’ and ‘synapse’ are enriched among the CisMapper-predicted gene targets (Supplementary Table S10).
Several previous studies have sought methods for accurately identifying the gene targets of regulatory regions (7–11,34) using auxiliary data on gene expression, TF binding, DNaseI hypersensitivity and histone modifications. Although demonstrably more accurate, these methods have not supplanted simple distance-based association of TF ChIP-seq peaks with putative target genes in practice. This is probably due mainly to the relative simplicity of distance-based methods, as well as to the fact that the more advanced methods have not been explicitly validated on regulatory regions defined by TF ChIP-seq peaks. We developed CisMapper to provide a method that is more accurate than simple distance-based methods, but that places a minimum burden on the user to provide auxiliary data. CisMapper uses only data for a single histone modification and gene expression across a small panel of tissues, requires no training step and has been extensively evaluated here as an alternative to distance-based methods for analyzing TF ChIP-seq peaks.
CisMapper can analyze TF ChIP-seq peaks to predict regulatory links between TF binding sites and the TSSs of genes. It predicts these links using cross-tissue correlation between histone marks overlapping the TF binding site and expression at the TSS. The target lists output by CisMapper can be used to predict either which TSSs or which genes a given TF regulates. Similarly, the regulatory element lists it outputs can be used to predict which specific TF binding sites are most likely to regulate a given TSS or gene.
We have shown that the regulatory links predicted by CisMapper coincide with chromatin contacts at a higher rate than links predicted based on the distance between the binding site and the TSS, the current method of choice. Direct chromatin contact between a bound TF and a TSS is highly suggestive of a possible regulatory interaction, which is what CisMapper is intended to predict. We also report experiments using the differential TF activity evaluation method to show that CisMapper's lists of the gene and TSS targets of a TF have higher accuracy than predictions made by distance-based methods. We have also shown that CisMapper is especially accurate for predicting long-distance regulatory links that are beyond the reach of distance-based prediction methods, and that as more histone and expression data become available across a larger number of tissues, the accuracy of CisMapper's regulatory predictions will improve. Based on these results, we believe that CisMapper is a valuable addition to the standard bioinformatic toolkit for analyzing TF ChIP-seq data.
Importantly, we have shown that CisMapper requires neither histone nor expression data from the tissue of interest, only the genomic loci of the ChIP-seq peaks for a TF in that tissue. However, if such histone and expression data are available, it can and should be included in CisMapper's input, as we expect it to improve prediction accuracy.
We have also shown that CisMapper does not require the TF to be expressed in any of the panel tissues to accurately predict regulatory links to its TFBS. This suggests that even if a TF's expression is tissue specific, CisMapper can still detect when it binds to enhancers showing varying activity across CisMapper's tissue panel.
Suitable compendia of histone mark and expression data currently exist for using CisMapper to analyze TF ChIP-seq data from human, mouse, fly and worm. For analyzing human data, extensive histone and expression data are available from the Roadmap Epigenomics Project (15), from ENCODE (13) and from FANTOM5 ((35); expression data only). Data for mouse are available from the mouse ENCODE project (14), and a mouse blood-specific compendium has been published recently (36). The modENCODE project provides data for both fly (37) and worm (38). Each of these compendia contain matched histone and expression data from seven to over 100 tissues, and our results show that CisMapper can make useful regulatory predictions when provided with such data for as few as five tissues in the organism of interest.
While we have shown that CisMapper predictions are more accurate than distance-based predictions, the coverage of CisMapper and distance-based methods is quite distinct. On the one hand, distance-based methods are confounded when chromatin looping causes a TF binding site to regulate a TSS other than the nearest one. On the other hand, CisMapper can only predict a regulatory link between a TF binding site and a TSS when there is variation in their histone mark and expression, respectively, across the tissues in the histone/expression compendia provided to CisMapper. Consequently, the regulatory predictions made by CisMapper are somewhat complementary to those made by distance-based methods.
Due to the complementarity of the distance- and correlation-based approaches to regulatory interaction prediction, a future version of CisMapper will integrate genomic distance directly with histone-expression correlation in calculating the link score. We anticipate this will improve CisMapper's coverage. In the mean time, we recommend analyzing TF ChIP-seq peaks with both CisMapper and a distance-based method. The CisMapper predictions will provide a higher quality set of predicted targets and regulatory binding sites, and the union of those predictions with the distance-based predictions will provide a higher-coverage, albeit less-accurate, set.
CisMapper predictions of regulatory links are also complementary to those inferred from chromatin conformation capture (CCC) data because they are based on completely different types of evidence. Specifically, the link score that CisMapper calculates for a pair of loci indicates how related histone and expression levels are between the loci, whereas, the read count for a pair of loci produced by a conformation capture assay, after conversion to a score that corrects for distance-dependent and other biases, can be used to infer if the two loci are in contact. Thus, CisMapper and chromatin conformation capture assays (e.g. 3C (39), 4C (40), 5C (41), Hi-C (42), ChIA-PET (43) or CHiC (6)) provide scores that are independent predictions of regulatory interactions between pairs of genomic loci. This independence suggests that intersecting the sets of loci pairs predicted by CisMapper with those predicted by CCC in the same tissue should yield an even more accurate set of predicted regulatory interactions.
Analyses of the regulation of expression by a transcription factor should benefit from CisMapper's more accurate and highly specific predictions of regulatory links between its binding sites and particular TSSs. For example, when searching for regulatory SNPs, it is reasonable to assume that those contained in TF binding sites predicted by CisMapper to be regulatory are more likely to be important biologically. (Note that we assume that the binding sites can be identified within the TF-bound regions predicted by CisMapper via standard motif-based methods (44).) Likewise, gene ontology analysis (25) performed using the more accurately predicted target gene set provided by CisMapper should better elucidate the biological roles of the ChIP-ed TF. Finally, when validating predicted regulatory binding sites via genome editing (e.g. using CRISPR/Cas (25)), CisMapper's ability to associate specific binding sites with a gene and to rank them by regulatory potential should prove invaluable.
The use of CisMapper need not be restricted to the analysis of TF ChIP-seq data. CisMapper can take as input any set of loci (expressed as a BED file) from the genome of interest, and will generate lists of the TSSs and genes that those regions may regulate. Previously we showed that the cross-tissue histone-expression correlation approach used by CisMapper can predict regulatory links between enhancers and TSSs (12), providing the first validation of this idea (19,45). As noted above, distance-based methods cannot reliably distinguish which TSS might be regulated by a given locus due to the possibility of chromatin looping. This ability to make TSS-specific predictions of regulation by arbitrary genomic loci is a novel feature of CisMapper.
A second novel feature of CisMapper is that it can utilize data for any type of histone mark in making its predictions, and the regulatory links it predicts will depend on the histone mark chosen (e.g. H3K27ac or H3K4me3). By contrast, distance-based methods do not make predictions that take into account the histone state of the predicted regulatory loci. In future work we will explore running CisMapper using a series of distinct histone marks in order to classify links according to their ‘histone profiles’—the set of histone marks that identify the given link. This may allow us to group regulatory links into biologically relevant classes (e.g. activating, repressing, promoter-specific, enhancer-specific, etc.) in a way analogous to previous work that uses histone profiles to assign genomic loci to classes such as promoter, enhancer, insulator, etc. (46,47). In principle, this link-profiling approach might be used to classify links predicted by CisMapper from TF binding sites (ChIP-seq peaks), enhancers, disease-associated SNPs or chromatin conformation contact data.
Supplementary Data are available at NAR Online.
National Institutes of Health [R01 GM103544 to T.L.B]. Funding for open access charge: NIH [R01 GM103544 to T.L.B].
Conflict of interest statement. None declared.