Identifying the globally most highly enriched regions of genes
We generated ChIP-seq data for the H3K9/14 histone acetylation in murine Th2 cells, a cell type that is part of the immune system and can be obtained in large numbers ex vivo
. Supplementary Figure S1A and S1B
demonstrate characterization of the cell type by protein and mRNA expression of cell-type-specific genes (Il13
. To test the experimental background of the ChIP-seq method, we also prepared data from a control sample where we used a non-specific IgG antibody during the immunoprecipitation step. Supplementary Table S1
shows that we were able to map ~10–27 million reads per ChIP-seq reaction of which ~80% were uniquely mappable.
We processed the mapped read positions based on the ‘XSET’ method (22
), by assuming our 36-bp sequence reads to be the ends of fragments of an average size of 200
bp. We then converted this into a distribution of read density along the chromosomes. Supplementary Figure S1C
shows the resulting peak landscape along the Gata3
In order to find the region relative to a specific genomic mark (5′- or 3′-end of genes, exons or introns) where histone modifications are present, we first aligned all the genomic objects (genes, exons or introns) according to either 5′- or 3′-ends and depicted an overall landscape of read distribution along the genomic objects –x
-axis being the base pair distance relative to the alignment point and y
-axis being the number of read-nucleotides per gene at that base pair. Additionally, we stretched or compressed each genomic object to the same length and obtained another set of landscapes in terms of ‘percentage length’. Within these landscapes, global peaks are detected as region with a read enrichment of >40% of the largest value and no <150
bp in width (2% in the case of percentage length). This is used as a window for further analysis (for a further discussion of the window detection procedure, please see the discussion in the Supplementary Data
). Once the windows are identified across all genomic annotations, their importance is determined and ranked by how many reads fall into each window normalized to the window width.
In the case of our H3K9/14ac data set, the most enriched window is from 400-bp upstream of the transcription start site (TSS) to 807-bp downstream. In contrast, the IgG control distribution is much flatter, showing only a slight enrichment at TSSs (A). Therefore, we used this window for further analysis. It is also interesting to note that intron/exons junctions featured small peaks.
Figure 1. ChIP-seq data distribution for the H3K9/14ac histone modification. (A) The cumulative read density for the whole genome is shown from −5kb to +5kb relative to TSSs. The H3K9/14ac sample (blue) shows a strong enrichment within (more ...)
Clearly, this strategy can be applied to any type of chromosomal annotation, including, for example, enhancer annotations. Furthermore, it can be applied to ChIP-seq data for proteins other than histones, e.g. TFs, RNA polymerase and so forth, to search for globally enriched regions relative to gene or enhancer annotations for instance.
Gene-by-gene quantification of histone modification levels: signal versus noise
Based on the optimal window of H3K9/14ac around TSSs, we extracted the peak area within the window for each gene individually. We normalize this value by the total area and define it as the normalized locus-specific chromatin state (NLCS). When we display the global distribution of this value for all genes as a density plot, the number of genes decays rapidly as the value increases for both the control and the H3K9/14ac sample. The H3K9/14ac sample features an extra shoulder (B), which is emphasized by log-transforming the data, and suggests that the sample data is a sum of two separate but overlapping distributions (B). This is reminiscent of flow cytometry data, where protein staining with fluorescent antibodies often leads to two separate distributions of log-transformed fluorescence intensity per cell, one corresponding to auto-fluorescence or background staining, the other to the subpopulation of cells expressing the protein (23
The major question is what the natures of the two subpopulations of our sample are. From the log transformation, we observe that the NLCS of the IgG control displays only one peak (apart from the smaller spikes which are due to log transformation) with NLCS values very similar to the left peak of the H3K9/14ac data. This observation indicates that the left peak is likely due to the background noise. This is also confirmed by a study of the genomic background within the samples. Here, we randomly selected fragments with the optimal window width from the intergenic regions and plotted the distribution of the NLCS values (B, dotted lines). For the IgG control, the resulting distribution largely overlaps with the distribution based on the window (B). The small shift to the left probably reflects the slight TSS enrichment we have seen previously (A). The H3K9/14ac intergenic distribution also lies slightly toward the left of the left peak, and its large overlap with the left peak suggests that the left peak consists of a large proportion of experimental noise and possibly biological noise, which should be filtered from the right peak—the genes with true modification signals. Therefore, we designate these two to-be-separated subpopulations of genes as BG (background) and HM (histone modification) (B).
In order to separate HM from BG, the shape of each distribution has to be elucidated. Although most studies assume that the experimental background of high-throughput sequencing follows Poisson distributions (22
), some studies suggest that the ChIP-seq background is not simple (24
). In order to directly compare a Poisson fit to possible alternatives (normal, log normal or truncated normal), we extracted from our IgG control data the number of complete sequencing reads that map to the TSS window. We excluded genes with zero values (2% in H3K9/14ac and 18% in IgG, which are reasonably low portions) because a discontinuity of the distribution is present at zero due to the lack of resolution and hence not suitable for modeling. We calculated the Bayesian Information Criterion (BIC) (21
) as an indication of the goodness-of-fit. The fitting results suggest that the two-parameter distributions (normal, lognormal and truncated normal) fit much better than the one-parameter Poisson model. Amongst the two-parameter distributions, the truncated (at 0.5) normal distribution fits best to the IgG control distribution (C, see Supplementary Table S2
The shape of the HM distribution of our H3K9/14ac sample suggests that a normal or lognormal function would fit the distribution of this group of genes well too. In order to determine contributions of the BG and HM groups to the total distribution of the data, we fit two-component mixture models of various combinations of normal and lognormal distributions (two normal, two lognormal, mixed lognormal and normal) to the NLCS distributions by using the expectation EM (20). Due to the complexity of parameter estimation, the truncated normal distribution was not included here.
All of the tested models fit the data well, with very close BICs (normal+lognormal, A, for other combinations, see Supplementary Figure S2
and Table S3
). For further analysis, we picked the model that represents BG and HM groups as normal and lognormal (A), respectively, as they best reproduce the shape of the actual distribution.
Figure 2. (A) Mathematical modeling of the H3K9/14ac sample data. (A) combination of a normal (for BG) and a lognormal distribution (for HM) was fit to the NLCS data (from the −400/+807-bp TSS window, as shown in B). The experimental data is shown (more ...)
Distinguishing histone-modified genes from background
Based on the parameterization of the BG and HM distributions, we can now quantify the overlap between the two groups of genes at any point of the data distribution. An FDR can be set which gives a minimum threshold of NLCS above which the probability of finding a BG gene is below the desired FDR value. Genes with NLCS values higher than the minimum threshold can then be classified within the desired FDR. Similarly, BG genes can be obtained by a maximum threshold, which means that for genes below this threshold, the probability to find HM genes is less than the FDR.
We chose an FDR of 0.01 and performed a binary classification of genes for both the BG and HM groups of genes (A). Of the total 21
326 genes, 20
424 could be classified, with the remainder of 902 genes in the intermediate region where the BG and HM distributions overlap extensively. Our classification agrees with the current understanding of T helper cell biology as genes expressed exclusively in Th2 cells such as Gata3
are found in the HM group. Genes such as Ifng
, on the other hand, which are known to be expressed in Th1 and Th17 cells (and not in Th2 cells), respectively, are found in the BG category.
Integration with expression data
In order to probe the biological meaning of the two sets of genes, we extracted the expression levels of genes in each category from published microarray data for the same cell type (19
). Genes in the HM category were expressed at significantly higher levels (P
, one-sided Wilcoxon rank-sum test) than those classified as BG (B).
To further study the relationship between histone modification and expression, we display the combined histone modification and expression data for all genes as a 2D density plot, with the color denoting the density of genes at particular spots. This representation of the NLCS versus gene expression data supports the concept of BG/HM separation, as two clearly identifiable groups of genes are visible, which correspond to either high expression and presence of H3K9/14ac histone modification or low/off expression and histone modification at background levels, respectively (C). Comparison with B shows that the expression levels of the BG/HM classified genes agree very well with the expression levels of the two groups of genes in C.
These results demonstrate that the global pattern of ChIP-seq signal for a histone modification allows us to distinguish experimental or biological background from the actual signal, if the data is extracted from each gene individually based on a sequence window. This in turn makes it possible to classify genes into background and histone-modified genes and yields estimates for the genome-wide fraction of modified genes. The close agreement with the expression data demonstrates the validity and power of our approach. An overview of our analysis strategy is given in .
Based on our observations, we have developed the ‘EpiChIP’ software, which allows the analysis of ChIP-seq experiments in a similar way to the approach described above. EpiChIP is a platform independent desktop application and comes with a user-friendly graphical interface that does not require any programming skills. The user supplies EpiChIP with files of sequencing reads that were mapped by any of three common mapping programs (Maq, Eland, Bowtie) or BED files. Then, based on a gene annotation file (RefSeq for mouse and human are included with EpiChIP) or user-defined custom annotations, one or more sequence windows with respect to genomic coordinates can be chosen. These may include regions around the TSS or end of a gene, the sequences at intron/exon boundaries, etc. EpiChIP then determines the density of paired-end or XSET-processed single-end reads along the sequence window(s) for each gene.
The genome-wide distribution of the data along the window(s) is displayed, peak detection is performed and further window-specific information such as the percentage of total peak area that falls into the various windows is output. This enables the user to decide which window should be used for the further analysis.
EpiChIP allows the user to upload a control sample and overlay it with the plots generated for the actual sample as a further means to test the presence of specific signal in the sample (in addition to the presence of bimodality in the global data distribution).
Based on the selected window, the NLCS value for each individual gene is then extracted and the distribution of NLCS among all genes is displayed on linear or log2 scale. Currently, our program fits combinations of normal or lognormal distributions to this data and displays all relevant fitting parameters. An upgrade allowing for further distributions is planned. EpiChIP lets the user decide on an FDR and saves the resulting gene lists in files that can be used for further analysis.
In the case that the 1D distribution does not allow one to clearly identify two peaks, we also included a feature for linking the data to a second data set, such as gene expression data, and displaying it as a heatmap as in C. The user can then encircle groups of genes and save them to files.
EpiChIP is downloadable for multiple platforms from http://epichip.sourceforge.net/index.html
. The web site will be regularly updated and contains a tutorial, a detailed documentation, an FAQ list and the mapped Th2 sample and control files which can be used as demos for exploring EpiChIP functions.
We have tested EpiChIP on published ChIP-seq datasets for 36 histone modifications, RNA PolII and CTCF binding and H2AZ histones in human Th cells (8
). EpiChIP confirmed previously described genomic distributions of the studied modifications and identified optimal windows varying from <200
bp to >5
kb in length in 14 out of 16 histone acetylations and 14 out of 20 methylations (Supplementary Table S4
). The smaller number of optimal windows that were identified among the methylations probably reflects the more diverse pattern of histone methylations, which, in contrast to the acetylations, can also be associated with repressed transcription and generally show more intricate distributions, such as the marking of expressed exons (11
In all cases where an optimal window was detected, the bimodality in the log NLCS-value distribution was observed. We show screenshots of the curve fittings EpiChIP performed on five different examples. For H3K9me1, we used the window EpiChIP detected automatically, from −987 to +2568
bp with respect to TSSs (A). For H3K27me3, where a peak at TSSs is not strong but present, we manually picked a window from the TSS to 2-kb downstream (B). Finally, we picked a window from the start of exons to +300-bp downstream (the first exons and introns of all genes are excluded by EpiChIP to avoid signal overlaps from TSSs) for the H3K36me3 modification (C). In addition, we show curve fittings to two canonical activating histone modifications, H3K9ac (Supplementary Figure S3A
) and H3K4me3 (Supplementary Figure S3B
). The EpiChIP curve fitting yielded good fits in all cases and allowed successful separation of HM from BG. This illustrates the functionality of our program and demonstrates the wide applicability to different types of histone modifications and other data.
EpiChIP screenshots for analysis examples. Types of histone modifications and analysis windows as indicated (A) H3K9me1, (B) H3K27me3 and (C) H3K36me3.
We further studied the consistency of EpiChIP with one of the most widely used peak-finding program, MACS (28
). We used the list of peaks MACS identified in our H3K9/14ac sample to represent a new starting dataset for EpiChIP and compared its output on the MACS peaks with the EpiChIP output on the original data. We found that the optimal window for the MACS peaks (from −402 to +841
bp at TSSs, Supplementary Figure S4A
) was very similar to the one identified in the original data (see above). Moreover, once we used this window to calculate the NLCS values and perform a classification into BG and HM, the overlap between HM genes and genes that had a MACS peak in the same window was very high (Supplementary Figure S4B
). This good agreement supports our concept of background modeling based on the genome-wide data distribution. Furthermore, it demonstrates that EpiChIP and peak-finding programs can be used in a complementary manner. As a further example, we compared the performance of EpiChIP and MACS on the H3K27me3 dataset as described above. When we select for all genes that have MACS peaks within the region from 0 to 2
kb with respect to TSS and plot the expression levels of those, we find that many of the MACS-selected genes are still expressed at high levels, whereas the proportion of EpiChIP-‘HM’ genes that are expressed are lower and within the range expected from the FDR (Supplementary Figure S5