Identifying Cell-specific Chromatin Modifications
We applied diffReps to analyze the differential sites of H3K4me3 between hESC and K562 from the ENCODE project 
. This dataset (see Text S1
for more details) contains two biological replicates for each condition and the number of uniquely mapped reads ranges between 7 and 16 million (). We used diffReps with default parameter values except that the “–nsd” option (see Design and Implementation) is set to “sharp” because H3K4me3 tends to generate sharp peaks. The hESC was used as the control group while the K562 was used as the treated group. In addition, two input samples are also available in each cell line from the same project (see Text S1
for more details). We mixed one input from hESC with another input from K562 and created two groups of input samples. We used these data to perform mock comparisons to estimate the number of false positives given by a differential analysis method.
Summary of the two benchmark datasets.
First, we set out to determine the sensitivity and specificity (using mock data) of a number of different methods, including diffReps (negative binomial test), diffReps (G-test on pooled replicates), DESeq 
, edgeR 
and ChIPDiff 
. Both DESeq and edgeR are popular R packages that are widely used in differential expression analysis for RNA-seq. They use the negative binomial distribution to estimate the over-dispersion among replicates and further stabilize it using neighboring genes. They can also be applied to ChIP-seq data by splitting the genome into non-overlapping windows (see Text S1
for more details). ChIPDiff is an approach based on hidden Markov models (HMM) to detect differential sites. For completion, we also added a peak-calling based approach by first calling H3K4me3 peaks using CCAT 
, finding the union of peaks from the two cell lines and then identifying differential peaks using DESeq (referred in the following as CCAT+DESeq). This way we can compare the peak-calling based approach with window based methods.
Each method is run repeatedly with its nominal p-value chosen from 1E-2 to 1E-8 (See Figs. S3
& S4 for additional p-values). shows that diffReps is the most sensitive method among all the methods compared, followed by edgeR, DESeq, ChIPDiff and, lastly, CCAT+DESeq. At each cutoff, diffReps (negative binomial test) typically detects a few thousands more differential sites than the secondly ranked method, edgeR. It should be noticed that G-test predicts ~10,000 more sites than negative binomial test on average. However, G-test also generates ~4–5 times more false positives than negative binomial test when using a p-value cutoff of 1E-4 or more stringent (). Because G-test simply considers the ratio of read count between two comparison groups and ignores the variation within a group, some loss of specificity is expected. What is more, the negative binomial test result is almost a subset of the G-test result as shown by Venn diagrams (). This makes G-test a viable alternative to negative binomial test if specificity is less of a concern. Although diffReps seems to produce more false positives than the other methods on the mock data (), we observed an exponential decrease of false positives when the significance cutoff increases. At diffReps’ default cutoff (p<1E-4
), the empirically estimated false discovery rate (FDR) is 28/15,109
0.2% for negative binomial test and 137/25,369
0.5% for G-test.
The number of differential sites found by different methods as p-value cutoff varies from 1E-2 to 1E-8 on the ENCODE H3K4me3 ChIP-seq data.
Another experiment was performed to determine the consistency among various methods in ranking differential sites by statistical significance. diffReps (negative binomial test) was compared against two other approaches on the top 2,500 and 5,000 sites using Venn diagrams. shows that diffReps shows very nice consistency with DESeq and ChIPDiff: 87% (2,500) and 93% (5,000) of the diffReps top sites overlap with another method. In contrast, ChIPDiff shows slightly more disparity than the other methods: 19% (2,500) and 8% (5,000) of its top sites are unique. We performed the same analysis between diffReps (negative binomial test), edgeR and CCAT+DESeq. diffReps shows very nice consistency with edgeR (Fig. S1
): 74% (2,500) and 82% (5,000) of the top sites are overlapped. However, CCAT+DESeq largely disagrees with the other methods: 78% (2,500) and 80% (5,000) of its top sites are unique.
Overlap of the top differential site lists of three methods: diffReps (negative binomial test), DESeq and ChIPDiff on H3K4me3 comparing K562 and hESC.
To further study the differential sites identified by diffReps (negative binomial test), we compared it with ChIPDiff and DESeq at its default significance cutoff (p<1E-4). In total, diffReps found 9,649 increased sites and 5,460 decreased sites in K562 vs. hESC. ChIPDiff detected 2,955 increased and 3049 decreased sites, while DESeq detected 3,880 increased and 3,470 decreased sites. Venn diagram shows that almost all of the differential sites found by ChIPDiff and DESeq are also detected by diffReps ().
Venn diagrams of the differential sites of diffReps (negative binomial test), ChIPDiff and DESeq on H3K4me3 comparing K562 and hESC.
H3K4me3 is known to be associated with transcriptional activation 
and may also relate to alternative splicing 
. To study the functional relevance of the differential sites, we analyzed the transcriptomes of hESC and K562 using RNA-seq. Cufflinks 
was used to identify differential transcriptomic events (FDR<5%) between the two cell lines 
. We separated all differential sites detected by the three approaches (diffReps, ChIPDiff and DESeq) into two categories: “diffReps-specific” and “Overlap” (i.e., detected by both diffReps and another method). To correlate the differential sites with gene expression, we further classified them into “Promoter” (upstream 1Kb to downstream 1Kb of TSS) and “Genebody” (downstream 1Kb of TSS to TES), and ignored the ones that are mapped to intergenic regions. Considering the direction of change, we defined four groups in total: “Promoter Up”, “Promoter Down”, “Genebody Up” and “Genebody Down”.
In the overlap category, the differential sites show very significant and positive correlation with the overall gene expression change ( and Table S3
), validating the function of H3K4me3 as an activation mark. These differential sites also show some correlation with alternative promoter usage but no correlation with alternative splicing ( and Table S3
). In the diffReps-specific category, only increased gene expression is correlated with “Promoter Up” and “Genebody Up” ( and Table S3
). Interestingly, alternative splicing shows correlation with “Genebody Up” and “Promoter Down”, while alternative promoter shows correlation with “Genebody Down” and “Genebody Up” ( and Table S3
). There are 61 and 69 genes that show alternative promoter usage and alternative splicing on the genome, while 20 and 14 of them contain at least one diffReps-specific site. In total, 29 and 37 diffReps-specific sites are located on those genes with alternative promoter usage and alternative splicing, respectively. We identified those sites’ closest TSS or alternative exon (i.e., variant exon and exons with alternative boundaries) and plotted the distribution of the distance between a site and its closest feature (). Clearly, there is a sharp peak centered on TSS or exon in the density plots, indicating spatial proximity of the differential sites to those features. To investigate whether this distribution is specific to the sites that are related with alternative splicing/promoter, we drew the same plots for all 10,025 diffReps-specific sites (Fig. S2
). We observed the same distribution pattern. Indeed, 8,410 (or 84%) of the sites are within 1Kb of TSSs or alternative exons, corresponding to 7,120 unique genes. 638 of these genes show transcriptional changes (i.e. whole gene expression, alternative promoter or alternative splicing) as defined by Cufflinks (FDR<5%).
Heatmaps of correlation between chromatin modification differential sites and differential transcriptomic events on the ENCODE data.
Distance density plot of diffReps-specific sites that are related with alternative promoter usage and alternative splicing.
As a negative control, we performed the same correlation analysis between the differential sites from the mock data and the transcriptional changes. In total, diffReps found 28 (negative binomial test) and 137 (G-test) differential sites of which only 3 and 12 can be mapped to promoter or genebody, respectively. None of these differential sites overlaps with the transcriptional changes.
Presumably the overlap category contains differential sites that are most significant and represent major H3K4me3 differential sites. It is not surprising to find them to correlate with gene expression change. While the diffReps-specific sites represent minor H3K4me3 differential sites. The correlation between diffReps-specific sites and alternative splicing (as well as gene expression) indicates that many functionally important differential sites may be missed by other methods. This is further illustrated by two examples (). MICU1, a gene that encodes a mitochondrial calcium uptake protein, contains two isoforms with alternative TSSs: one TSS is about 100Kb downstream of the other. ENST00000361114 is a major isoform that shows 7.4 fold increased expression in K562 (RPKM
37.2) compared to hESC (RPKM
5.0). While ENST00000418483 is a minor isoform that shows the opposite direction of change (hESC RPKM
2.5; K562 RPKM
0). diffReps found two up-regulated H3K4me3 sites: one (fold change
1.3E-5) is ~500bp upstream and the other (fold change
6.8E-7) is ~1,200bp downstream of the TSS of ENST00000361114 (). This supports the role of H3K4me3 as a positive regulator of alternative promoter usage. FANCI is a gene that belongs to the Fanconi anemia complementation group which is related to DNA repair. This gene’s expression shows a 6.4 fold increase from hESC (RPKM
14.3) to K562 (RPKM
91.4). However, the gene’s 18 isoforms show variable degrees of transcriptional change which is probably regulated by the splicing machinery in a very delicate way. Cufflinks can detect these splicing changes by modeling the isoform expression distributions for the two comparison groups 
and report alternative splicing events. Here the gene is determined by Cufflinks to contain alternative splicing (FDR<0.1%). To evaluate the degree of splicing vs. overall transcriptional output for each isoform, a splicing index can be calculated as:
Figure 6 IGV  genome browser screenshots of two example genes that contain alternative promoter usage and alternative splicing events.
so that a negative SI
means the isoform changes to a lesser degree than the other isoforms; while a positive SI
means the isoform changes to a greater degree than the other isoforms. A major isoform, ENST00000310775, shows a 5.0 fold increase from hESC (RPKM
5.6) to K562 (RPKM
28.2), while a minor isoform, ENST00000564920, shows a 34.0 fold increase from hESC (RPKM
0.1) to K562 (RPKM
3.4). The SI
is calculated for the major and minor isoforms to be −0.4 and 2.4, respectively. diffReps found an up-regulated site (fold change
3.3E-6) which overlaps a variant exon of the minor isoform (). This seems to suggest a positive role for H3K4me3 in the inclusion of the variant exon in K562 cells.
It should be noticed that in , the differential site overlaps with an H3K4me3 peak that sits on the TSS of FANCI. H3K4me3 is a histone mark that strongly associates with promoters (or alternative promoters). In our analysis, we also found most differential sites to be located near TSS. However, an H3K4me3 peak often spans several Kb, raising the possibility for it to affect the splicing procedures of the downstream overlapping exons. Previously, people have reported the histone mark’s association with splicing 
. Here we have shown some statistical association between the mark’s differential sites that extend into genebody and splicing. This phenomenon is further demonstrated by an additional example in Fig. S7
Applying diffReps to Study a Repressive Mark in Mouse Brain
We have previously published a ChIP-seq study 
of H3K9me3 in nucleus accumbens of mouse treated with chronic cocaine or saline injections. This dataset () contains two groups: cocaine and saline. The cocaine group has two replicates and the saline group has three replicates. Each replicate has ~12–15 million uniquely mapped reads. This dataset also contains six DNA input samples. We separated them into two groups with three samples in each group and created a mock dataset. This dataset represents an in vivo
study where the signal-to-noise ratio is presumably much lower than that of cultured cells. We used diffReps with default parameter values and the “–nsd” option is set to “broad” because H3K9me3 tends to generate broad peaks.
First, we performed the same sensitivity and specificity study as above on these data. The results (; See Figs. S5
& S6 for additional p-values) again show that diffReps is more sensitive than the other methods. Because the “–nsd" option is set to a more permissive threshold than above, we observed more false positives from almost all methods (ChIPDiff is still zero) on the mock data (). At the most relaxed cutoffs, i.e. p<1E-2
, the FDR seems to be very high: roughly 10–20% for both negative binomial test and G-test using diffReps. At the default cutoff (p<1E-4
), the FDR decreases to 5% for negative binomial test and 6% for G-test. Similarly, the negative binomial test result is a subset (nearly) of the G-test result ().
The number of differential sites found by different methods as p-value cutoff varies from 1E-2 to 1E-8 on the brain H3K9me3 ChIP-seq data.
We compared diffReps (negative binomial test) with DESeq and edgeR at the default cutoff. diffReps reports more than 3,000 differential sites, while DESeq and edgeR report only 20 and 31 sites. Venn diagrams () show that diffReps is inclusive of the comparison methods except two down sites that are uniquely identified by edgeR. diffReps automatically annotates the >3,000 differential sites based on their genomic locations. Using the annotated output, we could easily create a pie-chart to analyze the domains of this mark’s regulation. As shown in , the majority of the mark’s regulation resides in intergenic regions (76%). This is consistent with the mark’s enrichment in heterochromatic regions. There is also an appreciable portion of the differential sites in genebodies (20%) with only a very small portion in promoters (3%).
Venn diagrams of the differential sites of diffReps, DESeq and edgeR on H3K9me3 comparing cocaine- and saline-treated mouse nucleus accumbens.
Genomic distribution of the H3K9me3 differential sites identified by diffReps.
diffReps further defined 266 hotspots (p<1E-2
; Table S1
) from these differential sites. We annotated the hotspots using the ChIPpeakAnno package 
to identify the overlapping transcripts allowing a 1Kb gap. 132 of the hotspots were found to associate with 627 transcripts which correspond to 198 unique genes. We performed pathway analysis (Ingenuity Systems, www.ingenuity.com
) on these genes and found the top one under physiological function to be “Nervous System Development and Function” (p=2.9E-8
) which involves 52 genes (Table S2
). The first thing we noticed is that the majority of these genes would be considered to be silenced in adult neurons. For example, the top functional annotation is “olfactory response of organism” (p=2.9E-8
) that involves 27 olfactory receptors (OR), which are not expressed appreciably outside of the olfactory epithelium; the top two functional annotation are “abnormal morphology of axis” (p=5.1E-07
) that involves five homeobox domain genes, which are developmentally regulated transcription factors that function during mammalian development 
. Our RNA-seq data confirm that these genes are not expressed at significant levels in the nucleus accumbens (unpublished data). Although it is hard to predict what partial unsilencing, or further silencing, of these genes would do in this brain region, regulation at these sites indicates that repeated cocaine might alter certain developmentally determined gene programs, which, through chromatin structural alterations, may lead to aberrant regulation of silenced loci in adult nucleus accumbens neurons. This observation is consistent with our previous study 
where we also observed global alterations in chromatin/nuclear structure, whereby chronic cocaine exposure resulted in decreased heterochromatic domains and increased nuclear surface area in nucleus accumbens neurons, potentially indicating large scale alterations in nuclear function. What is more, a recent study has shown that OR are being dynamically regulated by H3K9me3, where the choice of expression is mediated via derepression of this histone mark 
. Our results may then shed light on novel functional roles for this group of genes in cocaine-regulated transcriptional perturbation in nucleus accumbens. In the following, we briefly discuss some other genes that we found to be potentially interesting:
- SYNE1 is a multi-isomeric modular protein that forms a network between organelles and the actin cytoskeleton to maintain the subcellular spatial organization . This could also be important for synaptic development, although there is no literature to indicate a role in cocaine responsiveness.
- HDAC2 has been implicated in dendritic/synaptic regulation in adult brain, both in the context of learning and memory and stress behaviors . Also, it has been observed to increase in expression in response to cocaine self-administration .
- RASGRF1 could also be interesting, as Ras signaling (which is induced by increased BDNF signaling in nucleus accumbens) has been shown to be important for cocaine behaviors . RASGRF1 is induced by cocaine, which promotes Ras-Mapk signaling in the nucleus accumbens  and is important for phospho-CREB signaling  and ΔFosB accumulation , both of which are important for cocaine behavior.
In summary, the hotspot-detecting program allows us to identify the chromatin regions that are being heavily regulated. Applying this method to ChIP-seq analysis of H3K9me3 from a drug administration study in mouse brain reveals cocaine-mediated regulation on several genes which may turn out to be novel targets in the future. Therefore, the hotspot-detecting program can be a useful research tool for those who are interested in dynamic regulation of chromatin modifications.