In this study we have described HPeak, an HMM-based algorithm for defining ChIP-enriched peaks from short sequencing read data generated from ChIP-Seq experiments. Distinct from various algorithms currently available [10
], HPeak explicitly assumes probability distributions to model coverage profiles of hypothetical DNA fragment (HDFs, see the Methods section) along the genome. After dividing each chromosome into bins, HPeak employs an HMM to distinguish ChIP-enriched regions from the background. Generalized Poisson (GP) [41
] or zero inflated Poisson (ZIP) [42
] distributions were used to model observed HDF counts in each bin, allowing for a more optimal fit to the data than a standard Poisson distribution. The end of each HDF was down-weighted when evaluating coverage to account for the uncertainties in ChIP DNA fragment length. These features facilitate the recognition of the core regions that show significant ChIP-enrichment. As a result, HPeak produces more calibrated peaks with higher motif concentration when compared to other peak-finding algorithms. Evaluation of experimental data showed favourable performance in terms of motif enrichment. Because the underlying HMM is quite general, HPeak may be applied to a wide spectrum of ChIP-Seq data with different experimental design and different sequencing depth, achieving balanced sensitivity and specificity with little or no fine-tuning by the users. In a recent study, Laajala et al. conducted a comprehensive performance comparison of existing peak-calling software [32
]. HPeak was included in that study along with eight other published peak-calling algorithms. We noticed that overall HPeak performance is quite encouraging. For example, HPeak showed the best true positive rate and is closest to the optimum when testing on the NRSF ChIP-Seq data. This is consistent with what we have found.
A key advantage of model-based methods is that they are compatible with rigorous statistical inference. For example, under our model assumption, we can directly calculate the probability of observing the actual number of HDFs in a bin. Such probabilities can then be used to rank all peaks identified. This is important, as we have shown that higher-ranking peaks are more likely to harbour canonical binding motifs (Figure ). Furthermore, these probabilities can facilitate comparison of peaks across samples and studies. Another advantage of model-based method is that additional information such as GC content and mappability scores can be easily incorporated by extending the model.
In addition to its ability to identify the ChIP-enriched portion of the genome, HPeak provides more extensive information than other available programs. For example, as an option to users, HPeak provides more comprehensive annotation corresponding to each peak such as GC content, phylogenetic conservation (phastCons scores [43
]), genomic features of the region (exon, intron, 5' UTR, 3' UTR, intergenic), and distance to the TSS of nearby genes. HPeak also provides an optional WIG file containing the genomic locations of all identified peaks, easily enabling the visualization of all of the peaks within the UCSC genome browser. Further, HPeak provides an optional FASTA format sequence file containing nucleotide sequences of all peaks to facilitate subsequent motif analysis.
When comparing publicly available STAT1 ChIP-Seq and ChIP-chip data, we found that the ChIP-Seq technique has a clear advantage over the ChIP-chip technique in terms of enriching for an expected motif under the predicted peaks. The improvement can be largely attributed to the increased resolution offered by the new sequencing technology. By enabling the detection of narrower peaks flanking the true binding sites, ChIP-Seq reveals a higher concentration of the predicted binding motif within its peaks. It is known that the significance measure derived from the ChIP-chip data is correlated with the probability that a region contains the canonical binding motif [40
]. We found that such correlation is much stronger in ChIP-Seq data (Figure ). This implies that the read coverage profile is very informative on the presence and location of actual functional binding sites.
In this study, rather than the commonly used Chi-square goodness-of-fit test, we proposed to use the K-S test as an alternative to evaluate the reproducibility of datasets obtained under different conditions. We found that the K-S test is better suited for sequencing data than the Chi-square goodness-of-fit test, since there is no need to divide chromosomes into windows and correlation/reproducibility can be conveniently visualized by plotting ECDFs (Figure S1 in additional file 1
It is worth pointing out that it is challenging to determine criteria to evaluate the performance of various peak-calling algorithms on experimental ChIP-Seq data. This is because in general very little information is available on true positive and true negative loci. We choose to use the prevalence of known motifs as a metric for performance. One caveat of this approach is that, as pointed out in Hu et al. 2010 [44
], many of the motif patterns stored in the database may not be accurate and there maybe novel motifs that do not exist in motif databases. We speculate that the inaccurate motif patterns will affect the results of all peak-calling algorithms equally, but the actual effect remains to be seen and further studies seem warranted. Additionally, our method is not well-suited for quantifying false detection rate therefore some methods maybe put in a disadvantaged position in our comparison. Because of this, our performance evaluation results should be interpreted with caution.
The ChIP-Seq technology can be applied to other types of proteins in addition to TFs. For example, multiple studies have utilized ChIP-Seq to identify histone modification sites in the genome [12
], which is reviewed in Park [45
]. Since the underlying two-component HMM is quite general, HPeak can also be applied to data collected from other types of sequencing-based experiments such as MeDIP-seq [46
], RNA-seq [47
] and methylation pattern discovery [19
]. In these experiments, HPeak can be used to identify regions in the genome that is significantly enriched for sequencing reads. Some adjustment of HPeak parameters such as bin size may be needed when analyzing non-ChIP-Seq type of data.
The HMM used in HPeak assumes two different states: enriched and non-enriched. Although such a scheme is well-accepted in ChIP-chip analyses [33
], it is possible and of interest to consider more sophisticated HMM schemes where more than two states are allowed. As an example, we may consider a four-state HMM: enrichment of reads on the positive strand, enrichment of reads on the negative strand, enrichment of reads on both strands and no enrichment. By utilizing strand information, we will be able to better identify true binding events since a symmetric pattern among reads with different strands is expected around the binding sites. Another possibility is to distinguish shapes of peaks, such as sharp peaks, broad and low plateaus. These may help us to distinguish different types of binding events.
We assume HDF counts follow ZIP and GP distributions in background and ChIP-enriched regions respectively. Other probability distributions such as negative binomial (NB) has also been used to model ChIP-Seq data [22
]. It is of interest to understand whether these distributions fit observed ChIP-Seq data better than the standard Poisson distribution. To investigate, we fit GP, Poisson and NB distributions to the number of HDFs found in NRSF and STAT1 ChIP-Seq data. For the number of HDFs found in the ChIP-enriched regions, we found that the GP distribution shows a much better fit than the Poisson distribution and a slightly better fit than the NB distributions. An example of the model fit can be found in Figure S4 in additional file 1
. For the number of HDFs found in background regions, we found that the ZIP distribution produces a slightly better fit than both Poisson and NB distributions (data not shown).
The current HPeak algorithm does not distinguish reads of different orientation within a peak. Such information has been shown to be informative in pinpointing the summit of the peak and to estimate the DNA fragment length [20
]. We plan to incorporate such information in the future release of HPeak and we believe it will further enhance the performance of the HPeak program.