PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bmriBioMed Research International
 
Biomed Res Int. 2017; 2017: 5346793.
Published online 2017 March 5. doi:  10.1155/2017/5346793
PMCID: PMC5357551

COPAR: A ChIP-Seq Optimal Peak Analyzer

Binhua Tang, 1 , 2 , * Xihan Wang, 1 and Victor X. Jin 3

Abstract

Sequencing data quality and peak alignment efficiency of ChIP-sequencing profiles are directly related to the reliability and reproducibility of NGS experiments. Till now, there is no tool specifically designed for optimal peak alignment estimation and quality-related genomic feature extraction for ChIP-sequencing profiles. We developed open-sourced COPAR, a user-friendly package, to statistically investigate, quantify, and visualize the optimal peak alignment and inherent genomic features using ChIP-seq data from NGS experiments. It provides a versatile perspective for biologists to perform quality-check for high-throughput experiments and optimize their experiment design. The package COPAR can process mapped ChIP-seq read file in BED format and output statistically sound results for multiple high-throughput experiments. Together with three public ChIP-seq data sets verified with the developed package, we have deposited COPAR on GitHub under a GNU GPL license.

1. Introduction

Next-generation sequencing (NGS) integrated with ChIP technology provides a genome-wide perspective for biomedical research and clinical diagnosis applications [13].

Data quality and peak alignment of ChIP-sequencing profiles are directly related to the reliability and reproducibility of analysis results. For example, ChIP-seq data characterize alteration evidence for transcription factor (TF) binding activities in response to chemical or environmental stimuli, but if the ChIP-seq alignment is poorly selected, any follow-up analysis may lead to inaccurate TF binding results and inevitable loss of biological meanings [4, 5].

The mostly investigated items in ChIP-seq peak calling procedures are peak number, false discovery rate (FDR), corresponding bin-size, and other statistical thresholds selected in each analysis. Without exception, such arguments form impenetrable barriers for biologists and bioinformaticians to choose a suitable pair condition for analyzing experimental results.

And to our knowledge, few literatures or application notes focus on such topics; thus herein we propose a flexible package based on feature extraction and signal processing algorithms for solving such an argument-selection optimization problem in optimal peak alignment.

In summary, the package COPAR can quantitatively measure NGS/ChIP-seq experiment quality through global peak alignment comparison and extract genomic features based on spectrum method for in-depth analysis of ChIP-sequencing profiles.

2. Materials and Methods

2.1. Optimal Peak Alignment Estimation

For determining optimal ChIP-seq alignment, we need to analyze peak numbers under specific argument constraints. Thus we acquire optimal peak numbers by constraining specific arguments, which can be formalized as a class of optimal track analysis, illustrated as

argmaxiPi,iNs.t.fiχ,bi=β,piδ,
(1)

where Pi denotes a set of optimal peak numbers under corresponding argument constraints, fi stands for argument FDR, bi stands for bin-size, pi denotes p value threshold, and χ, β, and δ represent the presupposed argument values, respectively.

2.2. Spectrum-Based Genomic Feature Extraction

For a finite random variable sequence, its power spectrum is normally estimated from its autocorrelation sequence by use of discrete-time Fourier transform (DTFT), denoted as [68]

Pω=12πn=Cxxnejnω,
(2)

where Cxx denotes autocorrelation sequence of a discrete signal xn, defined as

Cxxi,j=EXiμiXjμjσiσj,
(3)

where μ and σ stand for mean and variance, respectively.

In our study, for consideration of the ChIP-seq data characteristics, we use 128 sampling points to calculate discrete Fourier transform, with the related sampling frequency 1 KHz.

3. Results

The COPAR package was developed and open-sourced for academic biologists, and it uses built-in functions for determining optimal peak alignment candidate and extracting genomic features from ChIP-seq dataset.

The package is designed to handle BED-formatted ChIP-seq data as input [9], and it can process single ChIP-seq for optimal peak alignment and feature extraction analysis, together with the capability to perform genome-wide statistical comparison for multiple ChIP-seq samples. The analysis flowchart for the package is given in Figure 1.

Figure 1
Flowchart for optimal peak alignment estimation and genomic feature analysis with COPAR. The package can perform optimal peak estimation based on global alignment of ChIP-seq data; then it can utilize the frequency spectrum approach for genomic feature ...

It can automatically determine the optimal peak alignment with statistically meaningful FDR through fast global alignment comparison; the global comparison is subject to two statistical arguments, namely, bin-size and p value threshold.

The functionalities of our developed package are largely complementary to and extend current tools used for ChIP-seq data analysis. The optimal peak alignment estimation is shown in Figures 2(a) and 2(b); and the spectrum-based feature extraction is given in Figures 2(c) and 2(d). Figures 2(a) and 2(b) utilize heatmap to represent peak number and corresponding FDR candidate subject to each argument pair, bin-size (vertical axis), and p value threshold (horizontal axis), respectively; Figure 2(c) denotes the spectrum distribution of the global peak alignment candidate sequence, normalized with its frequency range [0,500] Hz and magnitude within [−40, −3] dB; Figure 2(d) denotes the randomized case.

Figure 2
Global optimal peak analysis result subject to the arguments bin-size and FDR. (a) Global distributions for peak number candidates and (b) corresponding false discovery rate, subject to bin-size (vertical axis, from 100 through 500 bp) and p ...

4. Conclusions

Based on global peak alignment, COPAR optimizes the argument selection in ChIP-seq analysis; meanwhile, COPAR utilizes the signal spectrum processing method to further extract genomic features and statistically compare multiple ChIP-seq samples for NGS high-throughput experiments.

In summary, our developed package COPAR can process mapped read file in BED format and output statistically sound results for diverse high-throughput sequencing experiments; we further verified the package with three GEO ChIP-seq datasets as study cases, and we included the analysis results into the package manual. The developed package COPAR is currently available under a GNU GPL license from https://github.com/gladex/COPAR.

Acknowledgments

This work has been supported by the Natural Science Foundation of Jiangsu, China (BE2016655 and BK20161196), Fundamental Research Funds for China Central Universities (2016B08914), and Changzhou Science & Technology Program (CE20155050). This work made use of the resources supported by the NSFC-Guangdong Mutual Funds for Super Computing Program (2nd Phase) and the Open Cloud Consortium- (OCC-) sponsored project resource, supported in part by grants from Gordon and Betty Moore Foundation and the National Science Foundation (USA) and major contributions from OCC members.

Abbreviations

NGS:
Next-generation sequencing
ChIP-seq:
Chromatin immunoprecipitation-sequencing
FDR:
False discovery rate
TF:
Transcription factor
DTFT:
Discrete-time Fourier transform.

Competing Interests

The authors declare that they have no competing interests.

Authors' Contributions

Binhua Tang and Victor X. Jin conceived the method; Binhua Tang and Xihan Wang wrote and compiled the package; Binhua Tang, Xihan Wang, and Victor X. Jin drafted and proof-checked the manuscript.

References

1. Mardis E. R. ChIP-seq: welcome to the new frontier. Nature Methods. 2007;4(8):613–614. doi: 10.1038/nmeth0807-613. [PubMed] [Cross Ref]
2. Martinez G. J., Rao A. Cooperative transcription factor complexes in control. Science. 2012;338(6109):891–892. doi: 10.1126/science.1231310. [PMC free article] [PubMed] [Cross Ref]
3. Kilpinen H., Barrett J. C. How next-generation sequencing is transforming complex disease genetics. Trends in Genetics. 2013;29(1):23–30. doi: 10.1016/j.tig.2012.10.001. [PubMed] [Cross Ref]
4. Chikina M. D., Troyanskaya O. G. An effective statistical evaluation of chipseq dataset similarity. Bioinformatics. 2012;28(5):607–613. doi: 10.1093/bioinformatics/bts009. [PMC free article] [PubMed] [Cross Ref]
5. Furey T. S. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nature Reviews Genetics. 2012;13(12):840–852. doi: 10.1038/nrg3306. [PMC free article] [PubMed] [Cross Ref]
6. Oppenheim A. V., Schafer R. W. Discrete-Time Signal Processing. 3rd. Upper Saddle River, NJ, USA: Prentice Hall; 2010.
7. Tang B., Hsu H.-K., Hsu P.-Y., et al. Hierarchical modularity in ERα transcriptional network is associated with distinct functions and implicates clinical outcomes. Scientific Reports. 2012;2, article 875 doi: 10.1038/srep00875. [PMC free article] [PubMed] [Cross Ref]
8. Wang S.-L., Zhu Y.-H., Jia W., Huang D.-S. Robust classification method of tumor subtype by using correlation filters. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2012;9(2):580–591. doi: 10.1109/TCBB.2011.135. [PubMed] [Cross Ref]
9. Lan X., Bonneville R., Apostolos J., Wu W., Jin V. X. W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. Bioinformatics. 2011;27(3):428–430. doi: 10.1093/bioinformatics/btq669. [PMC free article] [PubMed] [Cross Ref]

Articles from BioMed Research International are provided here courtesy of Hindawi