Home | About | Journals | Submit | Contact Us | Français |

**|**Biomed Res Int**|**v.2017; 2017**|**PMC5357551

Formats

Article sections

Authors

Related links

Biomed Res Int. 2017; 2017: 5346793.

Published online 2017 March 5. doi: 10.1155/2017/5346793

PMCID: PMC5357551

*Binhua Tang: Email: moc.kooltuo@gnat.hb

Academic Editor: Xingming Zhao

Received 2016 October 28; Accepted 2017 February 14.

Copyright © 2017 Binhua Tang et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Sequencing data quality and peak alignment efficiency of ChIP-sequencing profiles are directly related to the reliability and reproducibility of NGS experiments. Till now, there is no tool specifically designed for optimal peak alignment estimation and quality-related genomic feature extraction for ChIP-sequencing profiles. We developed open-sourced COPAR, a user-friendly package, to statistically investigate, quantify, and visualize the optimal peak alignment and inherent genomic features using ChIP-seq data from NGS experiments. It provides a versatile perspective for biologists to perform quality-check for high-throughput experiments and optimize their experiment design. The package COPAR can process mapped ChIP-seq read file in BED format and output statistically sound results for multiple high-throughput experiments. Together with three public ChIP-seq data sets verified with the developed package, we have deposited COPAR on GitHub under a GNU GPL license.

Next-generation sequencing (NGS) integrated with ChIP technology provides a genome-wide perspective for biomedical research and clinical diagnosis applications [1–3].

Data quality and peak alignment of ChIP-sequencing profiles are directly related to the reliability and reproducibility of analysis results. For example, ChIP-seq data characterize alteration evidence for transcription factor (TF) binding activities in response to chemical or environmental stimuli, but if the ChIP-seq alignment is poorly selected, any follow-up analysis may lead to inaccurate TF binding results and inevitable loss of biological meanings [4, 5].

The mostly investigated items in ChIP-seq peak calling procedures are peak number, false discovery rate (FDR), corresponding bin-size, and other statistical thresholds selected in each analysis. Without exception, such arguments form impenetrable barriers for biologists and bioinformaticians to choose a suitable pair condition for analyzing experimental results.

And to our knowledge, few literatures or application notes focus on such topics; thus herein we propose a flexible package based on feature extraction and signal processing algorithms for solving such an argument-selection optimization problem in optimal peak alignment.

In summary, the package COPAR can quantitatively measure NGS/ChIP-seq experiment quality through global peak alignment comparison and extract genomic features based on spectrum method for in-depth analysis of ChIP-sequencing profiles.

For determining optimal ChIP-seq alignment, we need to analyze peak numbers under specific argument constraints. Thus we acquire optimal peak numbers by constraining specific arguments, which can be formalized as a class of optimal track analysis, illustrated as

$$\begin{array}{c}\underset{i}{\mathrm{a}\mathrm{r}\mathrm{g}\hspace{0.17em}\mathrm{m}\mathrm{a}\mathrm{x}}\hspace{1em}{P}_{i},\phantom{\rule{10pt}{0ex}}i\in N\\ \\ \text{s.t.}\hspace{1em}{f}_{i}\le \chi ,\\ \\ \hspace{1em}{b}_{i}=\beta ,\\ \\ \hspace{1em}{p}_{i}\le \delta ,\\ \end{array}$$

(1)

where *P*_{i} denotes a set of optimal peak numbers under corresponding argument constraints, *f*_{i} stands for argument FDR, *b*_{i} stands for bin-size, *p*_{i} denotes *p* value threshold, and *χ*, *β*, and *δ* represent the presupposed argument values, respectively.

For a finite random variable sequence, its power spectrum is normally estimated from its autocorrelation sequence by use of discrete-time Fourier transform (DTFT), denoted as [6–8]

$$\begin{array}{c}P\left(\phantom{\rule[-0.12pt]{0ex}{4.53pt}}\omega \phantom{\rule[-0.12pt]{0ex}{4.53pt}}\right)=\frac{\mathrm{1}}{\mathrm{2}\pi}{\displaystyle \sum}_{n=-\infty}^{\infty}{C}_{xx}\left(\phantom{\rule[-0.12pt]{0ex}{4.53pt}}n\phantom{\rule[-0.12pt]{0ex}{4.53pt}}\right){e}^{-jn\omega},\end{array}$$

(2)

where *C*_{xx} denotes autocorrelation sequence of a discrete signal *x*_{n}, defined as

$$\begin{array}{c}{C}_{xx}\left(\phantom{\rule[-2.59pt]{0ex}{6.57999pt}}i,j\phantom{\rule[-2.59pt]{0ex}{6.57999pt}}\right)=\frac{E\left[\left({X}_{i}-{\mu}_{i}\right)\left({X}_{j}-{\mu}_{j}\right)\right]}{{\sigma}_{i}{\sigma}_{j}},\end{array}$$

(3)

where *μ* and *σ* stand for mean and variance, respectively.

In our study, for consideration of the ChIP-seq data characteristics, we use 128 sampling points to calculate discrete Fourier transform, with the related sampling frequency 1KHz.

The COPAR package was developed and open-sourced for academic biologists, and it uses built-in functions for determining optimal peak alignment candidate and extracting genomic features from ChIP-seq dataset.

The package is designed to handle BED-formatted ChIP-seq data as input [9], and it can process single ChIP-seq for optimal peak alignment and feature extraction analysis, together with the capability to perform genome-wide statistical comparison for multiple ChIP-seq samples. The analysis flowchart for the package is given in Figure 1.

Flowchart for optimal peak alignment estimation and genomic feature analysis with COPAR. The package can perform optimal peak estimation based on global alignment of ChIP-seq data; then it can utilize the frequency spectrum approach for genomic feature **...**

It can automatically determine the optimal peak alignment with statistically meaningful FDR through fast global alignment comparison; the global comparison is subject to two statistical arguments, namely, bin-size and *p* value threshold.

The functionalities of our developed package are largely complementary to and extend current tools used for ChIP-seq data analysis. The optimal peak alignment estimation is shown in Figures 2(a) and 2(b); and the spectrum-based feature extraction is given in Figures 2(c) and 2(d). Figures 2(a) and 2(b) utilize heatmap to represent peak number and corresponding FDR candidate subject to each argument pair, bin-size (vertical axis), and *p* value threshold (horizontal axis), respectively; Figure 2(c) denotes the spectrum distribution of the global peak alignment candidate sequence, normalized with its frequency range [0,500]Hz and magnitude within [−40, −3]dB; Figure 2(d) denotes the randomized case.

Based on global peak alignment, COPAR optimizes the argument selection in ChIP-seq analysis; meanwhile, COPAR utilizes the signal spectrum processing method to further extract genomic features and statistically compare multiple ChIP-seq samples for NGS high-throughput experiments.

In summary, our developed package COPAR can process mapped read file in BED format and output statistically sound results for diverse high-throughput sequencing experiments; we further verified the package with three GEO ChIP-seq datasets as study cases, and we included the analysis results into the package manual. The developed package COPAR is currently available under a GNU GPL license from https://github.com/gladex/COPAR.

This work has been supported by the Natural Science Foundation of Jiangsu, China (BE2016655 and BK20161196), Fundamental Research Funds for China Central Universities (2016B08914), and Changzhou Science & Technology Program (CE20155050). This work made use of the resources supported by the NSFC-Guangdong Mutual Funds for Super Computing Program (2nd Phase) and the Open Cloud Consortium- (OCC-) sponsored project resource, supported in part by grants from Gordon and Betty Moore Foundation and the National Science Foundation (USA) and major contributions from OCC members.

- NGS:
- Next-generation sequencing
- ChIP-seq:
- Chromatin immunoprecipitation-sequencing
- FDR:
- False discovery rate
- TF:
- Transcription factor
- DTFT:
- Discrete-time Fourier transform.

The authors declare that they have no competing interests.

Binhua Tang and Victor X. Jin conceived the method; Binhua Tang and Xihan Wang wrote and compiled the package; Binhua Tang, Xihan Wang, and Victor X. Jin drafted and proof-checked the manuscript.

1. Mardis E. R. ChIP-seq: welcome to the new frontier. *Nature Methods*. 2007;4(8):613–614. doi: 10.1038/nmeth0807-613. [PubMed] [Cross Ref]

2. Martinez G. J., Rao A. Cooperative transcription factor complexes in control. *Science*. 2012;338(6109):891–892. doi: 10.1126/science.1231310. [PMC free article] [PubMed] [Cross Ref]

3. Kilpinen H., Barrett J. C. How next-generation sequencing is transforming complex disease genetics. *Trends in Genetics*. 2013;29(1):23–30. doi: 10.1016/j.tig.2012.10.001. [PubMed] [Cross Ref]

4. Chikina M. D., Troyanskaya O. G. An effective statistical evaluation of chipseq dataset similarity. *Bioinformatics*. 2012;28(5):607–613. doi: 10.1093/bioinformatics/bts009. [PMC free article] [PubMed] [Cross Ref]

5. Furey T. S. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. *Nature Reviews Genetics*. 2012;13(12):840–852. doi: 10.1038/nrg3306. [PMC free article] [PubMed] [Cross Ref]

6. Oppenheim A. V., Schafer R. W. *Discrete-Time Signal Processing*. 3rd. Upper Saddle River, NJ, USA: Prentice Hall; 2010.

7. Tang B., Hsu H.-K., Hsu P.-Y., et al. Hierarchical modularity in ER*α* transcriptional network is associated with distinct functions and implicates clinical outcomes. *Scientific Reports*. 2012;2, article 875 doi: 10.1038/srep00875. [PMC free article] [PubMed] [Cross Ref]

8. Wang S.-L., Zhu Y.-H., Jia W., Huang D.-S. Robust classification method of tumor subtype by using correlation filters. *IEEE/ACM Transactions on Computational Biology and Bioinformatics*. 2012;9(2):580–591. doi: 10.1109/TCBB.2011.135. [PubMed] [Cross Ref]

9. Lan X., Bonneville R., Apostolos J., Wu W., Jin V. X. W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. *Bioinformatics*. 2011;27(3):428–430. doi: 10.1093/bioinformatics/btq669. [PMC free article] [PubMed] [Cross Ref]

Articles from BioMed Research International are provided here courtesy of **Hindawi**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |