Effective genomic size is an important parameter for the calculation of p
-value, and the fact that the SIPeS analysis can calculate fragment length and effective genomic size accurately using the reads from paired-end sequencing means that it can provide enhanced identification of DNA-protein binding sites from ChIP-seq data. Currently, most of the available algorithms can not accurately calculate the fragment size because they are mainly designed for single-end ChIP-Seq data analysis, which usually uses the direction of reads to estimate the fragment length to identify binding sites [16
]. The paired-end sequencing technology generates double-end reads with unique tokens that can be used to calculate fragment length using SIPeS. Moreover, SIPeS can calculate the accurate effective genomic size using the advantage of the accurate fragment length. Other algorithms, such as MACS recommend that the effective genome size of hg18 is about 90% of the whole genome length [7
] while SISSRs recommends about 80% [9
] and FindPeaks suggests about 70% [12
]. However this estimation is likely to affect the accuracy of the analysis for researchers who use the ChIP-Seq technology. In this study, an effective genomic size of 111,755,668 bp of AMS enriched DNA was observed using the SIPeS preprocessing program, which accounts for approximately 93% of the Arabidopsis
whole genome length. In addition, SIPeS calculates the DNA-protein binding sites on basis of the analysis of fragment pileup which is more intuitive and creditable, while most of the existing algorithms are based on the tag counts to test the enrichment.
Currently, most peak finding methods often employ a window scan for the whole genome with a step to calculate the read count and see if that can satisfy the statistical tests. Varying window size and step length may therefore cause differences in the results. SIPeS can determine peak end and start positions based on a dynamic baseline while other algorithms sometimes incorrectly split a true peak into two or more peaks. In addition, SIPeS uses a dynamic baseline to discriminate closely adjacent binding sites to easily separate adjacent overlapping peaks. For example, if a baseline of 1 is used, two closely adjacent signal map A and map B are misrepresented as a single peak (Additional file 2a
- baseline 1, signal map C identified). But if a higher baseline is adopted, map A and map B are identified (Additional file 2a
- baseline 2, signal maps A and B identified). SIPeS can also analyze broad peaks with high signal levels (i.e. 1 peak) while a peak of the same shape but of lower signal value with low signal values would have every local maxima (i.e. multiple peaks). For example, one peak with the summit 1 will be called when the baseline is below 10 and satisfies the p
-value cutoff set by the user. When the baseline is increased to 10, then two peaks, one merging peak (1 and 2) and peak 3 will be called. When the baseline is increased to 12, three peaks (1, 2, 3) will be called (Additional file 2b
). If the low signal value is not high enough to satisfy the p
-value cutoff, then only broad peaks with higher signal will be called.
Therefore by utilizing a dynamic baseline, SIPeS can theoretically find all the signal maps with a single global maximum (Figure ), this is of particular importance for high-density genomes which may have a number of binding sites in close proximity. We found that motif occurrence percentage is higher when t is increased from 1 to 200 which mean peak results will be better with a high t value; suggesting t is a good indicator of finding binding sites (Figure ). Also, peak number tends to be stable when t is increased using SIPeS, therefore users can find more genuine DNA-protein binding sites by increasing the t value (Figure ). From analysis of our AMS ChIP-Seq data, we recommend the users to choose as high a value for t as possible since this will allow the peaks to be identified more accurately, even though it may take more time to achieve this goal. At the same time, SIPeS is able to report the percentage of peaks with a single global maximum based on t set by users which can judge whether t is set reasonable.
Figure 6 Relationship between the maximum dynamic baseline t and percentage of signal maps with a single global maximum (p < 1 × 10-5, fold > 2) for AMS using SIPeS. The percentage of peaks with a single global maximum appears increased (more ...)
Figure 7 Relationship between the maximum dynamic baseline t and percentage of AMS motif occurrence (p < 1 × 10-5, fold > 2) using SIPeS. When t is increased from the lower to the higher, more AMS motif occurrence percentage is revealed, (more ...)
Figure 8 Relationship between the maximum dynamic baseline t and the peak number for AMS (p < 1 × 10-5, fold > 2) called by SIPeS. As t is increased from the lower to the higher, more peaks are called by SIPeS using the AMS paired-end ChIP-Seq (more ...)
Similar to the limitation of existing algorithms, SIPeS is not suitable for peak finding in a wide peak region such as those histone marks, since the statistical tests are not capable of satisfying the user's threshold (for example, p < 0.01). Additionally, SIPeS algorithm is targeting to paired-end sequencing reads, and not applicable for single-end sequencing data.