We have described Pyicos, a powerful tool for the analysis of mapped HTS reads. Pyicos framework facilitates the analysis of different HTS datatypes. We have described its application to ChIP-Seq, RNA-Seq and CLIP-Seq data using the three corresponding protocols: callpeaks, enrichment and clipseq. In callpeaks, we define a peak score in terms of a Poisson P-value, which is calculated independently for each chromosome, and in terms of one peak property, either the peak height or read count. The peak score therefore takes into account the differences across chromosomes and the fact that peaks with the same height or read count may have different P-values depending on the chromosome in which they are located. Using ChIP-Seq data for PR, CTCF, CEBPA and NRSF, we have shown that the peak score provides an appropriate ranking for peaks. Interestingly, using either the peak height or read count can make a difference depending on the dataset, as we found an increase of ~7% in the fraction of the top 500 PR peaks with motifs, leading to a better peak definition compared with the other methods. We have further shown that the subtraction of the control is effective to increase the fraction of peaks with motif, indicating that it eliminates potential FPs.
Pyicos has the advantage that all operations described are configurable by the user to keep as much flexibility as possible, since not all of them may be applicable to a dataset. For instance, splitting peaks seems to result in improvements only for CTCF. Similarly, although Pyicos allows the user to choose the number of tolerated duplicated reads, we found that in all datasets the best peak definition is achieved by removing all duplicates.
Furthermore, using various measures, we have compared callpeaks with three other methods specifically developed for punctuated ChIP-Seq data: MACS, USeq and FindPeaks. Regarding peak definition, we have found that FindPeaks, Pyicos and MACS show very similar spatial resolutions, which are also higher than those for USeq peaks. Furthermore, we observed that peaks ranked by Pyicos and MACS show a slightly higher fraction of motif-containing peaks than those ranked by FindPeaks and USeq.
Peak detection has been assessed using ChIP-qPCR validated positive and negative regions for NRSF. We found that all methods perform similarly, probably due to high agreement between their generated peaks. The methods with the highest pairwise correlations are Pyicos, MACS and USeq. FindPeaks showed a lower correlation, probably due to a different handling of the control sample: whereas, FindPeaks compares a peak height with the distribution of peak heights from the control sample, the other three methods compare signal to control locally, using windows (MACS and USeq) or at base pair resolution (Pyicos). In summary, our results lead us to conclude that callpeaks provides an accurate protocol for peak detection for punctuated ChIP-Seq data. Moreover, due to the more flexible usage of the various operations compared with other methods, it allows to design a customized analysis of the ChIP-Seq data.
We have further illustrated Pyicos flexibility by applying it to other datatypes. Using data from RNA-Seq and microarray experiments on liver and kidney samples, we have shown that Pyicos can recover DE genes with high accuracy and that is in fact comparable to methods specifically designed for differential expression analysis from RNA-Seq. Moreover, Pyicos performs well using different inputs: replicated or non-replicated, read counts or RPKM. DEGseq, which can also work with RPKM, shows similar accuracy as Pyicos and both correlate well in terms of the predicted Z-scores. The other two methods, DESeq and edgeR, were not designed to work with RPKM, hence could not be included in the comparison. The fact that Pyicos performs well with simulated replicated data presents the advantage of making possible to analyse many of the published datasets that have been produced without replica.
It is surprising that the highest accuracy is achieved on count data, since it is known that using counts leads to a length bias in DE gene detection. This is probably because the genes selected from the microarray for benchmarking are not much affected by the length bias, since the accuracy hardly changes when using RPKM. This also suggests that RPKM alone does not provide the optimal normalization method. However, using a new density definition, TRPK, which combines RPKM with the TMM normalization, the length differences between DE and non-DE genes are reduced to a level close to that of the microarray. Nonetheless, the TMM normalization is based on the assumption that the majority of regions tested do not change significantly, which might not hold true for some pairs of samples or for certain sets of regions. Accordingly, Pyicos implements this as an option to the user.
The flexibility of
Pyicos is further demonstrated by applying the EA protocol to broad ChIP-Seq data. In particular, using ENCODE ChIP-Seq data for H3K36me3 and RNAPII from the cell lines K562 and NHEK, we obtain a high correlation for the
Z-scores of these signals when compared with the enrichment
Z-scores for RNA-Seq between the same cell lines. Thus, the EA of
Pyicos using RPKM provides a tool to analyse broad ChIP-Seq data of various sorts, which would otherwise require a combination of approaches to be analysed (
Young et al., 2011). A further advantage is that
Pyicos can directly accept BED files with mapped reads and regions of interest, unlike DESeq and edgeR.
Pyicos can also calculate enriched regions
de novo genome wide, using the reads from two experiments that are overlapping or sufficiently close in position. This is particularly useful for the analysis of signals for which the user does not know where to expect the enrichment relative to genome annotations, like enhancer elements. We have further shown that
Pyicos can also be used to process CLIP-Seq data without a control. We expect that many of the basic
Pyicos operations could be applicable to other datatypes and could possibly be combined to generate new analysis protocols.
Some of the operations described can become impracticable for some analysis tools due to the amount of reads produced by a single HTS experiment nowadays. The bottleneck of HTS data analysis mostly lies in the memory usage of the software and in the storage and retrieval of data. Indeed, HTS data generation seems to be outpacing the improvements in CPU and disk storage (
Kahn, 2011). Moreover, HTS data has become ubiquitous in genomic research; hence, we should aim to provide software that can be adapted to the available computing resources and, in particular, that can run on the average desktop computer. For these reasons, we developed
Pyicos minimizing RAM usage and maintaining reasonable CPU time usage.
Pyicos callpeaks protocol for ChIP-Seq data outperforms the other tested methods concerning memory usage while it stays competitive in running time. FindPeaks and USeq algorithms load entire files in memory, allowing for better time performance. However, this is only practical on a computer with enough RAM. For example, an experiment with 100 million reads using as input a BED file, would require >4 GB available of RAM. For EA, the main bottleneck lies in the calculation of read counts or RPKM on the input regions. Although
Pyicos is not as fast as BEDTools combined with DESeq or edgeR, its memory performance is far superior, allowing the handling of very large datasets. As made patent in the last few years, read files are increasing enormously in size with the development of the technology, making memory usage a critical feature in HTS analysis software.
We conclude that the added value of having a modular tool is not at the cost of accuracy or performance; hence, Pyicos provides a useful framework for the analysis and integration of heterogeneous HTS data. Finally, as Pyicos is open source, we encourage the addition of new operations in order to combine them with already existing ones and possibly to create new analysis protocols.