The present study evaluated from a practical point of view the performance of the currently available open source software for detecting transcription factor binding sites in ChIP-seq data. A main observation was that the choice of the algorithm may considerably affect the overall conclusions made from the data (Figure ). Moreover, there was no clear winner among the methods that would have outperformed the other approaches systematically in each dataset. Instead, the choice of the best method was strongly dependent on the data under analysis (Figure ). While QuEST performed well in the NRSF data, MACS may be a better choice in the FoxA1 data, whereas FindPeaks showed good performance in the STAT6 data. Below we discuss some practical guidelines for the researchers, including (i) the choice of an appropriate algorithm for different study objectives, (ii) the use of a negative control sample, and (iii) the use of empirical validations.
In most of the currently published ChIP-seq studies, the choice of the peak detection algorithm lacks detailed motivation or description (see Table for a set of representative cases). Our comparison demonstrates, however, that this choice warrants careful attention. While most computational methods perform well under some circumstances, their behaviour can vary markedly depending on the dataset under analysis. This is especially true when the aim is to detect all the potential binding sites of a particular transcription factor of interest. If only a small set of best candidate targets are to be detected, then all the methods performed relatively well in our comparisons (Figure ). To identify the best candidates, the candidate binding positions can be prioritized using the peak magnitude scores or their p-values, provided by the peak detection software.
When the goal is to identify a comprehensive set of regulatory interactions, the major challenge is to determine a suitable threshold to discriminate true binding sites from background noise. This was exemplified by the large differences in the numbers of peak calls observed with the different approaches (Figure ). PeakFinder and GeneTrack do not provide any statistical estimates of the FDR, making it difficult to choose an appropriate cutoff. Although the other algorithms estimate also the statistical significance of the detections, the accuracy of the estimation can depend heavily on the choice of the selected null model [5
]. While the simplest model assumes that the background read density is uniform along the genome and independent between the strands, several authors have observed that the sequenced control samples show highly non-uniform behaviour and, in some cases, their read density patterns are close to those expected from true binding sites [5
]. This can be due to various reasons, such as sequencing and mapping biases, non-specific immunoprecipitation or differences in the chromatin structure [10
]. Therefore, the use of separate control samples has been suggested [5
], and was also supported in our comparisons (Figure ). If experimentally determined true positive and true negative binding sites are available, then it is possible to calculate also an empirical FDR for the detections (Figure ).
Besides the inclusion of a control sample, another important decision concerns the type of an appropriate control. At least three types of controls have been considered: a non-immunoprecipitated fragmented DNA sample (input DNA) [12
], a ChIP-seq sample using an unspecific antibody (e.g. IgG) [25
], or a ChIP-seq sample under a different cellular condition (e.g. without stimulation) [26
]. Further study is needed to determine which control sample type provides the best outcome in different algorithms. In addition, as the quality of the immunoprecipitating antibody critically affects the results, the actual ChIP experiment may also be repeated with different antibodies [8
If the binding motif of the transcription factor of interest is known, then it can provide useful information about the relative performance of the different approaches (Figure ). However, there are also transcription factors that do not require a specific binding motif [27
]. Further information about the adequacy of the peak detection methods can be obtained by experimental validations (Figure ). Since confirmation studies on candidate binding sites are expensive and time-consuming, however, thorough experimental validations are relatively rarely done when reporting large-scale findings. Moreover, it is worth noting that building an appropriate set of true negatives is a difficult task, and it has been suggested that the sets of the previously utilized true negatives may actually contain also true positives despite their low enrichment ratios in the qPCR validations [22
]. How to choose the best method directly from the data remains a challenging future research question.
From the point of view of an ordinary user, a major complication of the peak detection software is the typically large number of adjustable parameters. While the default parameters are a natural choice and were applied also by us, they may not be optimal for the particular data under analysis. On the other hand, if an algorithm lacks the possibility to easily adjust the parameters properly, it can be regarded as a weakness of the method. Other critical issues in ChIP-seq data analysis are the memory requirements for the computer and the diversity of the current data formats. The required input formats of the peak detection software as well as their output peak lists are far from being standardized, neither are the output formats produced by the different read alignment software. Further technical challenges include, for instance, the quality of the aligned reads and the required depth of sequencing [22
]. Also the interpretation of the results poses its own challenges. Even if a comprehensive and unbiased set of binding sites could be determined with ChIP-seq, the identified sites may not all be functional regulatory elements that have an impact on transcription. Instead, it is possible that several non-functional detections are made as a consequence of biological noise [28
Despite the challenges, the next-generation DNA sequencing has a great potential to accelerate biological and biomedical research by enabling a comprehensive analysis of genomes, transcriptomes and interactomes to be performed routinely without having the resources of large genomic centres [3
]. While several issues remain to be solved regarding, for instance, the optimization of the peak detection algorithms, already the current results support the utility of the ChIP-seq technique. In our comparisons, for example, all the algorithms identified binding sites with highly significant overlap with the corresponding known sequence motif (Figure ), and the most prominent peaks were typically detected robustly across independent experiments (Figure ). In addition to transcription factor binding, a wide range of other biological phenomena can be investigated, such as chromosome conformation, genetic variation, and RNA expression (RNA-seq) to detect, for instance, differential splicing, microRNA and other non-coding RNAs [3
With the growing importance of the technology, rigorous computational approaches to transform the large datasets into biological knowledge are required to truly leverage the potential of these data. Rather than introducing a number of closely related algorithms, it is critical to objectively evaluate their performance to provide practical guidance to the researchers analysing their data and to the developers of the algorithms to evaluate their new ideas. An important future direction is also to effectively integrate ChIP-seq data with other types of datasets, such as those generated in siRNA interference experiments, to improve the detection of target genes.