In this article we have described a benchmark for testing methods that predict transcription factor binding sites. Our positive set of binding sites is based on ChIP-seq data and computationally predicted ChIP-seq peak regions. Although ChIP-seq is considered state-of-the-art technology for mapping transcription factor binding sites, there are at least four concerns in using such data for creating a fair and unbiased benchmark. First, ChIP-seq is cell-context specific, whereas motif detection is not. Which of the potential binding sites a transcription factor actually binds depends on the state of the cells, whereas computational prediction based on sequence motifs will not have this kind of bias. We assume that any bias due to the cell-context of the ChIP-seq peak regions have the same effect on the performance of the methods tested and only work to reduce the methods' overall performance. Based on our tests with ChIP-seq data from two different cell lines, this assumption seems to hold.
Second, using ChIP-seq data means that we cannot separate between direct and indirect binding. Because a transcription factor can bind via cofactors and without a sequence-specific motif, this indirect binding can introduce false positive peaks that results in more false negatives in the predicted sites of all methods.
Third, a major concern is the quality and correctness of the peak regions. We use ChIP-seq data from the highly standardized ENCODE project
[16] so we expect minimal noise in the source data due to differences in experimental procedures between the cell line datasets. Also, our peak detection method has been shown to be highly accurate when tested against other common methods of peak detection
[15]. As described in
Methods, the set of derived binding sites are not necessarily complete, but are thought to represent the sites with the highest affinity for the transcription factor and should therefore be correlated with TF sequence motifs. In the benchmarks, we removed from consideration any regions of lesser affinity that are predicted to be peaks by MACS or SISSR alone, but that are not called as peaks with our stricter meta-approach. Given that we found similar relative performance between methods when using data from different cell-lines, we believe the benchmark gives a fair ranking of the methods. For now, ChIP-seq is probably the best technique available for genome-wide mapping of transcription factor binding sites in mammals.
Fourth, an issue which complicates performance comparison and which also explains some of the performance difference between the methods tested, is that many PWM models obtain their maximum score so frequently that it becomes impossible to sort the relatively large predicted regions according to score. In our benchmark, we take a conservative approach when calculating the ROC curve and add all negatives prior to adding positives when scores are equal. This favors the conservation-based methods, whose scoring depends on several genomes and therefore less often achieve maximum scores but give more fine-grained predictions compared to for example PWM scanning which is more penalized, especially on the shorter motifs. This can perhaps to some degree explain why conservation-based methods are so much better relative to PWM scanning on the promoter benchmark than they are on the site benchmark.
Another likely reason for the superiority of the conservation-based methods on the promoter benchmark, as compared to the site benchmark, concern the peak regions themselves. The promoter peaks are higher than the non-promoter peaks (on average 2.7 times higher, p-value 0.129 on a one-sided Wilcoxon signed-rank test across TFs), and importantly, the promoter peaks have more conserved sequence as measured by phyloP score (p-value

). We therefore expect the motifs to be better conserved in the promoter peaks as well.
In sum, we have created comprehensive benchmarks for methods which predict the location of transcription factor binding sites and have used the benchmark to evaluate the effects of using different motif representations and of using comparative genomics in predictions. We found that the methods that use conservation generally achieve better performance than methods that only use a single genome as input, especially on high-affinity binding sites. For good information-rich motifs, however, it might not be necessary or even beneficial to use conservation to predict binding sites.
The benchmarking has shown that the methods for TFBS prediction can and should be improved. As more genomes are made available, comparative genomics approaches, such as the branch length methods and phylogenetic shadowing
[27], can be very valuable for improving TFBS prediction. However, given the relatively small performance differences between elaborate and simpler conservation methods in our study, it is likely that new methods also could benefit from integrating more biological data to improve accuracy
[28]. We also suspect that the full benefit of more elaborate motif models will be seen as more binding site sequences are made available and incorporated into the motifs.