ChIP-Seq is rapidly becoming the approach of choice for genome-wide discovery of protein-DNA interactions, generating a need for robust and transparent analytical methodologies that leverage its inherent strengths. We developed QuEST to meet this need and utilized it as part of a work flow that is effective at producing a high-confidence list of specific and active TFBSs.
The high resolution of QuEST peak calls, evident for each of the diverse transcription factors we analyzed, is perhaps the most noteworthy methodological aspect of our results. For example, 89% of peaks that contained a significantly matching canonical TFBS motif in the NRSF polyclonal data were within 25 bp of the motif center, and 56% were within 10 bp (). QuEST thereby brings within reach an important goal in annotative functional genomics, which is to identify at high confidence the precise locations at which DNA binding proteins interact with the genome.
One feature that merits some discussion is the score QuEST generates for each peak, according to which the peaks are ranked. The score is directly proportional to the amount of tag enrichment in the set of DNA fragments that yielded sequences. Thus, a peak with a score of 50 is due to a TFBS that was twice as abundant in the DNA sample as a TFBS with a peak score of 25. While both scores may be above the reporting cutoff chosen (by the desired FDR), and are therefore considered real, there is twice the support for (and hence the confidence in) the stronger peak.
One potential drawback of QuEST is that it does not convert peak scores into definitive P-values. Instead, the stringency of peak calls is determined by the score threshold at which the peaks are reported, and the FDR is calculated for this threshold. Users can either use the default threshold or specify their own and assess the stringency through the FDR.
Model-free analysis as implemented in QuEST may be considered less powerful than approaches that leverage the additional power of an explicit model for the ChIP-Seq data. However, such explicit modeling will likely be elusive in the near future because of the many experimental and biological factors that influence the eventual enrichment signal that is detected by ChIP-Seq. Some part of the enrichment signal ought to reflect occupancy by the transcription factor, but confounding factors such as antibody specificity, epitope accessibility, and susceptibility of TFBS-adjacent DNA to shearing will be difficult to model explicitly. Furthermore, downstream manipulation necessary for library building, especially library amplification and sequencing, introduce additional biases into the enrichment signal. Together, these factors contribute to increased variance of signal strength across the binding sites, and complicate detection of weak binding signals. Application of QuEST or similar approaches will enhance our empirical understanding of ChIP-Seq data over the course of the next few years.