In this study, we presented SISSRs, a novel algorithm for precise identification of binding sites from short reads generated from ChIP-Seq experiments. We used SISSRs to identify 26 814, 5813 and 73 956 binding sites for CTCF, NRSF, and STAT1 proteins, respectively. Binding sites identified by SISSRs are of high resolution, i.e. the identified sites are within few tens of base pairs from the center of the canonical motif (A). For example, >90% of CTCF sites were within 32-bp from the motif center. The resolution of SISSRs-identified binding sites is at least an order of magnitude higher than that of those sites identified by previous approaches (3–5
). While SISSRs identifies the center of the binding site, previous approaches identify binding ‘regions’ ranging from a few hundred to a few thousand base pairs in length (Supplementary Figure 3
), which may contain more than one binding site within them. This could be one of the reasons that the binding regions identified by these approaches are of low resolution (Supplementary Figure 4A
), and the number of canonical motifs within these binding regions is relatively large (Supplementary Figure 4B
SISSRs’ ability to identify binding sites with high resolution helped it achieve unprecedented sensitivity and specificity, as evidenced by its ability to identify 32–299% more binding sites than that by previous approaches using the same dataset (3–5
). We found that 82, 68 and 92% of CTCF, NRSF and STAT1 binding regions reported by previous approaches overlap with one or more binding sites identified in this study for the respective proteins. The reason SISSRs did not recover all binding sites reported by previous studies could be one or both of the following. SISSRs was used with the option that requires at least two directional tags on either side of the binding site ()—a stringent criteria compared to that used by previous approaches, which did not consider tag directions and just count the number of tags mapped to a region. Although SISSRs provides an option to identify binding sites with corresponding tags mapped to only one strand (B), we did not utilize this option. Using this option and relaxing other constraints could improve the percentages listed above. It is also possible that some of the sites identified by previous approaches are false positives, which we do not expect SISSRs to identify.
SISSRs is highly accurate, which is evident from the fact that an overwhelming majority of the identified sites contained the previously established consensus binding motifs. For example, 96% and 92% of all CTCF and NRSF sites, respectively, contained the consensus binding motifs. This immediately raises questions about those sites that do not contain the consensus binding motif. Are these false-positives? While it is entirely possible that those sites without the canonical motif are false positives, one should not discount the fact that the protein of interest may have bound to the DNA indirectly via another protein (C), which may be hard to distinguish from direct DNA binding. Also, one needs to keep in mind that the mere presence of a consensus motif in the predicted region may not necessarily imply that the protein of interest actually binds directly at this site unless it can be determined that the binding is more-or-less independent of other factors. This would mean that the accuracy of identified sites could be lower than that claimed above. This does not reflect the accuracy of SISSRs, rather it reflects the limitation of the ChIP technology, which cannot distinguish between direct and indirect DNA binding.
SISSRs is robust, yet flexible enough that it allows the user to control the elements such as antibody specificity and sequencing errors, which could affect the quality of generated data, and thus the accuracy and resolution of identified binding sites. This is a very useful attribute considering the fact that not all ChIP experiments generate high-quality data every single time, i.e. the background noise (non-specific reads) usually varies for different ChIP experiments. Non-specific reads, which may be due to antibody non-specificity and/or sequencing errors, could be controlled for by adjusting the size of the scanning window. While larger window size reduces the impact of non-specific reads and thus false positives at the cost of resolution, smaller window size provides for increased resolution but also increases the number of false positives (A). This noteworthy feature of SISSRs is extremely useful especially when one needs to salvage information from a low-quality data. SISSRs also allows users to submit their own negative control dataset (such as IgG) to be used as a background noise, in place of the default random model.
SISSRs provides an option to identify those binding sites with tags mapped to only the sense or the antisense strand (B). This situation arises when tags cannot be mapped to certain regions in the genome, which contain repetitive elements. Since a read aligning to a repetitive element cannot be mapped to a unique genomic location, such tags are usually left out from further consideration, and as a result certain genomic regions enriched with repetitive elements are left unmapped. SISSRs employs a simple procedure to identify those binding sites with tags mapped just to one side of the site (see Methods section). Based on our analysis with the many transcription factor binding proteins, we found that an additional ~1–2% of binding sites could be identified by selecting this option (this option was not used to identify sites reported in this study). SISSRs also provides an option to mask out reads that fall within certain regions in the genome. This is useful especially if one needs to ignore tags that fall within, say, satellite repeat regions or regions close to centromere. Since such regions are suspected to contain disproportionately large number of mapped reads, which could be due to amplification biases or incorrect mapping of reads with one or two mismatches to regions having high sequence similarity with repetitive regions that are usually masked out during mapping, it is sometimes necessary to ignore reads mapped to these regions.
Our observation that the enrichment of tags at binding sites follows a power-law distribution raised an immediate question as to whether the tag density at the identified binding site is an indicator of the stability or affinity of protein–DNA interaction. Since stable protein–DNA interactions lead to a prolonged half-life of the protein–DNA complex, and the corresponding fragments are likely to be enriched in the ChIP sample, it is reasonable to expect high tag density at stable protein–DNA binding sites. The stability of the protein–DNA complex could depend on many factors such as how accessible the binding site is or how similar the binding site is to the canonical site. We could not assess the former possibility as it is outside the scope of this study. However, we observed a good correlation between the tag density and the information content of NRSF and CTCF binding sites indicating that tag density is a good indicator of the stability of protein–DNA binding. We have also identified the core residues within the NRSF and CTCF binding sites, which are critical for a stronger DNA binding.
In conclusion, although recent advances in sequencing technology provide us with the ability to map protein–DNA interactions on a genome-scale, development of algorithms to identify the exact binding sites from short reads generated by ultra high-throughput sequencing techniques is still in its infancy. We believe that SISSRs will serve as a useful tool for precise identification of binding sites from millions of ChIP-Seq reads. Our experimentation of SISSRs with ChIP-Seq data for three well-characterized DNA-binding proteins revealed interesting insights, which we believe will serve as a guidance for designing ChIP-Seq experiments. While a higher number of reads may increase the sensitivity () and resolution (), it may not necessarily translate to accuracy (), as accuracy may be depend on other factors such as antibody specificity, and how stable the protein–DNA complex is. The length of DNA fragment, which has a direct impact on the resolution of identified binding sites, should preferably be smaller (~120–150 bp) if high-resolution binding sites are desired.