Analysis of protein-DNA interactions using high-throughput short sequencing poses a number of novel computational challenges. We show that many aspects of the processing pipeline can be specifically tailored to improve detection of binding positions.
The protein binding positions exhibit a strand-specific pattern of tag occurrences. We illustrate that a genome-wide signature of such a pattern can be obtained with strand cross-correlation of tag density, providing a quick assessment of dataset quality and binding characteristics. The proposed alignment procedure also relies on this signature to determine the range of alignment quality that is informative about the binding positions. In our implementation, we have used a simple classification of tag alignment quality, based on the number of nucleotide mismatches. The same procedure can be applied to more elaborate measures of alignment quality, such as those incorporating confidence in specific base calls12
The examination of the input sequencing clearly indicates that experimental assessment of the background tag distribution is necessary for accurate evaluation of the ChIP-seq data. The background distribution is far from uniform and, in some cases, shows tag density patterns similar to those expected from true binding positions. We demonstrate that the knowledge of such distribution is instrumental for accurately assessing and reducing rates of false positive predictions. As additional datasets become available, it will be important to analyze the degree to which tag profiles of input or no-antibody measurements differ between independent experiments.
Comparison of different binding prediction algorithms shows that even though several methods can reach optimal sensitivity, there is a considerable variation in the accuracy of the identified binding positions. While the MTC method provides more accurate positions for CTCF and STAT1 binding, the WTD method is better at identifying precise positions of NRSF binding. The difference can be attributed to the consideration of tag patterns immediately near the center of the binding pattern, which show qualitative differences between NRSF and CTCF/STAT1 datasets (Supplementary Figure 12
). Since the NRSF binding tag pattern is more consistent with the basic expectations, we recommend using WTD method in the cases when the tag pattern cannot be examined beforehand on a set of expected binding positions. It remains to be seen, however, which tag pattern will be typical of other experiments and whether both patterns can be efficiently handled by a single method.
The ability to evaluate and predict the sequencing depth requirement is an important aspect of ChIP-seq studies. Our analyses demonstrate that none of the three examined datasets definitively reach a point of saturation at which the set of determined binding positions stabilizes. The binding positions exhibit very wide range of enrichment ratios so that additional sequencing reveals increasing number of weaker binding sites. This bears some resemblance to other genomics studies. In genome-wide association studies, for instance, increasing the sample size allows one to find more and more loci with smaller LOD scores; in gene expression studies, it leads one to find more and more genes with a statistically significant but smaller fold-change. In practical terms, this lack of saturation point has profound implications in study design. It suggests that it would be difficult to define a “sufficient” depth of sequencing and that other criteria must be specified.
We therefore propose that the sequencing depth requirements should be evaluated with respect to a specific target enrichment ratio of the binding positions. Towards that end, we provide a method to determine the minimal fold enrichment ratio above which the detection of binding positions has been saturated (i.e. stabilized) at a current sequencing depth. We also show that the relationship between saturated fold enrichment and the number of sequenced tags may be extrapolated to estimate the sequencing depth that would be required to reach saturation for lower fold enrichment ratio. It will be important to examine how well such extrapolations describe saturation properties of much larger datasets that are likely to be come available in the near future.
The fold enrichment ratio of a particular binding position may depend on diverse factors, such as binding affinity or efficiency of chromatin extraction. Since its relationship to the functional importance of binding positions is uncertain, the desired fold enrichment ratio target would clearly vary for different experiments. When some functional binding positions are already known for a particular protein, the target enrichment ratio can be chosen based on examination of these positions in the initial sequencing data or with quantitative PCR. If a target enrichment ratio cannot be estimated from other sources, it can be specified relative to the maximum or median enrichment observed in the dataset (e.g. detect binding positions with enrichment 5-fold below the maximum observed enrichment).
As more ChIP-seq datasets are generated, it will be important to analyze additional factors, such as sequencing biases associated with individual sequencing platforms, or stability of ChIP and input tag distributions between replicate experiments. Such data will likely lead to improvements in the binding prediction methods and allow better interpretation of the functional relevance of observed variability in fold enrichment ratios of different binding positions. Finally, it will be important to see whether the described techniques can be adapted for analysis of histone modifications or other widely-distributed chromatin marks that do not fit the models of point binding patterns.