To address the issues just discussed, we developed PscanChIP, based on our previous tool for promoter analysis Pscan (5
). In PscanChIP we introduce different criteria to assess motif overrepresentation and positional bias in analysis of ChIP-Seq peak regions. Our idea is that motifs actually corresponding to TF binding should be first of all overrepresented in the ChIP-Seq regions with respect to the rest of the genome (that is, the portion of the genome accessible for TF binding), and the same should hold true for motifs corresponding to other TFs interacting with the ChIP’ed one in a significant number of regions. Also, matches to a given PWM should present not only a positional bias, but also yield higher scores in the preferred positions, that is, they should better fit the matrix in the positions where they tend to cluster.
PscanChIP takes as input a set of genomic coordinates corresponding to ChIP-Seq peaks, assuming that they are centered on their summits and considers the 150 bp genomic regions around their center. The regions are scanned on both strands by PWMs available in the JASPAR and TRANSFAC databases, and optionally with additional PWMs that can be submitted by users. The scan returns for each oligo in the input regions a score between 0 and 1 (Supplementary Methods
), and the best (highest scoring) oligo is selected from each input region. As in the original Pscan, we bypass the need to define matching thresholds for PWMs to predict likely TFBS instances, comparing instead for each matrix the mean score of the best-matching oligos in the input regions to expected values defined according to different backgrounds. TFs mostly bind DNA in accessible regions (12
), and genome-wide maps of open chromatin can be built through experiments like DNaseI- or FAIRE-Seq (16
). The result is a collection of segments of DNA that are accessible to regulatory factors and other DNA interacting molecules. Thus, given a PWM, we compare the matching scores on the input sequences with an expected background value computed by considering the matching scores in a collection of DNA accessible regions of the same length. In particular, we used for this task the ENCODE regions available at the UCSC Genome Browser database (17
), which have been identified through DNaseI Digital Genomic Footprinting (18
). The regions differ according to the tissue/cell line studied, and thus yield cell-line–specific expected values. Therefore, as input, users have to select also the cell line/tissue on which their ChIP experiment has been performed, or the closest relative in the list of available possibility (e.g. HepG2 for liver cells). But, if no suitable choice is available, we also included a ‘mixed’ background, made by a sample of non-overlapping accessible regions built by taking at random a subset of the regions from each of the cell lines available, and a ‘promoter’ background, to be used if the input comes mostly from promoter regions. Each of the background sequence sets (cell-line specific or mixed) is made of ~200 000 regions.
Once the matching scores for a PWM are available both in the input and the background sequence sets, global enrichment can be computed, by comparing with a t-test average and standard deviation of the best-match score in the input regions with average and standard deviation of best matches on the background sequence set, yielding enrichment P-values with a two-tailed t distribution. Global enrichment can be used to identify motifs that, in general, are overrepresented in the regions, but with no assumption on their location within the regions or any positional bias with respect to the peak summits. Sites for the ChIP’ed TF should anyway be the most significantly enriched according to this measure, and also PWMs for other ‘co-regulating’ TFs, that is, binding a significant number of the input regions, should yield low P-values.
Also, PscanChIP evaluates local enrichment, comparing with a t
-test mean and standard deviation of the score of the best matching oligos in the input regions to mean and standard deviation of the best match in the genomic regions flanking the input ones. Local enrichment can be used to identify motifs with significant preference for binding within the regions, that is, the motif corresponding to the ChIP’ed TF, as well as other TFs likely to interact with it and binding in its neighborhood. It should be noticed that with respect to similar methods, local enrichment not only assesses positional bias but also how well the matches fit the profile used, without establishing a pre-determined matching threshold. Finally, PscanChIP evaluates motif positional bias within the input regions, by splitting them into overlapping sub-regions of 10 bp. Mean and standard deviation for the best match score within each sub-region are compared with mean and standard deviation across all the sub-regions, again with a t
-test. Once again, motifs showing a positional bias, but more importantly having their best matches associated with a given sub-region, can be singled out. The bias can be for the center (as usually shown for the sites of the ChIP’ed TF), but it can be identified for any other sub-region. Further details on the calculations performed can be found in the Supplementary Methods
These different measures can detect different types of overrepresented motifs. The PWM corresponding to the ChIP’ed TF or to the co-factor(s) recruiting it should be the highest ranking (lowest P-values) with respect to all the three different measures. In case of ChIP-Seq experiments with low resolution (e.g. with low quantities of IP’ed DNA or a weakly specific antibody) and without sharp peak summits, the global P-value should be nevertheless low. PWMs corresponding to possible interactors (or competitors) of the ChIP’ed TF (often, but not necessarily always, binding DNA in its neighborhood) should have low global and local P-values, but rank after the ChIP’ed TF. Positional bias could also appear for these—but not as a rule. ‘Co-regulators’ (TFs that often bind the same promoters/enhancers of the ChIP’ed TF) tend to regulate the same genes, but do not have to bind cooperatively with it, and should have low global P-values but not necessarily low local P-values and no positional bias. On the other hand, PWMs with low local P-values only (better if substantiated by positional bias) might correspond to TFs interacting with the ChIP’ed one only in a limited subset of the input.