All programs within comb-p
expect files in simple BED format (Kent et al., 2002
) sorted by chromosome and start. Additional columns contain the P
-value(s) of interest based on the study design and generated from any software or statistical test.
first calculates the correlation at varying distance lags (here: auto-correlation—ACF). While Kechris’ (Kechris et al., 2010
) and many ACF implementations rely on fixed offsets of adjacent probes, comb-p
accepts a maximum distance and a single offset that segments that distance into intervals. If a given probe is 40 bases away from two (or more) other probes, then it will appear more than once in the bin containing probe pairs separated by <40 bases. This is useful in cases where the probes generating the P
-values are unevenly spaced, as is common with methylation arrays.
Once the ACF has been calculated, it can be used to perform the Stouffer–Liptak–Kechris correction (slk) where each P-value is adjusted according to adjacent P-values as weighted according to the ACF. The resulting BED file has an additional column containing the corrected P-value. A given P-value will be pulled lower if its neighbors also have low P-values (and little auto-correlation) and likely remain insignificant if the neighboring P-values are also high.
A q-value score based on the Benjamini–Hochberg false discovery (FDR) correction or on a null model from shuffled data may then be calculated. The peak-finding algorithm can then be used to find enrichment regions or peaks on the FDR q-value, the slk-corrected P-value or on the original P-value.
Once the regions are identified, a P
-value for each region can be assigned using the Stouffer–Liptak correction. Note that the first ACF calculated above was only out to the distance specified and may not extend far enough to cover the longer regions. Therefore, this step first calculates the ACF out to a distance equal to the length of the longest region. The corrected P
-value for each region is then calculated using the original P
-values that fall within the peak and the full ACF. Because we use the original, uncorrected P
-values in the calculation of significance for the peak, we side-step issues of altering the distribution in both the slk
and FDR steps. The region_p
program reports the slk
-value and a one-step Šidák (1967)
multiple-testing correction. For a given region, the number of possible tests in the Šidák correction is the total bases covered by all input probes divided by the size of the given region. In short, we define the extents of the region using the FDR q
-values, then we define the significance of the regions using the SLK correction of the original P
comb-p is implemented as a single command-line application that dispatches to multiple independent sub-modules and as a set of python packages that may be used programatically. Where possible, computationally intensive algorithms are parallelized automatically—for the ACF and the slk steps in the analyses described below—this results in a speed-up that is nearly linear with the number of cores available. The implementation has been tuned to minimize memory use and computation time.
When run without arguments, the comb-p executable displays the available programs and a short description of each. Although each of these tools may be used independently, the progression of acf, slk, fdr, peaks, region_p works well for the examples presented. These steps can be run successively with the single command: pipeline.