To generate the kernel density estimation, we consider the problem where we are given
n sample points along a chromosome of length
L. Our goal is to locate regions with high sample density. If we assume the points {
xi}
i=1n are sampled as

, then an estimate of this probability density function (pdf) will provide a significance measure for high density regions. We use the univariate kernel density estimation (kde) to infer the pdf, written as
where
b is a bandwidth parameter controlling the smoothness of the density estimates, and
K() is a Gaussian kernel function with mean 0 and variance 1. Instead of explicitly setting
b, a user provides a feature length parameter (default=600) which controls the sharpness of the pdf estimate. Larger features will naturally lead to smoother density estimates.
Computing the density at each point in the chromosome using all
n points is computationally expensive and exceeds the precision available to common computing platforms. We therefore compute a default window size
w as a function of the bandwidth parameter
b and the Gaussian kernel such that
We expect that window sizes for typical bandwidth settings will be on the order of a few thousand, significantly less than the many millions of bases available.
We also compute a threshold level for evaluating the significance of density regions using the following background model:
- Compute an average number of features for window w as nw=nw/L.
- Calculate the kernel density at a fixed point, xc, within the window given a random and uniform distribution of the nw features.
- Repeat step 2 k times to obtain a distribution of the kernel density estimates for xc. For large k the kdes become normally distributed.
- The threshold is s SDs above the mean of this normal distribution.
Larger values of
s reduce the false discovery rate and provide a natural statistical interpretation to the veracity of these density regions.
F-Seq takes an input a BED format file (
http://genome.ucsc.edu/FAQ/FAQformat#format1) containing aligned sequence tags. Since calculation of kernel density estimation requires a point measure for each sequence, we use the estimated center of the DNA fragment being sequenced. In many cases, such as from ChIP-seq protocols, the aligned sequence represents only the 5′ end of a longer fragment and therefore should be extended to the average fragment size in the experiment. In the case of DNase-Seq protocols where the 5′ end of the sequence represents the point of enrichment, the alignment should be shortened to 1 bp in length. A perl script has been included to perform this task.
Output files can be created either as a continuous probability wiggle format (
http://genome.ucsc.edu/goldenPath/help/wiggle.html) or as a discrete-scored regions BED format. The discrete regions are those where the continuous probability is above the threshold
s SDs above the background mean. These output files are ready for immediate import into the UCSC Genome Browser (Kent
et al.,
2002) (
http://genome.ucsc.edu).