To generate the kernel density estimation, we consider the problem where we are given

*n* sample points along a chromosome of length

*L*. Our goal is to locate regions with high sample density. If we assume the points {

*x*_{i}}

_{i=1}^{n} are sampled as

, then an estimate of this probability density function (pdf) will provide a significance measure for high density regions. We use the univariate kernel density estimation (kde) to infer the pdf, written as

where

*b* is a bandwidth parameter controlling the smoothness of the density estimates, and

*K*() is a Gaussian kernel function with mean 0 and variance 1. Instead of explicitly setting

*b*, a user provides a feature length parameter (default=600) which controls the sharpness of the pdf estimate. Larger features will naturally lead to smoother density estimates.

Computing the density at each point in the chromosome using all

*n* points is computationally expensive and exceeds the precision available to common computing platforms. We therefore compute a default window size

*w* as a function of the bandwidth parameter

*b* and the Gaussian kernel such that

We expect that window sizes for typical bandwidth settings will be on the order of a few thousand, significantly less than the many millions of bases available.

We also compute a threshold level for evaluating the significance of density regions using the following background model:

- Compute an average number of features for window
*w* as *n*_{w}=*nw*/*L*. - Calculate the kernel density at a fixed point,
*x*_{c}, within the window given a random and uniform distribution of the *n*_{w} features. - Repeat step 2
*k* times to obtain a distribution of the kernel density estimates for *x*_{c}. For large *k* the kdes become normally distributed. - The threshold is
*s* SDs above the mean of this normal distribution.

Larger values of

*s* reduce the false discovery rate and provide a natural statistical interpretation to the veracity of these density regions.

F-Seq takes an input a BED format file (

http://genome.ucsc.edu/FAQ/FAQformat#format1) containing aligned sequence tags. Since calculation of kernel density estimation requires a point measure for each sequence, we use the estimated center of the DNA fragment being sequenced. In many cases, such as from ChIP-seq protocols, the aligned sequence represents only the 5′ end of a longer fragment and therefore should be extended to the average fragment size in the experiment. In the case of DNase-Seq protocols where the 5′ end of the sequence represents the point of enrichment, the alignment should be shortened to 1 bp in length. A perl script has been included to perform this task.

Output files can be created either as a continuous probability wiggle format (

http://genome.ucsc.edu/goldenPath/help/wiggle.html) or as a discrete-scored regions BED format. The discrete regions are those where the continuous probability is above the threshold

*s* SDs above the background mean. These output files are ready for immediate import into the UCSC Genome Browser (Kent

*et al.*,

2002) (

http://genome.ucsc.edu).