By mapping high-throughput sequencing reads to a reference genome, following the strategy outlined by Hesselberth et al.
), we obtained a set of genomic locations of DNaseI cleavages, called tags
. The tag count of a particular position indicates the number of tags occurring at that position. However, not every genomic position is assigned a tag count. Because the end sequencing proceeds in a strand-specific manner, sequencing reads are strand-specific. A genomic position can then be associated with two types of reads, one for each strand. A location is called uniquely mappable
on one strand if the reads starting at the location on that particular strand cannot be mapped to any other location in the reference genome on either strand. In this work, we only consider uniquely mappable reads. Moreover, we assign tag counts only to positions that are uniquely mappable on both strands. Positions that are not uniquely mappable on both strands are considered missing data. In addition to tag counts and mappability information, the inputs to our method include two parameters that specify the size range of footprints to be identified, kmin
. In this study, we search for footprints of 8–30 bp, which is usually the size of TFBSs.
Given the inputs described above, DBFP generates as output a list of non-overlapping footprints ranked by a statistical confidence score. We use the q
-value (Storey and Tibshirani, 2003
) to measure the statistical significance of footprints. The q
-value of a footprint is defined as the minimal FDR threshold at which the footprint is deemed significant.
Our approach consists of three phases. First, we build a DBN to identify candidate footprint segments of width from kmin to kmax. Second, we compute a depletion score for every footprint segment. Finally, we repeat the entire procedure on the shuffled input data to generate a list of null scores. These null scores are used to estimate q-values. The approach is described in detail in the following sections.
2.1 A dynamic Baeyesian network for footprint detection
Given the genome-wide data of tag counts and mappability, we build a DBN for unsupervised segmentation of the genome into three types of non-overlapping regions as follows. Among these three types, footprint segments are our primary targets.
- Footprint: this type of segment is depleted of tags and has low tag count variance. Each segment's width is constrained to lie between kmin and kmax, inclusive. Moreover, footprint segments are flanked on both sides by segments enriched in tags, because footprints correspond to regions of low accessibility in the midst of regions of high accessibility.
- Background: similar to footprint segments, this type of segment has low tag counts and low tag count variance. However, background segments tend to be wide, with a width constraint > kmax.
- Hypersensitive: this type of segment is enriched for tags and has relatively high tag count variance.
As an example, a shows our network for two consecutive positions. The whole DBN actually extends over all genomic positions, and each position (except the first and last) is modeled by a copy of the same Bayesian network. The DBN is implemented using the Graphical Models ToolKit (GMTK; Bilmes and Zweig, 2002
). The details of this network are described below.
Fig. 2. The DBN for footprint identification. (a) The network for two consecutive positions. Observed variables are represented by shadowed nodes. A dashed edge between two nodes indicates that the parent node is a switching parent. (b) The transition diagram (more ...)
TagCount is an observed variable representing the number of tags that occur at each position. In order to reduce the variance of tag counts and the computational complexity of training and decoding, we discretize the original tag counts into 20 bins according to the genome-wide distribution. The bin boundaries are selected to make the bin counts as close to uniform as possible. The hidden variable segment has three states, indicating which segment type each position belongs to. These two variables form the backbone of the DBN, which is also a typical HMM. The relationship between them, i.e. the probability of observing a tag count given a segment type Pr[tagCount|segment], is modeled by a conditional probability table (CPT). The same conditional probability is actually shared between two segment states, footprint and background, because both footprint and background segments have low tag counts and low tag count variance. Given that tagCount has 20 discrete values, the dimensions of the CPT are 2 × 20.
Similar to an HMM, the probability of changing segment states between two positions is defined by a transition matrix, which is summarized by the transition diagram shown in b. Importantly, the transition between footprint and background is not allowed, which ensures that footprint segments are flanked by hypersensitive segments on both side.
As described earlier, unmappable positions are not assigned a tag count. The DBN considers unmappable positions as missing data. More specifically, the observed binary variable mappable
is used as a switching parent of tagCount
(Bilmes and Zweig, 2002
). At any given position, when mappable
= 1, tagCount
is the child of and relevant to segment
; but when mappable
= 0, the edge between tagCount
is effectively severed, thereby rendering tagCount
irrelevant to segment
. Therefore, the segment
state of an unmappable position is set to whatever best fits with the neighboring mappable positions on both sides.
In an HMM, the state duration usually follows a geometric distribution. For example, given a state s and the probability p of self-looping (i.e. staying in the same state), the probability that state s will have duration d is simply (1 − p)pd−1. In our model, each of the three segment states similarly follows a geometric duration distribution, but with some additional constraints: footprint has minimal and maximal durations of dmin = kmin and dmax = kmax, respectively; background has minimal duration dmin = kmax + 1. All these additional constraints are deterministic. Therefore, the only parameter for the duration distribution of a state is the self-looping probability p of that state.
The duration model is implemented by using two hidden variables changeSeg and counter. First, the changeSeg variable is used as a switching parent of segment: if the value of changeSeg is false, then the segment state at the previous position is copied over to the current position; otherwise, a state transition occurs, and a different state from the previous one is set to segment according to the transition matrix depicted in b. Second, the counter variable is used to count the number of consecutive positions that have the same segment state. Note that changeSeg is also a switching parent of counter: when changeSeg takes value false, the current counter is relevant to both the current segment and the previous counter; when changeSeg takes value true, the current counter becomes independent of the previous one, and its initial value is determined only by the current segment. Finally, the value of counter, coupled with the geometric duration distribution for each state, determines the value of changeSeg at the next position.
This DBN is trained in a unsupervised manner by applying the EM algorithm to all the input data. The training procedure finds the maximum likelihood estimates θ* for the parameters of the network. Next, given the learned parameters θ* and the input data D, the Viterbi algorithm is used to find the most likely assignment to all hidden variables H in our network over all positions: h* = argmaxhPr[H = h|D, θ*]. This procedure is called Viterbi decoding. In the end, the footprints returned are the regions where the state footprint is assigned to the hidden variable segment.
2.2 Depletion scoring
The depletion scoring procedure considers all footprint segments identified by the DBN. This score function is identical to the one employed in our previous work (Hesselberth et al.
). Given a footprint segment, we assign a depletion score by considering the number of tags in the segment with respect to the number of tags in its local background. Because a segment may contain unmappable positions, we define the effective segment size
as the segment width minus the total number of unmappable positions within that segment.
The scoring function computes the log of the probability that a segment of effective size a
or fewer tags, assuming that the tags are drawn randomly from a uniform background. This background distribution of tags can be estimated from a background window of fixed width ω centered on the specified segment. In this study, we set the background width ω to be 151 bp. Let b
be the effective window size of the background window. Then the probability of observing a random tag in a window of size a
. Given that we observe B
tags within the background window, the probability that a segment of effective size a
contains at most x
tags can be estimated using a binomial distribution:
is the number of tags observed in the target segment, and H0
is the null hypothesis. We use a local background, instead of a background based upon the entire data set, to estimate the null distribution of tags. This is because the rate of DNaseI cleavage may depend in part upon local chromatin structure. Hence, a segment containing an intermediate number of tags may get a good score if it is located in a region of high density of tags, whereas the same segment would get a bad score in a low density region.
In conjunction with the primary scoring function, we employ a pre-processing step of rank transformation, which replaces each tag count by its rank in the sorted list of tag counts. We perform this transformation for each background window to reduce the effect of outlier positions associated with a large number of tags. We have observed that the rank transformation yields better scores than using the raw tag counts directly (data not shown).
2.3 Estimating q-values
To measure the statistical significance of a list of scored footprint candidates, we estimate a q-value for each of the depletion scores generated by our algorithm.
First, we build an empirical null model by shuffling the genomic positions. Note that the distribution of positional tag counts remains the same after shuffling; i.e. if we observe 56 tags at one position in the real data, then after shuffling we will observe those 56 tags at some other position. In addition, because the DNaseI cleavage rate may vary from one genomic region to another, the shuffling is carried out locally within a target region (e.g. an intergenic region or a gene of the yeast genome). In order to obtain an accurate estimate of the null distribution, we generate 100 shuffled data sets in total.
Second, given our DBN model and the parameters learned from the real data, we use Viterbi decoding to identify footprint segments from each shuffled data set. We then compute depletion scores for those footprint segments as described in Section 2.2
. Such depletion scores calculated from shuffled data are called null scores
. The empirical null distribution is finally derived from all null scores of the 100 shuffled data sets.
Third, we use the empirical null distribution to convert the depletion scores generated from the real data into q
-values. Given a depletion score x
, its corresponding FDR is calculated by dividing the number of scores better than x
in the null data by the number of scores better than x
in the real data. Note that the total number N
of real scores is usually different from the total number
of null scores. Therefore, we need to normalize the two distributions and by multiplying our FDR estimates by
. Finally, the q
-value of a score x
is defined as the minimal FDR at which x
is deemed significant.
2.4 Gold standard and evaluation metric
To build a gold standard, we first scanned the yeast intergenic regions at a q
-value threshold 0.2 using the motif position weight matrices (PWMs) derived from ChIP-Chip experiments and evolutionary conservation by MacIsaac et al.
). For this procedure, we used the motif scanning tool FIMO (http://meme.sdsc.edu
), which computes p
-values assuming a zero-order Markov background model (Staden, 1994
) and corrects for multiple testing by converting to q
-values. Canonically, a protein-binding footprint should only occur in regions that are not occupied by nucleosomes. Therefore, we further filtered out the motif hits that fall in a nucleosome-occupied region, using a set of 55 141 sequenced nucleosomes from a previous study (Mavrich et al.
). More specifically, a motif hit is kept only if its distance to the nearest nucleosome dyad is between 74 bp and 220 bp. Finally, we merged overlapping hits of different motifs and got a set of 3180 distinct TF binding regions. We call this gold standard the MacIsaac motif hits
As an alternative gold standard, we directly used the collection of yeast TFBSs that were originally predicted by MacIsaac et al.
) and filtered out those falling in a nucleosome-occupied region (Mavrich et al.
). After merging overlapping binding sites from different factors, this set consists of 1992 distinct TF binding regions. We call this gold standard the MacIsaac binding sites
Neither of these MacIsaac data sets can be used as an absolute measure of the performance of our approach for two reasons. First, the MacIsaac gold standard sets presumably miss many real binding sites, because the MacIssac sets were conservatively defined using strict thresholds of binding p-value, conservation, and motif scanning q-value. Second, the experimental conditions for the MacIsaac ChIP-Chip data differ from the conditions for our digital genomic footprinting data. Although the MacIsaac sets are not perfect as an absolute measure, if we assume that the inaccuracies of our gold standard are unbiased with respect to each approach under consideration, we can still use those sets to compare the performance of different footprint detection approaches.
To evaluate a footprint detection algorithm, we first identify footprints by thresholding the scored candidates that the algorithm generates within the yeast intergenic regions. We then count the matches between this set of footprints and the binding regions from a MacIsaac gold standard. illustrates how true positive (TP), false positive (FP) and false negative (FN) sites are defined when comparing a set of identified footprints with the MacIsaac binding regions. To be called a TP site, a predicted footprint is required to overlap a MacIsaac binding region by at least 1 bp. Using the TP, FP and FN values, we can then compute precision [defined as TP/(TP+FP)] and recall [defined as TP/(TP+FN)]. For each footprint detection approach, we plot the precision as a function of recall by varying the depletion score threshold. Because we are primarily interested in footprints that are identified with a low FDR (i.e. those identified with high precision), we plot the precision-recall curve only up to a recall of 0.1.
Comparing predicted footprints to the MacIsaac binding regions. A footprint is considered a true positive if it overlaps a MacIsaac binding region by at least one base pair.
2.5 Digital genomic footprinting data
We used digital genomic footprinting data for the Saccharomyces cerevisiae
genome from (Hesselberth et al.
). These data were generated using DNaseI digestion and massively parallel DNA sequencing for yeast a
cells synchronized in the G1 phase of the cell cycle.
Among 3.6 million base pairs of the yeast intergenic regions, 3.0 million of them are uniquely mappable on both strands (i.e. the 27 bp reads starting at those positions on either strand are unique in the yeast genome). The data set contains 11.3 million sequence tags associated with those uniquely mappable positions.
Focusing on coding regions, 8.2 million out of 8.5 million base pairs in the yeast coding regions are uniquely mappable on both strands, and they are associated with 8.7 million sequence tags. This indicates that the sequencing coverage of coding regions is much lower than that of intergenic regions.