PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of sagmbStatistical Applications in Genetics and Molecular BiologySubmit to Statistical Applications in Genetics and Molecular BiologySubscribeStatistical Applications in Genetics and Molecular Biology
 
Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 29.
Published online 2009 June 19. doi:  10.2202/1544-6115.1454
PMCID: PMC2861327
NIHMSID: NIHMS193955

A Non-Homogeneous Hidden-State Model on First Order Differences for Automatic Detection of Nucleosome Positions

Abstract

The ability to map individual nucleosomes accurately across genomes enables the study of relationships between dynamic changes in nucleosome positioning/occupancy and gene regulation. However, the highly heterogeneous nature of nucleosome densities across genomes and short linker regions pose challenges in mapping nucleosome positions based on high-throughput microarray data of micrococcal nuclease (MNase) digested DNA. Previous works rely on additional detrending and careful visual examination to detect low-signal nucleosomes, which may exist in a subpopulation of cells. We propose a non-homogeneous hidden-state model based on first order differences of experimental data along genomic coordinates that bypasses the need for local detrending and can automatically detect nucleosome positions of various occupancy levels. Our proposed approach is applicable to both low and high resolution MNase-Chip and MNase-Seq (high throughput sequencing) data, and is able to map nucleosome-linker boundaries accurately. This automated algorithm is also computationally efficient and only requires a simple preprocessing step. We provide several examples illustrating the pitfalls of existing methods, the difficulties of detrending the observed hybridization signals and demonstrate the advantages of utilizing first order differences in detecting nucleosome occupancies via simulations and case studies involving MNase-Chip and MNase-Seq data of nucleosome occupancy in yeast S. cerevisiae.

1. Introduction

Nucleosomes are the fundamental structural units of chromatin and consist of approximately 146 base pairs of DNA wrapped around a histone octamer (Kornberg and Lorch; 1999; Chakravarthy et al.; 2006). The precise positioning of nucleosomes along the genome has been implicated in the regulation of gene expression (reviewed in Ercan et al. (2004)). Packaging of DNA into nucleosomes may prevent DNA binding proteins from accessing their sites, recruit transcriptional activators or repressors, and bring distant DNA sequences into close proximity to promote transcription (Millar and Grunstein; 2006). In the last couple of years, nucleosome positions have been mapped across the genomes of S. cerevisiae (Yuan et al.; 2005; Lee et al.; 2007; Shivaswamy et al.; 2008), C. elegans (Johnson et al.; 2006), and humans (Schones et al.; 2008) in various cell types and under a variety of physiological perturbations. These studies have revealed various chromatin remodeling patterns in transcriptional regulation at nucleosome resolution. In particular, Shivaswamy et al. (2008) showed that gene activation in yeast is mainly accompanied by the loss of one or two nucleosomes in the promoter regions, while Lee et al. (2007) illustrated that functionally related genes share similar nucleosome occupancy patterns across their promoters. Nucleosome occupancy can hinder the binding of transcription factors to their consensus motifs, and the fraction of bound motifs vary between nucleosomes and nucleosome free regions (Yuan et al.; 2005). These findings illuminated that identifying locations of individual nucleosomes accurately is essential for studying the effect of dynamic changes in nucleosome occupancy in the control of gene regulation. By having a reliable map of nucleosome occupancy, one can investigate various histone modifications at the nucleosome level to uncover the complex mechanism in transcriptional reprogramming. Similarly, for studies investigating the effect of physiological perturbations on nucleosome positioning, the starting point often involves maps of nucleosome occupancy before and after such perturbations. For example, nucleosome mapping experiments of Schones et al. (2008) in resting and activated human CD4+ T cells revealed specific reorganization patterns of nucleosomes in promoter and enhancer regions of the genome.

Numerous high-throughput experiments have been carried out to map nucleosome occupancy in S. cerevisiae via tiling arrays (Liu et al.; 2005; Yuan et al.; 2005; Lee et al.; 2007; Shivaswamy and Iyer; 2008; Kaplan et al.; 2008). More recently, a high resolution whole genome nucleosome map for yeast genome was developed via a high throughput sequencing technology (Albert et al.; 2007; Shivaswamy et al.; 2008). In both platforms, the sample input consists of mono-nucleosomes prepared via micrococcal nuclease (MNase) digestions, which degrades all but the DNA wrapped around histone proteins. Two nucleosomes are connected by linker DNA, which is digested by the enzyme. The digested sample is either sequenced by high-throughput sequencing technologies (MNase-Seq), or competitively hybridized against a control sample using high density tiling arrays in (MNase-Chip). A high percentage of the S.cerevisiae genome is known to be occupied by nucleosomes, however there exists substantial variation in nucleosome density across the genome. In particular, relatively higher density of nucleosomes is observed at transcribed regions and lower density is found in intergenic regions (Lee et al.; 2004; Bernstein et al.; 2004; Lee et al.; 2007; Shivaswamy et al.; 2008). Positions of nucleosomes across the whole genome are characterized by a stretch of consecutive probes encompassing approximately 146 base pairs with higher signals than the background. An interesting feature observed in many of the MNase-Chip experiments for mapping nucleosome positions is that the magnitude of log base 2 ratios for regions occupied by nucleosomes exhibit large variability. Specifically, some regions of the genome thought to be occupied by nucloesomes actually show log base 2 ratios below the baseline. Yuan et al. (2005) provided substantial evidence of this problem and referred to this phenomena as unpredictable trends in hybridization. The variability in the magnitudes of nucleosome occupancy is also observable from the high resolution MNase-Chip data of Lee et al. (2007) and MNase-Seq data of Shivaswamy et al. (2008). This trend in hybridization can be attributed to the heterogeneity of nucleosome densities across the whole genome, resulting in both stable and unstable nucleosome occupancies. Unstable or low-signal nucleosomes are nucleosome peaks having low maxima and may correspond to nucleosomes found only in a subpopulation of cells (Yuan et al.; 2005). These low-signal nucleosomes may be the most important and dynamic in regulating transcription by cycling on and off the DNA. We will refer to these as “low-signal nucleosomes” in our subsequent discussion.

Previous work in identifying nucleosome positions in MNase-Chip data include using a hidden Markov model (HMM) (Yuan et al.; 2005) on the observed log base 2 ratios. Yuan et al. (2005) proposed an HMM that takes into account the length of nucleosomal DNA and allows for one emission distribution for each of the nucleosome and linker states, respectively. To account for a global trend, Yuan et al. (2005) applied the HMM to a sliding window of 40 probes and averaged the estimated model parameters and posterior probabilities over all the windows covering a fixed probe to compute the most likely hidden state path. In addition to the nucleosomes identified by the sliding window HMM, they also included additional low-signal nucleosomes obtained by comparing the median intensities of the peak and trough within a window size of 7 probes, which was regarded as a detrending procedure to further identify low-signal nucleosomes. We show that applying the proposed detrending procedure to the entire tiling array data is undesirable as it introduces higher noise level and spurious linkers/nucleosomes in simulations and a case study of yeast nucleosome occupancy. Yuan et al. (2005) also provided nucleosome positions hand picked via close visual inspection in order to capture all potential nucleosomes. However, this heuristic approach becomes tedious when mapping nucleosome occupancy in larger genomic regions.

To accommodate the serious drawbacks of existing methods, we propose a fully automated approach which identifies nucleosome occupancy and addresses the length of nucleosomal DNA and the observed trends in hybridization signals. At the core of our methodology is a non-homogeneous hiddenstate model (NHSM) tailored for MNase-Chip data measuring nucleosome occupancy. By designing the architecture for the first order (lagged) differences of log base 2 ratios, we bypass the problem of unpredictable trends in the log base 2 ratios. An additional feature of our approach is its applicability to the more recent MNase-Seq data. We illustrate the methodology and benchmark its performance against other available methods in simulations and a case study involving a 20 base pairs resolution yeast MNase-Chip nucleosome occupancy data. The results of these experiments highlight the superiority of our NHSM to other approaches especially in identifying low-signal nucleosomes. We also provide an illustration of its applicability to higher resolution MNase-Chip and MNase-Seq nucleosome occupancy data. Additionally, two consecutive nucleosomes are separated by a linker of variable length. Therefore, a good methodology for mapping nucleosome occupancy should be able to identify nucleosome-linker boundaries accurately. This is usually challenging for the low resolution tiling array design in which a linker is represented by one or two probes (Yuan et al.; 2005; Kaplan et al.; 2008). Our proposed methodology carefully exploits the structure of nucleosomes and accurately maps nucleosome positions and therefore provides a principled framework for studying other epigenetic events that rely on mapping nucleosome occupancy.

2. Motivation

We motivate the idea behind our methodology using the MNase-Chip data from Yuan et al. (2005). We use the normalized median log base 2 ratios of the 8 replicates for illustration. The top panel of Figure 1 shows the nucleosome profile for a region in chromosome 3 in which the nucleosomes identified by Yuan et al. (2005) are marked with black lines (each vertical line representing a probe), and a stable nucleosome is represented by 6 to 8 probes. It is clear from the plot that the magnitude of log base 2 ratios of a nucleosome region exhibits large variability. Despite having heterogeneous hybridization signals, the plot suggests that a nucleosome is characterized by a peak in the local signal intensity, even if the log base 2 ratio is below the baseline. In other words, a nucleosome occupied region exhibit a “bump” shape irrespective of the actual strength in hybridization signal. In addition, this plot also suggests that using a single distribution for each of the nucleosome and linker/nucleosome depleted regions may fail to distinguish short linkers between stable or wellpositioned nucleosomes, (i.e., linkers between well-positioned nucleosomes have comparable hybridization strength to low-signal nucleosomes.)

Figure 1:
Typical characteristics of MNase-chip nucleosome occupancy data from Yuan et al. (2005). Top panel is the original normalized data tiling a region in chromosome 3. The vertical black solid lines represent probes identified as nucleosome state according ...

Given the observed “bump” (or peak with low maxima) characteristic of annotated nucleosomes in the original data, we consider a simple smoothing by replacing the log base 2 ratios of probe i with the average values of probe i − 1, i and i + 1. As evident in the middle panel of Figure 1, the “bump” shape is enhanced in the smoothed data which enable easier mapping of the nucleosome positions. The “bumps” also suggest that a nucleosome occupied region is characterized by a series of decreasing positive slopes, followed by slopes of approximately zero in magnitude and then a series of increasing negative slopes. This observation forms the modeling framework of our proposed methodology. The first order differences automatically take care of the trend in hybridization and thereby bring both the low and stable nucleosomes to a comparable level. Once the nucleosome positions are obtained, one can rank the strength of each nucleosome by the average log base 2 ratios of the probes within the nucleosome.

The trend in hybridization and variability in the magnitudes of nucleosome occupancy are not specific only to lower resolution MNase-Chip data of Yuan et al. (2005) but can also be clearly seen in high resolution MNase-Chip data of Lee et al. (2007) and MNase-Seq data of Shivaswamy et al. (2008), as evident in Figures 15 and and17.17. In addition, we also illustrate that a nucleosome occupied region is characterized by a “bump” shape, which is enhanced upon smoothing in these two data sets (middle and right panels of Figures 15 and and17).17). The choice of smoothing is presented in Section 3. Since all the different data sets share similar defining characteristics of nucleosome occupancy, for expository purposes, our detailed discussion is mainly focused on the 20 base pairs resolution tiling array data of Yuan et al. (2005). This tiling array design is also used in a recent publication by Kaplan et al. (2008) for studying H3K56 acetylation in yeast. We will demonstrate that our proposed method is applicable and works well in detecting nucleosome positions in both the high resolution tiling arrays and sequencing data in Section 5.2.

Figure 15:
Nucleosome occupancy at CHA1 (top row) and HIS3 (bottom row) promoter. Red horizontal lines at y =1.5 are the nucleosome annotation from Lee et al. (2007). Blue horizontal lines at y =2 are the nucleosome annotation from Shivaswamy et al. (2008). Green ...
Figure 17:
Nucleosome occupancy at CHA1 (top row) and HIS3 (bottom row) promoter. Red horizontal lines at y =20 are the nucleosome annotation from Lee et al. (2007). Blue horizontal lines at y =25 are the nucleosome annotation from Shivaswamy et al. (2008). Green ...

3. Hidden-state model for mapping nucleosome positions

To circumvent the problem of decoding nucleosome occupancy locally to accommodate for the observed local trends as in Yuan et al. (2005), we consider an alternative approach to infer nucleosome positions based on first order (lagged) differences, Ot, which we defined as:

Ot=Xt+k1Xtk,Xt=S(Y)t,

where Yt is the observed log base 2 ratio of probe t and Xt is the corresponding smoothed/denoised data for Yt, t = 0, ..., T. The Ot’s are to quantify the slope at midpoint between probe t−1 and probe t to capture the “bump” shape of a nucleosome. We define this midpoint location as mid-probe, i.e., mid-probe t corresponds to the midpoint between probe t−1 and probe t. Although other choices for defining the slope/gradient can be utilized, we show that the simple first order (lagged) differences is generally sufficient in both the low and high resolution MNase-Chip and MNase-Seq data. For Yuan et al. (2005) data, we let Xt to be a moving average statistic in a window size of 2w + 1 probes and Ot to be the first order difference (k = 1). That is,

Ot=XtXt1,Xt=j=twt+wYj/(2w+1).

We observe that substituting the log base 2 ratios by the corresponding moving average statistic Xt’s reduces the noise in the data and enhances the shape of peaks and troughs, but not at the expense of over smoothing the data as shown in the middle panel of Figure 1. On the other hand, for the high resolution MNase-Chip and MNase-Seq data, the simple moving average can be replaced with a Gaussian kernel smoother. A detailed motivation with some analytical results for using a Gaussian kernel smoothing is given in Appendix A.1. In kernel smoothing, the tuning parameter is the bandwidth h. Large bandwidth implies more smoothing, and vice versa. For Gaussian kernel, h is also the standard deviation. That is,

Xt=j=0Tφ(|GtGj|h)Yjj=0Tφ(|GtGj|h),

where [var phi](.) is the standard Gaussian probability density function and Gj is the genomic coordinate corresponding to Yj in base pairs. We propose choosing h based on the size of the nucleosomal DNA. For a Gaussian distribution, 99% of the values are within ±2.5σ, where σ = h. Therefore, we choose h = 146/5 so that the bandwidth spans the size of a nucleosome. The middle and right panels of Figures 15 and and1717 illustrate the resulting smoothed log base 2 ratios from Gaussian kernel smoothing with this choice of bandwidth. The Gaussian kernel smoothing is able to denoise the data and enhance the “bump” characteristics of a nucleosome.

As motivated in Section 2, a nucleosome occupied region is characterized by a series of positive followed by negative slopes or Ot’s, while the boundaries of nucleosomes-linker regions are characterized by steeper slopes. Detecting jumps in Ot’s via segmentation is a potential approach to map nucleosome occupancy but traditional segmentation approaches do not incorporate the length of nucleosomal DNA. In addition, since the data is obtained from tiling arrays, spatial correlations among observations of nearby probes are expected. To account for the length of nucleosomal DNA and the correlation structure, we propose a non-homogeneous hidden-state model (NHSM) based on first order differences Ot’s. Next, we give a detailed characterization of the NHSM architecture.

Consider the state transitions given in Figure 2(a) where Ni’s represent the nucleosome region states, Li’s represent linker or nucleosome depleted region state and Bi’s represent nucleosome-linker boundaries. The self transitions of N1 and N3 is to account for less stable nucleosomes which span a larger region than well-positioned nucleosomes, termed “fuzzy” nucleosomes by Yuan et al. (2005). Recently, Valouev et al. (2008) also provided ample evidence for the existence of fuzzy nucleosomes in the C. elegans genome using highthroughput sequencing data. They attributed this to the lack of constraints in absolute positioning for some fraction of the nucleosomes across the C. elegans genome. Therefore, we introduce state duration d(i) to capture the length of nucleosomal DNA explicitly. Assume that a well-positioned nucleosome (146 base pairs) is characterized by p probes, or equivalently p − 1 first order differences. We require

i{N2a,N2b,N2c}d(i)+2=p1,0d(N2a),d(N2b)p3,

since at least one probe is from N1 and one is from N3 out of p − 1 probes representing a nucleosome.

Figure 2:
State transition representation in NHSM. Ni represents nucleosome states, Li represents linker states, BN and BL represent linker-nucleosome and nucleosome-linker boundaries, respectively.

In most cases, the “bump” shape of a nucleosome on tiling arrays is symmetrical, which implies that d(N2a) = d(N2c). Moreover, given the state duration constraint, the state transitions can be further simplified as in Figure 2(b) by tying states N2a, N2b and N2c as N2 with a trinomial duration density:

pN2(d1,d2,d3)=(p3)!d1!d2!d3!p1d1p2d2p3d3,

where p1 + p2 + p3 = 1 and d1 + d2 + d3 = p − 3.

Let bi(Ot) denote the emission distribution for observed value at mid-probe t = 1, ..., T given unknown state i [set membership] {Ni, Li, Bi}. We model bi(Ot) with Gaussian distributions,

bBN(Ot)N(μ1,σBN2),bN1(Ot)N(μ2,σN12),bN3(Ot)N(μ2,σN52),bBL(Ot)N(μ1,σBL2),bL1(Ot)N(μ2,σL12),bL2(Ot)N(0,σL22),bL3(Ot)N(μ2,σL32),bN2(Ot:t+p4)N(μ˜,),

where

μ˜=(μ2,,μ2d1,0,,0d2,μ2,,μ2p3d1d2),=diag(σN2a2,,σN2a2d1,σN2b2,,σN2b2d2,σN2c2,,σN2c2p3d1d2),

and 0 < μ2 < μ1. The constraint on the mean of emission distributions is to ensure the series of decreasing positive slopes, zero slopes and followed by increasing negative slopes which characterize the “bump” shape of a nucleosome. In the case of symmetric “bump” shape, the duration density for N2 reduces to univariate density p(d1) and

μ˜=(μ2,,μ2d1,0,,0p32d1,μ2,,μ2d1).

The discrete duration density assumption implies that the proposed duration in the hidden states is equivalent to a hidden-state model with a larger hidden state space. We can recast the state transition in Figure 2(a) as Figure 3 which have the same complexity by considering all possible uni-directional paths transiting from N1 and incorporating the constraint ∑i[set membership]{N2a, N2b, N2c}d(i) +2 =p − 1. We can equivalently let bN2a(Ot) ~ N (μ2, σN2a2), bN2b (Ot) ~ N(0, σN2b2) and bN2c (Ot) ~ N(−μ2, σN2c2). In scenarios where we have high resolution experiments for mapping nucleosome occupancy such as the 4 base pairs resolution MNase-chip data of Lee et al. (2007) or 1 base pair resolution MNase-seq data of Shivaswamy et al. (2008), the “bump” shape of nucleosome is relatively well characterized by a few positive slopes, followed by a plateau and a few negative slopes. In such cases, we can reduce the range of d1 by removing some uni-directional paths in Figure 3 and thereby simplify the structure of the hidden state transitions.

Figure 3:
State transition representation in NHSM. An equivalent representation of the discrete duration density in the hidden states of Figure 2(a).

Let Qt denote the hidden state for mid-probe t. Note that if Qt = BN, this indicates that probe t − 1 is in linker region and probe t is in nucleosome region. Since high log base 2 ratios represent regions that are more likely to be occupied by nucleosomes and vice versa for low log base 2 ratios, we model the hidden state transitions as a function of observed log base 2 ratios Xt. This framework implies that while the nucleosome/linker hidden state Qt dictates the observed first order differences (a function of Xt), there is another function of Xt which in turn influences the hidden states and gives rise to a feedback structure. This idea is adapted from Zucchini et al. (2008) who proposed a mechanism to allow for such feedback. We refer the readers to Zucchini et al. (2008) for an excellent motivation of this framework in the context of animal behavior. We define Zt = Ot + Zt−1 for t = 1, ..., T and Z0 = X0. In the case where Ot = XtXt−1, we have Zt = Xt. This feedback structure is best understood by considering the following graphical model representation (Figure 4).

Figure 4:
Graphical model representation of the feedback structure. The directionality of the edges dictates the dependence structure. This model implies that the transition from Qt to Qt+1 depends on Zt.

Let ai,j(z) = P(Qt+1 = j | Qt = i, Zt = z) be the transition probabilities from state i to j between mid-probe t and mid-probe t + 1 given Zt. Also, write O(t) = (O1, ..., Ot). Two assumptions arising from Figure 4 are:

P(Qt+1|Q(t),Z0,O(t))=P(Qt+1|Qt,Zt)  for  t=1,,T1.
(1)

P(Ot|Q(t),Z0,O(t1))=P(Ot|Qt)  for  t=1,,T.
(2)

To avoid overparametrization, only transitions aBL, • (Zt), aL3, • (Zt) and aN3, • (Zt) are functions of Zt’s. Other transition probabilities are assumed to be time homogeneous. We employ a logistic regression model to parametrize the hidden transitions for BL, L3 and N3:

ai,j(Zt)=exp(γi,j+βjZt)k=1Nexp(γi,k+βkZt).

In cases where the data has been median centered at zero, we observe that a simpler version of the non-homogeneous transition probabilities for these three hidden states performs well (see case study). That is, we consider

ai,j(Zt)={ai,jn,      if Zt<0,ai,jp,      if Zt0.

For instance, we can let aL3,BN (Zt) = I(Zt ≥ 0) + aL3,BNn I(Zt < 0) to impose transition into nucleosome states when Zt ≥ 0. Details of model fitting are given in Appendix A.2.

4. Simulation studies

Yuan et al. (2005) attributed the heterogeneous nucleosome density to unpredictable trends in hybridization data. They applied the HMM to a sliding window of 40 consecutive probes to address this issue. Hidden states decoding via the Viterbi algorithm was based on average values of the model parameters and posterior probabilities of all windows containing a fixed probe. We referred to this method as sliding window HMM (SHMM). SHMM is computationally intensive and requires one to select the window size, which depends on the trend in hybridization. Yuan et al. (2005) also proposed detrending the data by comparing the magnitude of peak and trough locally to capture lowsignal nucleosomes. In particular, for each probe, they considered a window size of 7 probes (~ size of a nucleosome) centered at the probe and replaced the observed log base 2 ratio by the difference between the median of log base 2 ratios within the window and the minimum log base 2 ratio of the two probes adjacent to this window. They observed that the trend was effectively eliminated using this procedure. We referred to this method as HMMD (detrending followed by usual HMM to infer nucleosome/linker states).

4.1. Simulation I: Hidden Markov model with trend line

In the first simulation, we generated the data using the HMM hidden states architecture in Figure 5(a) (or Figure S1E of Yuan et al. (2005)), in which well-positioned nucleosomes were represented by 6 to 8 probes (N1-N8) and delocalized nucleosomes (D1-D9) covered at least 9 probes. Nucleosome regions were expected to have high log base 2 ratios whereas linker regions had lower values. The hidden state transitions in Yuan et al. (2005) allowed for linker regions (L) to have variable length. Conditioned on the hidden states, the observed log base 2 ratios were generated from Gaussian distributions, with mean 0.7, standard deviation (s.d.) 0.2 for nucleosome states and mean −0.7, s.d. 0.3 for linker state. We illustrated that although we were simulating the observed log base 2 ratios, and not the first order differences, our proposed NHSM was able to map nucleosome positions accurately.

Figure 5:
HMM architecture in Yuan et al. (2005). D1-D9 represent delocalized high nucleosomes, N1-N8 represent well-positioned high nucleosomes, DL1-DL9 represent delocalized low-signal nucleosomes, NL1-NL8 represent well-positioned low-signal nucleosomes and ...

To simulate heterogeneous nucleosome densities, we added a trend line to the simulated data following Yuan et al. (2005). Figure 1 suggests that the underlying trend line in the observed data resembles a curve. Therefore, instead of adding a linear trend line as in Yuan et al. (2005), we let the trend be a sinusoidal curve so that the synthetic data resembles the observed data to a larger extent (Figure 6 top right panel). The bottom left panel of Figure 6 plots the detrended data obtained by comparing peak to trough in a window size of 7 described above. Although this procedure was able to remove the trend in hybridization, it introduced artificial linkers within delocalized nucleosomes and spurious “bumps” within nucleosome depleted/long linker regions and resulting in data with higher noise level. This suggests that applying the same detrending procedure to the whole data is not desirable. On the other hand, a simple smoothing of the synthetic data preserved the “bump” shape that characterizes a nucleosome (Figure 6 bottom right panel). We considered sinusoidal curves with different periodicity (Figure 7) in this simulation study.

Figure 6:
An example of simulated data from Simulation I. The dotted line in the top right panel is the trend line. Bottom left panel is the data detrended by comparing peak and trough within a window size of 7 probes. Bottom right panel is the smoothed data. Black ...
Figure 7:
Trend lines in the simulated data. The periodicity of the sinusoidal trend lines is varied in each simulation scenario.

4.2. Simulation II: Hidden Markov model with mixture emission distributions

Although adding a trend line results in sythetic data that resembled the actual observed data, it may not be the most realistic model to describe the heterogeneity of nucleosome densities. We considered a more realistic simulation setup to generate nucleosomes with various occupancy levels by using mixture emission distributions for the hidden states. We enlarged the hidden state transitions (Figure 5(b)) by introducing low and high (stable) nucleosome states. The stable nucleosomes (N1-N8, D1-D9) were generated from a Gaussian distribution with mean 0.7 and s.d. 0.2. Low-signal nucleosomes (NL1-NL8, DL1-DL8) were generated from a Gaussian distribution with mean 0.1 and s.d. 0.3 and the linker state was generated from a mixture of 3 Gaussian distributions with means −0.3, −0.5, −0.7 and constant s.d. 0.3 with equal mixing proportion. An example of simulated data is shown in Figure 8. The middle panel again shows that detrending introduces a higher noise level to the original data.

Figure 8:
An example of simulated data from Simulation II. Middle panel is the data detrended by comparing peak and trough within a window size of 7 probes. Bottom panel is the smoothed data. Black vertical lines represent nucleosome probes.

We simulated observations for 1000 probes according to a tiling design of 50-mer probes overlapped by 30 base pairs covering a 20030 base pair region. In both simulations, we decoded the hidden states using the usual HMM with two emission distributions, one for linker and one for nucleosomes (without differentiating fuzzy/well-positioned, low/high), SHMM, HMMD (detrend first, then apply usual HMM) and our proposed NHSM (on first order differences). The most probable path for each method was decoded via the Viterbi algorithm (Appendix A.2.3).

4.3. Results

We compared the performance of each method via the area under a receiver operating characteristic (AUROC) curve, by varying the posterior probabilities of declaring a probe to be in a nucleosome (well positioned and delocalized) state. In addition, we also evaluated the sensitivity and specificity at probe level of the most probably path for each method. The results, averaged over 50 simulated data sets of 1000 probes, are summarized in Table 1.

Table 1:
Mean sensitivity, mean specificity and AUROC from the 50 simulations with the corresponding standard errors for each method. Sensitivity and specificity calculations are based on the most probably path decoding in each method. AUROC illustrates the overall ...

In both simulations, NHSM has a consistent result and outperforms other methods in both the sensitivity/specificity at the 0.5 posterior probability threshold and AUROC, since its main assumption is the “bump” shape that characterizes a nucleosome and this characteristic is preserved irrespective of the underlying trends in hybridization (Simulation I). HMM consistently tends to declare fewer nucleosomes, resulting in lower sensitivity. On the other hand, in cases where the trend line has larger periodicity, comparing the magnitudes of peaks and troughs is able to remove the trend effect and improves the performance of HMMD, although it is still worse than NHSM. The superior performance of SHMM in Simulation I with larger periodicity is not surprising. When the periodicity is large, the simulated data in each segment consisting of 40 probes is very close the the original hidden Markov model generator with scaled mean in the emission distributions, and therefore fitting a usual HMM to each segment in SHMM agrees with the underlying data generator. However, when the trend line oscillates more frequently (Simulation I) or is unpredictable (Simulation II), the performance of SHMM decreases rapidly. This indicates that the sliding window size in SHMM depends heavily on the trend in hybridization. In the actual data analysis, it is hard to calibrate the window size since the exact trend is unknown, and a reasonable number of probes within the window size is required for obtaining reliable parameter estimates in an HMM fit.

5. Case studies

5.1. Mapping nucleosome occupancy inMNase-chip data

We illustrated our proposed NHSM on the normalized median log base 2 ratios of the 8 replicates from Yuan et al. (2005). The data was generated from microarrays which consist of 50-mer oligonucleotides probes tiled at 20 base pairs resolution, covering approximately half megabase of the yeast genome. A moving average in a window size of 3 probes was first applied across the whole data as the smoothing step. A well positioned nucleosome (~146 base pairs) is represented by at least 6 probes (Yuan et al.; 2005), which implies that 0 ≤ d(N2a), d(N2b), d(N2c) ≤ 3. We also assumed that d(N2a) = d(N2c). Therefore, the structure of state transitions in the hidden states is simplified and given in Figure 9. For this case study, we considered the simpler non-parametric transition probabilities for BL, L3 and N3:

aN3,BL(Zt)={1,               if Zt<0,aN3,BLp,     if Zt0,aBL,BN(Zt)={aBL,BNn,     if Zt<0,1,               if Zt0,aL3,BN(Zt)={aL3,BNn,     if Zt<0,1,              if Zt0.

where Zt = Ot + Zt−1. Since Ot = XtXt−1 in this case study, we have Zt = Xt.

Figure 9:
Simplified state transition representation in NHSM for MNase-chip data of Yuan et al. (2005). We assume that d(N2a) = d(N2c).

This transition structure implies that if the current state is in a linker region, a positive log base 2 ratio observed in the immediate probe imposes transition into a nucleosome state. Similarly, if the current state is in N3 nucleosome state, a negative log base 2 ratio observed in the immediate probe imposes transition into a linker state. This transition structure appears to be sufficient and works well on the data. We first illustrated that our proposed NHSM is able to detect low-signal nucleosomes in the HIS3 promoter region as shown in Figure 10. The horizontal black line between positions 721871 and 721971 is the low-signal nucleosome annotated in Figure 1B of Yuan et al. (2005) which was only identified by further detrending the data for low-signal nucleosomes. In particular, Yuan et al. (2005) first applied SHMM to decode nucleosome positions, and further included additional nucleosomes identified exclusively only by HMMD, which they labeled as low-signal nucleosomes. Although Yuan et al. (2005) used the SHMM annotation as baseline nucleosomal set and added low-signal nucleosomes from HMMD to this set, this procedure could be problematic if the boundaries of the low-signal nucleosomes cut across the boundaries of flanking nucleosomes identified in SHMM. In general, combining the two sets of annotation requires careful curation in determining the boundaries of nucleosomes and linkers since the two annotations do not coincide precisely. This low-signal nucleosome was also identified by others according to Yuan et al. (2005) and in the high resolution MNase-Chip experiment of Lee et al. (2007) and MNase-Seq experiment of Shivaswamy et al. (2008), therefore it is not likely to be an artifact of hybridization. Our proposed NHSM is able to map this low-signal nucleosome automatically and accurately without any additional detrending. We provided an example of a low-signal nucleosome that was still missed by further detrending in Yuan et al. (2005) in Figure 11. This low-signal nucleosome was also annotated in high resolution data of Lee et al. (2007) and Shivaswamy et al. (2008) and this provides evidence against it being a hybridization artifact. We also showed that the duration constraint in nucleosome states in our NHSM architecture is able to distinguish real “bumps” which characterize a nucleosome from spurious small “bumps” at positions 103400 (between nucleosomes 1 and 2) and 104400 (between nucleosomes 6 and 7) in the top panels of Figure 12. The problem with detrending the data by comparing peak and trough within a window size of 7 probes is also visible in this region. As evident in the bottom left panel of Figure 12, detrending introduced more noise to the original data and diminished the distinction between linker and nucleosomes.

Figure 10:
Nucleosome occupancy in HIS3 promoter. Top left panel is the original normalized data tiling HIS3 promoter region and using annotation based on“hand picked” nucleosomes in Yuan et al. (2005). Top right panel is similar to top left panel ...
Figure 11:
An example of “hand picked” low-signal nucleosome for a region in chromosome 3. Black horizontal line between positions 49841 and 49961 is an example of “hand picked” low-signal nucleosome by Yuan et al. (2005). Red and ...
Figure 12:
Nucleosome occupancy for a region in chromosome 3 in Yuan et al. (2005). Top panels are based on “hand picked” annotation. Bottom left panel is the detrended data by comparing peak and trough within a window size of 7 probes. Bottom right ...

To compare the performance of the different methods, we used the “hand picked” annotation in Yuan et al. (2005) as the gold standard. Hand picked annotation was based on careful visual inspection (Yuan et al.; 2005), and thus formed a reliable nucleosome map for this tiling array data. However, it is inevitable that there may still exist some uncertainties in mapping nucleosomelinker boundaries even by careful visual inspection as shown in Figure 13. To account for the one/two probes boundary uncertainties in the “hand picked” annotation, we allowed for one probe margin in defining sensitivity and specificity. That is, suppose that the underlying state for probe i is nucleosomal based on “hand picked” annotation. We declare this probe to be correctly inferred if either one of the probes i−1, i or i+1 is annotated as nucleosomal probe for each of the method under comparison. To measure the sensitivity of our proposed method in detecting low-signal nucleosomes, we considered two possible sets of true positives. The first set was defined by using probes annotated as nucleosomes (both low and high signals) in the “hand picked” annotation. The second set was defined by using probes categorized as low-signal nucleosomes by “hand picked” annotation according to Yuan et al. (2005) (that is corresponding to score 0.25 and 0.5 in Yuan et al. (2005)).

Figure 13:
Examples of “hand picked” annotations in Yuan et al. (2005). Left panels are original data based on the “hand picked” annotations in Yuan et al. (2005) for two regions in chromosomes 5 and 7, respectively. Right panels ...

Table 2 summarizes the sensitivity and specificity for these methods using “hand picked” annotation as the gold standard. “Sensitivity(both)” was obtained using all annotated nucleosomes as true positives while “Sensitivity(low)” was obtained using annotated low-signal nucleosomes as true positives. SHMM misses a very large fraction of the low-signal nucleosomes, and thereby has extremely poor sensitivity. HMM has a higher sensitivity than SHMM, but a much lower specificity. The methods are comparable in terms of their specificities, except for HMM. The sensitivity analysis illustrates that the proposed NHSM based on first order differences is able to bypass the need for local detrending and automatically map nucleosome positions accurately. HMMD is the worst among all, which again illustrates that detrending the data is a difficult procedure and could potentially distort the signals in the observed data.

Table 2:
Sensitivity/specificity for the case study. Sensitivity and specificity are computed by treating the “hand picked” annotation of Yuan et al. (2005) as the gold standard.

We also compared the performance of our proposed NHSM, HMM, HMMD and SHMM (from Yuan et al. (2005)) via ROC curves, by varying the posterior probabilities of declaring a probe to be in a nucleosome (well positioned and delocalized) state using the low-signal nucleosomes as true positive set. The results are shown in Figure 14, which demonstrates that the proposed NHSM based on first order differences performs better than all the other methods. Specifically, NHSM has an AUROC statistic of 0.889 whereas the best of the other methods has only a value of 0.808.

Figure 14:
Receiver operating characteristic (ROC) curve. Comparison of various methods on MNase-chip data from Yuan et al. (2005) using the set of “hand picked” annotated low-signal nucleosomes as the true positive set.

The above benchmarking study relied on using the ”hand picked” annotation of Yuan et al. (2005) as the gold standard. We also provide an additional benchmarking experiment in Table 3 by annotating the probes in Yuan et al. (2005) based on the nucleosome calls from the higher resolution experiment of Lee et al. (2007) which used a 4 base pairs resolution tiling array. Comparisons of different methods using this annotation as the gold standard again indicate that NHSM uniformly outperforms other methods by identifying a larger percentage of low-signal nucleosomes.

Table 3:
Sensitivity/specificity for the case study using annotations from Lee et al. (2007). Sensitivity and specificity are computed by treating the annotation of Lee et al. (2007) as the gold standard.

5.2. Application to high resolution MNase-Chip and MNase-Seq data

Next, we will illustrate the applicability of our proposed NHSM in mapping nucleosome occupancy on the high resolution MNase-Chip (Lee et al.; 2007) and MNase-Seq data (Shivaswamy et al.; 2008). In both cases, we smoothed the data via Gaussian kernel smoothing described in Section 3 and defined Ot = Xt+4Xt−5 (k = 5). Details are given in Appendix A.2.4. We first demonstrate the utility of our proposed NHSM in annotating the CHA1 and HIS3 promoters in the 4 base pairs resolution MNase-Chip data of Lee et al. (2007) (Figure 15). We observe that our proposed NHSM detects the lowsignal nucleosomes between positions 17050 and 17200 in the CHA1 promoter (labeled 1 in Figure 15) and between positions 722700 and 722850 in the HIS3 promoter (labeled 2 in Figure 15), which were missed by the original SHMM annotation of Lee et al. (2007). These two low-signal nucleosomes were also identified by Shivaswamy et al. (2008) in their MNase-Seq data, indicating that they are not artifacts of hybridization.

Since our modeling framework utilizes first order differences which capture the “bump” shape of a nucleosome and not the observed log base 2 ratios in the emission distribution, it can be applied to first order (lagged) differences on tag counts/reads in MNase-Seq data. In Shivaswamy et al. (2008), 514803 uniquely aligned reads were generated for the normal cells via the sequencing technology. We considered the following strategy for mapping nucleosome positions onMNase-Seq data. Since each of the 27 base pairs Solexa sequencing read corresponds to a mono-nucleosome of size 150–200 base pairs, we first extended these reads to 150 base pairs according to the sequence orientation for both the plus and minus strands. The total reads for each genomic position is then taken to be the sum of all extended reads at the position, as shown in Figure 16. Therefore, the total reads at every 50 base pairs on the genome is analogous to the observed log base 2 ratios in MNase-chip data of 50 base pairs resolution.

Figure 16:
Illustration of obtaining reads for each genomic position in ChIP-Seq data. White rectangles are reads mapped to the plus strand and the black rectangles are reads mapped to the minus strand. Panel B shows the extended reads (150 base pairs). Panel C ...

We also illustrate the applicability of our proposed NHSM in annotating these two promoter regions in the MNase-Seq data of Shivaswamy et al. (2008) using a 4 base pairs resolution to facilitate direct comparison against the 4 base pairs MNase-Chip data of Lee et al. (2007). Of particular interest is the ability of NHSM to decode three low-signal nucleosomes between positions 722200 and 722700 in the HIS3 promoter (labeled 3, 4 and 5 in Figure 17), which were missed by the original annotation of Shivaswamy et al. (2008) as shown in Figure 17. The NHSM annotation for these three nucleosomes is consistent with the annotation from Lee et al. (2007). Both Figures 15 and and1717 also illustrate that our proposed NHSM results in the most consistent, i.e., greatest overlaps in annotations of nucleosome positions between MNase-Chip and MNase-Seq data. This is desirable given that both data sets measure the same nucleosome occupancy in yeast S. cerevisiae.

6. Discussion

The ability to map nucleosome positions accurately is crucial for investigating changes in nucleosome occupancies and their relationship to gene regulation since losses/gains in occupancy usually occur at one or two nucleosomes as illustrated in Shivaswamy et al. (2008). We introduced a non-homogeneous hidden-state model (NHSM) that automatically maps nucleosome positions based on either high-throughput tiling array or sequencing data and is computationally efficient. The modeling framework utilizes first order (lagged) differences which capture the “bump” shape that characterize a nucleosome and enable accurate mapping of nucleosome-linker boundaries. The NHSM bypasses the need for further local detrending which misses low-signal nucleosomes (Figure 11). We also demonstrated the pitfalls of detrending the data with a simple method of comparing peak and trough within a window size covering a nucleosome (HMMD). Such a detrending introduced higher noise levels to the data in both the simulations and a case study of yeast nucleosome occupancy data. Modeling the emission distribution on first order differences allows our method to be applicable to both the MNase-Chip and MNase-Seq data, since the defining characteristic of a nucleosome in both cases is the “bump” shape. By allowing a duration distribution, NHSM is able to capture the fuzzy nucleosomes, which have heterogenic and dynamic positions.

The only preprocessing step required before applying our proposed NHSM in detecting nucleosome positions is data smoothing. We have illustrated in the case studies that simple smoothing such as moving average in a window size of 3 is generally sufficient for lower resolution tiling arrays, e.g, the MNase-Chip data of Yuan et al. (2005). For high resolution tiling arrays and sequencing data which have lower signal-to-noise ratio, we proposed a Gaussian kernel smoothing with a bandwidth chosen based on the size of nucleosomal DNA, and demonstrated that this smoothing is able to denoise and enhance the “bump” shape of a nucleosome. Recently, Yassour et al. (2008) introduced a method for improving the resolution of the nucleosome positions in low resolution tiling arrays with overlapping probe design. In this paper, we aimed to develop a general method applicable to both low and high resolution nucleosome occupancy data by specifically capturing the defining characteristic, that is the “bump” shape of a nucleosome. However, the main idea of partitioning a probe into smaller fragments in Yassour et al. (2008) can be easily adapted into our framework by generating a pseudo MNase-Chip with higher resolution and thereby improving the resolution of nucleosome positions. We provide an illustration of this point in Appendix A.3.

The numerous examples and extensive simulations provided in this paper demonstrate that our proposed method is able to detect linker regions that are represented by only one/two probes, low-signal nucleosomes (Figures 10 and and11)11) and outperforms currently available methods. Although the underlying architecture of our NHSM is simple, it is effective in detecting nucleosome occupancies in both low and high resolution MNase-Chip and MNase-Seq data. Furthermore, in the datasets that we used for illustration, nucleosomal DNA was isolated by MNase but our approach is applicable to tiling arrays and sequencing technologies regardless of how the nucleosomes are isolated. We conclude by stressing that accurate annotation of nucleosome occupancy based on data from high-throughput experiments under various physiological conditions forms the basis of comparing different samples to elucidate the dynamics of nucleosome occupancy. Some examples of this line of work are by Shivaswamy et al. (2008) and Schones et al. (2008), and we anticipate that the number such examples will keep growing.

Acknowledgments

This research has been supported in part by a PhRMA Foundation Research Starter Grant in Informatics (P.K. and S.K.), the NIH grant HG003747 (P.K. and S.K.), the NSF grant DMS004597 (S.K.), the Morgridge Institute for Research support for Computation and Informatics in Biology and Medicine (P.K), NSF CAREER Award #0447887 (A.P.G and D.J.H.) and an NIGMS grant to the Molecular Biosciences Training Program T32GM007215 (D.J.H.).

A Appendix. 

A.1.  Choice of smoothing for nucleosome occupancy

We investigate various smoothing algorithms for denoising the observed log base 2 ratios in tiling arrays measuring nucleosome occupancy. We will illustrate the performance of the selected smoothing algorithms on the high resolution tiling arrays (4 base pairs resolution) from Lee et al. (2007). The original log base 2 ratios from one replicate/array for the CHA1, HIS3 and SAC7 promoters are shown in the top left panels of Figures 18, ,19,19, and and20,20, respectively.

Figure 18:
Nucleosome occupancy in CHA1 promoter. Different panels illustrate various smoothing algorithms. Vertical dotted lines are boundaries separating nucleosome-linker states from Lee et al. (2007). Red horizontal lines at y =1.5 are the nucleosome annotations ...
Figure 19:
Nucleosome occupancy in HIS3 promoter. Different panels illustrate various smoothing algorithms. Vertical dotted lines are boundaries separating nucleosome-linker states from Lee et al. (2007). Red horizontal lines at y =1.5 are the nucleosome annotations ...
Figure 20:
Nucleosome occupancy in SAC7 promoter. Different panels illustrate various smoothing algorithms. Vertical dotted lines are boundaries separating nucleosome-linker states from Lee et al. (2007). Red horizontal lines at y = 1.5 are the nucleosome annotations ...

We have shown that data smoothing based on moving average in a window size (2w + 1) of 3 probes works well in the lower resolution nucleosome data from Yuan et al. (2005). However, the data from Lee et al. (2007) has lower signal-to-noise ratio compared to Yuan et al. (2005). Therefore, one pass moving average using a window size of 2w + 1 = 3 is not sufficient, as shown in top middle panels of Figures 18, ,19,19, and and20.20. One possible solution is to use a larger w. Since a nucleosomal DNA is 146 base pairs long, a nucleosome occupied region will span approximately 32 probes. Hence, consider moving averages in a window size of 32 probes. As given in the top right panels of Figures 18, ,19,19, and and20,20, using a larger w is able to increase the signal-to-noise ratio, although the denoised log base 2 ratios still exhibit some wiggly pattern.

Next, as an alternative for using a larger window size, we consider iterated moving averages using a window size of 3 probes. Let S and S(Y )t denote the smoother function and the resulting smoothed value at probe t, respectively. For example, S(Y)t=j=twt+wYj/(2w+1) in the moving average approach. Let k denote the number of iterations and Sk be the resulting smoother function at the k-th iteration. Then,

S1(Y)t=13(Yt1+Yt+Yt+1)S2(Y)t=132(Yt2+2Yt1+3Yt+2Yt+1+Yt+2)S3(Y)t=133(Yt3+3Yt2+6Yt1+7Yt+6Yt+1+3Yt+2+Yt+3)S4(Y)t=134(Yt4+4Yt3+10Yt2+16Yt1+19Yt+16Yt+1+10Yt+2            +4Yt+3+Yt+4)          :

The smoother function has the following general form:

Sk(Y)t=13kj=tkt+kcjYj,

where ctj = ct+j, j = 1, ..., k (symmetry), ct−kct−k+1 ≤ . . . ≤ ct−1 and j=tkt+kcj=3k. The bottom left panels of Figures 18, ,19,19, and and2020 plot the resulting smoothed log base 2 ratios for the m-th iteration, where m is minimum k such that |Sk(Y) − Sk−1(Y) |L2 ≤ ε, for ε = 10−6.

Next, we provide a simple analytical result for smoothing based on iterated moving averages. The collection of weights in front of the log base 2 ratios in Sk can be viewed as the resulting probability mass function from the sum of k independent and identically distributed discrete uniform random variables Ui taking values {−1, 0, 1}, denoted by f. In particular, f is related to weights in Sk(Y ) as follows:

f(i=1kui)=cj3k  fori=1kui=j{k,k+1,,k1,k}

We provide the form of f for k = 2 and k = 3:

f(u1+u2)={19for u1+u2{2,2},29for u1+u2{1,1},39for u1+u2{0}.

f(u1+u2+u3)={127  for  u1+u2+u3{3,3},327  for  u1+u2+u3{2,2},627  for  u1+u2+u3{1,1},727  for  u1+u2+u3{0}.

Then, by the Central Limit Theorem,

i=1kUi/kDN(0,σ2)  as  k,

where σ2 = (32 − 1)/12. Therefore, for a sufficiently large fixed K,

f(i=1KUi=j)φ(j|0,Kσ2)j=KKφ(j|0,Kσ2),

i.e., a discretized Gaussian on integer support { −K, ..., K}, where φ(z|μ,σ2)=exp[(zμ)2/2σ2]/σ2π. In other words, the weights cj/3K, j = 1, ..., K, in SK(Y )t are approximately φ(j|t,Kσ2)/j=tKt+Kφ(j|t,Kσ2). As the number of iterations (K) increases, [var phi](j|t, Kσ2) → 0. Therefore,

SK(Y)t=13Kj=tKt+KcjYj         j=tKt+Kφ(j|t,Kσ2)Yjj=tKt+Kφ(j|t,Kσ2),

and

limKj=tKt+Kφ(j|t,Kσ2)Yjj=tKt+Kφ(j|t,Kσ2)=Y¯.

That is, the smoothed log base 2 ratios from iterated moving average become flatter and flatter approaching a constant Y, and the choice of ε above is critical to avoid over-smoothing.

The analytical result above also shows that the iterated moving average is approximately a Gaussian kernel smoother for large k. In kernel smoothing, the tuning parameter is the bandwidth h. Large bandwidth implies more smoothing, and vice versa. For Gaussian kernel, h is also the standard deviation. That is,

Xt=j=0Tφ(|GtGj|h)Yjj=0Tφ(|GtGj|h),

where Gj is the genomic coordinate corresponding to Yj in base pairs. We propose choosing h based on the size of the nucleosomal DNA. For a Gaussian distribution, 99% of the values are within ±2.5σ, where σ = h. We choose h = 146/5 so that the bandwidth spans the size of a nucleosome. The bottom right panels of Figures 18, ,19,19, and and2020 illustrate the resulting smoothed log base 2 ratios from Gaussian kernel smoothing with this choice of bandwidth. The Gaussian kernel smoothing is able to denoise the data and enhance the “bump” characteristics of a nucleosome, i.e., a series of decreasing positive slopes, followed by slopes of approximately zero in magnitude and then a series of increasing negative slopes.

We also investigated another simple non-linear smoothing, i.e., iterated moving median (Tukey; 1977). This smoother is more robust and less sensitive to sudden jumps, and is usually recommended over moving average. Mallow (1979) proved that iterated moving median converges for odd-numbered spans (w). However, as shown in the bottom middle panels of Figures 18, ,19,19, and and20,20, smoothing based on iterated moving median is less desirable since the “bump” shape of the nucleosomes is not well characterized compared to Gaussian kernel smoothing/iterated moving average. Another popular denoising algorithm is wavelet smoothing, which requires more tuning parameters (i.e., wavelet type, decomposition level and thresholds). The wavelet smoothing is utilized by Zhang et al. (2008) recently for detecting nucleosomes in ChIP-Seq. However, since we have illustrated that a Gaussian kernel smoothing is sufficient for denoising the log base 2 ratios from both the high resolution MNase-Chip and MNase-Seq data with a justified choice of bandwidth, we do not explore the more sophisticated wavelet smoothing.

A.2.  Model fitting for NHSMwith the expectation maximization algorithm

Let Qt be the hidden state latent variable for mid-probe t and λ = (π, A, B) denote the model parameters, where A is the transition probability matrix and B is the emission distribution. Also define Zt = Ot + Zt−1 for t = 1, ..., T and Z0 (Z0 = X0 if Ot = XtXt−1 and 0 = X0+...+X2k−2 if Ot = Xt+k−1Xtk. See Appendix A.2.4), and let O(T) = (O1, ...,OT).

Two assumptions arising from Figure 4 are:

P(Qt+1|Q(t),Z0,O(t),λ)=P(Qt+1|Qt,Zt,λ)  for  t=1,,T1,
(3)

P(Ot|Q(t),Z0,O(t1),λ)=P(Ot|Qt,λ)  for  t=1,,T.
(4)

For simplicity, we assume Z0 is fixed. The complete data likelihood is given by:

P(O(T),Q(T)|Z0,λ)=P(Q1|Z0,λ)t=1T[P(Ot|Q(t),Z0,O(t1),λ)]t=1T[P(Ot+1|Q(t),Z0,Q(t),λ)]=P(Q1|Z0,λ)t=1T[P(Ot|Qt,λ)]t=1T1[P(Qt+1|Qt,Zt,λ)]=πQ1t=1TbQt(Ot)t=1T1aQt,Qt+1(Zt).

The first equality follows from repeated application of conditional probability. The second equality follows from the two assumptions above.

Assume that there are N hidden states. Then, the complete data log likelihood is given by:

log P(O(T),Q(T)|Z0,λ)=log[i=1NπiI(Q1=i)] [t=1Ti=1Nbi(Ot)I(Qt=i)]                                      [t=1T1i=1Nj=1Nai,j(Zt)I(Qt=i,Qt+1=j)].

We assume that

bi(Ot)=N(μi,σi2),

ai,j(Zt)=exp(γi,j+βjZt)k=1Nexp(γi,k+βkZt)

Expected complete log likelihood is given by

E[log P(O(T),Q(T)|Z0,λ)]=i=1NP(Q1=i|O(T),Z0,λ) log πi+t=1Ti=1NP(Qt=i|O(T),Z0,λ) log[12πσi2exp((Otμi)22σi2)]+t=1T1i=1Nj=1NP(Qt=i,Qt+1=j|O(T),Z0,λ)log ai,j(Zt).

E-step:

Define two variables γt(i) and ξt+1(i, j):

γt(i)=P(Qt=i|O(T),Z0,λ)=αt(i)βt(i)i=1Nαt(i)βt(i)=αt(i)βt(i)i=1NαT(i)=j=1NP(Qt=i,Qt+1=j|O(T),Z0,λ),

ξt+1(i,j)=P(Qt=i,Qt+1=j|O(T),Zt,λ)=αt(i)ai,j(Zt)bj(Ot+1)βt+1(j)i=1Nj=1Nαt(i)ai,j(Zt)bj(Ot+1)βt+1(j)=αt(i)ai,j(Zt)bj(Ot+1)βt+1(j)i=1NαT(i),

where αt(i) = P(O1, ..., Ot, Qt = i | Z0, λ) and βt(i) = P(Ot+1, ..., OT | Qt = i, Z0, λ).

M-step:

max E[log P(O(T),Q(T)|Z0,λ)]                                      s.t     i=1Nπi=1,                                                j=1Nai,j(Zt)=1,    t=1,,T1

yields

   πi=γ1(i),{μ^1=t=1Tγt(BN)Ott=1Tγt(BL)Ott=1Ti{BN,BL}γt(i),μ^2=t=1Ti{N1,N2a,L3}γt(i)Ott=1Ti{N2c,N3,L1}γt(i)Ott=1Ti{N1,N2a,N2c,N3,L1,L3}γt(i),     if  μ^1μ^2,μ^1=μ^2=t=1Ti{BN,N1,N2a,L3}γt(i)Ott=1Ti{BL,N2c,N3,L1}γt(i)Ott=1Ti{BN,BL,N1,N2a,N2c,N3,L1,L3}γt(i),if     μ^1<μ^2,σ^i2=t=1Tγt(i)(Otμ^i)2t=1Tγt(i).

The non-parametric transition probabilities for the case study are updated as follows:

a^i,j=t=1T1ξt+1(i,j)t=1T1γt(i)  for  iN3,BL,L3,a^N3,BLp=t=1T1ξt+1(N3,BL)I(Zt0)t=1T1γt(N3)I(Zt0),a^BL,BNn=t=1T1ξt+1(BL,BN)I(Zt<0)t=1T1γt(BL)I(Zt<0),a^L3,BNn=t=1T1ξt+1(L3,BN)I(Zt<0)t=1T1γt(L3)I(Zt<0).

Note that the parameters in the logistic regression model cannot be solved analytically. These parameters can be optimized via a conjugate gradient algorithm. The details can be found in Robertson et al. (2004). The computation of αt(i) and βt(i) is based on the well-known forward and backward procedures (Rabiner; 1989).

A.2.1. Forward procedure

Let αt = P(O1, O2, ..., Ot, Qt = i | Z0, λ).

  • Initialization:
    • α1(i) = P(O1, Q1 = i | Z0, λ) = πibi(O1), for 1 ≤ iN.
  • Induction:
    • αt+1(j) = [i=1Nαt(i)ai,j(Zt)]bj(Ot+1), for 1 ≤ tT − 1, 1 ≤ jN.
  • Termination:
    • P(O(T) | Z0, λ) = i=1NαT(i).

A.2.2. Backward procedure

Let βt(i) = P(Ot+1, ..., OT | Qt = i, Z0, λ).

  • Initialization:
    • βT (i) = 1, for 1 ≤ iN.
  • Induction:
    • βt(i) = j=1Nai,j(Zt)bj(Ot+1)βt+1(j), for t = T−1, T−2, ..., 1, 1 ≤ jN.
  • Termination:
    • P(O(T) | Z0, λ) = i=1Nβ1(i)bi(O1)πi.

The optimal hidden state sequence is obtained via Viterbi algorithm (Rabiner; 1989).

A.2.3. Viterbi algorithm

Define δt(i) = maxQ1,...,Qt−1 P(Q1, ..., Qt = i, O1, ..., Ot | Z0, λ).

  • Initialization:
    • δ1(i) = πibi(O1), for 1 ≤ iN,
    • ψ1(i) = 0, for 1 ≤ iN.
  • Recursion:
    • δt(j) = max1≤iNt−1(i)ai,j(Zt) ]bj(Ot), for 2 ≤ tT, 1 ≤ jN,
    • ψt(j) = argmax1≤iNt−1(i)ai,j(Zt)], for 2 ≤ tT, 1 ≤ jN.
  • Termination:
    • P(O(T) | Z0, λ)* = max1≤iNT (i) ],
    • QT*=argmax1iN[δT(i)].
  • Path backtracking:
    • Qt*=ψt+1(Qt+t*), for t = T − 1, T − 2, ..., 1.

A.2.4. Details of the NHSM for high resolution MNase-Chip and MNase-Seq data

For a general first order lagged differences Ot = Xt+k−1Xtk, Ot is defined for t = k, ..., Tk + 1. We define Zt = Ot + Zt−1, and where Zt = Xt+1−k + ...+Xt+k−1. To initialize the chain, let Zk−1 = X0 +...+X2k−2. Let us define t+1−k = Zt and t+1−k = Ot so that the chain starts at 1. Therefore, for the examples in Figure 15 and Figure 17:

  • Initialize with 0 = Z4 = X0 + ... + X8.
  • Let 1 = O5 = X9X01 = Z5 = X1 + ... + X9.
  • Let 2 = O6 = X10X12 = Z6 = X2 + ... + X10 and so forth.

The transition probabilities for Figure 15 are:

aN3,BL(Z˜t)={1,aN3,BLp,if Z˜t/(2k1)<0,if Z˜t/(2k1)0,aBL,BN(Z˜t)={aBL,BNn,1,if Z˜t/(2k1)<0,if Z˜t/(2k1)0,aL3,BN(Z˜t)={aL3,BNn,1,if Z˜t/(2k1)<0,if Z˜t/(2k1)0.

The transition probabilities for Figure 17 are:

aN3,BL(Z˜t)={1,aN3,BLp,if Z˜t/(2k1)<5,if Z˜t/(2k1)5,aBL,BN(Z˜t)={aBL,BNn,1,if Z˜t/(2k1)<5,if Z˜t/(2k1)5,aL3,BN(Z˜t)={aL3,BNn,1,if Z˜t/(2k1)<5,if Z˜t/(2k1)5.

Here, the transition is based on t/(2k − 1) which is the average log base 2 ratio in the window containing probes t, t + 1, ..., t + 2k − 2. The mean tag count in Chromosome 3 is 5.2. Therefore, we choose 5 as the threshold for the transition probabilities.

A.3.  Increasing resolution of MNase-Chip data

Recently, Yassour et al. (2008) introduced a method for mapping nucleosome positions, tailored for constant low resolution tiling arrays with overlapping probe design. Their method aimed to improve the resolution of the nucleosome positions by partitioning a probe into smaller fragments. This idea can be adapted into our model as follows: Let Yi−1, Yi and Yi+1 be the log base 2 ratios for 3 consecutive probes from a design similar to Yuan et al. (2005). Partition each Yi into 5 segments Pi1, Pi2, …, Pi5 (Figure 21) as in Yassour et al. (2008). Since two probes overlap by 30 base pairs, Pi13 and Pi1 represent the same genomic region. Similarly, for each of ( Pi14, Pi2),( Pi15, Pi3, Pi+11),( Pi4, Pi+12) and ( Pi5, Pi+13), we have the same genomic region. We can create a pseudo MNase-Chip with 9 probes having log base 2 ratios Ri, where R1 = R2 = Yi−1, R3 = R4 = (Yi−1 + Yi) / 2, R5 = (Yi−1 + Yi + Yi+1) / 3, R6 = R7 = (Yi + Yi+1)/2 and R8 = R9 = Yi+1 as shown in Figure 21. After generating a pseudo MNase-Chip with higher resolution, our proposed NHSM can be readily used to detect nucleosome occupancy.

Figure 21:
Increasing resolution of tiling arrays via pseudo probes. This is an illustration on how we can adapt the idea of Yassour et al. (2008) in creating a pseudo MNase-Chip data from constant low resolution tiling arrays with overlapping probe design. Yi−1 ...

References

  • Albert I, Mavrich T, Tomsho L, Qi J, Zanton S, Schuster S, Pugh B. Translational and rotational settings of H2A.Z nucleosomes across the saccharomyces cerevisiae genome. Nature. 2007;446:572–576. doi: 10.1038/nature05632. [PubMed] [Cross Ref]
  • Bernstein B, Liu C, Humphrey E, Perlstein E, Schreiber S. Global nucleosome occupancy in yeast. Genome Biology. 2004;5(62) doi: 10.1186/gb-2004-5-9-r62. [PMC free article] [PubMed] [Cross Ref]
  • Chakravarthy S, Park Y, Chodaparambil J, Edayathumangalam R, Luger K. Structure and dynamic properties of nucleosome core particles, FEBS Letters. 2006;579(4):895–898. [PubMed]
  • Ercan S, Carrozza MJ, Workman JL. Global nucleosome distribution and the regulation of transcription in yeast. Genome Biology. 2004;5(10) doi: 10.1186/gb-2004-5-10-243. doi:10.1186/gb–2004–5–10–243. [PMC free article] [PubMed] [Cross Ref]
  • Johnson SM, Tan FJ, McCullough HL, Riordan DP, Fire AZ. Flexibility and constraint in the nucleosome core landscape of caenorhabditis elegans chromatin. Genome Research. 2006;16:1505–1516. doi: 10.1101/gr.5560806. [PubMed] [Cross Ref]
  • Kaplan T, Liu C, Erkmann J, Holik J, Grunstein M, Kaufman P, Friedman N, Rando O. Cell cycle- and chaperone-mediated regulation of h3k56ac incorporation in yeast. PLoS Genetics. 2008 doi: 10.1371/journal.pgen.1000270. [PMC free article] [PubMed] [Cross Ref]
  • Kornberg R, Lorch Y. Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell. 1999;98:285–294. doi: 10.1016/S0092-8674(00)81958-3. [PubMed] [Cross Ref]
  • Lee C, Shibata Y, Rao B, Strahl B, Lieb J. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nature Genetics. 2004 doi: 10.1038/ng1400. [PubMed] [Cross Ref]
  • Lee W, Tillo D, Bray N, Morse R, Davis R, Hughes T, Nislow C. A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics. 2007 doi: 10.1038/ng2117. [PubMed] [Cross Ref]
  • Liu C, Kaplan T, Kim M, Buratowski S, Schreiber S, Friedman N, Rando O. Single-nucleosome mapping of histone modifications in s.cerevisiae. PLoS Biol. 2005;3(10):1753–1769. doi: 10.1371/journal.pbio.0030328. [PMC free article] [PubMed] [Cross Ref]
  • Mallow C. 1979. Some theoretical results on Tukeys 3R smootherLecture Notes in Mathematics, Springer Berlin/Heidelberg
  • Millar C, Grunstein M. Genome-wide patterns of histone modifications in yeast. Nature Reviews Molecular Cell Biology. 2006;7:657–666. doi: 10.1038/nrm1986. [PubMed] [Cross Ref]
  • Rabiner L. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. doi: 10.1109/5.18626. [Cross Ref]
  • Robertson A, Kirshner S, Smyth P. Downscaling of daily rainfall occurrence over northeast brazil using a hidden markov model. Journal of Climate. 2004;17(22):4407–4424. doi: 10.1175/JCLI-3216.1. [Cross Ref]
  • Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, Wei G, Zhao K. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008;132(5):887–898. doi: 10.1016/j.cell.2008.02.022. [PubMed] [Cross Ref]
  • Shivaswamy S, Bhinge A, Zhao Y, Jones S, Hirst M, Iyer V. Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation. PLOS Biology. 2008;6(3):618–630. doi: 10.1371/journal.pbio.0060065. [PMC free article] [PubMed] [Cross Ref]
  • Shivaswamy S, Iyer V. Stress-dependent dynamics of global chromatin remodeling in yeast: dual role for swi/snf in the heat shock stress response. Molecular and celular biology. 2008;28(7):2221–2234. doi: 10.1128/MCB.01659-07. [PMC free article] [PubMed] [Cross Ref]
  • Tukey J. Exploratory data analysis. Addison-Wesley; 1977.
  • Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S. A high-resolution, nucleosome position map of C.elegans reveals a lack of universal sequence-dictated positioning. Genome Research. 2008;18:1051–1063. doi: 10.1101/gr.076463.108. [PubMed] [Cross Ref]
  • Yassour M, Kaplan T, Jaimovich A, Friedman N. Nucleosome positioning from tiling microarray data. Bioinformatics. 2008;24:139–146. doi: 10.1093/bioinformatics/btn151. [PMC free article] [PubMed] [Cross Ref]
  • Yuan G, Liu Y, Dion M, Slack M, Wu L, Altschuler S, Rando O. Genome-scale idenfication of nucleosome positions in s.cerevisiae. Science. 2005;309:626–630. doi: 10.1126/science.1112178. [PubMed] [Cross Ref]
  • Zhang Y, Shin H, Song J, Lei Y, Liu X. Identifying positioned nucleosomes with epigenetic marks in human from chip-seq. BMC Genomics. 2008;9(1) doi: 10.1186/1471-2164-9-537. [PMC free article] [PubMed] [Cross Ref]
  • Zucchini W, Raubenheimer D, MacDonald I. Modeling time series of animal behavior by means of a latent-state model with feedback. Biometrics. 2008;64:807–815. doi: 10.1111/j.1541-0420.2007.00939.x. [PubMed] [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of Berkeley Electronic Press