Home | About | Journals | Submit | Contact Us | Français |

**|**Stat Appl Genet Mol Biol**|**PMC2861327

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Motivation
- 3. Hidden-state model for mapping nucleosome positions
- 4. Simulation studies
- 5. Case studies
- 6. Discussion
- References

Authors

Related links

Stat Appl Genet Mol Biol. 2009 January 1; 8(1): 29.

Published online 2009 June 19. doi: 10.2202/1544-6115.1454

PMCID: PMC2861327

NIHMSID: NIHMS193955

Copyright © 2009 The Berkeley Electronic Press. All rights reserved

This article has been cited by other articles in PMC.

The ability to map individual nucleosomes accurately across genomes enables the study of relationships between dynamic changes in nucleosome positioning/occupancy and gene regulation. However, the highly heterogeneous nature of nucleosome densities across genomes and short linker regions pose challenges in mapping nucleosome positions based on high-throughput microarray data of micrococcal nuclease (MNase) digested DNA. Previous works rely on additional detrending and careful visual examination to detect low-signal nucleosomes, which may exist in a subpopulation of cells. We propose a non-homogeneous hidden-state model based on first order differences of experimental data along genomic coordinates that bypasses the need for local detrending and can automatically detect nucleosome positions of various occupancy levels. Our proposed approach is applicable to both low and high resolution MNase-Chip and MNase-Seq (high throughput sequencing) data, and is able to map nucleosome-linker boundaries accurately. This automated algorithm is also computationally efficient and only requires a simple preprocessing step. We provide several examples illustrating the pitfalls of existing methods, the difficulties of detrending the observed hybridization signals and demonstrate the advantages of utilizing first order differences in detecting nucleosome occupancies via simulations and case studies involving MNase-Chip and MNase-Seq data of nucleosome occupancy in yeast *S. cerevisiae*.

Nucleosomes are the fundamental structural units of chromatin and consist of approximately 146 base pairs of DNA wrapped around a histone octamer (Kornberg and Lorch; 1999; Chakravarthy et al.; 2006). The precise positioning of nucleosomes along the genome has been implicated in the regulation of gene expression (reviewed in Ercan et al. (2004)). Packaging of DNA into nucleosomes may prevent DNA binding proteins from accessing their sites, recruit transcriptional activators or repressors, and bring distant DNA sequences into close proximity to promote transcription (Millar and Grunstein; 2006). In the last couple of years, nucleosome positions have been mapped across the genomes of *S. cerevisiae* (Yuan et al.; 2005; Lee et al.; 2007; Shivaswamy et al.; 2008), *C. elegans* (Johnson et al.; 2006), and humans (Schones et al.; 2008) in various cell types and under a variety of physiological perturbations. These studies have revealed various chromatin remodeling patterns in transcriptional regulation at nucleosome resolution. In particular, Shivaswamy et al. (2008) showed that gene activation in yeast is mainly accompanied by the loss of one or two nucleosomes in the promoter regions, while Lee et al. (2007) illustrated that functionally related genes share similar nucleosome occupancy patterns across their promoters. Nucleosome occupancy can hinder the binding of transcription factors to their consensus motifs, and the fraction of bound motifs vary between nucleosomes and nucleosome free regions (Yuan et al.; 2005). These findings illuminated that identifying locations of individual nucleosomes accurately is essential for studying the effect of dynamic changes in nucleosome occupancy in the control of gene regulation. By having a reliable map of nucleosome occupancy, one can investigate various histone modifications at the nucleosome level to uncover the complex mechanism in transcriptional reprogramming. Similarly, for studies investigating the effect of physiological perturbations on nucleosome positioning, the starting point often involves maps of nucleosome occupancy before and after such perturbations. For example, nucleosome mapping experiments of Schones et al. (2008) in resting and activated human CD4^{+} T cells revealed specific reorganization patterns of nucleosomes in promoter and enhancer regions of the genome.

Numerous high-throughput experiments have been carried out to map nucleosome occupancy in *S. cerevisiae* via tiling arrays (Liu et al.; 2005; Yuan et al.; 2005; Lee et al.; 2007; Shivaswamy and Iyer; 2008; Kaplan et al.; 2008). More recently, a high resolution whole genome nucleosome map for yeast genome was developed via a high throughput sequencing technology (Albert et al.; 2007; Shivaswamy et al.; 2008). In both platforms, the sample input consists of mono-nucleosomes prepared via micrococcal nuclease (MNase) digestions, which degrades all but the DNA wrapped around histone proteins. Two nucleosomes are connected by linker DNA, which is digested by the enzyme. The digested sample is either sequenced by high-throughput sequencing technologies (MNase-Seq), or competitively hybridized against a control sample using high density tiling arrays in (MNase-Chip). A high percentage of the *S.cerevisiae* genome is known to be occupied by nucleosomes, however there exists substantial variation in nucleosome density across the genome. In particular, relatively higher density of nucleosomes is observed at transcribed regions and lower density is found in intergenic regions (Lee et al.; 2004; Bernstein et al.; 2004; Lee et al.; 2007; Shivaswamy et al.; 2008). Positions of nucleosomes across the whole genome are characterized by a stretch of consecutive probes encompassing approximately 146 base pairs with higher signals than the background. An interesting feature observed in many of the MNase-Chip experiments for mapping nucleosome positions is that the magnitude of log base 2 ratios for regions occupied by nucleosomes exhibit large variability. Specifically, some regions of the genome thought to be occupied by nucloesomes actually show log base 2 ratios below the baseline. Yuan et al. (2005) provided substantial evidence of this problem and referred to this phenomena as unpredictable trends in hybridization. The variability in the magnitudes of nucleosome occupancy is also observable from the high resolution MNase-Chip data of Lee et al. (2007) and MNase-Seq data of Shivaswamy et al. (2008). This trend in hybridization can be attributed to the heterogeneity of nucleosome densities across the whole genome, resulting in both stable and unstable nucleosome occupancies. Unstable or low-signal nucleosomes are nucleosome peaks having low maxima and may correspond to nucleosomes found only in a subpopulation of cells (Yuan et al.; 2005). These low-signal nucleosomes may be the most important and dynamic in regulating transcription by cycling on and off the DNA. We will refer to these as “low-signal nucleosomes” in our subsequent discussion.

Previous work in identifying nucleosome positions in MNase-Chip data include using a hidden Markov model (HMM) (Yuan et al.; 2005) on the observed log base 2 ratios. Yuan et al. (2005) proposed an HMM that takes into account the length of nucleosomal DNA and allows for one emission distribution for each of the nucleosome and linker states, respectively. To account for a global trend, Yuan et al. (2005) applied the HMM to a sliding window of 40 probes and averaged the estimated model parameters and posterior probabilities over all the windows covering a fixed probe to compute the most likely hidden state path. In addition to the nucleosomes identified by the sliding window HMM, they also included additional low-signal nucleosomes obtained by comparing the median intensities of the peak and trough within a window size of 7 probes, which was regarded as a detrending procedure to further identify low-signal nucleosomes. We show that applying the proposed detrending procedure to the entire tiling array data is undesirable as it introduces higher noise level and spurious linkers/nucleosomes in simulations and a case study of yeast nucleosome occupancy. Yuan et al. (2005) also provided nucleosome positions hand picked via close visual inspection in order to capture all potential nucleosomes. However, this heuristic approach becomes tedious when mapping nucleosome occupancy in larger genomic regions.

To accommodate the serious drawbacks of existing methods, we propose a fully automated approach which identifies nucleosome occupancy and addresses the length of nucleosomal DNA and the observed trends in hybridization signals. At the core of our methodology is a non-homogeneous hiddenstate model (NHSM) tailored for MNase-Chip data measuring nucleosome occupancy. By designing the architecture for the first order (lagged) differences of log base 2 ratios, we bypass the problem of unpredictable trends in the log base 2 ratios. An additional feature of our approach is its applicability to the more recent MNase-Seq data. We illustrate the methodology and benchmark its performance against other available methods in simulations and a case study involving a 20 base pairs resolution yeast MNase-Chip nucleosome occupancy data. The results of these experiments highlight the superiority of our NHSM to other approaches especially in identifying low-signal nucleosomes. We also provide an illustration of its applicability to higher resolution MNase-Chip and MNase-Seq nucleosome occupancy data. Additionally, two consecutive nucleosomes are separated by a linker of variable length. Therefore, a good methodology for mapping nucleosome occupancy should be able to identify nucleosome-linker boundaries accurately. This is usually challenging for the low resolution tiling array design in which a linker is represented by one or two probes (Yuan et al.; 2005; Kaplan et al.; 2008). Our proposed methodology carefully exploits the structure of nucleosomes and accurately maps nucleosome positions and therefore provides a principled framework for studying other epigenetic events that rely on mapping nucleosome occupancy.

We motivate the idea behind our methodology using the MNase-Chip data from Yuan et al. (2005). We use the normalized median log base 2 ratios of the 8 replicates for illustration. The top panel of Figure 1 shows the nucleosome profile for a region in chromosome 3 in which the nucleosomes identified by Yuan et al. (2005) are marked with black lines (each vertical line representing a probe), and a stable nucleosome is represented by 6 to 8 probes. It is clear from the plot that the magnitude of log base 2 ratios of a nucleosome region exhibits large variability. Despite having heterogeneous hybridization signals, the plot suggests that a nucleosome is characterized by a peak in the local signal intensity, even if the log base 2 ratio is below the baseline. In other words, a nucleosome occupied region exhibit a “bump” shape irrespective of the actual strength in hybridization signal. In addition, this plot also suggests that using a single distribution for each of the nucleosome and linker/nucleosome depleted regions may fail to distinguish short linkers between stable or wellpositioned nucleosomes, (i.e., linkers between well-positioned nucleosomes have comparable hybridization strength to low-signal nucleosomes.)

Given the observed “bump” (or peak with low maxima) characteristic of annotated nucleosomes in the original data, we consider a simple smoothing by replacing the log base 2 ratios of probe *i* with the average values of probe *i* − 1, *i* and *i* + 1. As evident in the middle panel of Figure 1, the “bump” shape is enhanced in the smoothed data which enable easier mapping of the nucleosome positions. The “bumps” also suggest that a nucleosome occupied region is characterized by a series of decreasing positive slopes, followed by slopes of approximately zero in magnitude and then a series of increasing negative slopes. This observation forms the modeling framework of our proposed methodology. The first order differences automatically take care of the trend in hybridization and thereby bring both the low and stable nucleosomes to a comparable level. Once the nucleosome positions are obtained, one can rank the strength of each nucleosome by the average log base 2 ratios of the probes within the nucleosome.

The trend in hybridization and variability in the magnitudes of nucleosome occupancy are not specific only to lower resolution MNase-Chip data of Yuan et al. (2005) but can also be clearly seen in high resolution MNase-Chip data of Lee et al. (2007) and MNase-Seq data of Shivaswamy et al. (2008), as evident in Figures 15 and and17.17. In addition, we also illustrate that a nucleosome occupied region is characterized by a “bump” shape, which is enhanced upon smoothing in these two data sets (middle and right panels of Figures 15 and and17).17). The choice of smoothing is presented in Section 3. Since all the different data sets share similar defining characteristics of nucleosome occupancy, for expository purposes, our detailed discussion is mainly focused on the 20 base pairs resolution tiling array data of Yuan et al. (2005). This tiling array design is also used in a recent publication by Kaplan et al. (2008) for studying H3K56 acetylation in yeast. We will demonstrate that our proposed method is applicable and works well in detecting nucleosome positions in both the high resolution tiling arrays and sequencing data in Section 5.2.

To circumvent the problem of decoding nucleosome occupancy locally to accommodate for the observed local trends as in Yuan et al. (2005), we consider an alternative approach to infer nucleosome positions based on first order (lagged) differences, *O _{t}*, which we defined as:

$$\begin{array}{l}{O}_{t}={X}_{t+k-1}-{X}_{t-k},\\ {X}_{t}=S{(Y)}_{t},\end{array}$$

where *Y _{t}* is the observed log base 2 ratio of probe

$$\begin{array}{l}{O}_{t}={X}_{t}-{X}_{t-1},\\ {X}_{t}=\sum _{j=t-w}^{t+w}{Y}_{j}/(2w+1).\end{array}$$

We observe that substituting the log base 2 ratios by the corresponding moving average statistic *X _{t}*’s reduces the noise in the data and enhances the shape of peaks and troughs, but not at the expense of over smoothing the data as shown in the middle panel of Figure 1. On the other hand, for the high resolution MNase-Chip and MNase-Seq data, the simple moving average can be replaced with a Gaussian kernel smoother. A detailed motivation with some analytical results for using a Gaussian kernel smoothing is given in Appendix A.1. In kernel smoothing, the tuning parameter is the bandwidth

$${X}_{t}=\frac{{\sum}_{j=0}^{T}\phi \left(\frac{|{G}_{t}-{G}_{j}|}{h}\right){Y}_{j}}{{\sum}_{j=0}^{T}\phi \left(\frac{|{G}_{t}-{G}_{j}|}{h}\right)},$$

where (.) is the standard Gaussian probability density function and *G _{j}* is the genomic coordinate corresponding to

As motivated in Section 2, a nucleosome occupied region is characterized by a series of positive followed by negative slopes or *O _{t}*’s, while the boundaries of nucleosomes-linker regions are characterized by steeper slopes. Detecting jumps in

Consider the state transitions given in Figure 2(a) where *N _{i}*’s represent the nucleosome region states,

$$\begin{array}{cc}\sum _{i\in \left\{{N}_{2a},{N}_{2b},{N}_{2c}\right\}}d(i)+2=p-1,& 0\le d({N}_{2a}),d({N}_{2b})\le p-3,\end{array}$$

since at least one probe is from *N*_{1} and one is from *N*_{3} out of *p* − 1 probes representing a nucleosome.

In most cases, the “bump” shape of a nucleosome on tiling arrays is symmetrical, which implies that *d*(*N*_{2}* _{a}*) =

$${p}_{{N}_{2}}({d}_{1},{d}_{2},{d}_{3})=\frac{(p-3)!}{{d}_{1}!{d}_{2}!{d}_{3}!}{p}_{1}^{{d}_{1}}{p}_{2}^{{d}_{2}}{p}_{3}^{{d}_{3}},$$

where *p*_{1} + *p*_{2} + *p*_{3} = 1 and *d*_{1} + *d*_{2} + *d*_{3} = *p* − 3.

Let *b _{i}*(

$$\begin{array}{cc}{b}_{{B}_{N}}({O}_{t})\sim N({\mu}_{1},{\sigma}_{{B}_{N}}^{2}),& {b}_{{N}_{1}}({O}_{t})\sim N({\mu}_{2},{\sigma}_{{N}_{1}}^{2}),\\ {b}_{{N}_{3}}({O}_{t})\sim N(-{\mu}_{2},{\sigma}_{{N}_{5}}^{2}),& {b}_{{B}_{L}}({O}_{t})\sim N(-{\mu}_{1},{\sigma}_{{B}_{L}}^{2}),\\ {b}_{{L}_{1}}({O}_{t})\sim N(-{\mu}_{2},{\sigma}_{{L}_{1}}^{2}),& {b}_{{L}_{2}}({O}_{t})\sim N(0,{\sigma}_{{L}_{2}}^{2}),\\ {b}_{{L}_{3}}({O}_{t})\sim N({\mu}_{2},{\sigma}_{{L}_{3}}^{2}),& {b}_{{N}_{2}}({O}_{t:t+p-4})\sim N(\tilde{\mu},\sum ),\end{array}$$

where

$$\begin{array}{l}\tilde{\mu}=(\underset{{d}_{1}}{\underbrace{{\mu}_{2},\dots ,{\mu}_{2}}},\underset{{d}_{2}}{\underbrace{0,\dots ,0}},\underset{p-3-{d}_{1}-{d}_{2}}{\underbrace{-{\mu}_{2},\dots ,-{\mu}_{2}}}),\\ \sum =\text{diag(}\underset{{d}_{1}}{\underbrace{{\sigma}_{{N}_{2a}}^{2},\dots ,{\sigma}_{{N}_{2a}}^{2}}},\underset{{d}_{2}}{\underbrace{{\sigma}_{{N}_{2b}}^{2},\dots ,{\sigma}_{{N}_{2b}}^{2}}},\underset{p-3-{d}_{1}-{d}_{2}}{\underbrace{{\sigma}_{{N}_{2c}}^{2},\dots ,{\sigma}_{{N}_{2c}}^{2}}}),\end{array}$$

and 0 < *μ*_{2} < *μ*_{1}. The constraint on the mean of emission distributions is to ensure the series of decreasing positive slopes, zero slopes and followed by increasing negative slopes which characterize the “bump” shape of a nucleosome. In the case of symmetric “bump” shape, the duration density for *N*_{2} reduces to univariate density *p*(*d*_{1}) and

$$\tilde{\mu}=(\underset{{d}_{1}}{\underbrace{{\mu}_{2},\dots ,{\mu}_{2}}},\underset{p-3-2{d}_{1}}{\underbrace{0,\dots ,0}},\underset{{d}_{1}}{\underbrace{-{\mu}_{2},\dots ,-{\mu}_{2}}}).$$

The discrete duration density assumption implies that the proposed duration in the hidden states is equivalent to a hidden-state model with a larger hidden state space. We can recast the state transition in Figure 2(a) as Figure 3 which have the same complexity by considering all possible uni-directional paths transiting from *N*_{1} and incorporating the constraint ∑_{i{N2a, N2b, N2c}}*d*(*i*) +2 =*p −* 1. We can equivalently let *b*_{N2a}(*O _{t}*) ~

Let *Q _{t}* denote the hidden state for mid-probe

Let *a _{i,j}*(

$$P({Q}_{t+1}|{Q}^{(t)},{Z}_{0},{O}^{(t)})=P({Q}_{t+1}|{Q}_{t},{Z}_{t})\text{for}t=1,\dots ,T-1.$$

(1)

$$P({O}_{t}|{Q}^{(t)},{Z}_{0},{O}^{(t-1)})=P({O}_{t}|{Q}_{t})\text{for}t=1,\dots ,T.$$

(2)

To avoid overparametrization, only transitions *a _{BL}*, • (

$${a}_{i,j}({Z}_{t})=\frac{\text{exp}({\gamma}_{i,j}+{\beta}_{j}{Z}_{t})}{{\sum}_{k=1}^{N}\text{exp}({\gamma}_{i,k}+{\beta}_{k}{Z}_{t})}.$$

In cases where the data has been median centered at zero, we observe that a simpler version of the non-homogeneous transition probabilities for these three hidden states performs well (see case study). That is, we consider

$${a}_{i,j}({Z}_{t})=\{\begin{array}{l}{a}_{i,j}^{n},\text{if}{Z}_{t}0,\\ {a}_{i,j}^{p},\text{if}{Z}_{t}\ge 0.\end{array}$$

For instance, we can let *a*_{L3},* _{BN}* (

Yuan et al. (2005) attributed the heterogeneous nucleosome density to unpredictable trends in hybridization data. They applied the HMM to a sliding window of 40 consecutive probes to address this issue. Hidden states decoding via the Viterbi algorithm was based on average values of the model parameters and posterior probabilities of all windows containing a fixed probe. We referred to this method as sliding window HMM (SHMM). SHMM is computationally intensive and requires one to select the window size, which depends on the trend in hybridization. Yuan et al. (2005) also proposed detrending the data by comparing the magnitude of peak and trough locally to capture lowsignal nucleosomes. In particular, for each probe, they considered a window size of 7 probes (~ size of a nucleosome) centered at the probe and replaced the observed log base 2 ratio by the difference between the median of log base 2 ratios within the window and the minimum log base 2 ratio of the two probes adjacent to this window. They observed that the trend was effectively eliminated using this procedure. We referred to this method as HMMD (detrending followed by usual HMM to infer nucleosome/linker states).

In the first simulation, we generated the data using the HMM hidden states architecture in Figure 5(a) (or Figure S1E of Yuan et al. (2005)), in which well-positioned nucleosomes were represented by 6 to 8 probes (N1-N8) and delocalized nucleosomes (D1-D9) covered at least 9 probes. Nucleosome regions were expected to have high log base 2 ratios whereas linker regions had lower values. The hidden state transitions in Yuan et al. (2005) allowed for linker regions (L) to have variable length. Conditioned on the hidden states, the observed log base 2 ratios were generated from Gaussian distributions, with mean 0.7, standard deviation (s.d.) 0.2 for nucleosome states and mean −0.7, s.d. 0.3 for linker state. We illustrated that although we were simulating the observed log base 2 ratios, and not the first order differences, our proposed NHSM was able to map nucleosome positions accurately.

To simulate heterogeneous nucleosome densities, we added a trend line to the simulated data following Yuan et al. (2005). Figure 1 suggests that the underlying trend line in the observed data resembles a curve. Therefore, instead of adding a linear trend line as in Yuan et al. (2005), we let the trend be a sinusoidal curve so that the synthetic data resembles the observed data to a larger extent (Figure 6 top right panel). The bottom left panel of Figure 6 plots the detrended data obtained by comparing peak to trough in a window size of 7 described above. Although this procedure was able to remove the trend in hybridization, it introduced artificial linkers within delocalized nucleosomes and spurious “bumps” within nucleosome depleted/long linker regions and resulting in data with higher noise level. This suggests that applying the same detrending procedure to the whole data is not desirable. On the other hand, a simple smoothing of the synthetic data preserved the “bump” shape that characterizes a nucleosome (Figure 6 bottom right panel). We considered sinusoidal curves with different periodicity (Figure 7) in this simulation study.

Although adding a trend line results in sythetic data that resembled the actual observed data, it may not be the most realistic model to describe the heterogeneity of nucleosome densities. We considered a more realistic simulation setup to generate nucleosomes with various occupancy levels by using mixture emission distributions for the hidden states. We enlarged the hidden state transitions (Figure 5(b)) by introducing low and high (stable) nucleosome states. The stable nucleosomes (N1-N8, D1-D9) were generated from a Gaussian distribution with mean 0.7 and s.d. 0.2. Low-signal nucleosomes (NL1-NL8, DL1-DL8) were generated from a Gaussian distribution with mean 0.1 and s.d. 0.3 and the linker state was generated from a mixture of 3 Gaussian distributions with means −0.3, −0.5, −0.7 and constant s.d. 0.3 with equal mixing proportion. An example of simulated data is shown in Figure 8. The middle panel again shows that detrending introduces a higher noise level to the original data.

We simulated observations for 1000 probes according to a tiling design of 50-mer probes overlapped by 30 base pairs covering a 20030 base pair region. In both simulations, we decoded the hidden states using the usual HMM with two emission distributions, one for linker and one for nucleosomes (without differentiating fuzzy/well-positioned, low/high), SHMM, HMMD (detrend first, then apply usual HMM) and our proposed NHSM (on first order differences). The most probable path for each method was decoded via the Viterbi algorithm (Appendix A.2.3).

We compared the performance of each method via the area under a receiver operating characteristic (AUROC) curve, by varying the posterior probabilities of declaring a probe to be in a nucleosome (well positioned and delocalized) state. In addition, we also evaluated the sensitivity and specificity at probe level of the most probably path for each method. The results, averaged over 50 simulated data sets of 1000 probes, are summarized in Table 1.

In both simulations, NHSM has a consistent result and outperforms other methods in both the sensitivity/specificity at the 0.5 posterior probability threshold and AUROC, since its main assumption is the “bump” shape that characterizes a nucleosome and this characteristic is preserved irrespective of the underlying trends in hybridization (Simulation I). HMM consistently tends to declare fewer nucleosomes, resulting in lower sensitivity. On the other hand, in cases where the trend line has larger periodicity, comparing the magnitudes of peaks and troughs is able to remove the trend effect and improves the performance of HMMD, although it is still worse than NHSM. The superior performance of SHMM in Simulation I with larger periodicity is not surprising. When the periodicity is large, the simulated data in each segment consisting of 40 probes is very close the the original hidden Markov model generator with scaled mean in the emission distributions, and therefore fitting a usual HMM to each segment in SHMM agrees with the underlying data generator. However, when the trend line oscillates more frequently (Simulation I) or is unpredictable (Simulation II), the performance of SHMM decreases rapidly. This indicates that the sliding window size in SHMM depends heavily on the trend in hybridization. In the actual data analysis, it is hard to calibrate the window size since the exact trend is unknown, and a reasonable number of probes within the window size is required for obtaining reliable parameter estimates in an HMM fit.

We illustrated our proposed NHSM on the normalized median log base 2 ratios of the 8 replicates from Yuan et al. (2005). The data was generated from microarrays which consist of 50-mer oligonucleotides probes tiled at 20 base pairs resolution, covering approximately half megabase of the yeast genome. A moving average in a window size of 3 probes was first applied across the whole data as the smoothing step. A well positioned nucleosome (~146 base pairs) is represented by at least 6 probes (Yuan et al.; 2005), which implies that 0 ≤ *d*(*N*_{2}* _{a}*),

$$\begin{array}{ccc}{a}_{{N}_{3},{B}_{L}}({Z}_{t})& =& \{\begin{array}{l}1,\text{if}{Z}_{t}0,\\ {a}_{{N}_{3},{B}_{L}}^{p},\text{if}{Z}_{t}\ge 0,\end{array}\\ {a}_{{B}_{L},{B}_{N}}({Z}_{t})& =& \{\begin{array}{l}{a}_{{B}_{L},{B}_{N}}^{n},\text{if}{Z}_{t}0,\\ 1,\text{if}{Z}_{t}\ge 0,\end{array}\\ {a}_{{L}_{3},{B}_{N}}({Z}_{t})& =& \{\begin{array}{l}{a}_{{L}_{3},{B}_{N}}^{n},\text{if}{Z}_{t}0,\\ 1,\text{if}{Z}_{t}\ge 0.\end{array}\end{array}$$

where *Z _{t}* =

This transition structure implies that if the current state is in a linker region, a positive log base 2 ratio observed in the immediate probe imposes transition into a nucleosome state. Similarly, if the current state is in *N*_{3} nucleosome state, a negative log base 2 ratio observed in the immediate probe imposes transition into a linker state. This transition structure appears to be sufficient and works well on the data. We first illustrated that our proposed NHSM is able to detect low-signal nucleosomes in the *HIS3* promoter region as shown in Figure 10. The horizontal black line between positions 721871 and 721971 is the low-signal nucleosome annotated in Figure 1B of Yuan et al. (2005) which was only identified by further detrending the data for low-signal nucleosomes. In particular, Yuan et al. (2005) first applied SHMM to decode nucleosome positions, and further included additional nucleosomes identified exclusively only by HMMD, which they labeled as low-signal nucleosomes. Although Yuan et al. (2005) used the SHMM annotation as baseline nucleosomal set and added low-signal nucleosomes from HMMD to this set, this procedure could be problematic if the boundaries of the low-signal nucleosomes cut across the boundaries of flanking nucleosomes identified in SHMM. In general, combining the two sets of annotation requires careful curation in determining the boundaries of nucleosomes and linkers since the two annotations do not coincide precisely. This low-signal nucleosome was also identified by others according to Yuan et al. (2005) and in the high resolution MNase-Chip experiment of Lee et al. (2007) and MNase-Seq experiment of Shivaswamy et al. (2008), therefore it is not likely to be an artifact of hybridization. Our proposed NHSM is able to map this low-signal nucleosome automatically and accurately without any additional detrending. We provided an example of a low-signal nucleosome that was still missed by further detrending in Yuan et al. (2005) in Figure 11. This low-signal nucleosome was also annotated in high resolution data of Lee et al. (2007) and Shivaswamy et al. (2008) and this provides evidence against it being a hybridization artifact. We also showed that the duration constraint in nucleosome states in our NHSM architecture is able to distinguish real “bumps” which characterize a nucleosome from spurious small “bumps” at positions 103400 (between nucleosomes 1 and 2) and 104400 (between nucleosomes 6 and 7) in the top panels of Figure 12. The problem with detrending the data by comparing peak and trough within a window size of 7 probes is also visible in this region. As evident in the bottom left panel of Figure 12, detrending introduced more noise to the original data and diminished the distinction between linker and nucleosomes.

To compare the performance of the different methods, we used the “hand picked” annotation in Yuan et al. (2005) as the gold standard. Hand picked annotation was based on careful visual inspection (Yuan et al.; 2005), and thus formed a reliable nucleosome map for this tiling array data. However, it is inevitable that there may still exist some uncertainties in mapping nucleosomelinker boundaries even by careful visual inspection as shown in Figure 13. To account for the one/two probes boundary uncertainties in the “hand picked” annotation, we allowed for one probe margin in defining sensitivity and specificity. That is, suppose that the underlying state for probe *i* is nucleosomal based on “hand picked” annotation. We declare this probe to be correctly inferred if either one of the probes *i*−1, *i* or *i*+1 is annotated as nucleosomal probe for each of the method under comparison. To measure the sensitivity of our proposed method in detecting low-signal nucleosomes, we considered two possible sets of true positives. The first set was defined by using probes annotated as nucleosomes (both low and high signals) in the “hand picked” annotation. The second set was defined by using probes categorized as low-signal nucleosomes by “hand picked” annotation according to Yuan et al. (2005) (that is corresponding to score 0.25 and 0.5 in Yuan et al. (2005)).

Table 2 summarizes the sensitivity and specificity for these methods using “hand picked” annotation as the gold standard. “Sensitivity(both)” was obtained using all annotated nucleosomes as true positives while “Sensitivity(low)” was obtained using annotated low-signal nucleosomes as true positives. SHMM misses a very large fraction of the low-signal nucleosomes, and thereby has extremely poor sensitivity. HMM has a higher sensitivity than SHMM, but a much lower specificity. The methods are comparable in terms of their specificities, except for HMM. The sensitivity analysis illustrates that the proposed NHSM based on first order differences is able to bypass the need for local detrending and automatically map nucleosome positions accurately. HMMD is the worst among all, which again illustrates that detrending the data is a difficult procedure and could potentially distort the signals in the observed data.

We also compared the performance of our proposed NHSM, HMM, HMMD and SHMM (from Yuan et al. (2005)) via ROC curves, by varying the posterior probabilities of declaring a probe to be in a nucleosome (well positioned and delocalized) state using the low-signal nucleosomes as true positive set. The results are shown in Figure 14, which demonstrates that the proposed NHSM based on first order differences performs better than all the other methods. Specifically, NHSM has an AUROC statistic of 0.889 whereas the best of the other methods has only a value of 0.808.

The above benchmarking study relied on using the ”hand picked” annotation of Yuan et al. (2005) as the gold standard. We also provide an additional benchmarking experiment in Table 3 by annotating the probes in Yuan et al. (2005) based on the nucleosome calls from the higher resolution experiment of Lee et al. (2007) which used a 4 base pairs resolution tiling array. Comparisons of different methods using this annotation as the gold standard again indicate that NHSM uniformly outperforms other methods by identifying a larger percentage of low-signal nucleosomes.

Next, we will illustrate the applicability of our proposed NHSM in mapping nucleosome occupancy on the high resolution MNase-Chip (Lee et al.; 2007) and MNase-Seq data (Shivaswamy et al.; 2008). In both cases, we smoothed the data via Gaussian kernel smoothing described in Section 3 and defined *O _{t}* =

Since our modeling framework utilizes first order differences which capture the “bump” shape of a nucleosome and not the observed log base 2 ratios in the emission distribution, it can be applied to first order (lagged) differences on tag counts/reads in MNase-Seq data. In Shivaswamy et al. (2008), 514803 uniquely aligned reads were generated for the normal cells via the sequencing technology. We considered the following strategy for mapping nucleosome positions onMNase-Seq data. Since each of the 27 base pairs Solexa sequencing read corresponds to a mono-nucleosome of size 150–200 base pairs, we first extended these reads to 150 base pairs according to the sequence orientation for both the plus and minus strands. The total reads for each genomic position is then taken to be the sum of all extended reads at the position, as shown in Figure 16. Therefore, the total reads at every 50 base pairs on the genome is analogous to the observed log base 2 ratios in MNase-chip data of 50 base pairs resolution.

We also illustrate the applicability of our proposed NHSM in annotating these two promoter regions in the MNase-Seq data of Shivaswamy et al. (2008) using a 4 base pairs resolution to facilitate direct comparison against the 4 base pairs MNase-Chip data of Lee et al. (2007). Of particular interest is the ability of NHSM to decode three low-signal nucleosomes between positions 722200 and 722700 in the *HIS3* promoter (labeled 3, 4 and 5 in Figure 17), which were missed by the original annotation of Shivaswamy et al. (2008) as shown in Figure 17. The NHSM annotation for these three nucleosomes is consistent with the annotation from Lee et al. (2007). Both Figures 15 and and1717 also illustrate that our proposed NHSM results in the most consistent, i.e., greatest overlaps in annotations of nucleosome positions between MNase-Chip and MNase-Seq data. This is desirable given that both data sets measure the same nucleosome occupancy in yeast *S. cerevisiae*.

The ability to map nucleosome positions accurately is crucial for investigating changes in nucleosome occupancies and their relationship to gene regulation since losses/gains in occupancy usually occur at one or two nucleosomes as illustrated in Shivaswamy et al. (2008). We introduced a non-homogeneous hidden-state model (NHSM) that automatically maps nucleosome positions based on either high-throughput tiling array or sequencing data and is computationally efficient. The modeling framework utilizes first order (lagged) differences which capture the “bump” shape that characterize a nucleosome and enable accurate mapping of nucleosome-linker boundaries. The NHSM bypasses the need for further local detrending which misses low-signal nucleosomes (Figure 11). We also demonstrated the pitfalls of detrending the data with a simple method of comparing peak and trough within a window size covering a nucleosome (HMMD). Such a detrending introduced higher noise levels to the data in both the simulations and a case study of yeast nucleosome occupancy data. Modeling the emission distribution on first order differences allows our method to be applicable to both the MNase-Chip and MNase-Seq data, since the defining characteristic of a nucleosome in both cases is the “bump” shape. By allowing a duration distribution, NHSM is able to capture the fuzzy nucleosomes, which have heterogenic and dynamic positions.

The only preprocessing step required before applying our proposed NHSM in detecting nucleosome positions is data smoothing. We have illustrated in the case studies that simple smoothing such as moving average in a window size of 3 is generally sufficient for lower resolution tiling arrays, e.g, the MNase-Chip data of Yuan et al. (2005). For high resolution tiling arrays and sequencing data which have lower signal-to-noise ratio, we proposed a Gaussian kernel smoothing with a bandwidth chosen based on the size of nucleosomal DNA, and demonstrated that this smoothing is able to denoise and enhance the “bump” shape of a nucleosome. Recently, Yassour et al. (2008) introduced a method for improving the resolution of the nucleosome positions in low resolution tiling arrays with overlapping probe design. In this paper, we aimed to develop a general method applicable to both low and high resolution nucleosome occupancy data by specifically capturing the defining characteristic, that is the “bump” shape of a nucleosome. However, the main idea of partitioning a probe into smaller fragments in Yassour et al. (2008) can be easily adapted into our framework by generating a pseudo MNase-Chip with higher resolution and thereby improving the resolution of nucleosome positions. We provide an illustration of this point in Appendix A.3.

The numerous examples and extensive simulations provided in this paper demonstrate that our proposed method is able to detect linker regions that are represented by only one/two probes, low-signal nucleosomes (Figures 10 and and11)11) and outperforms currently available methods. Although the underlying architecture of our NHSM is simple, it is effective in detecting nucleosome occupancies in both low and high resolution MNase-Chip and MNase-Seq data. Furthermore, in the datasets that we used for illustration, nucleosomal DNA was isolated by MNase but our approach is applicable to tiling arrays and sequencing technologies regardless of how the nucleosomes are isolated. We conclude by stressing that accurate annotation of nucleosome occupancy based on data from high-throughput experiments under various physiological conditions forms the basis of comparing different samples to elucidate the dynamics of nucleosome occupancy. Some examples of this line of work are by Shivaswamy et al. (2008) and Schones et al. (2008), and we anticipate that the number such examples will keep growing.

This research has been supported in part by a PhRMA Foundation Research Starter Grant in Informatics (P.K. and S.K.), the NIH grant HG003747 (P.K. and S.K.), the NSF grant DMS004597 (S.K.), the Morgridge Institute for Research support for Computation and Informatics in Biology and Medicine (P.K), NSF CAREER Award #0447887 (A.P.G and D.J.H.) and an NIGMS grant to the Molecular Biosciences Training Program T32GM007215 (D.J.H.).

We investigate various smoothing algorithms for denoising the observed log base 2 ratios in tiling arrays measuring nucleosome occupancy. We will illustrate the performance of the selected smoothing algorithms on the high resolution tiling arrays (4 base pairs resolution) from Lee et al. (2007). The original log base 2 ratios from one replicate/array for the *CHA1*, *HIS3* and *SAC7* promoters are shown in the top left panels of Figures 18, ,19,19, and and20,20, respectively.

We have shown that data smoothing based on moving average in a window size (2*w* + 1) of 3 probes works well in the lower resolution nucleosome data from Yuan et al. (2005). However, the data from Lee et al. (2007) has lower signal-to-noise ratio compared to Yuan et al. (2005). Therefore, one pass moving average using a window size of 2*w* + 1 = 3 is not sufficient, as shown in top middle panels of Figures 18, ,19,19, and and20.20. One possible solution is to use a larger *w*. Since a nucleosomal DNA is 146 base pairs long, a nucleosome occupied region will span approximately 32 probes. Hence, consider moving averages in a window size of 32 probes. As given in the top right panels of Figures 18, ,19,19, and and20,20, using a larger *w* is able to increase the signal-to-noise ratio, although the denoised log base 2 ratios still exhibit some wiggly pattern.

Next, as an alternative for using a larger window size, we consider iterated moving averages using a window size of 3 probes. Let *S* and *S*(*Y* )* _{t}* denote the smoother function and the resulting smoothed value at probe

$$\begin{array}{l}{S}_{1}{(Y)}_{t}=\frac{1}{3}({Y}_{t-1}+{Y}_{t}+{Y}_{t+1})\\ {S}_{2}{(Y)}_{t}=\frac{1}{{3}^{2}}({Y}_{t-2}+2{Y}_{t-1}+3{Y}_{t}+2{Y}_{t+1}+{Y}_{t+2})\\ {S}_{3}{(Y)}_{t}=\frac{1}{{3}^{3}}({Y}_{t-3}+3{Y}_{t-2}+6{Y}_{t-1}+7{Y}_{t}+6{Y}_{t+1}+3{Y}_{t+2}+{Y}_{t+3})\\ {S}_{4}{(Y)}_{t}=\frac{1}{{3}^{4}}({Y}_{t-4}+4{Y}_{t-3}+10{Y}_{t-2}+16{Y}_{t-1}+19{Y}_{t}+16{Y}_{t+1}+10{Y}_{t+2}\\ +4{Y}_{t+3}+{Y}_{t+4})\\ :\end{array}$$

The smoother function has the following general form:

$${S}_{k}{(Y)}_{t}=\frac{1}{{3}^{k}}\sum _{j=t-k}^{t+k}{c}_{j}{Y}_{j},$$

where *c _{t}*

Next, we provide a simple analytical result for smoothing based on iterated moving averages. The collection of weights in front of the log base 2 ratios in *S _{k}* can be viewed as the resulting probability mass function from the sum of

$$f(\sum _{i=1}^{k}{u}_{i})=\frac{{c}_{j}}{{3}^{k}}\text{for}\sum _{i=1}^{k}{u}_{i}=j\in \{-k,-k+1,\dots ,k-1,k\}$$

We provide the form of *f* for *k* = 2 and *k* = 3:

$$f({u}_{1}+{u}_{2})=\{\begin{array}{c}\frac{1}{9}\text{for}{u}_{1}+{u}_{2}\in \left\{-2,2\right\},\\ \frac{2}{9}\text{for}{u}_{1}+{u}_{2}\in \left\{-1,1\right\},\\ \frac{3}{9}\text{for}{u}_{1}+{u}_{2}\in \left\{0\right\}.\hfill \end{array}$$

$$f({u}_{1}+{u}_{2}+{u}_{3})=\{\begin{array}{c}\frac{1}{27}\text{for}{u}_{1}+{u}_{2}+{u}_{3}\in \left\{-3,3\right\},\\ \frac{3}{27}\text{for}{u}_{1}+{u}_{2}+{u}_{3}\in \left\{-2,2\right\},\\ \begin{array}{l}\frac{6}{27}\text{for}{u}_{1}+{u}_{2}+{u}_{3}\in \left\{-1,1\right\},\\ \frac{7}{27}\text{for}{u}_{1}+{u}_{2}+{u}_{3}\in \left\{0\right\}.\end{array}\end{array}$$

Then, by the Central Limit Theorem,

$$\sum _{i=1}^{k}{U}_{i}/\sqrt{k}{\to}_{D}N(0,{\sigma}^{2})\text{as}k\to \infty ,$$

where σ^{2} = (3^{2} − 1)/12. Therefore, for a sufficiently large fixed *K*,

$$f(\sum _{i=1}^{K}{U}_{i}=j)\approx \frac{\phi (j|0,K{\sigma}^{2})}{{\sum}_{j=-K}^{K}\phi (j|0,K{\sigma}^{2})},$$

i.e., a discretized Gaussian on integer support { −*K*, ..., *K*}, where
$\phi (z|\mu ,{\sigma}^{2})=\text{exp}[-{(z-\mu )}^{2}/2{\sigma}^{2}]/\sigma \sqrt{2\pi}$. In other words, the weights *c _{j}*/3

$$\begin{array}{l}{S}_{K}{(Y)}_{t}=\frac{1}{{3}^{K}}\sum _{j=t-K}^{t+K}{c}_{j}{Y}_{j}\\ \approx \frac{{\sum}_{j=t-K}^{t+K}\phi \left(j|t,K{\sigma}^{2}\right){Y}_{j}}{{\sum}_{j=t-K}^{t+K}\phi \left(j|t,K{\sigma}^{2}\right)},\end{array}$$

and

$$\underset{K\to \infty}{\text{lim}}\frac{{\sum}_{j=t-K}^{t+K}\phi \left(j|t,K{\sigma}^{2}\right){Y}_{j}}{{\sum}_{j=t-K}^{t+K}\phi \left(j|t,K{\sigma}^{2}\right)}=\overline{Y}.$$

That is, the smoothed log base 2 ratios from iterated moving average become flatter and flatter approaching a constant , and the choice of ε above is critical to avoid over-smoothing.

The analytical result above also shows that the iterated moving average is approximately a Gaussian kernel smoother for large *k*. In kernel smoothing, the tuning parameter is the bandwidth *h*. Large bandwidth implies more smoothing, and vice versa. For Gaussian kernel, *h* is also the standard deviation. That is,

$${X}_{t}=\frac{{\sum}_{j=0}^{T}\phi \left(\frac{|{G}_{t}-{G}_{j}|}{h}\right){Y}_{j}}{{\sum}_{j=0}^{T}\phi \left(\frac{|{G}_{t}-{G}_{j}|}{h}\right)},$$

where *G _{j}* is the genomic coordinate corresponding to

We also investigated another simple non-linear smoothing, i.e., iterated moving median (Tukey; 1977). This smoother is more robust and less sensitive to sudden jumps, and is usually recommended over moving average. Mallow (1979) proved that iterated moving median converges for odd-numbered spans (*w*). However, as shown in the bottom middle panels of Figures 18, ,19,19, and and20,20, smoothing based on iterated moving median is less desirable since the “bump” shape of the nucleosomes is not well characterized compared to Gaussian kernel smoothing/iterated moving average. Another popular denoising algorithm is wavelet smoothing, which requires more tuning parameters (i.e., wavelet type, decomposition level and thresholds). The wavelet smoothing is utilized by Zhang et al. (2008) recently for detecting nucleosomes in ChIP-Seq. However, since we have illustrated that a Gaussian kernel smoothing is sufficient for denoising the log base 2 ratios from both the high resolution MNase-Chip and MNase-Seq data with a justified choice of bandwidth, we do not explore the more sophisticated wavelet smoothing.

Let *Q _{t}* be the hidden state latent variable for mid-probe

Two assumptions arising from Figure 4 are:

$$P({Q}_{t+1}|{Q}^{(t)},{Z}_{0},{O}^{(t)},\lambda )=P({Q}_{t+1}|{Q}_{t},{Z}_{t},\lambda )\text{for}t=1,\dots ,T-\mathrm{1,}$$

(3)

$$P({O}_{t}|{Q}^{(t)},{Z}_{0},{O}^{(t-1)},\lambda )=P({O}_{t}|{Q}_{t},\lambda )\text{for}t=1,\dots ,T.$$

(4)

For simplicity, we assume *Z*_{0} is fixed. The complete data likelihood is given by:

$$\begin{array}{lll}P({O}^{(T)},{Q}^{(T)}|{Z}_{0},\lambda )& =& P\left({Q}_{1}|{Z}_{0},\lambda \right)\prod _{t=1}^{T}\left[P({O}_{t}|{Q}^{(t)},{Z}_{0},{O}^{(t-1)},\lambda )\right]\\ & & \prod _{t=1}^{T}\left[P({O}_{t+1}|{Q}^{(t)},{Z}_{0},{Q}^{(t)},\lambda )\right]\\ & =& P\left({Q}_{1}|{Z}_{0},\lambda \right)\prod _{t=1}^{T}\left[P({O}_{t}|{Q}_{t},\lambda )\right]\prod _{t=1}^{T-1}\left[P({Q}_{t+1}|{Q}_{t},{Z}_{t},\lambda )\right]\\ & =& {\pi}_{{Q}_{1}}\prod _{t=1}^{T}{b}_{{Q}_{t}}({O}_{t})\prod _{t=1}^{T-1}{a}_{{Q}_{t},{Q}_{t+1}}({Z}_{t}).\end{array}$$

The first equality follows from repeated application of conditional probability. The second equality follows from the two assumptions above.

Assume that there are *N* hidden states. Then, the complete data log likelihood is given by:

$$\begin{array}{l}\text{log}P({O}^{(T)},{Q}^{(T)}|{Z}_{0},\lambda )=\text{log}\left[\prod _{i=1}^{N}{\pi}_{i}^{I({Q}_{1}=i)}\right]\left[\prod _{t=1}^{T}\prod _{i=1}^{N}{b}_{i}{({O}_{t})}^{I({Q}_{t}=i)}\right]\\ \left[\prod _{t=1}^{T-1}\prod _{i=1}^{N}\prod _{j=1}^{N}{a}_{i,j}{({Z}_{t})}^{I({Q}_{t}=i,{Q}_{t+1}=j)}\right].\end{array}$$

We assume that

$${b}_{i}({O}_{t})=N({\mu}_{i},{\sigma}_{i}^{2}),$$

$${a}_{i,j}({Z}_{t})=\frac{\text{exp}({\gamma}_{i,j}+{\beta}_{j}{Z}_{t})}{{\sum}_{k=1}^{N}\text{exp}({\gamma}_{i,k}+{\beta}_{k}{Z}_{t})}$$

Expected complete log likelihood is given by

$$\begin{array}{l}E[\text{log}P({O}^{(T)},{Q}^{(T)}|{Z}_{0},\lambda )]\\ =\sum _{i=1}^{N}P({Q}_{1}=i|{O}^{(T)},{Z}_{0},\lambda )\text{log}{\pi}_{i}\\ +\sum _{t=1}^{T}\sum _{i=1}^{N}P({Q}_{t}=i|{O}^{(T)},{Z}_{0},\lambda )\text{log}\left[\frac{1}{\sqrt{2\pi {\sigma}_{\text{i}}^{2}}}\text{exp}\left(-\frac{{\left({O}_{t}-{\mu}_{i}\right)}^{2}}{2{\sigma}_{i}^{2}}\right)\right]\\ +\sum _{t=1}^{T-1}\sum _{i=1}^{N}\sum _{j=1}^{N}P\left({Q}_{t}=i,{Q}_{t+1}=j|{O}^{(T)},{Z}_{0},\lambda \right)\text{log}{a}_{i,j}({Z}_{t}).\end{array}$$

**E-step:**

Define two variables γ* _{t}*(

$$\begin{array}{lll}{\gamma}_{t}(i)& =& P({Q}_{t}=i|{O}^{(T)},{Z}_{0},\lambda )\\ & =& \frac{{\alpha}_{t}(i){\beta}_{t}(i)}{{\sum}_{i=1}^{N}{\alpha}_{t}(i){\beta}_{t}(i)}\\ & =& \frac{{\alpha}_{t}(i){\beta}_{t}(i)}{{\sum}_{i=1}^{N}{\alpha}_{T}(i)}\\ & =& \sum _{j=1}^{N}P\left({Q}_{t}=i,{Q}_{t+1}=j|{O}^{(T)},{Z}_{0},\lambda \right),\end{array}$$

$$\begin{array}{lll}{\xi}_{t+1}(i,j)& =& P\left({Q}_{t}=i,{Q}_{t+1}=j|{O}^{(T)},{Z}_{t},\lambda \right)\\ & =& \frac{{\alpha}_{t}(i){a}_{i,j}({Z}_{t}){b}_{j}({O}_{t+1}){\beta}_{t+1}(j)}{{\sum}_{i=1}^{N}{\sum}_{j=1}^{N}{\alpha}_{t}(i){a}_{i,j}({Z}_{t}){b}_{j}({O}_{t+1}){\beta}_{t+1}(j)}\\ & =& \frac{{\alpha}_{t}(i){a}_{i,j}({Z}_{t}){b}_{j}({O}_{t+1}){\beta}_{t+1}(j)}{{\sum}_{i=1}^{N}{\alpha}_{T}(i)},\end{array}$$

where α* _{t}*(

**M-step:**

$$\begin{array}{l}\text{max}E\left[\text{log}P\left({O}^{(T)},{Q}^{(T)}|{Z}_{0},\lambda \right)\right]\\ \text{s}.\text{t}\sum _{i=1}^{N}{\pi}_{i}=1,\\ \sum _{j=1}^{N}{a}_{i,j}({Z}_{t})=1,t=1,\dots ,T-1\end{array}$$

yields

$$\begin{array}{l}{\pi}_{i}={\gamma}_{1}(i),\\ \{\begin{array}{l}{\widehat{\mu}}_{1}=\frac{{\sum}_{t=1}^{T}{\gamma}_{t}({B}_{N}){O}_{t}-{\sum}_{t=1}^{T}{\gamma}_{t}({B}_{L}){O}_{t}}{{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{B}_{N},{B}_{L}\right\}}{\gamma}_{t}(i)},\\ {\widehat{\mu}}_{2}=\frac{{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{N}_{1},{N}_{2a},{L}_{3}\right\}}{\gamma}_{t}(i){O}_{t}-{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{N}_{2c},{N}_{3},{L}_{1}\right\}}{\gamma}_{t}(i){O}_{t}}{{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{N}_{1},{N}_{2a},{N}_{2c},{N}_{3},{L}_{1},{L}_{3}\right\}}{\gamma}_{t}(i)},\text{if}{\widehat{\mu}}_{1}\ge {\widehat{\mu}}_{2},\end{array}\\ {\widehat{\mu}}_{1}={\widehat{\mu}}_{2}=\frac{{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{B}_{N},{N}_{1},{N}_{2a},{L}_{3}\right\}}{\gamma}_{t}(i){O}_{t}-{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{B}_{L},{N}_{2c},{N}_{3},{L}_{1}\right\}}{\gamma}_{t}(i){O}_{t}}{{\sum}_{t=1}^{T}{\sum}_{i\in \left\{{B}_{N},{B}_{L},{N}_{1},{N}_{2a},{N}_{2c},{N}_{3},{L}_{1},{L}_{3}\right\}}{\gamma}_{t}(i)},\\ \text{if}{\widehat{\mu}}_{1}{\widehat{\mu}}_{2},\\ {\widehat{\sigma}}_{i}^{2}=\frac{{\sum}_{t=1}^{T}{\gamma}_{t}(i){({O}_{t}-{\widehat{\mu}}_{i})}^{2}}{{\sum}_{t=1}^{T}{\gamma}_{t}(i)}.\end{array}$$

The non-parametric transition probabilities for the case study are updated as follows:

$$\begin{array}{ccc}{\widehat{a}}_{i,j}& =& \frac{{\sum}_{t=1}^{T-1}{\xi}_{t+1}(i,j)}{{\sum}_{t=1}^{T-1}{\gamma}_{t}(i)}\text{for}i\ne {N}_{3},{B}_{L},{L}_{3},\\ {\widehat{a}}_{{N}_{3},{B}_{L}}^{p}& =& \frac{{\sum}_{t=1}^{T-1}{\xi}_{t+1}({N}_{3},{B}_{L})I({Z}_{t}\ge 0)}{{\sum}_{t=1}^{T-1}{\gamma}_{t}({N}_{3})I({Z}_{t}\ge 0)},\\ {\widehat{a}}_{{B}_{L},{B}_{N}}^{n}& =& \frac{{\sum}_{t=1}^{T-1}{\xi}_{t+1}({B}_{L},{B}_{N})I({Z}_{t}0)}{{\sum}_{t=1}^{T-1}{\gamma}_{t}({B}_{L})I({Z}_{t}0)},\\ {\widehat{a}}_{{L}_{3},{B}_{N}}^{n}& =& \frac{{\sum}_{t=1}^{T-1}{\xi}_{t+1}({L}_{3},{B}_{N})I({Z}_{t}0)}{{\sum}_{t=1}^{T-1}{\gamma}_{t}({L}_{3})I({Z}_{t}0)}.\end{array}$$

Note that the parameters in the logistic regression model cannot be solved analytically. These parameters can be optimized via a conjugate gradient algorithm. The details can be found in Robertson et al. (2004). The computation of *α _{t}*(

Let α* _{t}* =

- Initialization:
- α
_{1}(*i*) =*P*(*O*_{1},*Q*_{1}=*i*|*Z*_{0}, λ) = π(_{i}b_{i}*O*_{1}), for 1 ≤*i*≤*N*.

- Induction:
- α
_{t}_{+1}(*j*) = $\left[{\sum}_{i=1}^{N}{\alpha}_{t}(i){a}_{i,j}({Z}_{t})\right]$*b*(_{j}*O*_{t}_{+1}), for 1 ≤*t*≤*T*− 1, 1 ≤*j*≤*N*.

- Termination:
*P*(*O*^{(}^{T}^{)}|*Z*_{0}, λ) = ${\sum}_{i=1}^{N}{\alpha}_{T}(i)$.

Let β* _{t}*(

- Initialization:
- β
(_{T}*i*) = 1, for 1 ≤*i*≤*N*.

- Induction:
- β
(_{t}*i*) = ${\sum}_{j=1}^{N}{a}_{i,j}({Z}_{t}){b}_{j}({O}_{t+1}){\beta}_{t+1}(j)$, for*t*=*T*−1,*T*−2, ..., 1, 1 ≤*j*≤*N*.

- Termination:
*P*(*O*^{(}^{T}^{)}|*Z*_{0}, λ) = ${\sum}_{i=1}^{N}{\beta}_{1}(i){b}_{i}\left({O}_{1}\right){\pi}_{i}$.

The optimal hidden state sequence is obtained via Viterbi algorithm (Rabiner; 1989).

Define δ* _{t}*(

- Initialization:
- δ
_{1}(*i*) = π(_{i}b_{i}*O*_{1}), for 1 ≤*i*≤*N*, - ψ
_{1}(*i*) = 0, for 1 ≤*i*≤*N*.

- Recursion:
- δ
(_{t}*j*) = max_{1≤}_{i}_{≤}[δ_{N}_{t}_{−1}(*i*)*a*(_{i,j}*Z*) ]_{t}*b*(_{j}*O*), for 2 ≤_{t}*t*≤*T*, 1 ≤*j*≤*N*, - ψ
(_{t}*j*) = argmax_{1≤}_{i}_{≤}[δ_{N}_{t}_{−1}(*i*)*a*(_{i,j}*Z*)], for 2 ≤_{t}*t*≤*T*, 1 ≤*j*≤*N*.

- Termination:
*P*(*O*^{(}^{T}^{)}|*Z*_{0}, λ)^{*}= max_{1≤}_{i}_{≤}[δ_{N}(_{T}*i*) ],- ${Q}_{T}^{*}={\text{argmax}}_{1\le i\le N}\left[{\delta}_{T}(i)\right].$

- Path backtracking:
- ${Q}_{t}^{*}={\psi}_{t+1}({Q}_{t+t}^{*})$, for
*t*=*T*− 1,*T*− 2, ..., 1.

For a general first order lagged differences *O _{t}* =

- Initialize with
*Z͂*_{0}=*Z*_{4}=*X*_{0}+ ... +*X*_{8}. - Let
*O͂*_{1}=*O*_{5}=*X*_{9}−*X*_{0}→*Z͂*_{1}=*Z*_{5}=*X*_{1}+ ... +*X*_{9}. - Let
*O͂*_{2}=*O*_{6}=*X*_{10}−*X*_{1}→*Z͂*_{2}=*Z*_{6}=*X*_{2}+ ... +*X*_{10}and so forth.

The transition probabilities for Figure 15 are:

$$\begin{array}{cccc}{a}_{{N}_{3},{B}_{L}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}1,\\ {a}_{{N}_{3},{B}_{L}}^{p},\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)0,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 0,\end{array}\\ {a}_{{B}_{L},{B}_{N}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}{a}_{{B}_{L},{B}_{N}}^{n},\\ 1,\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)0,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 0,\end{array}\\ {a}_{{L}_{3},{B}_{N}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}{a}_{{L}_{3},{B}_{N}}^{n},\\ 1,\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)0,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 0.\end{array}\end{array}$$

The transition probabilities for Figure 17 are:

$$\begin{array}{cccc}{a}_{{N}_{3},{B}_{L}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}1,\\ {a}_{{N}_{3},{B}_{L}}^{p},\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)5,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 5,\end{array}\\ {a}_{{B}_{L},{B}_{N}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}{a}_{{B}_{L},{B}_{N}}^{n},\\ 1,\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)5,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 5,\end{array}\\ {a}_{{L}_{3},{B}_{N}}({\tilde{Z}}_{t})& =& \{\begin{array}{l}{a}_{{L}_{3},{B}_{N}}^{n},\\ 1,\end{array}& \begin{array}{l}\text{if}{\tilde{Z}}_{t}/(2k-1)5,\\ \text{if}{\tilde{Z}}_{t}/(2k-1)\ge 5.\end{array}\end{array}$$

Here, the transition is based on *Z͂ _{t}/*(2

Recently, Yassour et al. (2008) introduced a method for mapping nucleosome positions, tailored for constant low resolution tiling arrays with overlapping probe design. Their method aimed to improve the resolution of the nucleosome positions by partitioning a probe into smaller fragments. This idea can be adapted into our model as follows: Let *Y _{i}*

- Albert I, Mavrich T, Tomsho L, Qi J, Zanton S, Schuster S, Pugh B. Translational and rotational settings of H2A.Z nucleosomes across the saccharomyces cerevisiae genome. Nature. 2007;446:572–576. doi: 10.1038/nature05632. [PubMed] [Cross Ref]
- Bernstein B, Liu C, Humphrey E, Perlstein E, Schreiber S. Global nucleosome occupancy in yeast. Genome Biology. 2004;5(62) doi: 10.1186/gb-2004-5-9-r62. [PMC free article] [PubMed] [Cross Ref]
- Chakravarthy S, Park Y, Chodaparambil J, Edayathumangalam R, Luger K. Structure and dynamic properties of nucleosome core particles, FEBS Letters. 2006;579(4):895–898. [PubMed]
- Ercan S, Carrozza MJ, Workman JL. Global nucleosome distribution and the regulation of transcription in yeast. Genome Biology. 2004;5(10) doi: 10.1186/gb-2004-5-10-243. doi:10.1186/gb–2004–5–10–243. [PMC free article] [PubMed] [Cross Ref]
- Johnson SM, Tan FJ, McCullough HL, Riordan DP, Fire AZ. Flexibility and constraint in the nucleosome core landscape of caenorhabditis elegans chromatin. Genome Research. 2006;16:1505–1516. doi: 10.1101/gr.5560806. [PubMed] [Cross Ref]
- Kaplan T, Liu C, Erkmann J, Holik J, Grunstein M, Kaufman P, Friedman N, Rando O. Cell cycle- and chaperone-mediated regulation of h3k56ac incorporation in yeast. PLoS Genetics. 2008 doi: 10.1371/journal.pgen.1000270. [PMC free article] [PubMed] [Cross Ref]
- Kornberg R, Lorch Y. Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome. Cell. 1999;98:285–294. doi: 10.1016/S0092-8674(00)81958-3. [PubMed] [Cross Ref]
- Lee C, Shibata Y, Rao B, Strahl B, Lieb J. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nature Genetics. 2004 doi: 10.1038/ng1400. [PubMed] [Cross Ref]
- Lee W, Tillo D, Bray N, Morse R, Davis R, Hughes T, Nislow C. A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics. 2007 doi: 10.1038/ng2117. [PubMed] [Cross Ref]
- Liu C, Kaplan T, Kim M, Buratowski S, Schreiber S, Friedman N, Rando O. Single-nucleosome mapping of histone modifications in s.cerevisiae. PLoS Biol. 2005;3(10):1753–1769. doi: 10.1371/journal.pbio.0030328. [PMC free article] [PubMed] [Cross Ref]
- Mallow C. 1979. Some theoretical results on Tukeys 3R smootherLecture Notes in Mathematics, Springer Berlin/Heidelberg
- Millar C, Grunstein M. Genome-wide patterns of histone modifications in yeast. Nature Reviews Molecular Cell Biology. 2006;7:657–666. doi: 10.1038/nrm1986. [PubMed] [Cross Ref]
- Rabiner L. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. doi: 10.1109/5.18626. [Cross Ref]
- Robertson A, Kirshner S, Smyth P. Downscaling of daily rainfall occurrence over northeast brazil using a hidden markov model. Journal of Climate. 2004;17(22):4407–4424. doi: 10.1175/JCLI-3216.1. [Cross Ref]
- Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, Wei G, Zhao K. Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008;132(5):887–898. doi: 10.1016/j.cell.2008.02.022. [PubMed] [Cross Ref]
- Shivaswamy S, Bhinge A, Zhao Y, Jones S, Hirst M, Iyer V. Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response to transcriptional perturbation. PLOS Biology. 2008;6(3):618–630. doi: 10.1371/journal.pbio.0060065. [PMC free article] [PubMed] [Cross Ref]
- Shivaswamy S, Iyer V. Stress-dependent dynamics of global chromatin remodeling in yeast: dual role for swi/snf in the heat shock stress response. Molecular and celular biology. 2008;28(7):2221–2234. doi: 10.1128/MCB.01659-07. [PMC free article] [PubMed] [Cross Ref]
- Tukey J. Exploratory data analysis. Addison-Wesley; 1977.
- Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek J, Costa G, McKernan K, Sidow A, Fire A, Johnson S. A high-resolution, nucleosome position map of C.elegans reveals a lack of universal sequence-dictated positioning. Genome Research. 2008;18:1051–1063. doi: 10.1101/gr.076463.108. [PubMed] [Cross Ref]
- Yassour M, Kaplan T, Jaimovich A, Friedman N. Nucleosome positioning from tiling microarray data. Bioinformatics. 2008;24:139–146. doi: 10.1093/bioinformatics/btn151. [PMC free article] [PubMed] [Cross Ref]
- Yuan G, Liu Y, Dion M, Slack M, Wu L, Altschuler S, Rando O. Genome-scale idenfication of nucleosome positions in s.cerevisiae. Science. 2005;309:626–630. doi: 10.1126/science.1112178. [PubMed] [Cross Ref]
- Zhang Y, Shin H, Song J, Lei Y, Liu X. Identifying positioned nucleosomes with epigenetic marks in human from chip-seq. BMC Genomics. 2008;9(1) doi: 10.1186/1471-2164-9-537. [PMC free article] [PubMed] [Cross Ref]
- Zucchini W, Raubenheimer D, MacDonald I. Modeling time series of animal behavior by means of a latent-state model with feedback. Biometrics. 2008;64:807–815. doi: 10.1111/j.1541-0420.2007.00939.x. [PubMed] [Cross Ref]

Articles from Statistical Applications in Genetics and Molecular Biology are provided here courtesy of **Berkeley Electronic Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |