Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3114652

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. MODEL FRAMEWORK
- 3. MODEL FITTING AND ESTIMATION PROCEDURE
- 4. APPLICATION TO YEAST NUCLEOSOME ARRAY DATA
- 5. SIMULATION STUDIES
- 6. DISCUSSION
- SUPPLEMENTARY MATERIAL
- FUNDING
- Supplementary Material
- References

Authors

Related links

Biostatistics. 2011 July; 12(3): 462–477.

Published online 2010 December 30. doi: 10.1093/biostatistics/kxq077

PMCID: PMC3114652

Ritendranath Mitra

Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX 77230, USA

Mayetri Gupta^{*}

Received 2010 April 26; Revised 2010 November 12; Accepted 2010 November 21.

Copyright © The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

This article has been cited by other articles in PMC.

Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called “histones.” Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.

Genomic DNA in the cell nucleus exists as a complex called chromatin, where DNA is folded and compacted through a series of interactions with proteins called “histones.” At the first level, a stretch of DNA of about 147 bp is wrapped around an octamer of histones, yielding a “nucleosome.” Nucleosome structure has been determined by X-ray crystallography (Richmond and Davey, 2003). The presence of nucleosomes prevents transcription factors (TFs) from binding to the DNA at those sites (Widom, 1992), resulting in interruption of gene regulatory processes. Thus, a well-defined map of nucleosome positions could be used to improve the predictive power of current transcription factor binding site (TFBS) search algorithms. Most commonly used TFBS or motif search algorithms use the sequence conservation at binding sites as the only criterion for TFBS presence (Liu *and others*, 1995), (, guptaliu). However, in complex genomes, these methods often suffer from poor predictability and high false-positive rates. Recent methods have used gene expression measurements through regression-type models (Conlon *and others*, 2003), (, guptaibrahim07) or ChIP-chip (chromatin immunoprecipitation) experimental data (Buck and Lieb, 2004) to improve predictions, but weak relationships in the first approach often lead to missing important TFBSs and the second approach is limited in applicability to a small set of TFs. The connection of nucleosome positioning to TF binding has motivated some researchers to make use of information from nucleosomal mapping for motif search. Narlikar *and others* (2007) have used nucleosome mapping information to create a motif occurrence prior. However, these methods will not work well unless nucleosome positions are based on high-quality predictions from experimental data. Next, we discuss the various experimental assays for nucleosome detection and statistical analysis methods.

Tiling arrays are microarrays of short overlapping probes, covering a genomic region of interest, used to measure positions of genomic-level events, such as TF binding and histone methylation. In Yuan *and others* (2005), based on the nucleosome-free DNA's susceptibility to micrococcal nuclease (“MNase”), nucleosomal DNA from yeast was isolated and labeled with green (cy3) fluorescent dye, while the total genomic DNA was labeled with red (cy5) dye. These segments were hybridized to microarrays printed with 50-mer oligonucleotide probes tiled at 20-bp overlaps. After preprocessing and normalization, the ratio of green-to-red measurement intensity values indicated how likely that locus was to contain a nucleosome. A more recent experimental technology is formaldehyde-assisted isolation of regulatory elements (FAIRE) (Hogan *and others*, 2006). FAIRE is initiated by fixing whole yeast cells in a growth medium by formaldehyde, harvesting them by centrifugation, sonicating the extracts, labeling the purified DNA, and hybridizing them to microarrays. In FAIRE, regions of the genome cross-linked with histones (nucleosomes) are less susceptible to TF binding and get retained at the organic–aqueous interphase, while the nucleosome-free regions (NFRs) get enriched. Hogan *and others* (2006) used 50-mer probes overlapping every 20 bp to tile yeast chromosome III using 4 microarrays (3 biological and 1 technical replicate). The measured enrichment throughout the genome, in terms of median logarithmic intensity ratios, represented NFR positioning. In Yuan *and others* (2005), the enrichment represents nucleosomes—hence the raw data roughly exhibit a negative correlation with FAIRE enrichment values. Both used identical probes in the same region of the genome and can be used for comparison. However, since FAIRE enriches NFRs, the exact model framework used in Yuan *and others* (2005) is not appropriate for FAIRE.

We now discuss computational methods for detection of nucleosome positioning from genomic assays. Most methods are adapted for MNase digestion data (Yuan *and others*, 2005). In Yuan *and others* (2005), a hidden Markov model (HMM) was implemented for classifying the genomic regions into nucleosomal and non-nucleosomal regions; a fully Bayesian approach was developed by Gupta (2007). High-throughput tiling array data have complications such as amplification bias, inaccurate reads, and varying degrees of nonrandom noise and measurement error. Relying solely on signal intensities for classification could propagate errors if the nucleosome predictions are used as input data for motif discovery. Recent work has linked DNA sequence features with nucleosome formation (Sekinger *and others*, 2005). Segal *and others* (2006) constructed a nucleosomal map of yeast from the distribution of dinucleotides in nucleosomal DNA. However, neglecting NFR sequence information weakened their classification approach. Other methods include Peckham *and others* (2007), who used support vector machines, and Lee *and others* (2007), who used Lasso regression on sequence features. Yuan and Liu (2008) used wavelet decomposition to represent the sequence signal as covariates in a logistic model. However, in all these approaches, restricting the sequence features to dinucleotides is a potential reason for high misclassification rates, in spite of clear biological evidence that sequence composition strongly affects nucleosome positioning (Sekinger *and others*, 2005). Also, sequence features were not used in prediction—the relationship was explored only by obtaining nucleosome predictions from intensity data and inferring potential sequence influences.

In FAIRE, employing an HMM with equal probe spacings would misspecify the model and give erroneous state predictions due to gaps and missing probes. In ChIP-chip assays to detect TFBSs, missing probes are not as serious a problem as TF-bound enriched probes constitute a tiny fraction of the total. The conformability of DNA structure has been shown to depend on a large set of nucleotide combinations (Sekinger *and others*, 2005). Here we propose a novel model framework for FAIRE data extending HMMs in 2 directions: (i) an underlying continuous-index Markov process allowing gaps and missing probes and (ii) incorporation of sequence features, not limited to dinucleotides, into nucleosomal state prediction, through dimension reduction techniques. This unified model uses signal intensities and sequence features to predict nucleosome locations. Continuous-index Markov processes have been applied to genomic problems such as copy number variation detection (Stjernqvist *and others*, 2007) and genetic linkage analysis (Lander and Green, 1987), but the present problem presents some unique characteristics. In Sections 2 and 3, we describe the new model framework and estimation. In Section 4, we show that including sequence features of a higher order leads to superior predictive power through real and simulated data applications.

Next, we propose a model to extract positional estimates of NFRs from FAIRE enrichment data (Hogan *and others*, 2006). Although our model is developed with the FAIRE data structure in mind, it is a general model which can be used for a variety of high-throughput microarray settings. Also, from a biological standpoint, since the FAIRE technology is simpler to use than the MNase procedure (Yuan *and others*, 2005), it is likely to be useful in nucleosomal studies for a wide variety of organisms.

Due to experimental and imaging errors, the data often have missing probe measurements, with the intervals between measurements varying in length. For each fully observed probe, the following data are present: (i) *Y*_{ij} = log ratio of the intensity of the replicate *j* of probe *i*; (ii) *t*_{i} = nucleotide index of the center of probe *i*; and (iii) *X*_{i} = *d*-dimensional vector of covariate measurements for probe *i*, for *i* = 1,…,*P* (probes) and *j* = 1,…,*R* (replicates). We define *t*_{a,b} = *t*_{b} − *t*_{a} as the gap between the centers of probes indexed *a* and *b*, which is the genomic distance between 2 probes (measured at a base pair scale). If all probes are of equal length and equispaced (which is typically the case), the gaps are solely due to the existence of missing probes, leading to *t*(*a*,*b*) being a constant multiple of the number of missing probes in between. In this case, it is equivalent to work with the metric of the number of missing probes as a distance measure.

Due to missing probes, the underlying nucleosome occupancy status can be considered an unobserved stochastic process evolving over a continuous index, denoted by [*Z*(*t*),*t* > 0]. We assume *Z*(*t*) to be a 2-state Markov process; *Z*(*t*) = 1(0) represents an NFR (non-NFR) state, that is,

(2.1)

where *P*_{zi − 1zi}(·) denotes the transition probability that state *z*_{i} was occupied at index *t*_{i} given that the process was in state *z*_{i − 1} at index *t*_{i − 1}. First, let us assume that the transition probabilities of *Z* are stationary. Also, conditional on the state of *Z* at index *t*_{i}, let the observed measurement *Y*_{i} be independent of the hidden process and all measurements prior to index *t*_{i}, that is,

(2.2)

Expressions (2.1) and (2.2) constitute a continuous-index hidden Markov process. For notational simplicity, we shall henceforth refer to *Z*(*t*_{i}) as simply *Z*_{i}. Next, assume that the probability of staying in a state is linear with respect to index for an infinitesimal interval. The 2-state hidden process can be parameterized by the transition rate from a nucleosomal state to an NFR (λ) and the reverse transition rate (μ). The matrix of instantaneous transition rates is given by . The matrix of transition probabilities *P*(*t*) over an interval *t* is generated by the matrix exponential of *Q*, that is, *P*(*t*) = exp(*Q**t*). Indexing the nucleosomal (non-NFR) state by 0 and NFR state by 1, we then have the following transition probabilities:

(2.3)

For sampling parameters under the Markov chain Monte Carlo (MCMC) framework, a parameterization in terms of the log intensities is useful, that is, we take log*λ* = *θ*_{l0} and log*μ* = *θ*_{m0}. The initial distributions of the hidden states can be taken to be uniform, that is, *π* = (*π*_{0},*π*_{1}) = (0.5,0.5).

We next specify the emission densities. Since the mean intensity may vary between probe replicates in the same state, we implemented a hierarchical Gaussian model with separate means for each probe, centered around a state-specific mean with a state-specific variance:

(2.4)

where *k*{0,1}. A more general form of the model would involve probe-specific variances. Although it is not possible to estimate the variances belonging to probes in different states (since the states are unobserved), we found that (i) variances within probes were almost uniform throughout the data set (the variance of probe-specific variances was less than 0.01) and (ii) for regions in the FAIRE data corresponding to strongly predicted nucleosomes or NFRs in the data of Yuan *and others* (2005), the variability of probe variances was very small. So we settled for a less complex model which accounted for diversity among probes only in terms of the mean structure.

Underlying characteristics of the DNA sequence that affect chromatin rigidity may affect positioning of nucleosomes. In previous studies, the prevalence of polynucleotide sequences such as poly-A (repetitions of “A”) and poly-T has been shown to affect nucleosome positioning. Hence, sequence features may affect the correct prediction of nucleosomal state. We use 3 levels of models, relating the covariates to the transition rates, or emission probabilities, via link functions.

- Model M0: the original or “base” model ((2.3) and (2.4)) assuming that the covariates do not affect either the state transitions or the emissions.
- Model M1 (“transition model”): Here a multiplicative intensity model associates the covariates to the transition rates, assuming that the transition rate during the interval between 2 points depends on the value of the covariates at the end of the interval, that is, the last probe. Let
*X*_{ij}denote covariate*j*(*j*= 1,…,*d*) for probe*i*. The transition rates over the interval [*t*_{i − 1},*t*_{i}] areIn the later sections, to simplify notation, we shall denote(2.5)*λ*_{i}(*X*_{i}) and*μ*_{i}(*X*_{i}) as*λ*_{i}and*μ*_{i}. - Model M2 (“emission model”): Here the probability distribution of the observations conditional on the hidden state can be affected by the covariate measurements. We assume that the state-specific probe measurement mean can be modeled as a linear function of the covariates, that is,(2.6)
- Model M3 (“full model”): Here we assume that both the transition intensities and the state emission probabilities are affected by the covariates, that is, both expressions (2.5) and (2.6) hold.

None of these models lead to a closed-form analytical expression for the likelihood due to the hidden state variable *Z*, although this can be computed numerically through recursive techniques, as discussed in Section 3. Before the model estimation procedure, we briefly discuss identifiability in the proposed models as it relates to the validity of inference under minimally informative priors.

Nonidentifiability of a model implies that more than one set of parameter values lead to the same likelihood. In a Bayesian framework, using an informative prior usually ensures identifiability. When using flat priors to minimize prior effects, nonidentifiability can lead to serious convergence problems in MCMC and bias inference. To establish identifiability for our proposed models, we extend the results of Teicher (1967), after stating the following results.

DEFINITION Let *f*_{}(*y*) be a parametric family of densities of *Y* with respect to a common dominating measure *μ* and parameter in some set Φ. If *π* is a probability measure on Φ, then the density *f*_{π}(*y*) = ∫_{Φ}*f*_{}(*y*)*π*(d) is called a mixture density. We say that the class of (all) mixtures of *f*_{} is identifiable if *f*_{π} = *f*_{π′}*μ* a.e. iff *π* = *π*^{′}. Furthermore, we say that the class of finite mixtures of *f*_{} is identifiable if for all measures *π* and *π*^{′} with finite support, *f*_{π} = *f*_{π′}*μ* a.e iff *π* = *π*^{′}.

PROPOSITION 1 (Teicher, 1967) The class of joint finite normal mixtures is identifiable.

PROPOSITION 2 (Teicher, 1967) Assume that the class of finite mixtures of the family *f*_{} of densities of *Y* with parameter Φ is identifiable. Then the class of finite mixtures of *n*-fold product densities *f*_{}^{(n)}(*y*) = *f*_{1}(*y*_{1})*f*_{n}(*y*_{n}) with = (_{1},…,_{n})Φ^{n} is identifiable.

Proposition 2 was proved by induction on *n* (Teicher, 1967). Since any HMM is a finite mixture of *n*-fold product densities, where the weights of the mixture are functions of the transition probabilities, we applied the above results to prove the identifiability of models M0–M3.

THEOREM 1 Models M0–M3 are identifiable. In other words, if *η* denotes the total set of all parameters in any of the 4 models and *L*(*η*;*y*) denotes the likelihood of *η*, there does not exist any *η*^{′}≠*η* such that *L*(*η*;*y*) = *L*(*η*^{′};*y*) for all *y*.

Details for the proof of Theorem 1 are provided in the supplementary material available at *Biostatistics* online.

Let *η* = (*θ*,*β*,*ν*,*σ*^{2}) denote the set of all parameters. (For model M0, *θ* = (*λ*,*μ*), *ν* = (*ν*_{00},*ν*_{10}), and for all models except M3, at least one parameter becomes void.) The likelihood for *N* probes, conditional on the covariates *X* = (*X*_{1},…,*X*_{N}) and locations *T* = (*t*_{1},…,*t*_{N}), is

(3.1)

Direct evaluation of (3.1) would involve summing over all possible sequences of hidden states *z*_{1},…,*z*_{N}. Under the Markovian assumption, this can be computed recursively by a forward sum.

Expectation–maximization (EM) (Dempster *and others*, 1977) is often used to estimate HMM parameters. However, due to the complex transition probability expressions here (the complete data likelihood consists of terms arising from the transition equations (2.3)), no closed-form M-step exists, necessitating computationally intensive numerical optimization. Although analytically tractable EM steps exist for some continuous-time HMMs (Roberts and Ephraim, 2008), we chose to avoid this due to model complexity and the tendency of EM to converge to local optima. A Bayesian MCMC sampling–based approach is feasible due to the log-concave forms of many of the conditional distributions. MCMC also allows us to obtain the full posterior probability landscape of the parameters rather than a single maximum likelihood–based point estimate. Next, we discuss prior elicitation and the recursion-based MCMC algorithm.

It can be deduced from the likelihood that the conditional posteriors assume proper densities with a flat prior. The emission densities are normal, so marginalizing the joint density along the axes of emission parameters is equivalent to integrating out normal densities, eventually yielding a weighted summation of finite terms. The transition rate expressions are a finite summation of terms of the type *f*(*a*)exp( − *g*(*a*)), where *f* is a function bounded by [0,1] and *g* is a positive scalar multiple in model M0 and a monotonic function in M1. For all such functions, the integral is finite, leading to a finite summation of finite terms. However in 2 cases, if all probes are classified into either the 0 or the 1 state, integrating out with a flat prior would yield infinity, leading to an improper posterior where Gibbs sampling may fail. This can be avoided through a constrained hidden state model with the state space excluding these 2 cases, yielding a proper posterior. When sampling from the constrained model, the conditional probabilities of the hidden states (excluding the all-0 or all-1 case) would differ from the true conditional probabilities (including the extreme states) by a negligible margin because the probabilities of these 2 conditions are infinitesimally low. In our genomic data application, the log intensities are so distributed that the sampler can almost never assign zeros to all the positions simultaneously (except when we initialize one transition rate to be unrealistically small so that the corresponding transition probability is 0). In practical implementation, our sampling scheme can reach 1 of these 2 states with a low probability—if so, we can ignore the sample and resample. This is a valid step under the constrained sampling setup—using a Gibbs sampler as a proposal for a Metropolis–Hastings step that rejects the proposed move if the state sequence is constant.

Based on the above, we assumed noninformative (uniform) priors for the logarithm of the transition rates and the emission mean and variance (on the logarithmic scale) parameters of our model. We also fit the models assuming a minimally informative normal prior (with a variance of 5) on the log transition rates. Then we sequentially reduced the variance by 0.25 units to assess the change. In the first case, there was no identifiable difference in the results compared to the flat prior. When the prior variance was reduced beyond 1, the log transition rates were pushed toward 0. Hence, we chose the emission and transition parameter priors to be minimally informative, specifically, *p*(log(*λ*))1; *p*(log(*μ*))1; *p*(*ν*_{k0})1 (*k* = 0,1); *p*(*σ*_{k}^{2})*σ*_{k}^{ − 2} (*k* = 0,1); *p*(*τ*_{k}^{2})*τ*_{k}^{ − 2}; *p*(*θ*_{mj})1,*p*(*θ*_{lj})1 (*j* = 1,…,*d*); and *p*(*β*_{kj})1 (*k* = 0,1;*j* = 1,…,*d*).

Let us denote *P*_{jk}(*t*_{i + 1}|*t*_{i}) as the transition probability from state *j* to state *k*, for adjacent probes at positions *t*_{i + 1} and *t*_{i}, which is implicitly dependent on the transition rates *λ* and *μ*. The following steps constitute one iteration of the MCMC sampler:

- Use the forward algorithm (Rabiner, 1989) to recursively calculate the likelihood. This algorithm iteratively computes the likelihood of the data till a step in the sequence, with the last step being in a particular state. The first application of this algorithm for HMMs was by Baum
*and others*(1970) and later extended for a variety of models (Rabiner, 1989). Let*F*_{j}^{i}denote the “partial” likelihood until position*i*of the sequence, where position*i*is in state*j*, that is,*F*_{j}^{i}=*P*(*Z*_{i}=*j*,*Y*_{1},…,*Y*_{i}|*X*_{1},…,*X*_{i}). In our model, this is equivalent to representing the nucleosomal and NFR states. The recursive procedure to calculate the likelihood is given byThe initial conditions are(3.2)*F*_{j}^{1}=*π*_{j}*f*(*Y*_{1}|*X*_{1},*Z*_{1}=*j*,*η*)(*j*= 0,1), and the full likelihood of the entire sequence is given by*F*_{0}^{N}+*F*_{1}^{N}. - Next, we employ a backward sampling procedure (Chib, 1996), (, sscott) to get a sample of the hidden states. Conditional on the sampled states at positions
*t*_{i + 1},…,*t*_{N}, the probability that position*i*is in state*j*is*P*(*Z*_{i}=*j*|*Z*_{i + 1},…,*Z*_{N},*Y*,*X*,*η*)*F*_{j}^{i}*P*_{j,zi + 1}(*t*_{i + 1}|*t*_{i}). The probability that the last position is in state*j*is proportional to*F*_{j}^{N}. In application, the above procedures are reformulated in terms of logarithms to avoid computational underflow. - Conditional on the other parameters, hidden states, and data, update the transition parameters using a Metropolis–Hastings procedure. For models M0 and M2, this is done directly for
*θ*_{l0}and*θ*_{m0}, while for models M1 and M3 this is done for each component in turn. - Conditional on the other parameters, hidden state path, and observed data, update the emission parameters
*ν*_{k0}(*k*= 0,1) for M0 and M1 and additionally,*β*_{k}, for M2 and M3.

Details of the MCMC steps are provided in the supplementary material available at *Biostatistics* online.

The estimation procedure for the base, transition, and emission models was applied to the FAIRE data set from Hogan *and others* (2006) that comprised a total of 13 947 probes. The data structure consists of 2 elements:

- Signal: This gives the log of the hybridization ratios obtained from the microarrays. For each probe, we have 3 replicates each giving a measure of the intensity data at the particular probe. The probes are evenly distributed (i.e. with constant spacing) in the data set. The higher the intensity, the greater the chance for the probe to be in an NFR. The data were preprocessed by a
*z*-score standardization to remove skewness (Hogan*and others*, 2006). The missingness pattern in the probes was random, and the maximum length of a missing block was 13. The non-missing signal values numbered 12 760, about 88% of the total data set. - Sequence: Each probe was of length 50 nucleotides, with an overlap of 20 nucleotides with the adjacent probe. As covariates which may potentially influence the nucleosomal signal, we first extracted an initial feature set consisting of counts of all oligomers upto length 5. For example, a measurement of a covariate corresponding to the dinucleotide TA would be the number of occurrences of this dinucleotide throughout the length of the probe. Each probe thus corresponds to a point in a 1364-dimensional Euclidean space, the coordinate for each probe given by the corresponding oligomer word counts. These covariates, especially for longer size oligomers, have very sparse counts and tend to be highly correlated with counts of words which are smaller segments. To avoid collinearity and also reduce dimensionality of the covariate space, we performed principal components (PCs) analysis on these 1364 covariates for all the probes in our data set. The first 5 PCs explained about 95% of the variability in the covariate space and were retained as the final collection of model covariates.

Next, we compared the performance of the models by applying them to the nucleosomal array data. All analyses were run in the statistical software R (http://www.r-project.org/). The MCMC algorithm was initialized at values from a N(0,1) distribution for log(*λ*), log(*μ*), *θ*, *β*, *ν*_{00}, and *ν*_{10}, while the variance parameters were initialized to 1. Multiple starting points made no significant difference in the results. Autocorrelation plots indicated that convergence was attained for all models within 1000–5000 iterations, which was taken as the burn-in period; 10 000 iterations (after burn-in) were used for inference. We present a subset of the analyses that indicate most strongly the power of the new approach; more details are presented in the supplementary material available at *Biostatistics* online.

Table 1 shows that only the second PC was significant at a 5% level, in models M1 and M2. For M3, MCMC convergence was very slow, hence we did not use this model further. In M0, the *λ* and *μ* estimates imply that if a particular probe was covered by an NFR, the probability of remaining in the state is very high (0.98). This supports the hypothesis that nucleosomes are separated by long NFRs. The predicted nucleosomal states also appear to be long, which is natural as the low resolution of the data leads to detecting enrichment of long nucleosome-free regulatory regions but not of short nucleosome-free linker regions (<10 bp) between adjacent nucleosomes. In M1, the second PC was significant in both nucleosomal and NFR categories. The opposing signs of the estimates imply that the AT oligomers associated with this covariate help in continuation of the NFR subsequences and are detrimental to nucleosome formation. Computing the weights of the dinucleotide counts for this PC shows that oligomers that are important in differentiating between nucleosomal and NFR regions are A, T, AA, TT, AT, TA, AAT, ATA, ATT, TTA, TAT, TAA, TTTTA, TTTAT (all have weights >0.1: AT and TA were highest; trinucleotides were intermediate). In M2, the second PC was significant only in the NFR state.

The 95% credible intervals (CIs) for parameter estimates for models M0 (base), M1 (transition), and M2 (emission). The indices *l* (or 0) and *m* (or 1) indicate the nucleosomal and NFR states; for instance, the term *θ*_{l0} refers to the intercept term **...**

We refitted models M1 and M2 using the 14 oligomer counts associated with the second PC (Tables 3 and 4 in the supplementary material available at *Biostatistics* online). AA and TT were the most significant variables for the NFR state in M2, while none of the nucleosome parameters were significant. This suggests that the intensity data within the nucleosomes do not depend on local DNA sequence—however, in NFRs, the intensity mean is a variable function of the AA and TT dinucleotide counts. We observed that the effect of the other oligomer combinations, which were a part of the second PC, vanishes when we refit the model with oligomer counts. This implies that the effect of the second PC was borne primarily by the counts of these 2 dinucleotide categories. These results reiterate the major importance of the AA and TT dinucleotides in influencing the state lengths. Here, as in the original M1, these 2 features were important for both nucleosomal and the NFR states. Figure 1 shows the empirical cumulative distribution function of the posterior probabilities of probes being classified into the nucleosomal state in the 3 models. The flatness of the center, with sharp rises at the 2 ends, demonstrates that most probes have a very high or low chance of being classified as an NFR and are robust to the actual cutoff. The qq-plots of the nucleosomal and NFR states (Supplementary Figure 1 available at *Biostatistics* online) demonstrate no significant departures from the fitted model.

Empirical cumulative distribution function of the posterior probabilities of probes being classified into the nucleosomal state for the base model (M0), transition model (M1), and emission model (M2).

Simulation study to compare the performance of discrete-index and continuous-index HMMs under 2 gap scenarios. The numbers “1” and “2” in the column “Set” refer to results obtained from data sets simulated **...**

An important question is which model is most appropriate for the data. Computing the Bayes factor analytically would be difficult due to model complexity. A simpler alternative (approximation) is the Bayesian information criterion (BIC). It is straightforward to compute the BIC using , where is the modal posterior likelihood, *k* is the number of parameters, and *n* is the number of data points. The BIC for the 3 models are 17154.27 (M0), 14386.92 (M1), and 13337.19 (M2), showing that incorporation of sequence features is an essential part in determining the structural classification. Model M2 gave the best fit, indicating that local sequence features indeed influence the measured intensities in nucleosomal regions and NFRs; however, the sequence does not exhibit as strong an effect in determining the lengths of the state of neighboring regions.

For biological validation, we extracted a set of yeast regulatory regions from the University of California, Santa Cruz (UCSC) database (http://genome.ucsc.edu/). Although the set of all validated NFRs is not available, the presence of active TFBSs is a useful indicator that the region is likely to be nucleosome free. We compared the overlap of predicted NFRs with known TFBS locations (Fig. 2, and Table 12 in the supplementary material available at *Biostatistics* online). The continuous-index models perform excellently compared with the method of Yuan *and others* (2005): M0 and M1 have a > 90% correct classification rate and M2 has 70%. Since we do not have a corresponding database of nucleosome regions, the false-positive rate is underestimated. This may be why M2 has a lower sensitivity than M0 in this comparison, although it fits the data better. Figure 3 shows the receiver operating characteristic (ROC) for all 3 models, using the UCSC genome database as our reference, assuming the known TFBSs to be the only NFRs. The ROCs show that all models perform equivalently well. Figure 3 shows that M1 has the highest specificity in prediction. M1 allows the sequence information to locally influence the length of the nucleosomal and non-nucleosomal states. The biological significance of this mechanism needs to be explored further to elucidate the relationship between nucleosomes and sequence signals.

Comparison between Yuan *and others* (2005) and FAIRE predictions with the 3 continuous-index models. On the vertical axis, the proportion of a genomic region “matched” with the database is given by the ratio . A cutoff of 0.5 was used for **...**

We next performed simulation studies to determine the (i) consistency of model estimates under different parameter settings, (ii) detection of the correct model under model misspecification, and (iii) importance of the continuous-index assumption in modeling gapped tiling array data.

Data sets were generated first under models M0, M1, and M2. Ten data sets of size 5000 probes were generated for each parameter setting for each model, which were fit using MCMC (Section 3). In each case, the estimated bias and Mean square error (MSE) showed consistent estimation of the model parameters. Here we describe one study in detail. For M0, the emission distributions were assumed to be normal with means −0.4 and 0.4, *λ* and *μ* were set to be − 3, and emission variances were set to be 0.35. For M1 and M2, the intercept parameter for both states was set as 1. Each regression coefficient (*θ* for M1 and *β* for M2) for the 5 covariates was fixed either as 1 or 0, for both nucleosomal and NFR states. Thus, a typical *β* (or *θ*) vector would be, for example, (1,0,1,0,0,0) for the nucleosomal and (1,0,0,0,0,0) for the NFR states. For all 3 models, 2500 iterations were run for each parameter setting, with a burn-in of 500. For evaluating misclassification rates, probes having an average state membership probability greater than 0.5 were rounded off to 1, while others were set at 0. For all data sets under model M0, the maximum MSE of the emission (transition) parameters was less than 0.0001 (0.03) and the maximum misclassification rate was 0, indicating that the model was fit accurately in each case. For M1, the maximum MSE for the corresponding parameters was also 0.03, with the maximum misclassification rate being 0.1; the MSE for *θ* ranged between 0.05 and 0.09. For M2, the maximum MSE for emission and transition parameters was slightly higher (0.08 for *μ*), the maximum misclassification rate was 0.1, and the range of MSE for *β* was 0.001–0.004.

Next, we simulated data under each model and estimated parameters and states from all models in order to judge whether in each case the correct model was most accurate. Five data sets were simulated under each model. The variances were fixed to 0.48 and 0.64 for the nucleosomal and NFR states. For M0, the transition rates were − 3 and − 3. For models M1 and M2, the coefficient for the first PC in the NFR state was fixed to 1, while the others were set to 0. The simulated data sets were of size 5000; the sequence features of the first 5000 probes of the FAIRE data set were used as covariates. When the simulation and estimation models matched, the method performed very accurately, with the maximum MSE of the emission parameters *ν* and *σ* being less than 0.01 and, for *λ* and *μ*, less than 0.05, and the averaged misclassification rates also low (Table 2). In each case, the BIC chose the correct model. Also for a simulation model, say *M*_{A}, when we estimated it by a model which contains it, say *M*_{B}, we obtained a similar value of the maximal log likelihood. In an ideal scenario, the extra components in the bigger model would turn out to be 0, and we would get the same likelihood—however, this is typically not achieved in most data sets, necessitating the use of criteria like the BIC. However, in our case, we see that the estimated log-likelihood values are very close to the true value (the error margin is less than 0.01) and the expected log likelihood, obtained from the averages of the MCMC iterations are equal. One possible reason is the large size of the data set, leading to the errors in estimation being smoothened to a considerable degree and the parameters equaling the true simulation parameters with high probability. Due to MCMC convergence problems, model M3 could not be tested. However, we generated 5 data sets from this model (with similar parameter settings as above) and ran the other 3 estimation models on these data sets. The percent of correctly classified probes under models M0, M1, and M2 were 81%, 92%, and 94%, respectively. In 2 of the 5 data sets, model M1 achieved the lowest BIC and MSE, while in the other 3, model M2 was best; model M0 was weakest.

In order to show that the continuous-index HMM was an essential improvement required to fit the gapped probe data, we simulated a data set of length 5000, where the gap distribution was the same as in the FAIRE data HMM. We also simulated a second data set of 5000 measurements where the gap structure was assigned randomly. Both data sets were analyzed using both discrete- and continuous-index HMMs (Table 3). Comparing the number of predicted matches with the simulated state set, it is clearly seen that the discrete-index HMM fails to achieve as high a correct classification rate as the continuous-index model, in the gapped data framework.

We propose a novel extension of the nucleosomal region prediction method with 2 major improvements: (i) a continuous-index HMM to accommodate missing or variable-gapped probes and (ii) using underlying DNA sequence features that influence transition rates and emission densities. We have demonstrated the model advantages on FAIRE data (Hogan *and others*, 2006), now widely used for prediction of NFRs. Our methods are efficient, given their complexity—the time taken (in seconds) for 1000 MCMC iterations for models M0, M1, and M2 are 15 343, 47 007, and 18 401, respectively. We proved identifiability conditions that allow construction of a straightforward MCMC procedure with minimally informative priors. In applications, models M0 and M1 performed similarly and were efficient at predicting regulatory regions, indicating that a nonhomogeneous transition rate is probably most accurate for predicting NFRs. Estimates from models M1 and M2 show that DNA nucleotide combinations play a significant role in determining nucleosome positions: adenosine and thymine (A and T) appear most influential. Of the different combinations, the dinucleotides (AT and TA) carry the maximum weight, followed by trinucleotides. These results support the biological hypothesis that the presence of A- and T-based nucleotide combinations contributes to the rigidity of the DNA, favoring positioning of NFRs (Sekinger *and others*, 2005). Our models successfully extend the HMM methodology to gapped data sets, opening up the possibility for synergistically modeling multiple complex data sets having varying probe intervals. Recent efforts have focused on combining data from multiple probe replicates into a hidden Markov framework (Johnson *and others*, 2009) and to develop sophisticated technologies to overcome major problems of ChIP-on-chip technology by performing massive parallel sequencing (Park, 2009). However, such technologies are in preliminary stages of development and critical refinement is necessary before fully realizing their potential and limitations.

Supplementary material is available at http://biostatistics.oxfordjournals.org.

The authors are grateful to Jason Lieb and Paul Giresi for many helpful discussions and the FAIRE data and to two anonymous reviewers whose suggestions made significant improvements to the article. *Conflict of Interest:* None declared.

National Institutes of Health (National Human Genome Research Institute) (HG004946) to M.G.

- Baum lE, Petrie T, Soules G, Weiss N. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics. 1970;41:164–170.
- Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics. 2004;83:349–360. [PubMed]
- Chib S. Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics. 1996;75:79–97.
- Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences of the United States of America. 2003;100:3339–3344. [PubMed]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38.
- Gupta M. Generalized hierarchical Markov models for the discovery of length-constrained sequence features from genome tiling arrays. Biometrics. 2007;63:797–805. [PubMed]
- Gupta M, Ibrahim JG. Variable selection in regression mixture modeling for the discovery of gene regulatory networks. Journal of the American Statistical Association. 2007;102:867–880.
- Gupta M, Liu JS. Discovery of conserved sequence patterns using a stochastic dictionary model. Journal of the American Statistical Association. 2003;98:55–56.
- Hogan GJ, Lee C-K, Lieb JD. Cell cycle-specified fluctuation of nucleosome occupancy at gene promoters. PLoS Genetics. 2006 2, e158. [PubMed]
- Johnson W, Liu X, Liu JS. Doubly-stochastic continuous-time hidden Markov approach for analyzing genome tiling arrays. Annals of Applied Statistics. 2009;3:1183–1203.
- Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proceedings of the National Academy of Sciences of the United States of America. 1987;84:2363–2367. [PubMed]
- Lee W, Tillo D, Bray N, Morse RH, Davis RW, Hughes TR, Nislow C. A high-resolution atlas of nucleosome occupancy in yeast. Nature Genetics. 2007;39:1235–1244. [PubMed]
- Liu JS, Neuwald AF, Lawrence CE. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. Journal of the American Statistical Association. 1995;90:1156–1170.
- Narlikar L, Gordan R, Hartemink AJ. A nucleosome-guided map of transcription factor binding sites in yeast. PLoS Computational Biology. 2007 3, e215. [PubMed]
- Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics. 2009;10:669–680. [PMC free article] [PubMed]
- Peckham HE, Thurman RE, Fu Y, Stamatoyannopoulos JA, Noble WS, Struhl K, Weng Z. Nucleosome positioning signals in genomic DNA. Genome Research. 2007;17:1170–1177. [PubMed]
- Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77:257–286.
- Richmond TJ, Davey CA. The structure of DNA in the nucleosome core. Nature. 2003;423:145–150. [PubMed]
- Roberts WJ, Ephraim Y. An EM algorithm for ion-channel current estimation. IEEE Transactions on Signal Processing. 2008;56:26–33.
- Scott SL. Bayesian methods for hidden Markov models. Journal of the American Statistical Association. 2002;97:337–351.
- Segal E, Fondufe-Mittendorf Y, Chen L, Thastrom A, Field Y, Moore IK, Wang JP, Widom J. A genomic code for nucleosome positioning. Nature. 2006;442:772–778. [PMC free article] [PubMed]
- Sekinger EA, Moqtaderi Z, Struhl K. Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast. Molecular Cell. 2005;18:735–748. [PubMed]
- Stjernqvist S, Ryden T, Skold M, Staaf J. Continuous-index hidden Markov modelling of array CGH copy number data. Bioinformatics. 2007;23:1006–1014. [PubMed]
- Teicher H. Identifiability of mixtures of product measures. The Annals of Mathematical Statistics. 1967;38:1300–1302.
- Widom J. A relationship between the helical twist of DNA and the ordered positioning of nucleosomes in all eukaryotic cells. Proceedings of the National Academy of Sciences of the United States of America. 1992;89:1095–1099. [PubMed]
- Yuan GC, Liu JS. Genomic sequence is highly predictive of local nucleosome depletion. PLoS Computational Biology. 2008 4, e13. [PubMed]
- Yuan GC, Liu YJ, Dion MF, Slack MD, Wu LF, Altschuler SJ, Rando OJ. Genome-scale identification of nucleosome positions in S. cerevisiae. Science. 2005;309:626–630. [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |