|Home | About | Journals | Submit | Contact Us | Français|
The nucleosome is the fundamental packing unit of DNAs in eukaryotic cells. Its detailed positioning on the genome is closely related to chromosome functions. Increasing evidence has shown that genomic DNA sequence itself is highly predictive of nucleosome positioning genome-wide. Therefore a fast software tool for predicting nucleosome positioning can help understanding how a genome's nucleosome organization may facilitate genome function.
We present a duration Hidden Markov model for nucleosome positioning prediction by explicitly modeling the linker DNA length. The nucleosome and linker models trained from yeast data are re-scaled when making predictions for other species to adjust for differences in base composition. A software tool named NuPoP is developed in three formats for free download.
Simulation studies show that modeling the linker length distribution and utilizing a base composition re-scaling method both improve the prediction of nucleosome positioning regarding sensitivity and false discovery rate. NuPoP provides a user-friendly software tool for predicting the nucleosome occupancy and the most probable nucleosome positioning map for genomic sequences of any size. When compared with two existing methods, NuPoP shows improved performance in sensitivity.
Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and strongly distort the wrapped DNA. Accumulating evidence shows that the DNA sequence itself is highly predictive of nucleosome positioning in vivo [1-7], and that nucleosome positioning is closely related to chromosome functions [1,8-11]. A fast software tool for predicting nucleosome positioning is highly desirable.
Several statistical methods for nucleosome positioning prediction have been proposed in the literature. In  a method was proposed based on cross-correlation with a nucleosome DNA sequence signature. In  a Markov model was used together with consideration of steric exclusion and thermodynamic equilibrium. In , a support vector machine (SVM) was trained based on the differential k-mer usage between nucleosome and linker DNAs. In , the authors proposed an N-score model to discriminate nucleosome and linker DNAs using wavelet energies as covariates in a logistic regression model. In , a web-interface called "nuScore" for estimation of the affinity of histone core to DNA and prediction of nucleosome positioning was developed based on the DNA deformation energy score. In [6,7], the model from  was improved by incorporation of differential k-mer usage (most notably, poly(dA:dT) tracts, which are strongly disfavored by nucleosomes). This model can be further improved by accounting for nucleosome-nucleosome interaction .
While the nucleosomal features are universal, eukaryotic genomes vary in nucleosomal repeat length  and base composition. The nucleosomal repeat length is dictated by the length distribution of linker DNAs that separate neighboring nucleosomes, and it determines the overall nucleosome density in the chromatin fiber. The contribution of this paper is a duration Hidden Markov Model and a software tool called NuPoP for genome-wide nucleosome positioning prediction. We show that incorporation of linker length information can achieve better sensitivity in prediction. In addition, we propose a re-scaling method to adjust for base composition variation when using yeast models to make predictions for other species. A relatively superior performance of this approach is established by comparing it with other existing tools.
The hidden Markov model (HMM) has been known for decades. An excellent and famous tutorial is Rabiner's 1989 paper , in which the model, algorithms, and applications were carefully and thoroughly reviewed. A conventional HMM implicitly assumes a geometric duration distribution for each state, which can be wrong in real applications. Modeling the explicit duration of each state can improve the prediction of HMMs (e.g., [14-17]). We model each chromosomal DNA sequence with a duration hidden Markov model (dHMM) of two oscillating states: nucleosome (N) and linker (L), where the nucleosome state has a fixed length of 147 bp, and the linker state has a variable length. We assume that at the end of each state, the chain must transit to the other state; additionally, a complete chromatin sequence must start with and end in a linker state. We trained a 4th order time-dependent Markov chain for the the N state, and a homogeneous 4th order Markov chain for the L state to distinguish the k-mer usage preferences for k up to 5 between the nucleosome and linker states as shown in other methods, e.g., [3,6] (see below for details).
Let e = e1, ..., e147 be a nucleosome DNA sequence. Let PN be the probability of observing e as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. We assume that the linker DNA length of a given species has an unknown distribution FL(k) defined for k = 1, ..., τL (the maximum linker length we allow). An observed linker DNA sequence e = e1, ..., ek carries two pieces of information, the length is k bp, and given which, the emitted letters are e1, ..., ek. Let GL(e|k) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing e as a linker DNA has probability
Suppose x = x1, ..., xn is a genomic DNA sequence of length n, where xi = A/C/G/T. Let z = z1, ..., zn be the corresponding hidden state path, where zi = 1 if xi is covered by a nucleosome state, and 0 otherwise. Suppose that the path z = z1, ..., zn partitions x into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. We denote these blocks as y = y+, ..., yB, and their state identification as s = s1, ..., sB, where si = 1 if yi is nucleosome state, and 0 otherwise. The probability of observing (x, z) is given by
where π0(s1) and πe(sB) stand for the probabilities that the chain initializes and ends with state s1 and sk respectively, and I is an indicator function. Since we assume that a complete chromatin sequence must start with and end in a linker state, π0(s1 = 0) = πe(sB = 0) = 1. We define the nucleosome occupancy at a specific position i, denoted oi, as the posterior probability that zi = 1, i.e.,
We also define the histone binding affinity score at position i as the log likelihood ratio for the region xi-73, ... xi, ..., xi+73 to be a nucleosome vs. a linker, i.e.,
Given the models PN, GL and FL, the optimal path z can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms.
We utilized the 503,264 yeast nucleosome DNA reads from 454 pyrosequencing published in  for model training and assessment. Among 371,914 reads that each were mapped to a unique region of the yeast genome, we first selected reads of length between 146 and 149 bp. If multiple such reads existed for the same nucleosome, we selected the one with the highest BLAST score. The resulting non-redundant set of 18,547 nucleosome sequences were center aligned to train the nucleosome model PN. The 4th order time dependent Markov chain can be defined by the base composition at the first position qN(x1), and the transitional probabilities qN(x2|x1), qN(x3|x1, x2), qN(x4|x1, x2, x3), qN(xk|xk-4, xk-3, xk-2, xk-1), for k = 5, ..., 147, xi = A/C/G/T, i = 1, ..., 147, where the subscript k, i index the positions within a nucleosome. These probabilities are trained using the corresponding observed fractions or conditional fractions based on the center alignment, with a three bp moving average (as explained in [1,18]). We further identified 8,090 reads-free regions of length 7-500 bp to train the linker state model GL. The 4th order homogeneous Markov model for the linker DNAs can be completely defined by the stationary base composition qL(xi), and the transition probabilities qL(xi|xi-1), qL(xi|xi-1, xi-2), qL(xi|xi-1, xi-2, xi-3), qL(xi|xi-1, xi-2, xi-3, xi-4). By "homogeneous", we mean that these probabilities are all constants as functions of i. These probabilities were trained using their observed values as in the nucleosome model. For example, qL(xi|xi-1, xi-2, xi-3, xi-4) was trained by calculating the fraction of occurrences of transitions from any four letters to the fifth letter in the selected putative linker DNA sequences.
Our initial nucleosome/linker model was trained using the yeast data. A complication arises when predicting nucleosomes for other species because organisms may differ significantly in their DNA base composition. We propose to scale up or down the probabilities in the Markov models by a factor determined by the difference of the base composition between the current species and yeast. For example, in C. elegans, the fraction of A plus T bases is 0.645 compared to 0.617 in yeast. For a specific transition probability qN(A|....) at any specific nucleosomal position defined for yeast, we scaled it up as qN(A|....) × 0.645/0.617 for the corresponding transition probability at that given position for C. elegans. Likewise the transition probabilities for G/C will be scaled down by a factor of 0.355/0.383.
All the re-scaled probabilities are then normalized. The same re-scaling applies to the linker model. We shall use simulations below to show that re-scaling improves prediction regarding sensitivity and false discovery rate. Using the trained nucleosome model (PN) and linker model (GL), we further train the linker DNA length distribution as follows. We assume that the linker DNAs in any given species or cell type have a maximum length τL = 500 bp.
This algorithm contains the following steps:
1. Initialize the algorithm with a uniform distribution for FL(k) for k = 1, ... τL where τL is the maximum allowable linker length.
2. Use the forward and backward algorithm to obtain the posterior expectation of FL(k) for each k. For a sequence x = x1, ..., xn, let nk be the number of linker DNAs of length k. Then
for k = 1, ..., τL .
for k = 1, ..., τL .
3. Update the empirical linker length distribution from step 2 using a kernel smoothing method as follows:
where K is the standard Gaussian kernel and h is the bandwidth parameter optimally chosen by the leave-one cross-validation method as in .
4. Use the updated linker length distribution from step 3 to compute the nucleosome occupancy and optimal positioning.
Compared to Viterbi training (i.e., using linker length predicted from the Viterbi algorithm), using the posterior expectation obtained in Eq. (1) combined with the kernel method in Eq. (2) performs overwhelmingly better in minimizing the summed square errors (unpublished work ). In the developed software tool NuPoP, we have trained the linker DNA length distributions for 11 different species including human, mouse, rat, zebrafish, D. melanogaster, C. elegans, S. cerevisiae, C. albicans, S. pombe; A. thaliana and maize. The linker DNA length distribution (FL) for each species has been trained by scanning the corresponding genome sequences based on τ L = 500. We found that the re-scaled nucleosome and linker profiles, together with the trained linker length distribution, not only roughly recover the genome-wide base compositions, but also the dinucleotide frequencies for different species. The frequency of each single or di-nucleotide in simulated genomes typically differs by ≤ 1% from that observed in the corresponding real genomes (results not shown). As different cell types from the same organism (with the same genome) can exhibit quite different linker DNA length distributions , a useful future refinement would utilize high quality nucleosome maps for the given cell type, when such data become available.
We have developed a software tool called NuPoP, implemented in three different formats: an R package tested for Windows XP, Linux and Mac OS X; a stand-alone Fortran program; and an NuPoP web server, all available from http://nucleosome.stats.northwestern.edu. The R package is built upon the Fortran program. It provides additional handy functions to visualize the resulting Viterbi (most probable nucleosome position map) and nucleosome occupancy predictions. Both the R package and Fortran program can handle a genomic sequence of any length with a RAM demand <400 M bytes. The predicted results are stored locally in the working directory. The web server provides an interface through which the user can submit their own sequence up to 500 K bp in length for fast online prediction. When making a prediction, the user is required to specify which species the genomic sequence is from. If the species is not on the list, NuPoP will calculate the base composition of the input DNA sequence and then choose the nucleosome and linker models from a species that has the most similar base composition. An alternative model with a 1st order time-dependent Markov chain for the nucleosome state and a homogeneous 1st order Markov model for the linker state, trained in the same way, is also implemented in NuPoP as an option.
Updating the linker length distribution not only helps recover the true nucleosome density, but also improves prediction. We demonstrated this by simulation as follows. We simulated 10 genomic sequences with the 4th and 1st order yeast models respectively, each containing 10,000 nucleosomes and 10,001 linkers. The linker DNA length was simulated from a Normal distribution (μ = 100, σ = 20) and a Gamma distribution (α = 1, β = 1/40). If a nucleosome is predicted within ±35 bp of a true nucleosome, we define it as a correct positive prediction. The rate of the correct positive prediction, referred to as sensitivity, is defined as the percentage of the 10,000 nucleosomes that are correctly predicted. In addition, we include the false discovery rate (FDR), defined as the fraction of the predicted nucleosomes that reside > ±35 bp away from any true nucleosomes, as the second measure for model performance. In analogy to statistical hypothesis testing, the sensitivity measures the power of prediction, while the FDR measures the fraction of type I errors in the positive claims. The results are presented in Table Table11 and and2.2. In both cases, the linker length was initialized as a uniform distribution with τL = 200. Compared to the Gamma distribution, the Normal distribution is relatively flatter. Thereby updating the linker length distribution did not significantly change the total number of predicted nucleosomes (or nucleosome density). The sensitivity increased on average by ~ 4-5% and the FDR dropped by ~ 5% after one update (for both the first and fourth order models). Further updating continued to improve the prediction until it stabilized after four iterations. In contrast, the Gamma model is much more skewed. The uniform linker length distribution resulted in an under-estimation of the total number of nucleosomes. By four updatings, the sensitivity increased by 8%, while FDR remained relatively more stable.
We also observe that under the same setting and condition, the fourth order model performs slightly but uniformly better than the first order model in both sensitivity and FDR. This is given that the true models are known and we scan the sequence using the true models. In theory, the first order Markov chain model is nested in the fourth order model. Therefore if the true model is the first order, a well trained fourth order model will have the same prediction power as the first order model, but not vice versa. Since training a higher order Markov chain model requires more data, inadequate training can undermine the prediction power.
To illustrate the advantages of re-scaling, we re-scaled the yeast profiles according to the base composition of the maize genome (G/C scaling factor in maize is 1.2). Using the re-scaled profiles we simulated 10 genomic sequences that each contain 10,000 nucleosomes and 10,001 linkers. The linker DNA length followed the same two distributions as in Table Table11 and and2.2. We compare the prediction results from the scaled and non-scaled yeast profiles in Table Table33 and and4.4. We found using the re-scaled ("correct") profile yields a lower FDR than using the yeast profile. In addition, updating the linker length under the correct profile consistently improves the sensitivity and FDR until prediction stabilizes. In contrast, while using the yeast profile to scan the simulated maize-like genome, the prediction drastically deteriorates as the linker length updating proceeds. The same simulation was repeated on other species including human and C. elegans, where the base composition is similar to yeast (A/T scaling factor is 0.96 for C. elegans, and 1.03 for human). Unsurprisingly, the results from the scaled yeast profile were still better than those from the original yeast profile in terms of both sensitivity and FDR, while the difference is much smaller than for the maize case (results not shown).
We briefly assess the prediction performance of NuPoP by comparing it with two existing methods: the N-score method of  (results kindly provided by Dr. G. Yuan, personal communication) and the Markov model/thermodynamic equilibrium method of  (to be called MM/TE method below). As the exact genome-wide nucleosome positioning map is unknown, we utilize the 371,914 454 high-throughput sequence reads to identify well-defined nucleosomes. We first selected sequences of length between 130-160 bp and constructed a reads-based occupancy map. The reads-occupancy score at a specific position is defined as the number of reads that covered this position. Then we calculated the moving average of this reads occupancy score using a 147 bp window. A sharp peak in the average occupancy curve indicates a nucleosome with well-defined positioning. Considering that the average linker DNA length is 20 bp in yeast , we quantified the sharpness of the peak by calculating the slope from the peak point to the up-/down-stream 20 bp point on the average occupancy curve. We set one condition for the peak to be selected as the center of a nucleosome to be that the absolute value of slope from either side should be > 0.01. Secondly, we required that the peak height itself must be ≥ 1.9, i.e., a well-defined nucleosome must be testified by at least two well-overlapped reads. We chose the threshold value as 1.9 instead of 2.0 because the overlap of the two reads can be less than 147 bp, resulting in a peak on the moving average curve slightly lower than 2.0. With these criteria, a total of 20,471 well-defined nucleosomes are selected from the 16 chromosomes of yeast. A snapshot of a region with many selected well-defined nucleosomes is presented in Figure Figure1.1. Figure Figure22 provides a snapshot of nucleosome occupancy predicted by NuPoP together with the reads-occupancy.
We first assess the sensitivity of predictions from NuPoP using the well-defined control set. If there is a predicted nucleosome within ±k bp of any well-defined nucleosomes (center to center), we count this as one correct prediction. We varied k from 5, 10, ... 70, 73 to investigate the sensitivity behavior at different precision thresholds. The N-score model predicted 48,394 nucleosomes. The current software tool for MM/TE method does not provide Viterbi predictions, but only the nucleosome occupancy scores. Therefore we calculated the moving average of the occupancy score using a 147 bp moving window. The resulting peaks were treated as the centers of predicted nucleosomes. If two peaks reside within 127 bp, we discarded the one with smaller moving average of occupancy score. This procedure identified 43,979 predicted nucleosomes (if we had required two nucleosomes to be 147 bp away, even fewer nucleosomes would have been identified). Likewise, using the occupancy scores from NuPoP, we identified 52,327 and 51,380 predicted nucleosomes under the 4th and 1st order models respectively. In Figure Figure3a,3a, we compare the sensitivity estimates from the 4th order model of NuPoP with the N-score and MM/TE methods at different threshold values of prediction accuracy (the fourth order model from NuPoP performs better, but very slightly, than the first order model. Hence the latter was omitted in Figure Figure33 for better presentation). As the sensitivity tends to increase with an increase in the total predictions, we further selected 43,979 and 48,394 best predictions from NuPoP model (here "best" is in the sense of the largest sum of occupancies over 147 bp) to compare with the N-score method and in Figure Figure3b3b and and3c3c respectively.
The sensitivity results suggest that the predictions from NuPoP outperforms the other two methods in two senses. Firstly, the sensitivity from NuPoP is 4.9-8.3% higher than the other two methods at different threshold values (Figure (Figure3a).3a). Secondly, while controlling the total predictions to be the same, NuPoP has ~ 3.2-5.3% higher sensitivity than the MM/TE method (Figure (Figure3b),3b), when the precision threshold is ≤ ±35. As the precision threshold gets less stringent, the difference attenuates and eventually vanishes. The contrast between NuPoP and the N-score method is even larger as shown in Figure Figure3c3c.
As a further comparison, we computed the predictions of the dHMM method under a uniform linker length distribution defined on 1, 2,..., 500. This method predicted 48,334 nucleosomes under the 4th order models, achieving a sensitivity 3.0-6.6% higher than the MM/TE method (Figure (Figure3a).3a). When controlling the total predictions to be the same as MM/TE method or N-score method, the resulting sensitivity curve almost perfectly overlaps with that from NuPoP. Therefore we omitted these results from Figure Figure3b3b and and3c3c.
One could further attempt to evaluate the false positive rate (FPR), measuring the fraction of linker regions that were falsely classified as nucleosome regions (or similarly the false discovery rate, FDR). This task requires well-defined linker regions. A problem, however, is that the average length of linker DNAs in yeast (20 bp; ) is smaller than the dispersion in lengths of the nucleosome DNAs as isolated biochemically (which is often 30-50 bp full width at half maximum, notwithstanding that the nucleosome as defined crystallographically has precisely 147 bp of DNA). Thus existing nucleosome maps lack the precision needed to define such short linker DNAs. Moreover, various sampling biases such as the DNA sequence preferences of the micrococcal nuclease used to liberate nucleosomes biochemically (which preferentially cleaves A/T rich regions) could yield longer genomic regions that are free of recovered nucleosome DNA reads even if they are actually nucleosome occupied . Attempts to evaluate the FPR given these problems in the data could result in misleading conclusions. For these reasons, FPR evaluation is not pursued in this paper.
The duration Hidden Markov model proposed in this paper is a generic model for the oscillating structure of nucleosome and linker DNAs in chromatin fiber. The Markov models can be replaced by any other models for the nucleosome and linker states. The kernel method for linker length training is nonparametric and typically robust. We showed in the simulation that updating the linker length distribution iteratively improves sensitivity and FDR in prediction if appropriate nucleosome and linker models are used. In particular, the first iteration often achieves the most pronounced improvement. In contrast inappropriate nucleosome and linker models could lead to the opposite outcome, as shown in the simulation studies (Table (Table33 and Table Table4).4). In reality, the genomic DNAs are complicated by their biological functions. The models trained based on typical nucleosomes or linkers may not well fit some special genomic regions like repeated elements. To avoid possible risks due to such complications, we trained the linker length distribution less greedily by using only one iteration in NuPoP.
Limitations may still exist in the model training and assessment used in this study. The MNase is known to have strong preference to cleave dinucleotides containing only A/T . Consequently the MNase-mapped nucleosome sequences can be systematically biased in some regions. This bias could undermine the prediction power because of the dampened signal in the trained nucleosome model. The systematic bias may also exist in the well-defined nucleosomes, causing inaccuracy in sensitivity estimation. A better map of nucleosomes is highly desirable for both purposes. For species other than yeast, we currently lack high-quality genome-wide nucleosome sequence data (e.g., like the 454 reads) for model training and model validation. The advantages of the re-scaling method shown using simulation in this paper need to be further assessed once such high-quality data becomes available. Moreover, the results from different methods in this paper were all based on the default settings. The N-score method was originally trained based on a much smaller set of nucleosome and linker sequences. A better training using a larger set could improve this method's predictions. In addition, different settings in the N-score or MM/TE methods can lead to different predictions, which we did not further investigate here. Finally, the software for MM/TE method only provides the occupancy score. Different ways to call a predicted nucleosome based on the occupancy score might lead to different conclusions.
Finally, we address the question of which subset of the available 454 reads data might best be used for training the nucleosome model. In NuPoP, we trained the nucleosome model using selected non-redundant nucleosome reads of length within a short range (146-149 bp), to retain strong high resolution nucleosome sequence signatures, e.g., the _10 bp-periodic dinucleotide signals. As comparisons, we trained two additional nucleosome models: one using the selected non-redundant reads of length 122-177 bp (retaining the non-redundancy but yielding far more training data), and the other using all reads of length 122-177 bp. The resulting models both contain the k-mer usage information that distinguishes nucleosomes from linkers (e.g., [3,4]), while the dinucleotide signals in these models are much weaker due to poor alignment of these reads. Furthermore, as the reads count at a nucleosome site is heavily biased by the G/C content due to MNase specificity and other effects in the experiment, the model trained from the redundant reads tends to be over-enriched in G/C. When combined with the linker model from NuPoP, the two alternative nucleosome models yielded comparable sensitivity as NuPoP in predicting the approximate positioning of nucleosomes, assessed based on the 20,471 well-defined nucleosomes used above. This comparison, however, is not sensitive to spatial precision of the predictions. Therefore, we asked further, given that a nucleosome is predicted within ±73 of a true nucleosome, which model predicts the location more accurately? To investigate this, we simulated genomic sequences using the nucleosome and linker models from NuPoP. We compared the prediction from the three models and found that the true model with strong signals achieves much better prediction accuracy than the two alternative models. For example, 16.1% of the predictions from the true model were prefect (with 0 bp offset), compared to 8.7% and 5.9% respectively from the other two models (results not shown).
The dHMM model proposed in this paper is effective in characterizing the oscillating structure of nucleosome and linker DNAs in chromatin fiber. Explicit modeling of linker length improves the prediction of nucleosome positioning regarding sensitivity. The developed software tool NuPoP provides a user-friendly interface for predicting nucleosome occupancy and the most probable nucleosomes positioning map genome-wide.
NuPoP software tools are freely available from http://nucleosome.stats.northwestern.edu. The R package shall be made available through bioconductor http://www.bioconductor.org upon publication. To run the NuPoP Fortran stand-alone program, a Fortran compiler is required. For the NuPoP R package, an R version later than 2.9 is required.
LXi did all the data analyses. JPW, JW, and LXi wrote the paper. JPW and JW directed the research. YFM conducted all lab work for data generation and validation. JPW and LXi developed the NuPoP software tools. LXia and JF implemented the NuPoP web-server. All authors read and approved the final manuscript.
The research is supported by NIH grant R01GM075313 and NCI grant U54CA143869. We acknowledge with gratitude the gift of a parallel sequencing run from Roche/454 Life Sciences, and thank Ms. Jolene Osterberger (Roche/454 Life Sciences) and Dr. Nadereh Jafari (Northwestern University) for arranging this. The authors would also like to thank Drs. Eran Segal, Guocheng Yuan, Zhiping Weng and Yutao Fu for help in providing data and their codes.