PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinformLink to Publisher's site
 
Bioinformation. 2009; 3(10): 454–459.
Published online 2009 August 6.
PMCID: PMC2737495

Prediction of nucleosome positions in the yeast genome based on matched mirror position filtering

Abstract

Nucleosome positioning can affect the accessibility of the underlying DNA to the nuclear environment and as such plays an essential role in the regulation of cellular processes. Specific patterns have been found in the underlying DNA sequences of the nucleosome, and one of the most important patterns includes dinucleotides distributed every 10 to 11 base pairs. Based on this property, we propose to match each dinucleotide in the sequence against its mirror occurrences for 10 to 11 base pairs on both left-hand and right­hand sides. A large number of matches in a local region will then signify the existence of a nucleosome. In this paper, we propose the matched mirror position filters for efficient matching of periodic dinucleotide patterns and computationally predict the nucleosome positions. Experimental results on the Saccharomyces cerevisiae (yeast) genome show that the proposed algorithm can predict nucleosome positions effectively. More than 50% of our predicted nucleosomes are within 35 base pairs of those detected by biological experiments.

Keywords: mirror position filter, nucleosome positioning, Saccharomyces cerevisiae, yeast genome, DNA

Background

The nucleosome, which is the primary unit of a eukaryotic chromatin, contains about 147 base pairs (bp) of DNA which are sharply bent and tightly wrapped around a histone octamer. This sharp bending occurs at every DNA helical repeat (~10 bp) where the major groove of the DNA faces inwards towards the histone octamer, and again ~5 bp away, with opposite direction, the major groove faces outward [1]. Along the chromatin, the neighboring nucleosomes are separated from each other by 10 to 50-bp long stretches of unwrapped linker DNA [Figure 1], thus, 75–90% of genomic DNA is wrapped in nucleosomes and as such affects sequence accessibility [2]. The positioning of nucleosomes along the chromatin is known to play an important role in the regulation of gene expression in eukaryotic cells.

Figure 1
Examples of nucleosomes, each containing 147 bp of nucleotides, connected by linkers of different lengths.

Extensive research has been carried out on nucleosome positioning recently. Some of the research tries to explore the mechanic properties of nucleosomes, while the others develop mathematical models for the prediction of the nucleosome positions [1,3>–5]. In their study of nucleosome placement in chromatin, Vincent Miele proposed that the physical properties of DNA may determine the nucleosome occupancy from yeast to fly [6]. Besides, I.P. Ioshikhes predicted some nucleosomes on the basis of comparative genomics while M. Yassour located the nucleosomes by analyzing microarray data [7,8]. Despite the fact that genome-wide maps of nucleosome locations have been generated [2,6,9, 10], the problem of accurately predicting the nucleosome positions computationally at high resolution remains unresolved.

Statistical analysis suggests that the periodicities of the underlying DNA sequence might help solve this problem [2]. Indeed, there is evidence that the distinctive sequence motifs recurring periodically at the DNA helical repeat facilitate the sharp bending of DNA around the nucleosome and hence favor nucleosome formation. These motifs include ~10-bp periodic AA/TT/TA dinucleotides that oscillate in phase with each other and out of phase with ~10-bp periodic GC dinucleotides [2,4,11] . Based on this property, we propose a computational method using the so-called matched mirror position filter (MMPF) for the prediction of nucleosome positions. An advantage of our method is that it does not require training data and thus expensive wet-lab experiments are not needed. The computational experiment results demonstrate that our approach can detect the positions of the nucleosome effectively. On average, more than 50% of our predicted stable nucleosomes are within 35bp of those detected by biological experiments and reported in literature.

Methodology

Nucleotide base set

Let us define the set B1 = }A, C, G, T{, which contains the four nucleotide bases. Then consequently there should be 4×4 =16 dinucleotides. S. C. Satchwell has validated that in the 16 possible dinucleotides only 10 of them are unique. He explained that it is because some of them are related to the two fold axis that passes between the two strands of the double helix, and the reverse complementary dinucleotides are considered to be equivalent [11]. Considering this factor, we define the unique dinucleotides as set B2 = {AA/TT, AT, AC/GT, AG/CT, TA, TC/GA, TG/CA, CC/GG, CG, GC}.

Dirac delta function

Assume that a DNA sequence is represented by a discrete function x(n), with x(n) єB1 and n = 0, 1, 2, …, N ~1. For each dinucleotide b єB2, a delta function is defined as in Equation 1 (see supplementary material). The notation of this delta function has been used to represent the positions of nucleotide bases [12, 13]. Here we generalize the notation to dinucleotides. For example, if x(n) = {GCAGGTACAGTCGAG}, then xAG(n) = δ(n~2) + δ(n~8) + δ(n~13).

Matched mirror position filter

Ideally, to produce a periodicity of ~10.5 bp, an impulse δ (n~nk) at each position nk should be like a double-side mirror. That is, it should reflect an impulse δ(n~nk~l) on the left-hand side 10 to 11 bp away and an impulse δ(n~nk+r) on the right-hand side 10 to 11 bp away (Figure 2). As defined in Equations (2) and (3) (see supplementary material) respectively, dL(b, nk) is the distance of δ(n~nk) to the impulse closest to the position nk~ 10.5, and dR(b, nk) is the distance of δ(n~nk) to the impulse closest to the position nk 10.5. Then, ideally we should have dL(b, nk) = 10 or 11 and dR(b, nk) = 10 or 11 and there should be many of these impulses in a nucleosome to produce the periodicity of ~10.5 bp and no such impulse in a linker.

Figure 2
Positions of dinucleotides along a DNA sequence. Ideally, a dinucleotide should have two mirror images, on the left-hand and right-hand sides respectively, 10 to 11 bp away to produce the periodicity of ~10.5 bp in a nucleosome.

In practice, dL(b, nk) and dR(b, nk) may deviate from the ideal values. To take this deviation into account, we use a matching function f(d) to measure the contributions of δ(n~nk), δ(n~nk~l) and δ(n~nkr) to the periodicity. The matching function should be large for d close to 10.5 and decrease as d moves away from this optimal value. Several choices of f(d) are shown in Figure 3. Given the matching function, the contributions of the three impulses to the periodicity can be described as Equation 4 (see supplementary material).

Figure 3
Possible choices of the matching function f(d) with d representing the distance from the current position. Diagrams in (a) and (b) show trapezoids with different base lengths. Diagram (c) shows a matching function consisting of the summation of six Gaussian ...

To detect the presence of nucleosomes, we move a window with the size of 2Ws + 1 along the DNA sequence and accumulate the contributions from all dinucleotides within the window. Consequently, the nucleosome matching score function can be defined as Equation 5 (supplementary material). In fact, our proposed method involves flexible matching of mirrored dinucleotide positions. To some extent, it is similar to the matched filter used in radar systems for detecting echo signals. Thus, our system is named the matched mirror position filter (MMPF) with S(n) being the output.

Threshold discriminant

Following the ideas discussed above, the nucleosomes can be predicted by comparing S(n) as defined in Equation 5 (see supplementary material) with a threshold distinguishing the nucleosome and the non-nucleosome. The threshold here is determined empirically by performing a lot of experiments on a number of nucleotide sequences. After a lot of experimental trials, we found that T = 1.2(2Ws + 1) is the optimal threshold. In the implementation of our proposed method, the window length is selected as 2Ws + 1=147 according to the fact that the lengths of nucleosomes in these sequences studied are mostly 147 bp [2,6,9, 10]. A window size in the range of 100 to 200 gives similar results, but the algorithm achieves the best result in terms of the false positives in the sequences studied with the window size of 147. For this window size, the threshold for the nucleosome matching score is 1.22Ws + 1) [equals, similar] 180.

Discussion

In order to verify the performance of the MMPF, we have performed nucleosome position prediction experiments on the Saccharomyces cerevisiae (yeast) genome, which is downloaded from the database of National Center for Biotechnology Information (NCBI) [14]. The genome contains 16 chromosomes with lengths from 230k to 1532k bp. We choose the yeast genome since it has been extensively studied and there are high-resolution nucleosome mapping results reported in literature [7]. A criterion similar to that in [2] is used here to compute the prediction accuracy. That is, we consider a prediction correct if the nucleosome center position determined by the MMPF is within 35 bp of that reported in the highresolution result of [7]. The prediction accuracy is then computed as the ratio of correctly predicted nucleosomes over the total number of nucleosomes predicted by our algorithm.

The prediction results achieved by using different matching functions in the MMPF are summarized in Table 1 (supplementary material). From this table, it can be seen that with the isosceles trapezoid function (upper base 9-12) the least nucleosomes can be predicted and the accuracy is the worst, while with superimposed Gaussian function (delta = 0.5) the accuracy achieves the best occasionally but its predicted nucleosomes are still less than using the isosceles trapezoid function (upper base 8-13). In consideration of both the number of correctly predicted nucleosomes and the prediction accuracy, we choose the isosceles trapezoid function (upper base 8-13) as the optimal matching function f(d) to predict nucleosome positions for further analysis.

For a better understanding of the correlation between the predicted nucleosomes and the regulatory function sites, such as promoters and the transcription factor binding sites, a comparison between the nucleosomes predicted by our method and those obtained from biological experiments and published in literature is shown in FiguresFigures4,4, ,55 and and6.6. Besides, the effect of different thresholds on the accuracy is also discussed. From Figure 4a, it can be seen that in the segment near the genes CHA1 and VAC17 of chromosome III, ten nucleosomes are reported in literature. Seven stable ones are predicted with our proposed method, and four of them coincide well with those determined in biological experiments. In Figure 4b, when the threshold is set to be 190, although the number of false positives decreases, the accuracy of the whole genomic-scale prediction degrades to ~45%.

Figure 4
Genome-wide prediction of nucleosome positions in yeast chromosome III using the MMPF with n denoting the position in the chromosome and S(n) denoting the matching score. (a) The distribution of the nucleosomes near genes CHA1 and VAC17. The upper green ...
Figure 5
Genome-wide prediction of nucleosome positions in chromosome XV and chromosome XIV using the MMPF with n denoting the position in the chromosome and S(n) denoting the matching score: (a) near gene HIS3 and (b) gene MFA2.
Figure 6
Genome-wide prediction of nucleosome positions near the transcription factor motifs [15] in chromosome XVI and chromosome III using the MMPF with n denoting the position in the chromosome and S(n) denoting the matching score.

Figure 5 shows the comparison between our predicted stable nucleosomes and those reported in literature near the genes HIS3 and MFA2 in chromosomes XV and XIV, respectively. A similar comparison of the nucleosomes near some transcription factor motifs [15] (in chromosomes XVI and III) is shown in Figure 6. The results shown in FiguresFigures4,4, ,55 and and66 suggest that the potential nucleosomes can be effectively predicted by the MMPF. Although in some segments the rate of correctly predicting nucleosomes is low (Figure 6b), this algorithm performs well on a genomic scale. Our analysis also indicates that nucleosomes may play a role in the regulation of the DNA sequences. From Figures Figures4a4a and and5,5, it can be seen that there may be correlations existing between the positions of nucleosomes and genes. It can also be seen from these figures that the nucleosomes have a strong affinity to the genes in some genomic locations and exhibit lower occupancy in other positions such as the promoters, usually found ~100 to 500 bp upstream of the start codon in the gene or in the intergenic regions. Besides, our findings in Figure 6 support the hypothesis that the functional transcription factor binding sites may be predominantly nucleosome-free [15].

Conclusion

In this paper, we have presented a computational method based on the MMPF for nucleosome position prediction. This technique, while offering a level of accuracy comparable to existing ones, has some distinct advantages. Being based on the periodicity of nucleosomes and a flexible pattern matching scheme, it is independent of a training set that must be obtained through biological experiments. The MMPF can provide a useful tool to study the eukaryotic genome chromatin structure, protein-DNA interactions, and transcriptional regulations. Our future work aims to improve the prediction accuracy for large genomic sequences based on more sophisticated mathematical models, such as probabilistic relaxation labeling [16].

Supplementary material

Data 1:

Acknowledgments

We thank Dr. Hongya Zhao for helpful suggestions. This work is supported by the Hong Kong Research Grant Council (Project CityU 123408), the National Natural Science Foundation of China (Project 60871086) and the Natural Science Foundation of Jiangsu Province, China (Project BK2008159).

Footnotes

Citation:Wu et al, Bioinformation 3(10): 454-459 (2009)

References

1. Richmond TJ, Davey CA. Nature. 2003;423:145. [PubMed]
2. Segal E, et al. Nature. 2006;442:772. [PMC free article] [PubMed]
3. Chen K, et al. Nucleic Acids Res. 2008;36:6228. [PMC free article] [PubMed]
4. Bina M. J Mol Biol. 1994;235:198. [PubMed]
5. Peckham HE, et al. Genome Res. 2007;17:1170. [PubMed]
6. Miele V, et al. Nucleic Acids Research. 2008;36:3746. [PMC free article] [PubMed]
7. Ioshikhes IP, et al. Nat Genet. 2006;38:1104. [PubMed]
8. Yassour M, et al. Bioinformatics. 2008;24:139. [PMC free article] [PubMed]
9. Yuan GC, et al. Science. 2005;309:626. [PubMed]
10. Pokholok DK. Cell. 2005;122:517. [PubMed]
11. Satchwell SC, et al. J Mol Biol. 1986;191:659. [PubMed]
12. Voss RF. Phys Rev Lett. 1992;68:3805. [PubMed]
Anastassiou D. IEEE Signal Proc Mag. 2001;18:8.
15. Harbison CT, et al. Nature. 2004;431:99. [PMC free article] [PubMed]
16. Zhao H, Yan H. Proc Int'l Multiconf Engineers & Computer Scientists; 2009. 245 pp.

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group