|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: WC HL. Performed the experiments: WC HL PMF CD YCZ. Analyzed the data: WC HL PMF CD KCC. Contributed reagents/materials/analysis tools: WC HL. Wrote the paper: WC KCC.
Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design.
In eukaryotic cells, genomic DNA is highly compacted into several levels of chromatin structures that ultimately make up the chromosomes. At the lowest level of compaction, a ~147 bp DNA sequence is tightly wrapped around the histone-octamer core (Fig. 1) into the elementary structural unit of chromatin, known as nucleosome . The packaging of DNA around the histone-octamer modulates the accessibility of genomic regions to regulatory proteins. There are close relationships between nucleosome positioning and key cellular processes, as demonstrated in mRNA splicing, DNA replication, and DNA repair , , . Consequently, revealing the mechanism involved in controlling nucleosome positioning is fundamentally important for in-depth understanding the subsequent steps of gene expression.
High-resolution genome-wide nucleosome maps are now available for several model organisms, such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens , , , , . These high-resolution data provide unprecedented opportunities for further investigating the roles of nucleosome positioning in gene regulation. However, experimental approach is expensive to perform genome-wide analysis of nucleosome distribution. In this regard, computational methods can be applied to the entire genome without this kind of disadvantage. Since the report of the nucleosome positioning code (~10 bp repeating pattern of dinucleotides AA-TT-TA/GC) in yeast , lots of theoretical works have been done attempting to elucidate nucleosome occupancy signals that determine the preference of a particular region in binding to histones and forming a nucleosome , , . Although of great interest and value, sequence-based predictions of nucleosome positioning have been limited in their accuracy and resolution, and to which extent nucleosome positioning in vivo is really dictated by the DNA sequence  is still an issue of controversy .
It was reported by Miele et al.  that DNA physical-chemical properties may determine nucleosome occupancy. Moreover, the recent study by Nozaki et al.  also suggested the existence of a highly bendable, fragile structure for nucleosomal DNA, implying that nucleosomal sequences indeed have distinct structural properties when compared with linker sequences.
In view of this, the present study was initiated in an attempt to develop a new method for predicting nucleosomal sequences based on the physicochemical properties of DNA.
According to a recent review , to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (1) construct or select a valid benchmark dataset to train and test the predictor; (2) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.
The reference genome sequence of Saccharomyces cerevisiae was obtained from the Saccharomyces Genome Database (http://www.yeastgenome.org/). The nucleosome positions of Saccharomyces cerevisiae were derived from the published data obtained by Lee et al. , where each of the 1,206,683 DNA fragments in the dataset constructed by these authors was assigned a nucleosome formation score using a lasso model, with the high or low score to reflect its high or low propensity in forming nucleosome, respectively. The low score can also be interpreted as the propensity to inhibit the formation of nucleosome. To prepare a high quality benchmark dataset, 5,000 fragments of 150 bp with the highest scores were selected as the nucleosome-forming sequence samples to construct the positive set , and 5,000 fragments of 150 bp with the lowest scores were selected as the nucleosome-inhibiting (or linker) sequence samples to construct the negative set ; i.e., the benchmark dataset in this study can be formulated as
where represents the symbol for “union” in the set theory, and
For the convenience of readers, the 5,000 sequences in and 5,000 sequences in are given in the Information S1.
Owing to their important roles in various different biological processes, the intrinsic physicochemical properties of DNA sequences have been intensively studied by many investigators , , , , . In the present study, the following twelve DNA physicochemical properties are to be considered: (1) A-philicity , (2) base stacking , (3) B-DNA twist , (4) bendability , (5) DNA bending stiffness , (6) DNA denaturation , (7) duplex disrupt energy , (8) duplex free energy , (9) propeller twist , (10) protein deformation , (11) protein-DNA twist , and (12) Z-DNA .
In order to quantitatively analyze the physical and chemical properties of the DNA sequence samples, we firstly converted the retrieved nucleosomal and linker sequences into numerical profiles according to the following schemes as validated by Florquin et al. . The detailed procedures are as following steps.
For any 2 base pair (bp) piece of DNA, there is a corresponding numerical value associated with any one of the aforementioned 12 physicochemical properties. Since the values of the 12 properties were at different levels, to make them easier to be handled, we normalized them into the range by means of the following equation
where is the original value of the i-th DNA physicochemical property (i=1, 2, …, 12) for the j-th (j=AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) dinucleotide (see Information S2); while the corresponding normalized value (see Table 1).
By means of a sliding window ,  approach with a window size of 2 bp and a step size of 1 bp, a DNA sequence was replaced by the corresponding normalized physicochemical values. Thus, each of the sequences in was translated into a numerical vector consisting of components, i.e., a 149-D (dimensional) numerical vector.
After going through the above step with all the 12 physicochemical properties, each of the sequences in was translated to 12 different 149-D vectors corresponding to the 12 physicochemical features. By combining the 12 vectors, we obtain an integrated vector containing components; i.e., each of the nucleosomal sequences in can be formulated as a 1788-D vector
where the first 149 components were derived from the property “A-philicity” or P(1), the second 149 components from the property “base stacking” or P(2), the last 149 components from the property “Z-DNA” or P(12) (cf. Table 1), and T the transposing operator.
The covariant discriminant (CD) or quadratic discriminant (QD) function has been widely used in the realm of bioinformatics, such as protein structural class prediction , , , protein coding region identification , protein subcellular location prediction , , splice site prediction , membrane protein type and location prediction , out membrane protein prediction , enzyme family class prediction , antimicrobial peptide classification , and prediction of protein cellular attributes .
Its formulation can be briefly described as follows. Suppose the standard feature vectors for the DNA sequences in and are, respectively, expressed by
where is the u-th component of the feature vector for the k-th sequence in the positive dataset , that for the k-th sequence in the negative dataset , the total number of DNA sequences in , that in , and the total number of components in a feature vector. For the current case, we have (cf. Information S1) and (cf. Eq.4).
Thus, whether a query DNA sequence belongs to the nucleosome-forming subset or nucleosome-inhibiting subset will be judged by
where is the argument of that minimizes , which is defined by
is the covariance matrix  for the subset , the elements in are given by
is the inverse matrix of , and is the determinant of the matrix . Therefore, the covariance discriminant function is also called the modified Mahalanobis discriminant function , . More description about the covariance discriminant function and its application in biology can be found in a review .
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (K-fold cross-validation) test, and jackknife test. However, as elaborated in ref.  and demonstrated by Eqs.28–32 of , among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., , , , , , , ). However, since the current study would involve feature selection as described below, to reduce the computational time, the 5-fold cross-validation test would be adopted as done by many investigators using SVM (Support Vector Machine) as the prediction engine.
Also, to use a more intuitive and easier-to-understand method to measure the prediction quality, according to the definition , , the rates of correct predictions for the nucleosome-forming dataset and the nucleosome-inhibiting dataset are respectively defined by
where is the total number of nucleosome-forming sequences concerned and the number of nucleosome-forming sequences missed in prediction; the total number of nucleosome-inhibiting sequences concerned and the number of nucleosome-inhibiting sequences missed in prediction. The overall success prediction rate is given by
It is clear from Eqs.12–13 that, if and only if none of nucleosome-forming sequences and nucleosome-inhibiting sequences are mispredicted, i.e., and , we have the overall success rate . Otherwise, the overall success rate would be smaller than 1.
Inclusion of redundant and noisy information would cause poor prediction results and increase computational time. To improve the prediction quality and gain deeper insights into the physicochemical properties of nucleosomal sequences, we performed feature selection using the wrapper-type feature selection algorithm called “fselect.py”, which can be downloaded at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools. The basic idea of this algorithm is to rank each of the features involved according to a score as elaborated by Chen and Lin . The ranked feature with a higher score indicates that it is a more highly relevant one for the target to be predicted. Based on the ranked features, we used the Incremental Feature Selection (IFS)  to determine the optimal number of features. During the IFS procedure, features in the ranked feature set were added one by one from higher to lower rank. A new feature set was composed when one feature had been added. Thus, the N feature sets thus formed would be composed of N ranked features. The feature set can be formulated as
For each of the N feature sets, a CD prediction model (cf. Eq.7) was constructed and examined with the 5-fold cross-validation on the benchmark dataset. By doing so, we obtained an IFS curve in a 2D Cartesian coordinate system with index as its abscissa (or X-coordinate) and the overall success rate as its ordinate (or Y-coordinate). The optimal feature set is defined by
with which the IFS curve reaches its peak. In other words, in the 2D coordinate system, when the value of is the maximum. Thus, we can use the features in Eq.15 to build the final predictor.
The predictor established by going through all the above procedures is called iNuc-PhysChem. Meanwhile, a user-friendly web-server for the predictor was also established as will be describe at the end of the paper.
Different from the previous methods , ,  that were mostly based on the sequence compositional features, we carried out a graphic profile comparison between nucleosomal and linker (non-nucleosomal) sequences in order to explore the specific features possessed by nucleosomal sequences. Using graphic approaches to study biological problems can provide an intuitive picture or useful insights for revealing complicated relations in these systems, as demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions , , , , inhibition of HIV-1 reverse transcriptase , , protein folding kinetics , drug metabolism systems , and using wenxiang diagram or graph  to study protein-protein interactions , , . To introduce graphic approach for the current study, let us use the conversion scheme  to transform the nucleosome and non-nucleosome sequences into the numerical vectors (cf. Eq.4). To intuitively show the difference between these two different types of sequences, a graphic expression of the standard feature vector (cf. Eq.5) for the nucleosomal sequences and that for the non-nucleosomal sequences are given in Fig. 2, which consists of 12 panels corresponding to 12 physicochemical properties of DNA sequences (cf. Section 2 of Materials and Method). The curves in the “A-philicity” panel reflect the first 149 components in the two standard feature vectors, those in the “base stacking” panel reflect the second 149 components, and so forth. It is interesting to note that, except for the “B-DNA twist” panel and “Protein-DNA twist” panel, the differences between the nucleosomal and non-nucleosomal sequences are quite remarkable in all the other 10 panels. These findings suggest that the two physicochemical properties might play a less role in distinguishing nucleosomal and non-nucleosomal sequences than the other 10 properties.
In order to compare the 12 physicochemical properties for the classification performance, the feature vector Eq.4, standard vector Eq.5, and classifier Eq.7 were reduced from the original 1788-D working space to twelve 149-D sub-working spaces. Each of the sub-working spaces corresponds to one of the 12 physicochemical properties. Shown in Fig. 3 are their success rates in the classification performance when examined by the 5-fold cross-validation on the benchmark dataset . As can be seen from Fig. 3, the success rates obtained by using the “B-DNA twist” and “protein-DNA twist” properties are indeed remarkably lower than those by most of the other properties, quite consistent with the graphic profile analysis of last section.
To identify the key features for nucleosomal sequence prediction, we used the wrapper-type feature selection algorithm and IFS approach as described in Section 5 of Materials and Method.
By adding the ranked features one by one according to the scores calculated by fselect.py, we built 1,788 individual CD predictors for the 1,788 sub-feature sets. We then tested the prediction performance for each of the 1,788 predictors and plotted the IFS curve as shown in Fig. 4, from which we can see that, when the top ranked 884 features were used, the overall success rate reached its peak, i.e., (cf. Eq.13), with for the nucleosome-forming sequences and for the nucleosome-inhibiting sequences (cf. Eq.12).
In other words, we have (cf. Eq.15) and the optimal feature set for the current biological system should be
To provide an overall view, a distribution of the 12 physicochemical features and their roles for the prediction model is given in Fig. 5, where the green boxes indicate the features that were not contained in the optimal feature set S884. The red and purple boxes indicate the features that were included in the optimal feature set S884: features in red boxes were positively correlated with nucleosomal sequences, while those in purple boxes were negatively correlated with nucleosomal sequences.
Based on the 2-mer absolute frequency of nucleotides, Zhang et al.  proposed a model to distinguish nucleosomal and linker sequences. When tested by the 5-fold cross-validation on the benchmark dataset, their method achieved an overall success rate of 95.70%, which is lower than that by the present method.
Furthermore, our model trained on the yeast data was also applied to the human genome. According to the human reference genome (hg 18), we randomly extracted 1000 nucleosomal and 1000 linker sequences from the high-resolution experimental data of human CD 4+ T cell . Our model achieved an overall success rate of 98.5% for classifying the experimentally confirmed nucleosomal and linker sequences in the human genome. This result is higher than 93.8% obtained by using the model proposed by Peckham et al. , which has also been applied to predict human nucleosomal sequences by Gupta et al. . All these results indicate that it is a quite promising approach by incorporating the DNA physicochemical features for predicting the nucleosomal sequences, and also suggest a conserved mechanism of nucleosome positioning across genomes.
Different with most current nucleosome positioning prediction methods that were solely relied on local sequence compositional information, in this study we developed a new method by incorporating the physicochemical features of DNA sequences. Our rationale to do so is that, different from the other nucleotide information, the physicochemical properties might affect DNA binding of regulatory proteins, either directly by hampering or favoring complex formation, or indirectly through the modulation of the chromatin structure and hence the DNA accessibility . Therefore, the current method may become a useful vehicle for in-depth studying nucleosomes.
For the convenience of the vast majority of experimental scientists, below let us give a step-by-step guide on how to use the iNuc-PhysChem web-server to get their desired results without the need to follow the complicated mathematic equations that were presented just for the integrity in developing the predictor.
Open the web server at http://lin.uestc.edu.cn/server/iNuc-PhysChem and you will see the top page of iNuc-PhysChem on your computer screen, as shown in Fig. 6. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it.
Either type or copy and paste the query DNA sequence into the input box at the center of Fig. 6. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (“>”) in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box.
Click on the Submit button to see the predicted result. For example, if you use the three query DNA sequences in the Example window as the input, after clicking the Submit button, you will see the following shown on the screen of your computer: the outcome for the 1st query sample (with 150 bp long) is “nucleosome”; the outcome for the 2nd query sample (with 150 bp long) is “linker”; the outcome for the 3rd query sample (with 502 bp long) contains sub-results, in which the outcomes for the segments from #1 to #61 are of “linker”, those for the segments from #62 to #198 are of “nucleosome”, and those from #199 to #353 are of “linker”. All these results are fully consistent with the experimental observations as summarized in the Information S1. It takes about few seconds for the above computation before the predicted result appears on your computer screen; the more number of query sequences and longer of each sequence, the more time it is usually needed.
Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iNuc-PhysChem.
Click on the Data button to download the benchmark datasets used to train and test the iNuc-PhysChem predictor.
The program is also available by clicking the button download on the lower panel of Fig. 6.
In this study although iNuc-PhysChem was trained by the dataset derived from Saccharomyces cerevisiae, it can be successfully used to identify nucleosome positioning for an independent DNA segment extracted from the Saccharomyces cerevisiae genome, as demonstrated by the 3rd sequence in the Example window of the iNuc-PhysChem web-server. Particularly, it can be also successfully used to classify nucleosomal and linker sequences in the human genome, as elaborated in Section 4 of Results and Discussion. Therefore, it is anticipated that iNuc-PhysChem can be successfully used to identify nucleosome in the whole genome as well.
The current study was focused on the demonstration that the physicochemical properties of DNA are important for nucleosome positioning prediction. Since the physicochemical properties of DNA can be used to describe the interaction between DNA and chromatin remodeling complexes in vivo, here we just used the in vivo data for the current study. However, it is instructive to point out that although in vivo and in vitro nucleosome maps are similar, promoters and DNA replication regions, where nucleosomal sequences are depleted in vivo, are strongly affected by nucleosome remodeling , . In view of this, we shall consider in our future work to use in vitro nucleosome maps  and the raw data from  to train the prediction model. Also, it is intriguing to analyze the impacts of different conformations (such as B- and Z-form) of DNA to nucleosome positioning, and will be investigated in our future studies as well.
Based on the results as reported in Section 4 of the Results and Discussion, we believe that the user-friendly web-server iNuc-PhysChem as proposed in this paper may serve as a useful tool for studying nucleosome positioning. Or at the very least, it can play a complimentary role to the existing methods in this area. Meanwhile, we also sincerely hope to hear any feedbacks (either positive or negative) from the users in using iNuc-PhysChem to generate their desired data. Their feedbacks will be very useful for us to improve the performance of iNuc-PhysChem.
The benchmark dataset consists of a positive dataset and a negative dataset . The positive dataset contains 5,000 nucleosome-forming DNA segments, while the negative dataset contains 5,000 nucleosome-inhibiting DNA segments. Each of these segments is 150-bp long.
The original numerical values for the 12 physicochemical properties of dinucleotide, where the physicochemical property “A-philicity”  is denoted by P(1); “base stacking”  by P(2); “B-DNA twist”  by P(3); “bendability”  by P(4); “DNA bending stiffness”  by P(5); “DNA denaturation”  by P(6); “duplex disrupt energy”  by P(7); “duplex free energy”  by P(8); “propeller twist”  by P(9); “protein deformation”  by P(10); “protein-DNA twist”  by P(11); and “Z-DNA”  by P(12). Their values were taken from the papers cited above, respectively.
We wish to express our gratitude to the editor and three anonymous reviewers whose constructive comments were very helpful in strengthening the presentation of this paper.