We first performed a feature selection step to identify which sequence features known or believed to influence nucleosome occupancy or positioning correlate with or are strongly associated with the in vitro
nucleosome data of Kaplan et al.[8
]. Table S1
(Additional File 1
) lists the 171 features tested and the results of the tests. The features included: (a) mononucleotide frequency (i.e. G+C content); (b) predicted DNA structural characteristics (each calculated from the dinucleotide content using a simple linear formula[19
]); (c) nucleosome positioning and excluding sequences from the literature[10
]; and (d) the frequency of 4-bp sequences over a 150-bp window. We used 4-mers instead of 5-mers (as in the Kaplan model) in order to limit the number of features, and to obtain inputs that correlate independently with nucleosome occupancy (since each 4-mer occurs more frequently than nucleosomes, on average). We identified 130 features that we deemed to be associated with in vitro
nucleosome occupancy across the yeast genome (see Methods
), including representatives of all categories (a-d) above (Table S1
[Additional File 1
We then used Lasso to learn linear models that relate these 130 features to the Kaplan et al. data. We created eleven different models, using eleven different random samples of 1,000,000 genomic positions selected from subsets of the yeast genome as training data, each with 10-fold internal cross-validation (Lasso itself chooses the number of coefficients using a cross-validation procedure within the training data). In each case, Lasso assigned nonzero weights to a similar set of features (Figures and S1-3 [Additional File 1
]), each of which yielded a roughly comparable correlation to test data. This result indicates that the model chosen is not strongly dependent on the subset of the data used for training. From among the models, we chose the model trained on chromosomes 1-9 for further analysis, on the basis that (a) it was an arbitrary selection, being the first model sorted numerically, and (b) it has 14 features, which is the median number among the eleven runs. Hereafter, we refer to this model as the "14 feature model", the formula for which is given in the Methods
Figure 1 Model feature weights selected by Lasso for eleven different training data sets. Chromosomes from which 1,000,000 random nucleotide positions were taken are given at bottom. Correlation coefficients are given in the middle, using a test set that does (more ...)
The 14 feature model explains a large majority of the variation in nucleosome occupancy over the yeast genome in the Kaplan et al. in vitro
] (R = 0.86 over the test set) (Figure ). This correlation is near the level of experimental reproducibility reported by Kaplan et al. (R = 0.92), and similar to that of the Kaplan model that learned 2,294 parameters (R = 0.89)[8
]. We note that our models with substantially more than 14 features have correlations with the in vitro
data as high as 0.88 (Figure and S1 [Additional File 1
]). The 14 feature model also correlates significantly with in vivo
nucleosome occupancy in yeast (grown in glucose)[8
] (Figure ) (R = 0.72, Spearman P < 2.2 × 10-308
). The Kaplan model has a correlation coefficient of 0.74 over the same test data. Thus, the 14 feature model encapsulates the vast majority of the information in the Kaplan model. Indeed, the correlation between the 14 feature model and the Kaplan model over the entire yeast genome is 0.88 (Figure ).
Figure 2 Performance of a 14 feature linear model of intrinsic nucleosome sequence preference. (A) Scatter plot vs. test set (yeast chromosomes 10-16), shown as a heat-map. Axis values are log2 normalized nucleosome occupancy (see Methods). (B) Model scores (probabilistic[ (more ...)
In order to further benchmark our model, we compared the performance of the 14 feature model with published models[2
] on other in vitro
and in vivo
nucleosome occupancy data sets, using Pearson correlation between predicted and actual data. These results are summarized in Table . In all cases, the 14 feature model has performance comparable to the Kaplan model and to another model (the Field model) from the same lab with a similar number of parameters as the Kaplan model[8
]. Since the 14 feature model is trained on Illumina/Solexa sequencing data, which may have inherent biases[33
], it is important to note that it also correlates well with an in vivo
nucleosome organization from a tiling array study in yeast[2
] and a sequencing-based study in C. elegans
that was normalized using naked genomic DNA processed in the same fashion as the nucleosomal DNA[34
], performing the best out of all models tested on the latter data set. Thus, our model is comparable to the Kaplan model on multiple data sets, including those generated in vivo
, using other methods, and/or in an organism distantly related to yeast.
Comparison of nucleosome occupancy prediction models on different data sets
The results from this comparison also confirm that models that combine aperiodic signals perform much better at predicting nucleosome occupancy than models based primarily on periodic dinucleotide signals[22
]. The one exception is the model of Yuan and Liu[26
], which is based on periodic dinucleotide signals in nucleosomal and linker sequences identified using wavelet analysis. We note, however, that the dinucleotide features with most predictive power and the highest regression coefficients in the Yuan and Liu model have frequencies at the single base scale (i.e. have a length scale of 1)[26
], suggesting that aperiodic dinucleotide composition is, perhaps unintentionally, a major component.
The most critical features in the 14 feature model are G+C content and frequency of AAAA, on the basis of two criteria. First, these two features correlate highly with nucleosome occupancy in vitro (R = 0.71 and 0.63, respectively), independently of all other features (Figure ). Second, a procedure in which we iteratively removed the least critical feature(s) of the model (i.e. those with the least influence on the basis of re-trained model performance after their removal) resulted in AAAA and G+C being the last two components removed (data not shown). A two-feature linear model (trained on G+C and AAAA) retained a correlation on test data of 0.72, only a marginal improvement over G+C alone (Figure ). From this analysis, we conclude that G+C content independently accounts for approximately half of the variation in intrinsic nucleosome occupancy (R2 = 0.712 = 0.50). We note that the Kaplan model weights for individual 5-mers also scale highly with G+C content (R = 0.78, Spearman P = 3.33 × 10-284; data not shown) and that the scores assigned by the Kaplan model to 147-base windows across the yeast genome correlate highly with G+C content (R = 0.87, Figure ). Table shows that other models that correlate highly with G+C content (Table , last column) perform well at predicting nucleosome occupancy in vitro and in vivo, and that G+C content itself is a good predictor in all data sets (Table ): in all data sets examined, %G+C had a higher correlation that the majority of published models tested at predicting nucleosome occupancy.
Figure 3 Correlation of each of the 14 features with nucleosome occupancy. (A) Graphic illustration of the correlation of each of the 14 sequence features with nucleosome occupancy in vitro and in vivo across the yeast genome (data from Kaplan et al.). (B- (more ...)
We next sought to understand why these 14 features are repeatedly retained in linear models (Figure ). Manual inspection of the components of the 14 feature model suggests a small number of overarching themes. All 11 of the 4-mers are A/T rich (eight are entirely A/T), and models of DNA structure suggest that they should retain some of the structural character of poly-A sequences (data not shown). Poly-A stretches are believed to exclude nucleosomes because they are both rigid and bent, making them less compatible with the extreme bending required for nucleosome formation, regardless of their local sequence context[14
]. Sequences high in G+C will tend to lack these (and related) sequences, which may partly explain why G+C content has high overall predictive value; however, it is possible for sequences to be both G+C rich and contain small nucleosome excluding sequences, which would negatively impact nucleosome formation, explaining why a variety of poly-A-like 4-mers are retained in the model.
The importance of G+C may also be explained by the fact that this single parameter affects virtually all aspects of DNA structure. Indeed, the two overall DNA structural properties selected (propeller twist, which describes angular displacement of bases in a pair relative to each other, and slide, which describes lateral translation of base pairs relative to each other), both correlate well with G+C content when calculated as an average over a tiling window using dinucleotide tables[19
] (data not shown). These and the majority of other DNA structural properties also correlate either positively or negatively with both G+C content and nucleosome occupancy in vitro
and in vivo
(Figure and data not shown). Thus, the 14 feature model is also likely to be dominated by G+C because this parameter influences a large number of structural attributes of DNA, perhaps most critically propeller twist and slide, which may also be sufficiently important that their deviations from simple G+C content cause them to be retained in the Lasso regression. There is prior evidence for the importance of one of these features in nucleosome formation: Poly-A and related sequences are rigid and bent precisely because they are high in propeller twist, resulting in a continuous network of bifurcated hydrogen bonds[36
To gain more direct evidence for separability between G+C content and poly-A sequences as determinants of intrinsic nucleosome occupancy, we examined G+C content and poly-A sequences in an independent data set in the Kaplan et al. paper, in which nucleosomes were assembled with synthetic 150-mer sequences designed to have a broader range of unusual sequence attributes than are present in the yeast genome. Since the synthetic 150-mer nucleosome occupancy data was described by Kaplan et al. as noisier than the yeast genomic DNA occupancy data[8
], due to two rounds of PCR required in the experiment, we first confirmed that the synthetic 150-mer data set displays the same global trends with respect to DNA structural parameters as does yeast genomic DNA, both in vitro
or in vivo
(Figure ). We then asked whether G+C content and poly-A sequences act independently by examining the effect of one variable while holding the other within a narrow range. Figure and show that these parameters do act independently to a considerable degree; G+C has a major effect even if there are no poly-A tracts of length greater than three, and poly-A tracts have a clear effect even if placed in a 150-mer with neutral G+C content. We note that the behaviour at the extremes of G+C content in Figure is inconsistent with the dependence of G+C shown in Figure ; however, there are very few data points at the extremes (Figure ). The in vivo
relevance of these extremes may be very small: there are no nucleosome-sized sequence windows in yeast that are greater than 80% or less than 20% G+C, and the same is nearly true in much larger genomes (e.g. human; Figure ). Even human CpG islands are only 66% G+C on average. CpG-like sequences[37
] among the ~27,000 oligonucleotides in this analysis[8
] do have high intrinsic nucleosome occupancy overall, even if they contain poly-A sequence (Figure ).
Figure 5 Relative nucleosome preference of different subsets of synthetic 150-mers. (A) and (B) Dependence of relative nucleosome preference (as log2(occupancy ratio)) on G+C content (A) and maximum poly-A length (B). Oligonucleotides categorized as "Neutral %G+C" (more ...)
Our model confirms and extends previous indications that G+C content is a major determinant of nucleosome sequence preference, demonstrating the importance of G+C content on intrinsic nucleosome occupancy. We propose that it represents a "summary feature" that both biases against poly-A-like tracts and encapsulates multiple DNA structural attributes. The 14 feature model we derive provides an extremely simple means to assess the intrinsic preference for nucleosomes to form on a given segment of DNA. Moreover, it can be used to evaluate why
the segment has an intrinsic preference, in comparison to other sequences; the expected distribution of values for all of the model features in random sequence or across a genome is easily determined. We note that the 14 feature model does not contain any periodic component; Kaplan et al. also found that periodic signal added little to the probabilistic model[8
]. We previously proposed that the predominant role of this signal may be to reinforce local translational or rotational settings[2
], and we emphasize that our 14 feature model does not explicitly predict either nucleosome positioning or translational settings, nor does it account for steric effects. Nonetheless, the model scores closely mirror actual in vitro
occupancy data obtained for the entire yeast genome, and also have strong correlations to in vivo
nucleosome occupancy in yeast and C. elegans
as shown in Figure and Table similarly or more strongly than any previous model or algorithm, and much higher than most previous approaches, particularly those that rely solely on periodic signals.
Finally, we note that G+C content as a major determinant of nucleosome occupancy has major implications for genome organisation. Our analysis indicates that in yeast simple nucleotide composition plays a direct role in nucleosome exclusion, and presumably in demarcation of promoters. Local biases in nucleotide composition have been reported in other eukaryotes, including CpG islands[37
], and transcription start sites[39
]. It will be of interest to examine how variation in base content impacts nucleosome occupancy and chromatin structure in other genomes, whether there are functional consequences, and how the intrinsic nucleosome formation signals interact with overlapping regulatory signals in the genome.