The structural unit for chromatin packaging is the nucleosome, which is composed of approximately 147 bps of DNA wrapped around a core histone octamer. Nucleosome-associated DNA is less accessible to regulatory proteins like transcription factors, and nucleosome positioning, as well as histone modifications and histone variants (e.g. H2A.Z, H3.3), are therefore influential in cellular processes that depend on chromatin accessibility 
. Because nucleosome positions depend on cellular processes as well as intrinsic factors (e.g. DNA sequence), understanding how these positions influence cell states can require determining nucleosome locations within individual genomic regions 
Currently, genome-wide nucleosome-based data are typically generated by high-throughput short-read sequencing of DNA obtained by either MNase digestion (MNase-seq), or chromatin immunoprecipitation (ChIP-seq) of MNase-digested or sonicated DNA. MNase digests linker DNA with relatively high specificity 
, and this specificity is reflected in the narrow spatial distribution of aligned reads. However, sonication protocols are widely used; for example, in work to identify classes of functional genomic regions by integrated analysis of diverse sets of short-read sequence data 
Some methods proposed for inferring nucleosome positions from short-read data are heuristic and are based on simple pile-up profiles 
. While more elaborate approaches are available or have been described, such as NPS 
and TemplateFilter (TpF) 
, or based on Hidden Markov Models (HMM) 
, these methods have been applied to data generated with protocols that use MNase-Seq, or MNase with ChIP-Seq (e.g. 
), and their effectiveness with sonicated ChIP-seq data has not been demonstrated.
Recently we described PICS, a probabilistic peak-caller for identifying transcription factor binding sites in ChIP-Seq data 
. PICS models bi-directional read densities, uses mixture models to resolve adjacent binding events, and imputes reads that are not mapped due to repetitive genome sequences. We anticipated that its model-based framework should be extensible to address both MNase-digested and sonicated nucleosome-based short-read data. We were interested in assessing how effectively the model could be adapted to the two data types, how robust the new algorithm would be to lower read densities, and the types of biological inferences that it would support from sonicated data. To address these issues, we developed PING, a method for probabilistic inference of nucleosome positioning from nucleosome-based sequence data. Like PICS, PING models bi-directional read densities, uses mixture models, and imputes missing reads. However, it uses a new prior specification for the spatial positioning of nucleosomes, has different model selection criteria, model parameters, and post-processing for estimated parameters. In addition, PING includes novel statistical methods to identify nucleosomes whose read densities are lower than those of neighboring nucleosomes.
In the work described here, we apply the new algorithm to three published short-read nucleosome-based data sets. We focus on regions around transcriptional start sites and in vivo
transcription factor binding sites, which have well-defined nucleosome distributions 
. We demonstrate that PING performed well in identifying nucleosome positions in both MNase-Seq data in yeast and sonicated H3K4me1 ChIP-Seq data in mouse, and that it compares favorably to NPS and TpF in robustness to lower read densities. Then, using published data from a mouse cell line 
, we consider global changes in nucleosome positioning relative to in vivo
binding sites for SPI1 (also known as PU.1) and CEBPB, and show that PING predictions from sonicated H3K4me1 ChIP-Seq data are consistent with published results from MNase-Seq data. Next, we apply PING to sonicated ChIP-Seq H3K4me1 data from mouse pancreas islet tissue 
. We distinguish in vivo
Foxa2 and Pdx1 binding sites that are between flanking H3K4me1-marked nucleosomes from sites that are within nucleosomal DNA. We show that genes associated with flanked TF-bound loci are more abundantly expressed than those associated with nucleosomal loci, consistent with flanked sites being active enhancer elements. Finally, we compare spatial distributions of binding sites on nucleosomal DNA for Pdx1 and for the pioneer transcription factor Foxa2.