Nucleosomes are DNA-protein complexes that form the building blocks of eukaryotic chromatin. They are involved in genome condensation, and play a major role in the regulation of gene expression [1
]. Each nucleosome is made up of eight histone proteins that together form a structural unit able to accommodate 147 base pairs of DNA wound around it. The DNA sequence has to have the flexibility and curvature that allows it to circle around a nucleosome [2
]. Empirical and theoretical studies have both shown that there are certain DNA sequence patterns that are too rigid to form such loops [3
]. These patterns include GC-rich motifs as well as poly-A and poly-T tracts, and we have previously compiled them into NXSensor, a web tool that predicts which DNA sequences would not be conducive to nucleosome binding; we called these motifs nucleosome exclusion sequences
Transcriptional regulation in eukaryotes is a complex process, as exemplified by recent publications [5
]. The formation and positioning of nucleosomes are crucial steps in gene regulation, in that they influence access to DNA by the transcriptional machinery. Experimental work on nucleosome positioning in yeast [7
] and fly [12
] has yielded significant results, and technological progress is such that we are quickly learning more about nucleosome positioning in the human genome [13
Experimental work that verifies where a nucleosome is positioned is dependent upon when the cells were sampled, and on which tissue or cell line the analysis was carried out. In addition, it is known that nucleosomes slide to allow certain regulatory mechanisms to take place [15
], and it has been shown in yeast that nucleosomes are only occasionally positioned by intrinsic sequence signals [16
]. We therefore chose nucleosome exclusion sequences as our predictive method, rather than nucleosome positioning sequences. We can with a certain level of certainty predict where nucleosomes would not bind, and it is therefore inferred that they can, and probably do, bind elsewhere.
Reported studies in [4
] observed certain trends in the nucleosome exclusion patterns of promoter regions. Both studies showed that there is a peak of nucleosome exclusion sequences just upstream of the transcriptional start site of genes. This pattern has subsequently been found to be true also in yeast [9
]. The studies in [4
] found that widely expressed genes, sometimes referred to as housekeeping genes, had a higher nucleosome exclusion potential than did tissue specific genes. This implies that the promoter regions of widely expressed genes were less likely to have nucleosomes in them than were the promoter regions of genes that had a narrow tissue distribution. This may allow easy access of the transcriptional machinery to the DNA of ubiquitously expressed genes. However, these studies had taken relatively small numbers of carefully selected human genes: 100 of each category in the case of [4
], and 500 each in [17
], and they both relied on manual selection and categorization of genes. The question remained whether there is a genome-wide trend of a gradient of nucleosome positioning potentials, and what implications this may have for the specificity of gene expression. These were the initial questions that we set out to answer in this study.
The objective of the present study, therefore, was to carry out a whole genome annotation of nucleosome exclusion regions (NXRs) in the human genome, and to correlate the results with tissue specificity, gene expression levels, and DNaseI hypersensitive sites. We calculated nucleosome exclusion scores (NXScores) across the whole genome, and observed NXScore trends in promoter regions. We classified tissue specific and widely expressed genes according to a new method proposed here based on Grubbs' outliers test, and validated the results using a previously described method based on Shannon's entropy [18
]. From a computational perspective, patterns such as NXRs and NXSs are fuzzy, non-exact, and overlapping, which poses a challenge for the analysis of all 3.4 billion base pairs of the whole human genome. We therefore developed a pilot grid architecture that can carry out such computationally intensive tasks. In this paper we report our results in the context of the regulation of gene expression.