We first surveyed published nucleosome positions in yeast and fly [9
]. When aligned at the TSS, there is a significant difference in +1 nucleosome position between the two species [10
] (Additional data file 1a). However, aligning by the coding region places the first coding nucleosome in similar positions in the two species - that is, just downstream of the start codon (Additional data file 1b). We also identified a highly conserved nucleosome immediately upstream of the 3' coding end in both species (Additional data file 1c).
Analyzing the H2A.Z map for human T cells [14
] also revealed nucleosomal peaks just downstream of start codons and just upstream of stop codons, marking both ends of the coding sequences (Figure ). Meanwhile, boundaries at both ends of transcripts are tightly coupled with nucleosome-free regions, potentially allowing access of the initiation and termination complex (Figure ). The nucleosome-free region at the TSS was followed by the +1 nucleosome. However, the association of the +1 nucleosome and the TSS appears to be weaker than that of the first coding nucleosome and the start codon. The patterns of nucleosome positioning for some individual genes are shown in Additional data file 2.
Figure 1 Epigenetic peaks near coding region boundaries. (a, b) Genome-wide average of nucleosome occupancy in T cells for genes aligned at the (a) coding ends or (b) transcript ends. The inner coding region is outlined in yellow. (c) Genome-wide average of methylation (more ...)
We carried out Solexa sequencing of methylated DNA from human T cells and found methylation peaks at the exact same positions (Figure ). We also profiled the mouse liver and found the same patterns (Additional data file 3). Together, the nucleosomal peaks are observed in human, fly, and yeast, and the methylation peaks in human, mouse, and plants.
To examine their role in regulating transcription, we first related the level of the epigenetic peaks to expression level. We found that highly expressed genes are depleted of the epigenetic peaks (Figure ), consistent with findings of a nucleosomal barrier against high transcription rate [15
]. However, the overall correlation was not strong. We then estimated elongation efficiency as mRNA production per unit density of elongating Pol II. Upon initiation, Pol II is phosphorylated at Ser5 in its carboxy-terminal domain, switching to an elongation-competent form. Thus, we calculated the ratio of expression level to the density of Ser5-phosphorylated Pol II within the transcript body. Genes with high elongation efficiency will show high expression levels even with a low density of elongating Pol II across the transcribed region, and the opposite for low elongation efficiency. A strong association was found between the level of the epigenetic peaks and elongation efficiency (Figure ; Additional data file 4).
Figure 2 Correlation of elongation inhibition with epigenetic peaks. (a, b) The average of nucleosome level (left panel) and methylation level (right panel) were plotted (a) within each expression percentile and (b) within each bin of elongation efficiency. The (more ...)
Without any interference, elongating Pol II should be distributed evenly across the transcript body except at the initiating and terminating sites. Faced with roadblocks, however, Pol II pauses and a pileup of Pol II forms, which can be observed as a peak of Pol II density. Thus, to demonstrate Pol II pausing at epigenetic marks, three criteria should be met: the presence of a Pol II peak; the presence of an epigenetic peak; and a correspondence between the positions of the two peaks.
Elongating Pol II appears to pile up immediately upstream of the nucleosomal peaks at both ends of protein coding units (Figure ), satisfying the three criteria. However, there seem to be confounding effects from Pol II enriched at nearby transcription initiation and termination sites. For example, the Pol II tail downstream of the stop codon (arrow above the right panel of Figure ) might reflect Pol II awaiting to be released from the transcription termination site. We thus selected genes with a long (> 5 kb) 5' untranslated region (UTR) and examined Pol II density around the TSS and start codon separately. Both unphosphorylated and elongating Pol II were enriched at the TSS, but only elongating Pol II showed high downstream density (Additional data file 5) with a pileup upstream of the start codon (left panel of Figure ). Another Pol II peak upstream of the start codon (arrow above the left panel of Figure ) seems to reflect Pol II at the TSS. A pileup of Pol II was also found before a long (> 5 kb) 3' UTR (right panel of Figure ), indicative of Pol II blockage that occurs independently of the transcription termination site.
Pol II pausing was not observed with low nucleosome occupancy (Figure ), indicating that elongating Pol II is indeed impeded by the boundary nucleosome. To observe the specific effect of the boundary nucleosome, we computed relative occupancy at the coding end compared to the surrounding region. Higher or lower nucleosome occupancy near the coding end directly led to higher or lower Pol II density in the immediate upstream region (Additional data file 6).
We roughly estimated the percentage of genes that are affected by Pol II pausing by comparing the average Pol II density around boundaries and that across surrounding regions. We found that 54% of genes exhibit higher Pol II density near the start codon than in the flanking region and 41% of genes have a Pol II peak near the stop codon.
Nucleosome positioning is governed by DNA sequences [11
]. Methylation level is dependent on the CpG content of the target sequence [17
]. Given the distinctive patterns of nucleosome positioning and methylation maintained in specific regions among different species, there should be strong constraints on the underlying DNA sequences. Being under strong natural selection, protein coding sequences could be better candidates than UTRs for conserved epigenetic targets downstream of transcription initiation and upstream of termination. Coding region boundaries might be subject to considerable negative selection that purifies sequence changes that are detrimental to nucleosome deposition or DNA methylation.
We examined two sequence characteristics deemed to be involved in epigenetic programming: DNA-bending propensity and CpG density. DNA-bending propensity, the ability of nucleotide sequences to wrap around a histone complex, is an important determinant of nucleosome formation [18
]. DNase I digestion experiments indicate that bending parameters for the start codon and three stop codons are among the 8 highest out of those for the 32 trinucleotides [20
]. Therefore, they can significantly contribute to the high bendability of coding boundaries (Additional data file 7). The boundary sequences with higher bendability tend to be more enriched for nucleosomes (upper panel in Figure ). Unexpectedly, DNA methylation level was also proportional to bending propensity.
Figure 3 Correlation of genetic and epigenetic characteristics at coding region boundaries. (a) Bending propensity and CpG density were calculated for flanking sequences downstream of the start codon or upstream of the stop codon. The number of genes (gray curve (more ...)
CpG density should be a determinant of methylation level. The boundary sequences with intermediate CpG density were densely methylated (lower panel in Figure ). In contrast, nucleosome occupancy was dominant among genes with lower CpG density. While the two marks commonly have affinity for base compositions with high bending propensity, DNA methylation at CpG sites might affect structural DNA bending and nucleosome formation. The proportions of genes marked by both nucleosomes and methylation or by just one of these are shown in Figure . More genes are specifically marked by nucleosomes than methylation, possibly because many boundary regions have relatively low CpG density (gray curve in lower panels of Figure ).
A group of genes had highest CpG density at the 5' end (gray curve at CpG density > 0.8 in bottom left panel of Figure ). These genes showed a markedly reduced level of DNA methylation, reflecting the fact that CpG islands are typically unmethylated. Indeed, 97.2% of these genes contained a CpG island within their promoter (-1,000 bp to 500 bp from the TSS) and 92.8% had a short (< 500 bp) UTR (P < 10-100), an indication that promoter CpG islands are overlapping or located very close to the start codon. These genes exhibited high expression compared to the rest of the genes (P < 10-80), even higher than the genes with a promoter CpG island (P = 1.4 × 10-10) (Additional data file 8), indicating additional effects of elongation control.
Next, we explored the intragenic distribution of the marks. Although significantly higher than its flanking region, the 5' peak is generally lower than the 3' peak (Additional data file 9a). Meanwhile, k-means clustering shows that most genes have higher peaks at both ends compared to the central region (Additional data file 9b). We then examined these patterns according to the size of the coding region (Figure ). We found that genes with a short coding sequence (< 1 kb) have nucleosomes and methylation in their inner region over a large portion of the gene body. In particular, their 3' ends lack both marks, in sharp contrast to most other genes, in which the marks are shifted toward both ends with a bias toward the 3' end. Unlike at the 5' end, both marks commonly peaked at the 3' end, especially in many genes of intermediate size (Figure ).
Figure 4 Epigenetic characteristics within coding regions of varying size. (a) Heatmaps showing the coding-region profiles of nucleosomes (left side) or methylation (right side) averaged over 100 neighboring genes ordered by size. A total of 25,883 genes were (more ...)
Notably, nucleosome composition within the coding region of yeast or fly genes is not sharply shifted to the coding boundaries - the 3' peak is especially not very prominent (upper panel in Figure ) - a similar pattern to that seen for human genes of similar size (1 to 2 kb). Intragenic DNA methylation in plant genomes is also concentrated in the central region of protein coding sequences (lower panel in Figure ). Arabidopsis
genes share a similar pattern with human genes of similar size (1 to 3 kb). Rice genes are longer than Arabidopsis
genes (Figure ) and have detectable, if not complete, peak patterns, which can explain the observed 5'-end peak [4