|Home | About | Journals | Submit | Contact Us | Français|
DNA sequence variations in individual genomes within the same species give rise to different phenotypes. One mechanism in this process is alteration of chromatin structure due to sequence variation that impacts gene regulation downstream. In this study, we compose a high-confidence collection of human indels and SNPs based on the analysis of a large set of publicly available sequencing data and investigate whether the DNA loci associated with stable nucleosome positions are protected against sequence mutations. We address how the sequence variation is reflected in the occupancy profiles of nucleosomes of different types at regulatory sequences and genome-wide. We find that indels are depleted around nucleosome positions of all considered types; SNPs, on the other hand, are enriched around the positions of bulk nucleosomes but depleted around the positions preferentially occupied by epigenetically modified nucleosomes. Such a behavior indicates an increased level of conservation for the sequences associated with epigenetically modified nucleosomes and highlights complex organization of the human chromatin.
Growing evidence indicates that structural organization of chromatin, and the presence of regular nucleosome-positioning patterns in particular, are crucial for faithful gene regulation1–3. There is an on-going debate in the literature about the role of DNA sequence in establishing such patterns in vivo in different organisms4–6. For this question, analysis of the variability of the genomic sequences associated with stable nucleosome positions within the population can provide important clues. Mutations in genomic DNA can disrupt nucleosome positioning signal encoded in DNA and/or binding targets of transcription factors which are often coordinated with the stable nucleosomes. Therefore, functionally significant positions of stable nucleosomes are likely to be reflected in the density of sequence variations.
Two types of genomic sequence variation are the most relevant in this context: single nucleotide polymorphism (SNP) and short insertions and deletions (indels). Recent analysis of sequence variability in the yeast genome has shown that SNP density is higher by 10–15% in the DNA fragments associated with nucleosome cores than in linkers7. The analysis of the SNP distribution in the human genome revealed presence of a periodic signal close to the nucleosomal length in the promoter proximal regions and increased SNP density in the closed chromatin enriched with nucleosomes8,9. Association of both SNPs and indels with chromatin structure was recently characterized for regions around gene starts in the fish genome10. It was shown that unlike SNP density, the density of indels is decreased within stable nucleosome positions as compared to linker DNA.
Association of sequence variation with nucleosome organization in the human genome has not been studied comprehensively yet. Earlier studies focused exclusively on SNPs8, and the direct comparison of the genome variability and nucleosome occupancy profile was not possible for the human genome due to the lack of genome-scale nucleosome profiles. However, recent advances in high-throughput sequencing technology have made it possible to map nucleosomes and to accurately identify sequence variants on a genome-scale in humans11–14.
To this end, we collected sequencing data available from the NCBI Trace Archive and composed high-confidence non-redundant data sets of SNPs and indels from 1 to 100 bp in length (see Methods for detail). These sets comprise data from different sequencing centers obtained for multiple unrelated genomes and thus any biases due to possible experimental artifacts or genome sampling should be significantly reduced. We also use the data from a recently published set of genome variations based on the analysis of 8 individual genomes for the validation of our findings14 (results for the ‘8-genome’ set are presented as Supplementary material). Nucleosome occupancy has been profiled in the human genome for several types of epigenetically modified nucleosomes and for ‘bulk’ nucleosomes not selected for any histone variant or modification11,12. Based on these data we have recently identified with high resolution the stable positions for bulk nucleosomes and for the nucleosomes containing the H2A.Z histone variant and the histone H3 tri-methylated at lysine 46. The H2A.Z and H3K4me3 nucleosomes are associated with transcription activation and are enriched at gene starts, while the bulk nucleosomes are distributed throughout the genome more evenly, making the combination of these sets a convenient tool for our analysis.
Availability of these data allows us to address the question of how sequence variations are distributed relative to nucleosome positions on genome scale. For the first time, we consider nucleosomes of different types, both epigenetically modified and bulk, which play different roles in gene regulation. We also compare patterns of sequence variability and their association with chromatin structure in different regulatory genome regions such as transcription start and end sites (TSS and TES) and splicing sites.
Distributions of the genome variation instances around stable nucleosome positions follow different patterns for indels and SNPs (Figure 1). Frequencies of indels are decreased inside core sequences for all types of nucleosomes as compared to the frequencies in linker DNA (Figure 1A). The distribution of SNP frequency is more complex: the SNP frequency is higher inside bulk nucleosomes, while it does not show significant variation around H2A.Z and H3K4me3 nucleosomes (Figure 1B).
The coordination of indel and SNP distributions with nucleosome positioning is further illustrated in Figures 1C and 1D, where genome-wide autocorrelations of indel and SNP occurrences are shown. Autocorrelation is a measure of probability to find two and more instances of a variation separated by the specific distance in the genome. Therefore, if a coordinated pattern exists in the distribution of the variations, it should be reflected in the autocorrelation function. Unlike the monotonic autocorrelation plot for SNPs, the plot for indels features two pronounced maxima at distances of 170 bp and 318 bp, which agrees reasonably well with nucleosomal repeat length in the human chromatin15.
We performed a number of further analyses to confirm our results. We checked that the frequency profiles around stable nucleosome positions share the same features when indels were split into insertions and deletions and SNPs were split into transitions and transversions and analyzed independently (Supplementary Figure 1A–D). Since nucleosomes are known to favor GC-rich sequences16,17 we stratified nucleosome positions by GC-content and verified that the distribution of indels and SNPs is similar for GC-rich and GC-poor sequences (Supplementary Figure 1E–G).
A recent study showed that tandem repeats can interfere with nucleosome positioning in yeast18. To check how the presence of repeats affects our findings, we examined the association of the genome variations located in the unique and in the repetitive regions of different classes with nucleosome positioning (Figure 2). This analysis reveals that the indel frequencies in core nucleosomes are lower than the indel frequencies in linker DNA for all classes of repeats (Figure 2). At the same time, the indels associated with repeats are further excluded from the nucleosome cores than the indels from unique regions. We also note that the indels from interspersed repeats contribute the most to the maxima in the autocorrelation function shown in Figure 1C (data not shown). In contrast, SNPs of all classes show nucleosome-to-linker occurrence ratio close to one, with the exception of the SNPs associated with the simple repeats in epigenetic nucleosomes which are depleted in nucleosome cores (Figure 2B). It is noteworthy that although the profile of SNP occurrences around bulk nucleosomes is not flat, the overall nucleosome-to-linker ratio is close to one for the linker DNA, defined in this study as 50-bp fragments flanking the core nucleosome.
Since a noticeably weaker coordination between nucleosome positioning and frequency of the 1-bp indels was reported for the fish genome10, we compared the relative occurrences of indels inside and outside the nucleosome core for different indel lengths (Supplementary Figure 2). In our dataset, the nucleosome-to-linker ratio is about 0.6 for all the indel lengths for which such a ratio could be determined with statistical significance. We note that for the ‘8-genome’ set, such a ratio shows greater variability for different indel lengths and is close to 1 for the 1-bp indels. Nonetheless, the DNA sequences around stable nucleosome positions are still moderately depleted of the 1-bp indels from the ‘8-genome’ set compared to the fragment located 100–200 bp away from the nucleosome center (Supplementary Figure 2C,D).
One may expect that the 5-bp indels are excluded from the core nucleosome sequences to a higher extent than the 10-bp indels since the 5-bp shift would disrupt the sequence patterns determining rotational phasing of nucleosomes, while the 10-bp shift would preserve them19–21. However, we do not observe such dependence in the ratio of indel occurrences (Supplementary Figure 2A,C). The absolute number of occurrences of indels longer than 5 bp is relatively small in our dataset, therefore we do not completely rule out the stronger exclusion of the 5-bp indels than that of the 10-bp ones. A more likely explanation for the lack of the noticeable difference, however, is a relatively low number of the sequences that encode the 10-bp signal in the human genome6.
Intron-exon and exon-intron boundaries are among the mostly conserved genomic regions. Nucleosome positioning in these regions was recently studied22,23. Our analysis reveals that the nucleosome density is different at intron-exon and exon-intron junctions, while the pattern of genome variation is similar (Figure 3). We observe a pronounced stable nucleosome position at the exon-intron junction, while the intron-exon boundary seems to align with a trough in the nucleosome density, flanked by two positioned nucleosomes. This difference in the profile of stable nucleosome positions at the two boundaries is consistent with the distribution of nucleosome-disfavoring sequences, which has a stronger and wider peak at the intron-exon junctions than at the exon-intron junctions22.
Distributions of SNPs and indels reach minima at both intron-exon and exon-intron junctions and feature wide trough on the exon side of the splice site. The appearance of the trough inside exons is consistent with their coding function and the presence of the conserved regulatory elements such as exonic splicing enhancers and silencers located in these regions24. Overall the distribution of genome variation around splice sites seems to be driven by the requirement on the conservation of splicing signals rather than by the nucleosomal patterns. Nevertheless, indels follow the trends observed on the genome scale (Figure 1) and are excluded from such positions at the splicing sites. On the other hand, the distribution of SNPs differs from that expected from the genome-wide analysis since we do not observe any increase in the SNP frequency at the nucleosome positions, even though more than half of the nucleosomes at the splicing sites are bulk. These observations suggest that strong selective pressure at specific genomic locations can overcome the traits in the genome variation profile imposed by nucleosome positioning.
Comparison of the distributions of SNPs, indels, and stable nucleosome positions around transcription starts reveals two levels of coordination between genome variability and nucleosome positioning (Figure 4A). First, the overall increase in the nucleosome density around TSS is correlated with the decrease in density of both SNPs and indels in this region. It should be noted that the increase in the nucleosome density corresponds to stable positions only and may not represent the overall density of nucleosomes. Also, higher accessibility of open chromatin at TSS for nuclease digestion used to produce mono-nucleosome fragments for sequencing can contribute to the appearance of such an increase.
Second, genome variation and nucleosome profiles are negatively correlated at the level of individual nucleosome positions, specifically at the nucleosome-free region and at the positions +1,+2, and +3 downstream of TSS. The Pearson correlations calculated inside the 1.5 kb region around TSS between genome variations and nucleosome occupancy indicate that both SNPs and indels are depleted at stable nucleosome positions at gene starts (Supplementary Table 1). Of the two, indels are anti-correlated with nucleosome occupancy to a higher degree than SNPs for all types of nucleosomes, in agreement with the results shown in Figure 1. Here, the profiles were de-trended before the calculation of correlation coefficients (see Methods for detail), and therefore our results are not influenced by the ‘overall’ coordination described above.
The exact location of the position +1 for bulk nucleosomes has been shown to depend on the transcription status of the gene12. The genes that are highly transcribed in a broad range of tissues often have their TSS encompassed by CpG islands25,26, implying that the transcription status of those genes is reflected in the underlying DNA sequences. In this context, it is interesting to compare the profiles of the sequence variation and nucleosome positioning around TSS for the CpG and non-CpG genes. We focus this analysis on bulk nucleosomes because most of the epigenetic nucleosomes considered in the current study are associated with the transcriptionally active genes and the nucleosome occupancy profiles are nearly identical around TSS of CpG and non-CpG genes for these nucleosomes6.
Since the number of stable nucleosome positions determined from the experimental data for bulk nucleosomes is not sufficient to obtain a reliable average profile around TSS for each gene group, we treat all sequenced tags as independent nucleosome fragments (Figures 3B). This approach allows increased statistical power to detect small changes in average profiles although it may reduce accuracy at the level of individual nucleosomes. This comparison shows that position +1 of bulk nucleosomes is shifted downstream in CpG genes as compared to non-CpG genes, as expected for the genes with increased expression level12. The minimum in the indel distribution for non-CpG nucleosomes aligns well with the bulk nucleosome position +1 for this group and the indel frequency minimum moves in the direction of the nucleosome position +1 for CpG genes (Figure 4B). This shift of the minimum in the indel profile indicates that the nucleosome positioning at TSS of CpG genes has evolved together with DNA sequence, presumably to accommodate high levels of transcription in a broad range of tissues26. The distribution of SNPs does not exhibit the same level of coordination with nucleosome occupancy at TSS of CpG and non-CpG genes as indel distribution does (Supplementary Figure 3), in accordance with the lower correlation between SNP and nucleosome density observed previously (Supplementary Table 1).
Around TES, indel density is negatively correlated with stable nucleosome positions, while SNP density is positively correlated (Figure 4C, Supplementary Table 1). We note that most nucleosomes at TES are bulk, and thus the positive correlation between SNPs and nucleosome density agrees with our finding that the SNP occurrence is higher on average inside the core sequences of the bulk nucleosomes (Figure 1B).
There are two possible explanations of the differences in SNP occurrence profiles around bulk and epigenetic nucleosomes observed in our analysis. One possibility is that the sequences associated with epigenetic nucleosomes of the types considered in this study are themselves conserved to a higher extent than the positions of ‘less important’ bulk nucleosomes. Another possibility for the lower frequency of SNPs detected for epigenetic nucleosome positions is simply the result of higher conservation of the TSS region, where most of such nucleosomes are located. To clarify this issue, we calculated the distributions of genome variations around nucleosome positions of each type in the regions that are proximal to and distant from TSS (Figure 5).
We observed a clear decrease in SNP density for the epigenetic positions proximal to TSS (Figure 5A). Even far from TSS, the epigenetic positions are not associated with an increase in SNP rate, which is characteristic of bulk nucleosomes (Figure 5B). In the TSS proximal regions, most of the stable nucleosome positions in our set are epigenetic and only a small number of bulk positions are available. Despite the resulting jaggedness of the SNP density profile for bulk nucleosomes, it is obvious that there is no clear dip at TSS proximal regions, consistent with the first explanation given above.
Our results indicate that while indels are depleted on average in all types of nucleosomes at TSS, TES, and genome-wide, the density of SNPs exhibits a more intricate behavior. In particular, SNPs are negatively correlated with nucleosome occupancy at TSS and positively correlated with nucleosome occupancy at TES (Figure 4 and Supplementary Table 1). In agreement with this observation, the density of SNPs is increased in the core sequences associated with bulk but not epigenetic nucleosomes, which constitute a majority of the nucleosomes in our set at TES and TSS respectively (Figure 1). The positive correlation between SNP density and nucleosome occupancy was reported earlier for the fish and yeast genomes7,10. We note that the nucleosome positions used in earlier studies correspond to bulk positions in our notation; thus, the results between the previous studies and ours are consistent. However, in the current paper we show that this rule does not hold in a number of important cases in the human genome. At least some classes of epigenetic nucleosomes are associated with a decrease rather than increase in the SNP density (Figure 5). We also report that SNPs may be negatively correlated with the nucleosome density in the DNA regions that are under strong selective pressure, such as exon-intron boundaries (Figure 3). Thus, our findings highlight the complexities in the interplay between the mechanisms that control SNP appearance and nucleosome positioning in humans.
It is interesting to consider why nucleosomal sequences in bulk are strongly depleted of one type of mutations, indels, while they are either only moderately depleted or even enriched in another type of mutations, SNPs. In general, two mechanisms are potentially responsible for the difference in the density of genome variations inside and outside nucleosomes27. One is connected to the alteration of the mutation rate in nucleosomal DNA, e.g. due to physical interaction the nucleosomal DNA with histones7,10,28. Another assumes that the DNA sequences that contain nucleosome positioning signals and/or binding sites of transcription factors are evolutionarily conserved to a higher extent than the adjacent DNA fragments29. These mechanisms are not mutually exclusive and can both contribute as discussed below.
Our observation of roughly the same frequency of indels inside nucleosomes of different types (Figure 1) suggests alteration of mutation rate rather than action of purifying selection for indels. Indeed, our results provide little support for the hypothesis that the selection pressure excludes indels from nucleosomes. We did not detect a dependence of the nucleosome-to-linker ratio of the indel occurrences on indel length (Supplementary Figure 2), which would be suggestive of this mechanism. Overall, our results indicate that the stable nucleosome positions are reflected in the indel frequency profile regardless of the local base composition or details of regulatory pathways in which a specific DNA locus is involved. This is illustrated by a shift of the nucleosome position +1 at starts of the CpG genes relative to the corresponding position at starts of the non-CpG genes in the indel frequency profile (Figure 4B). The sequence composition of the TSS proximal regions of CpG and non-CpG genes is quite different and CpG genes are actively transcribed in a broader range of cell types than the non-CpG genes26, yet the nucleosome position +1 is reflected in the indel frequency profile in each of these groups.
On the other hand, the density of SNPs appears to be affected by natural selection. A single nucleotide mutation can disrupt a transcription factor binding site to interfere with regulatory pathways. It is less likely that such a mutation would significantly alter the positioning properties of a 147-bp sequence associated with a nucleosome. Furthermore, even if a mutation changes the position of a bulk nucleosome by several base pairs, this may not have any biological effect. As a result, mutations would be tolerated in the core sequence of bulk nucleosomes but would be excluded from the linkers where many transcription factors bind3,30,31. In contrast, correct placement of epigenetically modified nucleosomes is important for gene regulation, and the positions preferentially occupied by these nucleosomes are likely to be conserved to the same or greater extent compared to the linker sequences. It should be emphasized that our results do not imply a complete absence of selective pressure on the bulk nucleosome sequences but rather that the pressure is stronger in linkers than in the nucleosomes of this type. Neither do we suggest that the SNP occurrence rate is not changed in nucleosome core sequences at all. Rather, our results provide information about the mechanisms contributing the most to the observed features of sequence variation profiles.
The interpretation of a stronger conservation of epigenetic nucleosome positions, rather than the difference in mutation rates in the bulk and epigenetic nucleosomes is further supported by two lines of evidence. The fraction of SNPs rarely occurring in population, in particular those associated with only one genome in our data set, is higher for the epigenetic than for bulk nucleosomes (Supplementary Figure 4). This indicates a stronger selection against SNPs from the epigenetic nucleosomes. As discussed above, we also observe a clear drop in SNP density at the nucleosome positions coinciding with exon-intron boundaries (Figure 3), which is likely to result from the strong selection pressure acting on the splicing sites. Since the greater part of the nucleosomes proximal to exon-intron junctions are bulk, the anti-correlation of SNP frequency with nucleosome occupancy argues against the idea that the presence of nucleosomes of this type necessarily increases the SNP accumulation rate.
Taken together, our results suggest that a combination of purifying selection acting on biologically important sequences and the alteration of the mutation rate in nucleosomal DNA determine the pattern of sequence variation in the human genome (Figure 6). Further studies are required, however, to unambiguously prove or disprove the involvement of the above mechanisms in the evolution of nucleosome positioning sequences in the human genome. In particular, characterization of molecular mechanisms that can underlie chromatin-directed mutational bias will undoubtedly advance our understanding of the principles of genome evolution.
We identified SNPs and indels by comparing trace sequences with the sequence of the reference human genome (NCBI version 36.2). The trace data from the human libraries produced in 8 different sequencing centers (Agencourt Biosciences (ABC), Baylor College of Medicine (BCM), Celera (CRA), Cold Spring Harbor Laboratory –Watson Genome (CSHL), J. Craig Venter Institute (JCVI), Santa Cruz Genome Center (SC), Whitehead Center for Biomedical Research (WIBR), Washington University Sequencing Center (WUGSC) – referred to as sources) were downloaded from the Trace Archive (NCBI). The traces were mapped to the genomic reference DNA using GMAP32 software and the high score alignments were detected by the previously described procedure33. The GMAP alignments were parsed using the following parameters (i) distance of the reported variation from the end of the alignment – more than 20 bp; (ii) perfect alignment of 5 bp of flanking regions on both sides of the variations. All SNPs and indels of lengths from 1 bp to 100 bp were taken for analysis. The repeats content of the indels/SNPs loci was analyzed by comparing variations positions with the RepeatMasker annotation of Human genome. The indels that have lengths of more than 5 bp and contain mono- and dinucleotide repeats were filtered out from the final set. All variations were reported on the positive strand, so each chromosomal position represents a separate event of specific length, type (SNP, insertion or deletion) and allele. The events corresponding to SNPs and indels were clustered separately by the 5'-end for each source and for all sources together. The final data set includes 907,324 indel and 4,068,654 SNP events that were supported by at least 3 traces covering the variation from at least 2 sources (Supplementary Table 2). The histogram showing the frequency of SNP/indel events relative to the reference sequence is shown in Supplementary Figure 5A,B. The distribution of the indel lengths is shown in Supplementary Figure 5C.
We also used a recently published set14 of indels and SNPs based on analysis of 8 human genomes for comparison and validation of the results (the results of the analysis obtained for this dataset, shown in Supplementary Figures 6–8, support our findings described above). The genomic positions of the sequence variation events in the ‘8-genome’ set originally are presented in the coordinates that correspond to the human genome build hg17. The coordinates were converted to the hg18 coordinate frame with UCSC utility liftOver. Both data sets were generated by the analyses of the sequence alignments of the trace sequences to the reference Human genome. We found that the overlaps between the genome loci from both data sets constitute ~50–60% (Supplementary Figure 9). The difference between the sets is explained by the differences in the original traces data used for variation analysis and the definition of the indels/SNPs calling parameters. The first data set was produced based on the libraries from 8 centers including the ABC, which was the only source of the second data set (list of all libraries is given in Supplementary Table 3). The distinction in the indel/SNP calling procedure includes different alignment tools (GMAP and ssahaSNP) applied in the analysis and different classification of the alignments (see Supplementary Material for detail).
Stable positions for nucleosomes bearing H3K4me3 mark (28,976 positions), H2A.Z variant (17,667 positions), and bulk nucleosomes not selected for a specific epigenetic mark or histone variant (27,486 positions) were taken from a recent analysis of ChIP-Seq and MNase-Seq data6. In addition we composed an aggregative set of nucleosome positions that comprises all three individual sets. In the cases of two or more positions located closer than 150 bp to each other only the position that is associated with the largest number of sequenced tags was retained. The final set included 63,554 nucleosome positions.
The autocorrelations in the sequence variation positioning were computed for different lag distances for each chromosome separately and then averaged genome-wide accounting for chromosome sizes, similar to our previous analysis of nucleosome positions6. The frequency profiles of genome variations around stable nucleosome positions represent the indel and SNP occurrences normalized to the number of nucleosome positions in the corresponding set and smoothed in the 75-bp running window. The frequency profiles of the nucleosome positions and sequence variations around TSS, TES, and splicing sites were normalized to the number of genes or exons in the corresponding sets. The genes were oriented in the direction of transcription prior to averaging. Smoothing in the 100-bp running window was used for TSS, TES profiles and for nucleosome frequency in the splicing site profiles. The smaller running window of 11-bp was used in case of the genome variation profiles around splicing sites to allow for a better resolution in this case. The profiles were scaled to the interval from zero to one for easier comparison. Additional loess smoothing in 11-bp window, which does not affect positions of the major minima and maxima on the plots, was applied to reduce the jaggedness in the TSS, TES, and splicing site profiles. For the calculation of Pearson correlations between nucleosome and sequence variation frequencies and creating heatmaps, the profiles were de-trended by subtracting the same profile smoothed in the 750-bp running window.