|Home | About | Journals | Submit | Contact Us | Français|
The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation1-5. However, it remains difficult to identify the causal variants underlying eQTLs and little is known about the regulatory mechanisms by which they act. To address this gap, we used DNaseI sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines (LCLs), for which genome-wide genotypes and estimates of gene expression levels are also available6-8. We obtained a total of 2.7 billion uniquely mapped DNase-seq reads, which allowed us to produce genome-wide maps of chromatin accessibility for each individual. We identified 9,595 locations at which DNase-seq read depth correlates significantly with genotype at a nearby SNP or indel (FDR=10%). We call such variants “DNaseI sensitivity Quantitative Trait Loci” (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele-specific changes in transcription factor binding. A substantial fraction (16%) of dsQTLs are also associated with variation in the expression levels of nearby genes, (namely, these loci are also classified as eQTLs). Conversely, we estimate that as many as 55% of eQTL SNPs are also dsQTLs. Our observations indicate that dsQTLs are highly abundant in the human genome, and are likely to be important contributors to phenotypic variation.
It is now well-established that eQTLs are abundant in a wide range of cell-types and in diverse organisms, and recent studies have implicated human eQTLs as important contributors to phenotypic variation1-5. However, the underlying regulatory mechanisms by which eQTLs impact gene expression remain poorly understood. One mechanism that may be important is when the alternative alleles at a particular SNP lead to different levels of transcription factor binding or nucleosome occupancy at regulatory sites; this in turn may lead to allele-specific differences in transcription rates9-12. In this study, we used high-depth DNaseI-sequencing (DNase-seq) in a panel of 70 individuals and find that indeed a large fraction of eQTLs are likely caused by this type of mechanism.
DNase-seq is a genome-wide extension of the classical DNaseI footprinting method13-15. This assay identifies regions of chromatin that are accessible (or “sensitive”) to cleavage by the DNaseI enzyme. Such regions are referred to as DNaseI hypersensitive sites (DHSs). DNaseI-sensitivity provides a precise, quantitative marker of regions of open chromatin, and correlates well with a variety of other markers of active regulatory regions including promoter and enhancer-associated histone marks. Furthermore, bound transcription factors protect the DNA sequence within a binding site from DNaseI cleavage, often producing recognizable “footprints” of reduced DNaseI sensitivity13,15-17.
We collected DNase-seq data for 70 HapMap Yoruba lymphoblastoid cell lines (LCLs) for which gene expression data and genome-wide genotypes were already available6-8. We obtained an average of 39M uniquely mapped DNase-seq reads per sample, providing individual maps of chromatin accessibility for each cell line (see Supplementary Information for all analysis details). Our data allowed us to characterize the distribution of DNaseI cuts within individual hypersensitive sites at extremely high resolution. As expected, the DHSs coincide to a great extent with previously annotated regulatory regions, and DNaseI sensitivity is positively correlated with the expression levels of nearby genes (Figures S6&7). Overall, the locations of hypersensitive sites are highly correlated across individuals (Supplementary Information)11.
We tested for genetic variants that impact local chromatin accessibility. To do this, we divided the genome into non-overlapping 100 bp windows, and then focused our analysis on the 5% of windows with the highest DNaseI sensitivity (see Supplementary Information). For each individual, we treated the number of DNase-seq reads in a given window, divided by the total number of mapped reads, as a quantitative trait that estimates the level of chromatin accessibility. We then tested for association between individual-specific DNaseI sensitivity in each window and genotypes of all SNPs and indels in a cis candidate region of 40kb centered on the target window.
Using this procedure, we identified associations between genotypes and inter-individual variation in DNase-seq read depth in 9,595 windows at FDR=10% (corresponding to 8,902 distinct DHSs, once we combine adjacent windows whose hypersensitivity data is associated with the same SNP or indel; Figure 1A). We refer to these 8,902 loci as “DNaseI sensitivity QTLs”, or dsQTLs. We additionally considered a much smaller cis-candidate region of just 2kb around each target window, and found that the majority of the dsQTLs are detected within this smaller region (7,088 associated windows in 6,070 DHSs), suggesting that most dsQTLs lie close to the target DHS. In contrast, we find only weak evidence of trans-acting dsQTLs, likely because our experiment is underpowered for detecting these (Supplementary Information). For dsQTLs with enough DNase-seq reads overlapping the most significant SNP (n=892), we confirmed that the fraction of reads carrying each allele in heterozygotes correlates well with the dsQTL effect sizes (Figure 1B, correlation coefficient r=0.72, P<<10−16).
We observed that dsQTLs typically affect chromatin accessibility for about 200-300 bp (Figure 2A). Of the DHSs affected by dsQTLs, 77% lie in chromatin regions predicted by Ernst et al. to be functional in LCLs18: 41% in predicted enhancers, 26% in promoters, and 10% in insulators, even though those chromatin states together cover only 6.7% of the genome overall (and 38% of our hypersensitive sites).
We next studied the properties of cis-acting variants that generate dsQTLs, using a Bayesian hierarchical model that accounts for the uncertainty in which sites are causal19(Supplementary Information). This model obtains unbiased estimates of the average properties of causal sites even though, because of linkage disequilibrium, it is typically uncertain which site is causal for any individual dsQTL (Supplementary Information). As shown in Figures 2B&C, most dsQTLs are generated by variants that are close to the target window. We estimate that 56% of the dsQTLs are due to variants that lie within the same DHSs and that 67% lie within 1 kb of the target window. dsQTLs that lie more than 1kb from the target window are themselves significantly enriched in non-adjacent DHS windows (2.4-fold compared to matched random SNPs), and are often associated with changes in sensitivity in multiple non-adjacent DHS windows (Figure S15).
One intuitive mechanism for dsQTLs is that these may be caused by variants that strengthen or weaken individual transcription factor binding sites, thereby changing transcription factor affinity and local nucleosome occupancy20-22 and hence DNaseI cut rates. Consistent with this model, an aggregated plot of DNaseI sensitivity at dsQTLs shows a distinct drop in chromatin accessibility around putatively causal SNPs that is reminiscent of transcription factor binding footprints, especially in the genotypes associated with high sensitivity15-17.
To test the importance of disruption of transcription factor binding sites as a mechanism underlying dsQTLs, we again turned to the Bayesian hierarchical model. We used the union of all published footprint locations in lymphoblastoid cell lines16-17, and a set of footprints that we identified using the DNase-seq data reported in this study (Supplementary Methods). Analysis using the hierarchical model indicated a 3.6-fold enrichment of dsQTLs within transcription factor binding footprints (P<<10−16), controlling for the overall enrichment within DHSs. Additionally, the allele associated with a higher score of the position weight matrix (PWM) is typically associated with higher chromatin accessibility (P<<10−16), consistent with the expectation that higher transcription factor binding affinity leads to more open chromatin (Figure 2D). Of the dsQTLs that fall within DNase-seq footprints tied to specific transcription factors motifs (using CENTIPEDE17), CTCF, CRE and ISRE are the most enriched while MEF2 is significantly depleted.
To further understand the functional consequences of dsQTLs, we examined ChIP-seq data for nine transcription factors collected by the ENCODE Project in one or more lymphoblastoid cell lines10,23. Overall, the alleles that are associated with increased DNaseI sensitivity are highly associated with increased transcription factor binding (P<10−16 ; Figure 2E), indicating that dsQTLs are strong predictors of changes in occupancy by a range of DNA-binding proteins.
Given that dsQTLs produce sequence-specific changes in chromatin accessibility and, frequently, changes in transcription factor binding, we hypothesized that a fraction of the dsQTL variants might also affect expression levels of nearby genes. We examined this by testing for associations between the most significant variant at each of the dsQTLs detected using the 2kb window size and expression levels of nearby genes (i.e., genes with transcription start sites, TSSs, within 100kb) estimated by sequencing RNA from the same cell lines8. Using this approach, we found that 16% of dsQTL SNPs are also significantly associated with variation in expression levels of at least one nearby gene (FDR=10%). This represents a huge enrichment over random expectation (450-fold, P<<10−16 ; Figure 3). One example of a joint dsQTL-eQTL is illustrated in Figure 3A, in which a SNP disrupts an interferon-sensitive response element (ISRE) located in the first intron of the SLFN5 gene, leading to both a strong dsQTL and an eQTL for SLFN5. Conversely, out of 1,271 eQTLs detected using RNA-seq data from these cell lines8, 23% of the most significant SNPs are also dsQTLs (FDR=10%). Using the method of Storey et al.24 for estimating the proportion of tests where the null hypothesis is false (while accounting for incomplete power), we estimate that 55% of the most significant eQTL SNPs are also dsQTLs and that 39% of the dsQTLs are also eQTLs. Hence dsQTLs are a major mechanism by which genetic variation may impact gene expression levels.
We observed that for most (70%) of the joint dsQTL-eQTLs, the allele that is associated with increased chromatin accessibility is also associated with increased gene expression levels (Figure 3B). Since higher DNaseI-sensitivity generally correlates with higher transcription factor occupancy, this suggests that transcription factors that are bound to DHSs usually act as enhancers. CRE-box and GABP/ETS-box were the most enriched motifs among repressors and enhancers respectively. The dsQTLs that are also eQTLs (FDR=10%) are highly enriched around the TSSs of the target genes: for 23% of the joint dsQTL-eQTLs, the associated DHS is within 1kb, and for 39% it is within 10kb of the TSS (Figure 4A). This is consistent with previous work showing strong clustering of eQTLs around TSSs19,25-26. Nonetheless, there is a significant signal of long-range regulation as far as 100kb. Additionally, 14% of the joint dsQTL-eQTLs are significant eQTLs for two or more genes, suggesting that some regulatory regions affect more than one gene.
We sought to identify additional factors that may influence whether a dsQTL regulates gene expression of nearby genes, while controlling for the very strong effect of distance from TSS (Figure 4B). We observed that a dsQTL is more likely to be an eQTL for the gene with the nearest TSS (1.6-fold, P=3×10−4) and is more likely to be an eQTL if it is located within the transcribed region of the gene (2.7-fold, P=2×10−9). Further, a dsQTL is 2.6 fold more likely to be an eQTL if it is associated with a DHS that overlaps a DNA methylation QTL27 (P=4×10−4), and shows a 2.4-fold increase if the associated DHS overlaps a PolII ChIP-seq peak10 (P=4×10−4). Conversely, a dsQTL is significantly less likely to be an eQTL for a gene if an active binding site for the insulator protein CTCF17 lies between the dsQTL and the gene’s TSS (2.4-fold decrease, P=1×10−12). Finally, the presence of the enhancer mark P300 (from ENCODE ChIP-seq data28) in the dsQTL window increases the probability that a distal dsQTL (TSS>1.5kb) is an eQTL (1.7-fold, P=1×10−5).
In summary, we have shown that common genetic variants impact chromatin accessibility at thousands of hypersensitive regions across the human genome. The putative causal variants most often lie within or very near the hypersensitive regions, and frequently act by changing the binding affinity of transcription factors. Mapping of DNaseI sensitivity QTLs provides a powerful tool for detecting potentially functional changes in a variety of different types of regulatory elements, and roughly 50% of eQTLs are also dsQTLs. Furthermore, analysis of significantly associated SNPs from genome-wide association studies additionally implicates some of these dsQTLs as potentially underlying a variety of GWAS hits (Supplementary Information). Changes in chromatin accessibility may be a major mechanism linking genetic variation to changes in gene regulation and, ultimately, organismal phenotypes.
DNase-seq libraries were created as previously described29, with small modifications. Each library was sequenced on at least two lanes of an Illumina GAIIx. Resulting 20bp sequencing reads were mapped to the human genome sequence (hg18) using an algorithm that we designed specifically to eliminate mappability biases between sequence variants. We divided the genome into 100 bp windows and selected the top 5% in terms of total DNaseI sensitivity. DNaseI sensitivity for each individual in each window was normalized by the total number of mapped reads for that individual. For QTL mapping, the data were further rescaled within and across individuals, and we adjusted the data for an observed individual × GC interaction, as well as for the top four principal components of the DNaseI sensitivity matrix. Genotypes for all available SNPs and indels were obtained from HapMap and 1000 Genomes data and imputed where necessary6,7,30. We performed DNase-seq association mapping by regressing the adjusted sensitivity in each window against the genotypes at variants in a 40 kb region centered on each DHS. As validation, we used our DNase-seq reads as well as ChIP-seq reads and DNase-seq reads from ENCODE to confirm that allele-specific reads spanning heterozygous sites at dsQTLs are consistent with the association analysis. We also used RNA-seq data from the same cell lines8 to study the links between dsQTLs and eQTLs. Finally, we explored the properties of dsQTLs that make them more or less likely to influence gene expression by fitting a logistic model on all dsQTLs, where the eQTL status of each dsQTL-eQTL test is modeled as a function of distance from the TSS and a variety of other annotations. For full details of all methods see the Supplementary Information.
This work was supported by grants from the National Institutes of Health to YG (HG006123) and JKP (MH090951), by the Howard Hughes Medical Institute, by the Chicago Fellows Program (RPR), by the American Heart Association (AAP), and by the NIH Genetics and Regulation Training grant (AAP and JFD). We thank members of the Pritchard, Przeworski, Stephens and Gilad labs as well as three anonymous reviewers for many helpful comments or discussions, and the ENCODE Project for publicly available ChIP-seq data.
Author Contributions A.A.P led the data collection with assistance from S.D.L, K.M, and N.L. The data analysis was performed jointly by J.F.D and R.P.R., with contributions from A.A.P., J.B.V., D.J.G, and J.K.Pi. G.E.C and M.S. provided technical assistance and discussion of methods and results. The manuscript was written by J.F.D., A.A.P., R.P.R., Y.G., and J.K.Pr. The project was jointly supervised by Y.G. and J.K.Pr.
Author information All raw data and tables of all dsQTL are available under GEO accession number GSE31388.