We analyzed gene expression data measured using Illumina WG6 microarrays in 210 HapMap lymphoblastoid cell lines from unrelated individuals, first published by Stranger et al.
]. Compared with other existing data sets, these data include expression measurements for the largest set of individuals that have been resequenced by the 1000 Genomes Consortium and thus provide the greatest power to identify and localize eQTLs. Following expression data cleaning (Materials and methods) we were left with expression measurements on 8,526 genes. Expression normalization and removal of unknown confounders greatly increased our power to detect modest associations [17
] (Figure S1 in Additional file 1
). Our genotype data consisted of HapMap genotypes at 3.3 million SNPs for all 210 individuals along with additional genotype calls made by the 1000 Genomes Project for 141 individuals. For SNPs that were called in both the HapMap and 1000 Genomes data, we used the HapMap genotype calls. The genotypes of 1000 Genomes SNPs were imputed in the remaining 69 individuals using BIMBAM [31
], yielding a total of 13.6 M SNPs per individual. For each of 8,526 expressed genes we tested for eQTLs at all SNPs between 100 kb upstream of the TSS and 100 kb downstream of the transcription end site (nearly all of the compelling signals of eQTLs in this data set lie within this region [14
In an initial analysis, we used standard linear regression to identify 2,708 eQTLs at a gene-level false discovery rate (FDR) of 1% (corresponding to a P
-value threshold of P
= 4 × 10-6
). Of these eQTLs, 96% were also detected using HapMap SNPs only (at the same P
-value threshold). However, in many cases, the lowest P
-value 1000 Genomes SNPs were substantially more significant than the lowest P
-value HapMap SNPs (791 of the genes have a 1000 Genomes P
-value at least an order of magnitude smaller than the best HapMap P
-value (Figures S2A and S3 in Additional file 1
)). These observations support the expectation that HapMap SNPs provide good power to detect eQTLs, but frequently miss the functional sites.
In this paper, we will refer to an 'eQTL' as a locus for which at least one SNP shows an association between genotype and gene expression. We assume that each eQTL can be explained by a single causal site, which we will refer to as an 'eQTN' (expression quantitative trait nucleotide). Our primary interest is in understanding the properties of eQTNs. (In this paper we focus on the effects of SNP variation, while recognizing that a modest fraction of eQTLs are caused by other types of variants such as deletions or duplications; see Materials and methods.) In practice, however, there is usually ambiguity as to which SNP is actually driving the observed association. For example, in about 80% of significant eQTLs (at FDR = 1%) there is at least one additional SNP with a P
-value within a factor of 10 of the most significant P
-value (Figure S4A in Additional file 1
). Moreover, the distance between the significant SNPs for a given eQTL is often tens of kilobases or more (Figure S4B in Additional file 1
). This uncertainty poses a serious difficulty for determining whether eQTLs are enriched in any given type of functional element since most functional elements are far smaller than the typical extent of linkage disequilibrium.
The hierarchical model
To account for this uncertainty, we used a Bayesian hierarchical model, similar to that previously developed by our group [15
]. Because it is usually not possible to determine the eQTN for any given eQTL with complete confidence, the hierarchical model instead assigns a posterior probability to each SNP that it is the eQTN and the enrichment estimates are summed over these posterior probabilities. Assigning posterior probabilities allows us to estimate the fraction of eQTNs in an annotation while accounting for uncertainty in determining which SNP is the eQTN.
In brief, the model consists of two levels (a cartoon of this is shown in Figure ). At the level of individual genes, we perform Bayesian regression to test whether the genotypes of each SNP are associated with expression of the gene [33
]. The Bayesian regression yields a Bayes factor for each SNP that measures relative support for a model in which that SNP is the eQTN compared to a model in which that gene has no eQTN. We also compute a prior probability that each SNP is the eQTN, based on a variety of annotations (for example, whether the SNP lies within a conserved region or a DNaseI hypersensitive site). The prior probability for each SNP is computed as a logistic function of the SNP's membership in these annotations; the coefficients of the logistic function (denoted by λl
for annotation l) are estimated across all genes. By combining the Bayes factors with the prior probabilities we can compute a posterior probability that each SNP is an eQTN for a given gene.
Figure 1 A schematic outline of the hierarchical model. (a) Two SNPs that are significantly associated with expression level at the adjacent gene (in our method, association is measured using Bayes factors). (b) SNP 1 is located in regulatory annotations I, II (more ...)
The higher level of the hierarchical model uses all genes with expression data to estimate the coefficients of the logistic prior (that is, the λl). For each annotation, we will refer to the corresponding value of λl as our estimate of the enrichment of eQTNs in that annotation, while controlling for all the other annotations included in the model. eQTN enrichments can be interpreted in the same fashion as a coefficient in a logistic regression. In our case, it is defined as the odds of a SNP being an eQTN given that it is in a certain annotation, divided by the odds if it is not in that annotation, holding all other parameters in the model constant. Estimates of eQTN enrichments are the maximum likelihood estimates of parameters of the hierarchical model. These are computed during the maximization of likelihood of the hierarchical model.
We fit the hierarchical model by maximizing the joint likelihood of the expression data across all genes. This corresponds to setting the λl
to their maximum likelihood estimates. At the same time, for each individual eQTL, the posterior probabilities shift towards SNPs that lie in annotations that are enriched for eQTNs in other genes; the amount of shifting of the posterior is weighted by the degree of enrichment of that annotation genome-wide. We have previously shown with simulated data that this approach provides accurate estimates of the genome-wide enrichment of eQTNs within particular features, despite the uncertainty at individual genes [15
An additional challenge is that both eQTNs and many regulatory annotations are nonrandomly distributed with respect to the TSS and so eQTNs may appear enriched in some annotations by virtue of this spatial distribution alone. We wanted to test whether existing regulatory annotations had explanatory power beyond that expected from their distribution with respect to the TSS. As part of our analysis we therefore developed a background model that captured the effects of distance to the TSS as well as the exon/intron structure of the gene (Materials and methods).
For all annotations discussed in the following sections (DNaseI, histone marks, ChIP-seq, DNaseI foot-prints, core promoter elements and evolutionarily conserved sites) we tested the effect of each annotation separately within the hierarchical model, considering the annotation and the background effects alone. In our final analysis (see 'A combined model of eQTN location' below), we combined all annotations that were significantly enriched in eQTNs, as detected in the first stages of our analysis, in a single model, which we refer to as the combined model. For all analyses using the hierarchical model, we excluded 100 genes with strong eQTLs that we used to test our prior model at the end of the paper (see below for details).
eQTNs in active chromatin: DNaseI hypersensitivity and histone modifications
DNaseI hypersensitivity and a variety of histone modifications can mark regulatory elements and regions of active transcription or repression [34
]. We collated publicly available data for eight histone modifications (H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K36me3, H3K27me3 and H4K20me1) and DNase-seq data, all collected in HapMap lymphoblastoid cell lines (LCLs). These data were generated by the Bernstein and Crawford groups for the ENCODE project [39
] and supplemented with additional DNaseI sequencing by our own group [41
]. To analyze these data we used the hierarchical model considering each annotation separately.
We find that SNPs located within open chromatin, as marked by DNaseI hypersensitivity, are approximately four-fold more likely to be associated with variation in gene expression levels than SNPs outside these regions (Figure ). Histone marks that have been previously associated with active promoters and enhancers (H3K9ac, H3K4me1, H3K4me2, H3K4me3, H3K27ac) [35
] are also significantly enriched in eQTNs (Figure ; Figure S5 in Additional file 1
). In contrast, as might be expected, there is no enrichment for eQTNs in regions marked by the repressive marks H3K27me3 and H4K20me1 (there is instead a weak signal of depletion, albeit nonsignificant, of eQTNs in such regions). The enrichment of eQTNs in regions marked by DNaseI and active histone marks is higher (four- to seven-fold) at distances of > 5 kb upstream of a gene's TSS (Figure ; Figure S5 in Additional file 1
); the enrichment is strongest for H3K27ac, a modification associated with gene enhancers [35
Figure 2 Estimated fold enrichment of eQTNs in putative regions of active chromatin. The plots show the enrichment of eQTNs within DNaseI hypersensitive peaks, or within regions marked by a number of histone modifications. (a) All locations within the cis region (more ...)
Summing over the posterior eQTN probabilities for all eQTLs, we estimate that approximately 20% of all eQTNs occur within DNaseI hypersensitive sites, even though this annotation covers just 1% of the genome. Similarly, over 40% of all eQTNs occur within either a DNaseI hypersensitive site or within a histone-modified region, while this combined annotation covers just 4.5% of the genome (Table S1 in Additional file 1
eQTNs and transcription factor binding: ChIP-seq and DNase-seq footprints
Our analysis of regions of open chromatin suggested that a large fraction of eQTNs impact the function of promoters and enhancers, perhaps by modifying protein-DNA interactions that occur in these regions. We next focused on loci of active TF binding identified using two assays: ChIP-seq and DNase-seq footprinting. ChIP-seq identifies fragments of DNA that are bound by a known protein. While ChIP-seq provides binding information for specific proteins of interest, the resolution is somewhat limited as the signal peaks may be hundreds of base pairs in size. In contrast, individual active TF binding sites can be mapped at the motif level by DNase-seq footprinting [41
]. Here the precise location of TF binding is predicted by identifying DNase-seq 'footprints' detected as protected areas of otherwise hypersensitive regions, which mark the exact location of protein-DNA interaction. DNase-seq footprinting can provide base-pair resolution of the location of factor binding sites; however, there is frequently ambiguity about the active binding factor if multiple factors bind to similar DNA sequence motifs. We used publicly available ChIP-seq data from the ENCODE project for nine TFs [27
] as well as DNaseI-based inferences of individual binding sites from the 'Centipede' algorithm [41
]. Binding sites were grouped into clusters using sequence similarity (Table S2 in Additional file 1
Interestingly, our results suggest that many eQTNs influence binding of specific groups of TFs (Figure ). We find that regions bound by the TF Jun-D are highly enriched for eQTNs (approximately 8.2-fold enrichment above background); strong enrichment is also seen for the immune response factor NF-κB (3.3-fold) (Figure ). Our analysis of individual DNase-seq footprints also shows that overall TF binding sites identified using these methods are enriched in eQTNs (2.2-fold; Figure ). We also find that specific TFs and groups of TFs are substantially more likely to produce eQTNs. Specifically, we find striking enrichments in binding sites of the ETS family of TFs (approximately 7.5-fold enrichment), interferon stimulated response elements (ISREs; approximately 7.5-fold enrichment), CTCF binding sites (approximately 9.4-fold enrichment) and motifs that bind NF-κB (approximately 4.5-fold enrichment). The most enriched signal is for the ETS TF family of TFs, which are known to be closely involved in B-cell development [44
]. Other TFs, including the ISRE TFs and NF-κB are key components of the immune response, in particular the cellular response to viral challenge (ISRE, NF-κB, JunD) [48
Figure 3 eQTN enrichments in regulatory elements directly related to transcription factor binding. (a, b) eQTN enrichments in regulatory elements directly related to transcription factor binding as annotated by ChIP-seq (a) or DNase-seq footprinting (b). Of the (more ...)
eQTNs within the core promoter
A large fraction of eQTNs occur very close to the TSS [15
], and presumably affect the core and distal promoter architecture. The core promoter is usually defined as the collection of regulatory elements within approximately 50 bp either side of the gene TSS, which serve to position RNA polymerase II correctly [51
]. We identified individual functional elements in the core promoter using the following computational approaches. We first generated annotations based on known core promoter motifs, such as the TATA box and the initiator (Inr) element. Next we mapped the locations of the 1,000 hexamer words that are most enriched in the core promoter versus the region immediately upstream. Finally, we also identified evolutionarily conserved regions [53
], conserved TF binding sites [54
], known regulatory elements from the literature [56
] and upstream ORF-causing mutations [57
] within the core promoter region. Our results show that regions of 'high regulatory potential' [58
], overrepresented hexamers and the downstream promoter element (DPE) core promoter motif are significantly enriched in eQTNs (Figure ). Interestingly, of five known core promoter elements we included here, we find a strong enrichment in only a single motif type, the DPE, with a suggestive but weak enrichment in the Initiator (Inr) motif. DPE has the consensus sequence RGWYV and is typically located 20 to 30 bp downstream of the TSS. Experimental work has suggested that this motif may function as a TATA box in TATA-less Drosophila
]. Perhaps surprisingly, the remaining known core promoter motifs, including the TATA box itself, are not predictive of eQTN location (Figure ).
Figure 4 eQTN enrichment in regulatory elements of the core promoter. (a) The fold enrichments of eQTNs in a variety of predicted regulatory elements based on published methods, sequence motifs and evolutionary conservation. See main text for further details. (more ...)
eQTNs in evolutionarily conserved sites
Evolutionarily conserved regions can often provide valuable information on the location of regulatory elements [60
]. We obtained phastCons conserved elements [53
], phyloP negatively selected sites [62
], conserved TF binding sites ('tfbsCons' and 'MotifMap') [54
] and regions of high 'regulatory potential' [58
]. In general, we find that conservation provides surprisingly little information for predicting eQTN location. Only the 'regulatory potential' annotation was marginally significantly enriched in eQTNs (Figure S7 in Additional file 1
). We suggest that the relatively small effect of conservation is a result of accounting for a distance from TSS effect in our background model, which may diminish the utility of conservation as an indicator of regulatory elements.
A combined model of eQTN location
Our survey of existing regulatory annotations identified a number of computational and experimental assays that are significantly enriched in eQTNs. We next assembled these regulatory annotations into a single 'combined' model to reduce uncertainty around putatively causal eQTNs. The annotations included were: DNaseI peaks; the H3K27ac, H3K36me3, K3K4me1, K3K4me2, K3K4me3 and H3K9ac histone marks; known motifs, overrepresented hexamers and high regulatory potential sequences at the core promoter; all the TF ChIP-seq data; and DNase-seq footprints from the ETS, ISRE, CTCF and NF-κB TF groups. In addition to these experimental annotations, we also included our background model, which incorporated distance from the TSS as well as the gene structure.
When parameters are estimated from data, models with a greater number of parameters will always produce a likelihood equal to or greater than a simpler model and so likelihood alone cannot be used to compare combined and background models, which differ in their number of parameters. Instead, we used the Akaike information criterion (AIC), which penalizes models with more parameters. The model with the lowest AIC is the best fit, and a difference of greater than two units of AIC is typically considered significant. Using AIC, our combined model is a significantly better fit to the data than the background model and all the annotation models we used in this study (Figure S8 in Additional file 1
). To test for overfitting, we adopted a ten-fold cross validation approach. In cross-validation, because no parameters are estimated from the test data, the likelihood can be directly used to compare these models. In every case the combined model produces a higher likelihood than the background model on the test data set (Figure S9 in Additional file 1
). This suggests that our combined model adds significant predictive power beyond the background model.
Many of these annotations are correlated and, as a result, their estimated levels of enrichment shrink when included in the same model (Figure ). This is particularly the case for many of the TF ChIP-seq peaks, only two of which (Jun-D and NF-κB) remain significant when included along with the more generic marks of active chromatin, namely DNaseI and the histone marks. It is also clear that in the combined model, some annotations are substantially more informative than others. For example, in the region > 5kb upstream of the TSS it seems that the best indicator of active regions is the putative enhancer histone mark H3K27ac [35
]. The other marks add relatively little when H3K27ac is included in the model, although when tested individually most marks are enriched in eQTNs (Figure S5 in Additional file 1
). We note that the correlations between genomic marks will be averaged over by the model, such that the posterior probabilities will accurately reflect the combined effects of all annotations included.
eQTN enrichments in all functional annotations included in the combined model, ordered by annotation type. Error bars show 95% confidence intervals. Arrows indicate that the confidence interval extends beyond the left end of the x-axis.
Figure illustrates how the hierarchical model combines information from regulatory annotations with Bayes factors to identify high posterior eQTNs. Here, we selected two example high posterior eQTNs (Pr > 0.5) located in NF-κB ChIP-seq regions (identified using ENCODE data in HapMap individual NA12878). We note that, in this case, we are specifically selecting genes where our model places high weight on an individual SNP being the eQTN. A natural way to identify such SNPs is to select those where the posterior probability is > 0.5 - in other words, our data indicate that this SNP is more likely to be the eQTN for that gene than all other SNPs combined. In both cases, the model selects these SNPs because they are strongly associated with variation in expression and they lie within a number of enriched annotations, including DNaseI hypersensitive regions, multiple histone marks and ChIP-seq peaks.
Figure 6 Examples of two high posterior eQTNs, rs473407 and rs28362527, in two genes. The first row shows the Bayes factors for each SNP located within a 35-kb window either side of the gene. The second row shows the position, and marginal enrichment level, of (more ...)
Independent NF-κB ChIP-seq data from nine individuals [63
] are shown in the bottom two panels. These data show that, looking across individuals, the strength of ChIP-seq signal for NF-κB in this region is significantly correlated with the putative eQTN genotypes (P
= 4.2 × 10-3
= 3.4 × 10-4
for rs473407 and rs28362527, respectively).
More generally, for high-confidence eQTNs within NF-κB peaks we see a significant enrichment of positive correlation between eQTN genotype and NF-κB read-depth (P
= 0.013, Kolmogorov-Smirnov test) (Figures S10 and S11 in Additional file 1
). For a large fraction of the eQTNs that are significantly correlated with change in binding, the direction of the change is the same as the direction of change in expression (74%; P
= 6.3 × 10-4
, sign-test), consistent with the generally accepted role of NF-κB as an activator [64
]. Our results therefore suggest that the functions of this group of eQTNs may frequently involve changes in binding level of NF-κB at these locations.
Prediction of eQTN location using only prior information
The hierarchical model combines regulatory annotations (in the form of a prior model) with the association signal derived from eQTL mapping. We tested the extent to which this prior model (that is, excluding the association signal) places a sensible ranking on which SNPs are most likely to generate eQTLs. Before our analysis, we selected 100 genes with a strong eQTL and for which there was a single strong candidate eQTN SNP. These genes were withheld from all analyses using the hierarchical model. The criteria for selecting these genes were that we required (i) at least one SNP with a P-value < 5 × 10-8 (this corresponds to an FDR of 0.01%), and (ii) that the P-value difference between the most significant SNP and the next most significant SNP for that gene be at least two orders of magnitude. This P-value difference corresponds to requiring that the most associated SNP has a roughly 100-fold higher Bayes factor than any other SNP for that gene.
Simulations indicate that, in the absence of genotyping error, this procedure would correctly identify causal SNPs for > 99% of genes; the corresponding rate with realistic genotyping and imputation error rates is > 90% (Figure S12 in Additional file 1
). We may also miss some causal variants (such as structural variants or variants in highly repetitive regions) if they are not included in the SNP data. Note that misidentification of the causal variant will cause our analysis to be somewhat conservative, in the sense that we will tend to underestimate the performance of our prior. These genes will also tend to have lower than average linkage disequilibrium, although this would not seem to have any obvious biasing effect on the performance of the prior.
In the entire data set, 198 genes meet both criteria; the 100 genes that we used were sampled at random from the set of 198 (see Figure S13 in Additional file 1
for examples). We then tested the ability of our prior models to predict the location of the lowest P
-value SNP. This effectively tests whether the prior can distinguish low P
-value SNPs using only regulatory annotations, but without information on gene expression variation.
For 50% of genes the putative causal site is among the top 3% of SNPs in the genic region based on the prior model, and for a large fraction (70%) the putative causal site is ranked among the top 10% of SNPs in the region (Figure ). The model with experimental data is significantly better than the distance model alone (P = 1 × 10-5), and both models are far better than a random prior (P < < 10-16). Our results suggest that, by itself, regulatory annotation can already provide a meaningful selection of putative eQTNs. Combining this prior with gene expression association signals is therefore a powerful approach for identifying causal variants.
Figure 7 Prior rankings of SNPs for 100 genes where a single SNP is a clear best candidate for being the 'true' eQTN using the prior probability from the hierarchical model. The histogram shows the percentage of genes for which the putative causal site is ranked (more ...)