We analyzed gene expression measurements from lymphoblastoid cell lines representing 210 unrelated individuals studied by the International HapMap Project
[24],
[25]. These expression data, first reported by
[19], were generated using the Illumina Sentrix Human-6 Expression BeadChip. For each sample we also used SNP genotype data from the Phase II HapMap Project, consisting of 3.3 million genotypes per individual
[25].
After remapping the Illumina probes onto human mRNA sequences from RefSeq, we created a cleaned set of expression data for 12,227 distinct autosomal genes that had a unique RNA sequence in RefSeq (see
Methods). For most analyses we removed 634 genes that had one or more HapMap SNPs within the expression probe and 147 very large genes (>500 kb), leaving us with a core data set of 11,446 genes.
We then set out to identify SNPs that affect measured mRNA levels in
cis. As an operational definition, we considered the “
cis-candidate region” to start 500 kb upstream of the transcription start site (TSS) and to end 500 kb downstream of the transcription end site (TES). Consistent with previous work
[20],
[12], our preliminary analysis found that most detectable eQTLs lie within this region.
Although the HapMap samples represent four different populations, originating from Africa, Europe and east Asia, our main analyses pooled the data into a single sample. To avoid false positives due to population-level expression differences
[26],
[20],
[27], for each gene we transformed the African, European and east Asian expression data separately to standard normal distributions prior to combining the samples (
Methods). Our rationale for combining samples was that we should achieve better power and better localization of signals than if we analyzed the populations separately. In doing so, we assume that functional variants usually have similar effects in different populations, an assumption that is parsimonious, and has empirical support
[20],
Figure S1. The overall results for analyses of individual populations are very similar (see
Figures S2,
S3, and
S4).
The Distribution of cis-Acting eQTLs
For each of the 11,446 genes, we tested for putative
cis-acting eQTLs by regressing measured mRNA levels against SNP genotypes, independently for each SNP in the
cis-candidate region, using a standard linear regression model. Consistent with previous reports
[20], we found a substantial number of genes with strong evidence for containing at least one eQTL. A total of 744 genes (6.5%) had at least one SNP with a p-value <7×10
−6. If the smallest p-value in each gene is treated as a summary statistic, this threshold yields a gene-level false discovery rate of 5%
[28].
We also observed that, in many cases, the SNPs most strongly associated with mRNA levels for a particular gene lie in a restricted region, allowing relatively precise localisation of eQTLs. plots examples of p-values in three genes, illustrating both the strong association signal that is often achieved, and the relatively localised nature of many of the signals (
Figure S5).
Encouraged by the potential for these data to localise eQTLs, we next examined the distribution of the physical location of putative eQTLs within the
cis-candidate region. For each gene with an eQTL (defined as having at least one p-value <7×10
−6) we took the position of the
most significant SNP as an estimate of the location of the functional site. In practice, we expect that the most significant SNP will sometimes be the actual functional site, but usually it will not since (1) HapMap contains only ≈1/3 of common SNPs
[25]; (2) some eQTLs may be due to SNPs in LD with nearby copy number variants, though in practice few of the copy number variants known to be associated with expression are well-tagged by SNPs in these data (data not shown;
[19]); (3) a non-functional SNP in strong LD with the functional site may have a smaller p-value by chance. Using simulations we estimate that the median distance between the functional SNP and the most significant SNP in our data is 7.5 kb (
Figures S6 and
S7). As expected, local recombination rates are strongly inversely correlated with the distance between the functional SNP and the most significant SNP (
Figure S8).
shows histograms of the locations of the most significant eQTL SNPs, as a function of gene size. (The plots incorporate a correction factor for the possibility of spurious signals due to undetected SNPs in the expression probes; see
Methods.) Several interesting features emerge. First, the distribution of the most significant eQTL SNPs is roughly centered on the transcribed region. Second, nearly all such eQTL SNPs lie close to genes: we find relatively few that are >50 kb from the corresponding gene. Third, as shown in
Figure S9, there is a significant enrichment of eQTL SNPs in exons compared to introns. We will return to this observation later in the paper.
Finally, for all three gene sizes, the highest density of eQTLs is around the TSS and immediately upstream of the TES, as reported previously in yeast
[22]. The TSS peak was reported in a previous plot of these data
[20], but in that previous analysis the TES signal peak was concealed due to the variability of gene lengths (see
Figure S10). The TES signal is quite asymmetric: among genes with an eQTL, 10% (75) have the most significant eQTL in the 4 kb upstream of the TES, compared with just 4% (29) in the 4 kb immediately downstream.
A Hierarchical Model of eQTLs in the cis-Candidate Region
While reveals the broad distribution of eQTLs and makes few modeling assumptions, it does not easily enable formal model testing about which aspects of gene structure (or other sequence features) are most important for generating eQTLs. Moreover, since the most significant SNP is not always close to the functional site, this approach can be expected to flatten out the true peaks of eQTLs and to increase the numbers of eQTLs that appear to lie far from the target genes.
Consequently, we next developed a Bayesian hierarchical modeling approach that solves many of these problems (see the
Methods for further details). We considered a collection of models in which the parameters predict the prior probability that any given SNP in the
cis-candidate region will be an “eQTN” (i.e., the functional
nucleotide that creates an eQTL). Each model incorporates information about the physical locations of SNPs and, in some of our models, additional functional annotation of the SNPs. (Our calculations assume that the actual functional site is included in the HapMap genotype data; see below for further discussion.) The model parameters are estimated by maximizing the overall likelihood of the expression data, across all genes.
To implement our hierarchical approach, we switched to using Bayesian regression to test for association between SNPs and gene expression
[29] (
Methods). For each SNP in the
cis-candidate region around a gene, we computed a Bayes factor that measures the relative support for the alternative hypothesis (the SNP is an eQTN) compared against the null (the SNP is independent of gene expression). For these data, the Bayes factors are highly correlated with p-values from standard linear regression. However, a key advantage of Bayes factors is that, combined with the prior probabilities specified by the model, they allow us to compute the posterior probability that each SNP is the actual eQTN.
The hierarchical model shares information across all genes about the distribution of signals and this in turn allows better weighting of which SNPs in individual genes are most likely to be eQTNs. For example, consider a hypothetical gene in which two SNPs that are associated with expression are in perfect LD (
r2
=

1). Suppose that one SNP is very close to the TSS, and the other is 30 kb upstream. In the p-value analysis, we would assign each of these SNPs 50% weight. In contrast, the hierarchical model downweights the upstream SNP because it is apparent from the overall data that eQTNs are much more abundant near the TSS, suggesting that the SNP near the TSS is much more likely to be responsible for the signal. Simulations show that the hierarchical model provides a more accurate profile of the distribution of eQTNs (see
Figures S5 and
S11).
Of course, some degree of complication is added by the fact that current HapMap data do not yet contain all SNPs. Therefore, the sites that we infer to be “eQTNs” in this study surely include many SNPs that are tags of nearby functional SNPs that are not in HapMap. This effect will systematically reduce our estimates of the importance of any particular factor in predicting eQTNs. In the case of factors relating to physical location (such as distance from the TSS) simulations show that this has a modest impact on spreading out the signal peaks that we observe, and that the overall distribution of signals is still estimated very well (see
Figure S5,
S11, and
S12). In contrast, in the case of factors relating to functional categories (e.g., whether a SNP lies in a conserved element) we would expect the impact to be much more serious since functional elements are usually small and tag SNPs are unlikely to fall within the same element as a functional site. A second complication is caused by the possibility that undetected SNPs in the expression array probes might create spurious signals
[30]. Our hierarchical model includes an explicit correction for this, using the 634 genes with a known SNP in the probe as training data.
Distribution of eQTNs with Respect to the Transcribed Region
We first set out to get a more refined view of the distribution of eQTNs across the cis-candidate region. The basic versions of our hierarchical model described the positions of SNPs relative to a single “anchor” point such as the TSS. SNPs were grouped into discrete bins based on their distance upstream of the anchor, or downstream (treated separately). The bins nearest the anchor point were just 1 kb wide, to accommodate rapid changes in the rate of eQTNs, while more distant bins were wider (this improves the parameter estimates since the distant bins generally contain few eQTNs). Each bin was associated with a single parameter that relates to the proportion of SNPs in that bin that are eQTNs. The rationale for this model was that it would provide a good description of the data if, for example, the abundance of regulatory elements could be well predicted by distance from the TSS alone.
We also considered models with pairs of anchor points (e.g., the TSS and the TES). In those models, each SNP belonged to two bins, each corresponding to the distance from one anchor point. This model treats the probability that a SNP is an eQTN as the sum of an effect due to the first anchor plus an effect due to the second anchor. Recall that our gene set includes only genes with a single annotated transcript, so that this analysis does not incorporate alternative transcription start or end sites.
compares eight different models using either a single anchor point (e.g., TSS or TES), or pairs of anchors (TSS and one other anchor). We used AIC (Akaike Information Criterion) to penalize the two-anchor models for the extra parameters that they use.
| Table 1Candidate models of eQTN locations, ranked by AIC. |
In summary, the results provide compelling support for a model including both the TSS and TES over all other models (). Two other two-anchor models (namely TSS+probe location, and TSS+coding sequence end) also performed well, presumably because the Illumina probes and the coding sequence end positions are usually near to the TES. However, given that the TSS+TES model had by far the strongest support, we use this model in our subsequent analyses.
We next replotted the locations of eQTNs, using the posterior probabilities estimated by the hierarchical model (). Compared to the p-value-based analysis, the two strong peaks of signal near the TSS and TES are considerably strengthened. Also, in the hierarchical model, the level of background signal upstream and downstream of the gene is greatly reduced, presumably because most of the background signal in the p-value analysis can be explained as resulting from LD with SNPs near the TSS and TES. The hierarchical model estimates that the total number of eQTLs is considerably larger than the number that we detected by linear regression at the rather stringent false discovery rate of 5% (1586 vs. 744). This difference is partly because the hierarchical model adds fractional probabilities for eQTLs that have only partial support for being true eQTLs, and partly because the hierarchical model is more sensitive to signals in locations that are likely a priori.
Another view of the hierarchical model results is shown in the cumulative plots in , which plot the cumulative distribution of eQTNs across the gene region. Most eQTNs lie close to the gene, with less than 7% of the detected cis-eQTNs located more than 20 kb outside the gene. Overall, there are about 3-fold more eQTNs in the upstream region of the gene (5′ of the TSS) than downstream (3′ of the TES) (30% vs. 9%).
We next investigated the peaks of signal near the TSS and TES in more detail, using a finer bin partition close to the TSS and TES (see and
Methods). At this finer scale, the TES signal is extremely sharply peaked over a region of just ~100 bp immediately upstream of the TES. The data strongly reject a model in which the signal is symmetric around the TES (p

=

3×10
−7). In contrast, the TSS signal is more spread out, and spans both sides of the TSS. There is no evidence of asymmetry in the TSS signal (p

=

0.34).
We also observed that the TSS and TES peaks both correspond with two parts of the typical gene structure that, averaging across all 11,446 genes, tend to be highly conserved across the mammalian phylogeny (). The correspondence of the two eQTN peaks with the peaks of conservation suggests that there may be a causal link between these two types of signals. We propose that the sequence conservation reflects, at least in part, the roles of these two locations in regulating mRNA levels, though further work will be needed to verify the connection.
Similarly, the TSS peak also matches up closely with the peak binding densities of a collection of transcription factors that are involved in transcription initiation (reported previously by the ENCODE group, based on ChIP-chip data collected for a set of regions spanning ~1% of the genome
[4]). As might be expected, the ENCODE data identified almost no transcription factor binding near the TES. We return to these latter observations in the
Discussion.
Distribution of eQTNs with Respect to Functional Annotation
We next used our hierarchical model to examine the impact of various types of functional annotation on the probability that a SNP is an eQTN. We first classified SNPs that lie inside genes into categories based on the exon/intron structure (e.g, first, coding and last exons; first, internal, and last introns;
Figure S13). In order to make the model fully identifiable, we estimate the effect of each annotation
relative to the abundance of eQTNs in internal introns (as this category has the greatest number of SNPs). Since gene position is highly predictive of eQTN abundance, we controlled for SNP position using the TSS+TES model. In effect, the hierarchical model now tests whether the annotation adds any predictive value beyond the basic position information. As noted above, incomplete SNP ascertainment in HapMap means that we will generally underestimate–perhaps substantially–the impact of relevant annotations.
The main result of this first analysis is that internal introns have a deficit of eQTNs, compared to coding exons, as well as first and last exons and introns (,
Table S1). For example, SNPs in coding exons are ~2-fold more likely than SNPs in internal introns to be eQTNs. First introns are also relatively enriched for eQTNs compared to internal introns (controlling for position). However, since the total amount of sequence contained in introns vastly exceeds that in exons, 53% of genic eQTNs lie in internal introns compared to 10% in coding exons (see
Table S1). SNP density differs slightly between exons and introns, but not nearly enough to generate a 2-fold difference in eQTN abundance (
Table S2). Overall, the hierarchical model that includes the gene structure annotation as well as position effects relative to the TSS and TES is substantially better than the TSS+TES-only model (by 7 units of AIC).
We then considered the impact of a variety of other types of SNP annotation (see
Methods and
Figure S14). None of these annotations shows convincing signals of enrichment (
Table S3). We estimate a 1.9-fold enrichment of eQTNs inside conserved noncoding elements, as might be expected if these identify functional elements, however the 95% confidence interval narrowly overlaps 1. We also tested for an enrichment of eQTNs at computationally predicted microRNA binding sites, reasoning that SNPs in these binding sites might affect mRNA degradation. We found a suggestive, but non-significant, enrichment of eQTNs in these sites (1.4-fold). It is unclear whether the absence of significant effects in these analyses indicates that these types of annotation are not strongly associated with eQTNs or instead reflects the incompleteness of HapMap and the limitations of current functional annotations.
Finally, based on ENCODE results showing that the promoter regions of genes with CpG islands tend to have more accessible chromatin and greater occupancy by transcription factors
[4], we predicted that CpG status might also provide relevant annotation. Indeed, we find that genes with a CpG island spanning the TSS are expressed at higher average levels, and are ~50% more likely than genes without a CpG island on the TSS to have a
cis-eQTN (15% vs 11%). This effect is consistent with the observation that genes with CpG islands are more likely to be expressed in a wide range of tissues than are genes without CpG islands
[31]. After adjusting for the different overall rates of eQTNs, the distribution of signal locations in the two classes of genes is very similar (
Figure S15).