Datasets and cis-eQTL mapping
For this analysis we used data from the 210 unrelated HapMap samples in the original HapMap Phase I/II cell lines 
. These include 60 CEPH (CEU), 60 Yoruba (YRI), 45 Chinese (CHB) and 45 Japanese (JPT) samples. The SNP genotypes were based on HapMap genotypes for all HapMap SNPs, combined with whole-genome sequence data from the 1000 Genomes Project 
. For individuals not sequenced by the 1000 Genomes Project, missing SNP genotypes were imputed using BIMBAM 
. See the Methods
for further details.
We analyzed expression measurements obtained using three distinct technologies:
- Illumina gene array data from a total of 210 CEU, CHB, JPT and YRI samples  and hereafter referred to as the Illumina data set. This is the expression dataset we used for our previous study ;
- Affymetrix exon array data from 117 CEU and YRI samples  and hereafter referred to as the Affymetrix data set;
- RNA sequence data from 102 CEU and YRI samples , , hereafter referred to as the RNA-seq data set.
Note that the 117 individuals in the Affymetrix data set, and 102 individuals in the RNA-seq data set are both subsets of the 210 individuals in the Illumina data set (and in both cases include the majority of the CEU and YRI samples). The original RNA-seq data sets included a few individuals that were not in either the Illumina data or Phase I/II HapMap; in order to simplify the genotype imputation pipeline these individuals were excluded from the analysis.
To avoid the impact of spurious associations caused by SNPs falling within the probes of the two array data sets (Illumina and Affymetrix), we systematically removed all probes containing at least one SNP. We also removed all probes impacted by short insertions/deletions or copy number variations (CNV) based on the genomic coordinates of these structural variants as provided by 
and the 1000 Genomes Consortium 
. Finally, for all expression data we removed probes (Illumina and Affymetrix) or exons (RNA-seq) that appear to be “non-expressed” (see Materials and Methods
). For each dataset, expression levels were quantile normalized to a standard normal distribution within populations prior to performing the mapping of cis-eQTLs, in order to avoid false positives due to population structure 
. More extensive details on the data processing are provided in the Materials and Methods
section. Table S1
provides a summary of the content of each expression dataset.
For eQTL mapping, we used standard linear regression to test every SNP within the transcript or 100 kb from either end of the transcribed region for association with gene expression. Table S1
reports the number of eQTLs we found for each expression dataset for an empirically estimated False Discovery Rate (FDR) of 5%. For each gene with at least one significant SNP, we treated the most significant SNP as an estimate of the major eQTN (Quantitative Trait Nucleotide) for that gene. When more than one SNP had exactly the same lowest p-value (e.g., several SNPs in perfect LD) we equally shared the probability of being the major eQTN among all these SNPs.
The TES peak is strongest in the Illumina data and absent in the RNA-seq data
The left panels of display the distribution of locations of the best SNP with respect to the target gene (similar to of 
): the vertical red arrow on each panel highlights the location of the Illumina TES peak. As is evident, the peak upstream of the TES is strongest in the Illumina data, weaker in the Affymetrix data, and essentially absent from the RNA-seq data.
Expression QTN distributions estimated using three different technologies for measuring gene expression.
Expression QTN distribution estimated using only those Affymetrix probes that are located within the same exon as an Illumina probe creates an apparent 3′ signal peak.
To assess the evidence for a TES peak more quantitatively, we computed the AIC (Akaike Information Criterion) for a model with, and without a special TES effect (; compare the “TSS, intron, exon” model to the “TSS, intron, exon, last exon” model). As illustrated in the figure, the model with a special effect for the last exon (i.e, the exon ending at the TES) is strongly preferred for the Illumina data (by 188 units of likelihood), it is weakly preferred for the Affymetrix data (by 7 units), and weakly disfavored for the RNA-seq data (by 2 units).
One major difference between the Illumina data and the other two data sets is that a large fraction of the Illumina probes are positioned in the last exon (85% for Illumina compared to 21% for Affymetrix). To assess whether the Illumina probe placement might have helped to create the peak of signals at the TES, we filtered the Affymetrix data to include only those Affymetrix probes that are in the same exon as an Illumina probe (and hence the filtered data set includes mainly probes in the last exons of genes). When we did this, we observed that indeed the filtered Affymetrix data set showed a much stronger peak of eQTLs in the last exon (). This latter observation strongly suggests that the probe distribution on the Illumina array has helped to create an apparent peak of eQTLs in the last exon that is not supported by the other data sets.
In principle, one plausible explanation might be that ungenotyped SNPs in Illumina probes could generate spurious eQTL signals, and that these would often be detected by nearby SNPs; such an effect might in principle generate a spurious 3′ peak of eQTLs. However, this does not appear to be the case. The original analysis of Veyrieras et al
included a correction factor for unmeasured SNPs-in-probes that suggested that the effect of such SNPs was relatively modest 
. That conclusion is confirmed by the analysis presented here, for which we removed all probes containing 1000 Genomes SNPs (which should include nearly all common SNPs in probes); the distribution of eQTNs in the Illumina data is very similar to the corrected distribution reported previously 
eQTNs within last exons frequently have exon-specific effects on expression
Recent work has shown that there are many SNPs that impact the expression levels of individual exons, while not necessarily affecting the overall expression levels of genes 
. Such QTLs are often referred to as “splicing-QTLs” (sQTLs) although in some cases they arise through other mechanisms than splicing changes per se
(for example, changing the transcription end site position 
). Fraser and Xie, who also analyzed the Affymetrix data, reported that the last exon is particularly prone to exon-level QTLs 
. Given these observations, we conjectured that the TES peak of eQTLs in the Illumina data may be caused by SNPs that lie in or near the last exon, and that affect expression levels of the last exon only.
To evaluate this hypothesis, we computed exon-specific expression levels in each individual using both the Affymetrix and the RNA-seq data sets, controlling for the overall expression level of the gene in that individual to remove the impact of gene-level eQTLs (see Materials and Methods
). We then tested whether eQTNs identified using the Illumina data would replicate at either the exon level (as sQTLs) or at the gene level (as eQTLs) in the Affymetrix and RNA-seq data sets. For each gene with a significant eQTL in the Illumina data set (FDR
5%), we designated the most significant SNP as a putative Illumina eQTN. (In many cases the putative eQTNs will not be the true causal sites, but they should at least be in strong LD with the causal sites). For this set of eQTNs we classified each SNP according to its position within the corresponding gene: i.e., the SNP was either in the first, internal or last exon, or it was intronic, or intergenic. We tested each eQTN for association both at the gene level (for the gene identified by the Illumina analysis) and at the exon level (for each exon from that gene that was within 10 kb of the eQTN). Each exon-level test was treated as an independent test.
(left panel) shows clearly that last exon Illumina eQTNs are more likely than other Illumina eQTNs to be sQTNs in the Affymetrix data. Conversely, last exon Illumina eQTNs are less likely than other categories of SNPs to replicate at the gene level in the Affymetrix data (, right panel). Although less striking, shows that the RNA-seq data exhibit a very similar pattern. (Note that the RNA-seq analysis of splicing-QTLs is thought to be underpowered in this data set, except for the most highly expressed genes (see Supplementary Figure 15 of 
), which may explain why the enrichment of last-exon QTLs is lower for the RNA-seq data than for the Affymetrix data.)
Illumina last exon expression-QTLs are more likely to be splicing-QTLs.
As an illustration, provides an example of a highly significant Illumina eQTL (p
), for which the most significant SNP lies within the 3′ UTR. However, analysis of the Affymetrix data indicates that the effect of this SNP is primarily through an exon-specific effect on the last exon (and indeed there may be a separate opposite
effect on exon 2). In summary, we interpret as evidence that the 3′ peak of eQTLs that was detected in the Illumina data is largely due to a large number of exon-level QTLs that were detected by 3′ UTR probes in the Illumina arrays.
SNP rs8984, located within the last exon of gene CHURC1, primarily affects expression of the last exon, but is interpreted by the Illumina analysis (which has only one probe in this gene) as a gene level QTL.
eQTN enrichment within exons is reproducible between platforms
In our previous work, we also reported a 2-fold enrichment of eQTNs within internal exons compared to introns 
. Given our newer observations regarding the TES peak, we thus asked whether this exon enrichment is robust across platforms. To evaluate this, we performed a gene-level analysis of all three data sets using our empirical Bayesian framework (see Materials and Methods
). We considered four different nested models for the distributions of eQTNs:
- TSS: the model accounts only for distance from the TSS;
- TSS, intragenic: same as previously plus an annotation for SNP being within the transcribed region of the target gene;
- TSS, intron, exon: as previously but the intragenic annotation is split into exclusive intron and exon categories;
- TSS, intron, exon, last exon: as previously but the exon category is split into exclusive exon (except the last) and last exon categories.
For each of the three data sets, we ran each model separately and then selected the best model based on the Akaike Information Criterion (AIC).
plots for each data set the difference between each model and the best one (i.e., the model with the smallest AIC value). also reports the odds ratio estimates of the parameters under each model and for each dataset. As noted above, Model 3 (which includes the special last exon effect) is strongly favored by the Illumina data, weakly favored by the Affymetrix data, and weakly disfavored by the RNA-seq data. However, in contrast, all three data sets agree that Model 1 is significantly better than Model 0 (implying a general enrichment of eQTNs within transcribed regions), and that Model 2 is significantly better than Model 1 (implying an enrichment of eQTNs within exons compared to introns).
eQTN enrichment within exons is strongly supported by all three datasets while there is relatively weak evidence that the last exon is special.