Structure of gene expression variation among populations
We assessed the global landscape of expression using principal components analysis (PCA) (Figure S1
). Unlike SNP-based PCA plots for the same populations, all populations in the expression-based PCA plot do not separate distinctly by their continental ancestry. We assessed correlation of principal component (PC) 1 from the SNP-based PCA which separates African/non-African populations against all principal components from the expression PCA and found decay of correlation from the first 50 principal components maximized at PC3 and PC7 highlighting that gene expression differences, while not distinguishable by heredity alone, are partly shaped by it (Figure S2
To quantify population differentiation with respect to distinct gene expression levels, we calculated the statistic VST
for each of 21,800 probes, corresponding to 18,226 unique autosomal Ensembl genes (see Methods
), for each pairwise combination of populations. For each pairwise comparison of populations, the distribution of VST
values was heavily skewed toward values near 0, with a long narrow tail comprised of values between 0 and 1 (Table S1
and Figure S3
). Individual pairwise combinations of populations differ with respect to numbers of genes exhibiting high VST
genes (Table S1
) such that the amount of VST
between a pair of populations is correlated with the degree of genetic distance; For example, the CHB-JPT combination only has 13 genes with VST
greater than 0.2, whereas the CHB-MKK combination has 4031 genes with VST
greater than 0.2. Together these analyses indicate that the vast majority of genes do not exhibit highly differentiated expression variation between populations, however every pairwise combination of populations has genes with highly structured expression variation. Analysis of the union of probes exhibiting top 5% VST
scores from each pairwise population comparison indicates a significant enrichment of Gene Ontology (GO) terms, including nucleus, protein binding, RNA binding, nucleotide binding, RNA splicing (Table S2
). Each of the eight populations also exhibited significant population-specific GO term enrichment when top population-specific VST
scores were analyzed (Table S3
). For example, CEU exhibits an enrichment in immune response (GO:0006955, p-value 6.7×10−6
) and regulation of immune response (GO:0050776, p-value 4.05×10−5
), indicating strong structure of expression variation for genes of these categories in CEU versus other populations, and that this structure is not seen between any other populations. We also addressed the question of whether those functions which are significantly differentiated in one population inform differentiation in a closely-related population (i.e., differentiation of function may be shared by closely-related populations). For example, differentiation might be similar for each of CHB and JPT when individually compared to all the other six populations (or for LWK, MKK, and YRI). In general, GO terms corresponding to genes which are significantly diverged in expression in one population relative to the others are also diverged in expression in the other, closely-related populations (Figure S4
). A caveat of these analyses is that the cell lines of individual populations were initially transformed at different time points and have been subject to differing numbers of passages, so it is possible that some of the population-specific signals reflect technical issues as opposed to true population-level divergence of function. However, when we consider the number of genes with VST
greater than 0.2 in relation to median FST
, we do not observe that comparisons involving CEU, the oldest cell lines exhibit unusually high VST
Cis associations of gene expression with SNPs
For each of the 21,800 probes (18,226 unique autosomal genes) selected for analysis, we performed a cis
- association test between expression and common SNP genotypes using a Spearman Rank Correlation (SRC) model (See Methods
). The model was applied to each population separately: 1) For the normalized and stratification-corrected expression data, and 2) for the REDUCED expression data. The purpose of presenting both sets of results is to compare the properties of the two sets, as opposed to simply choosing one approach over the other. We analyzed in depth those associations significant at the 0.01 permutation threshold. At this level of significance, we expect roughly 182 genes to have at least one significant association by chance, and we detected 657, 774, 698, 795, 773, 472, 947, 799 genes with a significant association in CEU, CHB, GIH, JPT, LWK, MEX, MKK, and YRI, respectively with a false discovery rate (FDR) of 18–39% per population (). Similar FDR values were obtained when evaluating the degree of replication of an eQTL discovered in one population by replication in another (see Methods
) where FDR was estimated to range from 31% to 40%. A lower FDR was observed in non-African populations as expected, given that only a subset of African eQTLs will be represented in non-African samples. In total, at the 0.01 threshold we detected a non-redundant set of 3,130 genes exhibiting a significant cis
association in at least one population. At higher stringency (permutation threshold 0.001), the per-population FDR ranged from 4%–11%; with a total of 1,132 genes exhibiting a significant cis
association in at least one population.
Cis- associations detected with Spearman Rank Correlation analysis of normalized and PCA-corrected expression data.
As cis-eQTLs associated with probes overlaying SNPs need to be interpreted with caution, we examined our significant associations for potential artifacts. At the 0.01 permutation threshold, of the 3,292 probes with significant associations (corresponding to 3,130 genes), 249 probes (245 genes), overlaid a SNP. This means that 7.5% of the probes with a significant cis-eQTL overlaid a common SNP (compared to 6.5% among all probes), which did not represent a significant enrichment of these probes among our significant associations. We did, however, consider the effect of these associations on across-population sharing of associations, and report levels of sharing both with and without these probes. At the 0.01 permutation threshold, of the 3,130 genes exhibiting a significant cis association, 1,074 genes had a significant association in at least two populations, and 63 in all eight populations. This indicates that 34% of genes with a significant cis-association had an association in at least two of the populations, and 2% of genes in all eight populations. If we conservatively exclude those probes known to overlap common SNPs, we reduce the number of non-redundant genes exhibiting a cis-association to 2,900, but observe that 957 (33%) of the remaining genes had a significant cis-association in at least two of the populations, and 54 (2%) in all eight populations. Thus, probes with underlying SNPs are not contributing significantly to our estimates of across-population sharing (i.e., replication) of eQTLs. At higher stringency (permutation threshold 0.001), we note that none of the significant associations involve probes with underlying SNPs, and 48% of genes with a significant cis-association had an association in at least two of the populations, and 2% of genes in all eight populations ().
To increase the power of our analysis we ran the cis
- association analysis using the REDUCED data (see Methods
) and the same analysis parameters, and at the 0.01 permutation threshold detected 1,966, 1,950, 1,984, 2,131, 1,794, 1,131, 2,562, 2,415, genes with a significant association in CEU, CHB, GIH, JPT, LWK, MEX, MKK, and YRI, respectively with a false discovery rate (FDR) of 7–16% per population (). In total, there is a non-redundant set of 5,691 genes showing a significant cis
association in at least one population, 3,240 in at least two populations, and 331 in all eight populations. This indicates that 57% of genes with a significant cis
-association had an association in at least two of the populations, which is an increase over the 34% replication observed using the normalized and PCA-corrrected data (). As expected given each population's sample size, all significant detected effects are relatively large; the range of Spearman's rho, the correlation coefficient, is 0.338–0.919 for the normalized and PCA-corrected data, and 0.337–0.933 for the ‘REDUCED data’ (Table S4
). There is substantial overlap between genes detected from the normalized and PCA corrected data with that of the REDUCED data (Table S5
). Of the genes with significant cis
- associations in the REDUCED data analysis, 70–77% of the genes are novel, i.e., were not identified as having a significant cis
- association in the normalized and PCA-corrected data analysis, though the vast majority of these significant p-values were close to significance thresholds in the PCA corrected data analysis (Figure S6
). The additional cis
-eQTLs detected in the REDUCED analysis are likely due to the increased sensitivity of the analysis (see 
). Of the 22 to 35 percent of the cis
-eQTLs that did not replicate in the REDUCED analysis, a portion were likely artifactual associations in the first analysis, or else weak effects with low power for replication (as evidenced by the lowest replication in the analysis of the smallest population, the MEX). Together, these results demonstrate that by applying dimension reduction to the expression data, we do not introduce bias, rather we increase power and replication.
Cis- associations detected with Spearman Rank Correlation analysis of “REDUCED” data.
Multiple effects underlying cis-eQTLs
To quantify the degree to which multiple cis
-associated SNPs comprising a given cis
-eQTL represent multiple, independent effects, we applied a stepwise regression framework to each probe that had at least two significant cis
-eQTL SNPs at the 0.01 permutation threshold. We identified a total of 33 genes with multiple eQTLs (0.15% of all 21,800 genes tested), corresponding to 1.1% of genes with significant cis
-eQTLs, or 1.7% of genes overall that had more than two significant cis
-eQTL SNPs in at least one population. In CEU, fourteen genes exhibited multiple independent cis
-effects (corresponding to 1.8% of genes with significant cis
-eQTLs). In total, 10 genes (1.2%), 1 (0.14%), 7 (0.83%), 1 (0.12%), 0 (0%), 7 (0.71%), and 13 (1.5%) with multiple cis
-eQTLs were detected for CHB, GIH, JPT, LWK, MEX, MKK and YRI respectively (Table S6
). Taken together, at the 0.01 permutation threshold for all eight populations, we observed that ~0–2% of genes with an expression association possess multiple independent cis
-eQTLs effects. At most, a single gene had five independently associated SNPs, i.e., expression of TACO1
), translational activator of mitchondrially encoded cytochrome c oxidase subunit I, a gene with a role in Leigh Syndrome, was independently associated with five SNPs in LWK.
Population sharing of cis-eQTLs
As described above, at the 0.01 permutation threshold we detected 3,130 genes that had a significant cis
- association, 1,074 (34%) of which had a significant cis
-eQTL in at least two populations. We evaluated whether the extent to which pairs of populations shared significant cis
- associations was related to their distance as defined by SNPs, with the goal of assessing the degree of sharing of functional variation across populations of varying ancestry. Qualitatively, we observed that more closely-related populations tend to share more cis
- associated genes than more distantly-related populations (Figure S7
). We further considered sharing using parsimony. Assessing eQTLs discovered at the 0.01 permutation threshold and shared at least at the 0.1 permutation threshold, we used an estimate of the general population structure of these eight populations to identify significant subpopulation sharing between CHB+JPT, CEU+GIH and CEU+GIH+JPT+CHB (Figure S8
). For eQTLs estimated as ancestral (or recurrent) to all populations using this methodology we did not find any enrichment in particular functional categories (no GO terms had a 0.05 significance after Bonferroni correction).
For each pairwise combination of populations, we examined those SNP-probe pairs that were significant in both populations and determined the proportion that had the allelic effect in the same direction in both populations. As described above, at the 0.01 permutation threshold we detected 1,074 genes with a significant cis- association in at least two populations. Among the 28 population pairs, we observed 98.9–100% concordance of allelic direction (). Evaluation of allelic direction concordance from the ‘REDUCED data’ analysis produced nearly identical results (not shown). These results suggest that regulatory variation affects gene expression in the same direction across populations.
Spearman's rho for each significant SNP-probe cis- association shared by at least two populations.
To quantify the degree of concordance of effect size across populations, for those SNP-probe associations significant in multiple populations, we asked whether the SNP exerts the same effect size in each of the populations, as quantified by expression level fold-change differences between homozygote genotype categories in each of the populations. We observe that the effect size (fold difference between homozygotes of the two different genotypic states of a SNP) is shared between any two populations when the association is also shared (), and furthermore, larger effect sizes were slightly more likely to be shared (Figure S9
). In addition, for SNP-probe pairs discovered in one population, if we consider the p-value distribution in the other seven for the same SNP-probe pairs, we observe extensive enrichment of low p-values as indicated by the fraction of expected true positives pi1 (Figure S10
), indicating that our threshold-based estimates of across-population cis
-eQTL sharing are underestimates. This result, paired with the result that effect sizes (fold-change) are similar among populations, suggests that the driving force behind the discovery of an eQTL in one population but not another is mainly due to allele frequency differences and not due to differences in absolute effect size.
Expression level fold-change for significant SNP-probe cis- associations shared by pairs of populations.
Genomic properties of eQTLS
The distribution of cis
- associations relative to the transcription start site (TSS) shows that the majority of association signals are approximately symmetrically centered on the TSS (), with the majority within 100 Kb of the TSS, as has been previously observed 
. Significant associations extend out to 1 Mb (the limit tested in this analysis), with the strongest statistical signals located directly at the TSS. Note the smallest population sample, MEX, provided the weakest statistical signals as expected. We observe a pattern to this distribution when we partition associations into categories based on the number of populations in which the gene is found to have a significant cis
- association. We observe that for those genes found to have a significant cis
-eQTL in only one population, the distribution of most-significantly associated SNPs are uniformly distributed throughout the 2 Mb window ( and Figure S11
). For genes with significant cis
-eQTL associations in all eight populations, the distribution is centered on the TSS, with few associations extending beyond +/−200 Kb from the TSS. As population sharing increases from genes with significant associations in only one population to genes with significant associations in all eight populations, we see a gradual tightening of the distribution around the TSS. This is true for single populations (Figure S11
) and when examined in aggregate across populations (). This is unlikely to be driven simply by false positive associations that fail to replicate, as the tightening pattern is observed even when comparing associations shared in six versus seven versus eight populations, which are themselves unlikely to be false positives. This observation suggests that the genomic distribution of cis
-regulatory variants with respect to their contribution to genome function is different even when deviations of allele frequency are taken into account.
Distribution of cis- associations in each population relative to the transcription start site (TSS).
Distribution of cis- associations relative to the transcription start site (TSS) and in relation to population sharing.
We also observed that a substantial number of SNPs were associated with more than one gene. A total of 264 genes have eQTLs at the 0.01 permutation threshold organized in clusters of 2 or more genes, where the eQTL is identical among genes, suggesting same functional variant. Of these, 52 clusters of 2 or more genes were observed (and therefore replicated) in at least two populations and the distance to TSS of such eQTL-SNPs was larger relative to all other eQTL-SNPs for single genes. This signal suggests the presence of regulatory domains that influence multiple genes in a coordinated fashion and these tend to do so from long distances. Such coordinated regulation from distance is well known in the HOX cluster 
, but our data suggests that this may be a more general phenomenon.
eQTLs and disease
The integration of eQTL results with GWAS has been proposed as a way to move toward biological and mechanistic understanding of complex trait etiology 
, and is already achieving success [e.g.,34]
. It has also recently been demonstrated that genome-wide association signals are enriched for eQTLs 
. We compared our cis
-eQTL results to the National Institute of Health's catalog of genome-wide association studies 
, and found that of 4,772 GWAS SNPs representing 475 traits (available as of August 4, 2011), 62 SNPs were also the most-significant SNP of a cis
-eQTL in at least one population (0.01 permutation threshold, ‘REDUCED’ analysis). These 62 SNPs associate with expression of 57 Ensembl genes, and 51 traits, including Alcohol dependence, Crohn's disease, Coronary Heart Disease, HDL cholesterol, Prostate Cancer, Trigylcerides, and many others (Table S7
). The majority of GWAS studies have been performed in populations of Caucasian ancestry, however the overlap of GWAS SNPs to the strongest-associated cis
-eQTLs did not reflect this; instead all populations were represented (CEU: N
14 SNPs, CHB: N
10, GIH: N
9, JPT: N
21, LWK: N
8, MEX: N
10, MKK: N
9, YRI: N
8). Of the 62 SNPs that were the most significant cis
-eQTL for a given gene as well as associated to a trait in the GWAS catalog, we observed that 15 (~24%) were the most significant SNP of the same gene in at least one additional population, a proportion which is likely an underestimate of the true value, given that the same SNPs weren't necessarily tested in all populations. We asked whether across-population replication of the most significant SNP per gene differed depending on whether that SNP was a GWAS SNP, and observed no difference (Fisher's 2-tailed p-value 0.128). The eQTLs identified through our analyses contribute significantly to the available functional regulatory data that could be used to fine-map and elucidate the function of genetic variants contributing to complex phenotypes.