The expression levels of genes determine the distinctive characteristics of cells. Recent studies have shown that gene expression levels in humans differ not only among cell types within an individual but also among individuals1,2
. This observation led to analysis of gene expression as a phenotype and to the identification of polymorphic genetic variants that influence individual differences in expression level3–8
. However, these studies of the genetics of human gene expression have been restricted to individuals from one European-derived sample, the families collected by the Centre d’Etude du Polymorphisme Humain (CEPH). Differences between populations in gene expression phenotypes have not been characterized. We present an analysis of such differences.
Much of the recognized genetic variation among populations is in DNA polymorphisms with no known functional significance. On the other hand, some allele frequency differences between populations have highly significant phenotypic consequences. Among the best-established are the differences in allele frequencies for mendelian genetic diseases. The marked population differences in prevalence of these qualitative phenotypes (such as cystic fibrosis9
and Tay-Sachs disease10
) are entirely due to differences in frequencies of the mutant alleles. However, genetic differences among populations in quantitative phenotypes are potentially just as important functionally.
Here we extend the comparative genetic analysis of population differences from qualitative phenotypes to a particular quantitative phenotype, the expression level of genes. The choice of gene expression as a phenotype provides a large set of comparable traits, all measured at the same time in each individual. Our goals are to determine what proportion of gene expression phenotypes differs significantly between populations and to what extent the phenotypic differences are attributable to specific genetic polymorphisms. We find that at least 25% of the gene expression phenotypes differ significantly between the major populations studied, and specific genetic variation (in allele frequency) accounts for the difference in the most significant instances among the phenotypes that are cis regulated.
We measured the expression of genes in Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines from three populations that are part of the samples from the International HapMap Project11
. These include 60 European-derived individuals from the Utah pedigrees of the Centre d’Etude du Polymorphisme Humain (CEU), 41 Han Chinese in Beijing (CHB) and 41 Japanese in Tokyo (JPT).
We used the Affymetrix Genome Focus Array that contains ~8,500 annotated human genes to measure expression of genes in the 142 individuals from the three populations. We focused on 4,197 genes that are expressed in lymphoblastoid cell lines. There were 939 genes whose expression was significantly different by the t
< 0.05 after Šidák correction12
) between the CEU and CHB samples and 756 genes that differed between the CEU and JPT samples. In contrast, there were only 27 genes whose expression differed significantly (P
) between the CHB and JPT samples. Because the mean expression levels of most genes are similar between the CHB and JPT samples, we combined the samples as ‘CHB+JPT’ for subsequent analysis, as did the International HapMap Consortium11
. At P
, there were 1,097 genes that differed between CEU and the combined CHB+JPT samples (Supplementary Table 1
online). shows eight of the gene expression phenotypes with the largest differences between the CEU and CHB+JPT samples. Even when the mean expression differed significantly between populations, the magnitude of the difference was quite small for most genes, and the area of overlap was large. describes the 35 genes whose mean expression differs by twofold or more between the CEU and CHB+JPT samples.
Gene expression in CHB+JPT (open circles) and CEU (filled circles) for eight genes that differ in mean expression level between the populations (N = 82 for CHB+JPT, and N = 60 for CEU). Additional information is found in .
Thirty-five genes with greater than twofold difference in mean expression between CEU and CHB+JPT samples
The gene with the greatest difference between the CEU and CHB+JPT samples was UGT2B17
; its mean expression in the CEU individuals was 22 times higher than in the CHB+JPT samples. In both populations, there is a polymorphism for deletion of this gene13
. Homozygotes for the deletion are more common in CHB+JPT than in the CEU samples14
, accounting for the lower average expression of this gene in CHB+JPT ().
We considered it essential to replicate the marked similarity of the Asian-derived populations and their distinctness from the CEU. We followed up the initial findings with an analysis of 24 samples from the Han Chinese of Los Angeles (CHLA) who are part of the Human Variation Panel15
. Among the 35 genes in , only one (3%) differed significantly (P
< 0.05) between the CHLA and the CHB+JPT samples, but 32 (91%) differed significantly between CHLA and CEU.
To investigate the population differences in a multilocus fashion, we carried out cluster analysis16
and grouped the samples from 60 CEU, 41 CHB, 41 JPT and 24 CHLA by similarity of expression level for the 1,097 genes that are differentially expressed between the HapMap CEU and CHB+JPT samples. We expected that the CHB and JPT would cluster together, separately from the CEU. However, we were most interested in how the CHLA samples would be grouped, as they were not used in identifying the 1,097 genes. There were two main clusters (): one consisted entirely of CEU individuals (59 of the 60 CEU), and the other consisted of all the Asian-derived individuals (82 CHB+JPT and 24 CHLA individuals) plus one CEU. Thus, the samples from the Han Chinese of Los Angeles were much more similar in expression profile to the HapMap CHB+JPT samples than to the HapMap CEU samples. This confirms that there is a characteristic expression pattern that the CHLA samples share with the CHB+JPT. The CHLA samples were collected separately from the CHB and JPT samples; therefore, the expression differences between the European- and Asian-derived samples are not an artifact of how the cells were processed.
Figure 2 Results of cluster analysis. The 166 individuals are represented by columns, and the 1,097 genes of the main analysis are represented by rows. For each gene, expression level for each individual is indicated by color; intensity of red is proportional (more ...)
Our second goal was to determine to what extent the expression-phenotype differences are associated with, and possibly attributable to, specific genetic differences. A large catalog of population differences at the DNA level is available (in the form of SNP frequencies11
) for the same HapMap samples we studied at the expression level. We did the analysis in two steps. First, we carried out genome-wide association (GWA) analysis with these SNPs for each of the 1,097 phenotypes to localize the genetic determinants of variation in gene expression. We did this analysis in CEU and CHB+JPT. Then we compared the results for the two samples in order to identify the genetic differences that might explain expression differences between the populations.
We carried out the GWA analysis with the SNP markers as follows. For each expression phenotype, we tested ~2 million SNPs for association by linear regression of expression level on SNP genotype (coded 0, 1, 2). To adjust for the large number of tests, we set the significance level at nominal P = 2.5 × 10−8 (Pc = 0.05 after Šidák correction), which is conservative. Among 1,097 phenotypes tested, we would expect approximately 55 (0.05 × 1,097) to appear significant by chance. We found 104 phenotypes that showed significant association with one or more markers in the CEU samples: 10 phenotypes with ‘cis’ association and 94 with ‘trans’ association. In the CHB+JPT samples, we found 89 phenotypes with significant association: 23 with cis association and 66 with trans association. We have operationally defined a cis-regulated gene by the presence of significant association with SNP(s) in the region 500 kb upstream of the start of the transcript to 500 kb downstream of the 3′ end. (This definition allows for linkage disequilibrium between a marker and the actual regulatory variants, and for long-range cis regulators.) Among the findings for either population alone, we expect some to be false positive results as indicated above. However, when we found the identical marker (among 2 million) to be significantly associated with the same expression phenotype in both populations, we considered the result very unlikely to be a false positive; instead, it is likely to be the ‘true’ regulatory variant, or in very strong linkage disequilibrium with a regulatory variant.
The most direct comparison between CEU and CHB+JPT, with respect to regulatory differences at the DNA level, is possible when a gene expression phenotype is associated with the same SNP in both populations. We restricted our attention to the 11 phenotypes of this kind, where the SNPs were the most highly significant in both populations. (In our data, these SNPs were all in cis to the expressed gene.) For these phenotypes, the association was either significant at P < 2.5 × 10−8 in both populations or significant at approximately P < 2.5 × 10−8 in one population and somewhat less significant in the other ().
Contribution (R2) of SNP genotypes to differences in mean expression between populations for 11 cis-regulated gene expression phenotypes
As the same cis markers are associated with the expression phenotypes in both populations, we assumed that the actual cis regulators were the same in both populations. At this point, however, we did not know whether the mean differences between populations were due mainly to different SNP genotype frequencies or to different mean expression levels for the same SNP genotypes (‘population-specific genotype effects’).
We used nested linear models for gene expression level to partition the overall expression variation sequentially into three components: (i) the effect of genotype variation, allowing for population differences in genotype frequencies, but not for population-specific genotype effects; (ii) additional variation explained by population-specific genotype effects and (iii) additional variation explained by departures from genetic additivity (dominance). The contributions of these components were represented by the fraction R2 of the total sum of squares (see Methods).
Except for one gene (TPP2), the highly significant expression differences between CEU and CHB+JPT were due to differences in genotype (allele) frequency much more than to population-specific genotype effects (). For five of the genes, the genotype frequency difference accounted for 50% or more of the variation. For example, the G allele of SNP rs2005354 was associated with higher expression of POMZP3 in both populations (). However, the frequency of the G allele was appreciably greater in CEU (0.28) than in CHB+JPT (0.06), with corresponding differences in genotype frequencies. The result is that the mean expression level was higher in CEU (7.3) than in CHB+JPT (6.6). Additional examples are shown in . We assumed that in these cases the SNP was itself a regulator of gene expression or was in strong linkage disequilibrium with a regulator. For most of these genes, therefore, there was little evidence that the regulators themselves differed. Instead, different frequencies of allelic forms of the regulator accounted for the population differences in expression levels for these expression phenotypes. (We did not find large contributions from dominance for any of the expression phenotypes in ; the largest R2 for this component was only 0.08, for UGT2B17).
Figure 3 The population difference in expression of POMZP3 is accounted for by the allele frequency difference at the very closely linked SNP rs2005354. The left panel shows the distribution of expression level in the same format as in . The right panel (more ...)
In addition to the variation analyzed above, some variation in expression phenotypes between populations can probably be attributed to different regulatory mechanisms. For four phenotypes, we found significant cis
association in the CHB+JPT sample but not in the CEU sample (Supplementary Table 2
online). We note, however, that the CHB+JPT sample size is larger (n
= 82) than the CEU sample (n
= 60), so it is possible that the corresponding cis
effects exist in CEU but were not strong enough to be detected with the smaller sample size.
There were four additional phenotypes with significant evidence for trans
regulators in both populations, but all four mapped to different genomic regions in the two populations (Supplementary Table 3
online). In several cases, the results were highly significant even after correction for multiple testing. This evidence for different locations suggests that different regulatory mechanisms may account for the variation in expression levels between populations. Nevertheless, we recognize that the genetic analysis for trans
regulators has been much less conclusive than for cis
regulators, and the apparent differences in location of regulators may be due to association findings that are false positives.
What can we conclude about the relationship between DNA sequence variation and variation in expression phenotype? Our previous studies3,4
showed that expression variation within the CEPH Utah sample is associated with polymorphic variation at the DNA level (that is, with SNPs). Here we have found that 1,097 expression phenotypes (~25% of those tested) differ significantly between the populations studied. Because so many phenotypes differ, when we combine them for analysis, we are able to classify individuals very accurately (as in ). However, our primary interest is not in classification but rather in accounting for the expression differences that we found between the populations and in the implications of this finding.
We found that the difference in expression for a set of phenotypes is accounted for by a simple aspect of population genetics. There are marked between-population differences in allele frequencies of the same SNPs that are associated with within-population regulation of expression. In the 11 phenotypes we investigated in detail, these allele frequency differences explain 18%–81% of the total variation in expression level. For five phenotypes, allele frequency differences at SNPs associated with the regulators account for more than half the total variation in expression. In other words, the population differences in these expression phenotypes are largely attributable to frequency differences at the DNA sequence level. Similar results have been found for differences between two strains of Drosophila melanogaster17
In our analysis, we tested a large set of quantitative phenotypes. By our very stringent criteria, we identified specific genetic polymorphisms strongly associated with the differences between human populations in at least a dozen of these phenotypes ( and Supplementary Tables 1
). Our approach yields a large collection of comparable measurements, consisting of gene expression phenotypes that can be examined simultaneously and compared among individuals. There are a few other polymorphisms that seem to account for population differences in quantitative traits: these include several examples for skin color (for a review, see ref. 18
), which has also been attributed to at least one SNP variant19
. Unlike expression profiles, however, these quantitative traits do not lend themselves to analysis as a ‘collection,’ and very few can be confidently associated with a specific genetic polymorphism. A collection of differences in gene expression therefore provides a distinctive way to approach the subject of genetically determined population differences.
The findings for gene expression are relevant for understanding the genetics of disease susceptibility—in particular, susceptibility to complex genetic diseases. In discussions of the genetics of complex disease, it is has been noted20
that variants in coding regions of candidate genes do not account for a large proportion of disease susceptibility. A reasonable conclusion is that variation in gene expression is responsible instead. There are well-known population differences in the prevalence of complex genetic diseases such as hypertension and type 2 diabetes mellitus. Our results suggest that genetically determined differences in gene expression contribute to these population differences. Analysis of variation in gene expression will enhance understanding of both the underlying genetics and the population differences observed in complex genetic diseases.