|Home | About | Journals | Submit | Contact Us | Français|
Variation in DNA sequence contributes to individual differences in quantitative traits, but in humans the specific sequence variants are known for very few traits. We characterized variation in gene expression in cells from individuals belonging to three major population groups. This quantitative phenotype differs significantly between European-derived and Asian-derived populations for 1,097 of 4,197 genes tested. For the phenotypes with the strongest evidence of cis determinants, most of the variation is due to allele frequency differences at cis-linked regulators. The results show that specific genetic variation among populations contributes appreciably to differences in gene expression phenotypes. Populations differ in prevalence of many complex genetic diseases, such as diabetes and cardiovascular disease. As some of these are probably influenced by the level of gene expression, our results suggest that allele frequency differences at regulatory polymorphisms also account for some population differences in prevalence of complex diseases.
The expression levels of genes determine the distinctive characteristics of cells. Recent studies have shown that gene expression levels in humans differ not only among cell types within an individual but also among individuals1,2. This observation led to analysis of gene expression as a phenotype and to the identification of polymorphic genetic variants that influence individual differences in expression level3–8. However, these studies of the genetics of human gene expression have been restricted to individuals from one European-derived sample, the families collected by the Centre d’Etude du Polymorphisme Humain (CEPH). Differences between populations in gene expression phenotypes have not been characterized. We present an analysis of such differences.
Much of the recognized genetic variation among populations is in DNA polymorphisms with no known functional significance. On the other hand, some allele frequency differences between populations have highly significant phenotypic consequences. Among the best-established are the differences in allele frequencies for mendelian genetic diseases. The marked population differences in prevalence of these qualitative phenotypes (such as cystic fibrosis9 and Tay-Sachs disease10) are entirely due to differences in frequencies of the mutant alleles. However, genetic differences among populations in quantitative phenotypes are potentially just as important functionally.
Here we extend the comparative genetic analysis of population differences from qualitative phenotypes to a particular quantitative phenotype, the expression level of genes. The choice of gene expression as a phenotype provides a large set of comparable traits, all measured at the same time in each individual. Our goals are to determine what proportion of gene expression phenotypes differs significantly between populations and to what extent the phenotypic differences are attributable to specific genetic polymorphisms. We find that at least 25% of the gene expression phenotypes differ significantly between the major populations studied, and specific genetic variation (in allele frequency) accounts for the difference in the most significant instances among the phenotypes that are cis regulated.
We measured the expression of genes in Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines from three populations that are part of the samples from the International HapMap Project11. These include 60 European-derived individuals from the Utah pedigrees of the Centre d’Etude du Polymorphisme Humain (CEU), 41 Han Chinese in Beijing (CHB) and 41 Japanese in Tokyo (JPT).
We used the Affymetrix Genome Focus Array that contains ~8,500 annotated human genes to measure expression of genes in the 142 individuals from the three populations. We focused on 4,197 genes that are expressed in lymphoblastoid cell lines. There were 939 genes whose expression was significantly different by the t test (P < 10−5; Pc < 0.05 after Šidák correction12) between the CEU and CHB samples and 756 genes that differed between the CEU and JPT samples. In contrast, there were only 27 genes whose expression differed significantly (P < 10−5) between the CHB and JPT samples. Because the mean expression levels of most genes are similar between the CHB and JPT samples, we combined the samples as ‘CHB+JPT’ for subsequent analysis, as did the International HapMap Consortium11. At P < 10−5, there were 1,097 genes that differed between CEU and the combined CHB+JPT samples (Supplementary Table 1 online). Figure 1 shows eight of the gene expression phenotypes with the largest differences between the CEU and CHB+JPT samples. Even when the mean expression differed significantly between populations, the magnitude of the difference was quite small for most genes, and the area of overlap was large. Table 1 describes the 35 genes whose mean expression differs by twofold or more between the CEU and CHB+JPT samples.
The gene with the greatest difference between the CEU and CHB+JPT samples was UGT2B17; its mean expression in the CEU individuals was 22 times higher than in the CHB+JPT samples. In both populations, there is a polymorphism for deletion of this gene13. Homozygotes for the deletion are more common in CHB+JPT than in the CEU samples14, accounting for the lower average expression of this gene in CHB+JPT (Fig. 1).
We considered it essential to replicate the marked similarity of the Asian-derived populations and their distinctness from the CEU. We followed up the initial findings with an analysis of 24 samples from the Han Chinese of Los Angeles (CHLA) who are part of the Human Variation Panel15. Among the 35 genes in Table 1, only one (3%) differed significantly (P < 0.05) between the CHLA and the CHB+JPT samples, but 32 (91%) differed significantly between CHLA and CEU.
To investigate the population differences in a multilocus fashion, we carried out cluster analysis16 and grouped the samples from 60 CEU, 41 CHB, 41 JPT and 24 CHLA by similarity of expression level for the 1,097 genes that are differentially expressed between the HapMap CEU and CHB+JPT samples. We expected that the CHB and JPT would cluster together, separately from the CEU. However, we were most interested in how the CHLA samples would be grouped, as they were not used in identifying the 1,097 genes. There were two main clusters (Fig. 2): one consisted entirely of CEU individuals (59 of the 60 CEU), and the other consisted of all the Asian-derived individuals (82 CHB+JPT and 24 CHLA individuals) plus one CEU. Thus, the samples from the Han Chinese of Los Angeles were much more similar in expression profile to the HapMap CHB+JPT samples than to the HapMap CEU samples. This confirms that there is a characteristic expression pattern that the CHLA samples share with the CHB+JPT. The CHLA samples were collected separately from the CHB and JPT samples; therefore, the expression differences between the European- and Asian-derived samples are not an artifact of how the cells were processed.
Our second goal was to determine to what extent the expression-phenotype differences are associated with, and possibly attributable to, specific genetic differences. A large catalog of population differences at the DNA level is available (in the form of SNP frequencies11) for the same HapMap samples we studied at the expression level. We did the analysis in two steps. First, we carried out genome-wide association (GWA) analysis with these SNPs for each of the 1,097 phenotypes to localize the genetic determinants of variation in gene expression. We did this analysis in CEU and CHB+JPT. Then we compared the results for the two samples in order to identify the genetic differences that might explain expression differences between the populations.
We carried out the GWA analysis with the SNP markers as follows. For each expression phenotype, we tested ~2 million SNPs for association by linear regression of expression level on SNP genotype (coded 0, 1, 2). To adjust for the large number of tests, we set the significance level at nominal P = 2.5 × 10−8 (Pc = 0.05 after Šidák correction), which is conservative. Among 1,097 phenotypes tested, we would expect approximately 55 (0.05 × 1,097) to appear significant by chance. We found 104 phenotypes that showed significant association with one or more markers in the CEU samples: 10 phenotypes with ‘cis’ association and 94 with ‘trans’ association. In the CHB+JPT samples, we found 89 phenotypes with significant association: 23 with cis association and 66 with trans association. We have operationally defined a cis-regulated gene by the presence of significant association with SNP(s) in the region 500 kb upstream of the start of the transcript to 500 kb downstream of the 3′ end. (This definition allows for linkage disequilibrium between a marker and the actual regulatory variants, and for long-range cis regulators.) Among the findings for either population alone, we expect some to be false positive results as indicated above. However, when we found the identical marker (among 2 million) to be significantly associated with the same expression phenotype in both populations, we considered the result very unlikely to be a false positive; instead, it is likely to be the ‘true’ regulatory variant, or in very strong linkage disequilibrium with a regulatory variant.
The most direct comparison between CEU and CHB+JPT, with respect to regulatory differences at the DNA level, is possible when a gene expression phenotype is associated with the same SNP in both populations. We restricted our attention to the 11 phenotypes of this kind, where the SNPs were the most highly significant in both populations. (In our data, these SNPs were all in cis to the expressed gene.) For these phenotypes, the association was either significant at P < 2.5 × 10−8 in both populations or significant at approximately P < 2.5 × 10−8 in one population and somewhat less significant in the other (Table 2).
As the same cis markers are associated with the expression phenotypes in both populations, we assumed that the actual cis regulators were the same in both populations. At this point, however, we did not know whether the mean differences between populations were due mainly to different SNP genotype frequencies or to different mean expression levels for the same SNP genotypes (‘population-specific genotype effects’).
We used nested linear models for gene expression level to partition the overall expression variation sequentially into three components: (i) the effect of genotype variation, allowing for population differences in genotype frequencies, but not for population-specific genotype effects; (ii) additional variation explained by population-specific genotype effects and (iii) additional variation explained by departures from genetic additivity (dominance). The contributions of these components were represented by the fraction R2 of the total sum of squares (see Methods).
Except for one gene (TPP2), the highly significant expression differences between CEU and CHB+JPT were due to differences in genotype (allele) frequency much more than to population-specific genotype effects (Table 2). For five of the genes, the genotype frequency difference accounted for 50% or more of the variation. For example, the G allele of SNP rs2005354 was associated with higher expression of POMZP3 in both populations (Fig. 3). However, the frequency of the G allele was appreciably greater in CEU (0.28) than in CHB+JPT (0.06), with corresponding differences in genotype frequencies. The result is that the mean expression level was higher in CEU (7.3) than in CHB+JPT (6.6). Additional examples are shown in Table 2. We assumed that in these cases the SNP was itself a regulator of gene expression or was in strong linkage disequilibrium with a regulator. For most of these genes, therefore, there was little evidence that the regulators themselves differed. Instead, different frequencies of allelic forms of the regulator accounted for the population differences in expression levels for these expression phenotypes. (We did not find large contributions from dominance for any of the expression phenotypes in Table 2; the largest R2 for this component was only 0.08, for UGT2B17).
In addition to the variation analyzed above, some variation in expression phenotypes between populations can probably be attributed to different regulatory mechanisms. For four phenotypes, we found significant cis association in the CHB+JPT sample but not in the CEU sample (Supplementary Table 2 online). We note, however, that the CHB+JPT sample size is larger (n = 82) than the CEU sample (n = 60), so it is possible that the corresponding cis effects exist in CEU but were not strong enough to be detected with the smaller sample size.
There were four additional phenotypes with significant evidence for trans regulators in both populations, but all four mapped to different genomic regions in the two populations (Supplementary Table 3 online). In several cases, the results were highly significant even after correction for multiple testing. This evidence for different locations suggests that different regulatory mechanisms may account for the variation in expression levels between populations. Nevertheless, we recognize that the genetic analysis for trans regulators has been much less conclusive than for cis regulators, and the apparent differences in location of regulators may be due to association findings that are false positives.
What can we conclude about the relationship between DNA sequence variation and variation in expression phenotype? Our previous studies3,4 showed that expression variation within the CEPH Utah sample is associated with polymorphic variation at the DNA level (that is, with SNPs). Here we have found that 1,097 expression phenotypes (~25% of those tested) differ significantly between the populations studied. Because so many phenotypes differ, when we combine them for analysis, we are able to classify individuals very accurately (as in Fig. 2). However, our primary interest is not in classification but rather in accounting for the expression differences that we found between the populations and in the implications of this finding.
We found that the difference in expression for a set of phenotypes is accounted for by a simple aspect of population genetics. There are marked between-population differences in allele frequencies of the same SNPs that are associated with within-population regulation of expression. In the 11 phenotypes we investigated in detail, these allele frequency differences explain 18%–81% of the total variation in expression level. For five phenotypes, allele frequency differences at SNPs associated with the regulators account for more than half the total variation in expression. In other words, the population differences in these expression phenotypes are largely attributable to frequency differences at the DNA sequence level. Similar results have been found for differences between two strains of Drosophila melanogaster17
In our analysis, we tested a large set of quantitative phenotypes. By our very stringent criteria, we identified specific genetic polymorphisms strongly associated with the differences between human populations in at least a dozen of these phenotypes (Table 2 and Supplementary Tables 1 and 2). Our approach yields a large collection of comparable measurements, consisting of gene expression phenotypes that can be examined simultaneously and compared among individuals. There are a few other polymorphisms that seem to account for population differences in quantitative traits: these include several examples for skin color (for a review, see ref. 18), which has also been attributed to at least one SNP variant19. Unlike expression profiles, however, these quantitative traits do not lend themselves to analysis as a ‘collection,’ and very few can be confidently associated with a specific genetic polymorphism. A collection of differences in gene expression therefore provides a distinctive way to approach the subject of genetically determined population differences.
The findings for gene expression are relevant for understanding the genetics of disease susceptibility—in particular, susceptibility to complex genetic diseases. In discussions of the genetics of complex disease, it is has been noted20 that variants in coding regions of candidate genes do not account for a large proportion of disease susceptibility. A reasonable conclusion is that variation in gene expression is responsible instead. There are well-known population differences in the prevalence of complex genetic diseases such as hypertension and type 2 diabetes mellitus. Our results suggest that genetically determined differences in gene expression contribute to these population differences. Analysis of variation in gene expression will enhance understanding of both the underlying genetics and the population differences observed in complex genetic diseases.
Lymphoblastoid cell lines for 60 HapMap CEU, 41 HapMap CHB, 41 HapMap JPT and 24 Han Chinese of Los Angeles (CHLA) were obtained from Coriell Cell Repositories and grown to a density of 5 × 105 cells/ml in RPMI 1640 with 15% FBS (vol/vol), 100 units penicillin/ml, 100 μg streptomycin sulfate/ml and 1% L-glutamine (wt/vol). Several CHB and JPT samples from the HapMap collection were excluded because cell lines were not available at the time of the study. Total RNA was extracted with the RNeasy Mini-Kit (Qiagen) and hybridized onto Affymetrix Genome Focus arrays according to the manufacturer’s protocol. The growing and processing of the HapMap cell lines was randomized by population group to eliminate batch effects that may contribute to apparent population differences in gene expression; the CHLA cells were studied later and were grown and processed at one time.
Expression arrays were analyzed using the Affymetrix MAS 5.0 software. The expression intensity was scaled to 500 and log2 transformed. The 4,197 genes that were called ‘Present’ in at least 80% of the samples in one population were used for further analysis.
The significance of the difference between sample means was first tested by the t test. To assess the effect of possible departures from the assumptions for the parametric test, we compared the results with those from the nonparametric Wilcoxon rank-sum test and found very similar results. With the Wilcoxon test, there are 1,104 genes that are significantly different between the CHB+JPT and CEU samples at P < 10−5. More than 90% of the genes that are significant by the t test are also significant by Wilcoxon test. For several randomly selected phenotypes, we calculated empirical P values by a permutation test. The empirical P values and those from the t test did not differ appreciably. We conclude that the t test is a satisfactory test for significant differences between CHB+JPT and CEU.
The pairwise similarity of all 166 subjects was calculated as the Pearson correlation coefficient of the expression levels of the 1,097 genes that were found to be differentially expressed between the CEU and CHB+JPT samples. The individuals were then grouped by hierarchical clustering using the average linkage method, as implemented in MultiExperiment Viewer.
Log2 of expression level as the dependent variable was regressed on SNP genotype (coded 0, 1, and 2). Genotypes from release 19 of the International HapMap Project were used. All markers with minor allele frequency >5% were included. Analysis was carried out separately for the CEU and CHB+JPT samples. Correction for multiple testing was performed by the Šidák procedure for 2,050,366 markers in the CHB+JPT samples and 2,246,676 markers in the CEU samples. The corrected P value of 0.05 corresponds to a nominal P value of ~2.5 × 10−8.
With nested multiple regression models, the total sums of squares of expression variation can be decomposed sequentially into contributing sums of squares. Each sum of squares measures the additional part of the total variation accounted for when one predictor variable is added to the model, given that the previous predictors are already included. Thus the contribution of each predictor is measured after adjusting for inclusion of previous predictors. In our data, which contain unequal numbers in the genotype categories, different ordering of predictors gives slightly different results.
We present the contributions from three components.
This analysis was carried out in SAS for each of the expression phenotypes in Table 2. We report component (i) as ‘R2 due to genotype variation’ and component (ii) as ‘additional R2 due to population-specific genotype effects’. We did not find large contributions from (iii) for any of the expression phenotypes in Table 2.
We thank T. Weber for carrying out the microarray hybridizations, V. Mancuso for processing samples and data analysis and H. H. Kazazian and K. Ewens for comments on the manuscript. This work was supported by US National Institutes of Health grants (to R.S.S, V.G.C) and by the W.W. Smith Chair (V.G.C.).
Accession codes. Gene Expression Omnibus (GEO): GSE5859.
Note: Supplementary information is available on the Nature Genetics website.
COMPETING INTERESTS STATEMENT
The authors declare that they have no competing financial interests.
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions