|Home | About | Journals | Submit | Contact Us | Français|
Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
GWASs may utilize either case-control cohorts to test for associations with diseases or population cohorts to identify associations with quantitative traits. In both cases, it is assumed that the cohorts consist of unrelated individuals that share the same population background, although this may not hold in practice for cohorts used in many current GWASs. The presence of related individuals within a study sample results in sample structure, a term that encompasses population stratification and hidden relatedness. Population stratification refers to the inclusion of individuals from different populations within the same study sample. Hidden relatedness refers to the presence of unknown genetic relationships between individuals within the study sample1,2. The effects of sample structure present in cohorts used for genetic association studies have been well documented and identified as a cause for some spurious associations3,4.
Although limiting study samples entirely to unrelated individuals may be difficult or impossible, genotype data provides valuable information on the sample structure that can inform genetic association analysis. For example, the STRUCTURE software5 uses genotype data to partition the sample into subpopulations within which there is no sample structure and subsequently carries out association tests within the identified subpopulations. To eliminate the effects of hidden relatedness, one can estimate the proportion of genes identical by descent (IBD) between any pair of individuals in the sample and exclude from the analysis those individuals that appear closely related1,6. Population stratification and hidden relatedness, however, constitute just two extreme manifestations of sample structure, and methods are needed to correct for other forms of sample structure. In the genomic control approach7,8, which has been widely adopted, the distribution of test statistics from the single-marker analysis is used to estimate the inflation factor, λ, with which the test statistics are subsequently rescaled, constraining the risk of false positives. The EIGENSTRAT software9,10 uses principal components analysis (PCA) to detect and describe sample structure and has been widely used in GWASs. Some principal components may represent broad differences across individuals within a given data set, effectively capturing a few major axes of population structure, but it is unclear how to interpret the rest of the principal components as surrogates of sample structure11,12. Currently, association studies typically use a combination of these strategies, first identifying close relatives to remove them from analysis, then correcting for broad sample structure using principal components or spatial information and finally correcting for the residual inflation with genomic control6,13,14.
If we knew the complete genealogy of the population, we could, in principle, apply a variance component method to model the effects of the genetic relationships on the phenotypes; this approach would be similar in spirit to the classical polygenic model15 directly applied to association mapping16. The variance component would capture the complex mixture of both population stratification and hidden relatedness that directly results from the genealogy and would correct for these relationships during the mapping. Although the exact genetic relationships between individuals in the samples are unknown, we could take advantage of the high-density genotype information to empirically estimate the level of relatedness between reportedly unrelated individuals.
We report here an approach for correcting for sample structure within GWASs, based on a linear mixed model (also sometimes referred to as a mixed linear model) with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects. Similar variance component approaches have been used successfully in animal models17-19. However, applying even an efficient implementation of a variance component approach, such as EMMA (ref. 19), is computationally intractable for data sets consisting of thousands of individuals, owing to the heavy computational burden in the estimation of variance parameters. Capitalizing on the characteristics of complex traits in humans, we make a few simplifying assumptions that allow us to markedly increase the speed of computations, making our approach readily applicable to GWASs with tens of thousands of individuals assayed at hundreds of thousands of SNPs. For most genetic association studies in humans, because the effect of any given locus on the trait is very small20, we need to estimate the variance parameters only once for each data set, and we can globally apply them to each marker. Our computational improvements reduce the running time for the analysis of a typical GWAS data set using a variance component model from years to hours. The advantage of the variance component approach is that the empirical relatedness matrix encodes a wide range of sample structures, including both hidden relatedness and population stratification. Principal component–based methods, in contrast, by estimating major axes of the pairwise genetic similarity matrix, capture some, but not all, of the sample structure, as we show below.
We evaluate our method using two human GWAS data sets, from the 1966 Northern Finland Birth Cohort (NFBC66)13,21 and the Wellcome Trust Case Control Consortium (WTCCC)6. The NFBC66 is based on a founder population, which is expected to minimize genetic heterogeneity, increasing the chances of mapping genes underlying traits of interest22. This is an ideal sample to evaluate our method because a detailed study23 of this data set has revealed the presence of substantial population structure that could influence the results of genetic association studies. In addition, we apply our method to the case-control studies for seven common complex diseases conducted by the WTCCC6. In both data sets, our method consistently outperforms both genomic control and principal component analysis. We term our method EMMA eXpedited (EMMAX) because it builds on the previous approach EMMA (ref. 19) and markedly reduces the computational cost.
To more closely examine the extent of sample structure within the NFBC66, we used PCA of the genotype covariance matrix9 and multidimensional scaling analysis (MDS) of the identity-by-state (IBS) matrix from NFBC66 samples. The first two coordinates identified by MDS are known to correlate well with the geographical location of the linguistic groups13. The first two principal components in the current sample correlate well with latitude and longitude of parental birthplaces for the subset of individuals with known ancestry (Fig. 1). Indeed, we noted that PCA of genotypes and classical MDS of the IBS matrix lead to very similar results. There is a correlation coefficient of 0.9993 between the first components from PCA and MDS and a correlation coefficient of 0.9978 between the second components. The first five principal components separate to varying degrees the linguistic and geographic subgroups comprising northern Finland (Supplementary Fig. 1), consistent with the previous analysis using MDS13. Despite the clear correlation between geographical regions of origin and the first two principal components, clustering analyses of the IBS matrix using PLINK software or hierarchical clustering in R did not identify separate subgroups.
Performing a simple uncorrected association test for each of the nine phenotypes originally examined in ref. 13, we made the following estimates of the genomic control parameters λ: body mass index, 1.031; C-reactive protein (CRP), 1.007; diastolic blood pressure, 1.031; glucose, 1.045; high-density lipoprotein (HDL), 1.052; insulin plasma levels, 1.029; low-density lipoprotein, 1.098; systolic blood pressure, 1.066; triglyceride, 1.023. These values are all higher than the ones obtained previously with a smaller sample size13 and are substantially higher than what one would expect in a sample with no structure. In addition, the height phenotype, which was not analyzed in the previous study13, has a λ value of 1.187. For reference, note that a conservative estimate of the 95% confidence interval of the inflation factor is between 0.992 and 1.008, assuming independence between the markers.
As hidden relatedness is a possible cause of inflated genomic control parameters, we reanalyzed the data after excluding a larger number of possibly related subjects (a genome-wide IBD estimate of >10% was used as a cutoff with PLINK software, excluding an additional 611 individuals). This resulted in a slight reduction of λ for some phenotypes (Table 1).
As suggested in ref. 9, we explored the effect of including a variable number of principal components in the association tests. Although including two or five principal components are included has a considerable effect on the λ values, further augmenting the number of principal components does not substantially decrease the genomic control parameter (Fig. 2). It is often suggested that only principal components having predictive power for the phenotype should be included in the regression11. We identified principal components for each phenotype that have a t-test P < 0.005 as predictors; the results of their inclusion in the association tests are reported in Figure 2.
We analyzed the ten NFBC66 phenotypes with EMMAX using a three-step procedure (see Online Methods). First, we computed a pairwise relatedness matrix from high-density markers, which we used to represent the sample structure. Second, we estimated the contribution of the sample structure to the phenotype using a variance component model, resulting in an estimated covariance matrix of phenotypes that models the effect of genetic relatedness on the phenotypes. Third, we applied a generalized least square (GLS) F-test24, or alternatively a score test25, at each marker to detect associations accounting for the sample structure using the covariance matrix.
The second step also provides us with the fraction of phenotypic variance explained by the empirically estimated relatedness matrix. We call this fraction pseudoheritability because it resembles the heritability estimated from a pedigree26, although this is not directly interchangeable with heritability of the trait because the estimated pairwise relatedness does not correspond exactly to the kinship coefficients. Nonetheless, the pseudoheritability estimates are concordant with the previous heritability estimates from a large family based study of Kosrae and Sardinian populations27,28. Different methods for estimating the pairwise relatedness provide slightly different but highly correlated estimates of pseudoherit-ability across the ten traits. (Supplementary Table 1).
Using the estimated covariance matrix, we proceeded with the GLS F-test to test the effect of each marker on the phenotype and then applied genomic control to quantify the amount of residual inflation. The genomic control λ parameters we obtained with EMMAX are much lower than those obtained using either standard association methods or regression analysis including 100 principal components (Table 1). Figure 3 and Supplementary Figure 2 illustrate the results using quantile-quantile plots of the P value distributions from these three tests. Only one of the ten phenotypes showed λ values within the 95% confidence interval of 0.992–1.008 with uncorrected or principal component analysis, whereas all of them fell in the confidence interval with EMMAX.
Unlike genomic control, the EMMAX model alters the ranking of SNPs by their association statistics. This is especially important as many recent GWAS follow-up and multistage design studies take the approach of genotyping all SNPs exceeding some predefined threshold29-31. We examined the extent to which the adoption of the EMMAX model changes the SNP rankings in comparison to the uncorrected and principal component analyses. We took the top k markers from the results of EMMAX, the uncorrected method, and regression including 100 principal components (as implemented in EIGENSOFT software), for k between 10 and 5,000. For each of these sets, we calculated the number of SNPs shared between the lists and the fraction of these shared SNPs relative to the number of unique SNPs in each pair of lists. Although many of the top SNPs reported by each method overlap, a considerable number of highly ranked SNPs differ between the methods (Fig. 4 and Supplementary Table 2). In general, EMMAX results are similar to uncorrected analysis when the inflation of test statistics is small, but they become more similar to the PCA as the inflation increases. Notably, the PCA consistently shows larger departures from the uncorrected analysis than EMMAX does across all ten phenotypes. For example, when the overdispersion of test statistics was negligible, such as in the CRP phenotype, only 66% of the top 2,000 hits were concordant between the principal component and the uncorrected analysis, whereas 89% were concordant between EMMAX and the uncorrected analysis.
EMMAX prevents the overdispersion of test statistics using a statistical model that explicitly takes into account sample structure, rather than correcting the overdispersed test statistics caused by not taking into account genetic relatedness in the statistical model. Consequently, EMMAX can also prevent the overcorrection that would remove true positive associations. We identified 15 genome-wide significant loci with at least one of the uncorrected, 100 principal components–corrected, or EMMAX, analyses after genomic control at the suggested P threshold32 of 7.2 × 10−8 across the ten phenotypes (Table 2). In 13 of the 15 loci, EMMAX P values become smaller than the uncorrected analysis. The two-sided binomial P value of the observed asymmetry is 9.8 × 10−4 if two methods have the same statistical power. With the 100 principal components–corrected analysis, 10 of the 15 loci show smaller P values than the uncorrected analysis (binomial P value of 0.12). Although 12 of the 15 loci are found by all methods to be genome-wide significant at P < 7.2 × 10−8, two known loci33, APOB (with triglyceride) and HNF4A (with HDL), pass the threshold only with EMMAX. In contrast, the locus NR1H3 (with HDL), which is genome-wide significant only with uncorrected analysis, turns out to be the only locus whose association has not yet been replicated by an independent study among the 15 loci.
Because EMMAX estimates the variance parameters under the null hypothesis, one may suspect that it is underpowered compared to the full mixed model, which estimates the variance parameters under the alternative hypothesis. This is comparable to the difference between the score statistic and the efficient score statistic25,34,35. As most genetic variants associated to date with human complex traits are estimated to explain only a small fraction of phenotypic variance20, the difference between the two approaches will be negligible in most cases. To assess the seriousness of this concern, we ran the original EMMA, which uses a full mixed effect model, on the 15 peak SNPs and compared the resulting P values to those estimated with EMMAX using GLS. Overall, as expected, the P values from the full mixed effect model tended to be smaller than the P value from the GLS model, but the magnitude of the difference was very small (Supplementary Fig. 3a). However, the running times for EMMA were substantially longer. Because the original EMMA re-estimates the variance parameters at each marker, given the size of the NFBC data set, it took more than 10 min of CPU time per marker on an Intel Xeon 3-GHz processor, even with an efficient C implementation of EMMA. A simple extrapolation suggests that it would take more than 6 years of CPU time to analyze a single GWAS data set using EMMA, taking a full mixed model approach. The total computational time using EMMAX for this data was 6.6 h in a single CPU, and the procedure could easily be parallelized to speed it up further.
We also applied our method to the WTCCC data set consisting of case-control studies for seven common diseases6. To analyze case-control phenotypes, we applied a linear model to the binary phenotypes, in the spirit of Armitage’s test (see Online Methods). We performed association testing over the seven disease phenotypes using EMMAX, EIGENSTRAT and uncorrected analysis. The values we observed for inflation factors λ were very similar to those in the original study, in which the test statistics were uncorrected: bipolar disease, 1.11; coronary artery disease, 1.06; Crohn’s disease, 1.10; hypertension, 1.06; rheumatoid arthritis, 1.03; type 1 diabetes, 1.04; and type 2 diabetes, 1.07. Consistent with our observations over the NFBC66 data, correcting for 100 principal components only partially reduced the inflation factors (Table 3 and Supplementary Fig. 2). When EMMAX was applied, the estimated inflation factors were below the upper bound of the confidence interval, suggesting that none of the phenotypes show significant inflation of test statistics.
However, we noticed that two of the phenotypes, rheumatoid arthritis and type 1 diabetes, show significant deflation of test statistics beyond the 95% confidence interval (λ = 0.965 for rheumatoid arthritis, λ = 0.946 for type 1 diabetes). This is not unexpected, considering that a substantial fraction of the phenotypic variance in these autoimmune diseases is explained by the HLA loci, leading to inaccurate estimation of variance parameters under the null hypothesis when the HLA effect is not accounted for. In fact, the set of genome-wide significant SNPs (P < 7.2 × 10−8; ref. 32) in this region account for 47% and 60% of the phenotypic variance of rheumatoid arthritis and type 1 diabetes, respectively6. We re-estimated the variance parameters by conditioning on the 57 and 134 SNPs within the extended human MHC region36 that explain more than 1% of phenotypic variance of rheumatoid arthritis and type 1 diabetes, respectively (as described in the Supplementary Note). As a result, the genomic control λ increased to 0.989 for rheumatoid arthritis and 0.991 for type 1 diabetes. We performed this conditioning procedure only for estimating variance parameters and not in the SNP association test so that the P values would be consistent with the unconditioned analysis. Conditioning on the SNPs with such a strong effect may further improve the power to identify novel loci. A more sophisticated conditional analysis—for example, one including haplotype effects or epistatic interactions into covariates—may also better account for the strong effects in the autoimmune diseases37.
Under certain conditions, one can expect the variance of the test statistics to be inflated by a constant across the genome7,8. A formal model of hidden relatedness based on the coalescent theory1 also suggests a constant inflation across the genome when the sample structure is entirely due to hidden relatedness7. However, for a more complex genealogical relationship among individuals, it is not clear how the inflation of test statistics will behave.
Using the same variance component framework, we developed a method to estimate the marker-specific inflation of test statistics using the correlation between each marker and the empirically estimated kinship matrix (described in the Supplementary Note). These estimates are concordant with the genome-wide genomic control inflation factor on average but show substantial differences across the SNPs (Fig. 5a). In the height phenotype, for example, the estimated marker-specific inflation factors have a mean of 1.107, s.d. of 0.090 and median value of 1.093. In light of this, we explored the relationship between marker-specific inflation factors and the overdispersion of test statistics with the uncorrected analysis. The distribution of height association P values for SNPs with inflation factors <1.05 shows a less marked departure from uniform distribution than does the distribution for SNPs with inflation factors >1.20 (Fig. 5b,c). Considering that SNPs with a higher inflation factor were identified without consideration of their possible association with the phenotype, it is reasonable to conclude that this excess of small P values reflects overdispersion of test statistics.
These results underscore how correcting the test statistics using a single inflation factor may be inappropriate, possibly reducing power and not sufficiently controlling for false positives. To further demonstrate this point, we ran a simple simulation using the variance component model on which EMMAX is based. Although simulating data under this model puts our method at an advantage, and the approach is therefore less suited for comparison to other models, it does demonstrate that under some circumstances uniformly deflating P values may be inappropriate. We randomly simulated 100 sets of phenotypes solely from the sample structure with no SNP effects and examined the quantile-quantile plots across different methods. Although the inflation for most of the SNPs is corrected by genomic control as expected, we observed substantial fluctuations of the test statistics at the tail of the distribution (Supplementary Fig. 4a,b).
More than 25% of the phenotypes showed inflation or deflation beyond the 95% confidence interval. This is because the SNPs with higher per-marker inflation are not sufficiently corrected by the constant genomic control inflation factor. In contrast, EMMAX results in P values close to the expected distribution (Supplementary Fig. 4c).
The finding that marker-specific inflation factors vary substantially across the genome has notable implications for meta-analyses and multistage analyses. Such studies typically combine the test statistics after correcting for potential inflation using genomic control30,31,38. The disadvantages of using the same global correction rather than a marker-specific one can become more serious when this step is done repeatedly. To better understand these effects in the context of meta-analysis, we first compared the marker-specific inflation factors between the two WTCCC control groups, collected from essentially the same population. We observed a very strong correlation (r = 0.95; Supplementary Fig. 5a). We further compared the inflation factors across different populations and different genotyping platforms using the NFBC66 samples and WTCCC control samples. We observed a strong correlation of r = 0.70 (Supplementary Fig. 5b), suggesting that the marker-specific inflation factors can be correlated across the multiple data sets used in meta-analysis or multistage analysis owing to the shared genetic history. If this is the case, the standard approach that corrects with genomic control before merging the P values from different studies may lead to further inaccuracies: tests at some markers would be excessively, or not sufficiently, deflated multiple times, resulting in an accumulation of errors.
We report here the development of the EMMAX program, taking an expedited mixed linear model approach to correct for sample structure within human GWASs. We demonstrate its effectiveness with the analysis of two human GWAS data sets, including quantitative as well as disease traits. The proposed approach differs substantially from genomic control in that it accounts for inflation owing to population structure in a marker-specific manner, resulting in a modified ranking of association results. Accounting for marker-specific effects can reduce both false positives and false negatives. We discuss this issue in more detail in the Supplementary Note.
There are several other methods that take into account pedigree-based or empirically estimated kinship matrices in the statistical test39-42. One of the key differences between these methods and the mixed model methods, including EMMAX, is that the mixed model methods have a procedure for estimating the contribution of the kinship matrix to the phenotypes, whereas the other methods do not. Estimating the phenotypic variance contributed by the sample structure enabled us to avoid undercorrection or overcorrection of the sample structure in the NFBC66 and WTCCC data sets.
The effective application of our method depends on an appropriate estimate of the variance parameters. The IBS or Balding-Nichols matrix43 appears to be better than IBD estimates at capturing the long-distance relationships that result in variations at the population level. However, when the structure of the sample at hand is better described in terms of fairly recent hidden relatedness, methods based on the estimation of IBD may have an advantage. In principle, our approach is also suitable for association testing in a data set including individuals from a heterogeneous population with admixed background. In such cases, it is important to consider SNP ascertainment bias in estimating the degree of relatedness between individuals. Because many SNP probes in genotyping arrays are selected from European populations, the marker-based pairwise distance between two individuals may appear to be larger between unrelated European samples than between unrelated individuals from other populations. To resolve the resulting ascertainment bias, each SNP may be differently weighted when the IBS similarity matrix is computed. A general framework has been presented19 for computing the similarity matrix with a different weight for each marker. Different weighting schemes can also be used to account for heterogeneous distribution of effect size from each marker or each genomic region.
Besides the choice of the kinship matrix, the estimation of variance parameters is also a crucial part of the EMMAX approach. In our analysis of the NFBC data, we show that estimating these parameters under the null hypothesis does not lead to appreciable bias in the association P values. The example of rheumatoid arthritis and type 1 diabetes in the WTCCC data set, in contrast, reveals the difficulties encountered by EMMAX when there are SNPs explaining a large fraction of phenotypic variance. In such cases, we show that estimating variance parameters conditionally on the SNPs with stronger effects alleviates these concerns.
Finally, whereas the analysis presented here relies on decomposing the variance into two terms, a genetic relatedness component and a component representing residual effects, future studies may need to account for additional variance components to more precisely model the heterogeneous phenotypic variance. In expression quantitative trait loci mapping, for example, one may want to add additional variance components to account for technical bias44. When multiple variance components are involved, one would need to make use of algorithms such as PROC MIXED implemented in SAS, as EMMA is developed for two variance components only; this would increase the running time of the first step of our procedure. However, because the same variance components estimated from the null hypothesis would be used across the genome-wide markers, the overall computational time should still be acceptable.
Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturegenetics/.
We thank the NFBC66 team for access to phenotype and genotype data used in the analyses presented here. The genotype data were generated at the Broad Institute with support from National Heart, Lung, and Blood Institute grant 6R01HL087679-03. We thank D. Clayton for reading through the manuscript and for providing important suggestions. We acknowledge the WTCCC for allowing us to use their data set. H.M.K., N.A.Z., J.H.S. and E.E. are supported by National Science Foundation grants 0513612, 0731455 and 0729049, and National Institutes of Health (NIH) grants 1K25HL080079 and U01-DA024417. N.A.Z. is supported by the Microsoft Research Fellowship. H.M.K. is supported by the Samsung Scholarship, National Human Genome Research Institute grant HG00521401, National Institute for Mental Health grant NH084698 and GlaxoSmithKline. C.S. is partially supported by NIH grants GM053275-14, HL087679-01, P30 1MH083268, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. N.B.F. and S.K.S. are supported by NIH grants HL087679-03, 5PL1NS062410-03, 5UL1DE019580-03 and 5RL1MH083268-03. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences.
AUTHOR CONTRIBUTIONS H.M.K., J.H.S., C.S. and E.E. designed the methods and experiments; H.M.K., J.H.S., S.K.S., S.-y.K., N.B.F., C.S. and E.E. jointly analyzed the NFBC66 data set; H.M.K., J.H.S., N.A.Z., C.S. and E.E. jointly analyzed the WTCCC data set; H.M.K., J.H.S., S.K.S., N.B.F., C.S. and E.E. wrote the manuscript; all authors contributed their critical reviews of the manuscript during its preparation.
Note: Supplementary information is available on the Nature Genetics website.
COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.