|Home | About | Journals | Submit | Contact Us | Français|
Single nucleotide polymorphisms (SNPs) discovered by genome-wide association studies (GWASs) account for only a small fraction of the genetic variation of complex traits in human populations. Where is the remaining heritability? We estimated the proportion of variance for human height explained by 294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis, and validated the estimation method by simulations based upon the observed genotype data. We show that 45% of variance can be explained by considering all SNPs simultaneously. Thus, most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests. We provide evidence that the remaining heritability is due to incomplete linkage disequilibrium (LD) between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency (MAF) than the SNPs explored to date.
Genome-wide association studies in human populations have discovered hundreds of SNPs significantly associated with complex traits1,2, yet for any one trait they typically account for only a small fraction of the genetic variation. Where is the missing heritability, the so called dark matter of the genome3,4? Suggested explanations include the existence of gene-by-gene or gene-by-environment interactions5, the common disease-rare variant hypothesis6 and the possibility that inherited epigenetic factors cause resemblance between relatives7,8. However, the variance explained by the validated SNPs is usually much less than the narrow-sense heritability, the proportion of phenotypic variance due to additive genetic variance. Non-additive genetic effects do not contribute to the narrow-sense heritability, so explanations based on non-additive effects are not relevant to the problem of missing heritability (Supplementary Note). There are two explanations for the failure of validated SNP associations to explain the estimated heritability: either the causal variants each explain such a small amount of variation that their effects fail to reach stringent significance thresholds and/or the causal variants are not in complete linkage disequilibrium (LD) with the SNPs that have been genotyped. Lack of complete LD might, for instance, occur if causal variants have lower minor allele frequency (MAF) than genotyped SNPs. Here we test these two hypotheses and estimate the contribution of each to the heritability of height in humans as a model complex trait.
Height in humans is a classical quantitative trait, easy to measure and studied for well over a century as a model for investigating the genetic basis of complex traits9,10. The heritability of height has been estimated to be ~0.8 (refs. 9,11–13). Rare mutations that cause extreme short or tall stature have been found14,15, but these do not explain much of the variation in the general population. Recent GWASs on tens of thousands of individuals have detected ~50 variants that are associated with height in the population, but these in total account for only ~5% of phenotypic variance16–19.
Data from a GWAS that are collected to detect statistical associations between SNPs and complex traits are usually analysed by testing each SNP individually for an association with the trait. To account for the large number of significance tests carried out, a very stringent P value is used. This reduces the occurrence of false positives, but may cause many real associations to be missed, especially if individual SNPs have a small effect on the trait. An alternative approach designed to overcome this problem is to fit all the SNPs simultaneously20. The effects of the SNPs are treated statistically as random, and the variance explained by all the SNPs together is estimated. This approach, which we use here, does not attempt to test the significance of individual SNPs but provides an unbiased estimate of the variance explained by the SNPs in total.
From a number of GWASs, we selected 4,259 individuals who were not knowingly related to one another and confirmed this with SNP data. We then estimated their pairwise genetic relationships using all autosomal markers (MAF ≥ 0.01), and retained 3,925 individuals (3,248 adults and 677 16-year-olds) whose pairwise relationship was estimated at less than 0.025 (maximum relatedness approximately corresponding to cousins two to three times removed: Supplementary Fig. 1). We fitted a linear model to the height data and used restricted maximum likelihood (REML)21 to estimate the variance explained by the SNPs. (In Online Methods, we show how this can be conveniently implemented with a mathematically equivalent model that uses the SNPs to calculate the genomic relationship between pairs of subjects). Using this approach, we estimated the proportion of phenotypic variance explained by the SNPs as 0.45 (s.e. = 0.08, Table 1), a nearly tenfold increase relative to the 5% explained by published and validated individual SNPs.
Our estimate of 45% is still less than the 80% of phenotypic variance due to additive genetic effects (that is, the estimated heritability). One reason why the SNPs do not explain the full estimated heritability is that the SNPs on the arrays are not in complete LD with causal variants. The ability of the SNPs to explain the phenotypic variation caused by causal variants depends on the LD between all the causal variants and all the SNPs. Lack of complete LD is manifested as a difference between the genomic relationship between each pair of subjects j and k at the causal variants (Gjk) and the relationship between the same individuals calculated from the SNPs (Ajk). As causal variants are unknown, we cannot estimate their LD with observed SNPs directly. However, we can mimic it by considering the LD of the genotyped SNPs with one another. It is likely that the causal variants and the SNPs have different properties, so LD among SNPs is only a guide to LD between causal variants and SNPs. One way in which the causal variants may differ from the SNPs is in MAF. To investigate how the difference between Gjk and Ajk depends on the number of SNPs used and the MAF of the causal variants, we randomly sampled five sets of SNPs (50K, 100K, …, 250K, where K = 1,000) in the adult dataset and ten sets of SNPs in the adolescent dataset (50K, 100K, .., 500K). For each SNP set, we randomly split the SNPs into two groups, the first representing SNPs and the second representing causal variants, and estimated genetic relationships using all of the SNPs in the first group (Ajk) and using SNPs with MAF ≤ θ in the second group (proxy for Gjk), where θ = 0.1, 0.2, 0.3, 0.4 or 0.5. We calibrated the prediction error by calculating the regression of Gjk on Ajk. We established empirically that the regression coefficient (Fig. 1), where N is the number of SNPs used to calculate Ajk and the term in c depends on the MAF of the causal variants (Online Methods). If the causal loci that have the same spectrum of allele frequency as the genotyped SNPs (θ = 0.5), then c = 0, and 1/N can be interpreted as the sampling error for estimating the relationship over the whole genome from N random SNPs. The parameter c is > 0 if θ < 0.5 because the relationship at causal variants with low MAF is typically less than the average relationship over the whole genome.
Therefore, given the number of SNPs used, we can correct the estimate of the variance explained by the SNPs for incomplete LD with causal variants, if causal variants have the same allelic frequency spectrum as genotyped SNPs. Using the same linear model as above, but corrected for this incomplete LD (c = 0), we estimated the proportion of variance explained by causal variants to be 0.54 (s.e. = 0.10, Table 1). This estimate assumes that the LD between SNPs and causal variants is as strong as that between the genotyped SNPs. However, if the causal polymorphisms tend to have lower MAF than the SNPs that have been assayed, as expected from neutral and selection theories of quantitative genetic variation6,22, we expect the LD between SNPs and causal variants to be reduced. When we used SNPs with a MAF < 0.1 as proxies for causal variants we found c = 6.2 × 10−6. Using this value of c to correct for incomplete LD, we estimated the proportion of variance in height explained by causal variants to be 0.84 (s.e. = 0.16; Supplementary Table 1). Although the standard error is high, this result is consistent with causal variants being, on average, at lower frequency than the SNPs used on commercial arrays and therefore in less LD with these SNPs than the LD of the SNPs with other SNPs. This does not prove that the causal variants have MAF < 0.1, but it shows that if this were the case, they could explain the estimated heritability of height (~0.8).
If our procedure for correcting for incomplete LD between SNPs and causal variants is correct, the variance explained by the causal variants should not depend on the number of SNPs used. To show that this is so, we randomly sampled 10%, 20%, …, and 100% of all the ~295K SNPs and estimated the variance explained by causal variants for each group of SNPs using both raw and adjusted estimates of relationships (assuming c = 0; Fig. 2). For the raw estimates of relationships, the proportion of variance explained increases with the number of SNPs because prediction error is reduced through inclusion of more SNPs. When the relationship estimates are adjusted for prediction error, the proportions of variance explained are independent from the number of SNPs and agree with an estimate of ~0.54 but have larger s.e. when fewer SNPs are used.
In addition, 1,318 of the 3,925 individuals were genotyped with ~516K SNPs, so we estimated relationships among these individuals (641 adults and 677 16-year-olds) with 516,345 SNPs and estimated the remaining pairwise relationships with 294,831 SNPs. We adjusted the two parts of the relationship matrix according to the number of SNPs used (assuming c = 0). The resulting estimate of proportion of variance explained by causal variants is no different from that using all the individuals with ~295K SNPs (Table 1).
We used simulation studies to validate the method of estimating the variance explained by causal variants using genome-wide SNPs. We simulated a quantitative trait on the basis of the observed genotype data of 3,925 individuals and 294,831 SNPs in two ways: (i) randomly sampling causal variants from all the SNPs, and (ii) randomly sampling causal variants from the SNPs with MAF ≤ 0.1 (Supplementary Note). Table 2 shows that in case (i), if we included the causal variants in estimating the genetic relationships, we obtained an unbiased estimate of the proportion of phenotypic variance explained by the causal variants (in this case this is the heritability of the trait, because in a simulation we know that these causal variants explain all the genetic variance). When we excluded the causal variants, we underestimated heritability, as the relationship derived from SNPs overestimated the variation of the relationship at causal loci owing to imperfect LD. However, the heritability estimate recovered when we adjusted relationship estimates using equation  (Online Methods; c = 0). In case (ii), even if we included the causal variants in the analysis, we still underestimated heritability, because the causal variants have lower frequency than the SNPs, on average, and have less LD with the SNPs than the SNPs have with other SNPs. Similarly, when we adjusted the relationship estimates with equation  (c = 6.2 × 10−6), we obtained unbiased estimates of h2. These results are consistent with the inference we draw from the empirical data. The results show that the estimate of variance caused by causal variants is unbiased regardless of the number of SNPs used, provided the method proposed here is employed.
Highly significant and well-replicated SNPs identified to date explain only ~5% of the phenotypic variance for height19. Our results show that common SNPs in total explain another ~40% of phenotypic variance. Hence, 88% (40/45) of the variation due to SNPs has been undetected in published GWASs because the effects of the SNPs are too small to be statistically significant. Our results also suggest that the discrepancy between 80% heritability and 45% accounted for by all SNPs is due to incomplete LD between causal variants and the SNPs, possibly because the causal variants have a lower MAF on average than the SNPs typed on the array. We cannot tell from these results whether or not some of this discrepancy is due to causal variants with very low frequency – for example, MAF < 0.001 (ref. 4). However, the results show that the total genetic variance could be explained by causal variants similar to the SNPs, with MAF < 0.1. If causal variants affecting height had no effect on fitness, they would show a complete range of MAF but with a higher proportion at low MAF than the SNPs on commercial arrays. If variants affecting height are subject to selection for either allele, there will still be a spectrum of MAF, but with an even greater proportion at low MAF. Thus, we do not conclude that all causal variants have MAF <0.1, but that the spectrum of MAF at causal variants is more concentrated at low values than it is for the SNPs used as markers.
The power to detect individual SNPs as significantly associated with a trait such as height depends on the variance associated with the SNP. This, in turn, depends on the LD between the SNP and the causal variant, the effect of the causal variant and its frequency. Causal variants with small effects or rare alleles with large effects (including rare Mendelian variants) will explain little variance and so will tend not to be significant even if they are in high LD with an assayed SNP. However, the cumulative effect of these SNPs will be included as part of the 45% of phenotypic variance explained by the SNPs in our analysis. Despite the use of ~295K SNPs, many causal variants, especially if they have low MAF, will not be in perfect LD with the assayed SNPs. This reduces the power of a conventional GWAS to detect them and reduces the variance estimated for the SNPs collectively in our study. The results imply that most causal variants explain such a small proportion of the variance that many causal variants affecting height must exist. The results of published GWASs are consistent with this finding, as high test statistics are distributed over much of the genome16.
Could our results be biased because of ascertainment in the data, data analysis or interpretation? We carefully adjusted phenotypes for systematic differences and applied thorough quality control to the SNP data (Online Methods). We show by principal component analysis (PCA) of African, Asian and European populations that all of our samples are of European ancestry (Supplementary Fig. 2a,b). We demonstrate further by PCA of European populations only that our samples show close relationship to the UK population and do not show an apparent cline across Europe (Supplementary Fig. 2c,d). We performed REML analysis by fitting the first two, four and ten eigenvectors from the European-only PCA as covariates. The results show little to no systematic difference in the estimates of the variance explained by fitting up to ten eigenvectors (Supplementary Table 1). Furthermore, we performed single-SNP association analysis between 1,286 ancestry-informative markers (AIMs) and height, and did not detect a significant inflation of the test statistic for these AIMs (Supplementary Fig. 3; P = 0.219). All these results suggest that our estimate of variance explained by all SNPs is unlikely to be biased by population stratification. A subtle form of stratification in GWASs might occur because subjects are distantly related. We excluded any subjects with a relationship to another subject > 0.025. If distant pedigree relationships were an important cause of the estimated relationships, then all chromosomes of a pair of subjects should reflect this relationship. We found no correlation between relatedness estimated from different chromosomes (Supplementary Table 2). Thus, the relationships we estimate from SNPs are driven by LD among the SNPs. It is the same LD that causes a SNP that is not a causal variant to show an association with a trait such as height. In other words, our estimate of the variance explained by the SNPs is based on the same phenomenon as the SNP associations reported from GWASs (LD between SNPs and causal variants). However, we accumulate the variance explained by all SNPs and so are not limited by the need for individual SNPs to pass stringent significance tests.
We also verified that the estimates of variance explained by the SNPs are not driven by a few outlier individuals that are similar in height and in SNP genotypes (Fig. 3). We regressed the squared difference in height between each pair of individuals on the estimate of their relatedness. The intercept and slope are estimates of twice the phenotypic variance and minus twice the additive genetic variance explained by the SNPs, respectively23, so the estimate of variance explained by the SNPs from this regression analysis is ~0.51. The signal on the slope of the regression line comes from many points and is not due to a few outliers. Note that our maximum likelihood estimate is more accurate than this regression analysis; we show the latter only to illustrate the robustness of the estimate. In addition, we performed REML analysis using subsets of individuals by randomly splitting the whole sample into two and four groups and by sampling 1,000, 2,000 and 3,000 individuals with replacement for four replicates (Supplementary Fig. 4). The average estimates of variance explained by all SNPs are not affected by sample size, but, as expected, the sampling error increases as sample size decreases.
Heritability is the proportion of phenotypic variation due to additive genetic factors24; we therefore fitted a model in which SNPs have additive effects. Non-additive genetic variation and variation due to gene-environment interactions may exist, but they are not part of the missing heritability because they do not contribute to the heritability. Epigenetic mutations may cause resemblance between relatives and contribute to heritability if stably inherited, but in that case they would be equivalent to DNA sequence variants, would show LD with the assayed SNPs and would not contribute to missing heritability25.
The method we have presented could be misinterpreted as a method for estimating the heritability of height. Actually, we estimate the variance in height explained by the SNPs. We show that these SNPs do explain over half the estimated heritability of height and that the missing proportion can be explained by incomplete LD between the SNPs and causal variants.
If other complex traits in humans, including common diseases, have genetic architecture similar to that of height, then our results imply that larger GWASs will be needed to find individual SNPs that are significantly associated with these traits, because the variance typically explained by each SNP is so small. Even then, some of the genetic variance of the trait will be undetected because the genotyped SNPs are not in perfect LD with the causal variants. Deep resequencing studies are likely to uncover more polymorphisms, including causal variants that will be represented on future genotyping arrays. Our data provide strong evidence that the variation contributed by many of these causal variants is likely to be small and that very large sample sizes will be required to show that their individual effects are statistically significant. A similar conclusion was drawn recently for schizophrenia26. In some cases the small variance will be due to a large effect for a rare allele, but this will still require a large sample size to reach significance. Genome-wide approaches like those used in our study can advance our understanding of the nature of complex-trait variation and can be exploited for selection programs in agriculture27 and individual risk prediction in humans28.
In a GWAS of a quantitative trait, we test for associations between individual SNPs and the trait by the following simple regression model,
where yj is the phenotypic value of the j-th individual; μ is the mean term; ai is the allele substitution effect of SNP i; xij is an indicator variable that takes value of 0, 1 or 2 if the genotype of the j-th individual at SNP i is bb, Bb or BB (alleles are arbitrarily called B or b), respectively; and ej is the residual effect, ., with σe2 being the residual variance.
Supposing that we could genotype subjects at the causal variants, we can include them all in the model
where gi is the total genetic effect of an individual j; m is the number of causal loci; ui is the scaled additive effect of the i-th causal variant; zij takes value of or if the genotype of the j-th individual at locus i is qq, Qq or QQ, respectively, with fi the frequency of Q allele at locus i (alleles are arbitrarily called Q or q)20,29; E(zij)= 0 and var(zij)= 1. In matrix notation, y=μ1+ g+ e and g= Zu. We treat u as random effects and assume , with σu2 being the variance of causal effects; then , where is variance of total additive genetic effects, and the variance-covariance matrix of y (the vector of observations) can be expressed as
where G is the genetic relationship matrix between pairs of individuals at causal loci. This equation shows the equivalence between the classical definition of heritability ( ) with σP2 being the phenotypic variance, and the proportion of phenotypic variance explained by the causal variants altogether.
In practice, we know little about the number and positions of the causal variants, so we are unable to obtain the G matrix directly. However, we can calculate the relationship from a genome-wide sample of SNPs (A) using the same formula as for G. That is.
where N is the number of SNPs and with pi the allele frequency at SNP i. This formula for A ignores the sampling error associated with each SNP. We can improve the estimate of A by calculating a weighted average across SNPs. For a SNP i, when j ≠ k (individuals j and k), ; in other words, it is the same for all SNPs regardless of allele frequency. When j = k, ; in other words, it is dependent on the allele frequency of the SNP. We therefore use the following equation to calculate Aij,
which provides an unbiased estimate of inbreeding coefficient (F) with mean of 1 + F, and has sampling variance of 1 when F = 0.
To obtain a genome-wide relationship, we combine Aijk for all of the SNPs using a common-sense weighting scheme,
Estimates of relationships are always relative to an arbitrary base population in which the average relationship is zero. We use the individuals in the sample as the base so that the average relationship between all pairs of individuals is zero and the average relationship of an individual with him/herself is 1.
If we knew the genotypes at the causal variants, we could fit model  and estimate the genetic variance . Instead we will use a modified version (A*) of the relationship matrix based on the SNPs A. Although we will use REML to estimate , the requirements of A* to obtain an unbiased estimate of are more easily understood for the method illustrated in Fig. 3. In this method for each pair of subjects is regressed on Gjk. The slope of this regression is . If we replace Gjk by an estimate such that then , and the regression of on is still , so the estimate of remains the same −b/2. To obtain an unbiased estimate of Gjk with the required property, we use linear regression of Gjk on Ajk. We cannot calculate G, so instead we use one set of SNPs to mimic causal variants using the following steps:
If the relationship at causal loci is predicted without error by the observed SNPs, β should equal one. When we applied this approach in our data, we found that for any of the MAF threshold θ, var(Ajk) is proportional to N whereas cov(Gjk, Ajk) is constant, irrespective of N (Supplementary Fig. 5). Consequently, we established an empirical linear relationship between β and the number of SNPs,
where c is constant for a certain MAF threshold θ —for example c = 6.2 × 10−6 when θ = 0.1 and c = 0 when θ = 0.5 (Fig. 1). The regression coefficient β is less than 1.0 because of two effects. First, the term in 1/N is due to the sampling error in estimating A from only N SNPs. This corresponds to the sampling error for Aijk at a single SNP calculated above as 1. If c = 0 and N = infinite, β = 1. In this case Ajk is the genomic relationship averaged over all positions in the genome. As the causal variants are a sample of such positions, Ajk is an unbiased estimated of Gjk. Second, the term in c occurs because the causal variants are not a random sample of all SNPs but a sample with low MAF. This causes the causal variants to have lower LD with the SNPs than random SNPs do with one another. Thus, even if Ajk was calculated from an infinite number of SNPs, it would still tend to overestimate the variance in relationships at the causal variants and consequently underestimate the genetic variance. We therefore adjust Ajk as
with the property of unbiasedness in the sense that .
Height measurements, self-reported or clinically measured, from 35,189 Australian adults and 2,036 Australian adolescents (around 16-years-old) were collected by the Queensland Institute of Medical Research. Of these individuals, 8,884 adults and 1,668 adolescents have been genotyped using Illumina SNP chips in several genome-wide association studies. All the samples were collected with informed consent and appropriate ethical approval. The adult samples were genotyped by HumanCNV370-Quad v3.0 BeadChips (~351K SNPs) or Human610-Quad v1.0 BeadChips (~582K SNPs), and the adolescent samples were all genotyped by Human610-Quad v1.0 BeadChips.
We included only the genotyped individuals of European descent, as verified by ancestry analysis using genome-wide SNP data.30,31 We selected a set of 3,535 “unrelated” adults (1,421 males and 2,114 females; from 18 to 91 years old, with mean of 45) and 724 “unrelated” 16-year-old adolescents (354 males and 370 females), for a combined dataset of 4,259 “unrelated” individuals according to the pedigree information.
We excluded SNPs in each individual dataset that had a mean GenCall Score < 0.7, missingness > 5%, a minor allele frequency (MAF) < 0.01 or a Hardy-Weinberg equilibrium (HWE) test P-value < 10−6, using PLINK32. A total of 304,013 SNPs in the adult dataset and 529,379 SNPs in the adolescent dataset passed this process, but only those in the autosomes were included in the analysis (295,400 SNPs for the adult dataset, 516,345 SNPs for the adolescent dataset and an intersect of 294,831 SNPs for the combined dataset).
We estimated the genetic relationships among all of the 4,259 individuals in the combined dataset by equation . The estimated relationships (off-diagonal elements of the relationship matrix) ranged from −0.024 to 0.585, suggesting that some close relatives still remained. The mean of genetic relationships of “unrelated” individuals should be close to zero, so the lower-bound of the range can be roughly regarded as the maximum deviation of an estimate from the mean. We estimated the two-tailed 95% confidence interval of relationships (adjusted for multiple tests by Bonferroni correction) to be from about −0.027 to 0.027. Therefore, to avoid having any close relatives in the data, we chose a cut-off value of 0.025 and selectively excluded one of any pair of individuals with an estimated relationship > 0.025 to maximize the remaining sample size. We excluded 287 individuals from the adult dataset and 47 individuals from the adolescent dataset. A total of 3,248 “unrelated” adults and 677 “unrelated” adolescents, with a combined dataset of 3,925 “unrelated” individuals, was retained for analysis.
The phenotypes were corrected for age and sex, and standardized to z-scores in each adult and adolescent dataset separately. We used a two-tailed 90% Winsorisation33 to adjust the z-scores of four individuals in the adult dataset with absolute values greater than 4.17, the (100 – 5/3248)th percentile of the standard normal distribution based on Bonferroni correction, and combined the z-scores in both adult and adolescent datasets for the combined dataset of height (Supplementary Fig. 1e).
We are grateful to the twins and their families for their generous participation in these studies. We would like to thank staff at the Queensland Institute of Medical Research: Dixie Statham, Ann Eldridge and Marlene Grace for sample collection, Megan Campbell, Lisa Bowdler, Steven Crooks and staff of the Molecular Epidemiology Laboratory for sample processing and preparation, Belinda Cornes for height data preparation, David Smyth and Harry Beeby for IT support, and Allan McRae and Hong Lee for discussions. We thank Naomi Wray for helpful comments on the manuscript. We acknowledge funding from the Australian National Health and Medical Research Council (NHMRC grants 241944, 389875, 389891, 389892, 389938, 442915, 442981, 496739, 496688 and 552485), the U.S. National Institute of Health (grants AA07535, AA10248, AA014041, AA13320, AA13321, AA13326 and DA12854) and the Australian Research Council (ARC grant DP0770096).
AUTHOR CONTRIBUTIONSP.M.V. and M.E.G. designed the study. J.Y. performed statistical analyses. B.B., B.P.M., A.K.H., D.R.N. and S.G. performed quality control analyses and prepared data. D.R.N., P.A.M., A.C.H. and N.G.M. contributed genotype and phenotype data. J.Y., G.W.M., M.E.G. and P.M.V. contributed to writing the paper.