Highly significant and well-replicated SNPs identified to date explain only ~5% of the phenotypic variance for height19
. Our results show that common SNPs in total explain another ~40% of phenotypic variance. Hence, 88% (40/45) of the variation due to SNPs has been undetected in published GWASs because the effects of the SNPs are too small to be statistically significant. Our results also suggest that the discrepancy between 80% heritability and 45% accounted for by all SNPs is due to incomplete LD between causal variants and the SNPs, possibly because the causal variants have a lower MAF on average than the SNPs typed on the array. We cannot tell from these results whether or not some of this discrepancy is due to causal variants with very low frequency – for example, MAF < 0.001 (ref. 4
). However, the results show that the total genetic variance could be explained by causal variants similar to the SNPs, with MAF < 0.1. If causal variants affecting height had no effect on fitness, they would show a complete range of MAF but with a higher proportion at low MAF than the SNPs on commercial arrays. If variants affecting height are subject to selection for either allele, there will still be a spectrum of MAF, but with an even greater proportion at low MAF. Thus, we do not conclude that all causal variants have MAF <0.1, but that the spectrum of MAF at causal variants is more concentrated at low values than it is for the SNPs used as markers.
The power to detect individual SNPs as significantly associated with a trait such as height depends on the variance associated with the SNP. This, in turn, depends on the LD between the SNP and the causal variant, the effect of the causal variant and its frequency. Causal variants with small effects or rare alleles with large effects (including rare Mendelian variants) will explain little variance and so will tend not to be significant even if they are in high LD with an assayed SNP. However, the cumulative effect of these SNPs will be included as part of the 45% of phenotypic variance explained by the SNPs in our analysis. Despite the use of ~295K SNPs, many causal variants, especially if they have low MAF, will not be in perfect LD with the assayed SNPs. This reduces the power of a conventional GWAS to detect them and reduces the variance estimated for the SNPs collectively in our study. The results imply that most causal variants explain such a small proportion of the variance that many causal variants affecting height must exist. The results of published GWASs are consistent with this finding, as high test statistics are distributed over much of the genome16
Could our results be biased because of ascertainment in the data, data analysis or interpretation? We carefully adjusted phenotypes for systematic differences and applied thorough quality control to the SNP data (Online Methods). We show by principal component analysis (PCA) of African, Asian and European populations that all of our samples are of European ancestry (Supplementary Fig. 2a,b
). We demonstrate further by PCA of European populations only that our samples show close relationship to the UK population and do not show an apparent cline across Europe (Supplementary Fig. 2c,d
). We performed REML analysis by fitting the first two, four and ten eigenvectors from the European-only PCA as covariates. The results show little to no systematic difference in the estimates of the variance explained by fitting up to ten eigenvectors (Supplementary Table 1
). Furthermore, we performed single-SNP association analysis between 1,286 ancestry-informative markers (AIMs) and height, and did not detect a significant inflation of the test statistic for these AIMs (Supplementary Fig. 3
= 0.219). All these results suggest that our estimate of variance explained by all SNPs is unlikely to be biased by population stratification. A subtle form of stratification in GWASs might occur because subjects are distantly related. We excluded any subjects with a relationship to another subject > 0.025. If distant pedigree relationships were an important cause of the estimated relationships, then all chromosomes of a pair of subjects should reflect this relationship. We found no correlation between relatedness estimated from different chromosomes (Supplementary Table 2
). Thus, the relationships we estimate from SNPs are driven by LD among the SNPs. It is the same LD that causes a SNP that is not a causal variant to show an association with a trait such as height. In other words, our estimate of the variance explained by the SNPs is based on the same phenomenon as the SNP associations reported from GWASs (LD between SNPs and causal variants). However, we accumulate the variance explained by all SNPs and so are not limited by the need for individual SNPs to pass stringent significance tests.
We also verified that the estimates of variance explained by the SNPs are not driven by a few outlier individuals that are similar in height and in SNP genotypes (). We regressed the squared difference in height between each pair of individuals on the estimate of their relatedness. The intercept and slope are estimates of twice the phenotypic variance and minus twice the additive genetic variance explained by the SNPs, respectively23
, so the estimate of variance explained by the SNPs from this regression analysis is ~0.51. The signal on the slope of the regression line comes from many points and is not due to a few outliers. Note that our maximum likelihood estimate is more accurate than this regression analysis; we show the latter only to illustrate the robustness of the estimate. In addition, we performed REML analysis using subsets of individuals by randomly splitting the whole sample into two and four groups and by sampling 1,000, 2,000 and 3,000 individuals with replacement for four replicates (Supplementary Fig. 4
). The average estimates of variance explained by all SNPs are not affected by sample size, but, as expected, the sampling error increases as sample size decreases.
Figure 3 All pairwise comparisons contribute to the estimate of genetic variance. Shown are the squared z-score differences between individuals (
) plotted against the adjusted estimates of genetic relationship (
). The blue line is the linear regression line (more ...)
Heritability is the proportion of phenotypic variation due to additive genetic factors24
; we therefore fitted a model in which SNPs have additive effects. Non-additive genetic variation and variation due to gene-environment interactions may exist, but they are not part of the missing heritability because they do not contribute to the heritability. Epigenetic mutations may cause resemblance between relatives and contribute to heritability if stably inherited, but in that case they would be equivalent to DNA sequence variants, would show LD with the assayed SNPs and would not contribute to missing heritability25
The method we have presented could be misinterpreted as a method for estimating the heritability of height. Actually, we estimate the variance in height explained by the SNPs. We show that these SNPs do explain over half the estimated heritability of height and that the missing proportion can be explained by incomplete LD between the SNPs and causal variants.
If other complex traits in humans, including common diseases, have genetic architecture similar to that of height, then our results imply that larger GWASs will be needed to find individual SNPs that are significantly associated with these traits, because the variance typically explained by each SNP is so small. Even then, some of the genetic variance of the trait will be undetected because the genotyped SNPs are not in perfect LD with the causal variants. Deep resequencing studies are likely to uncover more polymorphisms, including causal variants that will be represented on future genotyping arrays. Our data provide strong evidence that the variation contributed by many of these causal variants is likely to be small and that very large sample sizes will be required to show that their individual effects are statistically significant. A similar conclusion was drawn recently for schizophrenia26
. In some cases the small variance will be due to a large effect for a rare allele, but this will still require a large sample size to reach significance. Genome-wide approaches like those used in our study can advance our understanding of the nature of complex-trait variation and can be exploited for selection programs in agriculture27
and individual risk prediction in humans28