Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3432754

Formats

Article sections

- Abstract
- Introduction
- Background
- Violations of Model Assumptions
- Imputation and Rare Variants
- Ascertainment and Case Control Phenotypes
- Phenotypic Prediction
- Summary
- References

Authors

Related links

Hum Genet. Author manuscript; available in PMC 2013 October 1.

Published in final edited form as:

Published online 2012 July 21. doi: 10.1007/s00439-012-1199-6

PMCID: PMC3432754

NIHMSID: NIHMS395516

Address correspondence to: Noah Zaitlen, Program in Molecular and Genetic Epidemiology, Harvard School of Public Health, 665 Huntington Avenue - Building 2 Room 209, Boston, MA 02115, phone: 617-432-6848, fax: 617 432 1722, Email: ude.dravrah.hpsh@neltiazn

Heritability, the fraction of phenotypic variation explained by genetic variation, has been estimated for many phenotypes in a range of populations, organisms, and time points. The recent development of efficient genotyping and sequencing technology has led researchers to attempt to identify the genetic variants responsible for the genetic component of phenotype directly via GWAS. The gap between the phenotypic variance explained by GWAS results and those estimated by from classical heritability methods has been termed the “missing heritability problem”. In this work, we examine modern methods for estimating heritability, which use the genotype and sequence data directly. We discuss them in the context of classical heritability methods, the missing heritability problem, and describe their implications for understanding the genetic architecture of complex phentoypes.

Since their debut in 2005 genome-wide associations studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with hundreds of different phenotypes (Hindorff et al. 2009). Despite this success, the total faction of the phenotypic variation explained for most phenotypes remains small relative to the published heritability estimates, which are estimated using the trait covariance among relatives (Eichler et al. 2010; Maher 2008; Manolio et al. 2009). This “missing heritability problem” raises questions about the methods used to estimate heritability as well as the genetic architecture of complex phenotypes.

Many explanations for the sources of missing heritability have been proposed including structural variations, gene-environment interactions, epistatic interactions, parent of origin effects, and errors in narrow-sense heritability estimates (Eichler et al. 2010; Manolio et al. 2009; Zuk et al. 2012). Of particular interest is the distribution of causal variants along the genome, their number, and their frequency spectrum. GWAS are particularly suited to capture common variants and so violation of the common disease common variant model may lead to missing heritability. In Fisher’s infinitesimal model there are expected to be a large number of rare variants associated with disease. The rare-allele model proposes that rare variants of large effect account for a significant fraction of phenotypic variation, and it has been proposed that these can give rise to synthetic association in common variants (Dickson et al. 2010; Gibson 2011).

Determining which combination of these hypotheses is correct and where the majority of phenotypic variation lays has significant implications for the future success of association studies as well as the clinical utility of genetic risk prediction. It is possible to decouple some of these proposed genetic architectures without directly identifying the causal variants themselves. For example, Wray et al. show the potential for comparing heritability and sibling relative risk estimates to determine the validity of a rare-variant model (Gibson 2011; Wray and Goddard 2010; Wray et al. 2011).

Recently, Yang et al. (2010) proposed using linear mixed models (LMM) to estimate a lower bound on the total narrow-sense heritability estimation from GWAS data as well determining how much of the phenotypic variation is due to SNPs in LD with those on genotyping platforms. The results of this approach have broad implications for the genetic architecture of phenotypes as well as the future success of GWAS.

In this work we examine the problem heritability estimation in the GWAS era and how it relates to the missing heritability problem. We briefly review the classical methods of heritability estimation and contrast them with relatively recent use of genotype data to estimate the component of heritability explained by common SNPs via the LMM approach. We discuss the relative merits of the different methods in terms of potential confounding factors as well as what they tell us about the distribution of causal variants and the potential returns of future GWAS. Finally, we discuss the prospects for using LMM to predict human traits, including disease risk.

Heritability is a measure of the contribution of genetics to phenotype. Wright and Fisher formalized the concept by writing phenotypic variance as the sum of genetic variance and environmental variance,
${\sigma}_{P}^{2}={\sigma}_{G}^{2}+{\sigma}_{\epsilon}^{2}$. Broad sense heritability *H*^{2} is the ratio of total genetic variance to phenotypic variance
${H}^{2}=\frac{{\sigma}_{G}^{2}}{{\sigma}_{P}^{2}}$. This measure includes the effects of gene-gene interactions (epistatic effects)
${\sigma}_{I}^{2}$, dominance effects
${\sigma}_{D}^{2}$, and additive effects
${\sigma}_{g}^{2}$ such that
${\sigma}_{G}^{2}={\sigma}_{g}^{2}+{\sigma}_{D}^{2}+{\sigma}_{I}^{2}$. Narrow-sense heritability *h*^{2} measures just the additive contribution of genetic variation to phenotype
${h}^{2}=\frac{{\sigma}_{g}^{2}}{{\sigma}_{P}^{2}}$ (Falconer 1989; Lynch and Walsh 1998).

In this work we discuss estimates of narrow-sense heritability *h*^{2} unless stated otherwise. This is done because we focus on GWAS and the missing heritability problem. Most traditional estimates of heritability using the correlations among related individuals are presumed to estimate *h*^{2}, although these estimates can be biased. For example, the classical estimate involving the regression of offspring trait values on the mean parental values does not include the dominance component of variance, but the epistatic component does contribute to the estimate. The epistatic component is typically (and perhaps incorrectly) assumed to be 0 for identifiability purposes (Falconer 1989; Zuk et al. 2012) GWAS estimates of individual-marker effect sizes are generally measured marginally, ignoring dominance and interaction effects, so the “bottom up” heritability estimates from GWAS (defined below) are narrow-sense estimates.

In a GWAS, we are given a set of *N _{s}* SNPs

In an additive model, the phenotype of each individual is defined by a sum of linear effects

$${y}_{j}=m+\sum _{i\in C}{z}_{ij}{\alpha}_{i}+{\epsilon}_{j}$$

[Eqn. 1]

where
${Z}_{ij}=\frac{{g}_{ij}-2{p}_{i}}{\sqrt{2{p}_{i}(1-{p}_{i})}}$ are the normalized genotypes, *α _{i}* is the effect size of SNP

The genetic variance in an additive model is computed by the sum of the squared effect sizes of the normalized genotypes ${\sigma}_{g}^{2}={\displaystyle \sum _{i}}{\alpha}_{i}^{2}$ and the heritability is the ratio of the genetic variance to the total phenotypic variance ${h}^{2}=\frac{{\sigma}_{g}^{2}}{{\sigma}_{g}^{2}+{\sigma}_{\epsilon}^{2}}={\sigma}_{g}^{2}$, where ${\sigma}_{\epsilon}^{2}$ is environmental contribution to phenotype and ${\sigma}_{g}^{2}+{\sigma}_{\epsilon}^{2}={\sigma}_{Y}^{2}=1$.

Given a GWAS one can compute an estimate of the genetic variance ${\widehat{\sigma}}_{g}^{2}$ by using the effect size estimates from the markers with a pre-specified genome-wide significance level. This can be used to compute an estimate of the heritability ${h}_{\mathit{GWAS}}^{2}=\frac{{\widehat{\sigma}}_{g}^{2}}{{\sigma}_{Y}^{2}}$, which is defined as “bottom up” heritability estimation by Zuk et al. (2012).

Unfortunately, the full set of casual variants and their effect sizes are not known, so
${h}_{\mathit{GWAS}}^{2}$ will typically underestimate the total heritability. (The winner’s curse (Ioannidis 2007, 2008; Kraft 2008) and the inclusion of false-positive markers in the bottom-up estimate of genetic variance could in principle lead to an overestimate of heritability.) The difference between *h*^{2} and
${h}_{\mathit{GWAS}}^{2}$ is known as the “missing heritability”. It is the additive genetic variance not yet captured with GWAS or other methods of identifying associated variants.

The classical methods of heritability estimation are based on an intuitive concept. Phenotypes that are highly correlated amongst relatives in patterns consistent with Mendelian inheritance are more heritable than those that are weakly correlated amongst relatives. The formalization of this idea by Fisher (1918) and Wright (1921) is the foundation of heritability estimation.

Consider the correlation between the phenotype of two individuals in the additive model above:

$$\text{cor}({y}_{j},{y}_{k})=cov({y}_{j},{y}_{k})=cov(\sum _{i\in C}{z}_{ij}{\alpha}_{i},\sum _{i\in C}{z}_{ik}{\alpha}_{i})=\frac{{a}_{g}^{2}}{Nc}\sum _{i\in C}\frac{cov({z}_{ij},{z}_{ik})}{var({z}_{i})}={\sigma}_{g}^{2}{K}_{\mathit{Causal}}[j,k].$$

*K _{Causal}* is the genetic covariance matrix (Kang et al. 2010; Price et al. 2006; Yang et al. 2010) defined at the causal SNPs. The entry for element

$${K}_{\mathit{Causal},jk}=\frac{1}{{N}_{C}}\sum _{i\in C}\frac{({g}_{ij}-2{\widehat{p}}_{i})({g}_{ik}-2{\widehat{p}}_{i})}{2{\widehat{p}}_{i}(1-{\widehat{p}}_{i})}.$$

Until recently, the genotypes of individuals were unavailable and even now the set of causal variants is unknown, so alternative means of estimating *K _{Causal}* are required. The classical and still widely used approach is to collect sets of related individuals from known pedigrees. The estimate of

It is worth stressing that the entries in *K _{Ped}* are the sums of the

When multiple classes of relationship are measured, as is the case in extended pedigrees, one can take advantage of all the relationships simultaneously via a linear mixed model (Lange 2002; Shaw 1987), where the 1×*N*_{subjects} phenotype vector Y is distributed as a multivariate normal random variable with mean M and variance-covariance matrix Σ. The mean vector M captures the fixed effects of observed covariates (e.g. sex, age, or principal components of genetic variation). The variance-covariance matrix is:

$$\mathrm{\sum}=var(\sum _{i\in C}{z}_{i}{\alpha}_{i})+var(\epsilon )={K}_{\mathit{Causal}}{\sigma}_{g}^{2}+I{\sigma}_{\epsilon}^{2}.$$

To estimate heritability via a linear mixed model the restricted maximum likelihood (REML) estimate of
${\widehat{\sigma}}_{g}^{2}$ is computed and the heritability estimate is
${\widehat{h}}^{2}=\frac{{\widehat{\sigma}}_{g}^{2}}{{\sigma}_{Y}^{2}}$. REML is used to estimate the components of variance instead of maximum likelihood to avoid a bias introduced by the fixed effects (Shaw 1987). Since *K _{Causal}* is not known

There are many extensions to this linear mixed model approach that allow estimation different components of heritability. These include dominance effects (Lynch and Walsh 1998), gene-gene interaction (Yang et al. 2011a), the shared genetic basis of multiple phenotypes (cross heritability) (Boehnke et al. 1986; Deary et al. 2012; Lange and Boehnke 1983; Macgregor et al. 2006; Price et al. 2011), heritability from different genomic regions (Yang et al. 2011b), and the effects of shared environment (Lynch and Walsh 1998).

Top down heritability estimates are susceptible to a range of confounding factors, which can bias estimates. These include gene-environment correlations, selection, non-random mating, and inbreeding (Lynch and Walsh 1998; Visscher et al. 2008). Recently, Zuk et al. (2012) showed that certain types of epistatic interactions can inflate estimates of narrow-sense heritability.

The availability of genotype data over large collections of individuals has opened up new approaches to estimating heritability. These methods apply the same linear mixed model method described above, but replace the *K _{Ped}* estimate of

When genetic data are collected over the set of individuals in the study it is possible to estimate the total fraction of the genome shared identical by descent (IBD). Siblings for example do not share exactly 50% of their genome with each other (Visscher et al. 2006). Using the genetic data to estimate the fraction of genome shared IBD gives another means of estimating *K _{Causal}*, which we call

To illustrate this approach, Visscher et al. (2007) used the software package Merlin (Abecasis et al. 2002) to estimate IBD for a collection of twins to generate * _{IBD}* from 791 autosomal markers and estimate several components of the heritability of height.

Recently, linear mixed models (LMMs) have been applied to GWAS data in an attempt to partition the “missing” heritability into variants tagged by GWAS SNPs (mostly common) and those that are not (mostly rare) (Yang et al. 2010). This use of the LMM links modern statistical approaches for high-dimensional data analysis (penalized regression) with classical models in statistical genetics (de los Campos et al. 2010).

This LMM approach uses the same REML based estimate of
${\widehat{\sigma}}_{g}^{2}$ given above, but the matrix used is an empirical estimate of the genetic covariance (*K _{GCV}*) instead of

This approach—which we refer to as the Yang-Visscher or LMM-*K _{GCV}* approach—relies on the equivalence between the LMM,

$${y}_{j}=\alpha +{g}_{j}+{\epsilon}_{j},$$

with cov(*g _{j}*,

$${y}_{j}=\alpha +\sum _{i\in S}{\beta}_{i}{z}_{ij}+{\epsilon}_{j},$$

with the *β _{i}* i.i.d. N(0,

Note that hidden relatedness between individuals would bias *σ*_{g}^{2} since untagged causal variants would still tend to have the same correlation structure (related to *K _{IBD}*) as the causal variants that are tagged thereby inflating the estimate of the portion of variability explained by the measured SNPs. Moreover, epistatic effects may also confound estimates of the additive genetic component

The LMM just described is a special case of a general class of regression models defined by any similarity matrix *K* calculated from the GWAS data, with cov(*g _{j}*,

One of the advantages of this LMM approach using *K _{GCV}* is that individuals may be selected randomly with respect to their environmental exposures preventing confounding from shared environments that can affect pedigree-based estimates. In addition, they can inform researchers about the potential success of future GWAS conducted on the phenotype of interest. The

The application of *K _{GCV}* to heritability estimation was proposed by Hayes et al. (2009) in the context of related individuals. In this case,

Each of the heritability estimation methods described above make different assumptions about the model generating phenotype. The estimates of heritability may be biased when these assumptions are broken.

While pedigree-based estimates of heritability have been examined for decades, the Yang-Visscher approach is a very recent development and there are many open questions about the factors that can affect these estimates of heritability. Here we give several examples of such factors and perform some simple experiments to examine their effects. These are in no way meant to be exhaustive or conclusive, but rather to inform the reader of potential issues.

Zuk et al. (2012) show that when certain types of epistatic (gene-gene) interactions exist the estimates of heritability found from pedigree estimates, such as MZ versus DZ twins, will be upwardly biased. In this situation, bottom up estimates will never reach the top down estimate of heritability. They propose that this is a possible element of the “missing heritability problem”, and that the true narrow-sense heritability maybe substantially lower than current estimates for certain phenotypes (Zuk et al. 2012).

To examine this problem in the context of Yang-Visscher heritability estimates we simulated data sets using the epistatic “limiting pathway” models of Zuk et al. (Zuk et al. 2012), LP(1), LP(3), and LP(4). We simulated case-control genotypes and phenotypes of 2000 randomly ascertained unrelated individuals with 200 causal variants in each pathway, an effect size of 0.1, a minor allele frequency of 0.5, and prevalence of 50%. We computed a bottom up adjusted *h*^{2} estimate via linear regression as well as Yang-Visscher estimate of heritability, using all causal variants to estimate K* _{GCV}*. The results are shown in Table 1 and demonstrate that the Yang-Visscher approach is not susceptible to confounding from epistatic interaction under the LP model of interaction. If closely related individuals were used then the Yang-Visscher estimate would be upwardly biased from the epistatic component of variance.

Yang-Visscher and bottom-up estimates of heritability (and their standard error over 1000 replications) under three limiting pathway models of phenotype. For K>1 the pedigree-based top down estimates of heritability will be inflated. An LP(4) **...**

Thus, the LMM estimates of heritability from unrelated individuals provide a benchmark to assess how much of the total narrow-sense heritability currently-known GWAS-identified trait markers explain—a benchmark that is not influenced by “phantom heritability” due to epistatic interactions. The ratio of the bottom-up additive genetic variance estimated using GWAS-identified markers to the LMM estimate of the additive genetic variance estimates the proportion of GWAS-identifiable markers that have been identified to date.

The Yang-Vissher approach assumes a polygenic model of disease in which many markers of small effect contribute to variance in genetic risk. Specifically, it assumes marker effect sizes are all drawn from the same normal distribution, $\beta ~N(0,{\sigma}_{g}^{2}/{N}_{s})$. There are however many diseases where there are outlier markers with strikingly different effects. For example, GWAS have identified dozens of markers associated with Type 1 Diabetes and rheumatoid arthritis, most of which have very small effects relative to the long-established risk variants in the MHC; for both of these diseases, the variants in the MHC have per-allele relative risks roughly three times larger than the relative risks for the GWAS-identified risk variants (Barrett et al. 2009; Stahl et al. 2010).

To examine the effect of such extreme variants this we simulated 1000 GWAS of 1500 individuals with a single causal variant. The genotypes at 1,000 marker loci (including the causal locus) were generated by random binomials with minor allele frequencies drawn uniformly between 0.05 and 0.5. The true heritability of the phenotype was 0.5 and the average estimate over the 1000 GWAS was 0.50, suggesting that violations of the infinitesimal model do not strongly effect estimates of heritability.

For many phenotypes *K _{GCV}* will contain a large number of variants unlinked to any causal variants. To examine the effect of these variants on the estimates of heritability we repeated the experiment above with 10 causal variants and 10

To investigate the precision of LMM estimates of *h*^{2} using *K _{GCV}* in real-world situations, we used GWAS data on 10,503 individuals from two European-ancestry cohorts, the Nurses’ Health Study and Health Professionals Follow-up Study. We simulated continuous phenotypes as a function of 500 SNPs, according to Eqn. 1, constraining the SNP effects so that the resulting phenotype had the desired heritability (

Results from single replicates are shown in Table 2. Precision increases roughly linearly with increasing log sample size. For sample sizes under 2,000, the 95% confidence intervals are wide (>0.40), and, for modest heritabilities (under 25%, consistent with the observed heritabilities for many complex traits), they include 0. This suggests that accurate estimation of narrow-sense heritabilities will require large sample sizes, on the order of 5,000 to 10,000 or more, at least as big as those needed to identify individual markers with modest effects. Published studies using the LMM-*K _{GCV}* approach to estimate the narrow-sense heritability due to GWAS markers for continuous traits like height and body mass index used between 4,000 and 11,500 subjects (Yang et al. 2010; Yang et al. 2011b). Care must be taken when combining studies to reach such large sample sizes as this may introduce population substructure and corresponding environmental variation of non-genetic risk factors, potentially biasing estimates of heritability

The additive model assumes that all of the tested variants are independent. In reality, there is extensive LD between causal and non-causal variants in the genome. To examine the potential for LD to affect heritability estimates we repeated the experiment above with 4 causal variants, and 1 additional causal variant repeated 100 times simulating extensive LD for a particular SNP, and 10^{4} non-causal variants. The true heritability was 0.5 and the average estimated heritability was 0.40 showing that LD patterns can significantly affect heritability estimates. We note that this is an extreme example meant to demonstrate the potential for bias. Yang et al simulated phenotypes over real GWAS data (i.e. with real LD patterns) and found estimates within two standard errors of the true heritability (Yang et al. 2010).

Provided that the individuals in a GWAS are unrelated, the matrix *K _{GCV}* contains no information about SNPs out of LD with the genotyped SNPs. If the study contains related, individuals, however, the LMM estimate of heritability will contain some additional genetic variance due to variants not tagged by the GWAS SNPs. This is because

We simulated 1000 pairs of individuals that shared 0.5, 0.1, 0.05, 0.025 of their genome IBD and compute K_{GCV} for each pair. We repeated this experiment using 10^{4}, 10^{5}, and 10^{6} SNPs. The results are presented in Table 3. In each case the mean estimate of IBD is close to the true IBD showing the K_{GCV} is a good estimate of K_{IBD}. The standard error is independent of the true IBD and decreases as a function of the number of independent SNPs.

The genetic covariance between pairs of individuals with a range of IBDs, estimate from Ns SNPs. GCV is an unbiased estimate of IBD and the variance of the esimate in ()’s is function of the number of available SNPs.

For distantly related individuals, the signal from IBD will typically be small relative to the signal from the causal variants. Here a concern is confounding due to cryptic relatedness, where more closely related individuals tend to have similar trait values for non-genetic reasons (Kang et al. 2010). The influence of low levels of IBD in the Yang-Visscher approach remains an open question. It is possible to test explicitly for inflation due to relatedness, by simulating phenotypes over odd chromosomes and estimating heritability over even chromosomes (Visscher et al. 2010).

Individuals from different populations have different minor allele frequencies as well different environmental exposures. In a case control study this can lead to significant confounding if there is a difference in the phenotypic mean between the populations, and is usually corrected with a principal component adjustment. Browning and Browning (2011) show that under certain extreme population differences this can lead to biases in heritability estimates. Yang et al show that using PC adjustment will mitigate this inflation. They also propose to estimate the effects of population stratification and cryptic relatedness by performing heritability estimation over each chromosome (Yang et al. 2011b). This procedure has not yet been examined in detail in the published literature.

Another type of population stratification arises when there is a difference in the phenotypic variance (but not necessarily mean phenotype) between the populations. In this case PCA will not adequately adjust for population substructure leading to inflation in standard GWAS (McPeek and Abney, 2008). Furthermore, the interpretation of heritability may ambiguous in this scenario, since each of the sub populations will likely have different heritability estimates.

Heritability is defined with respect to a population at a particular time. The heritability of lung cancer will be dramatically different between a population where some people smoke and a population of only non-smokers. Thus bottom-up GWAS heritability estimates and those from published heritability studies can only be compared if they come from the same population and are conducted at similar times.

Currently the Yang-Visscher approach has been performed using observed SNPs, genotyped using the same platform (Yang et al. 2010; Yang et al. 2011b). Given the success of imputation in the GWAS community one open questions is the possibility of leveraging external reference panels such as the HapMap to determine if additional signal lies within the additional SNPs genotypes in the panel. High-throughput sequencing data are becoming available and with them a large number of rare variants. The proper way to include dense maps of common markers and rare variants in heritability estimation—notably in light of the discussion of the impact of linkage disequilibrium patterns, above—is an area of current research.

For binary traits, the percent of trait variance captured by *K _{GCV}* when analyzing a discontinuous 1-0 case-control phenotype in the LMM framework is not directly comparable to commonly-quoted heritabilites from some family studies (e.g. MZ-DZ twin comparisons), which are measures of the percentage of the underlying

For the linear mixed model, the phenotypic variance captured by *K _{GCV}* depends on disease prevalence and sampling scheme. By construction, the heritability of liability is independent of prevalence. When estimating the heritability of case-control phenotypes, the ascertainment strategy and prevalence of disease will affect the final heritability estimate. To address this issue, it is possible to transform the disease scale heritability estimate to a liability scale heritability estimate, which accounts for both ascertainment and prevalence (Dempster and Lerner 1949; Lee et al. 2011):

$${h}_{\mathit{liability}}^{2}={h}_{\mathit{Obs}}^{2}\frac{F(1-F)}{\varphi {({\mathrm{\Phi}}^{-1}(F))}^{2}}\frac{F(1-F)}{P(1-P)}.$$

F is the prevalence *ϕ* is the normal pdf and Φ is the normal cdf and P is the proportion of cases in the sample. The justification for this elegant adjustment depends on a rather simple model for ascertainment, namely, that selection for inclusion is independent of all other covariates conditional on disease status. This will not be the case in many practical situations (e.g. matched case-control studies), where ascertainment depends on other factors that are usually associated with disease risk and may also be associated with genotype. The impact of violations of this assumption is unclear.

The LMM using *K _{GCV}* also offers a means of phenotypic prediction using the best linear unbiased predictors or BLUPs (Lynch and Walsh 1998). The expected trait value for a new individual (who did not contribute to the data set used to fit the LMM) is given by:

$$\widehat{y}=\alpha +\widehat{g}=\alpha +\sum _{i\in S}{\widehat{\beta}}_{i}{z}_{i}.$$

This is similar to the “polygenic” models proposed by Purcell et al. (2009) and Evans et al. (Evans et al. 2009; Purcell et al. 2009), in that the predictor uses information contained in SNPs that do not reach the genome-wide significance threshold. But where the “polygenic” perform feature selection, only building predictors using markers with single-SNP (marginal) p-values below some threshold (often much larger than the stringent GWAS threshold), the LMM approach builds predictors using all available SNPs simultaneously. The LMM predictor is closely related to ridge regression, a penalized regression procedure that often outperforms variable selection procedures in terms of minimizing prediction error in new data sets (Harrell 2001; Hastie et al. 2001).

The accuracy of the LMM predictor is a function of narrow-sense heritability, the number of markers included in the LMM, the true genetic architecture, and the sample size in the data set used to fit the LMM. The sample size determines the accuracy with which *β _{i}* can be estimated. The squared correlation between the LMM predictor and trait values in new observations is typically far smaller than the heritability estimate from the LMM (the theoretical maximum of the squared correlation); this is because of the variability in the estimated

The Yang-Visscher approach to heritability estimation provides a means of estimating the contribution of SNPs in LD with those on genotyping platforms to the total phenotypic variation. In the context of GWAS these estimates answer questions about the genetic architecture of complex phenotypes. The growing number of GWAS identified loci, as well as their small effect sizes, has led to speculation about genetic models of disease.

There has been significant recent debate about the success or failure of GWAS (Eichler et al. 2010; Gibson 2011; Visscher et al. 2012). This has in turn reinvigorated the debate about the distribution of causal variants. Goldstein demonstrated the possibility for rare variants to induce synthentic associations (Dickson et al. 2010), and there have been several recent works discussing the common disease common variant, strong and weak rare variants, the infinitesimal, and other disease models (Gibson 2011).

There has also been speculation about the location of the “missing heritability” with discussions of parent of origin effects, epistatic interactions, gene-environment interactions, structural variation, and other cache’s of genetic variation not well captured by current GWAS or their analysis methods(Eichler et al. 2010; Visscher et al. 2012; Zuk et al. 2012).

The work of Yang and Visscher discussed here as well as other GWAS-based approaches (Lango Allen et al. 2010; So et al. 2011a; So et al. 2011b; Yang et al. 2011c) provide insights relevant to these questions. They estimate heritability restricted to a certain class of SNPs (i.e. those in LD with genotyped SNPs), are not confounded by many of the factors biasing traditional methods of heritability estimation, and are fundamentally different than bottom up methods. In principle these procedures could also be used to build phenotype prediction algorithms incorporating markers beyond the small number identified at genome-wide significance levels. However, very large sample sizes will be needed to obtain accurate estimates and precise prediction algorithms.

The authors thank Alkes Price, Eli Stahl and Dan Stram for helpful discussions, and Poorva Mudgal for programming support. NZ was supported by NIH fellowship 5T32ES007142-27, PK by NIH grant R21 DK084529.

- Abecasis GR, Cherny SS, Cookson WO, Cardon LR. Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002;30:97–101. [PubMed]
- Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, Plagnol V, Pociot F, Schuilenburg H, Smyth DJ, Stevens H, Todd JA, Walker NM, Rich SS. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet. 2009;41:703–7. [PMC free article] [PubMed]
- Boehnke M, Moll PP, Lange K, Weidman WH, Kottke BA. Univariate and bivariate analyses of cholesterol and triglyceride levels in pedigrees. Am J Med Genet. 1986;23:775–92. [PubMed]
- Browning SR, Browning BL. Population structure can inflate SNP-based heritability estimates. Am J Hum Genet. 2011;89:191–3. author reply 193–5. [PubMed]
- Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS One. 2008;3:e3395. [PMC free article] [PubMed]
- de los Campos G, Gianola D, Allison DB. Predicting genetic predisposition in humans: the promise of whole-genome markers. Nat Rev Genet. 2010;11:880–6. [PubMed]
- Deary IJ, Yang J, Davies G, Harris SE, Tenesa A, Liewald D, Luciano M, Lopez LM, Gow AJ, Corley J, Redmond P, Fox HC, Rowe SJ, Haggarty P, McNeill G, Goddard ME, Porteous DJ, Whalley LJ, Starr JM, Visscher PM. Genetic contributions to stability and change in intelligence from childhood to old age. Nature 2012 [PubMed]
- Dempster E, Lerner I. Heritability of threshold characters. Genetics. 1949;35:212–236. [PubMed]
- Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB. Rare variants create synthetic genome-wide associations. PLoS Biol. 2010;8:e1000294. [PMC free article] [PubMed]
- Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010;11:446–50. [PMC free article] [PubMed]
- Evans DM, Visscher PM, Wray NR. Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. Hum Mol Genet. 2009;18:3525–31. [PubMed]
- Falconer DS. Introduction to quantitative genetics. 3. Longman Wiley; Burnt Mill, Harlow, Essex, England New York: 1989.
- Fisher R. The correlation among relatives on the supposition of Mendelian inheritance. Trans Roy Soc Edinburgh. 1918;52:399–433.
- Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2011;13:135–45. [PubMed]
- Harrell F. Regression modeling strategies. Springer; New York: 2001.
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer; New York: 2001.
- Hayes BJ, Visscher PM, Goddard ME. Increased accuracy of artificial selection by using the realized relationship matrix. Genet Res (Camb) 2009;91:47–60. [PubMed]
- Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–7. [PubMed]
- Ioannidis JP. Non-replication and inconsistency in the genome-wide association setting. Hum Hered. 2007;64:203–13. [PubMed]
- Ioannidis JP. Why most discovered true associations are inflated. Epidemiology. 2008;19:640–8. [PubMed]
- Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42:348–54. [PMC free article] [PubMed]
- Kraft P. Curses--winner’s and otherwise--in genetic epidemiology. Epidemiology. 2008;19:649–51. discussion 657–8. [PubMed]
- Lange K. Mathematical and statistical methods for genetic analysis. Springer; New York: 2002.
- Lange K, Boehnke M. Extensions to pedigree analysis. IV. Covariance components models for multivariate traits. Am J Med Genet. 1983;14:513–24. [PubMed]
- Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S, Ferreira T, Wood AR, Weyant RJ, Segre AV, Speliotes EK, Wheeler E, Soranzo N, Park JH, Yang J, Gudbjartsson D, Heard-Costa NL, Randall JC, Qi L, Vernon Smith A, Magi R, Pastinen T, Liang L, Heid IM, Luan J, Thorleifsson G, Winkler TW, Goddard ME, Sin Lo K, Palmer C, Workalemahu T, Aulchenko YS, Johansson A, Zillikens MC, Feitosa MF, Esko T, Johnson T, Ketkar S, Kraft P, Mangino M, Prokopenko I, Absher D, Albrecht E, Ernst F, Glazer NL, Hayward C, Hottenga JJ, Jacobs KB, Knowles JW, Kutalik Z, Monda KL, Polasek O, Preuss M, Rayner NW, Robertson NR, Steinthorsdottir V, Tyrer JP, Voight BF, Wiklund F, Xu J, Hua Zhao J, Nyholt DR, Pellikka N, Perola M, Perry JR, Surakka I, Tammesoo ML, Altmaier EL, Amin N, Aspelund T, Bhangale T, Boucher G, Chasman DI, Chen C, Coin L, Cooper MN, Dixon AL, Gibson Q, Grundberg E, Hao K, Juhani Junttila M, Kaplan LM, Kettunen J, Konig IR, Kwan T, Lawrence RW, Levinson DF, Lorentzon M, McKnight B, Morris AP, Muller M, Suh Ngwa J, Purcell S, Rafelt S, Salem RM, Salvi E, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010;467:832–8. [PMC free article] [PubMed]
- Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet. 2011;88:294–305. [PubMed]
- Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D. FaST linear mixed models for genome-wide association studies. Nat Methods. 2011;8:833–5. [PubMed]
- Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sinauer; Sunderland, Mass: 1998.
- Macgregor S, Cornes BK, Martin NG, Visscher PM. Bias, precision and heritability of self-reported and clinically measured height in Australian twins. Hum Genet. 2006;120:571–80. [PubMed]
- Maher B. Personal genomes: The case of the missing heritability. Nature. 2008;456:18–21. [PubMed]
- Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53. [PMC free article] [PubMed]
- Park JH, Gail MH, Weinberg CR, Carroll RJ, Chung CC, Wang Z, Chanock SJ, Fraumeni JF, Jr, Chatterjee N. Distribution of allele frequencies and effect sizes and their interrelationships for common genetic susceptibility variants. Proc Natl Acad Sci U S A. 2011;108:18026–31. [PubMed]
- Pharoah PD, Antoniou A, Bobrow M, Zimmern RL, Easton DF, Ponder BA. Polygenic susceptibility to breast cancer and implications for prevention. Nat Genet. 2002;31:33–6. [PubMed]
- Pharoah PD, Antoniou AC, Easton DF, Ponder BA. Polygenes, risk prediction, and targeted prevention of breast cancer. N Engl J Med. 2008;358:2796–803. [PubMed]
- Powell JE, Visscher PM, Goddard ME. Reconciling the analysis of IBD and IBS in complex trait studies. Nat Rev Genet 2010 [PubMed]
- Price AL, Helgason A, Thorleifsson G, McCarroll SA, Kong A, Stefansson K. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 2011;7:e1001317. [PMC free article] [PubMed]
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–9. [PubMed]
- Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, Sullivan PF, Sklar P. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–52. [PubMed]
- Shaw R. Maximum-likelihood approaches applied to quantitative genetics of natural populations. Evolution. 1987;41:812–826.
- So HC, Gui AH, Cherny SS, Sham PC. Evaluating the heritability explained by known susceptibility variants: a survey of ten complex diseases. Genet Epidemiol. 2011a;35:310–7. [PubMed]
- So HC, Li M, Sham PC. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet Epidemiol. 2011b;35:447–56. [PubMed]
- Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FA, Zhernakova A, Hinks A, Guiducci C, Chen R, Alfredsson L, Amos CI, Ardlie KG, Barton A, Bowes J, Brouwer E, Burtt NP, Catanese JJ, Coblyn J, Coenen MJ, Costenbader KH, Criswell LA, Crusius JB, Cui J, de Bakker PI, De Jager PL, Ding B, Emery P, Flynn E, Harrison P, Hocking LJ, Huizinga TW, Kastner DL, Ke X, Lee AT, Liu X, Martin P, Morgan AW, Padyukov L, Posthumus MD, Radstake TR, Reid DM, Seielstad M, Seldin MF, Shadick NA, Steer S, Tak PP, Thomson W, van der Helm-van Mil AH, van der Horst-Bruinsma IE, van der Schoot CE, van Riel PL, Weinblatt ME, Wilson AG, Wolbink GJ, Wordsworth BP, Wijmenga C, Karlson EW, Toes RE, de Vries N, Begovich AB, Worthington J, Siminovitch KA, Gregersen PK, Klareskog L, Plenge RM. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet. 2010;42:508–14. [PubMed]
- Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90:7–24. [PubMed]
- Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev Genet. 2008;9:255–66. [PubMed]
- Visscher PM, Macgregor S, Benyamin B, Zhu G, Gordon S, Medland S, Hill WG, Hottenga JJ, Willemsen G, Boomsma DI, Liu YZ, Deng HW, Montgomery GW, Martin NG. Genome partitioning of genetic variation for height from 11,214 sibling pairs. Am J Hum Genet. 2007;81:1104–10. [PubMed]
- Visscher PM, Medland SE, Ferreira MA, Morley KI, Zhu G, Cornes BK, Montgomery GW, Martin NG. Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLoS Genet. 2006;2:e41. [PubMed]
- Visscher PM, Yang J, Goddard ME. A commentary on ’common SNPs explain a large proportion of the heritability for human height’ by Yang et al. (2010) Twin Res Hum Genet. 2010;13:517–24. [PubMed]
- Wray NR, Goddard ME. Multi-locus models of genetic risk of disease. Genome Med. 2010;2:10. [PMC free article] [PubMed]
- Wray NR, Purcell SM, Visscher PM. Synthetic associations created by rare variants do not explain most GWAS results. PLoS Biol. 2011;9:e1000579. [PMC free article] [PubMed]
- Wright S. Systems of Mating. Genetics. 1921;6:111–78. [PubMed]
- Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42:565–9. [PMC free article] [PubMed]
- Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex Trait Analysis. Am J Hum Genet. 2011a;88:76–82. [PubMed]
- Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B, Feingold E, Hayes MG, Hill WG, Landi MT, Alonso A, Lettre G, Lin P, Ling H, Lowe W, Mathias RA, Melbye M, Pugh E, Cornelis MC, Weir BS, Goddard ME, Visscher PM. Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011b;43:519–25. [PubMed]
- Yang J, Weedon MN, Purcell S, Lettre G, Estrada K, Willer CJ, Smith AV, Ingelsson E, O’Connell JR, Mangino M, Magi R, Madden PA, Heath AC, Nyholt DR, Martin NG, Montgomery GW, Frayling TM, Hirschhorn JN, McCarthy MI, Goddard ME, Visscher PM. Genomic inflation factors under polygenic inheritance. Eur J Hum Genet. 2011c;19:807–12. [PMC free article] [PubMed]
- Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A 2012 [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |