Although recent genome-wide studies have provided valuable insights into the genetic basis of human disease, they have explained relatively little of the heritability of most complex traits, and the variants identified through these studies have small effect sizes. This has led to the important and hotly debated issue of where the ‘missing heritability’ of complex diseases might be found. Here, seven leading geneticists offer their opinion about where this heritability is likely to lie, what this could tell us about the underlying genetic architecture of common diseases and how this could inform research strategies for uncovering genetic risk factors.
The field of genomics has entered a new era in which the ability to identify genetic variants that impact complex human traits and disease in an unbiased fashion using genome-wide approaches is widely accessible. To date, the workhorse of these efforts has been the genome-wide association study (GWAS), which has quickly moved from novel to routine, and has provided key insights into aspects of the underlying allelic architecture of complex traits. The main lesson learned from the early GWAS efforts is that though many disease-associated variants are often discovered, most have only a minor effect on disease, and in total explain only a small amount of the apparent heritability. Here we provide a brief overview of the genetic variation classes that may harbor the heritability missing from GWAS, and touch on approaches that will be leveraged in the coming years as genomics—and by extension medicine—becomes increasingly personalized.
Genetics; genomics; GWAS; liver; complex disease
The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
Genetically tractable model organisms from phages to mice have taught us invaluable lessons about fundamental biological processes and disease-causing mutations. Owing to technological and computational advances, human biology and the causes of human diseases have become accessible as never before. Progress in identifying genetic determinants for human diseases has been most remarkable for Mendelian traits. In contrast, identifying genetic determinants for complex diseases such as diabetes, cancer, and cardiovascular and neurological diseases has remained challenging, despite the fact that these diseases cluster in families. Hundreds of variants associated with complex diseases have been found in genome-wide association studies (GWAS), yet most of these variants explain only a modest amount of the observed heritability, a phenomenon known as “missing heritability.” The missing heritability has been attributed to many factors, mainly inadequacies in genotyping and phenotyping. We argue that lessons learned about complex traits in model organisms offer an alternative explanation for missing heritability in humans. In diverse model organisms, phenotypic robustness differs among individuals, and those with decreased robustness show increased penetrance of mutations and express previously cryptic genetic variation. We propose that phenotypic robustness also differs among humans and that individuals with lower robustness will be more responsive to genetic and environmental perturbations and hence susceptible to disease. Phenotypic robustness is a quantitative trait that can be accurately measured in model organisms, but not as yet in humans. We propose feasible approaches to measure robustness in large human populations, proof-of-principle experiments for robustness markers in model organisms, and a new GWAS design that takes differences in robustness into account.
Many common human diseases and complex traits are highly heritable and influenced by multiple genetic and environmental factors. Although genome-wide association studies (GWAS) have successfully identified many disease-associated variants, these genetic variants explain only a small proportion of the heritability of most complex diseases. Genetic interactions (gene-gene and gene-environment) substantially contribute to complex traits and diseases and could be one of the main sources of the missing heritability. This paper provides an overview of the available statistical methods and related computer software for identifying genetic interactions in animal and plant experimental crosses and human genetic association studies. The main discussion falls under the three broad issues in statistical analysis of genetic interactions: the definition, detection and interpretation of genetic interactions. Recently developed methods based on modern techniques for high-dimensional data are reviewed, including penalized likelihood approaches and hierarchical models; the relationships among these methods are also discussed. I conclude this review by highlighting some areas of future research.
Bayesian methods; Complex traits; Epistasis; Gene-environment interactions; Genetic association; High-dimensionality; Hierarchical models; Penalized likelihood; Quantitative trait loci
Recent genome-wide association studies (GWAS) have identified a number of novel genetic associations with complex human diseases. In spite of these successes, results from GWAS generally explain only a small proportion of disease heritability, an observation termed the ‘missing heritability problem’. Several sources for the missing heritability have been proposed, including the contribution of many common variants with small individual effect sizes, which cannot be reliably found using the standard GWAS approach. The goal of our study was to explore a complimentary approach, which combines GWAS results with functional data in order to identify novel genetic associations with small effect sizes. To do so, we conducted a GWAS for lymphocyte count, a physiologic quantitative trait associated with asthma, in 462 Hutterites. In parallel, we performed a genome-wide gene expression study in lymphoblastoid cell lines from 96 Hutterites. We found significant support for genetic associations using the GWAS data when we considered variants near the 193 genes whose expression levels across individuals were most correlated with lymphocyte counts. Interestingly, these variants are also enriched with signatures of an association with asthma susceptibility, an observation we were able to replicate. The associated loci include genes previously implicated in asthma susceptibility as well as novel candidate genes enriched for functions related to T cell receptor signaling and adenosine triphosphate synthesis. Our results, therefore, establish a new set of asthma susceptibility candidate genes. More generally, our observations support the notion that many loci of small effects influence variation in lymphocyte count and asthma susceptibility.
Genome-wide association studies (GWAS) in humans have identified hundreds of single nucleotide polymorphisms associated with complex traits, yet for most traits studied, the sum total of all these identified variants fail to explain a significant portion of the heritable variation. Reasons for this “missing heritability” are thought to include the existence of rare causative variants not captured by current genotyping arrays, structural variants that go undetected by existing technology, insufficient power to identify multi-gene interactions, small sample sizes, and the influence of environmental and epigenetic effects. As genotyping technologies have evolved it has become inexpensive and relatively straightforward to perform GWAS in mice. Mice offer a powerful tool for elucidating the genetic architecture of behavioral and physiological traits, and are complementary to human studies. Unlike F2 crosses of inbred strains, advanced intercross lines, heterogeneous stocks, outbred, and wild-caught mice have more rapid breakdown of linkage disequilibrium which allow for increasingly high resolution mapping. Because some of these populations are created using a small number of founder chromosomes they are not expected to harbor rare alleles. We discuss the differences between these mouse populations and examine their potential to overcome some of the pitfalls that have plagued human GWAS studies.
GWAS; quantitative trait loci; complex traits; forward genetics; advanced intercross lines; heterogeneous stock; outbred mice; wild mice
Genome-wide association studies have been successful in identifying common variants for common complex traits in recent years. However, common variants have generally failed to explain substantial proportions of the trait heritabilities. Rare variants, structural variations, and gene-gene and gene-environment interactions, among others, have been suggested as potential sources of the so-called missing heritability. With the advent of exome-wide and whole-genome next-generation sequencing technologies, finding rare variants in functionally important sites (e.g., protein-coding regions) becomes feasible. We investigate the role of linkage information to select families enriched for rare variants using the simulated Genetic Analysis Workshop 17 data. In each replicate of simulated phenotypes Q1 and Q2 on 697 subjects in 8 extended pedigrees, we select one pedigree with the largest family-specific LOD score. Across all 200 replications, we compare the probability that rare causal alleles will be carried in the selected pedigree versus a randomly chosen pedigree. One example of successful enrichment was exhibited for gene VEGFC. The causal variant had minor allele frequency of 0.0717% in the simulated unrelated individuals and explained about 0.1% of the phenotypic variance. However, it explained 7.9% of the phenotypic variance in the eight simulated pedigrees and 23.8% in the family that carried the minor allele. The carrier’s family was selected in all 200 replications. Thus our results show that family-specific linkage information is useful for selecting families for sequencing, thus ensuring that rare functional variants are segregating in the sequencing samples.
Genome-wide association studies (GWAS) have identified many common variants associated with complex traits in human populations. Thus far, most reported variants have relatively small effects and explain only a small proportion of phenotypic variance, leading to the issues of ‘missing’ heritability and its explanation. Using height as an example, we examined two possible sources of missing heritability: first, variants with smaller effects whose associations with height failed to reach genome-wide significance and second, allelic heterogeneity due to the effects of multiple variants at a single locus. Using a novel analytical approach we examined allelic heterogeneity of height-associated loci selected from SNPs of different significance levels based on the summary data of the GIANT (stage 1) studies. In a sample of 1,304 individuals collected from an island population of the Adriatic coast of Croatia, we assessed the extent of height variance explained by incorporating the effects of less significant height loci and multiple effective SNPs at the same loci. Our results indicate that approximately half of the 118 loci that achieved stringent genome-wide significance (p-value<5×10−8) showed evidence of allelic heterogeneity. Additionally, including less significant loci (i.e., p-value<5×10−4) and accounting for effects of allelic heterogeneity substantially improved the variance explained in height.
The last decade of human genetic research witnessed the completion of hundreds of genome-wide association studies (GWASs). However, the genetic variants discovered through these efforts account for only a small proportion of the heritability of complex traits. One explanation for the missing heritability is that the common analysis approach, assessing the effect of each single-nucleotide polymorphism (SNP) individually, is not well suited to the detection of small effects of multiple SNPs. Gene set analysis (GSA) is one of several approaches that may contribute to the discovery of additional genetic risk factors for complex traits. Complex phenotypes are thought to be controlled by networks of interacting biochemical and physiological pathways influenced by the products of sets of genes. By assessing the overall evidence of association of a phenotype with all measured variation in a set of genes, GSA may identify functionally relevant sets of genes corresponding to relevant biomolecular pathways, which will enable more focused studies of genetic risk factors. This approach may thus contribute to the discovery of genetic variants responsible for some of the missing heritability. With the increased use of these approaches for the secondary analysis of data from GWAS, it is important to understand the different GSA methods and their strengths and weaknesses, and consider challenges inherent in these types of analyses. This paper provides an overview of GSA, highlighting the key challenges, potential solutions, and directions for ongoing research.
pathway analysis; multilocus; complex traits; genetic association studies
The variance explained by genetic variants as identified in (genome-wide) genetic association studies is typically small compared to family-based heritability estimates. Explanations of this ‘missing heritability’ have been mainly genetic, such as genetic heterogeneity and complex (epi-)genetic mechanisms.
We used comprehensive simulation studies to show that three phenotypic measurement issues also provide viable explanations of the missing heritability: phenotypic complexity, measurement bias, and phenotypic resolution. We identify the circumstances in which the use of phenotypic sum-scores and the presence of measurement bias lower the power to detect genetic variants. In addition, we show how the differential resolution of psychometric instruments (i.e., whether the instrument includes items that resolve individual differences in the normal range or in the clinical range of a phenotype) affects the power to detect genetic variants.
We conclude that careful phenotypic data modelling can improve the genetic signal, and thus the statistical power to identify genetic variants by 20–99%.
Genome-wide association studies (GWAS) have been successful in detecting common genetic variants underlying common traits and diseases. Despite the GWAS success stories, the percent trait variance explained by GWAS signals, the so called “missing heritability” has been, at best, modest. Also, the predictive power of common variants identified by GWAS has not been encouraging. Given these observations along with the fact that the effects of rare variants are often, by design, unaccounted for by GWAS and the availability of sequence data, there is a growing need for robust analytic approaches to evaluate the contribution of rare variants to common complex diseases. Here we propose a new method that enables the simultaneous analysis of the association between rare and common variants in disease etiology. We refer to this method as SCARVA (simultaneous common and rare variants analysis). SCARVA is simple to use and is efficient. We used SCARVA to analyze two independent real datasets to identify rare and common variants underlying variation in obesity among participants in the Africa America Diabetes Mellitus (AADM) study and plasma triglyceride levels in the Dallas Heart Study (DHS). We found common and rare variants associated with both traits, consistent with published results.
association; common variant; haplotype; rare variant
Genome-wide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more efficient and cost-effective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be hard to distinguish from low frequency alleles.
A unified approach for estimating minor allele frequency, SNP calling and association studies based on pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling approach, EM-SNP, for pooled sequencing data based on the EM algorithm is then developed and compared with another recent SNP calling method, SNVer. We show that EM-SNP outperforms SNVer in terms of the fraction of db-SNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the association between variants and type I diabetes.
The EM-based approach for the analysis of pooled sequencing data can accurately estimate minor allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially useful for studies involving rare variants.
Although genome-wide association studies have uncovered variants associated with more than 150 traits, the percentage of phenotypic variation explained by these associations remains small. This has led to the search for the dark matter that explains this missing genetic component of heritability. One potential explanation for dark matter is rare variants, and several statistics have been devised to detect associations resulting from aggregations of rare variants in relatively short regions of interest, such as candidate genes. In this paper we investigate the feasibility of extending this approach in an agnostic way, in which we consider all variants within a much broader region of interest, such as an entire chromosome or even the entire exome. Our method searches for subsets of variant sites using either Markov chain Monte Carlo or genetic algorithms. The analysis was performed with knowledge of the Genetic Analysis Workshop 17 answers.
Single nucleotide polymorphisms (SNPs) discovered by genome-wide association studies (GWASs) account for only a small fraction of the genetic variation of complex traits in human populations. Where is the remaining heritability? We estimated the proportion of variance for human height explained by 294,831 SNPs genotyped on 3,925 unrelated individuals using a linear model analysis, and validated the estimation method by simulations based upon the observed genotype data. We show that 45% of variance can be explained by considering all SNPs simultaneously. Thus, most of the heritability is not missing but has not previously been detected because the individual effects are too small to pass stringent significance tests. We provide evidence that the remaining heritability is due to incomplete linkage disequilibrium (LD) between causal variants and genotyped SNPs, exacerbated by causal variants having lower minor allele frequency (MAF) than the SNPs explored to date.
Human genome resequencing technologies are becoming ever more affordable and provide a valuable source of data about rare genetic variants in the human genome. Such rare variation may play an important role in explaining the missing heritability of complex human traits. We implement an existing method for analyzing rare variants by testing for association with the mutational load across genes. In this study, we make use of simulated data from the Genetic Analysis Workshop 17 to assess the power of this approach to detect association with simulated quantitative and dichotomous phenotypes and to evaluate the impact of missing genotypes on the power of the analysis. According to our results, the mutational load based rare variant analysis method is relatively robust to call-rate and is adequately powered for genome-wide association analysis.
Complex diseases are often highly heritable. However, for many complex traits only a small proportion of the heritability can be explained by observed genetic variants in traditional genome-wide association (GWA) studies. Moreover, for some of those traits few significant SNPs have been identified. Single SNP association methods test for association at a single SNP, ignoring the effect of other SNPs. We show using a simple multi-locus odds model of complex disease that moderate to large effect sizes of causal variants may be estimated as relatively small effect sizes in single SNP association testing. This underestimation effect is most severe for diseases influenced by numerous risk variants. We relate the underestimation effect to the concept of non-collapsibility found in the statistics literature. As described, continuous phenotypes generated with linear genetic models are not affected by this underestimation effect. Since many GWA studies apply single SNP analysis to dichotomous phenotypes, previously reported results potentially underestimate true effect sizes, thereby impeding identification of true effect SNPs. Therefore, when a multi-locus model of disease risk is assumed, a multi SNP analysis may be more appropriate.
Genome-wide association studies have thus far failed to explain the observed heritability of complex human diseases. This is referred to as the “missing heritability” problem. However, these analyses have usually neglected to consider a role for epigenetic variation, which has been associated with many human diseases. We extend models of epigenetic inheritance to investigate whether environment-sensitive epigenetic modifications of DNA might explain observed patterns of familial aggregation. We find that variation in epigenetic state and environmental state can result in highly heritable phenotypes through a combination of epigenetic and environmental inheritance. These two inheritance processes together can produce familial covariances significantly higher than those predicted by models of purely epigenetic inheritance and similar to those expected from genetic effects. The results suggest that epigenetic variation, inherited both directly and through shared environmental effects, may make a key contribution to the missing heritability.
Genome-wide association studies (GWAS) have identified around 60 common variants associated with multiple sclerosis (MS), but these loci only explain a fraction of the heritability of MS. Some missing heritability may be caused by rare variants that have been suggested to play an important role in the aetiology of complex diseases such as MS. However current genetic and statistical methods for detecting rare variants are expensive and time consuming. ‘Population-based linkage analysis’ (PBLA) or so called identity-by-descent (IBD) mapping is a novel way to detect rare variants in extant GWAS datasets. We employed BEAGLE fastIBD to search for rare MS variants utilising IBD mapping in a large GWAS dataset of 3,543 cases and 5,898 controls. We identified a genome-wide significant linkage signal on chromosome 19 (LOD = 4.65; p = 1.9×10−6). Network analysis of cases and controls sharing haplotypes on chromosome 19 further strengthened the association as there are more large networks of cases sharing haplotypes than controls. This linkage region includes a cluster of zinc finger genes of unknown function. Analysis of genome wide transcriptome data suggests that genes in this zinc finger cluster may be involved in very early developmental regulation of the CNS. Our study also indicates that BEAGLE fastIBD allowed identification of rare variants in large unrelated population with moderate computational intensity. Even with the development of whole-genome sequencing, IBD mapping still may be a promising way to narrow down the region of interest for sequencing priority.
Hundreds of genome-wide association studies have been performed in recent years in order to try to identify common variants that associate with complex disease. These have met with varying success. Some of the strongest effects of common variants have been found in lateonset diseases and in drug response. The major histocompatibility complex has also shown very strong association with a variety of disorders. Although there have been some notable success stories in neuropsychiatric genetics, on the whole, common variation has explained little of the high heritability of these traits. In contrast, early studies of rare copy number variants have led rapidly to a number of genes and loci that strongly associate with neuropsychiatric disorders. It is likely that the use of whole-genome sequencing to extend the study of rare variation in neuropsychiatry will greatly advance our understanding of neuropsychiatric genetics.
genome-wide association study; rare variant; neuropsychiatric; schizophrenia; sequencing; rare variant; neuropsychiatric; schizophrenia ; sequencing
Genome-wide association studies (GWAS) have successfully identified a large number of genetic variants associated with complex traits, but these only explain a small proportion of the total heritability. It has been recently proposed that rare variants can create ‘synthetic association' signals in GWAS, by occurring more often in association with one of the alleles of a common tag single nucleotide polymorphism. While the ultimate evaluation of this hypothesis will require the completion of large-scale sequencing studies, it is informative to place it in the broader context of what is known about the genetic architecture of complex disease. In this review, we draw from empirical and theoretical data to summarize evidence showing that synthetic associations do not underlie many reported GWAS associations.
Osteoporosis is a complex disorder and commonly leads to fractures in elderly persons. Genome-wide association studies (GWAS) have become an unbiased approach to identify variations in the genome that potentially affect health. However, the genetic variants identified so far only explain a small proportion of the heritability for complex traits. Due to the modest genetic effect size and inadequate power, true association signals may not be revealed based on a stringent genome-wide significance threshold. Here, we take advantage of SNP and transcript arrays and integrate GWAS and expression signature profiling relevant to the skeletal system in cellular and animal models to prioritize the discovery of novel candidate genes for osteoporosis-related traits, including bone mineral density (BMD) at the lumbar spine (LS) and femoral neck (FN), as well as geometric indices of the hip (femoral neck-shaft angle, NSA; femoral neck length, NL; and narrow-neck width, NW). A two-stage meta-analysis of GWAS from 7,633 Caucasian women and 3,657 men, revealed three novel loci associated with osteoporosis-related traits, including chromosome 1p13.2 (RAP1A, p = 3.6×10−8), 2q11.2 (TBC1D8), and 18q11.2 (OSBPL1A), and confirmed a previously reported region near TNFRSF11B/OPG gene. We also prioritized 16 suggestive genome-wide significant candidate genes based on their potential involvement in skeletal metabolism. Among them, 3 candidate genes were associated with BMD in women. Notably, 2 out of these 3 genes (GPR177, p = 2.6×10−13; SOX6, p = 6.4×10−10) associated with BMD in women have been successfully replicated in a large-scale meta-analysis of BMD, but none of the non-prioritized candidates (associated with BMD) did. Our results support the concept of our prioritization strategy. In the absence of direct biological support for identified genes, we highlighted the efficiency of subsequent functional characterization using publicly available expression profiling relevant to the skeletal system in cellular or whole animal models to prioritize candidate genes for further functional validation.
BMD and hip geometry are two major predictors of osteoporotic fractures, the most severe consequence of osteoporosis in elderly persons. We performed sex-specific genome-wide association studies (GWAS) for BMD at the lumbar spine and femor neck skeletal sites as well as hip geometric indices (NSA, NL, and NW) in the Framingham Osteoporosis Study and then replicated the top findings in two independent studies. Three novel loci were significant: in women, including chromosome 1p13.2 (RAP1A) for NW; in men, 2q11.2 (TBC1D8) for NSA and 18q11.2 (OSBPL1A) for NW. We confirmed a previously reported region on 8q24.12 (TNFRSF11B/OPG) for lumbar spine BMD in women. In addition, we integrated GWAS signals with eQTL in several tissues and publicly available expression signature profiling in cellular and whole-animal models, and prioritized 16 candidate genes/loci based on their potential involvement in skeletal metabolism. Among three prioritized loci (GPR177, SOX6, and CASR genes) associated with BMD in women, GPR177 and SOX6 have been successfully replicated later in a large-scale meta-analysis, but none of the non-prioritized candidates (associated with BMD) did. Our results support the concept of using expression profiling to support the candidacy of suggestive GWAS signals that may contain important genes of interest.
Genome-wide association studies (GWAS) have been used successfully in detecting associations between common genetic variants and complex diseases. However, common SNPs detected by current GWAS only explain a small proportion of heritable variability. With the development of next-generation sequencing technologies, researchers find more and more evidence to support the role played by rare variants in heritable variability. However, rare and common variants are often studied separately. The objective of this paper is to develop a robust strategy to analyze association between complex traits and genetic regions using both common and rare variants.
We propose a weighted selective collapsing strategy for both candidate gene studies and genome-wide association scans. The strategy considers genetic information from both common and rare variants, selectively collapses all variants in a given region by a forward selection procedure, and uses an adaptive weight to favor more likely causal rare variants. Under this strategy, two tests are proposed. One test denoted by BwSC is sensitive to the directions of genetic effects, and it separates the deleterious and protective effects into two components. Another denoted by BwSCd is robust in the directions of genetic effects, and it considers the difference of the two components. In our simulation studies, BwSC achieves a higher power when the casual variants have the same genetic effect, while BwSCd is as powerful as several existing tests when a mixed genetic effect exists. Both of the proposed tests work well with and without the existence of genetic effects from common variants.
Two tests using a weighted selective collapsing strategy provide potentially powerful methods for association studies of sequencing data. The tests have a higher power when both common and rare variants contribute to the heritable variability and the effect of common variants is not strong enough to be detected by traditional methods. Our simulation studies have demonstrated a substantially higher power for both tests in all scenarios regardless whether the common SNPs are associated with the trait or not.
Genome-wide association studies have been successful in identifying loci contributing effects to a range of complex human traits. The majority of reproducible associations within these loci are with common variants, each of modest effect, which together explain only a small proportion of heritability. It has been suggested that much of the unexplained genetic component of complex traits can thus be attributed to rare variation. However, genome-wide association study genotyping chips have been designed primarily to capture common variation, and thus are underpowered to detect the effects of rare variants. Nevertheless, we demonstrate here, by simulation, that imputation from an existing scaffold of genome-wide genotype data up to high-density reference panels has the potential to identify rare variant associations with complex traits, without the need for costly re-sequencing experiments. By application of this approach to genome-wide association studies of seven common complex diseases, imputed up to publicly available reference panels, we identify genome-wide significant evidence of rare variant association in PRDM10 with coronary artery disease and multiple genes in the major histocompatibility complex (MHC) with type 1 diabetes. The results of our analyses highlight that genome-wide association studies have the potential to offer an exciting opportunity for gene discovery through association with rare variants, conceivably leading to substantial advancements in our understanding of the genetic architecture underlying complex human traits.
genome-wide association study; rare variants; imputation
Most common human traits and diseases have a polygenic pattern of inheritance: DNA sequence variants at many genetic loci influence phenotype. Genome-wide association (GWA) studies have identified >600 variants associated with human traits1, but these typically explain small fractions of phenotypic variation, raising questions about the utility of further studies. Here, using 183,727 individuals, we show that hundreds of genetic variants, in at least 180 loci, influence adult height, a highly heritable and classic polygenic trait2,3. The large number of loci reveals patterns with important implications for genetic studies of common human diseases and traits. First, the 180 loci are not random, but instead are enriched for genes that are connected in biological pathways (P=0.016), and that underlie skeletal growth defects (P<0.001). Second, the likely causal gene is often located near the most strongly associated variant: in 13 of 21 loci containing a known skeletal growth gene, that gene was closest to the associated variant. Third, at least 19 loci have multiple independently associated variants, suggesting that allelic heterogeneity is a frequent feature of polygenic traits, that comprehensive explorations of already-discovered loci should discover additional variants, and that an appreciable fraction of associated loci may have been identified. Fourth, associated variants are enriched for likely functional effects on genes, being over-represented amongst variants that alter amino acid structure of proteins and expression levels of nearby genes. Our data explain ∼10% of the phenotypic variation in height, and we estimate that unidentified common variants of similar effect sizes would increase this figure to ∼16% of phenotypic variation (∼20% of heritable variation). Although additional approaches are needed to fully dissect the genetic architecture of polygenic human traits, our findings indicate that GWA studies can identify large numbers of loci that implicate biologically relevant genes and pathways.