|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) have successfully identified a large number of genetic variants associated with complex traits, but these only explain a small proportion of the total heritability. It has been recently proposed that rare variants can create ‘synthetic association' signals in GWAS, by occurring more often in association with one of the alleles of a common tag single nucleotide polymorphism. While the ultimate evaluation of this hypothesis will require the completion of large-scale sequencing studies, it is informative to place it in the broader context of what is known about the genetic architecture of complex disease. In this review, we draw from empirical and theoretical data to summarize evidence showing that synthetic associations do not underlie many reported GWAS associations.
Numerous common human diseases and phenotypic traits are believed to arise from a combination of genetic and environmental factors. The unravelling of the genetic predisposition to complex traits is a major challenge, and it could lead to better prevention, diagnosis and treatment of disease.
Recently, advances in genotyping technologies, reduction in genotyping costs and the availability of data regarding genome-wide sequence variation through the International HapMap Project and 1000 genomes project have made genome-wide association studies (GWAS) possible. GWAS have emerged as a powerful tool for identifying genetic variants associated with complex traits. In the past few years, more than 500 loci have been found to be associated with human common diseases and traits (1). GWAS have proven to be much more successful than linkage studies, which were underpowered to detect variants of modest effect (2), and candidate gene studies, which are non-systematic and biased due to our limited knowledge of the biological pathways implicated in disease pathogenesis (3).
GWAS are based on the common disease–common variant (CDCV) hypothesis (4), which states that relatively common genetic variants (MAF > 5%) of relatively low penetrance are important contributors to the genetic susceptibility to common diseases. Well-powered GWAS, which capture a substantial majority of common variation in the genome, have been now conducted for many common diseases. However, for the majority of these diseases, common variants explain only a small proportion of heritability (5), due to small individual effect sizes. It has been estimated that only 13% of all identified susceptibility loci have odds ratios (OR) above 2, and only 1% have OR above 10 (6). For example, if we consider a total estimated sibling recurrence risk ratio (λs) of 5–10 for rheumatoid arthritis (RA) (7), 15 for type 1 diabetes (T1D) (8), 17–35 for Crohn's disease (CD) (9) and 3 for type 2 diabetes (T2D) (10), their established susceptibility loci would contribute ~33–47%, 55.6%, 10–12.6% and 11.9% of the total heritability, respectively (Table 1).
Explaining this ‘missing heritability' of complex diseases (11–13) is an area of active research, and there are likely to be multiple contributing factors. Part of the explanation is likely to be an underestimate of the contribution made by the types of variants targeted by GWAS. For instance, it might be that there are large numbers of variants of very small effect, which early GWAS were underpowered to detect, yet to be found. This idea is supported by the observation that meta-analyses of published GWAS are discovering a substantial number of new susceptibility loci (14–25). In addition, for most loci, causal variants and potential independent additional markers within the region have not been identified yet. New ways of analysing the genetic architecture of complex traits using GWAS data are suggesting that indeed a large proportion of heritability can be explained by common variants and that larger GWAS will yield many more validated loci for complex traits (26,27).
Of course, GWAS only interrogate a portion of the types of variation that could underlie disease risk. Analysis of GWAS data has been mainly focused on single nucleotide polymorphisms (SNPs), but there are other types of genetic variation, such as structural variants, that have not been studied in depth. However, recent studies of common (MAF > 5%) copy number variants (CNVs) have shown that they seem unlikely to account for a substantial proportion of the ‘missing heritability’ (28). Similarly, the analysis of gene–environment and gene–gene interactions (epistasis) might improve the fraction of heritability explained by loci documented thus far. Several epistatic interactions have been indentified in humans [e.g. between the RET protooncogene and endothelin receptor type B genes in Hirschsprung disease (29), the interleukin 4 receptor variants and interleukin 13 promoter variants in asthma (30) and the alpha- and beta-adrenergic receptors in congestive heart failure (31)], although they have not been replicated. However, this phenomenon has not been thoroughly explored through large-scale analysis of genome-wide SNP interactions, first due to the fact that current sample sizes are underpowered to detect modest interaction effects and secondly due to the paucity of sample collections with genetic and detailed environmental exposure data. Complex patterns of inheritance, such as parent of origin effects (32), as well as inherited epigenetic modifications of the genome, the presence of phenotype heterogeneity in the cohorts used in the first wave of GWAS, or even an initial over-estimation of the heritability of complex traits (33) can also contribute to the missing heritability.
While the above-mentioned plausible contributors seem unlikely to play a substantial role in explaining missing heritability, rare variants are increasingly thought to account for a large proportion of it (34–36). Contrary to the CDCV hypothesis, the multiple rare variant (MRV) hypothesis argues that the summation of the effects of low-frequency polymorphisms, each conferring an intermediate increase in risk (i.e. incompletely penetrant, but greater than those observed for common variants), can explain a significant proportion of the genetic susceptibility to common diseases and traits. Some studies analysing rare variants using GWAS data have been carried out, but these have proven to be underpowered to detect robust associations. Re-sequencing approaches are more suitable for rare variant analysis, and, as these are becoming more cost-effective and new analysis methods are being developed (37,38), they will soon be applied to large-scale studies of rare variants. Indeed, several targeted sequencing studies have already proven successful for the identification of associations between rare variants and some human diseases and disease-related phenotypes (39–43). The same argument can also be extended to other forms of genetic variation, and it has been recently proposed that rare CNVs may be responsible for some fraction of the missing heritability of complex traits (44,45).
It has been recently proposed that GWAS signals that have been credited to common variants could instead reflect the effect of MRVs. Dickson et al. (46) argue that rare variants can create ‘synthetic association' signals in GWAS, by occurring more often in association with one of the alleles of a common tag SNP (Fig. 1), which would thus synthetically confer an increased risk for disease. This might also mean that the causal variants could be megabases away from the common variants detected in GWAS, and that the real effect size could be much stronger than that implied by the common tag SNP. If true, the synthetic association hypothesis would suggest that follow-up studies from GWAS hits should encompass a much larger region than the linkage disequilibrium region surrounding the detected common variant (6).
There are very few documented examples showing that MRVs may be responsible for a common variant GWAS signal (47). It therefore seems sensible to evaluate this hypothesis in the broader context of human disease genetics, including historical study designs, functional annotations of GWAS regions and experiments in human populations with diverse ancestry. While sequencing experiments currently underway or in planning will ultimately resolve the role of synthetic association, the balance of evidence available today is already illuminating.
One line of evidence that suggests that synthetic associations do not underlie many reported GWAS associations is provided by linkage scans that have been conducted in the past. The genetic model that underpins synthetic association (allelic heterogeneity caused by several low-frequency variants with larger effects than commonly seen in GWAS) is highly tractable by linkage analysis, which combines information from all causal variants at a particular locus. This relationship is highlighted by the widely replicated linkage between the NOD2 gene and CD, which is driven by three independent, low-frequency causal variants (48–50) which cause a synthetic association signal in GWAS of CD (Fig. 1). NOD2 is the exception that proves the rule that, despite many attempts, very few replicable linkages to complex diseases have been discovered (51). This dearth of findings is informative when considering the likelihood of synthetic associations because it rules out a class of genetic models from playing a substantial role in complex disease.
Power calculations comparing a large-scale linkage scan (52) with the largest GWAS considered by Dickson et al. (46) show that only a small fraction of the genetic models which can give rise to synthetic associations would not be detected by linkage. Furthermore, the scenario where synthetic associations could have escaped linkage comprises models with a small number of causal variants with genotype relative risk <2.5 (53). While these observations do not entirely rule out synthetic associations, they seriously confine the parameter space in which they might exist. In addition, comparisons of even modest linkage signals with GWAS regions have shown only a few overlaps, and even these are largely driven by atypically large effects like the MHC in autoimmunity. In addition, attempts to explicitly use linkage information to boost the power of GWAS (54) have not been successful. This contrast between largely overlapping genetic models that linkage and synthetic association are well powered to detect and almost completely non-overlapping results from linkage and GWAS strongly suggests that synthetic associations do not underlie many GWAS signals.
Another prediction made by the synthetic association hypothesis is that the most significantly associated common variant identified by GWAS might be located several megabases away from the underlying low-frequency functional variants. The empirical properties of linkage disequilibrium between low-frequency and common variants are not fully understood, although the complete 1000 Genomes project (http://www.1000genomes.org/) will soon provide information necessary to evaluate this question directly. Nevertheless, two indirect pieces of evidence suggest that most GWAS hit SNPs are within a few hundred kilobases (and many within tens of kilobases) of their tagged functional alleles. First, a large number of GWAS signals across a variety of traits are nearby to genes previously established to cause Mendelian forms of the same trait (55). Secondly, genes involved in key pathways repeatedly arise in GWAS of some diseases. For example, 8 of 10 proteins involved in the Th17-differentiation signalling pathway have been associated with one or more auto-inflammatory diseases (56). As with many aspects relating to the evaluation of the prevalence of synthetic associations, deeper sequence data sets will be needed to fully answer the question of the distance between GWAS tag SNPs and causal variants, but these early patterns imply that the tag SNP often resides in the proximity of the relevant functional genomic element.
Under the synthetic associations model, common variant signals reflecting single or multiple rare alleles are unlikely to be consistent across populations of different ancestry. This is based on the fact that many of these rare variants would have arisen recently and will therefore not be shared across diverged populations. The majority of GWAS to date have focused on populations of European descent. However, data on more diverse populations are now starting to arise. For example, a study from early 2010 clearly demonstrated that common variant signals for T2D are reproducible and have similar effect sizes across East Asian populations including Chinese, Malays and Asian-Indians in Singapore (57). In fact, T2D-associated variants have been found to be associated with disease in diverse populations (ranging from African-Americans to Chinese) by several studies (58–62). Similarly, in RA, the STAT4 locus, as an example, has shown reproducible association with disease in the USA (63), UK (64), Spanish, Swedish, Dutch (65), Korean (66), Colombian (67), Japanese (68) and Greek (69) populations.
Although synthetic associations explaining common GWAS signals for complex polygenic traits are certainly plausible and can occur under specific circumstances (e.g. NOD2 in CD), results from studies thus far suggest that these scenarios are actually a rarity. The idea that MRVs at a particular locus may be associated with complex traits of interest has been around for over a decade. We are now starting to accrue a growing body of empirical evidence in support of this hypothesis. The field of complex trait genetics has over the last few months engaged in discussions on the controversial topic of synthetic associations, but it transpires that there is little evidence to support this as a widespread scenario.
Empowered by advances in sequencing technologies, attention is currently shifting towards the comprehensive study of low-frequency and rare variants. Resources such as the 1000 genomes project and emerging large-scale studies like the UK10k project will undoubtedly facilitate the examination of variants at this end of the allele frequency spectrum. In parallel, improved strategies for accurate imputation and powerful analysis of low-frequency and rare variants in aggregate are being further developed and fine-tuned to the needs of these next generation truly genome-wide scans for association.
Conflict of Interest statement. None declared.
G.O. is funded by the European Union (Marie Curie IEF Fellowship PIEF-GA-2009-235662). E.Z. and J.C.B. are supported by the Wellcome Trust (WT088885/Z/09/Z, WT089120/Z/09/Z). Funding to pay the Open Access Charge was provided by The Wellcome Trust.