Complex traits and diseases, such as body-mass index, height, diabetes, heart disease, and psychiatric disorders are undoubtedly caused by multiple genetic and environmental factors, although it has been a major challenge to identify specific genes. Recently, genome-wide association studies (GWAS) have resulted in the detection of many robustly associated single nucleotide polymorphism (SNP) variants across a range of outcomes , although for any particular disease or trait the SNP variants detected explain only a fraction of the total genetic variance calculated from family studies. The gap between the two has been termed the “missing heritability” ,. Many reasons for the missing heritability have been given . One plausible explanation is that rare variants, which existing GWAS platforms are not designed to capture, make significant contributions to the heritability of many traits and diseases. It is indeed likely that many multifactorial and heterogeneous phenotypes will be influenced by a diverse array of genetic factors that span the spectrum from private mutation to common variant. Dickson and colleagues , recently took a step further, by arguing that rare variants might explain not only some of the heritability that is currently missing, but also that they may be the cause of a proportion of detected associations between complex traits and common SNPs from GWAS. Based on computer simulations, they proposed that some constellations of variants within a narrow frequency and effect size range can account for “many” of the observed associations between complex traits and common SNPs from GWAS. This is a strong claim and one that they say has important implications for the “design of future studies to detect causal variants.” It is of great importance to the research community to establish whether “many” represents an important proportion of GWAS results to date, since indeed this can impact on decisions of experimental design and allocation of research funds.
Dickson et al. define synthetic association as the association of a genotyped common marker resulting from multiple unobserved low-frequency causal variants (see Figure 1). The variance contributed by the causal variants would be much higher than variance explained by the associated genotyped SNP, because the genotyped SNPs will not “tag” (see Box 1) the causal variants with great precision, thus leading to the “missing” heritability from GWAS. Importantly, synthetic associations may arise many hundreds of kilobases (kb) from the site of the causal variant(s), which would hamper attempts to locate the causal variants responsible for association signals by fine-mapping. Dickson et al. claim that rare variants can give rise to synthetic associations that are similar to many observed GWAS associations. As we show below, however, synthetic associations in fact tend to differ in some important ways to observations from GWAS. Furthermore, even if rare variants can, in principle, give rise to associations detectable in GWAS, the converse proposition (that, for a given trait, many, or even any, detected GWAS associations arise from rare variants) does not automatically follow.
Box 1. The Dickson et al. Genetic Model and Simulations
Dickson et al.  used coalescence theory (Box 2)  to simulate patterns of LD that are consistent with an evolutionary process, and then mimicked a GWAS by simulating cases and controls and performing association with disease status and common tagging SNPs (MAF>0.05). Specifically, each simulation was of a genomic region of length 100 kb (representing on 1/30,000th of the genome). To generate realistic patterns of SNP frequencies they assumed an effective population size of 10,000 and a mutation rate of 10−8. Within a 100 kb region up to 9 causal SNPs, each with frequency between 0.005 and 0.02 were allocated to influence disease (causal SNPs). Therefore, at a locus with 9 such variants, ~20% of the general population would be expected to carry at least one disease risk allele. The baseline probability of disease was 1% or 10%, and each risk variant had the same increased risk for disease (genotype relative risk, GRR, see Box 3) compared to the baseline. Each simulation generated 10,000 haplotypes of the 100 kb region. Individuals in the population were simulated by sampling, with replacement, pairs of haplotypes; these were allocated case or control status based on the probability of disease associated with the number or risk loci they carried (with GRR combining multiplicatively when an individual carried multiple risk alleles—this is not a common event, only about 1% of individuals will carry more than one risk allele when there are 9 causal SNPs in the 100 kb region). A case control study was simulated by selecting equal numbers of cases and controls. The simulations varied three parameters – the number of causal SNPs (1,3,5,7,9), the sample size of the case control study (2,000, 4,000, 6,000) and the GRR associated with each risk allele (2,3,4,5,6). Most simulations were conducted in the absence of recombination. The more realistic scenario of recombination (comparing different rates) was considered only when GRR=4. The simulation of recombination divided the 100 kb region into 200 fragments of 500 bp with no recombination within, and only recombination between, segments. Additional simulations also considered 9 causal variants of GRR=4 in a 10 Mb region and recombination of 1 cM/Mb.
Box 2. Glossary of Linkage Disequilibrium
We consider two loci on a chromosome. The causal locus has alleles C and c and the genotyped marker (SNP) has alleles M and m. These alleles have frequencies pC, 1−pC, pM,1−pM. The loci can make four possible haplotypes CM, cM, Cm, cm with frequencies pCM, pCg, pcM, pcm
Linkage Equilibrium – When the frequencies of haplotypes are the frequencies expected from the random association of the alleles , e.g., pCM=pC pM
Linkage disequilbrium (LD) – The non-random association between alleles on a chromosome, e.g., pCM >pC pM. Recombination breaks down linkage disequilbrium.
Recombination – Chromosomal cross-over between the paired chromosomes during meiosis so that the chromosomes passed to offspring comprise a mixture of the chromosomes inherited from its two parents. If the cross-over event occurs between loci C and M, then the LD between them is broken down in the transmitted chromosome. It may take several generations or multiple recombination events to have a substantial impact on the LD in the population.
Coupled alleles – Alleles at two loci that tend to be found together on a chromosome. For example, a locus with one rare allele (rare allele C, common allele c), will usually only make three chromosomal haplotypes with any other locus (Minor allele M, major allele m): CM, cM,cm. In this example, the rare allele C is only found in the population coupled with the allele M. This is called complete LD. Recombination breaks down the coupling of alleles, so that all four haplotypes exist in the population. However, while there is linkage disequilibrium the coupled alleles are those making combinations of haplotypes with frequency greater than expected if there was linkage equilibrium.
Measures of LD –The two commonly used measures of LD are r2 and |D'|, both scale the covariance between the loci, D=pCM−pC pG, but in different ways. r2=D2/(pC pM (1−pC)(1−pM)), so r is the correlation between the loci, which scales D by the standard deviation of allelic frequency at the two loci. When pC < pM and C and M are coupled and |D'|= D/pC(1-pM), so that D is scaled by the maximum allelic association possible given the allele frequencies at the two loci. Rare variants often make only three haplotypes with common SNPs, in this case r2 can be close to zero while |D'|=1.
Perfect LD – When the alleles at one locus (C and c) have the same frequency as the alleles at another locus (M and m) and when the alleles are perfectly coupled so that only two haplotypes exist CM and cm. In this case r2=|D'|=1.
Complete LD – When the alleles at one locus (C and c) have different frequency from the alleles at another locus (M and m), but alleles from the C and M locus are coupled as much as is possible given the different alleles frequencies. In this case, only three haplotypes exist in the population e.g., CM,cM,cm. In this case |D'|= 1 and r2 can range from very close to zero to 1 (when r2=1, the allele frequencies of the two loci are equal and there is perfect LD). The value of r2 depends on the allele frequency difference between the two loci.
Maximum r2 – The maximum r2 possible between two loci given their allele frequencies occurs when the two loci make only three haplotypes so that there is complete LD. If C has the lowest frequency out of C, c, M and m and if allele C is coupled with allele M where M might be either the minor or major allele at this locus then the difference in allele frequencies between the couple loci is v = pM −pC. The maximum r2 between them is . If allele C is very rare then , and when pM is close to 0.5, .
Tagging – When a genotyped SNP that is in LD with a non-genotyped variant, the genotyped SNP tags the non-genotyped variant.
Coalescence theory – A population genetics model of inheritance relationships among alleles at a given locus. The coalescence of two alleles is the most recent point (going back in time) at which they shared a common ancestor. Simulation under coalescence theory is an efficient way to generate a realistic distribution of SNP frequencies and LD between them.
Box 3. Glossary of Terms Underlying Variance Explained by a Locus on the Liability Scale
We assume a single locus with two alleles, the non-risk allele, c and the risk allele, C. The frequency of the risk allele is p, so that the frequency of the genotypes cc, cC and CC in the population are (1−p)2, 2p(1−p) and p2, assuming Hardy-Weinberg equilibrium.
Genotype relative risk (GRR):
GRR expresses the increased risk of disease associated with a single risk allele and is represented by the single character γ, so that under a multiplicative model of disease, the probability of disease for the three different genotypes are P(D|cc)=ϕ, P(D|Cc)=ϕγ and P(D|CC)=ϕγ2. If the disease prevalence in the population is, K and P(D)=K=(1−p)2(D|cc)+2p(1−p)P(D|Cc)+p2P(D|CC), then ϕ = K/(1+p(γ−1))2. For high GRR ϕ γ2>1, in this case P(D|CC) should be constrained to 1, and then ϕ=K/(1+p(γ−1))2. Dickson et al. chose to allow disease prevalence to vary, by fixing ϕ as a defined baseline probability.
Odds Ratio (OR):
The OR for heterozygotes compared to homozygotes of the non-risk allele is a function of the ratios of the probabilities of disease and not disease (ND) for the different genotype classes (P(D|Cc)/P(ND|Cc))/(P(D|cc)/P(ND|cc))=γ(1−ϕ)/(1−ϕγ).
Equivalence of GRR and OR:
As K→0, OR→γ. Since K is small for most complex genetic diseases, GRR for heterozygotes and OR are used interchangeably. OR can be estimated from data as it is robust to the inflated P(D) in case control studies—i.e., where the frequency of cases is often ~0.5, rather than K.
Variance explained on the liability scale:
If the associated variant has effect size GRR and allele frequency p, then the genetic variance in liability explained by the variant can be calculated from the mean liability associated with each genotype class (Table 4), but can be approximated as VG=2p(1−p)ln(OR)2/i2, where i is the mean liability (expressed in standard deviation units) of the diseased group calculated from normal distribution theory assuming a disease prevalence, K. i=z/p, where z the height of the standard normal curve at the liability (T) that truncates the proportion K on the standard normal curve. The residual variance is assumed to be normally distributed with variance 1, so the variance explained by the locus on the liability scale is h2=VG/(1+VG). The assumptions of normality used in the liability threshold model break down when each rare locus contributes a large proportion of the variance.
Variance explained at causal versus marker loci:
If the variance in disease liability explained by a causal locus is VC, then the variance explained at the genotyped locus is VM=r2VC (where r2is the linkage disequilibrium described in Box 2). Therefore, if we estimate the variance explained by a common genotyped genetic marker, VM then we can estimate the variance explained by the causal variant is expected to be VC=VM/r2. This relationship holds for quantitative traits but breaks down for disease traits when VM−VC is large and so cannot be used for calculating the variance explained by the causal variant. Instead we calculate the odds ratio at the causal locus and calculate the variance explained from that.
OR at the causal locus given the estimate of the OR at the genotyped SNP:
The OR at the causal locus ORC can be calculated as a function of the OR at the genotyped locus, ORC=1+(OR−1)pM/pC .
The study of Dickson et al.  is the first to consider, in detail, a genetic architecture of multiple rare variants within the framework of GWAS analyses. For ease of discussion, we use the terms rare, common, and very common alleles, but the cut-offs between them is necessarily somewhat arbitrary. For the purposes of simulation, Dickson et al. define rare variants as having risk allele frequency (RAF) 0.005–0.02 and define common SNPs to be representative of those used in GWAS studies (minor allele frequency, MAF>0.05). An important proportion of GWAS associations have risk alleles in the very common frequency spectrum (RAF>0.3) (Figure 2a). We will show that it is unlikely that such associations are driven by synthetic associations with single or multiple rare causal variants. We set out to understand and clarify their model and its implications in order to answer three questions:
- What is the expected frequency distribution of the most associated genotyped SNP under the Dickson et al. model?
- How many loci explain total genetic variance of complex disease under the Dickson et al. model?
- Using results from the GWAS of the International Schizophrenia Consortium as an example, are the results of Dickson et al. supported by empirical observation?