Our current search in genome-wide association studies (GWAS) is based on the common-disease common-variant model. It might be argued that the distribution of validated SNPs supports this model
[5]; for example, 18 of the 20 validated SNPs for type-2 diabetes in have MAFs

10%. Of the 383 SNPs from the recent GWAS (see
Introduction), 87% (335/383) have MAFs

. This observation is of course biased since statistical power is higher for larger MAFs and the current genotyping technology prioritizes SNPs with larger MAFs. The current array technology from Affymetrix and Illumina, directly and indirectly via LD, has a good coverage of the HapMap 4 M SNPs. However, an assessment in a resequenced region of 76 genes
[11] shows that the current products, including Affymetrix 6.0 and Illumina 1 M, have substantially low coverage of the complete common variation with MAFs

. So there could still be other common causal variants that are not yet covered by existing arrays.
We have used heritability as the basis to estimate the number of remaining variants, where heritability is defined as the genetic contribution to the variance of the liability of the disease. In comparison, Yang et al.
[12] used the population attribution fraction (PAF), roughly the genetic contribution to the proportion of the disease in the population. While it is straightforward to compute the PAF from a set of known SNPs, it is not obvious how to get the total PAF from all the (known and unknown) causal variants. This is a disadvantage compared to our approach, since heritability is commonly reported for most diseases.
Our computation shows that a large number of low-penetrant variants are needed to account for a heritability of 30–40%. This poses a major challenge, requiring enormous sample sizes (e.g. model B in to discover these variants. While such large samples are feasible in some existing consortia, a complicating factor that comes with larger and larger studies is the potential dilution of signal that results from the need to include heterogeneous populations and/or heterogeneous phenotypes. For example, it is clear from studies on the hereditary forms of breast cancer that mutations in the BRCA1 and BRCA2 genes are often specific to individual populations
[13]. If distinct sub-phenotypes are due to different susceptibility genes, a study that combines these heterogeneous phenotypes will yield diluted effects.
A smaller number of rare medium- to high-penetrant variants are needed to account for the heritability. The current SNP array platforms are not able to genotype very rare SNPs, but, surprisingly, if denser arrays were available and the ORs were of medium size (e.g, 1.28 to 2.01 in model D), we would only need modestly large sample sizes to detect these rare variants. Such sample sizes are comparable to many existing genome-wide association studies, so they are well within reach. We might also search for higher-penetrant variants in subsets of populations, for example, by more strictly-defined phenotypes or by studying familial cases.
One natural question about the rare-variant model with large effect-sizes (e.g., model E) is whether existing data already rule it out. Is it possible to miss such rare alleles using the existing tagging SNPs? The case of the CHEK2 1100delC mutation is a relevant example. It has an allele frequency of approximately 0.5% and an OR of 2.7 for sporadic breast cancer and 4.8 for familial breast cancer
[14]. Yet the CHEK2 gene does not appear among the top SNPs in the largest most recent breast cancer association study
[3]. So rare-variant model with large effect-sizes is still a possibility.
Very rare variants (MAFs

0.01) will create methodological problems. First of all, they are not represented in the current highest-density genotyping arrays. Another problem is the measurement accuracy: since genotype calling is based on fluorescent intensity and clustering, it will be hard to distinguish very rare variants from genotyping errors. Also, as they are likely to occur after the out-of-Africa migration, rare variants are likely to be population specific, which means that we cannot simply combine different study cohorts. Some of these problems might be solved by the complete sequencing method, but this technology is still too expensive for large studies.
Age-related macular degeneration
[15] and exfoliation glaucoma
[16] are unusual among phenotypes studied through GWAS, with large effects from common variants that have been identified in limited samples. Nonetheless, they show that there are traits with marked allelic homogeneity. Other very recent example is transferrine concentration
[17], where 40% of the variance is explained by a single locus. However, it is impossible to judge beforehand which complex traits will display such a genetic architecture.
To appreciate the scope of our challenge in genetic dissection of complex phenotypes, it is useful to consider the genetics of cystic fibrosis (CF), a ‘simple’ Mendelian disease of the mucus glands of the lungs, liver and pancreas. CF is a recessive disorder, caused by mutations in CFTR, a 230,000-base long gene on chromosome 7q31.2. Deletion of codon 508 (phenylalanine), first identified in 1988
[18], is found in 66% of the cases. However, there are more than 1000 other deleterious mutations, a great majority of which are very rare variants. It is known that the clinical manifestations of the disease, for example prognosis, vary substantially; while these correlate with the type of mutations
[19],
[20], the genotype explains only a small portion of the clinical variability.
This highlights two salient points: (i) If a simple genetic disease such as CF can have more than 1000 functional deleterious variants, are there reasons to believe that the number and spectrum of functional mutations (in terms of non-synonymous substitutions, stop-mutations, deletions, splice mutations etc.,) should be different for genes with more subtle effects on complex diseases? (ii) Monogenic diseases such as CF also have phenotypic diversity, and this diversity is still poorly explained by the underlying genetics. If anything, the phenotypic diversity of within each complex disease tends to be wider than that of simple Mendelian diseases, so our challenge will be even greater. Different disease subtypes are likely due to different (combinations of) causal variants; however, due to sample-size problems, our case-control samples are combined over these subtypes, so, the effects of the functional variants will be diluted. In conclusion, substantial challenges remain in finding genetic explanation of the common diseases.