Although we know that genetics-related determinants of risk exist, we cannot explain the majority of this risk through specific identified genetic variants. Numerous hypothetical arguments have been proposed to explain that dark matter, as summarized in [1
What accounts for the genetic ‘dark matter’ in cancer studies?a
The ability of GWAS to detect associations with common SNPs can be reduced if the phenotypes are poorly or inconsistently defined and ascertained, and/or if controls are also suboptimally screened for exclusion of disease. Even with large sample sizes of several thousands cases and controls, there is usually limited power to detect alleles of modest effect sizes (odds ratios of 1.20), and minimal power to detect risk allele odds ratios of <1.10 even for very common variants. Power is also limited to detect epistatic interactions of multiple modest effect genes. The detection of gene-environment interactions is hampered not only by limited power, but also by the lack of concurrent availability of both genetic and high-quality, standardized, and consistently collected environmental exposure data [6
]. Residual population stratification or genotyping error can also lead to the attrition of some associations, although current GWAS investigations have dramatically improved study performance on these fronts. The genetic architecture can differ substantially across different populations, and most GWAS to-date have targeted European-descent populations [41
], while there is some evidence that different loci can emerge and the implicated haplotype blocks and strength of association can vary when populations of other ancestry are examined [42
]. Moreover, despite generally good coverage of the whole genome in currently used genotyping platforms, some areas of the genome are still imperfectly covered and thus variants lurking in these areas would remain undiscovered. This is a more of a concern for African-descent populations than those of European descent. Finally, parent-of-origin-specific genetic effects would have been largely missed with the current mode of association analysis in most GWAS investigations. For example, the association of rs157935[T] at 7q32 with the risk of cutaneous basal cell carcinoma seems to be dependent to the parent-of-origin of the risk allele, estimated in silico
In particular, the hypothesis of insufficient sample size, a potential limiting factor in the statistical power to detect weak genetic loci, has gathered considerable supporting evidence based on GWAS conducted to-date and deserves some more elaboration. The general rule has been that the larger the sample size, the greater the yield of new discovered loci. The pattern of discoveries to-date in terms of the minor allele frequency and odds ratio of the risk alleles that emerge could be largely explained based on power considerations alone. This is leading GWAS research groups worldwide to assemble huge numbers of patients and controls to conduct GWAS on increasingly large series of cases and controls. As the technology is becoming cheaper, instead of multi-stage designs, performance of GWAS using the whole genotyping platform on very large samples (e.g. 100 000 cases and as many controls) might become feasible and cost-efficient [5
]. However, the practical issues in accumulating such a huge number of cases and controls are not easy to solve. For the most common cancers such as breast cancer and colorectal cancer, consortia with sample sizes in the range of up to 30 000 cases and as many controls are already in place and further enrollment of additional teams will be able to increase this further. For less common cancers, it will be a challenge to obtain such numbers, even if new large population cohorts are established [44
]. Moreover, it is possible that even with 100 000 cases and 100 000 controls, the proportion of variance explained by the discovered variants, cumulatively, might still not exceed 20%. However, the use ever larger cohorts providing the main solution to the identification of the elusive dark matter is a topic of debate in the community. One view is that many additional common variants are unlikely to be discovered or are not worth discovering, and that most genetic control is caused by variants that are not represented at all in the current studies [45
]. This is an interesting speculation, but there are still no data to support it or refute it.
Rare variants are an obvious contender for the source of the missing heritability. By default, usually variants with frequency <5-10% are excluded from current GWAS analyses. Many rare variants (especially those with minor allele frequencies of 0.5-5%) would be possible to capture using full-sequencing, if a sufficient number of individuals is genotyped. Indeed, the 1000 Genomes Project (http://www.1000genomes.org/page.php
) is aimed to the identification of the rare variants and thus facilitating association studies. However, unless the effects that they convey are large, the power to detect them would be practically zero. Various analytical approaches have been proposed to improve power, typically generating composite scores by merging many rare variants together that might share common function [46
]. However, such merging usually provides speculative results and no clear functional evidence exists to perform the grouping of variants with certainty.
Structural variants also need to be considered as an underlying cause of cancer. Several investigators have already performed association studies that capture copy number variants (CNVs), which can correspond to either common or rare alleles. So far, strong associations with common CNV are limited, although common CNV can be in strong LD with common SNPs (frequencies of the minor allele >0.1) [47
], suggesting that the detection power for common CNVs is probably adequate. Conversely, a considerable number of rare CNV have been proposed to be associated with various neuropsychiatric traits, including schizophrenia, mental retardation, autism, and epilepsy [48
]. In all of these cases, the CNV have been seen with a frequency of 0.2-1%, although they are exceedingly rare in the general population (generally ≤0.03%). No CNVs have yet been associated robustly with cancer phenotypes, but no large studies have yet been published in cancer patients to pursue this avenue.
Power considerations are important also for the detection of CNV associations in association studies, and those that can currently be detected for rare CNV are those with very large effect sizes.