|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) using population-based designs have identified many genetic loci associated with risk of a range of complex diseases including cancer; however, each locus exerts a very small effect and most heritability remains unexplained. Family-based pedigree studies have also suggested tentative loci linked to increased cancer risk, often characterized by pedigree-specificity. However, a comparison between the results of population-and those of family-based studies shows little concordance. Explanations for this unidentified genetic ‘dark matter’ of cancer include phenotype ascertainment issues, limited power, gene-gene and gene-environment interactions, population heterogeneity, parent-of-origin-specific effects, rare and unexplored variants. Many of these reasons converge towards the concept of genetic heterogeneity that might implicate hundreds of genetic variants in regulating cancer risk. Dissecting the dark matter is a challenging task. Further insights can be gained from both population association and pedigree studies.
The advent of genome-wide association studies (GWAS) has revolutionized research on genetic determinants of risk for common diseases. Hundreds of associations of common genetic variants with extremely impressive P-values have been published in past four years . However, the yield of associations has varied for different diseases and phenotypes, ranging from just one association discovered for pancreatic cancer to >25 for prostate cancer. As the dust settles after the first waves of enthusiasm, it is becoming evident that for many diseases, much of the genetic risk remains unexplained, representing the so-called ‘dark matter’ of genetic risk. Cancer is a prominent example where despite many important successes in >50 GWAS published recently, this dark matter seems to remain very prominent. Even with the current exciting pace of discovery, it seems unlikely that currently available samples and genotyping platforms will be able to explain this dark matter, unless additional breakthroughs and/or amendments in the strategy of discovery are adopted. As genotyping capacity evolves towards the feasibility of full sequencing in large population samples, there is debate about how to tackle the unexplained genetic dark matter.
These new insights and evolutions prompt a re-evaluation of our concepts about the genetic architecture of cancer. Are common cancers heritable diseases? What have we learnt from GWAS studies? How extensive is the dark matter? What are possible explanations for its presence? How could the dark matter be deciphered in future research? These are important questions that we address here.
Inherited cancer syndromes are associated with rare and highly penetrant monogenic mutations, but genetic factors also play a role in sporadic cancer, as reported in numerous family-based studies. The contribution of inherited factors has been quantified in modeling studies among twins. Although these studies are not fully consistent regarding the heritability (see glossary) of specific common cancers, overall they suggest that at least for some cancers the heritability is considerable.
Among >90 000 Swedish, Danish and Finnish twins, statistically significant effects of heritable factors were observed for prostate, colorectal and breast cancer, with inherited genetic differences among participants (heritability) accounting for 42%, 35%, and 27% of the phenotypic variance in the respective cancers; conversely, inherited factors played no statistically significant role for other types of cancer and were not involved in cervical cancer. For thyroid cancer, no concordant monozygotic or dizygotic pairs were observed .
A study of 9.6 million individuals in the Swedish family-cancer database showed that the risk of thyroid cancer has the highest contribution of heritable factors (53% of the proportion of variance explained) . In that study, other cancers with a high heritability included those of the endocrine system (28%), testis (25%), breast (25%) and melanoma (20%), whereas a low-to-moderate contribution of hereditary factors was estimated for cancers of colon (13%), nervous system (13%), rectum (12%), non-Hodgkin lymphoma (10%), and lung (8%); prostate cancer was excluded from the analysis .
In common cancers, environmental risk factors play an important role and in most, if not all, cases, supersede that of genetics [2-4]. However, the genetic component is important to decipher for estimating the individual genetic risk of cancer, improving early prevention and diagnosis, and understanding the underlying biochemical pathways as a first step to design new cancer therapies. In this regard, the advent of GWAS kindled hopes that association analyses of common variants tagging the genome with markers of high density would reveal this genetic component of cancer risk (see Box 1 for methodological aspects).
Linkage originally referred to the physical proximity of loci along the chromosome, i.e., sufficiently close (physically connected) for their alleles to co-segregate within families. Genetic linkage analysis, which is aimed at identifying the chromosomal location of loci affecting a particular phenotype, is carried out in pedigrees and is based on analysis of recombination frequency between a disease locus and marker loci. Recombination between two loci is a function of their distance. Owing to the relatively small number of generations that might be available in a given pedigree and the consequent small number of possible recombinations, related individuals tend to share large regions of the genome inherited from the same founders. Thus, linkage analysis covering the complete genome can usually be done based on genotyping <1000 informative markers. Pedigree analysis can detect rare high-risk disease alleles and can provide formal proof of the genetic modulation of a given phenotype .
Association analysis is carried out at the general population level in unrelated individuals and is based on the assumption that individuals sharing a particular phenotype, e.g., cancer, also share the disease allele originating from a common ancestral founder and causing or modulating the phenotype. Mapping of loci affecting a phenotype by association is based on the existence of linkage disequilibrium (LD) between the disease allele and marker alleles. Association analysis can identify disease-risk alleles only when such alleles show strong LD with marker alleles. Because LD decays very rapidly by distance in unrelated individuals, association analysis requires a much higher marker density (i.e. >100 000) than genetic linkage analysis and it is not suited to detect the role of rare variants .
As of December 18, 2009, the NHGRI catalog of GWAS lists 446 GWAS for different types of diseases or common phenotypes, including 2097 GWAS-discovered associations with any P-value of 10−5 or less (http://www.genome.gov/26525384). More than 50 GWAS have evaluated cancer phenotypes. Some recurring themes are becoming clear from these studies: the few variants discovered in each GWAS, the small effect sizes of the identified variants, and the relative lack of overlap in variants discovered by different GWAS investigations on the same phenotype.
Each of the cancer GWAS has detected one or a few loci associated with the risk of the particular cancer type and most, if not all, of these putative cancer risk loci showed small effects, that is the per-allele odds ratios of most of these were ≤1.4 and usually <1% of the phenotypic variance could be attributed to a single locus.
Small genetic effects were observed not only for pathologic phenotypes with relatively low heritability, such as common cancers, but also for physiological phenotypes characterized by high heritability, such as height. As an illustrative comparison, human height, a complex trait with heritability estimated to be ~80%, has been investigated in three GWAS involving analysis of hundreds of thousands of single nucleotide polymorphisms (SNPs) in ~ 63 000 people (for a review, see Ref ). Although the studies identified 54 loci affecting height variation in the population, these variants combined accounted for ~5% of height heritability. Moreover, the overlap among the three GWAS was quite poor, with only four loci identified in all three studies . Although the genetic effects for height can also be modulated by environmental exposures (e.g. malnutrition), for well-nourished populations, this is probably less of a concern. However, cancer risk is a much more complex phenotype than height. Beyond the possibility of incomplete penetrance of the inherited cancer susceptibility alleles, exposure to environmental risk factors could markedly modify genetic predisposition. Depending on environmental exposures, an individual at high genetic risk of developing cancer might never be affected, whereas an individual at low genetic risk for cancer might experience the disease .
Another generic feature of GWAS investigations is that often different GWAS discover different loci for the same phenotype. This leads to an accumulation of more discovered variants as more GWAS investigations are performed. The lack of overlap among different GWAS does not necessarily mean lack of replication. An aspect to consider is the limited power to detect small effects at the stringent required levels of genome-wide significance, even with large studies. Thus, variants that pass stringent genome-wide significance thresholds, such as P<10−8, are thought to be real [7, 8], whereas variants associated with substantially more modest P-values (e.g. 10−5−10−6) might represent false positives; not surprisingly the majority of such variants will fail to be replicated when tested in further samples. For example, in GWAS a variant with P = 10-5 is <1% likely to represent a true association [9, 10].
With these general principles in mind, we review some of the major successes of GWAS in identifying loci related to the risk of specific cancers, so as to illustrate both the extent of progress made and the remaining caveats. For illustrative purposes, we discuss two cancers where GWAS have revealed a substantial number of new loci (breast cancer and colorectal cancer) and two others where GWAS have revealed few loci, one where heritability seems to be limited (lung cancer) and another whether heritability is considered to be more prominent (thyroid cancer).
Several GWAS have identified multiple common genetic variants influencing breast cancer risk. Easton et al.  identified five loci mapping to 10q26, 16q12.1, 5q11.2, 8q24 and 11p15.5. Based on the NHGRI catalog, at a P<10−5 threshold, 14 additional regions were associated with breast cancer in subsequent GWAS publications [12-17]; only four of which (2q35, 5q11.2, 10q26.13 and 16q12.1) were seen in two or more GWAS [12, 14, 17] (Figure 1a). A recent study  found strong evidence for additional susceptibility loci on 3p and 17q by testing >800 promising associations derived from a previous GWAS . Finally, combined analysis of suggestive loci using three published GWAS led to the identification of an additional locus on 5p12  that seemed to be associated specifically with estrogen-receptor positive breast cancer. Claims for associations with specific subsets need even more careful replication, as subgroup differences can be spurious. For example, although the 2q35 locus was originally proposed to be associated specifically with estrogen-receptor positive breast cancer, a recent study found similar effects regardless of estrogen-receptor status . All of the identified variants have small effects (per allele odds ratios <1.41, some even <1.10) and explain less of the heritability of breast cancer than the previously known breast cancer 1, early onset (BRCA1) and breast cancer 2, early onset (BRCA2) mutations can explain, perhaps with the exception of a fibroblast growth factor receptor 2 (FGFR2) common variant, where the contribution to explaining risk might be of similar magnitude as that of BRCA1/2.
One may argue that even limited information can be informative in selected populations, where decision-making (e.g. whether mammography should be performed or not) might depend on slight modifications of risk [21, 22]. Nevertheless, currently, much of the genetic component of breast cancer risk remains uncharacterized and is thought to arise from combinations of common low-penetrance variants. These might interact with environmental exposures to cause disease risk. However, these environmental exposures remain elusive as a lot of the previously proposed environmental and lifestyle risk factors (e.g. nutrition) for breast cancer have been refuted in large studies in the last decade .
Most of the heritability of colorectal cancer (CRC) cannot be explained by monogenic syndromes caused by high-risk germ-line mutations in adenomatous polyposis coli (APC) or mismatch repair genes which account for only <5% of colorectal cancer cases . GWAS, carried out in large series including either sporadic colorectal cancer cases or colorectal cancer cases with a family history, have identified multiple loci at which common variants can influence the risk of developing CRC. When limited to loci included in the NHGRI catalog (threshold p<10−5), ten different susceptibility loci have been detected by GWAS [25-30] and by a meta-analysis of GWAS : these include loci at 8q24.21, 18q21.1, 15q13.3, 11q23.1, 8q23.3, 10p14, 19q13.11, 20p12.3, 14q22.2 and 16q22.1 (Figure 1b). In all of these loci, the best SNP markers exhibit very modest odds ratios for colorectal cancer predisposition (range 1.10-1.26). Most likely because of the limited power to detect such modest effects, few loci (8q24 and 18q21) have been found consistently by several GWAS [27-29]. Cumulatively, these variants explain only a very small fraction of colorectal cancer risk. We are still unclear as to how these or other risk variants might interact with environmental exposures (e.g. red meat intake) to modulate colorectal cancer risk.
Lung cancer is the prototype of a malignancy where environmental rather than genetic factors are apparently far more important. Three GWAS confirmed only one locus on chromosome 15q25, where nicotinic acetylcholine receptor genes map, as associated with lung cancer risk with an odds ratio of about 1.3 (for a review, see Ref ). However, the same locus is strongly linked to the main environmental risk factor for lung cancer (i.e. smoking), whose association with lung cancer is much greater, with odds ratios >20 for ~20 cigarettes per day . It is possible that the effect of the 15q25 on lung cancer risk is mediated entirely through its influence on nicotine dependence . Another two risk loci mapping to chromosomes 5p15 and 6p21 were detected by combining data of several GWAS [35-37]. The effects on cancer risk were even smaller, corresponding to odds ratios of 1.15-1.24, for the best-associated markers in these loci.
Although thyroid cancer displays the highest heritability among solid tumors (up to 53% in ), little is known about genetic variants affecting the risk of this cancer. A GWAS for thyroid cancer detected only two loci (9q22.33 and 14q13.3) where SNPs reached genome-wide significance . The per allele odds ratios were greater than those seen for most other common variants in association with cancer risk (1.75 and 1.37, respectively), but with only two variants available, the proportion of the variance explained is still limited. The GWAS used only 192 cases in the discovery stage, and thus the power is extremely limited despite the availability of over 30 000 controls; one might speculate that several more variants would be discovered, if the sample size could be enlarged.
Although we know that genetics-related determinants of risk exist, we cannot explain the majority of this risk through specific identified genetic variants. Numerous hypothetical arguments have been proposed to explain that dark matter, as summarized in Table 1 [1, 1, 39, 40].
The ability of GWAS to detect associations with common SNPs can be reduced if the phenotypes are poorly or inconsistently defined and ascertained, and/or if controls are also suboptimally screened for exclusion of disease. Even with large sample sizes of several thousands cases and controls, there is usually limited power to detect alleles of modest effect sizes (odds ratios of 1.20), and minimal power to detect risk allele odds ratios of <1.10 even for very common variants. Power is also limited to detect epistatic interactions of multiple modest effect genes. The detection of gene-environment interactions is hampered not only by limited power, but also by the lack of concurrent availability of both genetic and high-quality, standardized, and consistently collected environmental exposure data . Residual population stratification or genotyping error can also lead to the attrition of some associations, although current GWAS investigations have dramatically improved study performance on these fronts. The genetic architecture can differ substantially across different populations, and most GWAS to-date have targeted European-descent populations , while there is some evidence that different loci can emerge and the implicated haplotype blocks and strength of association can vary when populations of other ancestry are examined . Moreover, despite generally good coverage of the whole genome in currently used genotyping platforms, some areas of the genome are still imperfectly covered and thus variants lurking in these areas would remain undiscovered. This is a more of a concern for African-descent populations than those of European descent. Finally, parent-of-origin-specific genetic effects would have been largely missed with the current mode of association analysis in most GWAS investigations. For example, the association of rs157935[T] at 7q32 with the risk of cutaneous basal cell carcinoma seems to be dependent to the parent-of-origin of the risk allele, estimated in silico .
In particular, the hypothesis of insufficient sample size, a potential limiting factor in the statistical power to detect weak genetic loci, has gathered considerable supporting evidence based on GWAS conducted to-date and deserves some more elaboration. The general rule has been that the larger the sample size, the greater the yield of new discovered loci. The pattern of discoveries to-date in terms of the minor allele frequency and odds ratio of the risk alleles that emerge could be largely explained based on power considerations alone. This is leading GWAS research groups worldwide to assemble huge numbers of patients and controls to conduct GWAS on increasingly large series of cases and controls. As the technology is becoming cheaper, instead of multi-stage designs, performance of GWAS using the whole genotyping platform on very large samples (e.g. 100 000 cases and as many controls) might become feasible and cost-efficient . However, the practical issues in accumulating such a huge number of cases and controls are not easy to solve. For the most common cancers such as breast cancer and colorectal cancer, consortia with sample sizes in the range of up to 30 000 cases and as many controls are already in place and further enrollment of additional teams will be able to increase this further. For less common cancers, it will be a challenge to obtain such numbers, even if new large population cohorts are established . Moreover, it is possible that even with 100 000 cases and 100 000 controls, the proportion of variance explained by the discovered variants, cumulatively, might still not exceed 20%. However, the use ever larger cohorts providing the main solution to the identification of the elusive dark matter is a topic of debate in the community. One view is that many additional common variants are unlikely to be discovered or are not worth discovering, and that most genetic control is caused by variants that are not represented at all in the current studies . This is an interesting speculation, but there are still no data to support it or refute it.
Rare variants are an obvious contender for the source of the missing heritability. By default, usually variants with frequency <5-10% are excluded from current GWAS analyses. Many rare variants (especially those with minor allele frequencies of 0.5-5%) would be possible to capture using full-sequencing, if a sufficient number of individuals is genotyped. Indeed, the 1000 Genomes Project (http://www.1000genomes.org/page.php) is aimed to the identification of the rare variants and thus facilitating association studies. However, unless the effects that they convey are large, the power to detect them would be practically zero. Various analytical approaches have been proposed to improve power, typically generating composite scores by merging many rare variants together that might share common function . However, such merging usually provides speculative results and no clear functional evidence exists to perform the grouping of variants with certainty.
Structural variants also need to be considered as an underlying cause of cancer. Several investigators have already performed association studies that capture copy number variants (CNVs), which can correspond to either common or rare alleles. So far, strong associations with common CNV are limited, although common CNV can be in strong LD with common SNPs (frequencies of the minor allele >0.1) , suggesting that the detection power for common CNVs is probably adequate. Conversely, a considerable number of rare CNV have been proposed to be associated with various neuropsychiatric traits, including schizophrenia, mental retardation, autism, and epilepsy . In all of these cases, the CNV have been seen with a frequency of 0.2-1%, although they are exceedingly rare in the general population (generally ≤0.03%). No CNVs have yet been associated robustly with cancer phenotypes, but no large studies have yet been published in cancer patients to pursue this avenue.
Power considerations are important also for the detection of CNV associations in association studies, and those that can currently be detected for rare CNV are those with very large effect sizes.
Genetic linkage analysis in pedigrees containing multiple affected members can complement association analyses. Such studies have been traditionally hampered by their relatively small size and much lower markers density as compared with the recent GWAS. Despite the relatively low power, results of pedigree analyses can provide strong and convincing indications of genetic effects, since they are based on genetic transmission of disease alleles within a family and thus do not have to make the population assumptions of association analyses. How do results for family-linkage studies compare with those of GWAS?
Inherited mutations in the two major susceptibility genes for breast cancer, namely, BRCA1 and BRCA2, lead to a high risk of breast cancer, but account for only about 20% of familial breast cancer . Other genes that have fairly robust evidence for conferring susceptibility to familial breast cancer through inherited mutations include ataxia telangiectasia mutated (ATM) and tumor protein p53 (TP53); moreover, a 1100delC variant in CHK2 checkpoint homolog (Schizosaccharomyces pombe) (CHEK2) with minor allele frequency close to 1% has strong evidence for increasing breast cancer risk, and even more so familial cancer risk [50, 51]. All these known mutations still explain the minority of familial breast cancer and probably <5% of all breast cancer risk.
Several linkage studies have reported candidate regions containing breast cancer susceptibility genes. However, the logarithm of the odds ratio (LOD) score values obtained for these regions were not significant or of borderline statistical significance, and the percentage of families putatively linked to each region was low. In a recent linkage study in Spanish breast cancer families, three regions of interest, located on 3q25, 6q24, and 21q22, were observed . Overall, 20 distinct putative breast cancer susceptibility loci have been proposed, but these do not overlap among studies and independent loci cluster in each family  (Table 2).
One potential explanation for these results is genetic heterogeneity, with several putative breast cancer susceptibility genes playing an important role in the genetic risk of breast cancer but relevant only in a small number of families. The effects can be large in the specific families, but they would be completely lost once diluted in a large association population sample.
Hereditary non-polyposis colorectal cancer (HNPCC), or Lynch syndrome, is a hereditary condition that predisposes to colorectal cancer. Inherited mutations causing defects in the DNA mismatch repair machinery and occurring at different genes, such as mutL homolog 1, colon cancer, nonpolyposis type 2 (E. coli) (MLH1), mutS homolog 2, colon cancer, nonpolyposis type 1 (E. coli) (MSH2), mutS homolog 6 (E. coli) (MSH6), PMS2 postmeiotic segregation increased 2 (S. cerevisiae) (PMS2) and possibly mutL homolog 3 (E. coli) (MLH3) and PMS1 postmeiotic segregation increased 1 (S. cerevisiae) (PMS1) , are the main cause of HNPCC. However, microsatellite instability or alterations in expression of DNA mismatch repair proteins (i.e., markers of inactivating mutations at these genes) are not identified in ~ 60% of HNPCC families fulfilling the Amsterdam criteria of HNPCC. This suggests the existence of genetic variations at as yet unidentified genes, leading to a possible family-specific autosomal dominant trait of HNPCC risk . In addition, many families with multiple CRC cases but not completely fulfilling the Amsterdam criteria have been described [55, 56], suggesting a wider spectrum of familial disease.
Linkage studies in colorectal cancer families with no evidence of deficiency in DNA mismatch repair genes have mapped putative loci on several chromosomes (Table 2). Among these loci, most have very modest LOD scores and might well represent false-positives, but those on chromosome 3q21.1-q26.2 and 9q22.32-31.1 have shown the most consistent findings [57-59] (Table 2). Moreover, specific and different loci have been detected not only in different studies but also in single pedigrees within the same study (Table 2), providing another example of genetic heterogeneity.
Lung cancer occurs mostly in a sporadic form, although several familial clusters have been reported. Genetic linkage analysis of families with multiple cases of lung cancer detected a locus influencing lung cancer risk on chromosome 6q23-25 . Statistically significant linkage (LOD = 4.26) was observed in a subset of 23 families with five or more affected individuals in two or more generations, out of a total of 52 families. Linkage heterogeneity was detected, as in 14 families with only three affected relatives no linkage was found. No other loci were detected, suggesting that a single, relatively weak locus might affect familial lung cancer .
Estimates indicate that ~ 5% of non-medullary thyroid carcinoma (NMTC) is familial. Familial NMTC is often characterized by an earlier age of onset, higher aggressiveness, and more frequent multifocal disease and recurrence as compared to sporadic NMTC. Linkage studies in pedigrees of different geographical origin identified loci on chromosomes 1q21, 2q21, 8p23.1-p22, and 19p13.2 [61-64]. Typically, a single locus was detected in each pedigree, and LOD scores were quite low, ranging from 3.01 to 4.41, pointing to weak effects and to the possibility of genetic heterogeneity.
Although larger population-based GWAS studies will continue to be an important avenue to pursue in identifying more risk variants, large family-based case-control studies might represent an alternative design that incorporates the advantages of studying sporadic cancer, of avoiding problems with population structure, and of analyzing hundred thousands of genetic markers as in the case of population-based GWAS. However, collecting family-based samples might be more difficult than collecting a series of unrelated cases and controls and the matching of transmitted with nontransmitted alleles from the same family might reduce statistical power (Table 1). However, if properly collected, this information would be valuable, especially if effect sizes are much stronger within pedigrees. This could cut sample size requirements considerably.
Pedigree and family-based studies might be best suited to detect gene-gene or gene-environment interactions, because the ability of such detection depends not only on the size of the study population and the number of examined polymorphisms but even more so on the accuracy of the phenotype and of the environmental exposure measurements. Evaluation of environmental exposures might be more uniform and standardized in the setting of pedigrees where subjects have been born and have lived in the same location (Table 1).
Parent-of-origin-specific effects can suitably be dissected by pedigree studies, because no parental genotypes or family structure need be taken into account in population-based studies. In family-based association studies, parent-of-origin effects can also be detected when the design of the study includes genetic information from the parents (Table 1).
If rare variants are a hallmark of cancer risk, pedigrees but not family-based association studies would be expected to be particularly enriched in such rare variants. Also, pedigree studies do not suffer of problems related to different populations or ethnicity (Table 1).
Pedigree-based and population-based GWAS studies can also be combined (e.g. by weighing loci in GWAS differentially based on pedigree-derived signals). This would reduce the stringency of the required P-values. Such approaches would benefit from verification of their efficiency and discovery yield with appropriate replication studies.
Overall, comparison of the results from GWAS and pedigree studies shows hardly any overlap for breast, colon, lung and thyroid cancer (Figure 1 and Table 2). The relative success of discovery of GWAS versus loci discovered with familial approaches varies across cancers. Going beyond the examples of the four cancers that we discussed in detail above, the cancer for which we currently have the largest number of discovered loci through GWAS is prostate cancer, with a large number of GWAS identifying up to more than 25 independent loci [65-67], albeit all of them with very small effects. Conversely, for prostate cancer there have been no genes with strong evidence identified through familial approaches and our knowledge of environmental and lifestyle determinants also remains very rudimentary. At the other end of the spectrum, for testicular cancer, GWAS have identified fewer genes, but several of them have relatively strong effects [68, 69], although family studies demonstrated no statistically significant regions of linkage .
To some extent, the lack of overlap might simply mean that associations emerging from GWAS with modest P-values and modest LOD scores emerging from family-based linkage studies could well represent false-positives. However, false-positives are unlikely to represent the whole story. Numerous GWAS signals are definitely genuine, and several of the linkage signals have considerable credibility. One possibility is that many of the loci identified by pedigree analysis would not be detectable in population-based studies, if they are specific for each pedigree. Thus genetic heterogeneity might play an important role in genetic predisposition to different cancer types.
Although the idea that genetic heterogeneity can hamper the detection of loci relevant for complex disease is not new, insights from recent studies allow us to have a better grasp of its possible manifestations and implications for cancer heritability and to speculate on the extent of its complexity. The accumulating data suggest that we can exclude with confidence the scenario where one or a few loci alone can explain the majority of the genetic risk for any common cancer. Regardless of frequency (i.e. common, rare, very rare) and type (i.e. SNP or CNV), it is likely that hundreds or even thousands of genetic variants are implicated in cancer risk. One could envision a model, based on the example of colorectal cancer, in which genetic heterogeneity plays a major role in both familial cancer due to monogenic mutations and in polygenic inheritance of sporadic cancer, modulated by a complex architecture of a multiplicity of genetic loci (Figure 2). The relative contribution of low-penetrance low-risk common variants and high-penetrance, high-risk uncommon/rare variants is unknown and it might vary from one type of cancer to another and between sporadic and familial disease (acknowledging that in some cases, the exact boundaries of the definition of familial disease might still be unclear).
Moreover, it is possible that in some situations low-penetrance, low-risk common variants identified through GWAS could simply be markers of high-penetrance, high-risk uncommon/rare variants to which they are in linkage disequilibrium. These linked variants might be in the same genetic locus, but linkage sometimes can extent to very distant positions . The coexistence of mutations and common markers of risk in the same gene has been extensively documented in GWAS studies for some metabolic traits such as lipid levels . The extent of this coexistence in cancer is less well documented, given the relatively smaller number of bona-fide known mutated genes with familial risk that have been conclusively identified to-date, but some evidence already exists (e.g. for pigmentation-related genes and melanoma) .
Overall, the recognition of genetic heterogeneity should be seen as an opportunity rather than a problem in genetic epidemiology. Indeed, the analysis of candidate genes identified in population-based and in pedigree studies could allow tracing possible biochemical pathways affecting a specific type of cancer. This could reduce the extreme multi-dimensionality of the genetic architecture to fewer pathways. For example, although several independent loci cause the Lynch syndrome, the known germ-line mutations are observed in genes all involved in the control of DNA repair. Moreover, given the complexity of regulation of gene expression, variants of many different independent loci can bind to the same regulatory elements and affect expression of the same target gene.
The identification of common pathways underlying genetic heterogeneity might result in the design of new tests for early diagnosis that are based on biochemical targets of the causal loci modulating the disease. In addition, identification of such pathways could enable the design of new therapeutic strategies to control cancer outcome.
Improvements on the sequencing front might also change our understanding of the genetic architecture. Family trees for example can be successfully analyzed with exome sequencing, such as the approach used recently for understanding the genetics of Miller syndrome, a rare mendelian disorder . It is interesting to see to what extent, if any, various cancers might be a conglomerate of a large number of rare mendelian syndromes that can be dissected with a similar sequencing approach.
However, while such advances are desirable, the difficulties of achieving them should not be understated. If genetic heterogeneity is extreme, then cancer will become an example of ‘private epidemiology’ , where what happens to one individual with cancer is not applicable to others. As we add more, even if incremental, discoveries, we should be able to gain increasing insights about the complex puzzle of cancer genetic architecture.
This work was funded in part by grants from Associazione and Fondazione Italiana Ricerca Cancro (AIRC and FIRC), and Fondo Investimenti Ricerca di Base (FIRB), Italy, to TAD. Scientific support for this project to JPAI was provided through the Tufts Clinical and Translational Science Institute (Tufts CTSI) under funding from the National Institute of Health/National Center for Research Resources (UL1 RR025752). Points of view or opinions in this paper are those of the authors and do not necessarily represent the official position or policies of the Tufts CTSI.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.