By early 2006, the tools were in place and studies were under way in many laboratories to resolve the hotly debated issue (75
) of whether genetic mapping of common SNPs would shed light on common disease. Since then, scores of publications have reported the localization of common SNPs associated with a wide range of common diseases and clinical conditions (age-related macular degeneration, type 1 and type 2 diabetes, obesity, inflammatory bowel disease, prostate cancer, breast cancer, colorectal cancer, rheumatoid arthritis, systemic lupus erythematosus, celiac disease, multiple sclerosis, atrial fibrillation, coronary disease, glaucoma, gallstones, asthma, and restless leg syndrome) as well as various individual traits (height, hair color, eye color, freckles, and HIV viral set point). illustrates data from a paradigmatic genome-wide association study of Crohn’s disease performed by the Wellcome Trust Case Control Consortium.
Fig. 3 GWAS for Crohn's disease. The panels show data from the study of Crohn's disease by the Wellcome Trust Case Control Consortium. (A) Significance level (P value on log10 scale) for each of the 500,000 SNPs tested across the genome. SNP locations reflect (more ...)
Various lessons have already emerged about genetic mapping by GWAS:
1) GWASs work. Before 2006, only about two dozen reproducible associations outside the HLA locus had been discovered (25
). By early 2008, more than 150 relationships were identified between common SNPs and disease traits (table S1). In most diseases studied, GWASs have revealed multiple independent loci, although some traits have not yet yielded associations that meet stringent thresholds (e.g., hypertension). It is not clear whether this reflects inadequate sample size, phenotypic definition, or a different genetic architecture.
2) Effect sizes for common variants are typically modest. In a few cases, common variants with effects of a factor of ≥2 per allele have been found: APOE4 in Alzheimer’s disease (23
), CFH in age-related macular degeneration (77
), and LOXL1 in exfoliative glaucoma (80
). In the vast majority of cases, however, the estimated effects are much smaller—mostly increases in risk by a factor of 1.1 to 1.5 per associated allele.
3) The power to detect associations has been low. Given the effect sizes now known to exist, and the need to exceed stringent statistical thresholds, the first wave of GWASs provided low power to discover disease-causing loci (81
). For example, achieving 90% power to detect an allele with 20% frequency and a factor of 1.2 effect at a statistical significance of 10−8
requires 8600 samples (). Thus, although it is unlikely that common alleles of large effect have been missed, GWASs of hundreds to several thousand cases have necessarily identified only a fraction of the loci that can be found with larger sample sizes. This prediction has been empirically confirmed in T2D (83
), serum lipids (84
), Crohn’s disease (86
), and height (87
). Across these four traits and diseases, individual GWASs together documented 29 associations. Increasing the power by pooling the samples to perform meta-analysis and replication genotyping has increased this yield to more than 100 replicated loci for these four conditions.
4) Association signals have identified small regions for study but have not yet identified causal genes and mutations. Genetic mapping is a double-edged sword Local correlation of genetic variants facilitates the initial identification of a region but makes it difficult to distinguish causal mutation(s). Luckily, whereas family-based linkage methods typically yield regions of 2 to 10 Mb in span, GWASs typically yield more manageable regions of 10 to 100 kb.
These regions have yet to be scrutinized by fine-mapping and resequencing to identify the specific gene and variants responsible. Even when a locus is identified by SNP association, the causal mutation itself need not be a SNP. For example, the IRGM
gene was associated with Crohn’s disease on the basis of GWAS. Subsequent study suggests that the causal mutation is a deletion upstream of the promoter affecting tissue-specific expression (91
5) A single locus can contain multiple independent common risk variants. Intensive study has already identified seven independent alleles at 8q24 for prostate cancer (92
), three at complement factor H (CFH) for age-related macular degeneration (93
), three at IRF5 for systemic lupus erythematosus (95
), and two at IL23R for Crohn’s disease (96
). Multiple distinct alleles with different frequencies and risk ratios may well be the rule.
6) A single locus can harbor both common variants of weak effect and rare variants of large effect. In recent GWASs, studies of common SNPs enabled the identification of 19 loci as influencing low- or high-density lipoprotein (LDL, HDL) or triglycerides (84
). Nine of these 19 were already known to carry rare Mendelian mutations with large effects, such as the loci for the LDL receptor (LDLR) and familial hypercholesterolemia (FH). Similarly, the genes encoding Kir6.2, WFS1, and TCF2 are all known to cause Mendelian syndromes including T2D, as well as common SNPs with modest effects.
7) Because allele frequencies vary across human populations, the relative roles of common susceptibility genes can vary among ethnic groups. One example is the association of prostate cancer at 8q24: SNPs in the region play a role in all ethnic groups, but the contribution is greater in African Americans. This is not because the risk alleles yet found confer greater susceptibility in African Americans, but because they occur at higher frequencies (92
), contributing to the higher incidence among African American men than among men of European ancestry.
Lessons have also emerged about the functions and phenotypic associations of genes related to common diseases:
1) A subset of associations involve genes previously related to the disease. Of 19 loci meeting genome-wide significance in a recent GWAS of LDL, HDL, or triglyceride levels, 12 contained genes with known functions in lipid biology (84
). The gene for 3-hydroxy-3-methyl glutaryl–coenzyme A reductase (HMGCR), encoding the rate-limiting enzyme in cholesterol biosynthesis and the target of statin medications, was found by GWAS to carry common genetic variation influencing LDL levels (84
). Similarly, SNPs in the β-cell zinc transporter encoded by SLC30A8 were associated with risk of T2D (97
2) Most associations do not involve previous candidate genes. In some cases, GWAS results immediately suggest new biological hypotheses— for example, the role of complement factor H in age-related macular degeneration (77
), FGFR2 in breast cancer (98
), and CDKN2A and CDKN2B in T2D (99
). In many other cases, such as LOC387715/HTRAl with age-related macular degeneration (102
), nearby genes have no known function.
3) Many associations implicate non-protein-coding regions. Although some associated non-coding SNPs may ultimately prove attributable to LD with nearby coding mutations, many are sufficiently far from nearby exons to make this outcome unlikely. Examples include the region at 8q24 associated with prostate, breast, and colon cancer, 300 kb from the nearest gene (103
), and the region at 9q21 associated with myocardial infarction and T2D, 150 kb from the nearest genes encoding CDKN2A and CDKN2B (99
A role for noncoding sequence in disease risk is not surprising: Comparative genome analysis has shown that 5% of the human genome is evolutionarily conserved and thus functional; less than one-third of this 5% consists of genes that encode proteins (108
). Noncoding mutations with roles in disease susceptibility will likely open new doors to understanding genome biology and gene regulation. Regulatory variation also suggests different therapeutic strategies: Modulating levels of gene expression may prove more tractable than replacing a fully defective protein or turning off a gain-of-function allele.
4) Some regions contain expected associations across diseases and traits. Crohn’s disease, psoriasis, and ankylosing spondylitis have long been recognized to share clinical features; the association of the same common polymorphisms in IL23R in all three diseases points to a shared molecular cause (96
). SNPs in STAT4 (signal transducer and activator of transcription 4) are associated with rheumatoid arthritis and systemic lupus, two diseases that share clinical features. Multiple variants associated with T2D are associated with insulin secretion defects in nondiabetic individuals (101
), highlighting the role of β-cell failure in the pathogenesis of T2D.
5) Some regions reveal surprising associations. For example, unexpected connections have emerged among T2D, inflammatory diseases (two loci), and cancer (four loci). A single intron of CDKAL1 was found to contain a SNP associated with T2D and insulin secretion defects (99
), and another with Crohn’s disease and psoriasis (117
). A coding variant in glucokinase regulatory protein is associated with triglyceride levels and fasting glucose (101
) but also with C reactive protein levels (118
) and Crohn’s disease (86
). A SNP in TCF2 is associated with protection from T2D, as expected on the basis of Mendelian mutations at the same gene (120
). Unexpectedly, the same association signal turned up in a GWAS for prostate cancer (121
). Similarly, JAZF1 was identified as containing SNPs associated with T2D (83
) and prostate cancer (122
), and TCF7L2 with T2D (123
) and colon cancer (124