A central focus of complex disease genetics after genome-wide association studies (GWAS) is to identify low frequency and rare risk variants, which may account for an important fraction of disease heritability unexplained by GWAS. A profusion of studies using next-generation sequencing are seeking such risk alleles. We describe how already-known complex trait loci (largely from GWAS) can be used to guide the design of these new studies by selecting cases, controls, or families who are most likely to harbor undiscovered risk alleles. We show that genetic risk prediction can select unrelated cases from large cohorts who are enriched for unknown risk factors, or multiply-affected families that are more likely to harbor high-penetrance risk alleles. We derive the frequency of an undiscovered risk allele in selected cases and controls, and show how this relates to the variance explained by the risk score, the disease prevalence and the population frequency of the risk allele. We also describe a new method for informing the design of sequencing studies using genetic risk prediction in large partially-genotyped families using an extension of the Inside-Outside algorithm for inference on trees. We explore several study design scenarios using both simulated and real data, and show that in many cases genetic risk prediction can provide significant increases in power to detect low-frequency and rare risk alleles. The same approach can also be used to aid discovery of non-genetic risk factors, suggesting possible future utility of genetic risk prediction in conventional epidemiology. Software implementing the methods in this paper is available in the R package Mangrove.
The molecular mechanisms involved in the development of type 2 diabetes are poorly understood. Starting from genome-wide genotype data for 1,924 diabetic cases and 2,938 population controls generated by the Wellcome Trust Case Control Consortium, we set out to detect replicated diabetes association signals through analysis of 3,757 additional cases and 5,346 controls, and by integration of our findings with equivalent data from other international consortia. We detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B and IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings provide insights into the genetic architecture of type 2 diabetes, emphasizing the contribution of multiple variants of modest effect. The regions identified underscore the importance of pathways influencing pancreatic beta cell development and function in the etiology of type 2 diabetes.
We genotyped 2,861 cases from the UK PBC consortium and 8,514 UK population controls across 196,524 variants within 186 known autoimmune risk loci. We identified three loci newly associated with primary biliary cirrhosis (PBC) (with P<5×10−8), increasing the number of known susceptibility loci to 25. The most associated variant at 19p12 is a low-frequency non-synonymous SNP in TYK2, further implicating JAK/STAT and cytokine signalling in disease pathogenesis. A further five loci contained non-synonymous variants in high linkage disequilibrium (LD) (r2>0.8) with the most associated variant at the locus. We found multiple independent common, low-frequency and rare variant association signals at five loci. Of the 26 independent non-HLA signals tagged on Immunochip, 15 have SNPs in B-lymphoblastoid open-chromatin regions in high LD (r2>0.8) with the most associated variant. This study demonstrates how dense fine-mapping arrays coupled with functional genomic data can be utilized to identify candidate causal variants for functional follow-up.
Motivation: The existence of families with many individuals affected by the same complex disease has long suggested the possibility of rare alleles of high penetrance. In contrast to Mendelian diseases, however, linkage studies have identified very few reproducibly linked loci in diseases such as diabetes and autism. Genome-wide association studies have had greater success with such diseases, but these results explain neither the extreme disease load nor the within-family linkage peaks, of some large pedigrees. Combining linkage information with exome or genome sequencing from large complex disease pedigrees might finally identify family-specific, high-penetrance mutations.
Results: Olorin is a tool, which integrates gene flow within families with next generation sequencing data to enable the analysis of complex disease pedigrees. Users can interactively filter and prioritize variants based on haplotype sharing across selected individuals and other measures of importance, including predicted functional consequence and population frequency.
Imputation allows the inference of unobserved genotypes in low-density data sets, and is often used to test for disease association at variants that are poorly captured by standard genotyping chips (such as low-frequency variants). Although much effort has gone into developing the best imputation algorithms, less is known about the effects of reference set choice on imputation accuracy. We assess the improvements afforded by increases in reference size and diversity, specifically comparing the HapMap2 data set, which has been used to date for imputation, and the new HapMap3 data set, which contains more samples from a more diverse range of populations. We find that, for imputation into Western European samples, the HapMap3 reference provides more accurate imputation with better-calibrated quality scores than HapMap2, and that increasing the number of HapMap3 populations included in the reference set grant further improvements. Improvements are most pronounced for low-frequency variants (frequency <5%), with the largest and most diverse reference sets bringing the accuracy of imputation of low-frequency variants close to that of common ones. For low-frequency variants, reference set diversity can improve the accuracy of imputation, independent of reference sample size. HapMap3 reference sets provide significant increases in imputation accuracy relative to HapMap2, and are of particular use if highly accurate imputation of low-frequency variants is required. Our results suggest that, although the sample sizes from the 1000 Genomes Pilot Project will not allow reliable imputation of low-frequency variants, the larger sample sizes of the main project will allow.
imputation; reference sets; rare variants
The Fc receptor like 3 (FCRL3) molecule, involved in controlling B cell signalling, may contribute to the autoimmune disease process. Recently a genome wide screen detected association of neighbouring gene FCRL5 with Graves’ disease (GD). To determine whether FCRL5 represents a further independent B cell signaling GD susceptibility loci we screened 12 tag SNPs, capturing all known common variation within FCRL5, in 5192 UK Caucasian GD index cases and controls.
A case control association study investigating twelve tag SNPs within FCRL5 which captured the majority of known common variation within this gene region.
A dataset comprising 2504 UK Caucasian GD patients and 2688 geographically matched controls taken from the 1958 British Birth cohort.
We used the chi-squared test and haplotype analysis to investigate association between the tag SNPs and GD before performing regression analysis to determine if association at FCRL5 was independent of the known FCRL3 association.
Three of the FCRL5 tag SNPs, rs6667109, rs3811035 and rs6692977 showed association with GD (P=0.015-0.001, OR=1.15-1.16). Logistic regression performed on all FCRL5 and, previously screened, FCRL3 tag SNPs revealed that association with FCRL5 was secondary to linkage disequilibrium with the FCRL3, rs11264798 and rs10489678 SNPs.
FCRL5 does not appear to be exerting an independent effect on the development of GD in the UK. Fine mapping of the entire FCRL region is required to determine the exact location of the etiological variant/s present.
Linkage disequilibrium; FCRL3; FCRL5; Graves’ disease; genome wide screening
Attempting to classify patients into high or low risk for disease onset or outcomes is one of the cornerstones of epidemiology. For some (but by no means all) diseases, clinically usable risk prediction can be performed using classical risk factors such as body mass index, lipid levels, smoking status, family history and, under certain circumstances, genetics (e.g. BRCA1/2 in breast cancer). The advent of genome-wide association studies (GWAS) has led to the discovery of common risk loci for the majority of common diseases. These discoveries raise the possibility of using these variants for risk prediction in a clinical setting. We discuss the different ways in which the predictive accuracy of these loci can be measured, and survey the predictive accuracy of GWAS variants for 18 common diseases. We show that predictive accuracy from genetic models varies greatly across diseases, but that the range is similar to that of non-genetic risk-prediction models. We discuss what factors drive differences in predictive accuracy, and how much value these predictions add over classical predictive tests. We also review the uses and pitfalls of idealized models of risk prediction. Finally, we look forward towards possible future clinical implementation of genetic risk prediction, and discuss realistic expectations for future utility.
Genome-wide association studies, which produce huge volumes of data, are now being carried out by many groups around the world, creating a need for user friendly tools for data quality control and analysis. One critical aspect of GWAS quality control is evaluating genotype cluster plots to verify sensible genotype calling in putatively associated SNPs. Evoker is a tool for visualizing genotype cluster plots, and provides a solution to the computational and storage problems related to working with such large datasets.
Synthetic associations have been posited as a possible explanation for missing heritability in complex disease. We show several lines of evidence which suggest that, while possible, these synthetic associations are not common.
Type 1 diabetes (T1D) is a common autoimmune disorder that arises from the action of multiple genetic and environmental risk factors. We report the findings of a new genome-wide association study of T1D, combined in a meta-analysis with two previously published studies. The total sample set included 7,514 cases and 9,045 reference samples. Forty-one distinct genomic locations provided evidence for association to T1D in the meta-analysis (P < 10-6). After excluding previously reported associations, 27 regions were further tested in an independent set of 4,267 cases, 4,463 controls and 2,319 affected sib-pair (ASP) families. Of these, 18 regions were replicated (P < 0.01; overall P < 5 × 10-8) and four additional regions provided nominal evidence of replication (P < 0.05). The many new candidate genes suggested by these results include IL10, IL19, IL20, GLIS3, CD69 and IL27.
Genome-wide association studies (GWAS) have successfully identified a large number of genetic variants associated with complex traits, but these only explain a small proportion of the total heritability. It has been recently proposed that rare variants can create ‘synthetic association' signals in GWAS, by occurring more often in association with one of the alleles of a common tag single nucleotide polymorphism. While the ultimate evaluation of this hypothesis will require the completion of large-scale sequencing studies, it is informative to place it in the broader context of what is known about the genetic architecture of complex disease. In this review, we draw from empirical and theoretical data to summarize evidence showing that synthetic associations do not underlie many reported GWAS associations.
Background & Aims
Identifying shared and disease-specific susceptibility loci for Crohn’s disease (CD) and ulcerative colitis (UC) would help define the biologic relationship between the inflammatory bowel diseases. More than 30 CD susceptibility loci have been identified. These represent important candidate susceptibility loci for UC. Loci discovered by the index genome scans in CD have previously been tested for association with UC, but those identified in the recent meta-analysis await such investigation. Furthermore, the recently identified UC locus at ECM1 requires formal testing for association with CD.
We analyzed 45 single nucleotide polymorphisms, tagging 29 of the loci recently associated with CD in 2527 UC cases and 4070 population controls. We also genotyped the UC-associated ECM1 variant rs11205387 in 1560 CD patients and 3028 controls.
Nine regions showed association with UC at a threshold corrected for the 29 loci tested (P < .0017). The strongest association (P = 4.13 × 10-8; odds ratio = 1.27) was identified with a 170-kilobase region on chromosome 1q32 that contains 3 genes. We also found association with JAK2 and replicated a recently reported association with STAT3, further implicating the role of this signaling pathway in inflammatory bowel disease. Additional novel UC susceptibility genes were LYRM4 and CDKAL1. Twenty of the loci were not associated with UC, and several appear to be specific to CD. ECM1 variation was not associated with CD.
Collectively, these data help define the genetic relationship between CD and UC and characterize common, as well as disease-specific mechanisms of pathogenesis.
We report results of a nonsynonymous SNP scan for ulcerative colitis and identify a previously unknown susceptibility locus at ECM1. We also show that several risk loci are common to ulcerative colitis and Crohn’s disease (IL23R, IL12B, HLA, NKX2-3 and MST1), whereas autophagy genes ATG16L1 and IRGM, along with NOD2 (also known as CARD15), are specific for Crohn’s disease. These data provide the first detailed illustration of the genetic relationship between these common inflammatory bowel diseases.
To provide more power to detect type 1 diabetes (T1D) loci, we performed a meta-analysis of data from three genome-wide association (GWA) studies. We tested 305,090 SNPs in 3,561 T1D cases and 4,646 controls of European ancestry. We obtained further support for 4q27/IL2-IL21 (P = 1.9×10-8) and, after genotyping 6,225 cases, 6,946 controls and 2,828 families, convincing evidence for four previously unknown and distinct loci in chromosome regions 6q15/BACH2 (4.7×10-12), 10p15/PRKCQ (3.7×10-9), 15q24/CTSH (3.2×10-15) and 22q13/C1QTNF6 (2.0×10-8).
Several new risk factors for Crohn's disease have been identified in recent genome-wide association studies. To advance gene discovery further we have combined the data from three studies (a total of 3,230 cases and 4,829 controls) and performed replication in 3,664 independent cases with a mixture of population-based and family-based controls. The results strongly confirm 11 previously reported loci and provide genome-wide significant evidence for 21 new loci, including the regions containing STAT3, JAK2, ICOSLG, CDKAL1, and ITLN1. The expanded molecular understanding of the basis of disease offers promise for informed therapeutic development.
Obesity is a serious international health problem that increases the risk of several common diseases. The genetic factors predisposing to obesity are poorly understood. A genome-wide search for type 2 diabetes–susceptibility genes identified a common variant in the FTO (fat mass and obesity associated) gene that predisposes to diabetes through an effect on body mass index (BMI). An additive association of the variant with BMI was replicated in 13 cohorts with 38,759 participants. The 16% of adults who are homozygous for the risk allele weighed about 3 kilograms more and had 1.67-fold increased odds of obesity when compared with those not inheriting a risk allele. This association was observed from age 7 years upward and reflects a specific increase in fat mass.
A genome-wide association scan in Crohn disease by the Wellcome Trust Case Control Consortium1 detected strong association at 6 novel loci. We tested 37 SNPs from these and other loci for association in an independent case control sample. Replication was obtained for the IRGM gene on chromosome 5q33.1 which induces autophagy (replication P = 6.6 × 10−4, combined P = 2.1 × 10−10), and for 9 other loci including NKX2-3 and gene deserts on chromosomes 1q and 5p13.
A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ∼2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.
Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP–based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.
Malaria kills nearly a million people every year, most of whom are young children in Africa. The risk of developing severe malaria is known to be affected by genetics, but so far only a handful of genetic risk factors for malaria have been identified. We studied over a million DNA variants in over 5,000 individuals with severe malaria from the Gambia, Malawi, and Kenya, and about 7,000 healthy individuals from the same countries. Because the populations of Africa are far more genetically diverse than those in Europe, it is necessary to use statistical models that can account for both broad differences between countries and subtler differences between ethnic groups within the same community. We identified known associations at the genes ABO (which affects blood type) and HBB (which causes sickle cell disease), and showed that the latter is heterogeneous across populations. We used these findings to guide the development of statistical tests for association that take this heterogeneity into account, by modelling differences in the strength and genomic location of effect across and within African populations.
Genome-wide association (GWA) studies have identified numerous, replicable, genetic associations between common single nucleotide polymorphisms (SNPs) and risk of common autoimmune and inflammatory (immune-mediated) diseases, some of which are shared between two diseases. Along with epidemiological and clinical evidence, this suggests that some genetic risk factors may be shared across diseases—as is the case with alleles in the Major Histocompatibility Locus. In this work we evaluate the extent of this sharing for 107 immune disease-risk SNPs in seven diseases: celiac disease, Crohn's disease, multiple sclerosis, psoriasis, rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes. We have developed a novel statistic for Cross Phenotype Meta-Analysis (CPMA) which detects association of a SNP to multiple, but not necessarily all, phenotypes. With it, we find evidence that 47/107 (44%) immune-mediated disease risk SNPs are associated to multiple—but not all—immune-mediated diseases (SNP-wise PCPMA<0.01). We also show that distinct groups of interacting proteins are encoded near SNPs which predispose to the same subsets of diseases; we propose these as the mechanistic basis of shared disease risk. We are thus able to leverage genetic data across diseases to construct biological hypotheses about the underlying mechanism of pathogenesis.
Over the last five years we have found over 100 genetic variants predisposing to common diseases affecting the immune system. In this study we analyze 107 such variants across seven diseases and find that almost half are shared across diseases. We also find that the patterns of sharing across diseases cluster these variants into groups; proteins encoded near variants in the same group tend to interact. This suggests that genetic variation may influence entire pathways to create risk to multiple diseases.