Mutations in nucleotide-binding oligomerisation domain-containing protein 2 (NOD2) remain the strongest genetic determinants for Crohn’s disease (CD). Having previously identified vimentin as a novel NOD2-interacting protein, we aimed to investigate the regulatory effects of vimentin on NOD2 function and the association of variants in Vim to CD susceptibility.
Co-immunoprecipitation, fluorescent microscopy and fractionation were used to confirm the interaction between NOD2 and vimentin. HEK293 cells stably expressing wild-type NOD2 or NOD2-frameshift variant (L1007fs) and SW480 colonic epithelial cells were used alongside the vimentin inhibitor Withaferin-A (WFA) to assess effects on NOD2 function using nuclear factor-kappaB (NF-κB) reporter gene, GFP-LC3-based autophagy, and bacterial gentamicin protection assays. International GWAS meta-analysis data were used to test for association of SNPs in Vim to CD susceptibility.
The leucine rich repeat (LRR) domain of NOD2 contained the elements required for vimentin binding; CD-associated polymorphisms disrupted this interaction. NOD2 and vimentin co-localised at the cell plasma membrane and cytosolic mislocalisation of the L1007fs and R702W variants correlated with an inability to interact with vimentin. Use of WFA demonstrated that vimentin was required for NOD2-dependent NF-κB activation, MDP-induced autophagy induction, and that NOD2 and vimentin regulated the invasion and survival properties of a CD-associated adherent-invasive strain E.coli strain. Genetic analysis revealed an association signal across the haplotype block containing Vim.
Vimentin is an important regulator of NOD2 function and a potential novel therapeutic target in the treatment of CD. Additionally, Vim is a candidate susceptibility gene for CD, supporting the functional data.
inflammatory bowel disease; Crohn’s disease; NOD2; vimentin; E.coli; autophagy; genetic association studies
Fetal hemoglobin (HbF) is an important modulator of sickle cell disease (SCD). HbF has previously been shown to be affected by variants at three loci on chromosomes 2, 6 and 11, but it is likely that additional loci remain to be discovered.
Methods and Findings
We conducted a genome-wide association study (GWAS) in 1,213 SCA (HbSS/HbSβ0) patients in Tanzania. Genotyping was done with Illumina Omni2.5 array and imputation using 1000 Genomes Phase I release data. Association with HbF was analysed using a linear mixed model to control for complex population structure within our study. We successfully replicated known associations for HbF near BCL11A and the HBS1L-MYB intergenic polymorphisms (HMIP), including multiple independent effects near BCL11A, consistent with previous reports. We observed eight additional associations with P<10−6. These associations could not be replicated in a SCA population in the UK.
This is the largest GWAS study in SCA in Africa. We have confirmed known associations and identified new genetic associations with HbF that require further replication in SCA populations in Africa.
Exome sequencing studies in complex diseases are challenged by the allelic heterogeneity, large number and modest effect sizes of associated variants on disease risk and the presence of large numbers of neutral variants, even in phenotypically relevant genes. Isolated populations with recent bottlenecks offer advantages for studying rare variants in complex diseases as they have deleterious variants that are present at higher frequencies as well as a substantial reduction in rare neutral variation. To explore the potential of the Finnish founder population for studying low-frequency (0.5–5%) variants in complex diseases, we compared exome sequence data on 3,000 Finns to the same number of non-Finnish Europeans and discovered that, despite having fewer variable sites overall, the average Finn has more low-frequency loss-of-function variants and complete gene knockouts. We then used several well-characterized Finnish population cohorts to study the phenotypic effects of 83 enriched loss-of-function variants across 60 phenotypes in 36,262 Finns. Using a deep set of quantitative traits collected on these cohorts, we show 5 associations (p<5×10−8) including splice variants in LPA that lowered plasma lipoprotein(a) levels (P = 1.5×10−117). Through accessing the national medical records of these participants, we evaluate the LPA finding via Mendelian randomization and confirm that these splice variants confer protection from cardiovascular disease (OR = 0.84, P = 3×10−4), demonstrating for the first time the correlation between very low levels of LPA in humans with potential therapeutic implications for cardiovascular diseases. More generally, this study articulates substantial advantages for studying the role of rare variation in complex phenotypes in founder populations like the Finns and by combining a unique population genetic history with data from large population cohorts and centralized research access to National Health Registers.
We explored the coding regions of 3,000 Finnish individuals with 3,000 non-Finnish Europeans (NFEs) using whole-exome sequence data, in order to understand how an individual from a bottlenecked population might differ from an individual from an out-bred population. We provide empirical evidence that there are more rare and low-frequency deleterious alleles in Finns compared to NFEs, such that an average Finn has almost twice as many low-frequency complete knockouts of a gene. As such, we hypothesized that some of these low-frequency loss-of-function variants might have important medical consequences in humans and genotyped 83 of these variants in 36,000 Finns. In doing so, we discovered that completely knocking out the TSFM gene might result in inviability or a very severe phenotype in humans and that knocking out the LPA gene might confer protection against coronary heart diseases, suggesting that LPA is likely to be a good potential therapeutic target.
Genetic mutations cause primary immunodeficiencies (PIDs), which predispose to infections. Here we describe Activated PI3K-δ Syndrome (APDS), a PID associated with a dominant gain-of-function mutation E1021K in the p110δ protein, the catalytic subunit of phosphoinositide 3-kinase δ (PI3Kδ), encoded by the PIK3CD gene. We found E1021K in 17 patients from seven unrelated families, but not among 3,346 healthy subjects. APDS was characterized by recurrent respiratory infections, progressive airway damage, lymphopenia, increased circulating transitional B cells, increased IgM and reduced IgG2 levels in serum and impaired vaccine responses. The E1021K mutation enhanced membrane association and kinase activity of p110δ. Patient-derived lymphocytes had increased levels of phosphatidylinositol 3,4,5-trisphosphate and phosphorylated AKT protein and were prone to activation-induced cell death. Selective p110δ inhibitors IC87114 and GS-1101 reduced the activity of the mutant enzyme in vitro, suggesting a therapeutic approach for patients with APDS.
Zebrafish have become a popular organism for the study of vertebrate gene function1,2. The virtually transparent embryos of this species, and the ability to accelerate genetic studies by gene knockdown or overexpression, have led to the widespread use of zebrafish in the detailed investigation of vertebrate gene function and increasingly, the study of human genetic disease3–5. However, for effective modelling of human genetic disease it is important to understand the extent to which zebrafish genes and gene structures are related to orthologous human genes. To examine this, we generated a high-quality sequence assembly of the zebrafish genome, made up of an overlapping set of completely sequenced large-insert clones that were ordered and oriented using a high-resolution high-density meiotic map. Detailed automatic and manual annotation provides evidence of more than 26,000 protein-coding genes6, the largest gene set of any vertebrate so far sequenced. Comparison to the human reference genome shows that approximately 70% of human genes have at least one obvious zebrafish orthologue. In addition, the high quality of this genome assembly provides a clearer understanding of key genomic features such as a unique repeat content, a scarcity of pseudogenes, an enrichment of zebrafish-specific genes on chromosome 4 and chromosomal regions that influence sex determination.
A central focus of complex disease genetics after genome-wide association studies (GWAS) is to identify low frequency and rare risk variants, which may account for an important fraction of disease heritability unexplained by GWAS. A profusion of studies using next-generation sequencing are seeking such risk alleles. We describe how already-known complex trait loci (largely from GWAS) can be used to guide the design of these new studies by selecting cases, controls, or families who are most likely to harbor undiscovered risk alleles. We show that genetic risk prediction can select unrelated cases from large cohorts who are enriched for unknown risk factors, or multiply-affected families that are more likely to harbor high-penetrance risk alleles. We derive the frequency of an undiscovered risk allele in selected cases and controls, and show how this relates to the variance explained by the risk score, the disease prevalence and the population frequency of the risk allele. We also describe a new method for informing the design of sequencing studies using genetic risk prediction in large partially-genotyped families using an extension of the Inside-Outside algorithm for inference on trees. We explore several study design scenarios using both simulated and real data, and show that in many cases genetic risk prediction can provide significant increases in power to detect low-frequency and rare risk alleles. The same approach can also be used to aid discovery of non-genetic risk factors, suggesting possible future utility of genetic risk prediction in conventional epidemiology. Software implementing the methods in this paper is available in the R package Mangrove.
The molecular mechanisms involved in the development of type 2 diabetes are poorly understood. Starting from genome-wide genotype data for 1,924 diabetic cases and 2,938 population controls generated by the Wellcome Trust Case Control Consortium, we set out to detect replicated diabetes association signals through analysis of 3,757 additional cases and 5,346 controls, and by integration of our findings with equivalent data from other international consortia. We detected diabetes susceptibility loci in and around the genes CDKAL1, CDKN2A/CDKN2B and IGF2BP2 and confirmed the recently described associations at HHEX/IDE and SLC30A8. Our findings provide insights into the genetic architecture of type 2 diabetes, emphasizing the contribution of multiple variants of modest effect. The regions identified underscore the importance of pathways influencing pancreatic beta cell development and function in the etiology of type 2 diabetes.
Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP–based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.
Malaria kills nearly a million people every year, most of whom are young children in Africa. The risk of developing severe malaria is known to be affected by genetics, but so far only a handful of genetic risk factors for malaria have been identified. We studied over a million DNA variants in over 5,000 individuals with severe malaria from the Gambia, Malawi, and Kenya, and about 7,000 healthy individuals from the same countries. Because the populations of Africa are far more genetically diverse than those in Europe, it is necessary to use statistical models that can account for both broad differences between countries and subtler differences between ethnic groups within the same community. We identified known associations at the genes ABO (which affects blood type) and HBB (which causes sickle cell disease), and showed that the latter is heterogeneous across populations. We used these findings to guide the development of statistical tests for association that take this heterogeneity into account, by modelling differences in the strength and genomic location of effect across and within African populations.
Crohn’s disease (CD) and ulcerative colitis (UC), the two common forms of inflammatory bowel disease (IBD), affect over 2.5 million people of European ancestry with rising prevalence in other populations1. Genome-wide association studies (GWAS) and subsequent meta-analyses of CD and UC2,3 as separate phenotypes implicated previously unsuspected mechanisms, such as autophagy4, in pathogenesis and showed that some IBD loci are shared with other inflammatory diseases5. Here we expand knowledge of relevant pathways by undertaking a meta-analysis of CD and UC genome-wide association scans, with validation of significant findings in more than 75,000 cases and controls. We identify 71 new associations, for a total of 163 IBD loci that meet genome-wide significance thresholds. Most loci contribute to both phenotypes, and both directional and balancing selection effects are evident. Many IBD loci are also implicated in other immune-mediated disorders, most notably with ankylosing spondylitis and psoriasis. We also observe striking overlap between susceptibility loci for IBD and mycobacterial infection. Gene co-expression network analysis emphasizes this relationship, with pathways shared between host responses to mycobacteria and those predisposing to IBD.
We genotyped 2,861 cases from the UK PBC consortium and 8,514 UK population controls across 196,524 variants within 186 known autoimmune risk loci. We identified three loci newly associated with primary biliary cirrhosis (PBC) (with P<5×10−8), increasing the number of known susceptibility loci to 25. The most associated variant at 19p12 is a low-frequency non-synonymous SNP in TYK2, further implicating JAK/STAT and cytokine signalling in disease pathogenesis. A further five loci contained non-synonymous variants in high linkage disequilibrium (LD) (r2>0.8) with the most associated variant at the locus. We found multiple independent common, low-frequency and rare variant association signals at five loci. Of the 26 independent non-HLA signals tagged on Immunochip, 15 have SNPs in B-lymphoblastoid open-chromatin regions in high LD (r2>0.8) with the most associated variant. This study demonstrates how dense fine-mapping arrays coupled with functional genomic data can be utilized to identify candidate causal variants for functional follow-up.
Motivation: The existence of families with many individuals affected by the same complex disease has long suggested the possibility of rare alleles of high penetrance. In contrast to Mendelian diseases, however, linkage studies have identified very few reproducibly linked loci in diseases such as diabetes and autism. Genome-wide association studies have had greater success with such diseases, but these results explain neither the extreme disease load nor the within-family linkage peaks, of some large pedigrees. Combining linkage information with exome or genome sequencing from large complex disease pedigrees might finally identify family-specific, high-penetrance mutations.
Results: Olorin is a tool, which integrates gene flow within families with next generation sequencing data to enable the analysis of complex disease pedigrees. Users can interactively filter and prioritize variants based on haplotype sharing across selected individuals and other measures of importance, including predicted functional consequence and population frequency.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
We densely genotyped, using 1000 Genomes Project pilot CEU and additional re-sequencing study variants, 183 reported immune-mediated disease non-HLA risk loci in 12,041 celiac disease cases and 12,228 controls. We identified 13 new celiac disease risk loci at genome wide significance, bringing the total number of known loci (including HLA) to 40. Multiple independent association signals are found at over a third of these loci, attributable to a combination of common, low frequency, and rare genetic variants. In comparison with previously available data such as HapMap3, our dense genotyping in a large sample size provided increased resolution of the pattern of linkage disequilibrium, and suggested localization of many signals to finer scale regions. In particular, 29 of 54 fine-mapped signals appeared localized to specific single genes - and in some instances to gene regulatory elements. We define a complex genetic architecture of risk regions, and refine risk signals, providing a next step towards elucidating causal disease mechanisms.
Imputation allows the inference of unobserved genotypes in low-density data sets, and is often used to test for disease association at variants that are poorly captured by standard genotyping chips (such as low-frequency variants). Although much effort has gone into developing the best imputation algorithms, less is known about the effects of reference set choice on imputation accuracy. We assess the improvements afforded by increases in reference size and diversity, specifically comparing the HapMap2 data set, which has been used to date for imputation, and the new HapMap3 data set, which contains more samples from a more diverse range of populations. We find that, for imputation into Western European samples, the HapMap3 reference provides more accurate imputation with better-calibrated quality scores than HapMap2, and that increasing the number of HapMap3 populations included in the reference set grant further improvements. Improvements are most pronounced for low-frequency variants (frequency <5%), with the largest and most diverse reference sets bringing the accuracy of imputation of low-frequency variants close to that of common ones. For low-frequency variants, reference set diversity can improve the accuracy of imputation, independent of reference sample size. HapMap3 reference sets provide significant increases in imputation accuracy relative to HapMap2, and are of particular use if highly accurate imputation of low-frequency variants is required. Our results suggest that, although the sample sizes from the 1000 Genomes Pilot Project will not allow reliable imputation of low-frequency variants, the larger sample sizes of the main project will allow.
imputation; reference sets; rare variants
The Fc receptor like 3 (FCRL3) molecule, involved in controlling B cell signalling, may contribute to the autoimmune disease process. Recently a genome wide screen detected association of neighbouring gene FCRL5 with Graves’ disease (GD). To determine whether FCRL5 represents a further independent B cell signaling GD susceptibility loci we screened 12 tag SNPs, capturing all known common variation within FCRL5, in 5192 UK Caucasian GD index cases and controls.
A case control association study investigating twelve tag SNPs within FCRL5 which captured the majority of known common variation within this gene region.
A dataset comprising 2504 UK Caucasian GD patients and 2688 geographically matched controls taken from the 1958 British Birth cohort.
We used the chi-squared test and haplotype analysis to investigate association between the tag SNPs and GD before performing regression analysis to determine if association at FCRL5 was independent of the known FCRL3 association.
Three of the FCRL5 tag SNPs, rs6667109, rs3811035 and rs6692977 showed association with GD (P=0.015-0.001, OR=1.15-1.16). Logistic regression performed on all FCRL5 and, previously screened, FCRL3 tag SNPs revealed that association with FCRL5 was secondary to linkage disequilibrium with the FCRL3, rs11264798 and rs10489678 SNPs.
FCRL5 does not appear to be exerting an independent effect on the development of GD in the UK. Fine mapping of the entire FCRL region is required to determine the exact location of the etiological variant/s present.
Linkage disequilibrium; FCRL3; FCRL5; Graves’ disease; genome wide screening
Genome-wide association studies (GWAS) and candidate gene studies in ulcerative colitis (UC) have identified 18 susceptibility loci. We conducted a meta-analysis of 6 UC GWAS, comprising 6,687 cases and 19,718 controls, and followed-up the top association signals in 9,628 cases and 12,917 controls. We identified 29 additional risk loci (P<5×10-8), increasing the number of UC associated loci to 47. After annotating associated regions using GRAIL, eQTL data and correlations with non-synonymous SNPs, we identified many candidate genes providing potentially important insights into disease pathogenesis, including IL1R2, IL8RA/B, IL7R, IL12B, DAP, PRDM1, JAK2, IRF5, GNA12 and LSP1. The total number of confirmed inflammatory bowel disease (IBD) risk loci is now 99, including a minimum of 28 shared association signals between Crohn’s disease (CD) and UC.
Attempting to classify patients into high or low risk for disease onset or outcomes is one of the cornerstones of epidemiology. For some (but by no means all) diseases, clinically usable risk prediction can be performed using classical risk factors such as body mass index, lipid levels, smoking status, family history and, under certain circumstances, genetics (e.g. BRCA1/2 in breast cancer). The advent of genome-wide association studies (GWAS) has led to the discovery of common risk loci for the majority of common diseases. These discoveries raise the possibility of using these variants for risk prediction in a clinical setting. We discuss the different ways in which the predictive accuracy of these loci can be measured, and survey the predictive accuracy of GWAS variants for 18 common diseases. We show that predictive accuracy from genetic models varies greatly across diseases, but that the range is similar to that of non-genetic risk-prediction models. We discuss what factors drive differences in predictive accuracy, and how much value these predictions add over classical predictive tests. We also review the uses and pitfalls of idealized models of risk prediction. Finally, we look forward towards possible future clinical implementation of genetic risk prediction, and discuss realistic expectations for future utility.
Genome-wide association (GWA) studies have identified numerous, replicable, genetic associations between common single nucleotide polymorphisms (SNPs) and risk of common autoimmune and inflammatory (immune-mediated) diseases, some of which are shared between two diseases. Along with epidemiological and clinical evidence, this suggests that some genetic risk factors may be shared across diseases—as is the case with alleles in the Major Histocompatibility Locus. In this work we evaluate the extent of this sharing for 107 immune disease-risk SNPs in seven diseases: celiac disease, Crohn's disease, multiple sclerosis, psoriasis, rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes. We have developed a novel statistic for Cross Phenotype Meta-Analysis (CPMA) which detects association of a SNP to multiple, but not necessarily all, phenotypes. With it, we find evidence that 47/107 (44%) immune-mediated disease risk SNPs are associated to multiple—but not all—immune-mediated diseases (SNP-wise PCPMA<0.01). We also show that distinct groups of interacting proteins are encoded near SNPs which predispose to the same subsets of diseases; we propose these as the mechanistic basis of shared disease risk. We are thus able to leverage genetic data across diseases to construct biological hypotheses about the underlying mechanism of pathogenesis.
Over the last five years we have found over 100 genetic variants predisposing to common diseases affecting the immune system. In this study we analyze 107 such variants across seven diseases and find that almost half are shared across diseases. We also find that the patterns of sharing across diseases cluster these variants into groups; proteins encoded near variants in the same group tend to interact. This suggests that genetic variation may influence entire pathways to create risk to multiple diseases.
Genome-wide association studies, which produce huge volumes of data, are now being carried out by many groups around the world, creating a need for user friendly tools for data quality control and analysis. One critical aspect of GWAS quality control is evaluating genotype cluster plots to verify sensible genotype calling in putatively associated SNPs. Evoker is a tool for visualizing genotype cluster plots, and provides a solution to the computational and storage problems related to working with such large datasets.
Synthetic associations have been posited as a possible explanation for missing heritability in complex disease. We show several lines of evidence which suggest that, while possible, these synthetic associations are not common.
We performed a second-generation genome wide association study of 4,533 celiac disease cases and 10,750 controls. We genotyped 113 selected SNPs with PGWAS<10−4, and 18 SNPs from 14 known loci, in a further 4,918 cases and 5,684 controls. Variants from 13 new regions reached genome wide significance (Pcombined<5×10−8), most contain immune function genes (BACH2, CCR4, CD80, CIITA/SOCS1/CLEC16A, ICOSLG, ZMIZ1) with ETS1, RUNX3, THEMIS and TNFRSF14 playing key roles in thymic T cell selection. A further 13 regions had suggestive association evidence. In an expression quantitative trait meta-analysis of 1,469 whole blood samples, 20 of 38 (52.6%) tested loci had celiac risk variants correlated (P<0.0028, FDR 5%) with cis gene expression.