Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.
The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global p-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in PGM1 (phosphoglucomutase 1), two intronic SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two intronic SNPs in PARK2 (Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in GYS2 (glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.
An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.
This article presents the ability of an omnibus permutation test on ensembles of two-locus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genome-wide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.
Attribute selection; Complex disease; Epistasis; Genetic heterogeneity; Genome-wide association study; Pattern recognition; Permutation test; Single nucleotide polymorphism; Type 1 diabetes mellitus
Genome-wide association studies (GWASs) have identified low-penetrance common variants (i.e., single nucleotide polymorphisms, SNPs) associated with breast cancer susceptibility. Although GWASs are primarily focused on single-locus effects, gene-gene interactions (i.e., epistasis) are also assumed to contribute to the genetic risks for complex diseases including breast cancer. While it has been hypothesized that moderately ranked (P value based) weak single-locus effects in GWASs could potentially harbor valuable information for evaluating epistasis, we lack systematic efforts to investigate SNPs showing consistent associations with weak statistical significance across independent discovery and replication stages. The objectives of this study were i) to select SNPs showing single-locus effects with weak statistical significance for breast cancer in a GWAS and/or candidate-gene studies; ii) to replicate these SNPs in an independent set of breast cancer cases and controls; and iii) to explore their potential SNP-SNP interactions contributing to breast cancer susceptibility. A total of 17 SNPs related to DNA repair, modification and metabolism pathway genes were selected since these pathways offer a priori knowledge for potential epistatic interactions and an overall role in breast carcinogenesis. The study design included predominantly Caucasian women (2,795 cases and 4,505 controls) from Alberta, Canada. We observed two two-way SNP-SNP interactions (APEX1-rs1130409 and RPAP1-rs2297381; MLH1-rs1799977 and MDM2-rs769412) in logistic regression that conferred elevated risks for breast cancer (Pinteraction<7.3×10−3). Logic regression identified an interaction involving four SNPs (MBD2-rs4041245, MLH1-rs1799977, MDM2-rs769412, BRCA2-rs1799943) (Ppermutation = 2.4×10−3). SNPs involved in SNP-SNP interactions also showed single-locus effects with weak statistical significance, while BRCA2-rs1799943 showed stronger statistical significance (Pcorrelation/trend = 3.2×10−4) than the others. These single-locus effects were independent of body mass index. Our results provide a framework for evaluating SNPs showing statistically weak but reproducible single-locus effects for epistatic effects contributing to disease susceptibility.
OBJECTIVE—This study examined how differences in the BMI distribution of type 2 diabetic case subjects affected genome-wide patterns of type 2 diabetes association and considered the implications for the etiological heterogeneity of type 2 diabetes.
RESEARCH DESIGN AND METHODS—We reanalyzed data from the Wellcome Trust Case Control Consortium genome-wide association scan (1,924 case subjects, 2,938 control subjects: 393,453 single-nucleotide polymorphisms [SNPs]) after stratifying case subjects (into “obese” and “nonobese”) according to median BMI (30.2 kg/m2). Replication of signals in which alternative case-ascertainment strategies generated marked effect size heterogeneity in type 2 diabetes association signal was sought in additional samples.
RESULTS—In the “obese-type 2 diabetes” scan, FTO variants had the strongest type 2 diabetes effect (rs8050136: relative risk [RR] 1.49 [95% CI 1.34–1.66], P = 1.3 × 10−13), with only weak evidence for TCF7L2 (rs7901695 RR 1.21 [1.09–1.35], P = 0.001). This situation was reversed in the “nonobese” scan, with FTO association undetectable (RR 1.07 [0.97–1.19], P = 0.19) and TCF7L2 predominant (RR 1.53 [1.37–1.71], P = 1.3 × 10−14). These patterns, confirmed by replication, generated strong combined evidence for between-stratum effect size heterogeneity (FTO: PDIFF = 1.4 × 10−7; TCF7L2: PDIFF = 4.0 × 10−6). Other signals displaying evidence of effect size heterogeneity in the genome-wide analyses (on chromosomes 3, 12, 15, and 18) did not replicate. Analysis of the current list of type 2 diabetes susceptibility variants revealed nominal evidence for effect size heterogeneity for the SLC30A8 locus alone (RRobese 1.08 [1.01–1.15]; RRnonobese 1.18 [1.10–1.27]: PDIFF = 0.04).
CONCLUSIONS—This study demonstrates the impact of differences in case ascertainment on the power to detect and replicate genetic associations in genome-wide association studies. These data reinforce the notion that there is substantial etiological heterogeneity within type 2 diabetes.
Non-hereditary colorectal cancer (CRC) is a complex disorder resulting from the combination of genetic and non-genetic factors. Genome–wide association studies (GWAS) are useful for identifying such genetic susceptibility factors. However, the single loci so far associated with CRC only represent a fraction of the genetic risk for CRC development in the general population. Therefore, many other genetic risk variants alone and in combination must still remain to be discovered. The aim of this work was to search for genetic risk factors for CRC, by performing single-locus and two-locus GWAS in the Spanish population.
A total of 801 controls and 500 CRC cases were included in the discovery GWAS dataset. 77 single nucleotide polymorphisms (SNP)s from single-locus and 243 SNPs from two-locus association analyses were selected for replication in 423 additional CRC cases and 1382 controls. In the meta-analysis, one SNP, rs3987 at 4q26, reached GWAS significant p-value (p = 4.02×10−8), and one SNP pair, rs1100508 CG and rs8111948 AA, showed a trend for two-locus association (p = 4.35×10−11). Additionally, our GWAS confirmed the previously reported association with CRC of five SNPs located at 3q36.2 (rs10936599), 8q24 (rs10505477), 8q24.21(rs6983267), 11q13.4 (rs3824999) and 14q22.2 (rs4444235).
Our GWAS for CRC patients from Spain confirmed some previously reported associations for CRC and yielded a novel candidate risk SNP, located at 4q26. Epistasis analyses also yielded several novel candidate susceptibility pairs that need to be validated in independent analyses.
Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS.
The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements.
The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware.
Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single nucleotide polymorphism (SNP) combinations. Fast screening tools are needed to make epistasis analysis routinely available in GWAS.
Results: We present BiForce to support high-throughput analysis of epistasis in GWAS for either quantitative or binary disease (case–control) traits. BiForce achieves great computational efficiency by using memory efficient data structures, Boolean bitwise operations and multithreaded parallelization. It performs a full pair-wise genome scan to detect interactions involving SNPs with or without significant marginal effects using appropriate Bonferroni-corrected significance thresholds. We show that BiForce is more powerful and significantly faster than published tools for both binary and quantitative traits in a series of performance tests on simulated and real datasets. We demonstrate BiForce in analysing eight metabolic traits in a GWAS cohort (323 697 SNPs, >4500 individuals) and two disease traits in another (>340 000 SNPs, >1750 cases and 1500 controls) on a 32-node computing cluster. BiForce completed analyses of the eight metabolic traits within 1 day, identified nine epistatic pairs of SNPs in five metabolic traits and 18 SNP pairs in two disease traits. BiForce can make the analysis of epistasis a routine exercise in GWAS and thus improve our understanding of the role of epistasis in the genetic regulation of complex traits.
Availability and implementation: The software is free and can be downloaded from http://bioinfo.utu.fi/BiForce/.
Supplementary data are available at Bioinformatics online.
Identifying gene-gene interaction is a hot topic in genome wide association studies. Two fundamental challenges are: (1) how to smartly identify combinations of variants that may be associated with the trait from astronomical number of all possible combinations; and (2) how to test epistatic interaction when all potential combinations are available. We developed AprioriGWAS, which brings two innovations. (1) Based on Apriori, a successful method in field of Frequent Itemset Mining (FIM) in which a pattern growth strategy is leveraged to effectively and accurately reduce search space, AprioriGWAS can efficiently identify genetically associated genotype patterns. (2) To test the hypotheses of epistasis, we adopt a new conditional permutation procedure to obtain reliable statistical inference of Pearson's chi-square test for the contingency table generated by associated variants. By applying AprioriGWAS to age-related macular degeneration (AMD) data, we found that: (1) angiopoietin 1 (ANGPT1) and four retinal genes interact with Complement Factor H (CFH). (2) GO term “glycosaminoglycan biosynthetic process” was enriched in AMD interacting genes. The epistatic interactions newly found by AprioriGWAS on AMD data are likely true interactions, since genes interacting with CFH are retinal genes, and GO term enrichment also verified that interaction between glycosaminoglycans (GAGs) and CFH plays an important role in disease pathology of AMD. By applying AprioriGWAS on Bipolar disorder in WTCCC data, we found variants without marginal effect show significant interactions. For example, multiple-SNP genotype patterns inside gene GABRB2 and GRIA1 (AMPA subunit 1 receptor gene). AMPARs are found in many parts of the brain and are the most commonly found receptor in the nervous system. The GABRB2 mediates the fastest inhibitory synaptic transmission in the central nervous system. GRIA1 and GABRB2 are relevant to mental disorders supported by multiple evidences.
Genes do not operate in vacuum. They interact with each other in many ways. Therefore, to figure out genetic causes of disease by case-control association studies, it is important to take interactions into account. There are two fundamental challenges in interaction-focused analysis. The first is the number of possible combinations of genetic variants easily goes to astronomic which is beyond current computational facility, which is referred as “the curse of dimensionality” in field of computer science. The other is, even if all potential combinations could be exhaustively checked, genuine signals are likely to be buried by false positives that are composed of single variant with large main effect and some other irrelevant variant. In this work, we propose AprioriGWAS that employees Apriori, an algorithm that pioneers the branch of “Frequent Itemset Mining” in computer science to cope with daunting numbers of combinations, and conditional permutation, to enable real signals standing out. By applying AprioriGWAS to age-related macular degeneration (AMD) data and bipolar disorder (BD) in WTCCC data, we found interesting interactions between sensible genes in terms of disease. Consequently, AprioriGWAS could be a good tool to find epistasis interaction from GWA data.
Cholesterol concentrations in blood are related to cardiovascular diseases. Recent genome-wide association studies (GWAS) of cholesterol levels identified a number of single-locus effects on total cholesterol (TC) and high-density lipoprotein cholesterol (HDL-C) levels. Here, we report single-locus and epistasis SNP effects on TC and HDL-C using the Framingham Heart Study (FHS) data.
Single-locus effects and pairwise epistasis effects of 432,096 SNP markers were tested for their significance on log-transformed TC and HDL-C levels. Twenty nine additive SNP effects reached single-locus genome-wide significance (p < 7.2 × 10-8) and no dominance effect reached genome-wide significance. Two new gene regions were detected, the RAB3GAP1-R3HDM1-LCT-MCM6 region of chr02 for TC identified by six new SNPs, and the OSBPL8-ZDHHC17 region (chr12) for HDL-C identified by one new SNP. The remaining 22 single-locus SNP effects confirmed previously reported genes or gene regions. For TC, three SNPs identified two gene regions that were tightly linked with previously reported genes associated with TC, including rs599839 that was 10 bases downstream PSRC1 and 3.498 kb downstream CELSR2, rs4970834 in CELSR2, and rs4245791 in ABCG8 that slightly overlapped with ABCG5. For HDL-C, LPL was confirmed by 12 SNPs 8-45 kb downstream, CETP by two SNPs 0.5-11 kb upstream, and the LIPG-ACAA2 region by five SNPs inside this region. Two epistasis effects on TC and thirteen epistasis effects on HDL-C reached the significance of "suggestive linkage". The most significant epistasis effect (p = 5.72 × 10-13) was close to reaching "significant linkage" and was a dominance × dominance effect of HDL-C between LMBRD1 (chr06) and the LRIG3 region (chr12), and this pair of gene regions had six other D × D effects with "suggestive linkage".
Genome-wide association analysis of the FHS data detected two new gene regions with genome-wide significance, detected epistatic SNP effects on TC and HDL-C with the significance of suggestive linkage in seven pairs of gene regions, and confirmed some previously reported gene regions associated with TC and HDL-C.
Although genome-wide association studies (GWAS) have identified a significant number of single-nucleotide polymorphisms (SNPs) associated with many complex human traits, the susceptibility loci identified so far can explain only a small fraction of the genetic risk. Among other possible explanations, the lack of a comprehensive examination of gene–gene interaction (G×G) is often considered a source of the missing heritability. Previously, we reported a model-free Generalized Multifactor Dimensionality Reduction (GMDR) approach for detecting G×G in both dichotomous and quantitative phenotypes. However, the computational burden and less efficient implementation of the original programs make them impossible to use for GWAS. In this study, we developed a graphics processing unit (GPU)-based GMDR program (named GWAS-GPU), which is able not only to analyze GWAS data but also to run much faster than the earlier version of the GMDR program. As a demonstration of the program, we used the GMDR-GPU software to analyze a publicly available GWAS dataset on type 2 diabetes (T2D) from the Wellcome Trust Case Control Consortium. Through an exhaustive search of pair-wise interactions and a selected search of three- to five-way interactions conditioned on significant pair-wise results, we identified 24 core SNPs in six genes (FTO: rs9939973, rs9940128, rs9922047, rs1121980, rs9939609, rs9930506; TSPAN8: rs1495377; TCF7L2: rs4074720, rs7901695, rs4506565, rs4132670, rs10787472, rs11196205, rs10885409, rs11196208; L3MBTL3: rs10485400, rs4897366; CELF4: rs2852373, rs608489; RUNX1: rs445984, rs1040328, rs990074, rs2223046, rs2834970) that appear to be important for T2D. Of these core SNPs, 11 in FTO, TSPAN8, and TCF7L2 have been reported to be associated with T2D, obesity, or both, providing an independent replication of previously reported SNPs. Importantly, we identified three new susceptibility genes; i.e., L3MBTL3, CELF4, and RUNX1, for T2D, a finding that warrants further investigation with independent samples.
Type 2 diabetes (T2D) disproportionally affects African Americans (AfA) but, to date, genetic variants identified from genome-wide association studies (GWAS) are primarily from European and Asian populations. We examined the single nucleotide polymorphism (SNP) and locus transferability of 40 reported T2D loci in six AfA GWAS consisting of 2,806 T2D case subjects with or without end-stage renal disease and 4,265 control subjects from the Candidate Gene Association Resource Plus Study. Our results revealed that seven index SNPs at the TCF7L2, KLF14, KCNQ1, ADCY5, CDKAL1, JAZF1, and GCKR loci were significantly associated with T2D (P < 0.05). The strongest association was observed at TCF7L2 rs7903146 (odds ratio [OR] 1.30; P = 6.86 × 10−8). Locus-wide analysis demonstrated significant associations (Pemp < 0.05) at regional best SNPs in the TCF7L2, KLF14, and HMGA2 loci as well as suggestive signals in KCNQ1 after correction for the effective number of SNPs at each locus. Of these loci, the regional best SNPs were in differential linkage disequilibrium (LD) with the index and adjacent SNPs. Our findings suggest that some loci discovered in prior reports affect T2D susceptibility in AfA with similar effect sizes. The reduced and differential LD pattern in AfA compared with European and Asian populations may facilitate fine mapping of causal variants at loci shared across populations.
Genome-wide associations have shown a lot of promise in dissecting the genetics of complex traits in humans with single variants, yet a large fraction of the genetic effects is still unaccounted for. Analyzing genetic interactions between variants (epistasis) is one of the potential ways forward. We investigated the abundance and functional impact of a specific type of epistasis, namely the interaction between regulatory and protein-coding variants. Using genotype and gene expression data from the 210 unrelated individuals of the original four HapMap populations, we have explored the combined effects of regulatory and protein-coding single nucleotide polymorphisms (SNPs). We predict that about 18% (1,502 out of 8,233 nsSNPs) of protein-coding variants are differentially expressed among individuals and demonstrate that regulatory variants can modify the functional effect of a coding variant in cis. Furthermore, we show that such interactions in cis can affect the expression of downstream targets of the gene containing the protein-coding SNP. In this way, a cis interaction between regulatory and protein-coding variants has a trans impact on gene expression. Given the abundance of both types of variants in human populations, we propose that joint consideration of regulatory and protein-coding variants may reveal additional genetic effects underlying complex traits and disease and may shed light on causes of differential penetrance of known disease variants.
The ultimate goal of genome-wide association studies (GWAS) is to explain the proportion of variation in a phenotypic trait that can be attributed to genetic factors. The past two years have seen a plethora of successes in this field, yet, for most traits, a large fraction of variation remains unexplained. Epistasis, or interaction between genetic variants, is a largely under-explored factor, which may shed some light in this area. We use the HapMap populations to investigate interactions between regulatory and protein-coding variants and their impact on gene expression. We show that if a specific protein-coding variant has a functional impact, this can be modified by a co-segregating regulatory variant (cis interaction). Furthermore, the authors demonstrate that such modification effects between variants at one locus may affect the expression of other genes in the cell in a trans manner. The aim of this article is to present a framework though which variation can be considered in the context of GWAS. Viewing variation from this underappreciated angle may, in some cases, provide an explanation for differential penetrance of complex disease traits, but also for non-replication of GWAS results that may arise as a consequence of such interactions.
We investigated the variation in neuropsychological function explained by risk alleles at the psychosis susceptibility gene ZNF804A and its interacting partners using single nucleotide polymorphisms (SNPs), polygenic score and epistatic analyses. Of particular importance was the relative contribution of the polygenic score versus epistasis in variation explained.
The objectives were twofold: first, to assess the association between SNPs in ZNF804A and the ZNF804A polygenic score with measures of cognition in cases with psychosis. The second was to assess whether epistasis within the ZNF804A pathway could explain additional variation above and beyond that explained by the polygenic score.
Design, Setting and Participants
Patients with psychosis (N = 424) were assessed in areas of cognitive ability impaired in schizophrenia including IQ, memory, attention and social cognition. We used the Psychiatric GWAS Consortium (PGC1) schizophrenia GWAS to calculate a polygenic score based on identified risk variants within this genetic pathway. Cognitive measures significantly associated with the polygenic score were tested for an epistatic component using a training set (N = 170), which was used to develop linear regression models containing the polygenic score and two-SNP interactions. The best-fitting models were tested for replication in two independent test sets of cases: 1) 170 individuals with schizophrenia or schizoaffective disorder and 2) 84 patients with broad psychosis (including bipolar disorder, major depressive disorder and other psychosis).
Higher polygenic scores were associated with poorer performance amongst patients on IQ, memory and social cognition, explaining 1-3% of variation on these scores (p-values ranged from 0.012-0.034). Using a narrow psychosis training set and independent test sets of narrow phenotype psychosis (schizophrenia and schizoaffective disorder), broad psychosis, and controls (N = 89) respectively, the addition of two interaction terms containing two SNPs each increased the R2 for spatial working memory (SWM) strategy in the independent psychosis test sets from 1.2% using the polygenic score only to 4.8% (p-values = 0.11 and 0.0012), but did not explain additional variation in controls.
Conclusions and Relevance
These data support a role for the ZNF804A pathway in IQ, memory and social cognition in cases. Further we show that epistasis increases variation explained above the contribution of the polygenic score.
schizophrenia; working memory; epistasis; polygenic score; ZNF804A; cognition
The transcription factor 7-like 2 (TCF7L2) locus is strongly implicated in the pathogenesis of type 2 diabetes (T2D). We previously mapped the genomic regions bound by TCF7L2 using ChIP (chromatin immunoprecipitation)-seq in the colorectal carcinoma cell line, HCT116, revealing an unexpected highly significant over-representation of genome-wide association studies (GWAS) loci associated primarily with endocrine (in particular T2D) and cardiovascular traits.
In order to further explore if this observed phenomenon occurs in other cell lines, we carried out ChIP-seq in HepG2 cells and leveraged ENCODE data for five additional cell lines. Given that only a minority of the predicted genetic component to most complex traits has been identified to date, plus our GWAS-related observations with respect to TCF7L2 occupancy, we investigated if restricting association analyses to the genes yielded from this approach, in order to reduce the constraints of multiple testing, could reveal novel T2D loci.
We found strong evidence for the continued enrichment of endocrine and cardiovascular GWAS categories, with additional support for cancer. When investigating all the known GWAS loci bound by TCF7L2 in the shortest gene list, derived from HCT116, the coronary artery disease-associated variant, rs46522 at the UBE2Z-GIP-ATP5G1-SNF8 locus, yielded significant association with T2D within DIAGRAM. Furthermore, when we analyzed tag-SNPs (single nucleotide polymorphisms) in genes not previously implicated by GWAS but bound by TCF7L2 within 5 kb, we observed a significant association of rs4780476 within CPPED1 in DIAGRAM.
ChIP-seq data generated with this GWAS-implicated transcription factor provided a biologically plausible method to limit multiple testing in the assessment of genome-wide genotyping data to uncover two novel T2D-associated loci.
Genetic Association; Type 2; Transcription Factor; Gene
Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including >100,000 individuals of European ancestry. Combining all lipid biomarkers, our re-analysis supported 26 out of 38 reported colocalisation results with eQTLs and identified 14 new colocalisation results, hence highlighting the value of a formal statistical test. In three cases of reported eQTL-lipid pairs (SYPL2, IFT172, TBKBP1) for which our analysis suggests that the eQTL pattern is not consistent with the lipid association, we identify alternative colocalisation results with SORT1, GCKR, and KPNB1, indicating that these genes are more likely to be causal in these genomic intervals. A key feature of the method is the ability to derive the output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at http://coloc.cs.ucl.ac.uk/coloc/). Our methodology provides information about candidate causal genes in associated intervals and has direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.
Genome-wide association studies (GWAS) have found a large number of genetic regions (“loci”) affecting clinical end-points and phenotypes, many outside coding intervals. One approach to understanding the biological basis of these associations has been to explore whether GWAS signals from intermediate cellular phenotypes, in particular gene expression, are located in the same loci (“colocalise”) and are potentially mediating the disease signals. However, it is not clear how to assess whether the same variants are responsible for the two GWAS signals or whether it is distinct causal variants close to each other. In this paper, we describe a statistical method that can use simply single variant summary statistics to test for colocalisation of GWAS signals. We describe one application of our method to a meta-analysis of blood lipids and liver expression, although any two datasets resulting from association studies can be used. Our method is able to detect the subset of GWAS signals explained by regulatory effects and identify candidate genes affected by the same GWAS variants. As summary GWAS data are increasingly available, applications of colocalisation methods to integrate the findings will be essential for functional follow-up, and will also be particularly useful to identify tissue specific signals in eQTL datasets.
Genome-wide association studies (GWAS) have emerged as a powerful approach for identifying susceptibility loci associated with polygenetic diseases such as type 2 diabetes mellitus (T2DM). However, it is still a daunting task to prioritize single nucleotide polymorphisms (SNPs) from GWAS for further replication in different population. Several recent studies have shown that genetic variation often affects gene-expression at proximal (cis) as well as distal (trans) genomic locations by different mechanisms such as altering rate of transcription or splicing or transcript stability.
To prioritize SNPs from GWAS, we combined results from two GWAS related to T2DM, the Diabetes Genetics Initiative (DGI) and the Wellcome Trust Case Control Consortium (WTCCC), with genome-wide expression data from pancreas, adipose tissue, liver and skeletal muscle of individuals with or without T2DM or animal models thereof to identify T2DM susceptibility loci.
We identified 1,170 SNPs associated with T2DM with P < 0.05 in both GWAS and 243 genes that were located in the vicinity of these SNPs. Out of these 243 genes, we identified 115 differentially expressed in publicly available gene expression profiling data. Notably five of them, IGF2BP2, KCNJ11, NOTCH2, TCF7L2 and TSPAN8, have subsequently been shown to be associated with T2DM in different populations. To provide further validation of our approach, we reversed the approach and started with 26 known SNPs associated with T2DM and related traits. We could show that 12 (57%) (HHEX, HNF1B, IGF2BP2, IRS1, KCNJ11, KCNQ1, NOTCH2, PPARG, TCF7L2, THADA, TSPAN8 and WFS1) out of 21 genes located in vicinity of these SNPs were showing aberrant expression in T2DM from the gene expression profiling studies.
Utilizing of gene expression profiling data from different tissues of individuals with or without T2DM or animal models thereof is a powerful tool for prioritizing SNPs from WGAS for further replication studies.
Genome-wide association studies (GWAS) have been successful in finding numerous new risk variants for complex diseases, but the results almost exclusively rely on single-marker scans. Methods that can analyze joint effects of many variants in GWAS data are still being developed and trialed. To evaluate the performance of such methods it is essential to have a GWAS data simulator that can rapidly simulate a large number of samples, and capture key features of real GWAS data such as linkage disequilibrium (LD) among single-nucleotide polymorphisms (SNPs) and joint effects of multiple loci (multilocus epistasis). In the current study, we combine techniques for specifying high-order epistasis among risk SNPs with an existing program GWAsimulator[Li and Li 2008] to achieve rapid whole-genome simulation with accurate modeling of complex interactions. We considered various approaches to specifying interaction models including: departure from product of marginal effects for pair-wise interactions, product terms in logistic regression models for low-order interactions, and penetrance tables conforming to marginal effect constraints for high-order interactions or prescribing known biological interactions. Methods for conversion among different model specifications are developed using penetrance table as the fundamental characterization of disease models. The new program, called simGWA, is capable to efficiently generate large samples of GWAS data with high precision. We show that data simulated by simGWA are faithful to template LD structures, and conform to pre-specified diseases models with (or without) interactions.
genome-wide simulation; epistasis; gene-gene interaction; genome-wide association
Amyotrophic lateral sclerosis (ALS) is a fatal, degenerative neuromuscular disease characterized by a progressive loss of voluntary motor activity. About 95% of ALS patients are in "sporadic form"-meaning their disease is not associated with a family history of the disease. To date, the genetic factors of the sporadic form of ALS are poorly understood.
We proposed a two-stage approach based on seventeen biological plausible models to search for two-locus combinations that have significant joint effects to the disease in a genome-wide association study (GWAS). We used a two-stage strategy to reduce the computational burden associated with performing an exhaustive two-locus search across the genome. In the first stage, all SNPs were screened using a single-marker test. In the second stage, all pairs made from the 1000 SNPs with the lowest p-values from the first stage were evaluated under each of the 17 two-locus models.
we performed the two-stage approach on a GWAS data set of sporadic ALS from the SNP Database at the NINDS Human Genetics Resource Center DNA and Cell Line Repository http://ccr.coriell.org/ninds/. Our two-locus analysis showed that two two-locus combinations--rs4363506 (SNP1) and rs3733242 (SNP2), and rs4363506 and rs16984239 (SNP3) -- were significantly associated with sporadic ALS. After adjusting for multiple tests and multiple models, the combination of SNP1 and SNP2 had a p-value of 0.032 under the Dom∩Dom epistatic model; SNP1 and SNP3 had a p-value of 0.042 under the Dom × Dom multiplicative model.
The proposed two-stage analytical method can be used to search for joint effects of genes in GWAS. The two-stage strategy decreased the computational time and the multiple testing burdens associated with GWAS. We have also observed that the loci identified by our two-stage strategy can not be detected by single-locus tests.
Hereditary Multiple Exostoses (HME) is an autosomal-dominant disorder characterized by benign cartilage tumors (exostoses) forming near the growth plates, leading to severe health problems. EXT1 and EXT2 are the two genes known to harbor heterozygous loss-of-function mutations that account for the vast majority of the primary genetic component of HME. However, patients present with wide clinical heterogeneity, suggesting that modifier genes play a role in determining severity. Our previous work has pointed to an imbalance of β-catenin signaling being involved in the pathogenesis of osteochondroma formation. TCF7L2 is one of the key ‘gate-keeper’ TCF family members for Wnt/β-catenin signaling pathway, and TCF7L2 and EXT2 are among the earliest associated loci reported in genome wide appraisals of type 2 diabetes (T2D). Thus we investigated if the key T allele of single nucleotide polymorphism (SNP) rs7903146 within the TCF7L2 locus, which is strongly over-represented among T2D cases, was also associated with HME. We leveraged genotype data available from ongoing GWAS efforts from genomics and orthopaedic centers in the US, Canada and Italy. Collectively 213 cases and 1,890 controls were analyzed and, surprisingly, the T allele was in fact significantly under-represented in the HME patient group [P=0.009; odds ratio=0.737 (95% C.I. 0.587 - 0.926)]; in addition, the direction of effect was consistent within each individual cohort. Immunohistochemical analyses revealed that TCF7L2 is differentially expressed and distributed in normal human growth plate zones, and exhibits substantial variability in human exostoses in terms of staining intensity and distribution. In summary, the data indicate that there is a putative genetic connection between TCF7L2 and EXT in the context of HME. Given this observation, we suggest that these loci could possibly modulate shared pathways, in particular with respect to β-catenin, and their respective variants interplay to influence HME pathogenesis as well as T2D.
Transcription factor 7 like 2 (TCF7L2); Hereditary Multiple Exostoses (HME); osteochondroma; Exostosin (EXT); type 2 diabetes (T2D)
Motivation: Gene–gene interactions are of potential biological and medical interest, as they can shed light on both the inheritance mechanism of a trait and on the underlying biological mechanisms. Evidence of epistatic interactions has been reported in both humans and other organisms. Unlike single-locus genome-wide association studies (GWAS), which proved efficient in detecting numerous genetic loci related with various traits, interaction-based GWAS have so far produced very few reproducible discoveries. Such studies introduce a great computational and statistical burden by necessitating a large number of hypotheses to be tested including all pairs of single nucleotide polymorphisms (SNPs). Thus, many software tools have been developed for interaction-based case–control studies, some leading to reliable discoveries. For quantitative data, on the other hand, only a handful of tools exist, and the computational burden is still substantial.
Results: We present an efficient algorithm for detecting epistasis in quantitative GWAS, achieving a substantial runtime speedup by avoiding the need to exhaustively test all SNP pairs using metric embedding and random projections. Unlike previous metric embedding methods for case–control studies, we introduce a new embedding, where each SNP is mapped to two Euclidean spaces. We implemented our method in a tool named EPIQ (EPIstasis detection for Quantitative GWAS), and we show by simulations that EPIQ requires hours of processing time where other methods require days and sometimes weeks. Applying our method to a dataset from the Ludwigshafen risk and cardiovascular health study, we discovered a pair of SNPs with a near-significant interaction (P = 2.2 × 10−13), in only 1.5 h on 10 processors.
Recent large genome-wide association studies (GWAS) have identified multiple loci
which harbor genetic variants associated with type 2 diabetes mellitus (T2D),
many of which encode proteins not previously suspected to be involved in the
pathogenesis of T2D. Most GWAS for T2D have focused on populations of European
descent, and GWAS conducted in other populations with different ancestry offer a
unique opportunity to study the genetic architecture of T2D. We performed
genome-wide association scans for T2D in 3,955 Chinese (2,010 cases, 1,945
controls), 2,034 Malays (794 cases, 1,240 controls), and 2,146 Asian Indians
(977 cases, 1,169 controls). In addition to the search for novel variants
implicated in T2D, these multi-ethnic cohorts serve to assess the
transferability and relevance of the previous findings from European descent
populations in the three major ethnic populations of Asia, comprising half of
the world's population. Of the SNPs associated with T2D in previous GWAS,
only variants at CDKAL1 and
HHEX/IDE/KIF11 showed the strongest
association with T2D in the meta-analysis including all three ethnic groups.
However, consistent direction of effect was observed for many of the other SNPs
in our study and in those carried out in European populations. Close examination
of the associations at both the CDKAL1 and
HHEX/IDE/KIF11 loci provided some evidence of locus and
allelic heterogeneity in relation to the associations with T2D. We also detected
variation in linkage disequilibrium between populations for most of these loci
that have been previously identified. These factors, combined with limited
statistical power, may contribute to the failure to detect associations across
populations of diverse ethnicity. These findings highlight the value of
surveying across diverse racial/ethnic groups towards the fine-mapping efforts
for the casual variants and also of the search for variants, which may be
Type 2 diabetes mellitus (T2D) is a chronic disease which can lead to
complications such as heart disease, stroke, hypertension, blindness due to
diabetic retinopathy, amputations from peripheral vascular diseases, and kidney
disease from diabetic nephropathy. The increasing prevalence and complications
of T2D are likely to increase the health and economic burden of individuals,
families, health systems, and countries. Our study carried out in three major
Asian ethnic groups (Chinese, Malays, and Indians) in Singapore suggests that
the findings of studies carried out in populations of European ancestry (which
represents most studies to date) may be relevant to populations in Asia.
However, our study also raises the possibility that different genes, and within
the genes different variants, may confer susceptibility to T2D in these
populations. These findings are particularly relevant in Asia, where the
greatest growth of T2D is expected in the coming years, and emphasize the
importance of studying diverse populations when trying to localize the regions
of the genome associated with T2D. In addition, we may need to consider novel
methods for combining data across populations.
A number of studies have found that BMI in early life influences the risk of developing type 2 diabetes later in life. Our goal was to investigate if any type 2 diabetes variants uncovered through genome-wide association studies (GWAS) impact BMI in childhood.
RESEARCH DESIGN AND METHODS
Using data from an ongoing GWAS of pediatric BMI in our cohort, we investigated the association of pediatric BMI with 20 single nucleotide polymorphisms at 18 type 2 diabetes loci uncovered through GWAS, consisting of ADAMTS9, CDC123-CAMK1D, CDKAL1, CDKN2A/B, EXT2, FTO, HHEX-IDE, IGF2BP2, the intragenic region on 11p12, JAZF1, KCNQ1, LOC387761, MTNR1B, NOTCH2, SLC30A8, TCF7L2, THADA, and TSPAN8-LGR5. We randomly partitioned our cohort exactly in half in order to have a discovery cohort (n = 3,592) and a replication cohort (n = 3,592).
Our data show that the major type 2 diabetes risk–conferring G allele of rs7923837 at the HHEX-IDE locus was associated with higher pediatric BMI in both the discovery (P = 0.0013 and survived correction for 20 tests) and replication (P = 0.023) sets (combined P = 1.01 × 10−4). Association was not detected with any other known type 2 diabetes loci uncovered to date through GWAS except for the well-established FTO.
Our data show that the same genetic HHEX-IDE variant, which is associated with type 2 diabetes from previous studies, also influences pediatric BMI.
Genome-wide association studies (GWAS) do not provide a full account of the heritability of genetic diseases since gene-gene interactions, also known as epistasis are not considered in single locus GWAS. To address this problem, a considerable number of methods have been developed for identifying disease-associated gene-gene interactions. However, these methods typically fail to identify interacting markers explaining more of the disease heritability over single locus GWAS, since many of the interactions significant for disease are obscured by uninformative marker interactions e.g., linkage disequilibrium (LD).
In this study, we present a novel SNP interaction prioritization algorithm, named iLOCi (Interacting Loci). This algorithm accounts for marker dependencies separately in case and control groups. Disease-associated interactions are then prioritized according to a novel ranking score calculated from the difference in marker dependencies for every possible pair between case and control groups. The analysis of a typical GWAS dataset can be completed in less than a day on a standard workstation with parallel processing capability. The proposed framework was validated using simulated data and applied to real GWAS datasets using the Wellcome Trust Case Control Consortium (WTCCC) data. The results from simulated data showed the ability of iLOCi to identify various types of gene-gene interactions, especially for high-order interaction. From the WTCCC data, we found that among the top ranked interacting SNP pairs, several mapped to genes previously known to be associated with disease, and interestingly, other previously unreported genes with biologically related roles.
iLOCi is a powerful tool for uncovering true disease interacting markers and thus can provide a more complete understanding of the genetic basis underlying complex disease. The program is available for download at http://www4a.biotec.or.th/GI/tools/iloci.
Recent advance in genetic studies added the confirmed susceptible loci for type 2 diabetes to eighteen. In this study, we attempt to analyze the independent and joint effect of variants from these loci on type 2 diabetes and clinical phenotypes related to glucose metabolism.
Twenty-one single nucleotide polymorphisms (SNPs) from fourteen loci were successfully genotyped in 1,849 subjects with type 2 diabetes and 1,785 subjects with normal glucose regulation. We analyzed the allele and genotype distribution between the cases and controls of these SNPs as well as the joint effects of the susceptible loci on type 2 diabetes risk. The associations between SNPs and type 2 diabetes were examined by logistic regression. The associations between SNPs and quantitative traits were examined by linear regression. The discriminative accuracy of the prediction models was assessed by area under the receiver operating characteristic curves. We confirmed the effects of SNPs from PPARG, KCNJ11, CDKAL1, CDKN2A-CDKN2B, IDE-KIF11-HHEX, IGF2BP2 and SLC30A8 on risk for type 2 diabetes, with odds ratios ranging from 1.114 to 1.406 (P value range from 0.0335 to 1.37E-12). But no significant association was detected between SNPs from WFS1, FTO, JAZF1, TSPAN8-LGR5, THADA, ADAMTS9, NOTCH2-ADAM30 and type 2 diabetes. Analyses on the quantitative traits in the control subjects showed that THADA SNP rs7578597 was association with 2-h insulin during oral glucose tolerance tests (P = 0.0005, empirical P = 0.0090). The joint effect analysis of SNPs from eleven loci showed the individual carrying more risk alleles had a significantly higher risk for type 2 diabetes. And the type 2 diabetes patients with more risk allele tended to have earlier diagnostic ages (P = 0.0006).
The current study confirmed the association between PPARG, KCNJ11, CDKAL1, CDKN2A-CDKN2B, IDE-KIF11-HHEX, IGF2BP2 and SLC30A8 and type 2 diabetes. These type 2 diabetes risk loci contributed to the disease additively.
Genome-wide association studies have identified a large number of single-nucleotide polymorphisms (SNPs) that individually predispose to diseases. However, many genetic risk factors remain unaccounted for. Proteins coded by genes interact in the cell, and it is most likely that certain variants mainly affect the phenotype in combination with other variants, termed epistasis. An exhaustive search for epistatic effects is computationally demanding, as several billions of SNP pairs exist for typical genotyping chips. In this study, the experimental knowledge on biological networks is used to narrow the search for two-locus epistasis. We provide evidence that this approach is computationally feasible and statistically powerful. By applying this method to the Wellcome Trust Case–Control Consortium data sets, we report four significant cases of epistasis between unlinked loci, in susceptibility to Crohn's disease, bipolar disorder, hypertension and rheumatoid arthritis.
association studies; genome-wide scan; epistasis; biological network