We hypothesize that imputation based on data from the 1000 Genomes Project can identify novel association signals on a genome-wide scale due to the dense marker map and the large number of haplotypes. To test the hypothesis, the Wellcome Trust Case Control Consortium (WTCCC) Phase I genotype data were imputed using 1000 genomes as reference (20100804 EUR), and seven case/control association studies were performed using imputed dosages. We observed two ‘missed' disease-associated variants that were undetectable by the original WTCCC analysis, but were reported by later studies after the 2007 WTCCC publication. One is within the IL2RA gene for association with type 1 diabetes and the other in proximity with the CDKN2B gene for association with type 2 diabetes. We also identified two refined associations. One is SNP rs11209026 in exon 9 of IL23R for association with Crohn's disease, which is predicted to be probably damaging by PolyPhen2. The other refined variant is in the CUX2 gene region for association with type 1 diabetes, where the newly identified top SNP rs1265564 has an association P-value of 1.68 × 10−16. The new lead SNP for the two refined loci provides a more plausible explanation for the disease association. We demonstrated that 1000 Genomes-based imputation could indeed identify both novel (in our case, ‘missed' because they were detected and replicated by studies after 2007) and refined signals. We anticipate the findings derived from this study to provide timely information when individual groups and consortia are beginning to engage in 1000 genomes-based imputation.
genome-wide association study; the 1000 Genomes project; imputation
The human leukocyte antigen (HLA) class II genes HLA-DRB1, -DQA1 and -DQB1 are the strongest genetic factors for type 1 diabetes (T1D). Additional loci in the major histocompatibility complex (MHC) are difficult to identify due to the region’s high gene density and complex linkage disequilibrium (LD). To facilitate the association analysis, two novel algorithms were implemented in this study: one for phasing the multi-allelic HLA genotypes in trio families, and one for partitioning the HLA strata in conditional testing. Screening and replication were performed on two large and independent datasets: the Wellcome Trust Case–Control Consortium (WTCCC) dataset of 2,000 cases and 1,504 controls, and the T1D Genetics Consortium (T1DGC) dataset of 2,300 nuclear families. After imputation, the two datasets have 1,941 common SNPs in the MHC, of which 22 were successfully tested and replicated based on the statistical testing stratifying on the detailed DRB1 and DQB1 genotypes. Further conditional tests using the combined dataset confirmed eight novel SNP associations around 31.3 Mb on chromosome 6 (rs3094663, p = 1.66 × 10−11 and rs2523619, p = 2.77 × 10−10 conditional on the DR/DQ genotypes). A subsequent LD analysis established TCF19, POU5F1, CCHCR1 and PSORS1C1 as potential causal genes for the observed association.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-010-0908-2) contains supplementary material, which is available to authorized users.
Although they have demonstrated success in searching for common variants for complex diseases, Genome-Wide Association (GWA) studies are less successful in detecting rare genetic variants because of the poor statistical power of most of current methods. We developed a two-stage method that can apply to GWA studies for detecting rare variants. Here we report the results of applying this two-stage method to the Wellcome Trust Case Control Consortium (WTCCC) dataset that include 7 complex diseases: Bipolar disorder, Cardiovascular disease, Hypertension, Rheumatoid Arthritis, Crohn’s disease, Type 1 Diabetes and Type 2 Diabetes. We identified 24 genes or regions that reach genome wide significance. 8 of them are novel and were not reported in the WTCCC study. The cumulative risk (or protective) haplotype frequency for each of the 8 genes or regions is small, being at most 11%. For each of the novel genes, the risk (or protective) haplotype set cannot be tagged by the common SNPs available in chips (r2<0.32). The gene identified in hypertension was further replicated in the Framingham Heart Study (FHS), and is also significantly associated with Type 2 Diabetes. Our analysis suggests that searching for rare genetic variants is feasible in current genome-wide association studies and candidate gene studies, and the results can severe as guides to future resequencing studies to identify the underlying rare functional variants.
The aim of this study was to perform quality control (QC) and initial family-based association analyses on the major histocompatibility complex (MHC) single nucleotide polymorphism (SNP) and microsatellite marker data for the MHC Fine Mapping Workshop through the Type 1 Diabetes Genetics Consortium (T1DGC).
A random sample of blind duplicates was sent for analysis of QC. DNA samples collected from participants were shipped to the genotyping laboratory from several T1DGC DNA Repository sites. Quality checks including examination of plate-panel yield, marker yield, Hardy–Weinberg equilibrium, mismatch error rate, Mendelian error rate and allele distribution across plates were performed.
Genotypes from 2325 families within nine cohorts were obtained and subjected to QC procedures. The MHC project consisted of three marker panels – two 1536 SNP sets (Illumina Golden Gate platform performed at the Wellcome Trust Sanger Institute, Cambridge, UK) and one 66 microsatellite marker panel (performed at deCODE). In the raw SNP data, the overall concordance rate was 99.1% (±0.02).
The T1DGC MHC Fine Mapping project resulted in a 2300 family, 9992 genotyped individuals database comprising of two 1536 SNP panels and a 66 microsatellite panel to densely cover the 4 Mb MHC core region for use in statistical genetic analyses.
association; HLA; microsatellite; quality control; SNP; type 1 diabetes
It is generally known that risk variants segregate together with a disease within families but this information has not been used in the existing statistical methods for detecting rare variants. Here we introduce two weighted sum statistics that can apply to either genome-wide association data or resequencing data for identifying rare disease variants: weights calculated based on sibpairs and odd ratios, respectively. We evaluated the two methods via extensive simulations under different disease models. We compared the proposed methods with the weighted sum statistic (WSS) proposed by Madsen and Browning, keeping the same genotyping or resequencing cost. Our methods clearly demonstrate more statistical power than the WSS. In addition, we found using sibpair information can increase power over using only unrelated samples by more than 40%. We applied our methods to the Framingham Heart Study (FHS) and Wellcome Trust Case Control Consortium (WTCCC) hypertension datasets. Although we did not identify any genes as reaching a genome-wide significance level, we found variants in the candidate gene angiotensinogen (AGT) significantly associated with hypertension at P=6.9×10-4, whereas the most significant single SNP association evidence is P=0.063. We further applied the odds ratio weighted method to the IFIH1 gene for type 1 diabetes in the WTCCC data. Our method yielded a P value of 4.82×10-4, much more significant than that obtained by haplotype-based methods. We demonstrated that family data are extremely informative in searching for rare variants underlying complex traits, and the odds ratio weighted sum statistic is more efficient than currently existing methods.
In the presence of epistasis multilocus association tests of human complex traits can provide powerful methods to detect susceptibility variants. We undertook multilocus analyses in 1924 type 2 diabetes cases and 2938 controls from the Wellcome Trust Case Control Consortium (WTCCC). We performed a two-dimensional genome-wide association (GWA) scan using joint two-locus tests of association including main and epistatic effects in 70,236 markers tagging common variants. We found two-locus association at 79 SNP-pairs at a Bonferroni-corrected P-value = 0.05 (uncorrected P-value = 2.14 × 10−11). The 79 pair-wise results always contained rs11196205 in TCF7L2 paired with 79 variants including confirmed variants in FTO, TSPAN8, and CDKAL1, which are associated in the absence of epistasis. However, the majority (82%) of the 79 variants did not have compelling single-locus association signals (P-value = 5 × 10−4). Analyses conditional on the single-locus effects at TCF7L2 established that the joint two-locus results could be attributed to single-locus association at TCF7L2 alone. Interaction analyses among the peak 80 regions and among 23 previously established diabetes candidate genes identified five SNP-pairs with case-control and case-only epistatic signals. Our results demonstrate the feasibility of systematic scans in GWA data, but confirm that single-locus association can underlie and obscure multilocus findings.
Epistasis; simultaneous search; joint effects; genome-wide association
The Type I Diabetes Genetics Consortium (T1DGC) is an international, multicenter research program with two primary goals. The first goal is to identify genomic regions and candidate genes whose variants modify an individual’s risk of type I diabetes (T1D) and help explain the clustering of the disease in families. The second goal is to make research data available to the research community and to establish resources that can be used by, and that are fully accessible to, the research community. To facilitate the access to these resources, the T1DGC has developed a Consortium Agreement (http://www.t1dgc.org) that specifies the rights and responsibilities of investigators who participate in Consortium activities. The T1DGC has assembled a resource of affected sib-pair families, parent–child trios, and case–control collections with banks of DNA, serum, plasma, and EBV-transformed cell lines. In addition, both candidate gene and genome-wide (linkage and association) studies have been performed and displayed in T1DBase (http://www.t1dbase.org) for all researchers to use in their own investigations. In this supplement, a subset of the T1DGC collection has been used to investigate earlier published candidate genes for T1D, to confirm the results from a genome-wide association scan for T1D, and to determine associations with candidate genes for other autoimmune diseases or with type II diabetes that may be involved with β-cell function.
type I diabetes; autoantibodies; HLA; families; linkage; association
Purely epistatic multi-locus interactions cannot generally be detected via single-locus analysis in case-control studies of complex diseases. Recently, many two-locus and multi-locus analysis techniques have been shown to be promising for the epistasis detection. However, exhaustive multi-locus analysis requires prohibitively large computational efforts when problems involve large-scale or genome-wide data. Furthermore, there is no explicit proof that a combination of multiple two-locus analyses can lead to the correct identification of multi-locus interactions.
The proposed 2LOmb algorithm performs an omnibus permutation test on ensembles of two-locus analyses. The algorithm consists of four main steps: two-locus analysis, a permutation test, global p-value determination and a progressive search for the best ensemble. 2LOmb is benchmarked against an exhaustive two-locus analysis technique, a set association approach, a correlation-based feature selection (CFS) technique and a tuned ReliefF (TuRF) technique. The simulation results indicate that 2LOmb produces a low false-positive error. Moreover, 2LOmb has the best performance in terms of an ability to identify all causative single nucleotide polymorphisms (SNPs) and a low number of output SNPs in purely epistatic two-, three- and four-locus interaction problems. The interaction models constructed from the 2LOmb outputs via a multifactor dimensionality reduction (MDR) method are also included for the confirmation of epistasis detection. 2LOmb is subsequently applied to a type 2 diabetes mellitus (T2D) data set, which is obtained as a part of the UK genome-wide genetic epidemiology study by the Wellcome Trust Case Control Consortium (WTCCC). After primarily screening for SNPs that locate within or near 372 candidate genes and exhibit no marginal single-locus effects, the T2D data set is reduced to 7,065 SNPs from 370 genes. The 2LOmb search in the reduced T2D data reveals that four intronic SNPs in PGM1 (phosphoglucomutase 1), two intronic SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two intronic SNPs in PARK2 (Parkinson disease (autosomal recessive, juvenile) 2, parkin) and three intronic SNPs in GYS2 (glycogen synthase 2 (liver)) are associated with the disease. The 2LOmb result suggests that there is no interaction between each pair of the identified genes that can be described by purely epistatic two-locus interaction models. Moreover, there are no interactions between these four genes that can be described by purely epistatic multi-locus interaction models with marginal two-locus effects. The findings provide an alternative explanation for the aetiology of T2D in a UK population.
An omnibus permutation test on ensembles of two-locus analyses can detect purely epistatic multi-locus interactions with marginal two-locus effects. The study also reveals that SNPs from large-scale or genome-wide case-control data which are discarded after single-locus analysis detects no association can still be useful for genetic epidemiology studies.
It has been postulated that multiple-marker methods may have added ability, over single-marker methods, to detect genetic variants associated with disease. The Wellcome Trust Case Control Consortium (WTCCC) provided the first successful large genome-wide association studies (GWAS) which included single-marker association analyses for seven common complex diseases. Of those signals detected, only one was associated with coronary artery disease (CAD), and none were identified for hypertension (HTN). Our objective was to find additional genetic associations and pathways for cardiovascular disease by examining the WTCCC data for variants associated with CAD and HTN using two-marker testing methods. We applied two-marker association testing to the WTCCC dataset, which includes ~2,000 affected individuals with each disorder, and a shared pool of ~3,000 controls, all genotyped using Affymetrix GeneChip 500 K arrays. For CAD, we detected single nucleotide polymorphisms (SNP) pairs in three genes showing genome-wide significance: HFE2, STK32B, and DIPC2. The most notable SNP pairs in a non-protein-coding region were at 9p21, a known major CAD-associated region. For HTN, we detected SNP pairs in five genes: GPR39, XRCC4, MYO6, ZFAT, and MACROD2. Four further associated SNP pair regions were at least 70 kb from any known gene. We have shown that novel, multiple-marker, statistical methods can be of use in finding variants in GWAS. We describe many new, associated variants for both CAD and HTN and describe their known genetic mechanisms.
Pathway analyses of genome-wide association studies aggregate information over sets of related genes, such as genes in common pathways, to identify gene sets that are enriched for variants associated with disease. We develop a model-based approach to pathway analysis, and apply this approach to data from the Wellcome Trust Case Control Consortium (WTCCC) studies. Our method offers several benefits over existing approaches. First, our method not only interrogates pathways for enrichment of disease associations, but also estimates the level of enrichment, which yields a coherent way to promote variants in enriched pathways, enhancing discovery of genes underlying disease. Second, our approach allows for multiple enriched pathways, a feature that leads to novel findings in two diseases where the major histocompatibility complex (MHC) is a major determinant of disease susceptibility. Third, by modeling disease as the combined effect of multiple markers, our method automatically accounts for linkage disequilibrium among variants. Interrogation of pathways from eight pathway databases yields strong support for enriched pathways, indicating links between Crohn's disease (CD) and cytokine-driven networks that modulate immune responses; between rheumatoid arthritis (RA) and “Measles” pathway genes involved in immune responses triggered by measles infection; and between type 1 diabetes (T1D) and IL2-mediated signaling genes. Prioritizing variants in these enriched pathways yields many additional putative disease associations compared to analyses without enrichment. For CD and RA, 7 of 8 additional non-MHC associations are corroborated by other studies, providing validation for our approach. For T1D, prioritization of IL-2 signaling genes yields strong evidence for 7 additional non-MHC candidate disease loci, as well as suggestive evidence for several more. Of the 7 strongest associations, 4 are validated by other studies, and 3 (near IL-2 signaling genes RAF1, MAPK14, and FYN) constitute novel putative T1D loci for further study.
Genome-wide association studies have helped locate gene variants that affect our susceptibility to diseases. The analysis of these studies is typically straightforward: test each genetic variant whether it is correlated with predisposition to disease. This approach often works well for identifying commonly occurring variants with moderate effects on disease risk. However, the effects of many variants are so small they fail to register statistically significant correlations. This is a concern because many diseases are modulated by many genetic factors with small effects on disease risk. An alternative is to examine groups of variants, such as variants sharing a common pathway, and assess whether these groups are “enriched” for correlations with disease. This can be a more effective approach to identifying genetic factors relevant to disease. However, it does not tell us which genes are associated with disease. To address this limitation, we describe an approach that integrates enrichment analysis with tests for disease-variant correlations within a single framework. We illustrate this approach in genome-wide studies of seven complex diseases. We show that our approach supports enriched pathways in several diseases, and uncovers disease-susceptibility genes in these pathways not identified in conventional analyses of the same data.
Recent genome-wide association studies have resulted in a dramatic increase in our knowledge of the genetic loci involved in type 2 diabetes. In a complementary approach to these single-marker studies, we attempted to identify biological pathways associated with type 2 diabetes. This approach could allow us to identify additional risk loci.
RESEARCH DESIGN AND METHODS
We used individual level genotype data generated from the Wellcome Trust Case Control Consortium (WTCCC) type 2 diabetes study, consisting of 393,143 autosomal SNPs, genotyped across 1,924 case subjects and 2,938 control subjects. We sought additional evidence from summary level data available from the Diabetes Genetics Initiative (DGI) and the Finland-United States Investigation of NIDDM Genetics (FUSION) studies. Statistical analysis of pathways was performed using a modification of the Gene Set Enrichment Algorithm (GSEA). A total of 439 pathways were analyzed from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and BioCarta databases.
After correcting for the number of pathways tested, we found no strong evidence for any pathway showing association with type 2 diabetes (top Padj = 0.31). The candidate WNT-signaling pathway ranked top (nominal P = 0.0007, excluding TCF7L2; P = 0.002), containing a number of promising single gene associations. These include CCND2 (rs11833537; P = 0.003), SMAD3 (rs7178347; P = 0.0006), and PRICKLE1 (rs1796390; P = 0.001), all expressed in the pancreas.
Common variants involved in type 2 diabetes risk are likely to occur in or near genes in multiple pathways. Pathway-based approaches to genome-wide association data may be more successful for some complex traits than others, depending on the nature of the underlying disease physiology.
The Wellcome Trust Case Control Consortium (WTCCC) primary genome-wide association (GWA) scan1 on seven diseases, including the multifactorial, autoimmune disease, type 1 diabetes (T1D), shows significant association (P < 5 × 10−7 between T1D and six chromosome regions: 12q24, 12q13, 16p13, 18p11, 12p13 and 4q27. Here, we attempted to validate these and six other top findings in 4,000 individuals with T1D, 5,000 controls and 2,997 family trios that were independent of the WTCCC study. We confirmed unequivocally the associations of 12q24, 12q13, 16p13 and 18p11 (Pfollow-up ≤ 1.35 × 10−9; Poverall ≤ 1.15 × 10−14), leaving eight regions with small effects or false-positive associations with T1D. We also obtained evidence for chromosome 18q22 (Poverall = 1.38 × 10−8) from a genome-wide association study of nonsynonymous SNPs. Several regions, including 18q22 and 18p11, showed association with autoimmune thyroid disease. This study increases the number of T1D loci with compelling evidence from six to at least ten.
Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available.
Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of χ2 association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.
Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/.
email@example.com or firstname.lastname@example.org
Supplementary materials are available at Bioinformatics online.
Phenotypic misclassification (between cases) has been shown to reduce the power to detect association in genetic studies. However, it is conceivable that complex traits are heterogeneous with respect to individual genetic susceptibility and disease pathophysiology, and that the effect of heterogeneity has a larger magnitude than the effect of phenotyping errors. Although an intuitively clear concept, the effect of heterogeneity on genetic studies of common diseases has received little attention. Here we investigate the impact of phenotypic and genetic heterogeneity on the statistical power of genome wide association studies (GWAS). We first performed a study of simulated genotypic and phenotypic data. Next, we analyzed the Wellcome Trust Case-Control Consortium (WTCCC) data for diabetes mellitus (DM) type 1 (T1D) and type 2 (T2D), using varying proportions of each type of diabetes in order to examine the impact of heterogeneity on the strength and statistical significance of association previously found in the WTCCC data. In both simulated and real data, heterogeneity (presence of “non-cases”) reduced the statistical power to detect genetic association and greatly decreased the estimates of risk attributed to genetic variation. This finding was also supported by the analysis of loci validated in subsequent large-scale meta-analyses. For example, heterogeneity of 50% increases the required sample size by approximately three times. These results suggest that accurate phenotype delineation may be more important for detecting true genetic associations than increase in sample size.
Genome-wide association studies (GWAS) do not provide a full account of the heritability of genetic diseases since gene-gene interactions, also known as epistasis are not considered in single locus GWAS. To address this problem, a considerable number of methods have been developed for identifying disease-associated gene-gene interactions. However, these methods typically fail to identify interacting markers explaining more of the disease heritability over single locus GWAS, since many of the interactions significant for disease are obscured by uninformative marker interactions e.g., linkage disequilibrium (LD).
In this study, we present a novel SNP interaction prioritization algorithm, named iLOCi (Interacting Loci). This algorithm accounts for marker dependencies separately in case and control groups. Disease-associated interactions are then prioritized according to a novel ranking score calculated from the difference in marker dependencies for every possible pair between case and control groups. The analysis of a typical GWAS dataset can be completed in less than a day on a standard workstation with parallel processing capability. The proposed framework was validated using simulated data and applied to real GWAS datasets using the Wellcome Trust Case Control Consortium (WTCCC) data. The results from simulated data showed the ability of iLOCi to identify various types of gene-gene interactions, especially for high-order interaction. From the WTCCC data, we found that among the top ranked interacting SNP pairs, several mapped to genes previously known to be associated with disease, and interestingly, other previously unreported genes with biologically related roles.
iLOCi is a powerful tool for uncovering true disease interacting markers and thus can provide a more complete understanding of the genetic basis underlying complex disease. The program is available for download at http://www4a.biotec.or.th/GI/tools/iloci.
OBJECTIVE— The Type 1 Diabetes Genetics Consortium (T1DGC) has assembled and genotyped a large collection of multiplex families for the purpose of mapping genomic regions linked to type 1 diabetes. In the current study, we tested for evidence of loci associated with type 1 diabetes utilizing genome-wide linkage scan data and family-based association methods.
RESEARCH DESIGN AND METHODS— A total of 2,496 multiplex families with type 1 diabetes were genotyped with a panel of 6,090 single nucleotide polymorphisms (SNPs). Evidence of association to disease was evaluated by the pedigree disequilibrium test. Significant results were followed up by genotyping and analyses in two independent sets of samples: 2,214 parent-affected child trio families and a panel of 7,721 case and 9,679 control subjects.
RESULTS— Three of the SNPs most strongly associated with type 1 diabetes localized to previously identified type 1 diabetes risk loci: INS, IFIH1, and KIAA0350. A fourth strongly associated SNP, rs876498 (P = 1.0 × 10−4), occurred in the sixth intron of the UBASH3A locus at chromosome 21q22.3. Support for this disease association was obtained in two additional independent sample sets: families with type 1 diabetes (odds ratio [OR] 1.06 [95% CI 1.00–1.11]; P = 0.023) and case and control subjects (1.14 [1.09–1.19]; P = 7.5 × 10−8).
CONCLUSIONS— The T1DGC 6K SNP scan and follow-up studies reported here confirm previously reported type 1 diabetes associations at INS, IFIH1, and KIAA0350 and identify an additional disease association on chromosome 21q22.3 in the UBASH3A locus (OR 1.10 [95% CI 1.07–1.13]; P = 4.4 × 10−12). This gene and its flanking regions are now validated targets for further resequencing, genotyping, and functional studies in type 1 diabetes.
Motivation: Genome-wide association studies (GWASs) are commonly applied on human genomic data to understand the causal gene combinations statistically connected to certain diseases. Patients involved in these GWASs could be re-identified when the studies release statistical information on a large number of single-nucleotide polymorphisms. Subsequent work, however, found that such privacy attacks are theoretically possible but unsuccessful and unconvincing in real settings.
Results: We derive the first practical privacy attack that can successfully identify specific individuals from limited published associations from the Wellcome Trust Case Control Consortium (WTCCC) dataset. For GWAS results computed over 25 randomly selected loci, our algorithm always pinpoints at least one patient from the WTCCC dataset. Moreover, the number of re-identified patients grows rapidly with the number of published genotypes. Finally, we discuss prevention methods to disable the attack, thus providing a solution for enhancing patient privacy.
Availability and implementation: Proofs of the theorems and additional experimental results are available in the support online documents. The attack algorithm codes are publicly available at https://sites.google.com/site/zhangzhenjie/GWAS_attack.zip. The genomic dataset used in the experiments is available at http://www.wtccc.org.uk/ on request.
Contact: email@example.com or firstname.lastname@example.org
Supplementary data are available from Bioinformatics online.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
This article presents the ability of an omnibus permutation test on ensembles of two-locus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genome-wide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.
Attribute selection; Complex disease; Epistasis; Genetic heterogeneity; Genome-wide association study; Pattern recognition; Permutation test; Single nucleotide polymorphism; Type 1 diabetes mellitus
Rheumatoid arthritis (RA) is an archetypal, common, complex autoimmune disease with both genetic and environmental contributions to disease aetiology. Two novel RA susceptibility loci have been reported from recent genome-wide and candidate gene association studies. We, therefore, investigated the evidence for association of the STAT4 and TRAF1/C5 loci with RA using imputed data from the Wellcome Trust Case Control Consortium (WTCCC). No evidence for association of variants mapping to the TRAF1/C5 gene was detected in the 1860 RA cases and 2930 control samples tested in that study. Variants mapping to the STAT4 gene did show evidence for association (rs7574865, P = 0.04). Given the association of the TRAF1/C5 locus in two previous large case–control series from populations of European descent and the evidence for association of the STAT4 locus in the WTCCC study, single nucleotide polymorphisms mapping to these loci were tested for association with RA in an independent UK series comprising DNA from >3000 cases with disease and >3000 controls and a combined analysis including the WTCCC data was undertaken. We confirm association of the STAT4 and the TRAF1/C5 loci with RA bringing to 5 the number of confirmed susceptibility loci. The effect sizes are less than those reported previously but are likely to be a more accurate reflection of the true effect size given the larger size of the cohort investigated in the current study.
Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches.
Most genome-wide association studies performed to date have focused on testing individual genetic markers for associations with phenotype. Recently, methods that analyse the joint effects of multiple markers on genetic variation have provided further insights into the genetic basis of complex human traits. In addition, there is increasing interest in using genotype data for genetic risk prediction of disease. Often disparate analytical methods are used for each of these tasks. We propose a flexible novel approach that simultaneously performs identification of susceptibility loci, inference on the genetic architecture and provides polygenic risk prediction in the same statistical model. We illustrate the broad applicability of the approach by considering both simulated and real data. In the analysis of seven common diseases we show large differences in the proportion of genetic variation due to loci with different effect sizes and differences in prediction accuracy between complex traits. These findings are important for future studies and the understanding of the complex genetic architecture of common diseases.
To identify susceptibility alleles associated with rheumatoid arthritis, we genotyped 397 individuals with rheumatoid arthritis for 116,204 SNPs and carried out an association analysis in comparison to publicly available genotype data for 1,211 related individuals from the Framingham Heart Study1. After evaluating and adjusting for technical and population biases, we identified a SNP at 6q23 (rs10499194, ∼150 kb from TNFAIP3 and OLIG3) that was reproducibly associated with rheumatoid arthritis both in the genome-wide association (GWA) scan and in 5,541 additional case-control samples (P = 10−3, GWA scan; P < 10−6, replication; P = 10−9, combined). In a concurrent study, the Wellcome Trust Case Control Consortium (WTCCC) has reported strong association of rheumatoid arthritis susceptibility to a different SNP located 3.8 kb from rs10499194 (rs6920220; P = 5 × 10−6 in WTCCC)2. We show that these two SNP associations are statistically independent, are each reproducible in the comparison of our data and WTCCC data, and define risk and protective haplotypes for rheumatoid arthritis at 6q23.
The advent of next-generation sequencing technologies affords the ability to sequence thousands of subjects cost-effectively, and is revolutionizing the landscape of genetic research. With the evolving genotyping/sequencing technologies, it is not unrealistic to expect that we will soon obtain a pair of diploidic fully-phased genome sequences from each subject in the near future. Here, in light of this potential, we propose an analytic framework called, recursive organizer (ROR), which recursively groups sequence variants based upon sequence similarities and their empirical disease associations, into fewer and potentially more interpretable super sequence variants (SSV). As an illustration, we applied ROR to assess an association between HLA-DRB1 and type 1 diabetes (T1D), discovering SSVs of HLA-DRB1 with sequence data from the Wellcome Trust Case Control Consortium (WTCCC). Specifically, ROR reduces 36 observed unique HLA-DRB1 sequences into 8 SSVs that empirically associate with T1D, a four-fold reduction of sequence complexity. Using HLA-DRB1 data from Type 1 Diabetes Genetics Consortium (T1DGC) as cases and data from Fred Hutchinson Cancer Research Center as controls, we are able to validate associations of these SSVs with T1D. Further, SSVs consist of nine nucleotides, and each associates with its corresponding amino acids. Detailed examination of these selected amino acids reveals their potential functional roles in protein structures and possible implication to the mechanism of T1D.
Gene; DNA sequence; HLA; Structural variations; Type 1 diabetes
While genome-wide association studies (GWASs) have been successful in identifying novel variants associated with various diseases, it has been much more difficult to determine the biological mechanisms underlying these associations. Expression quantitative trait loci (eQTL) provide another dimension to these data by associating single nucleotide polymorphisms (SNPs) with gene expression. We hypothesised that integrating SNPs known to be associated with type 2 diabetes with eQTLs and coexpression networks would enable the discovery of novel candidate genes for type 2 diabetes.
We selected 32 SNPs associated with type 2 diabetes in two or more independent GWASs. We used previously described eQTLs mapped from genotype and gene expression data collected from 1,008 morbidly obese patients to find genes with expression associated with these SNPs. We linked these genes to coexpression modules, and ranked the other genes in these modules using an inverse sum score.
We found 62 genes with expression associated with type 2 diabetes SNPs. We validated our method by linking highly ranked genes in the coexpression modules back to SNPs through a combined eQTL dataset. We showed that the eQTLs highlighted by this method are significantly enriched for association with type 2 diabetes in data from the Wellcome Trust Case Control Consortium (WTCCC, p = 0.026) and the Gene Environment Association Studies (GENEVA, p = 0.042), validating our approach. Many of the highly ranked genes are also involved in the regulation or metabolism of insulin, glucose or lipids.
We have devised a novel method, involving the integration of datasets of different modalities, to discover novel candidate genes for type 2 diabetes.
Genetics of type 2 diabetes; Genomics/proteomics; Mathematical modelling and simulation
The Type I Diabetes Genetics Consortium (T1DGC) is an international collaboration whose primary goal is to identify genes whose variants modify an individual’s risk of type I diabetes (T1D). An integral part of the T1DGC’s mission is the establishment of clinical and data resources that can be used by, and that are fully accessible to, the T1D research community (http://www.t1dgc.org). The T1DGC has organized the collection and analyses of study samples and conducted several major research projects focused on T1D gene discovery: a genome-wide linkage scan, an intensive evaluation of the human major histocompatibility complex, a detailed examination of published candidate genes, and a genome-wide association scan. These studies have provided important information to the scientific community regarding the function of specific genes or chromosomal regions on T1D risk. The results are continually being updated and displayed (http://www.t1dbase.org). The T1DGC welcomes all investigators interested in using these data for scientific endeavors on T1D. The T1DGC resources provide a framework for future research projects, including examination of structural variation, re-sequencing of candidate regions in a search for T1D-associated genes and causal variants, correlation of T1D risk genotypes with biomarkers obtained from T1DGC serum and plasma samples, and in-depth bioinformatics analyses.
type I diabetes; sequence analysis; HLA; structural variants; expression