Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.
Access to WTCCC data: http://www.wtccc.org.uk/
Supplementary information: Supplementary data are available at Bioinformatics Online.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Genome-wide association studies (GWAS) test for disease-trait associations and estimate effect sizes at tag single-nucleotide polymorphisms (SNPs), which imperfectly capture variation at causal SNPs. Sequencing studies can examine potential causal SNPs directly; however, sequencing the whole genome or exome can be prohibitively expensive. Costs can be limited by using a GWAS to detect the associated region(s) at tag SNPs followed by targeted sequencing to identify and estimate the effect size of the causal variant. Genetic effect estimates obtained from association studies can be inflated because of a form of selection bias known as the winner’s curse. Conversely, estimates at tag SNPs can be attenuated compared to the causal SNP because of incomplete linkage disequilibrium. These two effects oppose each other. Analysis of rare SNPs further complicates our understanding of the winner’s curse because rare SNPs are difficult to tag and analysis can involve collapsing over multiple rare variants. In two-stage analysis of Genetic Analysis Workshop 17 simulated data sets, we find that selection at the tag SNP produces upward bias in the estimate of effect at the causal SNP, even when the tag and causal SNPs are not well correlated. The bias similarly carries through to effect estimates for rare variant summary measures. Replication studies designed with sample sizes computed using biased estimates will be under-powered to detect a disease-causing variant. Accounting for bias in the original study is critical to avoid discarding disease-associated SNPs at follow up.
Researchers continue to use genome-wide association studies (GWAS) to find the genetic markers associated with disease. Recent studies have added to the typical two-stage analysis a third stage that uses targeted resequencing on a randomly selected subset of the cases to detect the causal single-nucleotide polymorphism (SNP). We propose a design for targeted resequencing that increases the power to detect the causal variant. The design features an ascertainment scheme wherein only those cases with the presence of a risk allele are selected for targeted resequencing. We simulated a disease with a single causal SNP to evaluate our method versus a targeted resequencing design using randomly selected individuals. The simulation studies showed that ascertaining individuals for the targeted resequencing can substantially increase the power to detect a causal SNP, without increasing the false-positive rate.
Ascertainment; Genome-Wide Association Study; Causal Polymorphism; Targeted Resequencing
Genome-wide association studies (GWAS) are now feasible for studying the genetics underlying complex diseases. For many diseases, a list of candidate genes or regions exists and incorporation of such information into data analyses can potentially improve the power to detect disease variants. Traditional approaches for assessing the overall statistical significance of GWAS results ignore such information by inherently treating all markers equally.
We propose the prioritized subset analysis (PSA), in which a prioritized subset of markers is pre-selected from candidate regions, and the false discovery rate (FDR) procedure is carried out in the prioritized subset and its complementary subset, respectively.
The PSA is more powerful than the whole-genome single-step FDR adjustment for a range of alternative models. The degree of power improvement depends on the fraction of associated SNPs in the prioritized subset and their nominal power, with higher fraction of associated SNPs and higher nominal power leading to more power improvement. The power improvement can be substantial; for disease loci not included in the prioritized subset, the power loss is almost negligible.
The PSA has the flexibility of allowing investigators to combine prior information from a variety of sources, and will be a useful tool for GWAS.
Association analysis; False discovery rate; HapMap
Many complex diseases are influenced by genetic variations in multiple genes, each with only a small marginal effect on disease susceptibility. Pathway analysis, which identifies biological pathways associated with disease outcome, has become increasingly popular for genome-wide association studies (GWAS). In addition to combining weak signals from a number of SNPs in the same pathway, results from pathway analysis also shed light on the biological processes underlying disease. We propose a new pathway-based analysis method for GWAS, the supervised principal component analysis (SPCA) model. In the proposed SPCA model, a selected subset of SNPs most associated with disease outcome is used to estimate the latent variable for a pathway. The estimated latent variable for each pathway is an optimal linear combination of a selected subset of SNPs; therefore, the proposed SPCA model provides the ability to borrow strength across the SNPs in a pathway. In addition to identifying pathways associated with disease outcome, SPCA also carries out additional within-category selection to identify the most important SNPs within each gene set. The proposed model operates in a well-established statistical framework and can handle design information such as covariate adjustment and matching information in GWAS. We compare the proposed method with currently available methods using data with realistic linkage disequilibrium structures and we illustrate the SPCA method using the Wellcome Trust Case-Control Consortium Crohn Disease (CD) dataset.
SNPs; genome-wide association; pathway analysis; principal component analysis
Several lines of evidence suggest that genome-wide association studies (GWAS) have the potential to explain more of the “missing heritability” of common complex phenotypes. However, reliable methods to identify a larger proportion of single nucleotide polymorphisms (SNPs) that impact disease risk are currently lacking. Here, we use a genetic pleiotropy-informed conditional false discovery rate (FDR) method on GWAS summary statistics data to identify new loci associated with schizophrenia (SCZ) and bipolar disorders (BD), two highly heritable disorders with significant missing heritability. Epidemiological and clinical evidence suggest similar disease characteristics and overlapping genes between SCZ and BD. Here, we computed conditional Q–Q curves of data from the Psychiatric Genome Consortium (SCZ; n = 9,379 cases and n = 7,736 controls; BD: n = 6,990 cases and n = 4,820 controls) to show enrichment of SNPs associated with SCZ as a function of association with BD and vice versa with a corresponding reduction in FDR. Applying the conditional FDR method, we identified 58 loci associated with SCZ and 35 loci associated with BD below the conditional FDR level of 0.05. Of these, 14 loci were associated with both SCZ and BD (conjunction FDR). Together, these findings show the feasibility of genetic pleiotropy-informed methods to improve gene discovery in SCZ and BD and indicate overlapping genetic mechanisms between these two disorders.
Genome-wide association studies (GWAS) have thus far identified only a small fraction of the heritability of common complex disorders, such as severe mental disorders. We used a conditional false discovery rate approach for analysis of GWAS data, exploiting “genetic pleiotropy” to increase discovery of common gene variants associated with schizophrenia and bipolar disorders. Leveraging the increased power from combining GWAS of two associated phenotypes, we found a striking overlap in polygenic signals, allowing for the discovery of several new common gene variants associated with bipolar disorder and schizophrenia that were not identified in the original analysis using traditional GWAS methods. Some of the gene variants have been identified in other studies with large targeted replication samples, validating the present findings. Our pleiotropy-informed method may be of significant importance for detecting effects that are below the traditional genome-wide significance level in GWAS, particularly in highly polygenic, complex phenotypes, such as schizophrenia and bipolar disorder, where most of the genetic signal is missing (i.e., “missing heritability”). The findings also offer insights into mechanistic relationships between bipolar disorder and schizophrenia pathogenesis.
Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and LD structure. The power is also evaluated in situations where the causal SNPs may not be genotyped, but rather detected by proxy using a SNP in linkage disequilibrium (LD). A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the “proportionality assumption”, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable [Terwilliger and Hiekkalinna 2006], in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.
Genetic Predisposition to Disease; Genome, Human; genetics; Genotype; Humans; Linkage Disequilibrium; Models, Genetic; Polymorphism, Single Nucleotide; Quantitative Trait Loci; genetics; linkage disequilibrium; genome-wide; tagSNPs
Genome-wide association study (GWAS) is widely utilized to identify genes involved in human complex disease or some other trait. One key challenge for GWAS data interpretation is to identify causal SNPs and provide profound evidence on how they affect the trait. Currently, researches are focusing on identification of candidate causal variants from the most significant SNPs of GWAS, while there is lack of support on biological mechanisms as represented by pathways. Although pathway-based analysis (PBA) has been designed to identify disease-related pathways by analyzing the full list of SNPs from GWAS, it does not emphasize on interpreting causal SNPs. To our knowledge, so far there is no web server available to solve the challenge for GWAS data interpretation within one analytical framework. ICSNPathway is developed to identify candidate causal SNPs and their corresponding candidate causal pathways from GWAS by integrating linkage disequilibrium (LD) analysis, functional SNP annotation and PBA. ICSNPathway provides a feasible solution to bridge the gap between GWAS and disease mechanism study by generating hypothesis of SNP → gene → pathway(s). The ICSNPathway server is freely available at http://icsnpathway.psych.ac.cn/.
Genome-wide association (GWA) studies, where hundreds of thousands of single-nucleotide polymorphisms (SNPs) are tested simultaneously, are becoming popular for identifying disease loci for common diseases. Most commonly, a GWA study involves two stages: the first stage includes testing the association between all SNPs and the disease and the second stage includes replication of SNPs selected from the first stage to validate associations in an independent sample. The first stage is considered to be more fundamental since the second stage is contingent on the results of the first stage. Selection of SNPs from stage one for genotyping in stage two is typically based on an arbitrary threshold or controlling type I errors. These strategies can be inefficient and have potential to exclude genotyping of disease-associated SNPs in stage two. We propose an approach for selecting top SNPs that uses a strategy based on the false-negative rate (FNR). Using the FNR approach, we proposed the number of SNPs that should be selected based on the observed p-values and a pre-specified multi-testing power in the first stage. We applied our method to simulated data and a GWA study of glioma (a rare form of brain tumor) data. Results from simulation and the glioma GWA indicate that the proposed approach provides an FNR-based way to select SNPs using pre-specified power.
False negative rate; SNP selection; Two-stage genome-wide association study
A key challenge for genome-wide association studies (GWAS) is to understand how single nucleotide polymorphisms (SNPs) mechanistically underpin complex diseases. While this challenge has been addressed partially by Gene Ontology (GO) enrichment of large list of host genes of SNPs prioritized in GWAS, these enrichment have not been formally evaluated. Here, we develop a novel computational approach anchored in information theoretic similarity, by systematically mining lists of host genes of SNPs prioritized in three adult-onset diabetes mellitus GWAS. The “gold-standard” is based on GO associated with 20 published diabetes SNPs’ host genes and on our own evaluation. We computationally identify 69 similarity-predicted GO independently validated in all three GWAS (FDR<5%), enriched with those of the gold-standard (odds ratio=5.89, P=4.81e-05), and these terms can be organized by similarity criteria into 11 groupings termed “biomolecular systems”. Six biomolecular systems were corroborated by the gold-standard and the remaining five were previously uncharacterized. http://lussierlab.org/publications/ITS-GWAS
Genome-wide gene-gene interaction analysis using single nucleotide polymorphisms (SNPs) is an attractive way for identification of genetic components that confers susceptibility of human complex diseases. Individual hypothesis testing for SNP-SNP pairs as in common genome-wide association study (GWAS) however involves difficulty in setting overall p-value due to complicated correlation structure, namely, the multiple testing problem that causes unacceptable false negative results. A large number of SNP-SNP pairs than sample size, so-called the large p small n problem, precludes simultaneous analysis using multiple regression. The method that overcomes above issues is thus needed.
We adopt an up-to-date method for ultrahigh-dimensional variable selection termed the sure independence screening (SIS) for appropriate handling of numerous number of SNP-SNP interactions by including them as predictor variables in logistic regression. We propose ranking strategy using promising dummy coding methods and following variable selection procedure in the SIS method suitably modified for gene-gene interaction analysis. We also implemented the procedures in a software program, EPISIS, using the cost-effective GPGPU (General-purpose computing on graphics processing units) technology. EPISIS can complete exhaustive search for SNP-SNP interactions in standard GWAS dataset within several hours. The proposed method works successfully in simulation experiments and in application to real WTCCC (Wellcome Trust Case–control Consortium) data.
Based on the machine-learning principle, the proposed method gives powerful and flexible genome-wide search for various patterns of gene-gene interaction.
Increasing evidence suggests that single nucleotide polymorphisms (SNPs) associated with complex traits are more likely to be expression quantitative trait loci (eQTLs). Incorporating eQTL information hence has potential to increase power of genome-wide association studies (GWAS). In this paper, we propose using eQTL weights as prior information in SNP based association tests to improve test power while maintaining control of the family-wise error rate (FWER) or the false discovery rate (FDR). We apply the proposed methods to the analysis of a GWAS for childhood asthma consisting of 1296 unrelated individuals with German ancestry. The results confirm that eQTLs are enriched for previously reported asthma SNPs. We also find that some SNPs are insignificant using procedures without eQTL weighting, but become significant using eQTL-weighted Bonferroni or Benjamini–Hochberg procedures, while controlling the same FWER or FDR level. Some of these SNPs have been reported by independent studies in recent literature. The results suggest that the eQTL-weighted procedures provide a promising approach for improving power of GWAS. We also report the results of our methods applied to the large-scale European GABRIEL consortium data.
asthma; family-wise error rate; false discovery rate; eQTL; genome-wide association study; weighted hypothesis test
A major motivation for seeking disease-associated genetic variation is to identify novel risk processes. Although rare copy number variants (CNVs) appear to contribute to attention deficit hyperactivity disorder (ADHD), common risk variants (single-nucleotide polymorphisms [SNPs]) have not yet been detected using genome-wide association studies (GWAS). This raises the concern as to whether future larger-scale, adequately powered GWAS will be worthwhile. The authors undertook a GWAS of ADHD and examined whether associated SNPs, including those below conventional levels of significance, influenced the same biological pathways affected by CNVs.
The authors analyzed genome-wide SNP frequencies in 727 children with ADHD and 5,081 comparison subjects. The gene sets that were enriched in a pathway analysis of the GWAS data (the top 5% of SNPs) were tested for an excess of genes spanned by large, rare CNVs in the children with ADHD.
No SNP achieved genome-wide significance levels. As previously reported in a subsample of the present study, large, rare CNVs were significantly more common in case subjects than comparison subjects. Thirteen biological pathways enriched for SNP association significantly overlapped with those enriched for rare CNVs. These included cholesterol-related and CNS development pathways. At the level of individual genes, CHRNA7, which encodes a nicotinic receptor subunit previously implicated in neuropsychiatric disorders, was affected by six large duplications in case subjects (none in comparison subjects), and SNPs in the gene had a gene-wide p value of 0.0002 for association in the GWAS.
Both common and rare genetic variants appear to be relevant to ADHD and index-shared biological pathways.
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Nowadays, the availability of cheaper and accurate assays to quantify multiple (endo)phenotypes in large population cohorts allows multi-trait studies. However, these studies are limited by the lack of flexible models integrated with efficient computational tools for genome-wide multi SNPs-traits analyses. To overcome this problem, we propose a novel Bayesian analysis strategy and a new algorithmic implementation which exploits parallel processing architecture for fully multivariate modeling of groups of correlated phenotypes at the genome-wide scale. In addition to increased power of our algorithm over alternative Bayesian and well-established non-Bayesian multi-phenotype methods, we provide an application to a real case study of several blood lipid traits, and show how our method recovered most of the major associations and is better at refining multi-trait polygenic associations than alternative methods. We reveal and replicate in independent cohorts new associations with two phenotypic groups that were not detected by competing multivariate approaches and not noticed by a large meta-GWAS. We also discuss the applicability of the proposed method to large meta-analyses involving hundreds of thousands of individuals and to diverse genomic datasets where complex dependencies in the predictor space are present.
Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.
Results: In this article, we propose an ensemble approach based on boosting to study gene–gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.
Issues of multiple-testing and statistical significance in genome-wide association studies (GWAS) have prompted statistical methods utilizing prior data to increase the power of association results. Using prior findings from genome-wide linkage studies on bipolar disorder (BPD), we employed a weighted false discovery approach (wFDR; (Roeder et al. 2006)) to previously reported GWAS data drawn from the Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD). Using this method, association signals are up or down-weighted given the linkage score in that genomic region. Although no SNPs in our sample reached genome-wide significance through the wFDR approach, the strongest single SNP result from the original GWAS results (rs4939921 in myosin VB) is strongly up-weighted as it occurs on a linkage peak of chromosome 18. We also identify regions on chromosome 9, 17, and 18 where modestly associated SNP clusters coincide with strong linkage scores, implicating them as possible candidate regions for further analysis. Moving forward, we believe the application of prior linkage information will be increasingly useful to future GWAS studies that incorporate rarer variants into their analysis.
Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs.
We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs.
Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from
Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with the risk of hundreds of diseases. However, there is currently no database that enables non-specialists to answer the following simple questions: which SNPs associated with diseases are in linkage disequilibrium (LD) with a gene of interest? Which chromosomal regions have been associated with a given disease, and which are the potentially causal genes in each region? To answer these questions, we use data from the HapMap Project to partition each chromosome into so-called LD blocks, so that SNPs in LD with each other are preferentially in the same block, whereas SNPs not in LD are in different blocks. By projecting SNPs and genes onto LD blocks, the DistiLD database aims to increase usage of existing GWAS results by making it easy to query and visualize disease-associated SNPs and genes in their chromosomal context. The database is available at http://distild.jensenlab.org/.
Genome-Wide Association studies (GWAS) offer an unbiased means to understand the genetic basis of traits by identifying single nucleotide polymorphisms (SNPs) linked to causal variants of complex phenotypes. GWAS have identified a host of susceptibility SNPs associated with many important human diseases, including diseases associated with aging. In an effort to understand the genetics of broad resistance to age-associated diseases (i.e. ‘wellness’), we performed a meta-analysis of human GWAS. Toward that end, we compiled 372 GWAS that identified 1,775 susceptibility SNPs to 105 unique diseases and used these SNPs to create a genomic landscape of disease susceptibility. This map was constructed by partitioning the genome into 200 kb ‘bins’ and mapping the 1,775 susceptibility SNPs to bins based on their genomic location. Investigation of these data revealed significant heterogeneity of disease association within the genome, with 92% of bins devoid of disease-associated SNPs. In contrast, 10 bins (0.06%) were significantly (p<0.05) enriched for susceptibility to multiple diseases, 5 of which formed two highly significant peaks of disease association (p<0.0001). These peaks mapped to the Major Histocompatibility (MHC) locus on 6p21 and the INK4/ARF (CDKN2a/b) tumor suppressor locus on 9p21.3. Provocatively, all 10 significantly enriched bins contained genes linked to either inflammation or cellular senescence pathways, and SNPs near regulators of senescence were particularly associated with disease of aging (e.g. cancer, atherosclerosis, type 2 diabetes, glaucoma). This analysis suggests that germline genetic heterogeneity in the regulation of immunity and cellular senescence influences the human health span.
p16INK4a; p14ARF; ANRIL; TERT; longevity
Motivation: One of the fundamental questions in genetics study is to identify functional DNA variants that are responsible to a disease or phenotype of interest. Results from large-scale genetics studies, such as genome-wide association studies (GWAS), and the availability of high-throughput sequencing technologies provide opportunities in identifying causal variants. Despite the technical advances, informatics methodologies need to be developed to prioritize thousands of variants for potential causative effects.
Results: We present regSNPs, an informatics strategy that integrates several established bioinformatics tools, for prioritizing regulatory SNPs, i.e. the SNPs in the promoter regions that potentially affect phenotype through changing transcription of downstream genes. Comparing to existing tools, regSNPs has two distinct features. It considers degenerative features of binding motifs by calculating the differences on the binding affinity caused by the candidate variants and integrates potential phenotypic effects of various transcription factors. When tested by using the disease-causing variants documented in the Human Gene Mutation Database, regSNPs showed mixed performance on various diseases. regSNPs predicted three SNPs that can potentially affect bone density in a region detected in an earlier linkage study. Potential effects of one of the variants were validated using luciferase reporter assay.
Supplementary data are available at Bioinformatics online
A large body of functional and epidemiological evidence have previously illustrated the impact of specific MHC class I subtypes on clinical outcome during HIV-1 infection, and these observations have recently been re-iterated in genome wide association studies (GWAS). Yet because of the complexities surrounding GWAS-based approaches and the lack of knowledge relating to the identity of rarer single nucleotide polymorphism (SNP) variants, it has proved difficult to discover independent causal variants associated with favourable immune control. This is especially true of the candidate variants within the HLA region where many of the recently proposed disease influencing SNPs appear to reflect linkage with ‘protective’ MHC class I alleles. Yet causal MHC-linked SNPs may exist but remain overlooked owing to the complexities associated with their identification. Here we focus on the ancestral TNFα promoter −237A variant (rs361525), shown historically to be in complete linkage disequilibrium with the ‘protective’ HLA-B*5701 allele. Many of the ancestral SNPs within the extended TNFα promoter have been associated with both autoimmune conditions and disease outcomes, however, the direct role of these variants on TNFα expression remains controversial. Yet, because of the important role played by TNFα in HIV-1 infection, and given the proximity of the −237 SNP to the core promoter, its location within a putative repressor region previously characterized in mice, and its disruption of a methylation-susceptible CpG dinucleotide motif, we chose to carefully evaluate its impact on TNFα production. Using a variety of approaches we now demonstrate that carriage of the A SNP is associated with lower TNFα production, via a mechanism not readily explained by promoter methylation nor the binding of transcription factors or repressors. We propose that the −237A variant could represent a minor causal SNP that additionally contributes to the HLA-B*5701-mediated ‘protective’ effect during HIV-1 infection.
Following the widespread use of genome-wide association studies (GWAS), focus is turning towards identification of causal variants rather than simply genetic markers of diseases and traits. As a step towards a high-throughput method to identify genome-wide, non-coding, functional regulatory variants, we describe the technique of allele-specific FAIRE, utilising large-scale genotyping technology (FAIRE-gen) to determine allelic effects on chromatin accessibility and regulatory potential. FAIRE-gen was explored using lymphoblastoid cells and the 50,000 SNP Illumina CVD BeadChip. The technique identified an allele-specific regulatory polymorphism within NR1H3 (coding for LXR-α), rs7120118, coinciding with a previously GWAS-identified SNP for HDL-C levels. This finding was confirmed using FAIRE-gen with the 200,000 SNP Illumina Metabochip and verified with the established method of TaqMan allelic discrimination. Examination of this SNP in two prospective Caucasian cohorts comprising 15,000 individuals confirmed the association with HDL-C levels (combined beta = 0.016; p = 0.0006), and analysis of gene expression identified an allelic association with LXR-α expression in heart tissue. Using increasingly comprehensive genotyping chips and distinct tissues for examination, FAIRE-gen has the potential to aid the identification of many causal SNPs associated with disease from GWAS.
The identification of genetic variants associated with complex diseases has rapidly grown through lowering costs of genome sequencing and the use of large-scale genotyping chips based on this sequencing data. There have not been corresponding advances in the identification of causal genetic variants compared to variants simply associated with diseases or traits. Most of these causal variants are thought to be located not within regions coding for proteins, but within genomic regions that regulate the level of protein. We have combined the use of large-scale gene chips with functional analysis, to determine regions of the genome that confer a greater potential for controlling gene regulation dependent on the genotype of that individual. Combining this data with population data and gene expression data, we identify a potential causal variant that alters regulation of LXR-α, a key mediator in lipid metabolism, and show that this variant is associated with HDL-C levels. This methodology provides a model for future analyses to identify further causal variants for disease.
Gene-based association approach could be regarded as a complementary analysis to the single SNP association analysis. We meta-analyzed the findings from the gene-based association approach using the genome-wide association studies (GWAS) data from Chinese and European subjects, confirmed several well established bone mineral density (BMD) genes, and suggested several novel BMD genes.
The introduction of GWAS has greatly increased the number of genes that are known to be associated with common diseases. Nonetheless, such a single SNP GWAS has a lower power to detect genes with multiple causal variants. We aimed to assess the association of each gene with BMD variation at the spine and hip using gene-based GWAS approach.
We studied 778 Hong Kong Southern Chinese (HKSC) women and 5,858 Northern Europeans (dCG); age, sex, and weight were adjusted in the model. The main outcome measure was BMD at the spine and hip.
Nine genes showed suggestive p value in HKSC, while 4 and 17 genes showed significant and suggestive p values respectively in dCG. Meta-analysis using weighted Z-transformed test confirmed several known BMD genes and suggested some novel ones at 1q21.3, 9q22, 9q33.2, 20p13, and 20q12. Top BMD genes were significantly associated with connective tissue, skeletal, and muscular system development and function (p < 0.05). Gene network inference revealed that a large number of these genes were significantly connected with each other to form a functional gene network, and several signaling pathways were strongly connected with these gene networks.
Our gene-based GWAS confirmed several BMD genes and suggested several novel BMD genes. Genetic contribution to BMD variation may operate through multiple genes identified in this study in functional gene networks. This finding may be useful in identifying and prioritizing candidate genes/loci for further study.
Electronic supplementary material
The online version of this article (doi:10.1007/s00198-011-1779-7) contains supplementary material, which is available to authorized users.
Association study; Bone mineral density; Genetic epidemiology; Meta-analysis; Osteoporosis