A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of geno-type-phenotype correlation with a priori evidence of biological relevance.
We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains, including statistical evidence of genotype-phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution.
The identification of true causal loci to unravel the statistical evidence of genotype-phenotype correlations and the biological relevance of selected single-nucleotide polymorphisms (SNPs) is a challenging issue in genome-wide association studies (GWAS). Here, we introduced a novel method for the prioritization of SNPs based on p-values from GWAS. The method uses functional evidence from populations, including phenotype-associated gene expressions. Based on the concept of genetic interactions, such as perturbation of gene expression by genetic variation, phenotype and gene expression related SNPs were prioritized by adjusting the p-values of SNPs. We applied our method to GWAS data related to drug-induced cytotoxicity. Then, we prioritized loci that potentially play a role in drug-induced cytotoxicity. By generating an interaction model, our approach allowed us not only to identify causal loci, but also to find intermediate nodes that regulate the flow of information among causal loci, perturbed gene expression, and resulting phenotypic variation.
genome-wide association study; interaction network; prioritization; SNP
Single nucleotide polymorphisms (SNPs) are increasingly used to tag genetic loci associated with phenotypes such as risk of complex diseases. Technically, this is done genome-wide without prior restriction or knowledge of biological feasibility in scans referred to as genome-wide association studies (GWAS). Depending on the linkage disequilibrium (LD) structure at a particular locus, such tagSNPs may be surrogates for many thousands of other SNPs, and it is difficult to distinguish those that may play a functional role in the phenotype from those simply genetically linked. Because a large proportion of tagSNPs have been identified within non-coding regions of the genome, distinguishing functional from non-functional SNPs has been an even greater challenge. A strategy was recently proposed that prioritizes surrogate SNPs based on non-coding chromatin and epigenomic mapping techniques that have become feasible with the advent of massively parallel sequencing. Here, we introduce an R/Bioconductor software package that enables the identification of candidate functional SNPs by integrating information from tagSNP locations, lists of linked SNPs from the 1000 genomes project and locations of chromatin features which may have functional significance. Availability: FunciSNP is available from Bioconductor (bioconductor.org).
Genome-wide association (GWA) using large numbers of single nucleotide polymorphisms (SNPs) is now a powerful, state-of-the-art approach to mapping human disease genes. When a GWA study detects association between a SNP and the disease, this signal usually represents association with a set of several highly correlated SNPs in strong linkage disequilibrium. The challenge we address is to distinguish among these correlated loci to highlight potential functional variants and prioritize them for follow-up.
We implemented a systematic method for testing association across diverse population samples having differing histories and LD patterns, using a logistic regression framework. The hypothesis is that important underlying biological mechanisms are shared across human populations, and we can filter correlated variants by testing for heterogeneity of genetic effects in different population samples. This approach formalizes the descriptive comparison of p-values that has typified similar cross-population fine-mapping studies to date. We applied this method to correlated SNPs in the cholinergic nicotinic receptor gene cluster CHRNA5-CHRNA3-CHRNB4, in a case-control study of cocaine dependence composed of 504 European-American and 583 African-American samples. Of the 10 SNPs genotyped in the r2 ≥ 0.8 bin for rs16969968, three demonstrated significant cross-population heterogeneity and are filtered from priority follow-up; the remaining SNPs include rs16969968 (heterogeneity p = 0.75). Though the power to filter out rs16969968 is reduced due to the difference in allele frequency in the two groups, the results nevertheless focus attention on a smaller group of SNPs that includes the non-synonymous SNP rs16969968, which retains a similar effect size (odds ratio) across both population samples.
Filtering out SNPs that demonstrate cross-population heterogeneity enriches for variants more likely to be important and causative. Our approach provides an important and effective tool to help interpret results from the many GWA studies now underway.
Most recently, with maturing of bovine genome sequencing and high throughput SNP genotyping technologies, a large number of significant SNPs associated with economic important traits can be identified by genome-wide association studies (GWAS). To further determine true association findings in GWAS, the common strategy is to sift out most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. To systematically prioritize these statistically significant SNPs and facilitate follow-up replication studies, we developed a bovine SNP annotation tool (Snat) based on a web interface.
With Snat, various sources of genomic information are integrated and retrieved from several leading online databases, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, linkage information from AnimalQTLdb, conserved elements from UCSC Genome Browser Database and gene functions from Gene Ontology (GO), KEGG PATHWAY and Online Mendelian Inheritance in Animals (OMIA). Snat provides two different applications, including a CGI-based web utility and a command-line version, to access the integrated database, target any single nucleotide loci of interest and perform multi-level functional annotations. For further validation of the practical significance of our study, SNPs involved in two commercial bovine SNP chips, i.e., the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been annotated by Snat, and the corresponding outputs can be directly downloaded from Snat website. Furthermore, a real dataset involving 20 identified SNPs associated with milk yield in our recent GWAS was employed to demonstrate the practical significance of Snat.
To our best knowledge, Snat is one of first tools focusing on SNP annotation for livestock. Snat confers researchers with a convenient and powerful platform to aid functional analyses and accurate evaluation on genes/variants related to SNPs, and facilitates follow-up replication studies in the post-GWAS era.
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.
We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.
Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
Alcohol dependence is a complex disease, and although linkage and candidate gene studies have identified several genes associated with the risk for alcoholism, these explain only a portion of the risk.
We carried out a genome-wide association study (GWAS) on a case-control sample drawn from the families in the Collaborative Study on the Genetics of Alcoholism. The cases all met diagnostic criteria for alcohol dependence according to the Diagnostic and Statistical Manual of the American Psychiatric Association Fourth Edition (DSM-IV); controls all consumed alcohol but were not dependent on alcohol or illicit drugs. To prioritize among the strongest candidates, we genotyped most of the top 199 SNPs (p ≤ 2.1 × 10−4) in a sample of alcohol dependent families and performed pedigree-based association analysis. We also examined whether the genes harboring the top SNPs were expressed in human brain or were differentially expressed in the presence of ethanol in lymphoblastoid cells.
Although no single SNP met genome-wide criteria for significance, there were several clusters of SNPs that provided mutual support. Combining evidence from the case-control study, the followup in families, and gene expression provided strongest support for the association of a cluster of genes on chromosome 11 (SLC22A18, PHLDA2, NAP1L4, SNORA54, CARS, and OSBPL5) with alcohol dependence. Several SNPs nominated as candidates in earlier GWAS studies replicated in ours, including CPE, DNASE2B, SLC10A2,ARL6IP5, ID4, GATA4, SYNE1 and ADCY3.
We have identified several promising associations that warrant further examination in independent samples.
alcohol dependence; genome-wide association study; case-control study; family study; gene expression
Genome-wide association study (GWAS) is nowadays widely used to identify genes involved in human complex disease. The standard GWAS analysis examines SNPs/genes independently and identifies only a number of the most significant SNPs. It ignores the combined effect of weaker SNPs/genes, which leads to difficulties to explore biological function and mechanism from a systems point of view. Although gene set enrichment analysis (GSEA) has been introduced to GWAS to overcome these limitations by identifying the correlation between pathways/gene sets and traits, the heavy dependence on genotype data, which is not easily available for most published GWAS investigations, has led to limited application of it. In order to perform GSEA on a simple list of GWAS SNP P-values, we implemented GSEA by using SNP label permutation. We further improved GSEA (i-GSEA) by focusing on pathways/gene sets with high proportion of significant genes. To provide researchers an open platform to analyze GWAS data, we developed the i-GSEA4GWAS (improved GSEA for GWAS) web server. i-GSEA4GWAS implements the i-GSEA approach and aims to provide new insights in complex disease studies. i-GSEA4GWAS is freely available at http://gsea4gwas.psych.ac.cn/.
Genome-wide association studies (GWAS) are now feasible for studying the genetics underlying complex diseases. For many diseases, a list of candidate genes or regions exists and incorporation of such information into data analyses can potentially improve the power to detect disease variants. Traditional approaches for assessing the overall statistical significance of GWAS results ignore such information by inherently treating all markers equally.
We propose the prioritized subset analysis (PSA), in which a prioritized subset of markers is pre-selected from candidate regions, and the false discovery rate (FDR) procedure is carried out in the prioritized subset and its complementary subset, respectively.
The PSA is more powerful than the whole-genome single-step FDR adjustment for a range of alternative models. The degree of power improvement depends on the fraction of associated SNPs in the prioritized subset and their nominal power, with higher fraction of associated SNPs and higher nominal power leading to more power improvement. The power improvement can be substantial; for disease loci not included in the prioritized subset, the power loss is almost negligible.
The PSA has the flexibility of allowing investigators to combine prior information from a variety of sources, and will be a useful tool for GWAS.
Association analysis; False discovery rate; HapMap
Genome-wide association studies (GWAS) have provided a large set of genetic loci
influencing the risk for many common diseases. Association studies typically
analyze one specific trait in single populations in an isolated fashion without
taking into account the potential phenotypic and genetic correlation between
traits. However, GWA data can be efficiently used to identify overlapping loci
with analogous or contrasting effects on different diseases.
Here, we describe a new approach to systematically prioritize and interpret
available GWA data. We focus on the analysis of joint and disjoint genetic
determinants across diseases. Using network analysis, we show that variant-based
approaches are superior to locus-based analyses. In addition, we provide a
prioritization of disease loci based on network properties and discuss the roles
of hub loci across several diseases. We demonstrate that, in general, agonistic
associations appear to reflect current disease classifications, and present the
potential use of effect sizes in refining and revising these agonistic signals. We
further identify potential branching points in disease etiologies based on
antagonistic variants and describe plausible small-scale models of the underlying
The observation that a surprisingly high fraction (>15%) of the SNPs considered in
our study are associated both agonistically and antagonistically with related as
well as unrelated disorders indicates that the molecular mechanisms influencing
causes and progress of human diseases are in part interrelated. Genetic overlaps
between two diseases also suggest the importance of the affected entities in the
specific pathogenic pathways and should be investigated further.
Genome-wide association study; Genetic overlap; Shared variant network; Disease comorbidity
Genome-wide association studies (GWAS) do not provide a full account of the heritability of genetic diseases since gene-gene interactions, also known as epistasis are not considered in single locus GWAS. To address this problem, a considerable number of methods have been developed for identifying disease-associated gene-gene interactions. However, these methods typically fail to identify interacting markers explaining more of the disease heritability over single locus GWAS, since many of the interactions significant for disease are obscured by uninformative marker interactions e.g., linkage disequilibrium (LD).
In this study, we present a novel SNP interaction prioritization algorithm, named iLOCi (Interacting Loci). This algorithm accounts for marker dependencies separately in case and control groups. Disease-associated interactions are then prioritized according to a novel ranking score calculated from the difference in marker dependencies for every possible pair between case and control groups. The analysis of a typical GWAS dataset can be completed in less than a day on a standard workstation with parallel processing capability. The proposed framework was validated using simulated data and applied to real GWAS datasets using the Wellcome Trust Case Control Consortium (WTCCC) data. The results from simulated data showed the ability of iLOCi to identify various types of gene-gene interactions, especially for high-order interaction. From the WTCCC data, we found that among the top ranked interacting SNP pairs, several mapped to genes previously known to be associated with disease, and interestingly, other previously unreported genes with biologically related roles.
iLOCi is a powerful tool for uncovering true disease interacting markers and thus can provide a more complete understanding of the genetic basis underlying complex disease. The program is available for download at http://www4a.biotec.or.th/GI/tools/iloci.
In the recent decade, high-throughput genotyping and next-generation sequencing platforms have enabled genome-wide association studies (GWAS) of many complex human diseases. These studies have discovered many disease susceptible loci, and unveiled unexpected disease mechanisms. Despite these successes, these identified variants only explain a small proportion of the genetic contributions to these diseases and many more remain to be found. This is largely due to the small effect sizes of most disease-associated variants and limited sample size. As a result, it is critical to leverage other information to more effectively prioritize GWAS signals to increase replication rates and better understand disease mechanisms. In this review, we introduce the biological/genomic features that have been found to be informative for post-GWAS prioritization, and discuss available tools to utilize these features for prioritization
genome-wide association studies; prioritization; eQTL; DNase I hypersensitive site; non-coding
Genome-wide association studies (GWAS) have become a preferred method to identify new genetic susceptibility loci. This technique aims to understanding the molecular etiology of common diseases, but in many cases, it has led to the identification of loci with no obvious biological relevance. Herein, we show that previously unrecognized sequence homologies have caused single-nucleotide polymorphism (SNP) microarrays to incorrectly associate a phenotype to a given locus when in fact the linkage is to another distant locus. Using genetic differences between male and female subjects as a model to study the effect of one specific genomic region on the whole SNP microarray, we provide strong evidence that the use of standard methods for GWAS can be misleading. We suggest a new systematic quality control step in the biological interpretation of previous and future GWAS.
A key challenge for genome-wide association studies (GWAS) is to understand how single nucleotide polymorphisms (SNPs) mechanistically underpin complex diseases. While this challenge has been addressed partially by Gene Ontology (GO) enrichment of large list of host genes of SNPs prioritized in GWAS, these enrichment have not been formally evaluated. Here, we develop a novel computational approach anchored in information theoretic similarity, by systematically mining lists of host genes of SNPs prioritized in three adult-onset diabetes mellitus GWAS. The “gold-standard” is based on GO associated with 20 published diabetes SNPs’ host genes and on our own evaluation. We computationally identify 69 similarity-predicted GO independently validated in all three GWAS (FDR<5%), enriched with those of the gold-standard (odds ratio=5.89, P=4.81e-05), and these terms can be organized by similarity criteria into 11 groupings termed “biomolecular systems”. Six biomolecular systems were corroborated by the gold-standard and the remaining five were previously uncharacterized. http://lussierlab.org/publications/ITS-GWAS
Given that genome wide association studies (GWAS) of psychiatric disorders have identified only a small number of convincingly associated variants, there is interest in seeking additional evidence for associated variants using tests of gene-gene interaction. Comprehensive pair-wise SNP-SNP interaction analysis is computationally intensive and the penalty for multiple testing is severe given the number of interactions possible. Aiming to minimize these statistical and computational burdens, we have explored approaches to prioritise SNPs for interaction analyses.
Primary interaction analyses were performed using the Wellcome Trust Case Control Consortium Bipolar Disorder GWAS (1868 cases, 2938 controls). Replication analyses were performed using the Genetic Association Information Network BD dataset (1001 cases, 1033 controls). SNPs were prioritized for interaction analysis that showed evidence for association that surpassed a number of nominally significant thresholds, are within genome-wide significant genes, or are within genes that are functionally related.
For no set of prioritized SNPs did we obtain evidence to support the hypothesis that the selection strategy identified pairs of variants that were enriched for true (statistical) interactions.
SNPs prioritized according to a number of criteria do not have a raised prior probability for significant interaction that is detectable in samples of this size. As is now widely accepted for single SNP analysis, we argue the use of significance levels reflecting only the number of tests performed does not offer an appropriate degree of protection against the potential for GWAS studies to generate an enormous number of false positive interactions.
GWAS; SNP; epistasis; association; interaction; gene
The typical objective of Genome-wide association (GWA) studies is to identify single-nucleotide polymorphisms (SNPs) and corresponding genes with the strongest evidence of association (the 'most-significant SNPs/genes' approach). Borrowing ideas from micro-array data analysis, we propose a new method, named RS-SNP, for detecting sets of genes enriched in SNPs moderately associated to the phenotype. RS-SNP assesses whether the number of significant SNPs, with p-value P ≤ α, belonging to a given SNP set is statistically significant. The rationale of proposed method is that two kinds of null hypotheses are taken into account simultaneously. In the first null model the genotype and the phenotype are assumed to be independent random variables and the null distribution is the probability of the number of significant SNPs in greater than observed by chance. The second null model assumes the number of significant SNPs in depends on the size of and not on the identity of the SNPs in . Statistical significance is assessed using non-parametric permutation tests.
We applied RS-SNP to the Crohn's disease (CD) data set collected by the Wellcome Trust Case Control Consortium (WTCCC) and compared the results with GENGEN, an approach recently proposed in literature. The enrichment analysis using RS-SNP and the set of pathways contained in the MSigDB C2 CP pathway collection highlighted 86 pathways rich in SNPs weakly associated to CD. Of these, 47 were also indicated to be significant by GENGEN. Similar results were obtained using the MSigDB C5 pathway collection. Many of the pathways found to be enriched by RS-SNP have a well-known connection to CD and often with inflammatory diseases.
The proposed method is a valuable alternative to other techniques for enrichment analysis of SNP sets. It is well founded from a theoretical and statistical perspective. Moreover, the experimental comparison with GENGEN highlights that it is more robust with respect to false positive findings.
Motivation: Genome-wide association studies (GWAS) generate relationships between hundreds of thousands of single nucleotide polymorphisms (SNPs) and complex phenotypes. The contribution of the traditionally overlooked copy number variations (CNVs) to complex traits is also being actively studied. To facilitate the interpretation of the data and the designing of follow-up experimental validations, we have developed a database that enables the sensible prioritization of these variants by combining several approaches, involving not only publicly available physical and functional annotations but also multilocus linkage disequilibrium (LD) annotations as well as annotations of expression quantitative trait loci (eQTLs).
Results: For each SNP, the SCAN database provides: (i) summary information from eQTL mapping of HapMap SNPs to gene expression (evaluated by the Affymetrix exon array) in the full set of HapMap CEU (Caucasians from UT, USA) and YRI (Yoruba people from Ibadan, Nigeria) samples; (ii) LD information, in the case of a HapMap SNP, including what genes have variation in strong LD (pairwise or multilocus LD) with the variant and how well the SNP is covered by different high-throughput platforms; (iii) summary information available from public databases (e.g. physical and functional annotations); and (iv) summary information from other GWAS. For each gene, SCAN provides annotations on: (i) eQTLs for the gene (both local and distant SNPs) and (ii) the coverage of all variants in the HapMap at that gene on each high-throughput platform. For each genomic region, SCAN provides annotations on: (i) physical and functional annotations of all SNPs, genes and known CNVs within the region and (ii) all genes regulated by the eQTLs within the region.
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: One of the fundamental questions in genetics study is to identify functional DNA variants that are responsible to a disease or phenotype of interest. Results from large-scale genetics studies, such as genome-wide association studies (GWAS), and the availability of high-throughput sequencing technologies provide opportunities in identifying causal variants. Despite the technical advances, informatics methodologies need to be developed to prioritize thousands of variants for potential causative effects.
Results: We present regSNPs, an informatics strategy that integrates several established bioinformatics tools, for prioritizing regulatory SNPs, i.e. the SNPs in the promoter regions that potentially affect phenotype through changing transcription of downstream genes. Comparing to existing tools, regSNPs has two distinct features. It considers degenerative features of binding motifs by calculating the differences on the binding affinity caused by the candidate variants and integrates potential phenotypic effects of various transcription factors. When tested by using the disease-causing variants documented in the Human Gene Mutation Database, regSNPs showed mixed performance on various diseases. regSNPs predicted three SNPs that can potentially affect bone density in a region detected in an earlier linkage study. Potential effects of one of the variants were validated using luciferase reporter assay.
Supplementary data are available at Bioinformatics online
In genome-wide association studies (GWAS), the association between each single nucleotide polymorphism (SNP) and a phenotype is assessed statistically. To further explore genetic associations in GWAS, we considered two specific forms of biologically plausible SNP-SNP interactions, ‘SNP intersection’ and ‘SNP union,’ and analyzed the Crohn's Disease (CD) GWAS data of the Wellcome Trust Case Control Consortium for these interactions using a limited form of logic regression. We found strong evidence of CD-association for 195 genes, identifying novel susceptibility genes (e.g., ISX, SLCO6A1, TMEM183A) as well as confirming many previously identified susceptibility genes in CD GWAS (e.g., IL23R, NOD2, CYLD, NKX2-3, IL12RB2, ATG16L1). Notably, 37 of the 59 chromosomal locations indicated for CD-association by a meta-analysis of CD GWAS, involving over 22,000 cases and 29,000 controls, were represented in the 195 genes, as well as some chromosomal locations previously indicated only in linkage studies, but not in GWAS. We repeated the analysis with two smaller GWASs from the Database of Genotype and Phenotype (dbGaP): in spite of differences of populations and study power across the three datasets, we observed some consistencies across the three datasets. Notable examples included TMEM183A and SLCO6A1 which exhibited strong evidence consistently in our WTCCC and both of the dbGaP SNP-SNP interaction analyses. Examining these specific forms of SNP interactions could identify additional genetic associations from GWAS. R codes, data examples, and a ReadMe file are available for download from our website: http://www.ualberta.ca/~yyasui/homepage.html.
With the advent of cost-effective genotyping technologies, genome-wide association studies allow researchers to examine hundreds of thousands of single nucleotide polymorphisms (SNPs) for association with human disease. Recently, many researchers applying this strategy have detected strong associations to disease with SNP markers that are either not in linkage disequilibrium with any nonsynonymous SNP or large distances from any annotated gene. In such cases, no well-established standard practice for effective SNP selection for follow-up studies exists. We aim to identify and prioritize groups of SNPs that are more likely to affect phenotypes in order to facilitate efficient SNP selection for follow-up studies.
Based on the annotations available in the Ensembl database, we categorized SNPs in the human genome into classes related to regulatory attributes, such as epigenetic modifications and transcription factor binding sites, in addition to classes related to gene structure and cross-species conservation. Using the distribution of derived allele frequencies (DAF) within each class, we assessed the strength of natural selection for each class relative to the genome as a whole. We applied this DAF analysis to Perlegen resequenced SNPs genome-wide. Regulatory elements annotated by Ensembl such as specific histone methylation sites as well as classes defined by cross-species conservation showed negative selection in comparison to the genome as a whole.
These results highlight which annotated classes are under purifying selection, have putative functional importance, and contain SNPs that are strong candidates for follow-up studies after genome-wide association. Such SNP annotation may also be useful in interpreting results of whole-genome sequencing studies.
A central issue in genome-wide association (GWA) studies is assessing statistical significance while adjusting for multiple hypothesis testing. An equally important question is the statistical efficiency of the GWA design as compared to the traditional sequential approach in which genome-wide linkage analysis is followed by region-wise association mapping. Nevertheless, GWA is becoming more popular due in part to cost efficiency: commercially available 1M chips are nearly as inexpensive as a custom-designed 10K chip. It is becoming apparent, however, that most of the on-going GWA studies with 2,000~5,000 samples are in fact underpowered. As a means to improve power, we emphasize the importance of utilizing prior information such as results of previous linkage studies via a stratified false discovery rate (FDR) control. The essence of the stratified FDR control is to prioritize the genome and maintain power to interrogate candidate regions within the GWA study. These candidate regions can be defined as, but are by no means limited to, linkage-peak regions. Furthermore, we theoretically unify the stratified FDR approach and the weighted p-value method, and we show that stratified FDR can be formulated as a robust version of weighted FDR. Finally, we demonstrate the utility of the methods in two GWA datasets: Type 2 Diabetes (FUSION) and an on-going study of long-term diabetic complications (DCCT/EDIC). The methods are implemented as a user-friendly software package, SFDR. The same stratification framework can be readily applied to other type of studies, for example, using GWA results to improve the power of sequencing data analyses.
genome-wide association; genome-wide linkage; statistical power; prior information; false discovery rate
Genome-wide association studies (GWAS) have implicated ANK3 as a susceptibility gene for bipolar disorder (BP). We examined whether epistasis with ANK3 may contribute to the “missing heritability” in BP. We first identified via the STRING database 14 genes encoding proteins with prior biological evidence that they interact molecularly with ANK3. We then tested for statistical evidence of interactions between SNPs in these genes in association with BP in a discovery GWAS dataset and two replication GWAS datasets. The most significant interaction in the discovery GWAS was between SNPs in ANK3 and KCNQ2 (p = 3.18 × 10−8). A total of 31 pair-wise interactions involving combinations between two SNPs from KCNQ2 and 16 different SNPs in ANK3 were significant after permutation. Of these, 28 pair-wise interactions were significant in the first replication GWAS. None were significant in the second replication GWAS, but the two SNPs from KCNQ2 were found to significantly interact with five other SNPs in ANK3, suggesting possible allelic heterogeneity. KCNQ2 forms homo- and hetero-meric complexes with KCNQ3 that constitute voltage-gated potassium channels in neurons. ANK3 is an adaptor protein that, through its interaction with KCNQ2 and KCNQ3, directs the localization of this channel in the axon initial segment (AIS). At the AIS, the KCNQ2/3 complex gives rise to the M-current, which stabilizes the neuronal resting potential and inhibits repetitive firing of action potentials. Thus, these channels act as “dampening” components and prevent neuronal hyperactivity. The interactions between ANK3 and KCNQ2 merit further investigation, and if confirmed, may motivate a new line of research into a novel therapeutic target for BP.
epistasis; interaction; bipolar disorder; ANK3; KCNQ2; channelopathy; ion channel
Results from genome-wide association studies (GWAS) represent a potential resource for etiological and treatment research. GWAS of obesity-related phenotypes have been especially successful. To translate this success into a research tool, we developed and tested a “genetic risk score” (GRS) that summarizes an individual’s genetic predisposition to obesity.
Different GWAS of obesity-related phenotypes report different sets of single nucleotide polymorphisms (SNPs) as the best genomic markers of obesity risk. Therefore, we applied a 3-stage approach that pooled results from multiple GWAS to select SNPs to include in our GRS: The 3 stages are (1) Extraction. SNPs with evidence of association are compiled from published GWAS; (2) Clustering. SNPs are grouped according to patterns of linkage disequilibrium; (3) Selection. Tag SNPs are selected from clusters that meet specific criteria. We applied this 3-stage approach to results from 16 GWAS of obesity-related phenotypes in European-descent samples to create a GRS. We then tested the GRS in the Atherosclerosis Risk in the Communities (ARIC) Study cohort (N=10,745, 55% female, 77% white, 23% African American).
Our 32-locus GRS was a statistically significant predictor of body mass index (BMI) and obesity among ARIC whites (for BMI, r=0.13, p<1×10−30; for obesity, area under the receiver operating characteristic curve (AUC)=0.57 [95% CI 0.55–0.58]). The GRS improved prediction of obesity (as measured by delta-AUC and integrated discrimination index) when added to models that included demographic and geographic information. FTO- and MC4R-linked SNPs, and a non-genetic risk assessment consisting of a socioeconomic index (p<0.01 for all comparisons). The GRS also predicted increased mortality risk over 17 years of follow-up. The GRS performed less well among African Americans.
The obesity GRS derived using our 3-stage approach is not useful for clinical risk prediction, but may have value as a tool for etiological and treatment research.
Motivation: There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage disequilibrium (LD). To establish RF and GBM as viable alternatives for analyzing genome-wide data, it is necessary to address this potential bias and show that SL methods do not significantly under-perform conventional GWAS methods.
Results: Both LD and MAF have a significant impact on the variable importance measures commonly used in RF and GBM. Dividing SNPs into overlapping subsets with approximate linkage equilibrium and applying SL methods to each subset successfully reduces the impact of LD. A welcome side effect of this approach is a dramatic reduction in parallel computing time, increasing the feasibility of applying SL methods to large datasets. The created subsets also facilitate a potential correction for the effect of MAF using pseudocovariates. Simulations using simulated SNPs embedded in empirical data—assessing varying effect sizes, minor allele frequencies and LD patterns—suggest that the sensitivity to detect effects is often improved by subsetting and does not significantly under-perform the Armitage trend test, even under ideal conditions for the trend test.
Availability: Code for the LD subsetting algorithm and pseudocovariate correction is available at http://www.nd.edu/∼glubke/code.html.
Supplementary information: Supplementary data are available at Bioinformatics online.