PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (2022220)

Clipboard (0)
None

Related Articles

1.  Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence 
Bioinformatics  2008;24(16):1805-1811.
Motivation: A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of genotype–phenotype correlation with a priori evidence of biological relevance.
Results: We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains including statistical evidence of genotype–phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
Availability: A comprehensive database of biological prioritization scores for all known SNPs is available at http://zork.wustl.edu/gin. This can be used to prioritize nicotine dependence association studies through a straightforward mathematical formula—no special software is necessary.
Contact: ssaccone@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn315
PMCID: PMC2610477  PMID: 18565990
2.  Convergence of genetic influences in comorbidity 
BMC Bioinformatics  2012;13(Suppl 2):S8.
Background
Predisposition to complex diseases is explained in part by genetic variation, and complex diseases are frequently comorbid, consistent with pleiotropic genetic variation influencing comorbidity. Genome Wide Association (GWA) studies typically assess association between SNPs and a single-disease phenotype. Fisher meta-analysis combines evidence of association from single-disease GWA studies, assuming that each study is an independent test of the same hypothesis. The Rank Product (RP) method overcomes limitations posed by Fisher assumptions, though RP was not designed for GWA data.
Methods
We modified RP to accommodate GWA data, and we call it modRP. Using p-values output from GWA studies, we aggregate evidence for association between SNPs and related phenotypes. To assess significance, RP randomly samples the observed ranks to develop the null distribution of the RP statistic, and then places the observed RPs into the null distribution. ModRP eliminates the effect of linkage disequilibrium and controls for differences in power at tested SNPs, to meet RP assumptions in application to GWA data.
Results
After validating modRP based on both positive and negative control studies, we searched for pleiotropic influences on comorbid substance use disorders in a novel study, and found two SNPs to be significantly associated with comorbid cocaine, opium, and nicotine dependence. Placing these SNPs into biological context, we developed a protein network modeling the interaction of cocaine, nicotine, and opium with these variants.
Conclusions
ModRP is a novel approach to identifying pleiotropic genetic influences on comorbid complex diseases. It can be used to assess association for related phenotypes where raw data is unavailable or inappropriate for analysis using other approaches. The method is conceptually simple and produces statistically significant, biologically relevant results.
doi:10.1186/1471-2105-13-S2-S8
PMCID: PMC3375629  PMID: 22536871
3.  In search of causal variants: refining disease association signals using cross-population contrasts 
BMC Genetics  2008;9:58.
Background
Genome-wide association (GWA) using large numbers of single nucleotide polymorphisms (SNPs) is now a powerful, state-of-the-art approach to mapping human disease genes. When a GWA study detects association between a SNP and the disease, this signal usually represents association with a set of several highly correlated SNPs in strong linkage disequilibrium. The challenge we address is to distinguish among these correlated loci to highlight potential functional variants and prioritize them for follow-up.
Results
We implemented a systematic method for testing association across diverse population samples having differing histories and LD patterns, using a logistic regression framework. The hypothesis is that important underlying biological mechanisms are shared across human populations, and we can filter correlated variants by testing for heterogeneity of genetic effects in different population samples. This approach formalizes the descriptive comparison of p-values that has typified similar cross-population fine-mapping studies to date. We applied this method to correlated SNPs in the cholinergic nicotinic receptor gene cluster CHRNA5-CHRNA3-CHRNB4, in a case-control study of cocaine dependence composed of 504 European-American and 583 African-American samples. Of the 10 SNPs genotyped in the r2 ≥ 0.8 bin for rs16969968, three demonstrated significant cross-population heterogeneity and are filtered from priority follow-up; the remaining SNPs include rs16969968 (heterogeneity p = 0.75). Though the power to filter out rs16969968 is reduced due to the difference in allele frequency in the two groups, the results nevertheless focus attention on a smaller group of SNPs that includes the non-synonymous SNP rs16969968, which retains a similar effect size (odds ratio) across both population samples.
Conclusion
Filtering out SNPs that demonstrate cross-population heterogeneity enriches for variants more likely to be important and causative. Our approach provides an important and effective tool to help interpret results from the many GWA studies now underway.
doi:10.1186/1471-2156-9-58
PMCID: PMC2556340  PMID: 18759969
4.  SPOT: a web-based tool for using biological databases to prioritize SNPs after a genome-wide association study 
Nucleic Acids Research  2010;38(Web Server issue):W201-W209.
SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution.
doi:10.1093/nar/gkq513
PMCID: PMC2896195  PMID: 20529875
5.  Candidate Causal Regulatory Effects by Integration of Expression QTLs with Complex Trait Genetic Associations 
PLoS Genetics  2010;6(4):e1000895.
The recent success of genome-wide association studies (GWAS) is now followed by the challenge to determine how the reported susceptibility variants mediate complex traits and diseases. Expression quantitative trait loci (eQTLs) have been implicated in disease associations through overlaps between eQTLs and GWAS signals. However, the abundance of eQTLs and the strong correlation structure (LD) in the genome make it likely that some of these overlaps are coincidental and not driven by the same functional variants. In the present study, we propose an empirical methodology, which we call Regulatory Trait Concordance (RTC) that accounts for local LD structure and integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis eQTLs. We simulate genomic regions of various LD patterns with both a single or two causal variants and show that our score outperforms SNP correlation metrics, be they statistical (r2) or historical (D'). Following the observation of a significant abundance of regulatory signals among currently published GWAS loci, we apply our method with the goal to prioritize relevant genes for each of the respective complex traits. We detect several potential disease-causing regulatory effects, with a strong enrichment for immunity-related conditions, consistent with the nature of the cell line tested (LCLs). Furthermore, we present an extension of the method in trans, where interrogating the whole genome for downstream effects of the disease variant can be informative regarding its unknown primary biological effect. We conclude that integrating cellular phenotype associations with organismal complex traits will facilitate the biological interpretation of the genetic effects on these traits.
Author Summary
Genome-wide association studies have led to the identification of susceptibility loci for a variety of human complex traits. What is still largely missing, however, is the understanding of the biological context in which these candidate variants act and of how they determine each trait. Given the localization of many GWAS loci outside coding regions and the important role of regulatory variation in shaping phenotypic variance, gene expression has been proposed as a plausible informative intermediate phenotype. Here we show that for a subset of the currently published GWAS this is indeed the case, by observing a significant excess of regulatory variants among disease loci. We propose an empirical methodology (regulatory trait concordance—RTC) able to integrate expression and disease data in order to detect causal regulatory effects. We show that the RTC outperforms simple correlation metrics under various simulated linkage disequilibrium (LD) scenarios. Our method is able to recover previously suspected causal regulatory effects from the literature and, as expected given the nature of the tested tissue, an overrepresentation of immunity-related candidates is observed. As the number of available tissues will increase, this prioritization approach will become even more useful in understanding the implication of regulatory variants in disease etiology.
doi:10.1371/journal.pgen.1000895
PMCID: PMC2848550  PMID: 20369022
6.  Genome-wide association study combined with biological context can reveal more disease-related SNPs altering microRNA target seed sites 
BMC Genomics  2014;15(1):669.
Background
Emerging studies demonstrate that single nucleotide polymorphisms (SNPs) resided in the microRNA recognition element seed sites (MRESSs) in 3′UTR of mRNAs are putative biomarkers for human diseases and cancers. However, exhaustively experimental validation for the causality of MRESS SNPs is impractical. Therefore bioinformatics have been introduced to predict causal MRESS SNPs. Genome-wide association study (GWAS) provides a way to detect susceptibility of millions of SNPs simultaneously by taking linkage disequilibrium (LD) into account, but the multiple-testing corrections implemented to suppress false positive rate always sacrificed the sensitivity. In our study, we proposed a method to identify candidate causal MRESS SNPs from 12 GWAS datasets without performing multiple-testing corrections. Alternatively, we used biological context to ensure credibility of the selected SNPs.
Results
In 11 out of the 12 GWAS datasets, MRESS SNPs were over-represented in SNPs with p-value ≤ 0.05 (odds ratio (OR) ranged from 1.1 to 2.4). Moreover, host genes of susceptible MRESS SNPs in each of the 11 GWAS dataset shared biological context with reported causal genes. There were 286 MRESS SNPs identified by our method, while only 13 SNPs were identified by multiple-testing corrections with a given threshold of 1 × 10−5, which is a common cutoff used in GWAS. 27 out of the 286 candidate SNPs have been reported to be deleterious while only 2 out of 13 multiple-testing corrected SNPs were documented in PubMed. MicroRNA-mRNA interactions affected by the 286 candidate SNPs were likely to present negatively correlated expression. These SNPs introduced greater alternation of binding free energy than other MRESS SNPs, especially when grouping by haplotypes (4210 vs. 4105 cal/mol by mean, 9781 vs. 8521 cal/mol by mean, respectively).
Conclusions
MRESS SNPs are promising disease biomarkers in multiple GWAS datasets. The method of integrating GWAS p-value and biological context is stable and effective for selecting candidate causal MRESS SNPs, it reduces the loss of sensitivity compared to multiple-testing corrections. The 286 candidate causal MRESS SNPs provide researchers a credible source to initialize their design of experimental validations in the future.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-669) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-669
PMCID: PMC4246476  PMID: 25106527
microRNA; Genome-wide association study; Single nucleotide polymorphisms; Human diseases and cancers
7.  A noise-reduction GWAS analysis implicates altered regulation of neurite outgrowth and guidance in autism 
Molecular Autism  2011;2:1.
Background
Genome-wide Association Studies (GWAS) have proved invaluable for the identification of disease susceptibility genes. However, the prioritization of candidate genes and regions for follow-up studies often proves difficult due to false-positive associations caused by statistical noise and multiple-testing. In order to address this issue, we propose the novel GWAS noise reduction (GWAS-NR) method as a way to increase the power to detect true associations in GWAS, particularly in complex diseases such as autism.
Methods
GWAS-NR utilizes a linear filter to identify genomic regions demonstrating correlation among association signals in multiple datasets. We used computer simulations to assess the ability of GWAS-NR to detect association against the commonly used joint analysis and Fisher's methods. Furthermore, we applied GWAS-NR to a family-based autism GWAS of 597 families and a second existing autism GWAS of 696 families from the Autism Genetic Resource Exchange (AGRE) to arrive at a compendium of autism candidate genes. These genes were manually annotated and classified by a literature review and functional grouping in order to reveal biological pathways which might contribute to autism aetiology.
Results
Computer simulations indicate that GWAS-NR achieves a significantly higher classification rate for true positive association signals than either the joint analysis or Fisher's methods and that it can also achieve this when there is imperfect marker overlap across datasets or when the closest disease-related polymorphism is not directly typed. In two autism datasets, GWAS-NR analysis resulted in 1535 significant linkage disequilibrium (LD) blocks overlapping 431 unique reference sequencing (RefSeq) genes. Moreover, we identified the nearest RefSeq gene to the non-gene overlapping LD blocks, producing a final candidate set of 860 genes. Functional categorization of these implicated genes indicates that a significant proportion of them cooperate in a coherent pathway that regulates the directional protrusion of axons and dendrites to their appropriate synaptic targets.
Conclusions
As statistical noise is likely to particularly affect studies of complex disorders, where genetic heterogeneity or interaction between genes may confound the ability to detect association, GWAS-NR offers a powerful method for prioritizing regions for follow-up studies. Applying this method to autism datasets, GWAS-NR analysis indicates that a large subset of genes involved in the outgrowth and guidance of axons and dendrites is implicated in the aetiology of autism.
doi:10.1186/2040-2392-2-1
PMCID: PMC3035032  PMID: 21247446
8.  GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation 
PLoS Genetics  2014;10(11):e1004787.
Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are biologically more relevant to bladder cancer. The R implementation of GPA is currently available at http://dongjunchung.github.io/GPA/.
Author Summary
In the past 10 years, many genome wide association studies (GWAS) have been conducted to identify the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait/disease. On one hand, about 85% of identified risk variants are located in non-coding regions, which motivates a systematic understanding of the function of non-coding variants in regulatory elements in the human genome. On the other hand, complex diseases are often affected by many genetic variants with small or moderate effects. To address these issues, we propose a statistical approach, GPA, to integrating information from multiple GWAS datasets and functional annotation. Notably, our approach only requires marker-wise p-values as input, making it especially useful when only summary statistics, instead of the full genotype and phenotype data, are available. We applied GPA to analyze GWAS datasets of five psychiatric disorders and bladder cancer, where the central nervous system genes, eQTLs from the Genotype-Tissue Expression (GTEx), and the ENCODE DNase-seq data from 125 cell lines were used as functional annotation. The analysis results suggest that GPA is an effective method for integrative data analysis in the post-GWAS era.
doi:10.1371/journal.pgen.1004787
PMCID: PMC4230845  PMID: 25393678
9.  Making sense of GWAS: using epigenomics and genome engineering to understand the functional relevance of SNPs in non-coding regions of the human genome 
Considerable progress towards an understanding of complex diseases has been made in recent years due to the development of high-throughput genotyping technologies. Using microarrays that contain millions of single-nucleotide polymorphisms (SNPs), Genome Wide Association Studies (GWASs) have identified SNPs that are associated with many complex diseases or traits. For example, as of February 2015, 2111 association studies have identified 15,396 SNPs for various diseases and traits, with the number of identified SNP-disease/trait associations increasing rapidly in recent years. However, it has been difficult for researchers to understand disease risk from GWAS results. This is because most GWAS-identified SNPs are located in non-coding regions of the genome. It is important to consider that the GWAS-identified SNPs serve only as representatives for all SNPs in the same haplotype block, and it is equally likely that other SNPs in high linkage disequilibrium (LD) with the array-identified SNPs are causal for the disease. Because it was hoped that disease-associated coding variants would be identified if the true casual SNPs were known, investigators have expanded their analyses using LD calculation and fine-mapping. However, such analyses also identified risk-associated SNPs located in non-coding regions. Thus, the GWAS field has been left with the conundrum as to how a single-nucleotide change in a non-coding region could confer increased risk for a specific disease. One possible answer to this puzzle is that the variant SNPs cause changes in gene expression levels rather than causing changes in protein function. This review provides a description of (1) advances in genomic and epigenomic approaches that incorporate functional annotation of regulatory elements to prioritize the disease risk-associated SNPs that are located in non-coding regions of the genome for follow-up studies, (2) various computational tools that aid in identifying gene expression changes caused by the non-coding disease-associated SNPs, and (3) experimental approaches to identify target genes of, and study the biological phenotypes conferred by, non-coding disease-associated SNPs.
doi:10.1186/s13072-015-0050-4
PMCID: PMC4696349  PMID: 26719772
GWAS; Enhancers; Non-coding SNPs; Genome engineering
10.  Prioritization of SNPs for Genome-Wide Association Studies Using an Interaction Model of Genetic Variation, Gene Expression, and Trait Variation 
Molecules and Cells  2012;33(4):351-361.
The identification of true causal loci to unravel the statistical evidence of genotype-phenotype correlations and the biological relevance of selected single-nucleotide polymorphisms (SNPs) is a challenging issue in genome-wide association studies (GWAS). Here, we introduced a novel method for the prioritization of SNPs based on p-values from GWAS. The method uses functional evidence from populations, including phenotype-associated gene expressions. Based on the concept of genetic interactions, such as perturbation of gene expression by genetic variation, phenotype and gene expression related SNPs were prioritized by adjusting the p-values of SNPs. We applied our method to GWAS data related to drug-induced cytotoxicity. Then, we prioritized loci that potentially play a role in drug-induced cytotoxicity. By generating an interaction model, our approach allowed us not only to identify causal loci, but also to find intermediate nodes that regulate the flow of information among causal loci, perturbed gene expression, and resulting phenotypic variation.
doi:10.1007/s10059-012-2264-7
PMCID: PMC3887803  PMID: 22460606
genome-wide association study; interaction network; prioritization; SNP
11.  Replication of genetic loci for ages at menarche and menopause in the multi-ethnic Population Architecture using Genomics and Epidemiology (PAGE) study 
Human Reproduction (Oxford, England)  2013;28(6):1695-1706.
STUDY QUESTION
Do genetic associations identified in genome-wide association studies (GWAS) of age at menarche (AM) and age at natural menopause (ANM) replicate in women of diverse race/ancestry from the Population Architecture using Genomics and Epidemiology (PAGE) Study?
SUMMARY ANSWER
We replicated GWAS reproductive trait single nucleotide polymorphisms (SNPs) in our European descent population and found that many SNPs were also associated with AM and ANM in populations of diverse ancestry.
WHAT IS KNOWN ALREADY
Menarche and menopause mark the reproductive lifespan in women and are important risk factors for chronic diseases including obesity, cardiovascular disease and cancer. Both events are believed to be influenced by environmental and genetic factors, and vary in populations differing by genetic ancestry and geography. Most genetic variants associated with these traits have been identified in GWAS of European-descent populations.
STUDY DESIGN, SIZE, DURATION
A total of 42 251 women of diverse ancestry from PAGE were included in cross-sectional analyses of AM and ANM.
MATERIALS, SETTING, METHODS
SNPs previously associated with ANM (n = 5 SNPs) and AM (n = 3 SNPs) in GWAS were genotyped in American Indians, African Americans, Asians, European Americans, Hispanics and Native Hawaiians. To test SNP associations with ANM or AM, we used linear regression models stratified by race/ethnicity and PAGE sub-study. Results were then combined in race-specific fixed effect meta-analyses for each outcome. For replication and generalization analyses, significance was defined at P < 0.01 for ANM analyses and P < 0.017 for AM analyses.
MAIN RESULTS AND THE ROLE OF CHANCE
We replicated findings for AM SNPs in the LIN28B locus and an intergenic region on 9q31 in European Americans. The LIN28B SNPs (rs314277 and rs314280) were also significantly associated with AM in Asians, but not in other race/ethnicity groups. Linkage disequilibrium (LD) patterns at this locus varied widely among the ancestral groups. With the exception of an intergenic SNP at 13q34, all ANM SNPs replicated in European Americans. Three were significantly associated with ANM in other race/ethnicity populations: rs2153157 (6p24.2/SYCP2L), rs365132 (5q35/UIMC1) and rs16991615 (20p12.3/MCM8). While rs1172822 (19q13/BRSK1) was not significant in the populations of non-European descent, effect sizes showed similar trends.
LIMITATIONS, REASONS FOR CAUTION
Lack of association for the GWAS SNPs in the non-European American groups may be due to differences in locus LD patterns between these groups and the European-descent populations included in the GWAS discovery studies; and in some cases, lower power may also contribute to non-significant findings.
WIDER IMPLICATIONS OF THE FINDINGS
The discovery of genetic variants associated with the reproductive traits provides an important opportunity to elucidate the biological mechanisms involved with normal variation and disorders of menarche and menopause. In this study we replicated most, but not all reported SNPs in European descent populations and examined the epidemiologic architecture of these early reported variants, describing their generalizability and effect size across differing ancestral populations. Such data will be increasingly important for prioritizing GWAS SNPs for follow-up in fine-mapping and resequencing studies, as well as in translational research.
STUDY FUNDING/COMPETING INTEREST(S)
The Population Architecture Using Genomics and Epidemiology (PAGE) program is funded by the National Human Genome Research Institute (NHGRI), supported by U01HG004803 (CALiCo), U01HG004798 (EAGLE), U01HG004802 (MEC), U01HG004790 (WHI) and U01HG004801 (Coordinating Center), and their respective NHGRI ARRA supplements. The authors report no conflicts of interest.
doi:10.1093/humrep/det071
PMCID: PMC3657124  PMID: 23508249
menopause; menarche; genome-wide association study; race/ethnicity; single nucleotide polymorphism
12.  SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS 
BMC Bioinformatics  2013;14(Suppl 1):S9.
Background
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.
Results
We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.
Conclusions
Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
doi:10.1186/1471-2105-14-S1-S9
PMCID: PMC3548692  PMID: 23369106
13.  Detection of quantitative trait loci in Bos indicus and Bos taurus cattle using genome-wide association studies 
Background
The apparent effect of a single nucleotide polymorphism (SNP) on phenotype depends on the linkage disequilibrium (LD) between the SNP and a quantitative trait locus (QTL). However, the phase of LD between a SNP and a QTL may differ between Bos indicus and Bos taurus because they diverged at least one hundred thousand years ago. Here, we test the hypothesis that the apparent effect of a SNP on a quantitative trait depends on whether the SNP allele is inherited from a Bos taurus or Bos indicus ancestor.
Methods
Phenotype data on one or more traits and SNP genotype data for 10 181 cattle from Bos taurus, Bos indicus and composite breeds were used. All animals had genotypes for 729 068 SNPs (real or imputed). Chromosome segments were classified as originating from B. indicus or B. taurus on the basis of the haplotype of SNP alleles they contained. Consequently, SNP alleles were classified according to their sub-species origin. Three models were used for the association study: (1) conventional GWAS (genome-wide association study), fitting a single SNP effect regardless of subspecies origin, (2) interaction GWAS, fitting an interaction between SNP and subspecies-origin, and (3) best variable GWAS, fitting the most significant combination of SNP and sub-species origin.
Results
Fitting an interaction between SNP and subspecies origin resulted in more significant SNPs (i.e. more power) than a conventional GWAS. Thus, the effect of a SNP depends on the subspecies that the allele originates from. Also, most QTL segregated in only one subspecies, suggesting that many mutations that affect the traits studied occurred after divergence of the subspecies or the mutation became fixed or was lost in one of the subspecies.
Conclusions
The results imply that GWAS and genomic selection could gain power by distinguishing SNP alleles based on their subspecies origin, and that only few QTL segregate in both B. indicus and B. taurus cattle. Thus, the QTL that segregate in current populations likely resulted from mutations that occurred in one of the subspecies and can have both positive and negative effects on the traits. There was no evidence that selection has increased the frequency of alleles that increase body weight.
doi:10.1186/1297-9686-45-43
PMCID: PMC4176739  PMID: 24168700
14.  Winner's Curse Correction and Variable Thresholding Improve Performance of Polygenic Risk Modeling Based on Genome-Wide Association Study Summary-Level Data 
PLoS Genetics  2016;12(12):e1006493.
Recent heritability analyses have indicated that genome-wide association studies (GWAS) have the potential to improve genetic risk prediction for complex diseases based on polygenic risk score (PRS), a simple modelling technique that can be implemented using summary-level data from the discovery samples. We herein propose modifications to improve the performance of PRS. We introduce threshold-dependent winner’s-curse adjustments for marginal association coefficients that are used to weight the single-nucleotide polymorphisms (SNPs) in PRS. Further, as a way to incorporate external functional/annotation knowledge that could identify subsets of SNPs highly enriched for associations, we propose variable thresholds for SNPs selection. We applied our methods to GWAS summary-level data of 14 complex diseases. Across all diseases, a simple winner’s curse correction uniformly led to enhancement of performance of the models, whereas incorporation of functional SNPs was beneficial only for selected diseases. Compared to the standard PRS algorithm, the proposed methods in combination led to notable gain in efficiency (25–50% increase in the prediction R2) for 5 of 14 diseases. As an example, for GWAS of type 2 diabetes, winner’s curse correction improved prediction R2 from 2.29% based on the standard PRS to 3.10% (P = 0.0017) and incorporating functional annotation data further improved R2 to 3.53% (P = 2×10−5). Our simulation studies illustrate why differential treatment of certain categories of functional SNPs, even when shown to be highly enriched for GWAS-heritability, does not lead to proportionate improvement in genetic risk-prediction because of non-uniform linkage disequilibrium structure.
Author Summary
Large GWAS have identified tens or even hundreds of common SNPs significantly associated with individual complex diseases; however, these SNPs typically explain a small proportion of phenotypic variance. Recently, heritability analyses based on GWAS data suggest that common SNPs have the potential to explain substantially larger fraction of phenotypic variance and to improve the genetic risk prediction. Because of the polygenic nature, improving genetic risk prediction for complex diseases typically requires substantially increasing the sample size in the discovery set. Thus, it is crucial to develop more efficient algorithms using existing GWAS summary data. In this article, we extend the polygenic risk score (PRS) method by adjusting the marginal effect size of SNPs for winner’s curse and by incorporating external functional annotation data. Theoretical analysis and simulation studies show that the performance improvement depends on the genetic architecture of the trait, sample size of the discovery sample set and the degree of enrichment of association for SNPs annotated as “high-prior” and the linkage disequilibrium patterns of these SNPs. We applied our method to the summary data of 14 GWAS. Our method achieved 25–50% gain in efficiency (measured in the prediction R2) for 5 of 14 diseases compared to the standard PRS.
doi:10.1371/journal.pgen.1006493
PMCID: PMC5201242  PMID: 28036406
15.  Finding type 2 diabetes causal single nucleotide polymorphism combinations and functional modules from genome-wide association data 
Background
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
Methods
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
Results
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
Conclusions
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
doi:10.1186/1472-6947-13-S1-S3
PMCID: PMC3618247  PMID: 23566118
16.  On the identification of potential regulatory variants within genome wide association candidate SNP sets 
BMC Medical Genomics  2014;7:34.
Background
Genome wide association studies (GWAS) are a population-scale approach to the identification of segments of the genome in which genetic variations may contribute to disease risk. Current methods focus on the discovery of single nucleotide polymorphisms (SNPs) associated with disease traits. As there are many SNPs within identified risk loci, and the majority of these are situated within non-coding regions, a key challenge is to identify and prioritize variants affecting regulatory sequences that are likely to contribute to the phenotype assessed.
Methods
We focused investigation on SNPs within lung and breast cancer GWAS loci that reached genome-wide significance for potential roles in gene regulation with a specific focus on SNPs likely to disrupt transcription factor binding sites. Within risk loci, the regulatory potential of sub-regions was classified using relevant open chromatin and epigenetic high throughput sequencing data sets from the ENCODE project in available cancer and normal cell lines. Furthermore, transcription factor affinity altering variants were predicted by comparison of position weight matrix scores between disease and reference alleles. Lastly, ChIP-seq data of transcription associated factors and topological domains were included as binding evidence and potential gene target inference.
Results
The sets of SNPs, including both the disease-associated markers and those in high linkage disequilibrium with them, were significantly over-represented in regulatory sequences of cancer and/or normal cells; however, over-representation was generally not restricted to disease-relevant tissue specific regions. The calculated regulatory potential, allelic binding affinity scores and ChIP-seq binding evidence were the three criteria used to prioritize candidates. Fitting all three criteria, we highlighted breast cancer susceptibility SNPs and a borderline lung cancer relevant SNP located in cancer-specific enhancers overlapping multiple distinct transcription associated factor ChIP-seq binding sites.
Conclusion
Incorporating high throughput sequencing epigenetic and transcription factor data sets from both cancer and normal cells into cancer genetic studies reveals potential functional SNPs and informs subsequent characterization efforts.
doi:10.1186/1755-8794-7-34
PMCID: PMC4066296  PMID: 24920305
GWAS; Lung cancer; Regulatory regions; Gene regulation; Transcription factor binding site alteration; Enhancer; Topological domains
17.  Development and Evaluation of a Genetic Risk Score for Obesity 
Biodemography and social biology  2013;59(1):10.1080/19485565.2013.774628.
Background
Results from genome-wide association studies (GWAS) represent a potential resource for etiological and treatment research. GWAS of obesity-related phenotypes have been especially successful. To translate this success into a research tool, we developed and tested a “genetic risk score” (GRS) that summarizes an individual’s genetic predisposition to obesity.
Methods
Different GWAS of obesity-related phenotypes report different sets of single nucleotide polymorphisms (SNPs) as the best genomic markers of obesity risk. Therefore, we applied a 3-stage approach that pooled results from multiple GWAS to select SNPs to include in our GRS: The 3 stages are (1) Extraction. SNPs with evidence of association are compiled from published GWAS; (2) Clustering. SNPs are grouped according to patterns of linkage disequilibrium; (3) Selection. Tag SNPs are selected from clusters that meet specific criteria. We applied this 3-stage approach to results from 16 GWAS of obesity-related phenotypes in European-descent samples to create a GRS. We then tested the GRS in the Atherosclerosis Risk in the Communities (ARIC) Study cohort (N=10,745, 55% female, 77% white, 23% African American).
Results
Our 32-locus GRS was a statistically significant predictor of body mass index (BMI) and obesity among ARIC whites (for BMI, r=0.13, p<1×10−30; for obesity, area under the receiver operating characteristic curve (AUC)=0.57 [95% CI 0.55–0.58]). The GRS improved prediction of obesity (as measured by delta-AUC and integrated discrimination index) when added to models that included demographic and geographic information. FTO- and MC4R-linked SNPs, and a non-genetic risk assessment consisting of a socioeconomic index (p<0.01 for all comparisons). The GRS also predicted increased mortality risk over 17 years of follow-up. The GRS performed less well among African Americans.
Conclusions
The obesity GRS derived using our 3-stage approach is not useful for clinical risk prediction, but may have value as a tool for etiological and treatment research.
doi:10.1080/19485565.2013.774628
PMCID: PMC3671353  PMID: 23701538
18.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.
Access to WTCCC data: http://www.wtccc.org.uk/
Contact: lin@bios.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics Online.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
19.  A variable selection method for genome-wide association studies 
Biometrics  2011;27(1):1-8.
Motivation
Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results
We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false-positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium (LD) patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
20.  Re-Ranking Sequencing Variants in the Post-GWAS Era for Accurate Causal Variant Identification 
PLoS Genetics  2013;9(8):e1003609.
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
Author Summary
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
doi:10.1371/journal.pgen.1003609
PMCID: PMC3738448  PMID: 23950724
21.  Genetic variants associated with idiopathic pulmonary fibrosis susceptibility and mortality: a genome-wide association study 
The lancet. Respiratory medicine  2013;1(4):309-317.
Summary
Background
Idiopathic pulmonary fibrosis (IPF) is a devastating disease that probably involves several genetic loci. Several rare genetic variants and one common single nucleotide polymorphism (SNP) of MUC5B have been associated with the disease. Our aim was to identify additional common variants associated with susceptibility and ultimately mortality in IPF.
Methods
First, we did a three-stage genome-wide association study (GWAS): stage one was a discovery GWAS; and stages two and three were independent case-control studies. DNA samples from European-American patients with IPF meeting standard criteria were obtained from several US centres for each stage. Data for European-American control individuals for stage one were gathered from the database of genotypes and phenotypes; additional control individuals were recruited at the University of Pittsburgh to increase the number. For controls in stages two and three, we gathered data for additional sex-matched European-American control individuals who had been recruited in another study. DNA samples from patients and from control individuals were genotyped to identify SNPs associated with IPF. SNPs identified in stage one were carried forward to stage two, and those that achieved genome-wide significance (p<5 × 10−8) in a meta-analysis were carried forward to stage three. Three case series with follow-up data were selected from stages one and two of the GWAS using samples with follow-up data. Mortality analyses were done in these case series to assess the SNPs associated with IPF that had achieved genome-wide significance in the meta-analysis of stages one and two. Finally, we obtained gene-expression profiling data for lungs of patients with IPF from the Lung Genomics Research Consortium and analysed correlation with SNP genotypes.
Findings
In stage one of the GWAS (542 patients with IPF, 542 control individuals matched one-by-one to cases by genetic ancestry estimates), we identified 20 loci. Six SNPs reached genome-wide significance in stage two (544 patients, 687 control individuals): three TOLLIP SNPs (rs111521887, rs5743894, rs5743890) and one MUC5B SNP (rs35705950) at 11p15.5; one MDGA2 SNP (rs7144383) at 14q21.3; and one SPPL2C SNP (rs17690703) at 17q21.31. Stage three (324 patients, 702 control individuals) confirmed the associations for all these SNPs, except for rs7144383. Linkage disequilibrium between the MUC5B SNP (rs35705950) and TOLLIP SNPs (rs111521887 [r2=0.07], rs5743894 [r2=0.16], and rs5743890 [r2=0.01]) was low. 683 patients from the GWAS were included in the mortality analysis. Individuals who developed IPF despite having the protective TOLLIP minor allele of rs5743890 carried an increased mortality risk (meta-analysis with fixed-effect model: hazard ratio 1.72 [95% CI 1.24–2.38]; p=0.0012). TOLLIP expression was decreased by 20% in individuals carrying the minor allele of rs5743890 (p=0.097), 40% in those with the minor allele of rs111521887 (p=3.0 × 10−4), and 50% in those with the minor allele of rs5743894 (p=2.93 × 10−5) compared with homozygous carriers of common alleles for these SNPs.
Interpretation
Novel variants in TOLLIP and SPPL2C are associated with IPF susceptibility. One novel variant of TOLLIP, rs5743890, is also associated with mortality. These associations and the reduced expression of TOLLIP in patients with IPF who carry TOLLIP SNPs emphasise the importance of this gene in the disease.
Funding
National Institutes of Health; National Heart, Lung, and Blood Institute; Pulmonary Fibrosis Foundation; Coalition for Pulmonary Fibrosis; and Instituto de Salud Carlos III.
doi:10.1016/S2213-2600(13)70045-6
PMCID: PMC3894577  PMID: 24429156
22.  Genome-Wide Association Study SNPs in the Human Genome Diversity Project Populations: Does Selection Affect Unlinked SNPs with Shared Trait Associations? 
PLoS Genetics  2011;7(1):e1001266.
Genome-wide association studies (GWAS) have identified more than 2,000 trait-SNP associations, and the number continues to increase. GWAS have focused on traits with potential consequences for human fitness, including many immunological, metabolic, cardiovascular, and behavioral phenotypes. Given the polygenic nature of complex traits, selection may exert its influence on them by altering allele frequencies at many associated loci, a possibility which has yet to be explored empirically. Here we use 38 different measures of allele frequency variation and 8 iHS scores to characterize over 1,300 GWAS SNPs in 53 globally distributed human populations. We apply these same techniques to evaluate SNPs grouped by trait association. We find that groups of SNPs associated with pigmentation, blood pressure, infectious disease, and autoimmune disease traits exhibit unusual allele frequency patterns and elevated iHS scores in certain geographical locations. We also find that GWAS SNPs have generally elevated scores for measures of allele frequency variation and for iHS in Eurasia and East Asia. Overall, we believe that our results provide evidence for selection on several complex traits that has caused changes in allele frequencies and/or elevated iHS scores at a number of associated loci. Since GWAS SNPs collectively exhibit elevated allele frequency measures and iHS scores, selection on complex traits may be quite widespread. Our findings are most consistent with this selection being either positive or negative, although the relative contributions of the two are difficult to discern. Our results also suggest that trait-SNP associations identified in Eurasian samples may not be present in Africa, Oceania, and the Americas, possibly due to differences in linkage disequilibrium patterns. This observation suggests that non-Eurasian and non-East Asian sample populations should be included in future GWAS.
Author Summary
Natural selection exerts its influence by changing allele frequencies at genomic polymorphisms. Alleles associated with harmful traits decrease in frequency while those associated with beneficial traits become more common. In a simple case, selection acts on a trait controlled by a single polymorphism; a large change in allele frequency at this polymorphism can eliminate a deleterious phenotype from a population or fix a beneficial one. However, many phenotypes, including diseases like Type 2 Diabetes, Crohn's disease, and prostate cancer, and physiological traits like height, weight, and hair color, are controlled by multiple genomic loci. Selection may act on such traits by influencing allele frequencies at a single associated polymorphism or by altering allele frequencies at many associated polymorphisms. To search for cases of the latter, we assembled groups of genomic polymorphisms sharing a common trait association and examined their allele frequencies across 53 globally distributed populations looking for commonalities in allelic behavior across geographical space. We find that variants associated with blood pressure tend to correlate with latitude, while those associated with HIV/AIDS progression correlate well with longitude. We also find evidence that selection may be acting worldwide to increase the frequencies of alleles that elevate autoimmune disease risk.
doi:10.1371/journal.pgen.1001266
PMCID: PMC3017115  PMID: 21253569
23.  Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies 
PLoS Genetics  2014;10(10):e1004722.
Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
Author Summary
Genome-wide association studies (GWAS) have successfully identified numerous regions in the genome that harbor genetic variants that increase risk for various complex traits and diseases. However, it is generally the case that GWAS risk variants are not themselves causally affecting the trait, but rather, are correlated to the true causal variant through linkage disequilibrium (LD). Plausible causal variants are identified in fine-mapping studies through targeted sequencing followed by prioritization of variants for functional validation. In this work, we propose methods that leverage two sources of independent information, the association strength and genomic functional location, to prioritize causal variants. We demonstrate in simulations and empirical data that our approach reduces the number of SNPs that need to be selected for follow-up to identify the true causal variants at GWAS risk loci.
doi:10.1371/journal.pgen.1004722
PMCID: PMC4214605  PMID: 25357204
24.  e-GRASP: an integrated evolutionary and GRASP resource for exploring disease associations 
BMC Genomics  2016;17(Suppl 9):770.
Background
Genome-wide association studies (GWAS) have become a mainstay of biological research concerned with discovering genetic variation linked to phenotypic traits and diseases. Both discrete and continuous traits can be analyzed in GWAS to discover associations between single nucleotide polymorphisms (SNPs) and traits of interest. Associations are typically determined by estimating the significance of the statistical relationship between genetic loci and the given trait. However, the prioritization of bona fide, reproducible genetic associations from GWAS results remains a central challenge in identifying genomic loci underlying common complex diseases. Evolutionary-aware meta-analysis of the growing GWAS literature is one way to address this challenge and to advance from association to causation in the discovery of genotype-phenotype relationships.
Description
We have created an evolutionary GWAS resource to enable in-depth query and exploration of published GWAS results. This resource uses the publically available GWAS results annotated in the GRASP2 database. The GRASP2 database includes results from 2082 studies, 177 broad phenotype categories, and ~8.87 million SNP-phenotype associations. For each SNP in e-GRASP, we present information from the GRASP2 database for convenience as well as evolutionary information (e.g., rate and timespan). Users can, therefore, identify not only SNPs with highly significant phenotype-association P-values, but also SNPs that are highly replicated and/or occur at evolutionarily conserved sites that are likely to be functionally important. Additionally, we provide an evolutionary-adjusted SNP association ranking (E-rank) that uses cross-species evolutionary conservation scores and population allele frequencies to transform P-values in an effort to enhance the discovery of SNPs with a greater probability of biologically meaningful disease associations.
Conclusion
By adding an evolutionary dimension to the GWAS results available in the GRASP2 database, our e-GRASP resource will enable a more effective exploration of SNPs not only by the statistical significance of trait associations, but also by the number of studies in which associations have been replicated, and the evolutionary context of the associated mutations. Therefore, e-GRASP will be a valuable resource for aiding researchers in the identification of bona fide, reproducible genetic associations from GWAS results. This resource is freely available at http://www.mypeg.info/egrasp.
doi:10.1186/s12864-016-3088-1
PMCID: PMC5073857  PMID: 27766955
Polymorphism; SNP; GWAS; GRASP; Conservation; Disease; Phenotype
25.  Systematic permutation testing in GWAS pathway analyses: identification of genetic networks in dilated cardiomyopathy and ulcerative colitis 
BMC Genomics  2014;15:622.
Background
Genome wide association studies (GWAS) are applied to identify genetic loci, which are associated with complex traits and human diseases. Analogous to the evolution of gene expression analyses, pathway analyses have emerged as important tools to uncover functional networks of genome-wide association data. Usually, pathway analyses combine statistical methods with a priori available biological knowledge. To determine significance thresholds for associated pathways, correction for multiple testing and over-representation permutation testing is applied.
Results
We systematically investigated the impact of three different permutation test approaches for over-representation analysis to detect false positive pathway candidates and evaluate them on genome-wide association data of Dilated Cardiomyopathy (DCM) and Ulcerative Colitis (UC). Our results provide evidence that the gold standard - permuting the case–control status – effectively improves specificity of GWAS pathway analysis. Although permutation of SNPs does not maintain linkage disequilibrium (LD), these permutations represent an alternative for GWAS data when case–control permutations are not possible. Gene permutations, however, did not add significantly to the specificity. Finally, we provide estimates on the required number of permutations for the investigated approaches.
Conclusions
To discover potential false positive functional pathway candidates and to support the results from standard statistical tests such as the Hypergeometric test, permutation tests of case control data should be carried out. The most reasonable alternative was case–control permutation, if this is not possible, SNP permutations may be carried out. Our study also demonstrates that significance values converge rapidly with an increasing number of permutations. By applying the described statistical framework we were able to discover axon guidance, focal adhesion and calcium signaling as important DCM-related pathways and Intestinal immune network for IgA production as most significant UC pathway.
doi:10.1186/1471-2164-15-622
PMCID: PMC4223581  PMID: 25052024
DCM; UC; GWAS; Permutation tests; Pathway analysis

Results 1-25 (2022220)