PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1022301)

Clipboard (0)
None

Related Articles

1.  Systematic biological prioritization after a genome-wide association study 
Bioinformatics (Oxford, England)  2008;24(16):1805-1811.
Motivation
A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of geno-type-phenotype correlation with a priori evidence of biological relevance.
Results
We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains, including statistical evidence of genotype-phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
doi:10.1093/bioinformatics/btn315
PMCID: PMC2610477  PMID: 18565990
2.  SPOT: a web-based tool for using biological databases to prioritize SNPs after a genome-wide association study 
Nucleic Acids Research  2010;38(Web Server issue):W201-W209.
SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution.
doi:10.1093/nar/gkq513
PMCID: PMC2896195  PMID: 20529875
3.  Convergence of genetic influences in comorbidity 
BMC Bioinformatics  2012;13(Suppl 2):S8.
Background
Predisposition to complex diseases is explained in part by genetic variation, and complex diseases are frequently comorbid, consistent with pleiotropic genetic variation influencing comorbidity. Genome Wide Association (GWA) studies typically assess association between SNPs and a single-disease phenotype. Fisher meta-analysis combines evidence of association from single-disease GWA studies, assuming that each study is an independent test of the same hypothesis. The Rank Product (RP) method overcomes limitations posed by Fisher assumptions, though RP was not designed for GWA data.
Methods
We modified RP to accommodate GWA data, and we call it modRP. Using p-values output from GWA studies, we aggregate evidence for association between SNPs and related phenotypes. To assess significance, RP randomly samples the observed ranks to develop the null distribution of the RP statistic, and then places the observed RPs into the null distribution. ModRP eliminates the effect of linkage disequilibrium and controls for differences in power at tested SNPs, to meet RP assumptions in application to GWA data.
Results
After validating modRP based on both positive and negative control studies, we searched for pleiotropic influences on comorbid substance use disorders in a novel study, and found two SNPs to be significantly associated with comorbid cocaine, opium, and nicotine dependence. Placing these SNPs into biological context, we developed a protein network modeling the interaction of cocaine, nicotine, and opium with these variants.
Conclusions
ModRP is a novel approach to identifying pleiotropic genetic influences on comorbid complex diseases. It can be used to assess association for related phenotypes where raw data is unavailable or inappropriate for analysis using other approaches. The method is conceptually simple and produces statistically significant, biologically relevant results.
doi:10.1186/1471-2105-13-S2-S8
PMCID: PMC3375629  PMID: 22536871
4.  SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS 
BMC Bioinformatics  2013;14(Suppl 1):S9.
Background
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects.
Results
We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores.
Conclusions
Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
doi:10.1186/1471-2105-14-S1-S9
PMCID: PMC3548692  PMID: 23369106
5.  In search of causal variants: refining disease association signals using cross-population contrasts 
BMC Genetics  2008;9:58.
Background
Genome-wide association (GWA) using large numbers of single nucleotide polymorphisms (SNPs) is now a powerful, state-of-the-art approach to mapping human disease genes. When a GWA study detects association between a SNP and the disease, this signal usually represents association with a set of several highly correlated SNPs in strong linkage disequilibrium. The challenge we address is to distinguish among these correlated loci to highlight potential functional variants and prioritize them for follow-up.
Results
We implemented a systematic method for testing association across diverse population samples having differing histories and LD patterns, using a logistic regression framework. The hypothesis is that important underlying biological mechanisms are shared across human populations, and we can filter correlated variants by testing for heterogeneity of genetic effects in different population samples. This approach formalizes the descriptive comparison of p-values that has typified similar cross-population fine-mapping studies to date. We applied this method to correlated SNPs in the cholinergic nicotinic receptor gene cluster CHRNA5-CHRNA3-CHRNB4, in a case-control study of cocaine dependence composed of 504 European-American and 583 African-American samples. Of the 10 SNPs genotyped in the r2 ≥ 0.8 bin for rs16969968, three demonstrated significant cross-population heterogeneity and are filtered from priority follow-up; the remaining SNPs include rs16969968 (heterogeneity p = 0.75). Though the power to filter out rs16969968 is reduced due to the difference in allele frequency in the two groups, the results nevertheless focus attention on a smaller group of SNPs that includes the non-synonymous SNP rs16969968, which retains a similar effect size (odds ratio) across both population samples.
Conclusion
Filtering out SNPs that demonstrate cross-population heterogeneity enriches for variants more likely to be important and causative. Our approach provides an important and effective tool to help interpret results from the many GWA studies now underway.
doi:10.1186/1471-2156-9-58
PMCID: PMC2556340  PMID: 18759969
6.  BioQ: tracing experimental origins in public genomic databases using a novel data provenance model 
Bioinformatics  2012;28(8):1189-1191.
Motivation: Public genomic databases, which are often used to guide genetic studies of human disease, are now being applied to genomic medicine through in silico integrative genomics. These databases, however, often lack tools for systematically determining the experimental origins of the data.
Results: We introduce a new data provenance model that we have implemented in a public web application, BioQ, for assessing the reliability of the data by systematically tracing its experimental origins to the original subjects and biologics. BioQ allows investigators to both visualize data provenance as well as explore individual elements of experimental process flow using precise tools for detailed data exploration and documentation. It includes a number of human genetic variation databases such as the HapMap and 1000 Genomes projects.
Availability and implementation: BioQ is freely available to the public at http://bioq.saclab.net
Contact: ssaccone@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts117
PMCID: PMC3324523  PMID: 22426342
7.  GPA: A Statistical Approach to Prioritizing GWAS Results by Integrating Pleiotropy and Annotation 
PLoS Genetics  2014;10(11):e1004787.
Results from Genome-Wide Association Studies (GWAS) have shown that complex diseases are often affected by many genetic variants with small or moderate effects. Identifications of these risk variants remain a very challenging problem. There is a need to develop more powerful statistical methods to leverage available information to improve upon traditional approaches that focus on a single GWAS dataset without incorporating additional data. In this paper, we propose a novel statistical approach, GPA (Genetic analysis incorporating Pleiotropy and Annotation), to increase statistical power to identify risk variants through joint analysis of multiple GWAS data sets and annotation information because: (1) accumulating evidence suggests that different complex diseases share common risk bases, i.e., pleiotropy; and (2) functionally annotated variants have been consistently demonstrated to be enriched among GWAS hits. GPA can integrate multiple GWAS datasets and functional annotations to seek association signals, and it can also perform hypothesis testing to test the presence of pleiotropy and enrichment of functional annotation. Statistical inference of the model parameters and SNP ranking is achieved through an EM algorithm that can handle genome-wide markers efficiently. When we applied GPA to jointly analyze five psychiatric disorders with annotation information, not only did GPA identify many weak signals missed by the traditional single phenotype analysis, but it also revealed relationships in the genetic architecture of these disorders. Using our hypothesis testing framework, statistically significant pleiotropic effects were detected among these psychiatric disorders, and the markers annotated in the central nervous system genes and eQTLs from the Genotype-Tissue Expression (GTEx) database were significantly enriched. We also applied GPA to a bladder cancer GWAS data set with the ENCODE DNase-seq data from 125 cell lines. GPA was able to detect cell lines that are biologically more relevant to bladder cancer. The R implementation of GPA is currently available at http://dongjunchung.github.io/GPA/.
Author Summary
In the past 10 years, many genome wide association studies (GWAS) have been conducted to identify the genetic bases of complex human traits. As of January, 2014, more than 12,000 single-nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait/disease. On one hand, about 85% of identified risk variants are located in non-coding regions, which motivates a systematic understanding of the function of non-coding variants in regulatory elements in the human genome. On the other hand, complex diseases are often affected by many genetic variants with small or moderate effects. To address these issues, we propose a statistical approach, GPA, to integrating information from multiple GWAS datasets and functional annotation. Notably, our approach only requires marker-wise p-values as input, making it especially useful when only summary statistics, instead of the full genotype and phenotype data, are available. We applied GPA to analyze GWAS datasets of five psychiatric disorders and bladder cancer, where the central nervous system genes, eQTLs from the Genotype-Tissue Expression (GTEx), and the ENCODE DNase-seq data from 125 cell lines were used as functional annotation. The analysis results suggest that GPA is an effective method for integrative data analysis in the post-GWAS era.
doi:10.1371/journal.pgen.1004787
PMCID: PMC4230845  PMID: 25393678
8.  SNP-SNP Interactions Discovered by Logic Regression Explain Crohn's Disease Genetics 
PLoS ONE  2012;7(10):e43035.
In genome-wide association studies (GWAS), the association between each single nucleotide polymorphism (SNP) and a phenotype is assessed statistically. To further explore genetic associations in GWAS, we considered two specific forms of biologically plausible SNP-SNP interactions, ‘SNP intersection’ and ‘SNP union,’ and analyzed the Crohn's Disease (CD) GWAS data of the Wellcome Trust Case Control Consortium for these interactions using a limited form of logic regression. We found strong evidence of CD-association for 195 genes, identifying novel susceptibility genes (e.g., ISX, SLCO6A1, TMEM183A) as well as confirming many previously identified susceptibility genes in CD GWAS (e.g., IL23R, NOD2, CYLD, NKX2-3, IL12RB2, ATG16L1). Notably, 37 of the 59 chromosomal locations indicated for CD-association by a meta-analysis of CD GWAS, involving over 22,000 cases and 29,000 controls, were represented in the 195 genes, as well as some chromosomal locations previously indicated only in linkage studies, but not in GWAS. We repeated the analysis with two smaller GWASs from the Database of Genotype and Phenotype (dbGaP): in spite of differences of populations and study power across the three datasets, we observed some consistencies across the three datasets. Notable examples included TMEM183A and SLCO6A1 which exhibited strong evidence consistently in our WTCCC and both of the dbGaP SNP-SNP interaction analyses. Examining these specific forms of SNP interactions could identify additional genetic associations from GWAS. R codes, data examples, and a ReadMe file are available for download from our website: http://www.ualberta.ca/~yyasui/homepage.html.
doi:10.1371/journal.pone.0043035
PMCID: PMC3470545  PMID: 23071489
9.  regSNPs: a strategy for prioritizing regulatory single nucleotide substitutions 
Bioinformatics  2012;28(14):1879-1886.
Motivation: One of the fundamental questions in genetics study is to identify functional DNA variants that are responsible to a disease or phenotype of interest. Results from large-scale genetics studies, such as genome-wide association studies (GWAS), and the availability of high-throughput sequencing technologies provide opportunities in identifying causal variants. Despite the technical advances, informatics methodologies need to be developed to prioritize thousands of variants for potential causative effects.
Results: We present regSNPs, an informatics strategy that integrates several established bioinformatics tools, for prioritizing regulatory SNPs, i.e. the SNPs in the promoter regions that potentially affect phenotype through changing transcription of downstream genes. Comparing to existing tools, regSNPs has two distinct features. It considers degenerative features of binding motifs by calculating the differences on the binding affinity caused by the candidate variants and integrates potential phenotypic effects of various transcription factors. When tested by using the disease-causing variants documented in the Human Gene Mutation Database, regSNPs showed mixed performance on various diseases. regSNPs predicted three SNPs that can potentially affect bone density in a region detected in an earlier linkage study. Potential effects of one of the variants were validated using luciferase reporter assay.
Contact: yunliu@iupui.edu
Supplementary information: Supplementary data are available at Bioinformatics online
doi:10.1093/bioinformatics/bts275
PMCID: PMC3389767  PMID: 22611130
10.  Candidate Causal Regulatory Effects by Integration of Expression QTLs with Complex Trait Genetic Associations 
PLoS Genetics  2010;6(4):e1000895.
The recent success of genome-wide association studies (GWAS) is now followed by the challenge to determine how the reported susceptibility variants mediate complex traits and diseases. Expression quantitative trait loci (eQTLs) have been implicated in disease associations through overlaps between eQTLs and GWAS signals. However, the abundance of eQTLs and the strong correlation structure (LD) in the genome make it likely that some of these overlaps are coincidental and not driven by the same functional variants. In the present study, we propose an empirical methodology, which we call Regulatory Trait Concordance (RTC) that accounts for local LD structure and integrates eQTLs and GWAS results in order to reveal the subset of association signals that are due to cis eQTLs. We simulate genomic regions of various LD patterns with both a single or two causal variants and show that our score outperforms SNP correlation metrics, be they statistical (r2) or historical (D'). Following the observation of a significant abundance of regulatory signals among currently published GWAS loci, we apply our method with the goal to prioritize relevant genes for each of the respective complex traits. We detect several potential disease-causing regulatory effects, with a strong enrichment for immunity-related conditions, consistent with the nature of the cell line tested (LCLs). Furthermore, we present an extension of the method in trans, where interrogating the whole genome for downstream effects of the disease variant can be informative regarding its unknown primary biological effect. We conclude that integrating cellular phenotype associations with organismal complex traits will facilitate the biological interpretation of the genetic effects on these traits.
Author Summary
Genome-wide association studies have led to the identification of susceptibility loci for a variety of human complex traits. What is still largely missing, however, is the understanding of the biological context in which these candidate variants act and of how they determine each trait. Given the localization of many GWAS loci outside coding regions and the important role of regulatory variation in shaping phenotypic variance, gene expression has been proposed as a plausible informative intermediate phenotype. Here we show that for a subset of the currently published GWAS this is indeed the case, by observing a significant excess of regulatory variants among disease loci. We propose an empirical methodology (regulatory trait concordance—RTC) able to integrate expression and disease data in order to detect causal regulatory effects. We show that the RTC outperforms simple correlation metrics under various simulated linkage disequilibrium (LD) scenarios. Our method is able to recover previously suspected causal regulatory effects from the literature and, as expected given the nature of the tested tissue, an overrepresentation of immunity-related candidates is observed. As the number of available tissues will increase, this prioritization approach will become even more useful in understanding the implication of regulatory variants in disease etiology.
doi:10.1371/journal.pgen.1000895
PMCID: PMC2848550  PMID: 20369022
11.  SCAN: SNP and copy number annotation 
Bioinformatics  2009;26(2):259-262.
Motivation: Genome-wide association studies (GWAS) generate relationships between hundreds of thousands of single nucleotide polymorphisms (SNPs) and complex phenotypes. The contribution of the traditionally overlooked copy number variations (CNVs) to complex traits is also being actively studied. To facilitate the interpretation of the data and the designing of follow-up experimental validations, we have developed a database that enables the sensible prioritization of these variants by combining several approaches, involving not only publicly available physical and functional annotations but also multilocus linkage disequilibrium (LD) annotations as well as annotations of expression quantitative trait loci (eQTLs).
Results: For each SNP, the SCAN database provides: (i) summary information from eQTL mapping of HapMap SNPs to gene expression (evaluated by the Affymetrix exon array) in the full set of HapMap CEU (Caucasians from UT, USA) and YRI (Yoruba people from Ibadan, Nigeria) samples; (ii) LD information, in the case of a HapMap SNP, including what genes have variation in strong LD (pairwise or multilocus LD) with the variant and how well the SNP is covered by different high-throughput platforms; (iii) summary information available from public databases (e.g. physical and functional annotations); and (iv) summary information from other GWAS. For each gene, SCAN provides annotations on: (i) eQTLs for the gene (both local and distant SNPs) and (ii) the coverage of all variants in the HapMap at that gene on each high-throughput platform. For each genomic region, SCAN provides annotations on: (i) physical and functional annotations of all SNPs, genes and known CNVs within the region and (ii) all genes regulated by the eQTLs within the region.
Availability: http://www.scandb.org
Contact: ncox@medicine.bsd.uchicago.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp644
PMCID: PMC2852202  PMID: 19933162
12.  Prioritization of SNPs for Genome-Wide Association Studies Using an Interaction Model of Genetic Variation, Gene Expression, and Trait Variation 
Molecules and Cells  2012;33(4):351-361.
The identification of true causal loci to unravel the statistical evidence of genotype-phenotype correlations and the biological relevance of selected single-nucleotide polymorphisms (SNPs) is a challenging issue in genome-wide association studies (GWAS). Here, we introduced a novel method for the prioritization of SNPs based on p-values from GWAS. The method uses functional evidence from populations, including phenotype-associated gene expressions. Based on the concept of genetic interactions, such as perturbation of gene expression by genetic variation, phenotype and gene expression related SNPs were prioritized by adjusting the p-values of SNPs. We applied our method to GWAS data related to drug-induced cytotoxicity. Then, we prioritized loci that potentially play a role in drug-induced cytotoxicity. By generating an interaction model, our approach allowed us not only to identify causal loci, but also to find intermediate nodes that regulate the flow of information among causal loci, perturbed gene expression, and resulting phenotypic variation.
doi:10.1007/s10059-012-2264-7
PMCID: PMC3887803  PMID: 22460606
genome-wide association study; interaction network; prioritization; SNP
13.  Detection of quantitative trait loci in Bos indicus and Bos taurus cattle using genome-wide association studies 
Background
The apparent effect of a single nucleotide polymorphism (SNP) on phenotype depends on the linkage disequilibrium (LD) between the SNP and a quantitative trait locus (QTL). However, the phase of LD between a SNP and a QTL may differ between Bos indicus and Bos taurus because they diverged at least one hundred thousand years ago. Here, we test the hypothesis that the apparent effect of a SNP on a quantitative trait depends on whether the SNP allele is inherited from a Bos taurus or Bos indicus ancestor.
Methods
Phenotype data on one or more traits and SNP genotype data for 10 181 cattle from Bos taurus, Bos indicus and composite breeds were used. All animals had genotypes for 729 068 SNPs (real or imputed). Chromosome segments were classified as originating from B. indicus or B. taurus on the basis of the haplotype of SNP alleles they contained. Consequently, SNP alleles were classified according to their sub-species origin. Three models were used for the association study: (1) conventional GWAS (genome-wide association study), fitting a single SNP effect regardless of subspecies origin, (2) interaction GWAS, fitting an interaction between SNP and subspecies-origin, and (3) best variable GWAS, fitting the most significant combination of SNP and sub-species origin.
Results
Fitting an interaction between SNP and subspecies origin resulted in more significant SNPs (i.e. more power) than a conventional GWAS. Thus, the effect of a SNP depends on the subspecies that the allele originates from. Also, most QTL segregated in only one subspecies, suggesting that many mutations that affect the traits studied occurred after divergence of the subspecies or the mutation became fixed or was lost in one of the subspecies.
Conclusions
The results imply that GWAS and genomic selection could gain power by distinguishing SNP alleles based on their subspecies origin, and that only few QTL segregate in both B. indicus and B. taurus cattle. Thus, the QTL that segregate in current populations likely resulted from mutations that occurred in one of the subspecies and can have both positive and negative effects on the traits. There was no evidence that selection has increased the frequency of alleles that increase body weight.
doi:10.1186/1297-9686-45-43
PMCID: PMC4176739  PMID: 24168700
14.  Evaluation of genome-wide association study results through development of ontology fingerprints 
Bioinformatics  2009;25(10):1314-1320.
Motivation: Genome-wide association (GWA) studies may identify multiple variants that are associated with a disease or trait. To narrow down candidates for further validation, quantitatively assessing how identified genes relate to a phenotype of interest is important.
Results: We describe an approach to characterize genes or biological concepts (phenotypes, pathways, diseases, etc.) by ontology fingerprint—the set of Gene Ontology (GO) terms that are overrepresented among the PubMed abstracts discussing the gene or biological concept together with the enrichment p-value of these terms generated from a hypergeometric enrichment test. We then quantify the relevance of genes to the trait from a GWA study by calculating similarity scores between their ontology fingerprints using enrichment p-values. We validate this approach by correctly identifying corresponding genes for biological pathways with a 90% average area under the ROC curve (AUC). We applied this approach to rank genes identified through a GWA study that are associated with the lipid concentrations in plasma as well as to prioritize genes within linkage disequilibrium (LD) block. We found that the genes with highest scores were: ABCA1, lipoprotein lipase (LPL) and cholesterol ester transfer protein, plasma for high-density lipoprotein; low-density lipoprotein receptor, APOE and APOB for low-density lipoprotein; and LPL, APOA1 and APOB for triglyceride. In addition, we identified genes relevant to lipid metabolism from the literature even in cases where such knowledge was not reflected in current annotation of these genes. These results demonstrate that ontology fingerprints can be used effectively to prioritize genes from GWA studies for experimental validation.
Contact: zhengw@musc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp158
PMCID: PMC2732313  PMID: 19349285
15.  Snat: a SNP annotation tool for bovine by integrating various sources of genomic information 
BMC Genetics  2011;12:85.
Background
Most recently, with maturing of bovine genome sequencing and high throughput SNP genotyping technologies, a large number of significant SNPs associated with economic important traits can be identified by genome-wide association studies (GWAS). To further determine true association findings in GWAS, the common strategy is to sift out most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. To systematically prioritize these statistically significant SNPs and facilitate follow-up replication studies, we developed a bovine SNP annotation tool (Snat) based on a web interface.
Results
With Snat, various sources of genomic information are integrated and retrieved from several leading online databases, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, linkage information from AnimalQTLdb, conserved elements from UCSC Genome Browser Database and gene functions from Gene Ontology (GO), KEGG PATHWAY and Online Mendelian Inheritance in Animals (OMIA). Snat provides two different applications, including a CGI-based web utility and a command-line version, to access the integrated database, target any single nucleotide loci of interest and perform multi-level functional annotations. For further validation of the practical significance of our study, SNPs involved in two commercial bovine SNP chips, i.e., the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been annotated by Snat, and the corresponding outputs can be directly downloaded from Snat website. Furthermore, a real dataset involving 20 identified SNPs associated with milk yield in our recent GWAS was employed to demonstrate the practical significance of Snat.
Conclusions
To our best knowledge, Snat is one of first tools focusing on SNP annotation for livestock. Snat confers researchers with a convenient and powerful platform to aid functional analyses and accurate evaluation on genes/variants related to SNPs, and facilitates follow-up replication studies in the post-GWAS era.
doi:10.1186/1471-2156-12-85
PMCID: PMC3224132  PMID: 21982513
16.  Association Study of 167 Candidate Genes for Schizophrenia Selected by a Multi-Domain Evidence-Based Prioritization Algorithm and Neurodevelopmental Hypothesis 
PLoS ONE  2013;8(7):e67776.
Integrating evidence from multiple domains is useful in prioritizing disease candidate genes for subsequent testing. We ranked all known human genes (n = 3819) under linkage peaks in the Irish Study of High-Density Schizophrenia Families using three different evidence domains: 1) a meta-analysis of microarray gene expression results using the Stanley Brain collection, 2) a schizophrenia protein-protein interaction network, and 3) a systematic literature search. Each gene was assigned a domain-specific p-value and ranked after evaluating the evidence within each domain. For comparison to this ranking process, a large-scale candidate gene hypothesis was also tested by including genes with Gene Ontology terms related to neurodevelopment. Subsequently, genotypes of 3725 SNPs in 167 genes from a custom Illumina iSelect array were used to evaluate the top ranked vs. hypothesis selected genes. Seventy-three genes were both highly ranked and involved in neurodevelopment (category 1) while 42 and 52 genes were exclusive to neurodevelopment (category 2) or highly ranked (category 3), respectively. The most significant associations were observed in genes PRKG1, PRKCE, and CNTN4 but no individual SNPs were significant after correction for multiple testing. Comparison of the approaches showed an excess of significant tests using the hypothesis-driven neurodevelopment category. Random selection of similar sized genes from two independent genome-wide association studies (GWAS) of schizophrenia showed the excess was unlikely by chance. In a further meta-analysis of three GWAS datasets, four candidate SNPs reached nominal significance. Although gene ranking using integrated sources of prior information did not enrich for significant results in the current experiment, gene selection using an a priori hypothesis (neurodevelopment) was superior to random selection. As such, further development of gene ranking strategies using more carefully selected sources of information is warranted.
doi:10.1371/journal.pone.0067776
PMCID: PMC3726675  PMID: 23922650
17.  Functional annotation signatures of disease susceptibility loci improve SNP association analysis 
BMC Genomics  2014;15(1):398.
Background
Genetic association studies are conducted to discover genetic loci that contribute to an inherited trait, identify the variants behind these associations and ascertain their functional role in determining the phenotype. To date, functional annotations of the genetic variants have rarely played more than an indirect role in assessing evidence for association. Here, we demonstrate how these data can be systematically integrated into an association study’s analysis plan.
Results
We developed a Bayesian statistical model for the prior probability of phenotype–genotype association that incorporates data from past association studies and publicly available functional annotation data regarding the susceptibility variants under study. The model takes the form of a binary regression of association status on a set of annotation variables whose coefficients were estimated through an analysis of associated SNPs in the GWAS Catalog (GC). The functional predictors examined included measures that have been demonstrated to correlate with the association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC Human Genome Browser ENCODE super–track data, dbSNP function class, sequence conservation summaries, proximity to genomic variants in the Database of Genomic Variants and known regulatory elements in the Open Regulatory Annotation database, PolyPhen–2 probabilities and RegulomeDB categories. Because we expected that only a fraction of the annotations would contribute to predicting association, we employed a penalized likelihood method to reduce the impact of non–informative predictors and evaluated the model’s ability to predict GC SNPs not used to construct the model. We show that the functional data alone are predictive of a SNP’s presence in the GC. Further, using data from a genome–wide study of ovarian cancer, we demonstrate that their use as prior data when testing for association is practical at the genome–wide scale and improves power to detect associations.
Conclusions
We show how diverse functional annotations can be efficiently combined to create ‘functional signatures’ that predict the a priori odds of a variant’s association to a trait and how these signatures can be integrated into a standard genome–wide–scale association analysis, resulting in improved power to detect truly associated variants.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-398) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-398
PMCID: PMC4041996  PMID: 24886216
Association study; GWAS; SNPs; Functional annotations; Bayesian analysis; ENCODE project
18.  SNP-based pathway enrichment analysis for genome-wide association studies 
BMC Bioinformatics  2011;12:99.
Background
Recently we have witnessed a surge of interest in using genome-wide association studies (GWAS) to discover the genetic basis of complex diseases. Many genetic variations, mostly in the form of single nucleotide polymorphisms (SNPs), have been identified in a wide spectrum of diseases, including diabetes, cancer, and psychiatric diseases. A common theme arising from these studies is that the genetic variations discovered by GWAS can only explain a small fraction of the genetic risks associated with the complex diseases. New strategies and statistical approaches are needed to address this lack of explanation. One such approach is the pathway analysis, which considers the genetic variations underlying a biological pathway, rather than separately as in the traditional GWAS studies. A critical challenge in the pathway analysis is how to combine evidences of association over multiple SNPs within a gene and multiple genes within a pathway. Most current methods choose the most significant SNP from each gene as a representative, ignoring the joint action of multiple SNPs within a gene. This approach leads to preferential identification of genes with a greater number of SNPs.
Results
We describe a SNP-based pathway enrichment method for GWAS studies. The method consists of the following two main steps: 1) for a given pathway, using an adaptive truncated product statistic to identify all representative (potentially more than one) SNPs of each gene, calculating the average number of representative SNPs for the genes, then re-selecting the representative SNPs of genes in the pathway based on this number; and 2) ranking all selected SNPs by the significance of their statistical association with a trait of interest, and testing if the set of SNPs from a particular pathway is significantly enriched with high ranks using a weighted Kolmogorov-Smirnov test. We applied our method to two large genetically distinct GWAS data sets of schizophrenia, one from European-American (EA) and the other from African-American (AA). In the EA data set, we found 22 pathways with nominal P-value less than or equal to 0.001 and corresponding false discovery rate (FDR) less than 5%. In the AA data set, we found 11 pathways by controlling the same nominal P-value and FDR threshold. Interestingly, 8 of these pathways overlap with those found in the EA sample. We have implemented our method in a JAVA software package, called SNP Set Enrichment Analysis (SSEA), which contains a user-friendly interface and is freely available at http://cbcl.ics.uci.edu/SSEA.
Conclusions
The SNP-based pathway enrichment method described here offers a new alternative approach for analysing GWAS data. By applying it to schizophrenia GWAS studies, we show that our method is able to identify statistically significant pathways, and importantly, pathways that can be replicated in large genetically distinct samples.
doi:10.1186/1471-2105-12-99
PMCID: PMC3102637  PMID: 21496265
19.  An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data 
Bioinformatics  2012;28(20):2615-2623.
Motivation: There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage disequilibrium (LD). To establish RF and GBM as viable alternatives for analyzing genome-wide data, it is necessary to address this potential bias and show that SL methods do not significantly under-perform conventional GWAS methods.
Results: Both LD and MAF have a significant impact on the variable importance measures commonly used in RF and GBM. Dividing SNPs into overlapping subsets with approximate linkage equilibrium and applying SL methods to each subset successfully reduces the impact of LD. A welcome side effect of this approach is a dramatic reduction in parallel computing time, increasing the feasibility of applying SL methods to large datasets. The created subsets also facilitate a potential correction for the effect of MAF using pseudocovariates. Simulations using simulated SNPs embedded in empirical data—assessing varying effect sizes, minor allele frequencies and LD patterns—suggest that the sensitivity to detect effects is often improved by subsetting and does not significantly under-perform the Armitage trend test, even under ideal conditions for the trend test.
Availability: Code for the LD subsetting algorithm and pseudocovariate correction is available at http://www.nd.edu/∼glubke/code.html.
Contact: glubke@nd.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts483
PMCID: PMC3467741  PMID: 22847933
20.  Replication of genetic loci for ages at menarche and menopause in the multi-ethnic Population Architecture using Genomics and Epidemiology (PAGE) study 
Human Reproduction (Oxford, England)  2013;28(6):1695-1706.
STUDY QUESTION
Do genetic associations identified in genome-wide association studies (GWAS) of age at menarche (AM) and age at natural menopause (ANM) replicate in women of diverse race/ancestry from the Population Architecture using Genomics and Epidemiology (PAGE) Study?
SUMMARY ANSWER
We replicated GWAS reproductive trait single nucleotide polymorphisms (SNPs) in our European descent population and found that many SNPs were also associated with AM and ANM in populations of diverse ancestry.
WHAT IS KNOWN ALREADY
Menarche and menopause mark the reproductive lifespan in women and are important risk factors for chronic diseases including obesity, cardiovascular disease and cancer. Both events are believed to be influenced by environmental and genetic factors, and vary in populations differing by genetic ancestry and geography. Most genetic variants associated with these traits have been identified in GWAS of European-descent populations.
STUDY DESIGN, SIZE, DURATION
A total of 42 251 women of diverse ancestry from PAGE were included in cross-sectional analyses of AM and ANM.
MATERIALS, SETTING, METHODS
SNPs previously associated with ANM (n = 5 SNPs) and AM (n = 3 SNPs) in GWAS were genotyped in American Indians, African Americans, Asians, European Americans, Hispanics and Native Hawaiians. To test SNP associations with ANM or AM, we used linear regression models stratified by race/ethnicity and PAGE sub-study. Results were then combined in race-specific fixed effect meta-analyses for each outcome. For replication and generalization analyses, significance was defined at P < 0.01 for ANM analyses and P < 0.017 for AM analyses.
MAIN RESULTS AND THE ROLE OF CHANCE
We replicated findings for AM SNPs in the LIN28B locus and an intergenic region on 9q31 in European Americans. The LIN28B SNPs (rs314277 and rs314280) were also significantly associated with AM in Asians, but not in other race/ethnicity groups. Linkage disequilibrium (LD) patterns at this locus varied widely among the ancestral groups. With the exception of an intergenic SNP at 13q34, all ANM SNPs replicated in European Americans. Three were significantly associated with ANM in other race/ethnicity populations: rs2153157 (6p24.2/SYCP2L), rs365132 (5q35/UIMC1) and rs16991615 (20p12.3/MCM8). While rs1172822 (19q13/BRSK1) was not significant in the populations of non-European descent, effect sizes showed similar trends.
LIMITATIONS, REASONS FOR CAUTION
Lack of association for the GWAS SNPs in the non-European American groups may be due to differences in locus LD patterns between these groups and the European-descent populations included in the GWAS discovery studies; and in some cases, lower power may also contribute to non-significant findings.
WIDER IMPLICATIONS OF THE FINDINGS
The discovery of genetic variants associated with the reproductive traits provides an important opportunity to elucidate the biological mechanisms involved with normal variation and disorders of menarche and menopause. In this study we replicated most, but not all reported SNPs in European descent populations and examined the epidemiologic architecture of these early reported variants, describing their generalizability and effect size across differing ancestral populations. Such data will be increasingly important for prioritizing GWAS SNPs for follow-up in fine-mapping and resequencing studies, as well as in translational research.
STUDY FUNDING/COMPETING INTEREST(S)
The Population Architecture Using Genomics and Epidemiology (PAGE) program is funded by the National Human Genome Research Institute (NHGRI), supported by U01HG004803 (CALiCo), U01HG004798 (EAGLE), U01HG004802 (MEC), U01HG004790 (WHI) and U01HG004801 (Coordinating Center), and their respective NHGRI ARRA supplements. The authors report no conflicts of interest.
doi:10.1093/humrep/det071
PMCID: PMC3657124  PMID: 23508249
menopause; menarche; genome-wide association study; race/ethnicity; single nucleotide polymorphism
21.  FunciSNP: an R/bioconductor tool integrating functional non-coding data sets with genetic association studies to identify candidate regulatory SNPs 
Nucleic Acids Research  2012;40(18):e139.
Single nucleotide polymorphisms (SNPs) are increasingly used to tag genetic loci associated with phenotypes such as risk of complex diseases. Technically, this is done genome-wide without prior restriction or knowledge of biological feasibility in scans referred to as genome-wide association studies (GWAS). Depending on the linkage disequilibrium (LD) structure at a particular locus, such tagSNPs may be surrogates for many thousands of other SNPs, and it is difficult to distinguish those that may play a functional role in the phenotype from those simply genetically linked. Because a large proportion of tagSNPs have been identified within non-coding regions of the genome, distinguishing functional from non-functional SNPs has been an even greater challenge. A strategy was recently proposed that prioritizes surrogate SNPs based on non-coding chromatin and epigenomic mapping techniques that have become feasible with the advent of massively parallel sequencing. Here, we introduce an R/Bioconductor software package that enables the identification of candidate functional SNPs by integrating information from tagSNP locations, lists of linked SNPs from the 1000 genomes project and locations of chromatin features which may have functional significance. Availability: FunciSNP is available from Bioconductor (bioconductor.org).
doi:10.1093/nar/gks542
PMCID: PMC3467035  PMID: 22684628
22.  is-rSNP: a novel technique for in silico regulatory SNP detection 
Bioinformatics  2010;26(18):i524-i530.
Motivation: Determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score.
Results: We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS.
Availability: is-rSNP software is available for use at: www.genomics.csse.unimelb.edu.au/is-rSNP
Contact: gmaci@csse.unimelb.edu.au; adam.kowalczyk@nicta.com.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq378
PMCID: PMC2935445  PMID: 20823317
23.  On the identification of potential regulatory variants within genome wide association candidate SNP sets 
BMC Medical Genomics  2014;7:34.
Background
Genome wide association studies (GWAS) are a population-scale approach to the identification of segments of the genome in which genetic variations may contribute to disease risk. Current methods focus on the discovery of single nucleotide polymorphisms (SNPs) associated with disease traits. As there are many SNPs within identified risk loci, and the majority of these are situated within non-coding regions, a key challenge is to identify and prioritize variants affecting regulatory sequences that are likely to contribute to the phenotype assessed.
Methods
We focused investigation on SNPs within lung and breast cancer GWAS loci that reached genome-wide significance for potential roles in gene regulation with a specific focus on SNPs likely to disrupt transcription factor binding sites. Within risk loci, the regulatory potential of sub-regions was classified using relevant open chromatin and epigenetic high throughput sequencing data sets from the ENCODE project in available cancer and normal cell lines. Furthermore, transcription factor affinity altering variants were predicted by comparison of position weight matrix scores between disease and reference alleles. Lastly, ChIP-seq data of transcription associated factors and topological domains were included as binding evidence and potential gene target inference.
Results
The sets of SNPs, including both the disease-associated markers and those in high linkage disequilibrium with them, were significantly over-represented in regulatory sequences of cancer and/or normal cells; however, over-representation was generally not restricted to disease-relevant tissue specific regions. The calculated regulatory potential, allelic binding affinity scores and ChIP-seq binding evidence were the three criteria used to prioritize candidates. Fitting all three criteria, we highlighted breast cancer susceptibility SNPs and a borderline lung cancer relevant SNP located in cancer-specific enhancers overlapping multiple distinct transcription associated factor ChIP-seq binding sites.
Conclusion
Incorporating high throughput sequencing epigenetic and transcription factor data sets from both cancer and normal cells into cancer genetic studies reveals potential functional SNPs and informs subsequent characterization efforts.
doi:10.1186/1755-8794-7-34
PMCID: PMC4066296  PMID: 24920305
GWAS; Lung cancer; Regulatory regions; Gene regulation; Transcription factor binding site alteration; Enhancer; Topological domains
24.  Robust relationship inference in genome-wide association studies 
Bioinformatics  2010;26(22):2867-2873.
Motivation: Genome-wide association studies (GWASs) have been widely used to map loci contributing to variation in complex traits and risk of diseases in humans. Accurate specification of familial relationships is crucial for family-based GWAS, as well as in population-based GWAS with unknown (or unrecognized) family structure. The family structure in a GWAS should be routinely investigated using the SNP data prior to the analysis of population structure or phenotype. Existing algorithms for relationship inference have a major weakness of estimating allele frequencies at each SNP from the entire sample, under a strong assumption of homogeneous population structure. This assumption is often untenable.
Results: Here, we present a rapid algorithm for relationship inference using high-throughput genotype data typical of GWAS that allows the presence of unknown population substructure. The relationship of any pair of individuals can be precisely inferred by robust estimation of their kinship coefficient, independent of sample composition or population structure (sample invariance). We present simulation experiments to demonstrate that the algorithm has sufficient power to provide reliable inference on millions of unrelated pairs and thousands of relative pairs (up to 3rd-degree relationships). Application of our robust algorithm to HapMap and GWAS datasets demonstrates that it performs properly even under extreme population stratification, while algorithms assuming a homogeneous population give systematically biased results. Our extremely efficient implementation performs relationship inference on millions of pairs of individuals in a matter of minutes, dozens of times faster than the most efficient existing algorithm known to us.
Availability: Our robust relationship inference algorithm is implemented in a freely available software package, KING, available for download at http://people.virginia.edu/∼wc9c/KING.
Contact: wmchen@virginia.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq559
PMCID: PMC3025716  PMID: 20926424
25.  Genome-wide association study combined with biological context can reveal more disease-related SNPs altering microRNA target seed sites 
BMC Genomics  2014;15(1):669.
Background
Emerging studies demonstrate that single nucleotide polymorphisms (SNPs) resided in the microRNA recognition element seed sites (MRESSs) in 3′UTR of mRNAs are putative biomarkers for human diseases and cancers. However, exhaustively experimental validation for the causality of MRESS SNPs is impractical. Therefore bioinformatics have been introduced to predict causal MRESS SNPs. Genome-wide association study (GWAS) provides a way to detect susceptibility of millions of SNPs simultaneously by taking linkage disequilibrium (LD) into account, but the multiple-testing corrections implemented to suppress false positive rate always sacrificed the sensitivity. In our study, we proposed a method to identify candidate causal MRESS SNPs from 12 GWAS datasets without performing multiple-testing corrections. Alternatively, we used biological context to ensure credibility of the selected SNPs.
Results
In 11 out of the 12 GWAS datasets, MRESS SNPs were over-represented in SNPs with p-value ≤ 0.05 (odds ratio (OR) ranged from 1.1 to 2.4). Moreover, host genes of susceptible MRESS SNPs in each of the 11 GWAS dataset shared biological context with reported causal genes. There were 286 MRESS SNPs identified by our method, while only 13 SNPs were identified by multiple-testing corrections with a given threshold of 1 × 10−5, which is a common cutoff used in GWAS. 27 out of the 286 candidate SNPs have been reported to be deleterious while only 2 out of 13 multiple-testing corrected SNPs were documented in PubMed. MicroRNA-mRNA interactions affected by the 286 candidate SNPs were likely to present negatively correlated expression. These SNPs introduced greater alternation of binding free energy than other MRESS SNPs, especially when grouping by haplotypes (4210 vs. 4105 cal/mol by mean, 9781 vs. 8521 cal/mol by mean, respectively).
Conclusions
MRESS SNPs are promising disease biomarkers in multiple GWAS datasets. The method of integrating GWAS p-value and biological context is stable and effective for selecting candidate causal MRESS SNPs, it reduces the loss of sensitivity compared to multiple-testing corrections. The 286 candidate causal MRESS SNPs provide researchers a credible source to initialize their design of experimental validations in the future.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-669) contains supplementary material, which is available to authorized users.
doi:10.1186/1471-2164-15-669
PMCID: PMC4246476  PMID: 25106527
microRNA; Genome-wide association study; Single nucleotide polymorphisms; Human diseases and cancers

Results 1-25 (1022301)