PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (638870)

Clipboard (0)
None

Related Articles

1.  A variable selection method for genome-wide association studies 
Bioinformatics  2010;27(1):1-8.
Motivation: Genome-wide association studies (GWAS) involving half a million or more single nucleotide polymorphisms (SNPs) allow genetic dissection of complex diseases in a holistic manner. The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Existing methods for joint analysis of GWAS data tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs).
Results: We introduce GWASelect, a statistically powerful and computationally efficient variable selection method designed to tackle the unique challenges of GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false positive findings. Simulation studies demonstrate that the GWASelect performs well under a wide spectrum of linkage disequilibrium patterns and can be substantially more powerful than existing methods in capturing causal variants while having a lower FDR. In addition, the regression models based on the GWASelect tend to yield more accurate prediction of disease risk than existing methods. The advantages of the GWASelect are illustrated with the Wellcome Trust Case-Control Consortium (WTCCC) data.
Availability: The software implementing GWASelect is available at http://www.bios.unc.edu/~lin.
Access to WTCCC data: http://www.wtccc.org.uk/
Contact: lin@bios.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics Online.
doi:10.1093/bioinformatics/btq600
PMCID: PMC3025714  PMID: 21036813
2.  Two-stage study designs combining genome-wide association studies, tag single-nucleotide polymorphisms, and exome sequencing: accuracy of genetic effect estimates 
BMC Proceedings  2011;5(Suppl 9):S64.
Genome-wide association studies (GWAS) test for disease-trait associations and estimate effect sizes at tag single-nucleotide polymorphisms (SNPs), which imperfectly capture variation at causal SNPs. Sequencing studies can examine potential causal SNPs directly; however, sequencing the whole genome or exome can be prohibitively expensive. Costs can be limited by using a GWAS to detect the associated region(s) at tag SNPs followed by targeted sequencing to identify and estimate the effect size of the causal variant. Genetic effect estimates obtained from association studies can be inflated because of a form of selection bias known as the winner’s curse. Conversely, estimates at tag SNPs can be attenuated compared to the causal SNP because of incomplete linkage disequilibrium. These two effects oppose each other. Analysis of rare SNPs further complicates our understanding of the winner’s curse because rare SNPs are difficult to tag and analysis can involve collapsing over multiple rare variants. In two-stage analysis of Genetic Analysis Workshop 17 simulated data sets, we find that selection at the tag SNP produces upward bias in the estimate of effect at the causal SNP, even when the tag and causal SNPs are not well correlated. The bias similarly carries through to effect estimates for rare variant summary measures. Replication studies designed with sample sizes computed using biased estimates will be under-powered to detect a disease-causing variant. Accounting for bias in the original study is critical to avoid discarding disease-associated SNPs at follow up.
doi:10.1186/1753-6561-5-S9-S64
PMCID: PMC3287903  PMID: 22373407
3.  Prioritized Subset Analysis: Improving Power in Genome-wide Association Studies 
Human Heredity  2007;65(3):129-141.
Background
Genome-wide association studies (GWAS) are now feasible for studying the genetics underlying complex diseases. For many diseases, a list of candidate genes or regions exists and incorporation of such information into data analyses can potentially improve the power to detect disease variants. Traditional approaches for assessing the overall statistical significance of GWAS results ignore such information by inherently treating all markers equally.
Methods
We propose the prioritized subset analysis (PSA), in which a prioritized subset of markers is pre-selected from candidate regions, and the false discovery rate (FDR) procedure is carried out in the prioritized subset and its complementary subset, respectively.
Results
The PSA is more powerful than the whole-genome single-step FDR adjustment for a range of alternative models. The degree of power improvement depends on the fraction of associated SNPs in the prioritized subset and their nominal power, with higher fraction of associated SNPs and higher nominal power leading to more power improvement. The power improvement can be substantial; for disease loci not included in the prioritized subset, the power loss is almost negligible.
Conclusion
The PSA has the flexibility of allowing investigators to combine prior information from a variety of sources, and will be a useful tool for GWAS.
doi:10.1159/000109730
PMCID: PMC2858373  PMID: 17934316
Association analysis; False discovery rate; HapMap
4.  Using Ascertainment for Targeted Resequencing to Increase Power to Identify Causal Variants 
Statistics and its interface  2011;4(3):285-294.
Summary
Researchers continue to use genome-wide association studies (GWAS) to find the genetic markers associated with disease. Recent studies have added to the typical two-stage analysis a third stage that uses targeted resequencing on a randomly selected subset of the cases to detect the causal single-nucleotide polymorphism (SNP). We propose a design for targeted resequencing that increases the power to detect the causal variant. The design features an ascertainment scheme wherein only those cases with the presence of a risk allele are selected for targeted resequencing. We simulated a disease with a single causal SNP to evaluate our method versus a targeted resequencing design using randomly selected individuals. The simulation studies showed that ascertaining individuals for the targeted resequencing can substantially increase the power to detect a causal SNP, without increasing the false-positive rate.
PMCID: PMC3316326  PMID: 22468169
Ascertainment; Genome-Wide Association Study; Causal Polymorphism; Targeted Resequencing
5.  PATHWAY-BASED ANALYSIS FOR GENOME-WIDE ASSOCIATION STUDIES USING SUPERVISED PRINCIPAL COMPONENTS 
Genetic epidemiology  2010;34(7):716-724.
Many complex diseases are influenced by genetic variations in multiple genes, each with only a small marginal effect on disease susceptibility. Pathway analysis, which identifies biological pathways associated with disease outcome, has become increasingly popular for genome-wide association studies (GWAS). In addition to combining weak signals from a number of SNPs in the same pathway, results from pathway analysis also shed light on the biological processes underlying disease. We propose a new pathway-based analysis method for GWAS, the supervised principal component analysis (SPCA) model. In the proposed SPCA model, a selected subset of SNPs most associated with disease outcome is used to estimate the latent variable for a pathway. The estimated latent variable for each pathway is an optimal linear combination of a selected subset of SNPs; therefore, the proposed SPCA model provides the ability to borrow strength across the SNPs in a pathway. In addition to identifying pathways associated with disease outcome, SPCA also carries out additional within-category selection to identify the most important SNPs within each gene set. The proposed model operates in a well-established statistical framework and can handle design information such as covariate adjustment and matching information in GWAS. We compare the proposed method with currently available methods using data with realistic linkage disequilibrium structures and we illustrate the SPCA method using the Wellcome Trust Case-Control Consortium Crohn Disease (CD) dataset.
doi:10.1002/gepi.20532
PMCID: PMC3480088  PMID: 20842628
SNPs; genome-wide association; pathway analysis; principal component analysis
6.  Power of genome-wide association studies in the presence of interacting loci 
Genetic Epidemiology  2007;31(7):748-762.
Summary
Though multiple interacting loci are likely involved in the etiology of complex diseases, early genome-wide association studies (GWAS) have depended on the detection of the marginal effects of each locus. Here, we evaluate the power of GWAS in the presence of two linked and potentially associated causal loci for several models of interaction between them and find that interacting loci may give rise to marginal relative risks that are not generally considered in a one-locus model. To derive power under realistic situations, we use empirical data generated by the HapMap ENCODE project for both allele frequencies and LD structure. The power is also evaluated in situations where the causal SNPs may not be genotyped, but rather detected by proxy using a SNP in linkage disequilibrium (LD). A common simplification for such power computations assumes that the sample size necessary to detect the effect at the tSNP is the sample size necessary to detect the causal locus directly divided by the LD measure r2 between the two. This assumption, which we call the “proportionality assumption”, is a simplification of the many factors that contribute to the strength of association at a marker, and has recently been criticized as unreasonable [Terwilliger and Hiekkalinna 2006], in particular in the presence of interacting and associated loci. We find that this assumption does not introduce much error in single locus models of disease, but may do so in so in certain two-locus models.
doi:10.1002/gepi.20238
PMCID: PMC3101367  PMID: 17508359
Genetic Predisposition to Disease; Genome, Human; genetics; Genotype; Humans; Linkage Disequilibrium; Models, Genetic; Polymorphism, Single Nucleotide; Quantitative Trait Loci; genetics; linkage disequilibrium; genome-wide; tagSNPs
7.  ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework 
Nucleic Acids Research  2011;39(Web Server issue):W437-W443.
Genome-wide association study (GWAS) is widely utilized to identify genes involved in human complex disease or some other trait. One key challenge for GWAS data interpretation is to identify causal SNPs and provide profound evidence on how they affect the trait. Currently, researches are focusing on identification of candidate causal variants from the most significant SNPs of GWAS, while there is lack of support on biological mechanisms as represented by pathways. Although pathway-based analysis (PBA) has been designed to identify disease-related pathways by analyzing the full list of SNPs from GWAS, it does not emphasize on interpreting causal SNPs. To our knowledge, so far there is no web server available to solve the challenge for GWAS data interpretation within one analytical framework. ICSNPathway is developed to identify candidate causal SNPs and their corresponding candidate causal pathways from GWAS by integrating linkage disequilibrium (LD) analysis, functional SNP annotation and PBA. ICSNPathway provides a feasible solution to bridge the gap between GWAS and disease mechanism study by generating hypothesis of SNP → gene → pathway(s). The ICSNPathway server is freely available at http://icsnpathway.psych.ac.cn/.
doi:10.1093/nar/gkr391
PMCID: PMC3125783  PMID: 21622953
8.  False-Negative-Rate Based Approach for Selecting Top Single-Nucleotide Polymorphisms in the First Stage of a Two-Stage Genome-Wide Association Study 
Statistics and its interface  2011;4(3):359-371.
Genome-wide association (GWA) studies, where hundreds of thousands of single-nucleotide polymorphisms (SNPs) are tested simultaneously, are becoming popular for identifying disease loci for common diseases. Most commonly, a GWA study involves two stages: the first stage includes testing the association between all SNPs and the disease and the second stage includes replication of SNPs selected from the first stage to validate associations in an independent sample. The first stage is considered to be more fundamental since the second stage is contingent on the results of the first stage. Selection of SNPs from stage one for genotyping in stage two is typically based on an arbitrary threshold or controlling type I errors. These strategies can be inefficient and have potential to exclude genotyping of disease-associated SNPs in stage two. We propose an approach for selecting top SNPs that uses a strategy based on the false-negative rate (FNR). Using the FNR approach, we proposed the number of SNPs that should be selected based on the observed p-values and a pre-specified multi-testing power in the first stage. We applied our method to simulated data and a GWA study of glioma (a rare form of brain tumor) data. Results from simulation and the glioma GWA indicate that the proposed approach provides an FNR-based way to select SNPs using pre-specified power.
PMCID: PMC3467022  PMID: 23060946
False negative rate; SNP selection; Two-stage genome-wide association study
9.  Biomolecular Systems of Disease Buried Across Multiple GWAS Unveiled by Information Theory and Ontology 
A key challenge for genome-wide association studies (GWAS) is to understand how single nucleotide polymorphisms (SNPs) mechanistically underpin complex diseases. While this challenge has been addressed partially by Gene Ontology (GO) enrichment of large list of host genes of SNPs prioritized in GWAS, these enrichment have not been formally evaluated. Here, we develop a novel computational approach anchored in information theoretic similarity, by systematically mining lists of host genes of SNPs prioritized in three adult-onset diabetes mellitus GWAS. The “gold-standard” is based on GO associated with 20 published diabetes SNPs’ host genes and on our own evaluation. We computationally identify 69 similarity-predicted GO independently validated in all three GWAS (FDR<5%), enriched with those of the gold-standard (odds ratio=5.89, P=4.81e-05), and these terms can be organized by similarity criteria into 11 groupings termed “biomolecular systems”. Six biomolecular systems were corroborated by the gold-standard and the remaining five were previously uncharacterized. http://lussierlab.org/publications/ITS-GWAS
PMCID: PMC3041547  PMID: 21347143
10.  Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis 
BMC Bioinformatics  2012;13:72.
Background
Genome-wide gene-gene interaction analysis using single nucleotide polymorphisms (SNPs) is an attractive way for identification of genetic components that confers susceptibility of human complex diseases. Individual hypothesis testing for SNP-SNP pairs as in common genome-wide association study (GWAS) however involves difficulty in setting overall p-value due to complicated correlation structure, namely, the multiple testing problem that causes unacceptable false negative results. A large number of SNP-SNP pairs than sample size, so-called the large p small n problem, precludes simultaneous analysis using multiple regression. The method that overcomes above issues is thus needed.
Results
We adopt an up-to-date method for ultrahigh-dimensional variable selection termed the sure independence screening (SIS) for appropriate handling of numerous number of SNP-SNP interactions by including them as predictor variables in logistic regression. We propose ranking strategy using promising dummy coding methods and following variable selection procedure in the SIS method suitably modified for gene-gene interaction analysis. We also implemented the procedures in a software program, EPISIS, using the cost-effective GPGPU (General-purpose computing on graphics processing units) technology. EPISIS can complete exhaustive search for SNP-SNP interactions in standard GWAS dataset within several hours. The proposed method works successfully in simulation experiments and in application to real WTCCC (Wellcome Trust Case–control Consortium) data.
Conclusions
Based on the machine-learning principle, the proposed method gives powerful and flexible genome-wide search for various patterns of gene-gene interaction.
doi:10.1186/1471-2105-13-72
PMCID: PMC3531267  PMID: 22554139
11.  Investigating the Contribution of Common Genetic Variants to the Risk and Pathogenesis of ADHD 
The American Journal of Psychiatry  2012;169(2):186-194.
Objective:
A major motivation for seeking disease-associated genetic variation is to identify novel risk processes. Although rare copy number variants (CNVs) appear to contribute to attention deficit hyperactivity disorder (ADHD), common risk variants (single-nucleotide polymorphisms [SNPs]) have not yet been detected using genome-wide association studies (GWAS). This raises the concern as to whether future larger-scale, adequately powered GWAS will be worthwhile. The authors undertook a GWAS of ADHD and examined whether associated SNPs, including those below conventional levels of significance, influenced the same biological pathways affected by CNVs.
Method:
The authors analyzed genome-wide SNP frequencies in 727 children with ADHD and 5,081 comparison subjects. The gene sets that were enriched in a pathway analysis of the GWAS data (the top 5% of SNPs) were tested for an excess of genes spanned by large, rare CNVs in the children with ADHD.
Results:
No SNP achieved genome-wide significance levels. As previously reported in a subsample of the present study, large, rare CNVs were significantly more common in case subjects than comparison subjects. Thirteen biological pathways enriched for SNP association significantly overlapped with those enriched for rare CNVs. These included cholesterol-related and CNS development pathways. At the level of individual genes, CHRNA7, which encodes a nicotinic receptor subunit previously implicated in neuropsychiatric disorders, was affected by six large duplications in case subjects (none in comparison subjects), and SNPs in the gene had a gene-wide p value of 0.0002 for association in the GWAS.
Conclusions:
Both common and rare genetic variants appear to be relevant to ADHD and index-shared biological pathways.
doi:10.1176/appi.ajp.2011.11040551
PMCID: PMC3601404  PMID: 22420046
12.  Maximal conditional chi-square importance in random forests 
Bioinformatics  2010;26(6):831-837.
Motivation: High-dimensional data are frequently generated in genome-wide association studies (GWAS) and other studies. It is important to identify features such as single nucleotide polymorphisms (SNPs) in GWAS that are associated with a disease. Random forests represent a very useful approach for this purpose, using a variable importance score. This importance score has several shortcomings. We propose an alternative importance measure to overcome those shortcomings.
Results: We characterized the effect of multiple SNPs under various models using our proposed importance measure in random forests, which uses maximal conditional chi-square (MCC) as a measure of association between a SNP and the trait conditional on other SNPs. Based on this importance measure, we employed a permutation test to estimate empirical P-values of SNPs. Our method was compared to a univariate test and the permutation test using the Gini and permutation importance. In simulation, the proposed method performed consistently superior to the other methods in identifying of risk SNPs. In a GWAS of age-related macular degeneration, the proposed method confirmed two significant SNPs (at the genome-wide adjusted level of 0.05). Further analysis showed that these two SNPs conformed with a heterogeneity model. Compared with the existing importance measures, the MCC importance measure is more sensitive to complex effects of risk SNPs by utilizing conditional information on different SNPs. The permutation test with the MCC importance measure provides an efficient way to identify candidate SNPs in GWAS and facilitates the understanding of the etiology between genetic variants and complex diseases.
Contact: heping.zhang@yale.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq038
PMCID: PMC2832825  PMID: 20130032
13.  Using Linkage Information to Weight a Genome-Wide Association of Bipolar Disorder 
Issues of multiple-testing and statistical significance in genome-wide association studies (GWAS) have prompted statistical methods utilizing prior data to increase the power of association results. Using prior findings from genome-wide linkage studies on bipolar disorder (BPD), we employed a weighted false discovery approach (wFDR; (Roeder et al. 2006)) to previously reported GWAS data drawn from the Systematic Treatment Enhancement Program for Bipolar Disorder (STEP-BD). Using this method, association signals are up or down-weighted given the linkage score in that genomic region. Although no SNPs in our sample reached genome-wide significance through the wFDR approach, the strongest single SNP result from the original GWAS results (rs4939921 in myosin VB) is strongly up-weighted as it occurs on a linkage peak of chromosome 18. We also identify regions on chromosome 9, 17, and 18 where modestly associated SNP clusters coincide with strong linkage scores, implicating them as possible candidate regions for further analysis. Moving forward, we believe the application of prior linkage information will be increasingly useful to future GWAS studies that incorporate rarer variants into their analysis.
doi:10.1002/ajmg.b.31183
PMCID: PMC3082625  PMID: 21480485
14.  Detecting epistatic effects in association studies at a genomic level based on an ensemble approach 
Bioinformatics  2011;27(13):i222-i229.
Motivation: Most complex diseases involve multiple genes and their interactions. Although genome-wide association studies (GWAS) have shown some success for identifying genetic variants underlying complex diseases, most existing studies are based on limited single-locus approaches, which detect single nucleotide polymorphisms (SNPs) essentially based on their marginal associations with phenotypes.
Results: In this article, we propose an ensemble approach based on boosting to study gene–gene interactions. We extend the basic AdaBoost algorithm by incorporating an intuitive importance score based on Gini impurity to select candidate SNPs. Permutation tests are used to control the statistical significance. We have performed extensive simulation studies using three interaction models to evaluate the efficacy of our approach at realistic GWAS sizes, and have compared it with existing epistatic detection algorithms. Our results indicate that our approach is valid, efficient for GWAS and on disease models with epistasis has more power than existing programs.
Contact: jingli@case.edu
doi:10.1093/bioinformatics/btr227
PMCID: PMC3117367  PMID: 21685074
15.  DistiLD Database: diseases and traits in linkage disequilibrium blocks 
Nucleic Acids Research  2011;40(D1):D1036-D1040.
Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with the risk of hundreds of diseases. However, there is currently no database that enables non-specialists to answer the following simple questions: which SNPs associated with diseases are in linkage disequilibrium (LD) with a gene of interest? Which chromosomal regions have been associated with a given disease, and which are the potentially causal genes in each region? To answer these questions, we use data from the HapMap Project to partition each chromosome into so-called LD blocks, so that SNPs in LD with each other are preferentially in the same block, whereas SNPs not in LD are in different blocks. By projecting SNPs and genes onto LD blocks, the DistiLD database aims to increase usage of existing GWAS results by making it easy to query and visualize disease-associated SNPs and genes in their chromosomal context. The database is available at http://distild.jensenlab.org/.
doi:10.1093/nar/gkr899
PMCID: PMC3245128  PMID: 22058129
16.  regSNPs: a strategy for prioritizing regulatory single nucleotide substitutions 
Bioinformatics  2012;28(14):1879-1886.
Motivation: One of the fundamental questions in genetics study is to identify functional DNA variants that are responsible to a disease or phenotype of interest. Results from large-scale genetics studies, such as genome-wide association studies (GWAS), and the availability of high-throughput sequencing technologies provide opportunities in identifying causal variants. Despite the technical advances, informatics methodologies need to be developed to prioritize thousands of variants for potential causative effects.
Results: We present regSNPs, an informatics strategy that integrates several established bioinformatics tools, for prioritizing regulatory SNPs, i.e. the SNPs in the promoter regions that potentially affect phenotype through changing transcription of downstream genes. Comparing to existing tools, regSNPs has two distinct features. It considers degenerative features of binding motifs by calculating the differences on the binding affinity caused by the candidate variants and integrates potential phenotypic effects of various transcription factors. When tested by using the disease-causing variants documented in the Human Gene Mutation Database, regSNPs showed mixed performance on various diseases. regSNPs predicted three SNPs that can potentially affect bone density in a region detected in an earlier linkage study. Potential effects of one of the variants were validated using luciferase reporter assay.
Contact: yunliu@iupui.edu
Supplementary information: Supplementary data are available at Bioinformatics online
doi:10.1093/bioinformatics/bts275
PMCID: PMC3389767  PMID: 22611130
17.  A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies 
BMC Bioinformatics  2012;13:284.
Background
Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs.
Results
We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs.
Conclusions
Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.
doi:10.1186/1471-2105-13-284
PMCID: PMC3558454  PMID: 23113980
18.  Functional Differences Exist between TNFα Promoters Encoding the Common −237G SNP and the Rarer HLA-B*5701-Linked A Variant 
PLoS ONE  2012;7(7):e40100.
A large body of functional and epidemiological evidence have previously illustrated the impact of specific MHC class I subtypes on clinical outcome during HIV-1 infection, and these observations have recently been re-iterated in genome wide association studies (GWAS). Yet because of the complexities surrounding GWAS-based approaches and the lack of knowledge relating to the identity of rarer single nucleotide polymorphism (SNP) variants, it has proved difficult to discover independent causal variants associated with favourable immune control. This is especially true of the candidate variants within the HLA region where many of the recently proposed disease influencing SNPs appear to reflect linkage with ‘protective’ MHC class I alleles. Yet causal MHC-linked SNPs may exist but remain overlooked owing to the complexities associated with their identification. Here we focus on the ancestral TNFα promoter −237A variant (rs361525), shown historically to be in complete linkage disequilibrium with the ‘protective’ HLA-B*5701 allele. Many of the ancestral SNPs within the extended TNFα promoter have been associated with both autoimmune conditions and disease outcomes, however, the direct role of these variants on TNFα expression remains controversial. Yet, because of the important role played by TNFα in HIV-1 infection, and given the proximity of the −237 SNP to the core promoter, its location within a putative repressor region previously characterized in mice, and its disruption of a methylation-susceptible CpG dinucleotide motif, we chose to carefully evaluate its impact on TNFα production. Using a variety of approaches we now demonstrate that carriage of the A SNP is associated with lower TNFα production, via a mechanism not readily explained by promoter methylation nor the binding of transcription factors or repressors. We propose that the −237A variant could represent a minor causal SNP that additionally contributes to the HLA-B*5701-mediated ‘protective’ effect during HIV-1 infection.
doi:10.1371/journal.pone.0040100
PMCID: PMC3396654  PMID: 22808100
19.  Estimating the posterior probability that genome-wide association findings are true or false 
Bioinformatics  2009;25(14):1807-1813.
Motivation: A limitation of current methods used to declare significance in genome-wide association studies (GWAS) is that they do not provide clear information about the probability that GWAS findings are true of false. This lack of information increases the chance of false discoveries and may result in real effects being missed.
Results: We propose a method to estimate the posterior probability that a marker has (no) effect given its test statistic value, also called the local false discovery rate (FDR), in the GWAS. A critical step involves the estimation the parameters of the distribution of the true alternative tests. For this, we derived and implemented the real maximum likelihood function, which turned out to provide us with significantly more accurate estimates than the widely used mixture model likelihood. Actual GWAS data are used to illustrate properties of the posterior probability estimates empirically. In addition to evaluating individual markers, a variety of applications are conceivable. For instance, posterior probability estimates can be used to control the FDR more precisely than Benjamini–Hochberg procedure.
Availability: The codes are freely downloadable from the web site http://www.people.vcu.edu/∼jbukszar.
Contact: jbukszar@vcu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp305
PMCID: PMC2705227  PMID: 19420056
20.  Meta-analysis of gene-based genome-wide association studies of bone mineral density in Chinese and European subjects 
Osteoporosis International  2011;23(1):131-142.
Summary
Gene-based association approach could be regarded as a complementary analysis to the single SNP association analysis. We meta-analyzed the findings from the gene-based association approach using the genome-wide association studies (GWAS) data from Chinese and European subjects, confirmed several well established bone mineral density (BMD) genes, and suggested several novel BMD genes.
Introduction
The introduction of GWAS has greatly increased the number of genes that are known to be associated with common diseases. Nonetheless, such a single SNP GWAS has a lower power to detect genes with multiple causal variants. We aimed to assess the association of each gene with BMD variation at the spine and hip using gene-based GWAS approach.
Methods
We studied 778 Hong Kong Southern Chinese (HKSC) women and 5,858 Northern Europeans (dCG); age, sex, and weight were adjusted in the model. The main outcome measure was BMD at the spine and hip.
Results
Nine genes showed suggestive p value in HKSC, while 4 and 17 genes showed significant and suggestive p values respectively in dCG. Meta-analysis using weighted Z-transformed test confirmed several known BMD genes and suggested some novel ones at 1q21.3, 9q22, 9q33.2, 20p13, and 20q12. Top BMD genes were significantly associated with connective tissue, skeletal, and muscular system development and function (p < 0.05). Gene network inference revealed that a large number of these genes were significantly connected with each other to form a functional gene network, and several signaling pathways were strongly connected with these gene networks.
Conclusion
Our gene-based GWAS confirmed several BMD genes and suggested several novel BMD genes. Genetic contribution to BMD variation may operate through multiple genes identified in this study in functional gene networks. This finding may be useful in identifying and prioritizing candidate genes/loci for further study.
Electronic supplementary material
The online version of this article (doi:10.1007/s00198-011-1779-7) contains supplementary material, which is available to authorized users.
doi:10.1007/s00198-011-1779-7
PMCID: PMC3249198  PMID: 21927923
Association study; Bone mineral density; Genetic epidemiology; Meta-analysis; Osteoporosis
21.  Use of Allele-Specific FAIRE to Determine Functional Regulatory Polymorphism Using Large-Scale Genotyping Arrays 
PLoS Genetics  2012;8(8):e1002908.
Following the widespread use of genome-wide association studies (GWAS), focus is turning towards identification of causal variants rather than simply genetic markers of diseases and traits. As a step towards a high-throughput method to identify genome-wide, non-coding, functional regulatory variants, we describe the technique of allele-specific FAIRE, utilising large-scale genotyping technology (FAIRE-gen) to determine allelic effects on chromatin accessibility and regulatory potential. FAIRE-gen was explored using lymphoblastoid cells and the 50,000 SNP Illumina CVD BeadChip. The technique identified an allele-specific regulatory polymorphism within NR1H3 (coding for LXR-α), rs7120118, coinciding with a previously GWAS-identified SNP for HDL-C levels. This finding was confirmed using FAIRE-gen with the 200,000 SNP Illumina Metabochip and verified with the established method of TaqMan allelic discrimination. Examination of this SNP in two prospective Caucasian cohorts comprising 15,000 individuals confirmed the association with HDL-C levels (combined beta = 0.016; p = 0.0006), and analysis of gene expression identified an allelic association with LXR-α expression in heart tissue. Using increasingly comprehensive genotyping chips and distinct tissues for examination, FAIRE-gen has the potential to aid the identification of many causal SNPs associated with disease from GWAS.
Author Summary
The identification of genetic variants associated with complex diseases has rapidly grown through lowering costs of genome sequencing and the use of large-scale genotyping chips based on this sequencing data. There have not been corresponding advances in the identification of causal genetic variants compared to variants simply associated with diseases or traits. Most of these causal variants are thought to be located not within regions coding for proteins, but within genomic regions that regulate the level of protein. We have combined the use of large-scale gene chips with functional analysis, to determine regions of the genome that confer a greater potential for controlling gene regulation dependent on the genotype of that individual. Combining this data with population data and gene expression data, we identify a potential causal variant that alters regulation of LXR-α, a key mediator in lipid metabolism, and show that this variant is associated with HDL-C levels. This methodology provides a model for future analyses to identify further causal variants for disease.
doi:10.1371/journal.pgen.1002908
PMCID: PMC3420950  PMID: 22916038
22.  Evaluating diabetes and hypertension disease causality using mouse phenotypes 
BMC Systems Biology  2010;4:97.
Background
Genome-wide association studies (GWAS) have found hundreds of single nucleotide polymorphisms (SNPs) associated with common diseases. However, it is largely unknown what genes linked with the SNPs actually implicate disease causality. A definitive proof for disease causality can be demonstration of disease-like phenotypes through genetic perturbation of the genes or alleles, which is obviously a daunting task for complex diseases where only mammalian models can be used.
Results
Here we tapped the rich resource of mouse phenotype data and developed a method to quantify the probability that a gene perturbation causes the phenotypes of a disease. Using type II diabetes (T2D) and hypertension (HT) as study cases, we found that the genes, when perturbed, having high probability to cause T2D and HT phenotypes tend to be hubs in the interactome networks and are enriched for signaling pathways regulating metabolism but not metabolic pathways, even though the genes in these metabolic pathways are often the most significantly changed in expression levels in these diseases.
Conclusions
Compared to human genetic disease-based predictions, our mouse phenotype based predictors greatly increased the coverage while keeping a similarly high specificity. The disease phenotype probabilities given by our approach can be used to evaluate the likelihood of disease causality of disease-associated genes and genes surrounding disease-associated SNPs.
doi:10.1186/1752-0509-4-97
PMCID: PMC2917432  PMID: 20642857
23.  A hidden Markov random field model for genome-wide association studies 
Biostatistics (Oxford, England)  2009;11(1):139-150.
Genome-wide association studies (GWAS) are increasingly utilized for identifying novel susceptible genetic variants for complex traits, but there is little consensus on analysis methods for such data. Most commonly used methods include single single nucleotide polymorphism (SNP) analysis or haplotype analysis with Bonferroni correction for multiple comparisons. Since the SNPs in typical GWAS are often in linkage disequilibrium (LD), at least locally, Bonferroni correction of multiple comparisons often leads to conservative error control and therefore lower statistical power. In this paper, we propose a hidden Markov random field model (HMRF) for GWAS analysis based on a weighted LD graph built from the prior LD information among the SNPs and an efficient iterative conditional mode algorithm for estimating the model parameters. This model effectively utilizes the LD information in calculating the posterior probability that an SNP is associated with the disease. These posterior probabilities can then be used to define a false discovery controlling procedure in order to select the disease-associated SNPs. Simulation studies demonstrated the potential gain in power over single SNP analysis. The proposed method is especially effective in identifying SNPs with borderline significance at the single-marker level that nonetheless are in high LD with significant SNPs. In addition, by simultaneously considering the SNPs in LD, the proposed method can also help to reduce the number of false identifications of disease-associated SNPs. We demonstrate the application of the proposed HMRF model using data from a case–control GWAS of neuroblastoma and identify 1 new SNP that is potentially associated with neuroblastoma.
doi:10.1093/biostatistics/kxp043
PMCID: PMC2800164  PMID: 19822692
Empirical Bayes; False discovery; Iterative conditional model; Linkage disequilibrium
24.  is-rSNP: a novel technique for in silico regulatory SNP detection 
Bioinformatics  2010;26(18):i524-i530.
Motivation: Determining the functional impact of non-coding disease-associated single nucleotide polymorphisms (SNPs) identified by genome-wide association studies (GWAS) is challenging. Many of these SNPs are likely to be regulatory SNPs (rSNPs): variations which affect the ability of a transcription factor (TF) to bind to DNA. However, experimental procedures for identifying rSNPs are expensive and labour intensive. Therefore, in silico methods are required for rSNP prediction. By scoring two alleles with a TF position weight matrix (PWM), it can be determined which SNPs are likely rSNPs. However, predictions in this manner are noisy and no method exists that determines the statistical significance of a nucleotide variation on a PWM score.
Results: We have designed an algorithm for in silico rSNP detection called is-rSNP. We employ novel convolution methods to determine the complete distributions of PWM scores and ratios between allele scores, facilitating assignment of statistical significance to rSNP effects. We have tested our method on 41 experimentally verified rSNPs, correctly predicting the disrupted TF in 28 cases. We also analysed 146 disease-associated SNPs with no known functional impact in an attempt to identify candidate rSNPs. Of the 11 significantly predicted disrupted TFs, 9 had previous evidence of being associated with the disease in the literature. These results demonstrate that is-rSNP is suitable for high-throughput screening of SNPs for potential regulatory function. This is a useful and important tool in the interpretation of GWAS.
Availability: is-rSNP software is available for use at: www.genomics.csse.unimelb.edu.au/is-rSNP
Contact: gmaci@csse.unimelb.edu.au; adam.kowalczyk@nicta.com.au
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq378
PMCID: PMC2935445  PMID: 20823317
25.  High-throughput analysis of epistasis in genome-wide association studies with BiForce 
Bioinformatics  2012;28(15):1957-1964.
Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single nucleotide polymorphism (SNP) combinations. Fast screening tools are needed to make epistasis analysis routinely available in GWAS.
Results: We present BiForce to support high-throughput analysis of epistasis in GWAS for either quantitative or binary disease (case–control) traits. BiForce achieves great computational efficiency by using memory efficient data structures, Boolean bitwise operations and multithreaded parallelization. It performs a full pair-wise genome scan to detect interactions involving SNPs with or without significant marginal effects using appropriate Bonferroni-corrected significance thresholds. We show that BiForce is more powerful and significantly faster than published tools for both binary and quantitative traits in a series of performance tests on simulated and real datasets. We demonstrate BiForce in analysing eight metabolic traits in a GWAS cohort (323 697 SNPs, >4500 individuals) and two disease traits in another (>340 000 SNPs, >1750 cases and 1500 controls) on a 32-node computing cluster. BiForce completed analyses of the eight metabolic traits within 1 day, identified nine epistatic pairs of SNPs in five metabolic traits and 18 SNP pairs in two disease traits. BiForce can make the analysis of epistasis a routine exercise in GWAS and thus improve our understanding of the role of epistasis in the genetic regulation of complex traits.
Availability and implementation: The software is free and can be downloaded from http://bioinfo.utu.fi/BiForce/.
Contact: wenhua.wei@igmm.ed.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts304
PMCID: PMC3400955  PMID: 22618535

Results 1-25 (638870)