Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
gene–gene interaction; linkage disequilibrium; imputation
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
Genetic association studies achieve an unprecedented level of resolution in mapping disease genes by genotyping dense SNPs in a gene region. Meanwhile, these studies require new powerful statistical tools that can optimally handle a large amount of information provided by genotype data. A question that arises is how to model interactions between two genes. Simply modeling all possible interactions between the SNPs in two gene regions is not desirable because a greatly increased number of degrees of freedom can be involved in the test statistic. We introduce an approach to reduce the genotype dimension in modeling interactions. The genotype compression of this approach is built upon the information on both the trait and the cross-locus gametic disequilibrium between SNPs in two interacting genes, in such a way as to parsimoniously model the interactions without loss of useful information in the process of dimension reduction. As a result, it improves power to detect association in the presence of gene-gene interactions. This approach can be similarly applied for modeling gene-environment interactions. We compare this method with other approaches: the corresponding test without modeling any interaction, that based on a saturated interaction model, that based on principal component analysis, and that based on Tukey’s 1-df model. Our simulations suggest that this new approach has superior power to that of the other methods. In an application to endometrial cancer case-control data from the Women’s Health Initiative (WHI), this approach detected AKT1 and AKT2 as being significantly associated with endometrial cancer susceptibility by taking into account their interactions with BMI.
Genetic association analysis; Interaction; Gene; Environment; Endometrial cancer
For genome-wide association studies in family-based designs, we propose a powerful two-stage testing strategy that can be applied in situations in which parent-offspring trio data are available and all offspring are affected with the trait or disease under study. In the first step of the testing strategy, we construct estimators of genetic effect size in the completely ascertained sample of affected offspring and their parents that are statistically independent of the family-based association/transmission disequilibrium tests (FBATs/TDTs) that are calculated in the second step of the testing strategy. For each marker, the genetic effect is estimated (without requiring an estimate of the SNP allele frequency) and the conditional power of the corresponding FBAT/TDT is computed. Based on the power estimates, a weighted Bonferroni procedure assigns an individually adjusted significance level to each SNP. In the second stage, the SNPs are tested with the FBAT/TDT statistic at the individually adjusted significance levels. Using simulation studies for scenarios with up to 1,000,000 SNPs, varying allele frequencies and genetic effect sizes, the power of the strategy is compared with standard methodology (e.g., FBATs/TDTs with Bonferroni correction). In all considered situations, the proposed testing strategy demonstrates substantial power increases over the standard approach, even when the true genetic model is unknown and must be selected based on the conditional power estimates. The practical relevance of our methodology is illustrated by an application to a genome-wide association study for childhood asthma, in which we detect two markers meeting genome-wide significance that would not have been detected using standard methodology.
The current state of genotyping technology has enabled researchers to conduct genome-wide association studies of up to 1,000,000 SNPs, allowing for systematic scanning of the genome for variants that might influence the development and progression of complex diseases. One of the largest obstacles to the successful detection of such variants is the multiple comparisons/testing problem in the genetic association analysis. For family-based designs in which all offspring are affected with the disease/trait under study, we developed a methodology that addresses this problem by partitioning the family-based data into two statistically independent components. The first component is used to screen the data and determine the most promising SNPs. The second component is used to test the SNPs for association, where information from the screening is used to weight the SNPs during testing. This methodology is more powerful than standard procedures for multiple comparisons adjustment (i.e., Bonferroni correction). Additionally, as only one data set is required for screening and testing, our testing strategy is less susceptible to study heterogeneity. Finally, as many family-based studies collect data only from affected offspring, this method addresses a major limitation of previous methodologies for multiple comparisons in family-based designs, which require variation in the disease/trait among offspring.
The identification of several hundred genomic regions affecting disease risk has proven the ability of genome-wide association studies have proven their ability to identify genetic contributors to disease. Currently, single-nucleotide polymorphism (SNP) association analysis is the most widely used method of genome-wide association data, but recent research shows that multi-marker tests of association may provide greater power, especially when more than one mutation is present within a gene and the mutations are in low linkage disequilibrium with each other. Here we use a multi-marker association test based on regression to SNPs located within known genes to obtain a gene-level score of association. We then perform pathway analysis using this score as a measure of gene importance. We use two tests of pathway enrichment - a binomial test and a random set method. By utilizing publicly available gene and pathway information, we identify B cell, cytokine and inflammation response, and antigen presentation pathways as being associated with rheumatoid arthritis. These results confirm known biological mechanisms for auto-immunity disorders, of which rheumatoid arthritis is one.
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies.
Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1-15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable-sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here, a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1–15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
It is quite common that the genetic architecture of complex traits involves many genes and their interactions. Therefore, dealing with multiple unlinked genomic regions simultaneously is desirable.
In this paper we develop a regression-based approach to assess the interactions of haplotypes that belong to different unlinked regions, and we use score statistics to test the null hypothesis of non-genetic association. Additionally, multiple marker combinations at each unlinked region are considered. The multiple tests are settled via the minP approach. The P value of the "best" multi-region multi-marker configuration is corrected via Monte-Carlo simulations. Through simulation studies, we assess the performance of the proposed approach and demonstrate its validity and power in testing for haplotype interaction association.
Our simulations showed that, for binary trait without covariates, our proposed methods prove to be equal and even more powerful than htr and hapcc which are part of the FAMHAP program. Additionally, our model can be applied to a wider variety of traits and allow adjustment for other covariates. To test the validity, our methods are applied to analyze the association between four unlinked candidate genes and pig meat quality.
For many complex diseases, quantitative traits contain more information than dichotomous traits. One of the approaches used to analyse these traits in family-based association studies is the quantitative transmission disequilibrium test (QTDT). The QTDT is a regression-based approach that models simultaneously linkage and association. It splits up the association effect in a between- and a within-family genetic component to adjust and test for population stratification and includes a variance components method to model linkage. We extend this approach to detect gene–gene interactions between two unlinked QTLs by adjusting the definition of the between- and within-family component and the variance components included in the model. We simulate data to investigate the influence of the epistasis model, linkage disequilibrium patterns between the markers and the QTLs, and allele frequencies on the power and type I error rates of the approach. Results show that for some of the investigated settings, power gains are obtained in comparison with FAM-MDR. We conclude that our approach shows promising results for candidate-gene studies where too few markers are available to correct for population stratification using standard methods (for example EIGENSTRAT). The proposed method is applied to real-life data on hypertension from the FLEMENGHO study.
QTDT; epistasis; association; linkage
Association studies can focus on candidate gene(s), a particular genomic region, or adopt a genome wide association approach, each of which has implications for marker selection. The strategy for marker selection will affect the statistical power of the study to detect a disease association and is a crucial element of study design. The abundant single nucleotide polymorphisms (SNPs) are the markers of choice in genetic case-control association studies. The genotypes of neighbouring SNPs are often highly correlated (‘in linkage disequilibrium’ – LD) within a population which is utilised for selecting specific ‘tagSNPs’ to serve as proxies for other nearby SNPs in high LD. General guidelines for SNP selection in candidate genes/regions and genome-wide studies are provided in this protocol, along with illustrative examples. Publicly available web-based resources are utilised to browse and retrieve data and software such as Haploview and Goldsurfer2, are applied to investigate LD and to select tagSNPs.
gene; genetic marker; SNP; case-control study; association; design
Multivariate methods ranging from joint SNP to principal components analysis (PCA) have been developed for testing multiple markers in a region for association with disease and disease-related traits. However, these methods suffer from low power and/or the inability to identify the subset of markers contributing to evidence for association under various scenarios.
We introduce or-thoblique principal components-based clustering (OPCC) as an alternative approach to identify specific subsets of markers showing association with a quantitative outcome of interest. We demonstrate the utility of OPCC using simulation studies and an example from the literature on type 2 diabetes.
Compared to traditional methods, OPCC has similar or improved power under various scenarios of linkage disequilibrium structure and genotype availability. Most importantly, our simulations show how OPCC accurately parses large numbers of markers to a subset containing the causal variant or its proxy.
OPCC is a powerful and efficient data reduction method for detecting associations between gene variants and disease-related traits. Unlike alternative methodologies, OPCC has the ability to isolate the effect of causal SNP(s) from among large sets of markers in a candidate region. Therefore, OPCC is an improvement over PCA for testing multiple SNP associations With phenotypes Of interest.
Complex trait; Genotypes; Principal components; Cluster analysis
Recently, the use of linkage disequilibrium (LD) to locate genes which affect quantitative traits (QTL) has received an increasing interest, but the plausibility of fine mapping using linkage disequilibrium techniques for QTL has not been well studied. The main objectives of this work were to (1) measure the extent and pattern of LD between a putative QTL and nearby markers in finite populations and (2) investigate the usefulness of LD in fine mapping QTL in simulated populations using a dense map of multiallelic or biallelic marker loci. The test of association between a marker and QTL and the power of the test were calculated based on single-marker regression analysis. The results show the presence of substantial linkage disequilibrium with closely linked marker loci after 100 to 200 generations of random mating. Although the power to test the association with a frequent QTL of large effect was satisfactory, the power was low for the QTL with a small effect and/or low frequency. More powerful, multi-locus methods may be required to map low frequent QTL with small genetic effects, as well as combining both linkage and linkage disequilibrium information. The results also showed that multiallelic markers are more useful than biallelic markers to detect linkage disequilibrium and association at an equal distance.
linkage disequilibrium; quantitative trait locus; fine mapping
For more fruitful discoveries of genetic variants associated with diseases in genome-wide association studies, it is important to know whether joint analysis of multiple markers is more powerful than the commonly used single-marker analysis, especially in the presence of gene-gene interactions. This article provides a statistical framework to rigorously address this question through analytical power calculations for common model search strategies to detect binary trait loci: marginal search, exhaustive search, forward search, and two-stage screening search. Our approach incorporates linkage disequilibrium, random genotypes, and correlations among score test statistics of logistic regressions. We derive analytical results under two power definitions: the power of finding all the associated markers and the power of finding at least one associated marker. We also consider two types of error controls: the discovery number control and the Bonferroni type I error rate control. After demonstrating the accuracy of our analytical results by simulations, we apply them to consider a broad genetic model space to investigate the relative performances of different model search strategies. Our analytical study provides rapid computation as well as insights into the statistical mechanism of capturing genetic signals under different genetic models including gene-gene interactions. Even though we focus on genetic association analysis, our results on the power of model selection procedures are clearly very general and applicable to other studies.
model selection; statistical power; random predictor; genome-wide association studies; gene-gene interaction
Genome-wide association studies (GWAS) aim to detect single nucleotide polymorphisms (SNP) associated with trait variation. However, due to the large number of tests, standard analysis techniques impose highly stringent significance thresholds, leaving potentially associated SNPs undetected, and much of the trait genetic variation unexplained. Pathway- and network-based methodologies applied to GWAS aim to detect associations missed by standard single-marker approaches. The complex and non-random architecture of the genome makes it a challenge to derive an appropriate testing framework for such methodologies. We developed a rapid and simple permutation approach that uses GWAS SNP association results to establish the significance of pathway associations while accounting for the linkage disequilibrium structure of SNPs and the clustering of functionally related elements in the genome. All SNPs used in the GWAS are placed in a “circular genome” according to their location. Then the complete set of SNP association P values are permuted by rotation with respect to the genomic locations of the SNPs. Once these “simulated” P values are assigned, the joint gene P values are calculated using Fisher’s combination test, and the association of pathways is tested using the hypergeometric test. The circular genomic permutation approach was applied to a human genome-wide association dataset. The data consists of 719 individuals from the ORCADES study genotyped for ∼300,000 SNPs and measured for 51 traits ranging from physical to biochemical measurements. KEGG pathways (n = 225) were used as the sets of pathways to be tested. Our results demonstrate that the circular genomic permutations provide robust association P values. The non-permuted hypergeometric analysis generates ∼1400 pathway-trait combination results with an association P value more significant than P ≤ 0.05, whereas applying circular genomic permutation reduces the number of significant results to a more credible 40% of that value. The circular permutation software (“genomicper”) is available as an R package at http://cran.r-project.org/.
GWAS; pathway-based; permutation method; genomicper R package; cardiac disease
For a dense set of genetic markers such as single nucleotide polymorphisms (SNPs) on high linkage disequilibrium within a small candidate region, a haplotype-based approach for testing association between a disease phenotype and the set of markers is attractive in reducing the data complexity and increasing the statistical power. However, due to unknown status of the underlying disease variant, a comprehensive association test may require consideration of various combinations of the SNPs, which often leads to severe multiple testing problems. In this paper, we propose a latent variable approach to test for association of multiple tightly linked SNPs in case-control studies. First, we introduce a latent variable into the penetrance model to characterize a putative disease susceptible locus (DSL) that may consist of a marker allele, a haplotype from a subset of the markers, or an allele at a putative locus between the markers. Next, through using of a retrospective likelihood to adjust for the case-control sampling ascertainment and appropriately handle the Hardy-Weinberg equilibrium constraint, we develop an expectation-maximization (EM)-based algorithm to fit the penetrance model and estimate the joint haplotype frequencies of the DSL and markers simultaneously. With the latent variable to describe a flexible role of the DSL, the likelihood ratio statistic can then provide a joint association test for the set of markers without requiring an adjustment for testing of multiple haplotypes. Our simulation results also reveal that the latent variable approach may have improved power under certain scenarios comparing with classical haplotype association methods.
haplotype association; retrospective likelihood; latent variable; logistic mixture model; EM algorithm
Whole genome association studies using highly dense single nucleotide polymorphisms (SNPs) are a set of methods to identify DNA markers associated with variation in a particular complex trait of interest. One of the main outcomes from these studies is a subset of statistically significant SNPs. Finding the potential biological functions of such SNPs can be an important step towards further use in human and agricultural populations (e.g., for identifying genes related to susceptibility to complex diseases or genes playing key roles in development or performance). The current challenge is that the information holding the clues to SNP functions is distributed across many different databases. Efficient bioinformatics tools are therefore needed to seamlessly integrate up-to-date functional information on SNPs. Many web services have arisen to meet the challenge but most work only within the framework of human medical research. Although we acknowledge the importance of human research, we identify there is a need for SNP annotation tools for other organisms.
We introduce an R package called FunctSNP, which is the user interface to custom built species-specific databases. The local relational databases contain SNP data together with functional annotations extracted from online resources. FunctSNP provides a unified bioinformatics resource to link SNPs with functional knowledge (e.g., genes, pathways, ontologies). We also introduce dbAutoMaker, a suite of Perl scripts, which can be scheduled to run periodically to automatically create/update the customised SNP databases. We illustrate the use of FunctSNP with a livestock example, but the approach and software tools presented here can be applied also to human and other organisms.
Finding the potential functional significance of SNPs is important when further using the outcomes from whole genome association studies. FunctSNP is unique in that it is the only R package that links SNPs to functional annotation. FunctSNP interfaces to local SNP customised databases which can be built for any species contained in the National Center for Biotechnology Information dbSNP database.
Multi-marker methods for genetic association analysis can be performed for common and low frequency SNPs to improve power. Regression models are an intuitive way to formulate multi-marker tests. In previous studies we evaluated regression-based multi-marker tests for common SNPs, and through identification of bins consisting of correlated SNPs, developed a multi-bin linear combination (MLC) test that is a compromise between a 1 df linear combination test and a multi-df global test. Bins of SNPs in high linkage disequilibrium (LD) are identified, and a linear combination of individual SNP statistics is constructed within each bin. Then association with the phenotype is represented by an overall statistic with df as many or few as the number of bins. In this report we evaluate multi-marker tests for SNPs that occur at low frequencies. There are many linear and quadratic multi-marker tests that are suitable for common or low frequency variant analysis. We compared the performance of the MLC tests with various linear and quadratic statistics in joint or marginal regressions. For these comparisons, we performed a simulation study of genotypes and quantitative traits for 85 genes with many low frequency SNPs based on HapMap Phase III. We compared the tests using (1) set of all SNPs in a gene, (2) set of common SNPs in a gene (MAF ≥ 5%), (3) set of low frequency SNPs (1% ≤ MAF < 5%). For different trait models based on low frequency causal SNPs, we found that combined analysis using all SNPs including common and low frequency SNPs is a good and robust choice whereas using common SNPs alone or low frequency SNP alone can lose power. MLC tests performed well in combined analysis except where two low frequency causal SNPs with opposing effects are positively correlated. Overall, across different sets of analysis, the joint regression Wald test showed consistently good performance whereas other statistics including the ones based on marginal regression had lower power for some situations.
genetic association analysis; multi-marker association analysis; rare variant analysis; common variant analysis; multi-bin multi-marker tests; generalized Wald test; minimum p-value test; indirect association
Admixture mapping is a popular tool to identify regions of the genome associated with traits in a recently admixed population. Existing methods have been developed primarily for identification of a single locus influencing a dichotomous trait within a case-control study design. We propose a generalized admixture mapping (GLEAM) approach, a flexible and powerful regression method for both quantitative and qualitative traits, which is able to test for association between the trait and local ancestries in multiple loci simultaneously and adjust for covariates. The new method is based on the generalized linear model and uses a quadratic normal moment prior to incorporate admixture prior information. Through simulation, we demonstrate that GLEAM achieves lower type I error rate and higher power than ANCESTRYMAP both for qualitative traits and more significantly for quantitative traits. We applied GLEAM to genome-wide SNP data from the Illumina African American panel derived from a cohort of black women participating in the Healthy Pregnancy, Healthy Baby study and identified a locus on chromosome 2 associated with the averaged maternal mean arterial pressure during 24 to 28 weeks of pregnancy.
generalized linear model; local ancestry; mapping by admixture linkage disequilibrium; quadratic normal moment prior; quantitative traits
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
Osteoporosis is characterized by low bone mass and compromised bone structure, heritable traits that contribute to fracture risk. There have been no genome-wide association and linkage studies for these traits using high-density genotyping platforms.
We used the Affymetrix 100K SNP GeneChip marker set in the Framingham Heart Study (FHS) to examine genetic associations with ten primary quantitative traits: bone mineral density (BMD), calcaneal ultrasound, and geometric indices of the hip. To test associations with multivariable-adjusted residual trait values, we used additive generalized estimating equation (GEE) and family-based association tests (FBAT) models within each sex as well as sexes combined. We evaluated 70,987 autosomal SNPs with genotypic call rates ≥80%, HWE p ≥ 0.001, and MAF ≥10% in up to 1141 phenotyped individuals (495 men and 646 women, mean age 62.5 yrs). Variance component linkage analysis was performed using 11,200 markers.
Heritability estimates for all bone phenotypes were 30–66%. LOD scores ≥3.0 were found on chromosomes 15 (1.5 LOD confidence interval: 51,336,679–58,934,236 bp) and 22 (35,890,398–48,603,847 bp) for femoral shaft section modulus. The ten primary phenotypes had 12 associations with 100K SNPs in GEE models at p < 0.000001 and 2 associations in FBAT models at p < 0.000001. The 25 most significant p-values for GEE and FBAT were all less than 3.5 × 10-6 and 2.5 × 10-5, respectively. Of the 40 top SNPs with the greatest numbers of significantly associated BMD traits (including femoral neck, trochanter, and lumbar spine), one half to two-thirds were in or near genes that have not previously been studied for osteoporosis. Notably, pleiotropic associations between BMD and bone geometric traits were uncommon. Evidence for association (FBAT or GEE p < 0.05) was observed for several SNPs in candidate genes for osteoporosis, such as rs1801133 in MTHFR; rs1884052 and rs3778099 in ESR1; rs4988300 in LRP5; rs2189480 in VDR; rs2075555 in COLIA1; rs10519297 and rs2008691 in CYP19, as well as SNPs in PPARG (rs10510418 and rs2938392) and ANKH (rs2454873 and rs379016). All GEE, FBAT and linkage results are provided as an open-access results resource at .
The FHS 100K SNP project offers an unbiased genome-wide strategy to identify new candidate loci and to replicate previously suggested candidate genes for osteoporosis.
Genome-wide association studies of discrete traits generally use simple methods of analysis based on chi-square tests for contingency tables or logistic regression, at least for an initial scan of the entire genome. Nevertheless, more power might be obtained by using various methods that analyze multiple markers in combination. Methods based on sliding windows, wavelets, Bayesian shrinkage, or penalized likelihood methods, among others, were explored by various participants of Genetic Analysis Workshop 16 Group 1 to combine information across multiple markers within a region, while others used Bayesian variable selection methods for genome-wide multivariate analyses of all markers simultaneously. Imputation can be used to fill in missing markers on individual subjects within a study or in a meta-analysis of studies using different panels. Although multiple imputation theoretically should give more robust tests of association, one participant contribution found little difference between results of single and multiple imputation. Careful control of population stratification is essential, and two contributions found that previously reported associations with two genes disappeared after more precise control. Other issues considered by this group included subgroup analysis, gene-gene interactions, and the use of biomarkers.
rheumatoid arthritis; single-nucleotide polymorphisms; multi-marker associations; imputation; population stratification; gene-gene interactions; biomarkers
Genome-wide association studies (GWAS) may benefit from utilizing haplotype information for making marker-phenotype associations. Several rationales for grouping single nucleotide polymorphisms (SNPs) into haplotype blocks exist, but any advantage may depend on such factors as genetic architecture of traits, patterns of linkage disequilibrium in the study population, and marker density. The objective of this study was to explore the utility of haplotypes for GWAS in barley (Hordeum vulgare) to offer a first detailed look at this approach for identifying agronomically important genes in crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project and constructed haplotypes using three different methods. Marker-trait associations were tested by the efficient mixed-model association algorithm (EMMA). When QTL were simulated using single SNPs dropped from the marker dataset, a simple sliding window performed as well or better than single SNPs or the more sophisticated methods of blocking SNPs into haplotypes. Moreover, the haplotype analyses performed better 1) when QTL were simulated as polymorphisms that arose subsequent to marker variants, and 2) in analysis of empirical heading date data. These results demonstrate that the information content of haplotypes is dependent on the particular mutational and recombinational history of the QTL and nearby markers. Analysis of the empirical data also confirmed our intuition that the distribution of QTL alleles in nature is often unlike the distribution of marker variants, and hence utilizing haplotype information could capture associations that would elude single SNPs. We recommend routine use of both single SNP and haplotype markers for GWAS to take advantage of the full information content of the genotype data.
We describe composite likelihood-based analysis of a genome-wide breast cancer case–control sample from the Cancer Genetic Markers of Susceptibility project. We determine 14 380 genome regions of fixed size on a linkage disequilibrium (LD) map, which delimit comparable levels of LD. Although the numbers of single-nucleotide polymorphisms (SNPs) are highly variable, each region contains an average of ∼35 SNPs and an average of ∼69 after imputation of missing genotypes. Composite likelihood association mapping yields a single P-value for each region, established by a permutation test, along with a maximum likelihood disease location, SE and information weight. For single SNP analysis, the nominal P-value for the most significant SNP (msSNP) requires substantial correction given the number of SNPs in the region. Therefore, imputing genotypes may not always be advantageous for the msSNP test, in contrast to composite likelihood. For the region containing FGFR2 (a known breast cancer gene) the largest χ2 is obtained under composite likelihood with imputed genotypes (χ22 increases from 20.6 to 22.7), and compares with a single SNP-based χ22 of 19.9 after correction. Imputation of additional genotypes in this region reduces the size of the 95% confidence interval for location of the disease gene by ∼40%. Among the highest ranked regions, SNPs in the NTSR1 gene would be worthy of examination in additional samples. Meta-analysis, which combines weighted evidence from composite likelihood in different samples, and refines putative disease locations, is facilitated through defining fixed regions on an underlying LD map.
composite likelihood; association mapping; breast cancer; imputed genotypes; FGFR2 gene
Genome-wide association studies (GWAS) test hundreds of thousands of single-nucleotide polymorphisms (SNPs) for association to a trait, treating each marker equally and ignoring prior evidence of association to specific regions. Typically, promising regions are selected for further investigation based on p-values obtained from simple tests of association. However, loci that exert only a weak, low-penetrant role on the trait, producing modest evidence of association, are not detectable in the context of a GWAS. Implementing prior knowledge of association in GWAS could increase power, help distinguish between false and true positives, and identify better sets of SNPs for follow-up studies.
Here we performed a GWAS on rheumatoid arthritis (RA) patients and controls (Problem 1, Genetic Analysis Workshop 16). In order to include prior information in the analysis, we applied four methods that distinctively deal with markers in candidate genes in the context of GWAS. SNPs were divided into a random and a candidate subset, then we applied empirical correction by permutation, false-discovery rate, false-positive report probability, and posterior odds of association using different prior probabilities. We repeated the same analyses on two different sets of candidate markers defined on the basis of previously reported association to RA following two different approaches. The four methods showed similar relative behavior when applied to the two sets, with the proportion of candidate SNPs ranked among the top 2,000 varying from 0 to 100%. The use of different prior probabilities changed the stringency of the methods, but not their relative performance.