Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
gene–gene interaction; linkage disequilibrium; imputation
Genetic association studies achieve an unprecedented level of resolution in mapping disease genes by genotyping dense SNPs in a gene region. Meanwhile, these studies require new powerful statistical tools that can optimally handle a large amount of information provided by genotype data. A question that arises is how to model interactions between two genes. Simply modeling all possible interactions between the SNPs in two gene regions is not desirable because a greatly increased number of degrees of freedom can be involved in the test statistic. We introduce an approach to reduce the genotype dimension in modeling interactions. The genotype compression of this approach is built upon the information on both the trait and the cross-locus gametic disequilibrium between SNPs in two interacting genes, in such a way as to parsimoniously model the interactions without loss of useful information in the process of dimension reduction. As a result, it improves power to detect association in the presence of gene-gene interactions. This approach can be similarly applied for modeling gene-environment interactions. We compare this method with other approaches: the corresponding test without modeling any interaction, that based on a saturated interaction model, that based on principal component analysis, and that based on Tukey’s 1-df model. Our simulations suggest that this new approach has superior power to that of the other methods. In an application to endometrial cancer case-control data from the Women’s Health Initiative (WHI), this approach detected AKT1 and AKT2 as being significantly associated with endometrial cancer susceptibility by taking into account their interactions with BMI.
Genetic association analysis; Interaction; Gene; Environment; Endometrial cancer
For more fruitful discoveries of genetic variants associated with diseases in genome-wide association studies, it is important to know whether joint analysis of multiple markers is more powerful than the commonly used single-marker analysis, especially in the presence of gene-gene interactions. This article provides a statistical framework to rigorously address this question through analytical power calculations for common model search strategies to detect binary trait loci: marginal search, exhaustive search, forward search, and two-stage screening search. Our approach incorporates linkage disequilibrium, random genotypes, and correlations among score test statistics of logistic regressions. We derive analytical results under two power definitions: the power of finding all the associated markers and the power of finding at least one associated marker. We also consider two types of error controls: the discovery number control and the Bonferroni type I error rate control. After demonstrating the accuracy of our analytical results by simulations, we apply them to consider a broad genetic model space to investigate the relative performances of different model search strategies. Our analytical study provides rapid computation as well as insights into the statistical mechanism of capturing genetic signals under different genetic models including gene-gene interactions. Even though we focus on genetic association analysis, our results on the power of model selection procedures are clearly very general and applicable to other studies.
model selection; statistical power; random predictor; genome-wide association studies; gene-gene interaction
Various methods have been developed for identifying gene–gene interactions in genome-wide association studies (GWAS). However, most methods focus on individual markers as the testing unit, and the large number of such tests drastically erodes statistical power. In this study, we propose novel interaction tests of quantitative traits that are gene-based and that confer advantage in both statistical power and biological interpretation. The framework of gene-based gene–gene interaction (GGG) tests combine marker-based interaction tests between all pairs of markers in two genes to produce a gene-level test for interaction between the two. The tests are based on an analytical formula we derive for the correlation between marker-based interaction tests due to linkage disequilibrium. We propose four GGG tests that extend the following P value combining methods: minimum P value, extended Simes procedure, truncated tail strength, and truncated P value product. Extensive simulations point to correct type I error rates of all tests and show that the two truncated tests are more powerful than the other tests in cases of markers involved in the underlying interaction not being directly genotyped and in cases of multiple underlying interactions. We applied our tests to pairs of genes that exhibit a protein–protein interaction to test for gene-level interactions underlying lipid levels using genotype data from the Atherosclerosis Risk in Communities study. We identified five novel interactions that are not evident from marker-based interaction testing and successfully replicated one of these interactions, between SMAD3 and NEDD9, in an independent sample from the Multi-Ethnic Study of Atherosclerosis. We conclude that our GGG tests show improved power to identify gene-level interactions in existing, as well as emerging, association studies.
Epistasis is likely to play a significant role in complex diseases or traits and is one of the many possible explanations for “missing heritability.” However, epistatic interactions have been difficult to detect in genome-wide association studies (GWAS) due to the limited power caused by the multiple-testing correction from the large number of tests conducted. Gene-based gene–gene interaction (GGG) tests might hold the key to relaxing the multiple-testing correction burden and increasing the power for identifying epistatic interactions in GWAS. Here, we developed GGG tests of quantitative traits by extending four P value combining methods and evaluated their type I error rates and power using extensive simulations. All four GGG tests are more powerful than a principal component-based test. We also applied our GGG tests to data from the Atherosclerosis Risk in Communities study and found five gene-level interactions associated with the levels of total cholesterol and high-density lipoprotein cholesterol (HDL-C). One interaction between SMAD3 and NEDD9 on HDL-C was further replicated in an independent sample from the Multi-Ethnic Study of Atherosclerosis.
Due to the low statistical power of individual markers from a genome-wide association study (GWAS), detecting causal single nucleotide polymorphisms (SNPs) for complex diseases is a challenge. SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity.
We aim to detect type 2 diabetes (T2D) causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. T2D causal SNP combinations and genome-wide SNPs are mapped into functional modules using expanded gene set enrichment analysis (GSEA) considering pathway, transcription factor (TF)-target, miRNA-target, gene ontology, and protein complex functional modules. The prediction error rates are measured for SNP sets from functional module-based filtration that selects SNPs within functional modules from genome-wide SNPs based expanded GSEA.
A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected using optimal filtration criteria, with an error rate of 10.25%. Matching 101 SNPs with known T2D genes and functional modules reveals the relationships between T2D and SNP combinations. The prediction error rates of SNP sets from functional module-based filtration record no significance compared to the prediction error rates of randomly selected SNP sets and T2D causal SNP combinations from optimal filtration.
We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
For genome-wide association studies in family-based designs, we propose a powerful two-stage testing strategy that can be applied in situations in which parent-offspring trio data are available and all offspring are affected with the trait or disease under study. In the first step of the testing strategy, we construct estimators of genetic effect size in the completely ascertained sample of affected offspring and their parents that are statistically independent of the family-based association/transmission disequilibrium tests (FBATs/TDTs) that are calculated in the second step of the testing strategy. For each marker, the genetic effect is estimated (without requiring an estimate of the SNP allele frequency) and the conditional power of the corresponding FBAT/TDT is computed. Based on the power estimates, a weighted Bonferroni procedure assigns an individually adjusted significance level to each SNP. In the second stage, the SNPs are tested with the FBAT/TDT statistic at the individually adjusted significance levels. Using simulation studies for scenarios with up to 1,000,000 SNPs, varying allele frequencies and genetic effect sizes, the power of the strategy is compared with standard methodology (e.g., FBATs/TDTs with Bonferroni correction). In all considered situations, the proposed testing strategy demonstrates substantial power increases over the standard approach, even when the true genetic model is unknown and must be selected based on the conditional power estimates. The practical relevance of our methodology is illustrated by an application to a genome-wide association study for childhood asthma, in which we detect two markers meeting genome-wide significance that would not have been detected using standard methodology.
The current state of genotyping technology has enabled researchers to conduct genome-wide association studies of up to 1,000,000 SNPs, allowing for systematic scanning of the genome for variants that might influence the development and progression of complex diseases. One of the largest obstacles to the successful detection of such variants is the multiple comparisons/testing problem in the genetic association analysis. For family-based designs in which all offspring are affected with the disease/trait under study, we developed a methodology that addresses this problem by partitioning the family-based data into two statistically independent components. The first component is used to screen the data and determine the most promising SNPs. The second component is used to test the SNPs for association, where information from the screening is used to weight the SNPs during testing. This methodology is more powerful than standard procedures for multiple comparisons adjustment (i.e., Bonferroni correction). Additionally, as only one data set is required for screening and testing, our testing strategy is less susceptible to study heterogeneity. Finally, as many family-based studies collect data only from affected offspring, this method addresses a major limitation of previous methodologies for multiple comparisons in family-based designs, which require variation in the disease/trait among offspring.
The identification of several hundred genomic regions affecting disease risk has proven the ability of genome-wide association studies have proven their ability to identify genetic contributors to disease. Currently, single-nucleotide polymorphism (SNP) association analysis is the most widely used method of genome-wide association data, but recent research shows that multi-marker tests of association may provide greater power, especially when more than one mutation is present within a gene and the mutations are in low linkage disequilibrium with each other. Here we use a multi-marker association test based on regression to SNPs located within known genes to obtain a gene-level score of association. We then perform pathway analysis using this score as a measure of gene importance. We use two tests of pathway enrichment - a binomial test and a random set method. By utilizing publicly available gene and pathway information, we identify B cell, cytokine and inflammation response, and antigen presentation pathways as being associated with rheumatoid arthritis. These results confirm known biological mechanisms for auto-immunity disorders, of which rheumatoid arthritis is one.
Whole genome association studies using highly dense single nucleotide polymorphisms (SNPs) are a set of methods to identify DNA markers associated with variation in a particular complex trait of interest. One of the main outcomes from these studies is a subset of statistically significant SNPs. Finding the potential biological functions of such SNPs can be an important step towards further use in human and agricultural populations (e.g., for identifying genes related to susceptibility to complex diseases or genes playing key roles in development or performance). The current challenge is that the information holding the clues to SNP functions is distributed across many different databases. Efficient bioinformatics tools are therefore needed to seamlessly integrate up-to-date functional information on SNPs. Many web services have arisen to meet the challenge but most work only within the framework of human medical research. Although we acknowledge the importance of human research, we identify there is a need for SNP annotation tools for other organisms.
We introduce an R package called FunctSNP, which is the user interface to custom built species-specific databases. The local relational databases contain SNP data together with functional annotations extracted from online resources. FunctSNP provides a unified bioinformatics resource to link SNPs with functional knowledge (e.g., genes, pathways, ontologies). We also introduce dbAutoMaker, a suite of Perl scripts, which can be scheduled to run periodically to automatically create/update the customised SNP databases. We illustrate the use of FunctSNP with a livestock example, but the approach and software tools presented here can be applied also to human and other organisms.
Finding the potential functional significance of SNPs is important when further using the outcomes from whole genome association studies. FunctSNP is unique in that it is the only R package that links SNPs to functional annotation. FunctSNP interfaces to local SNP customised databases which can be built for any species contained in the National Center for Biotechnology Information dbSNP database.
Genome-wide association studies (GWAS) may benefit from utilizing haplotype information for making marker-phenotype associations. Several rationales for grouping single nucleotide polymorphisms (SNPs) into haplotype blocks exist, but any advantage may depend on such factors as genetic architecture of traits, patterns of linkage disequilibrium in the study population, and marker density. The objective of this study was to explore the utility of haplotypes for GWAS in barley (Hordeum vulgare) to offer a first detailed look at this approach for identifying agronomically important genes in crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project and constructed haplotypes using three different methods. Marker-trait associations were tested by the efficient mixed-model association algorithm (EMMA). When QTL were simulated using single SNPs dropped from the marker dataset, a simple sliding window performed as well or better than single SNPs or the more sophisticated methods of blocking SNPs into haplotypes. Moreover, the haplotype analyses performed better 1) when QTL were simulated as polymorphisms that arose subsequent to marker variants, and 2) in analysis of empirical heading date data. These results demonstrate that the information content of haplotypes is dependent on the particular mutational and recombinational history of the QTL and nearby markers. Analysis of the empirical data also confirmed our intuition that the distribution of QTL alleles in nature is often unlike the distribution of marker variants, and hence utilizing haplotype information could capture associations that would elude single SNPs. We recommend routine use of both single SNP and haplotype markers for GWAS to take advantage of the full information content of the genotype data.
It is quite common that the genetic architecture of complex traits involves many genes and their interactions. Therefore, dealing with multiple unlinked genomic regions simultaneously is desirable.
In this paper we develop a regression-based approach to assess the interactions of haplotypes that belong to different unlinked regions, and we use score statistics to test the null hypothesis of non-genetic association. Additionally, multiple marker combinations at each unlinked region are considered. The multiple tests are settled via the minP approach. The P value of the "best" multi-region multi-marker configuration is corrected via Monte-Carlo simulations. Through simulation studies, we assess the performance of the proposed approach and demonstrate its validity and power in testing for haplotype interaction association.
Our simulations showed that, for binary trait without covariates, our proposed methods prove to be equal and even more powerful than htr and hapcc which are part of the FAMHAP program. Additionally, our model can be applied to a wider variety of traits and allow adjustment for other covariates. To test the validity, our methods are applied to analyze the association between four unlinked candidate genes and pig meat quality.
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1-15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
Linkage disequilibrium (LD)-based association mapping is often performed by analyzing either individual SNPs or block-based multi-SNP haplotypes. Sliding windows of several fixed sizes (in terms of SNP numbers) were also applied to a few simulated or real data sets. In comparison, exhaustively testing based on variable-sized sliding windows (VSW) of all possible sizes of SNPs over a genomic region has the best chance to capture the optimum markers (single SNPs or haplotypes) that are most significantly associated with the traits under study. However, the cost is the increased number of multiple tests and computation. Here, a strategy of VSW of all possible sizes is proposed and its power is examined, in comparison with those using only haplotype blocks (BLK) or single SNP loci (SGL) tests. Critical values for statistical significance testing that account for multiple testing are simulated. We demonstrated that, over a wide range of parameters simulated, VSW increased power for the detection of disease variants by ∼1–15% over the BLK and SGL approaches. The improved performance was more significant in regions with high recombination rates. In an empirical data set, VSW obtained the most significant signal and identified the LRP5 gene as strongly associated with osteoporosis. With the use of computational techniques such as parallel algorithms and clustering computing, it is feasible to apply VSW to large genomic regions or those regions preliminarily identified by traditional SGL/BLK methods.
sliding window; association mapping; statistical power
Meta-analysis of genome-wide association studies involves testing single nucleotide polymorphisms (SNPs) using summary statistics that are weighted sums of site-specific score or Wald statistics. This approach avoids having to pool individual-level data. We describe the weights that maximize the power of the summary statistics. For small effect-sizes, any choice of weights yields summary Wald and score statistics with the same power, and the optimal weights are proportional to the square roots of the sites' Fisher information for the SNP's regression coefficient. When SNP effect size is constant across sites, the optimal summary Wald statistic is the well-known inverse-variance-weighted combination of estimated regression coefficients, divided by its standard deviation. We give simple approximations to the optimal weights for various phenotypes, and show that weights proportional to the square roots of study sizes are suboptimal for data from case-control studies with varying case-control ratios, for quantitative trait data when the trait variance differs across sites, for count data when the site-specific mean counts differ, and for survival data with different proportions of failing subjects. Simulations suggest that weights that accommodate inter-site variation in imputation error give little power gain compared to those obtained ignoring imputation uncertainties. We note advantages to combining site-specific score statistics, and we show how they can be used to assess effect-size heterogeneity across sites. The utility of the summary score statistic is illustrated by application to a meta-analysis of schizophrenia data in which only site-specific p-values and directions of association are available.
combining GWAS; effect-size heterogeneity; meta-analysis; noncentrality parameter; score statistics; Wald statistics
The potential benefits of using population isolates in genetic mapping, such as reduced genetic, phenotypic and environmental heterogeneity, are offset by the challenges posed by the large amounts of direct and cryptic relatedness in these populations confounding basic assumptions of independence. We have evaluated four representative specialized methods for association testing in the presence of relatedness; (i) within-family (ii) within- and between-family and (iii) mixed-models methods, using simulated traits for 2906 subjects with known genome-wide genotype data from an extremely isolated population, the Island of Kosrae, Federated States of Micronesia. We report that mixed models optimally extract association information from such samples, demonstrating 88% power to rank the true variant as among the top 10 genome-wide with 56% achieving genome-wide significance, a >80% improvement over the other methods, and demonstrate that population isolates have similar power to non-isolate populations for observing variants of known effects. We then used the mixed-model method to reanalyze data for 17 published phenotypes relating to metabolic traits and electrocardiographic measures, along with another 8 previously unreported. We replicate nine genome-wide significant associations with known loci of plasma cholesterol, high-density lipoprotein, low-density lipoprotein, triglycerides, thyroid stimulating hormone, homocysteine, C-reactive protein and uric acid, with only one detected in the previous analysis of the same traits. Further, we leveraged shared identity-by-descent genetic segments in the region of the uric acid locus to fine-map the signal, refining the known locus by a factor of 4. Finally, we report a novel associations for height (rs17629022, P< 2.1 × 10−8).
For many complex diseases, quantitative traits contain more information than dichotomous traits. One of the approaches used to analyse these traits in family-based association studies is the quantitative transmission disequilibrium test (QTDT). The QTDT is a regression-based approach that models simultaneously linkage and association. It splits up the association effect in a between- and a within-family genetic component to adjust and test for population stratification and includes a variance components method to model linkage. We extend this approach to detect gene–gene interactions between two unlinked QTLs by adjusting the definition of the between- and within-family component and the variance components included in the model. We simulate data to investigate the influence of the epistasis model, linkage disequilibrium patterns between the markers and the QTLs, and allele frequencies on the power and type I error rates of the approach. Results show that for some of the investigated settings, power gains are obtained in comparison with FAM-MDR. We conclude that our approach shows promising results for candidate-gene studies where too few markers are available to correct for population stratification using standard methods (for example EIGENSTRAT). The proposed method is applied to real-life data on hypertension from the FLEMENGHO study.
QTDT; epistasis; association; linkage
Association studies can focus on candidate gene(s), a particular genomic region, or adopt a genome wide association approach, each of which has implications for marker selection. The strategy for marker selection will affect the statistical power of the study to detect a disease association and is a crucial element of study design. The abundant single nucleotide polymorphisms (SNPs) are the markers of choice in genetic case-control association studies. The genotypes of neighbouring SNPs are often highly correlated (‘in linkage disequilibrium’ – LD) within a population which is utilised for selecting specific ‘tagSNPs’ to serve as proxies for other nearby SNPs in high LD. General guidelines for SNP selection in candidate genes/regions and genome-wide studies are provided in this protocol, along with illustrative examples. Publicly available web-based resources are utilised to browse and retrieve data and software such as Haploview and Goldsurfer2, are applied to investigate LD and to select tagSNPs.
gene; genetic marker; SNP; case-control study; association; design
This work demonstrates how gene association studies can be analyzed to map a global landscape of genetic interactions among protein complexes and pathways. Despite the immense potential of gene association studies, they have been challenging to analyze because most traits are complex, involving the combined effect of mutations at many different genes. Due to lack of statistical power, only the strongest single markers are typically identified. Here, we present an integrative approach that greatly increases power through marker clustering and projection of marker interactions within and across protein complexes. Applied to a recent gene association study in yeast, this approach identifies 2,023 genetic interactions which map to 208 functional interactions among protein complexes. We show that such interactions are analogous to interactions derived through reverse genetic screens and that they provide coverage in areas not yet tested by reverse genetic analysis. This work has the potential to transform gene association studies, by elevating the analysis from the level of individual markers to global maps of genetic interactions. As proof of principle, we use synthetic genetic screens to confirm numerous novel genetic interactions for the INO80 chromatin remodeling complex.
One of the most important problems in biology and medicine is to identify the genetic mutations that affect human traits such as blood pressure, longevity, and onset of disease. Currently, large scientific teams are examining the genomes of thousands of people in an attempt to find mutations present only in individuals with certain traits. Until now, mutations have been largely examined in isolation, without regard to how they work together inside the cell. However, large pathway maps are now available which describe in detail the network of genes and proteins that underlies cell function. Here we show how to take advantage of these pathway maps to better identify relevant mutations and to show how these mutations work mechanistically. This basic approach of combining genetic information with known maps of the cell will have wide-ranging applications in understanding and treating disease.
Multivariate methods ranging from joint SNP to principal components analysis (PCA) have been developed for testing multiple markers in a region for association with disease and disease-related traits. However, these methods suffer from low power and/or the inability to identify the subset of markers contributing to evidence for association under various scenarios.
We introduce or-thoblique principal components-based clustering (OPCC) as an alternative approach to identify specific subsets of markers showing association with a quantitative outcome of interest. We demonstrate the utility of OPCC using simulation studies and an example from the literature on type 2 diabetes.
Compared to traditional methods, OPCC has similar or improved power under various scenarios of linkage disequilibrium structure and genotype availability. Most importantly, our simulations show how OPCC accurately parses large numbers of markers to a subset containing the causal variant or its proxy.
OPCC is a powerful and efficient data reduction method for detecting associations between gene variants and disease-related traits. Unlike alternative methodologies, OPCC has the ability to isolate the effect of causal SNP(s) from among large sets of markers in a candidate region. Therefore, OPCC is an improvement over PCA for testing multiple SNP associations With phenotypes Of interest.
Complex trait; Genotypes; Principal components; Cluster analysis
The information provided by dense genome-wide markers using high throughput technology is of considerable potential in human disease studies and livestock breeding programs. Genome-wide association studies relate individual single nucleotide polymorphisms (SNP) from dense SNP panels to individual measurements of complex traits, with the underlying assumption being that any association is caused by linkage disequilibrium (LD) between SNP and quantitative trait loci (QTL) affecting the trait. Often SNP are in genomic regions of no trait variation. Whole genome Bayesian models are an effective way of incorporating this and other important prior information into modelling. However a full Bayesian analysis is often not feasible due to the large computational time involved.
This article proposes an expectation-maximization (EM) algorithm called emBayesB which allows only a proportion of SNP to be in LD with QTL and incorporates prior information about the distribution of SNP effects. The posterior probability of being in LD with at least one QTL is calculated for each SNP along with estimates of the hyperparameters for the mixture prior. A simulated example of genomic selection from an international workshop is used to demonstrate the features of the EM algorithm. The accuracy of prediction is comparable to a full Bayesian analysis but the EM algorithm is considerably faster. The EM algorithm was accurate in locating QTL which explained more than 1% of the total genetic variation. A computational algorithm for very large SNP panels is described.
emBayesB is a fast and accurate EM algorithm for implementing genomic selection and predicting complex traits by mapping QTL in genome-wide dense SNP marker data. Its accuracy is similar to Bayesian methods but it takes only a fraction of the time.
Recent advances in high-throughput genotyping and transcript profiling technologies have enabled the inexpensive production of genome-wide dense marker maps in tandem with huge amounts of expression profiles. These large-scale data encompass valuable information about the genetic architecture of important phenotypic traits. Comprehensive models that combine molecular markers and gene transcript levels are increasingly advocated as an effective approach to dissecting the genetic architecture of complex phenotypic traits. The simultaneous utilization of marker and gene expression data to explain the variation in clinical quantitative trait, known as clinical quantitative trait locus (cQTL) mapping, poses challenges that are both conceptual and computational. Nonetheless, the hierarchical Bayesian (HB) modeling approach, in combination with modern computational tools such as Markov chain Monte Carlo (MCMC) simulation techniques, provides much versatility for cQTL analysis. Sillanpää and Noykova (2008) developed a HB model for single-trait cQTL analysis in inbred line cross-data using molecular markers, gene expressions, and marker-gene expression pairs. However, clinical traits generally relate to one another through environmental correlations and/or pleiotropy. A multi-trait approach can improve on the power to detect genetic effects and on their estimation precision. A multi-trait model also provides a framework for examining a number of biologically interesting hypotheses. In this paper we extend the HB cQTL model for inbred line crosses proposed by Sillanpää and Noykova to a multi-trait setting. We illustrate the implementation of our new model with simulated data, and evaluate the multi-trait model performance with regard to its single-trait counterpart. The data simulation process was based on the multi-trait cQTL model, assuming three traits with uncorrelated and correlated cQTL residuals, with the simulated data under uncorrelated cQTL residuals serving as our test set for comparing the performances of the multi-trait and single-trait models. The simulated data under correlated cQTL residuals were essentially used to assess how well our new model can estimate the cQTL residual covariance structure. The model fitting to the data was carried out by MCMC simulation through OpenBUGS. The multi-trait model outperformed its single-trait counterpart in identifying cQTLs, with a consistently lower false discovery rate. Moreover, the covariance matrix of cQTL residuals was typically estimated to an appreciable degree of precision under the multi-trait cQTL model, making our new model a promising approach to addressing a wide range of issues facing the analysis of correlated clinical traits.
Bayesian multilevel modeling; genetic architecture; linked marker-expression pairs; pleiotropy
Recently, the use of linkage disequilibrium (LD) to locate genes which affect quantitative traits (QTL) has received an increasing interest, but the plausibility of fine mapping using linkage disequilibrium techniques for QTL has not been well studied. The main objectives of this work were to (1) measure the extent and pattern of LD between a putative QTL and nearby markers in finite populations and (2) investigate the usefulness of LD in fine mapping QTL in simulated populations using a dense map of multiallelic or biallelic marker loci. The test of association between a marker and QTL and the power of the test were calculated based on single-marker regression analysis. The results show the presence of substantial linkage disequilibrium with closely linked marker loci after 100 to 200 generations of random mating. Although the power to test the association with a frequent QTL of large effect was satisfactory, the power was low for the QTL with a small effect and/or low frequency. More powerful, multi-locus methods may be required to map low frequent QTL with small genetic effects, as well as combining both linkage and linkage disequilibrium information. The results also showed that multiallelic markers are more useful than biallelic markers to detect linkage disequilibrium and association at an equal distance.
linkage disequilibrium; quantitative trait locus; fine mapping
For a dense set of genetic markers such as single nucleotide polymorphisms (SNPs) on high linkage disequilibrium within a small candidate region, a haplotype-based approach for testing association between a disease phenotype and the set of markers is attractive in reducing the data complexity and increasing the statistical power. However, due to unknown status of the underlying disease variant, a comprehensive association test may require consideration of various combinations of the SNPs, which often leads to severe multiple testing problems. In this paper, we propose a latent variable approach to test for association of multiple tightly linked SNPs in case-control studies. First, we introduce a latent variable into the penetrance model to characterize a putative disease susceptible locus (DSL) that may consist of a marker allele, a haplotype from a subset of the markers, or an allele at a putative locus between the markers. Next, through using of a retrospective likelihood to adjust for the case-control sampling ascertainment and appropriately handle the Hardy-Weinberg equilibrium constraint, we develop an expectation-maximization (EM)-based algorithm to fit the penetrance model and estimate the joint haplotype frequencies of the DSL and markers simultaneously. With the latent variable to describe a flexible role of the DSL, the likelihood ratio statistic can then provide a joint association test for the set of markers without requiring an adjustment for testing of multiple haplotypes. Our simulation results also reveal that the latent variable approach may have improved power under certain scenarios comparing with classical haplotype association methods.
haplotype association; retrospective likelihood; latent variable; logistic mixture model; EM algorithm
Genome-wide association studies (GWAS) aim to detect single nucleotide polymorphisms (SNP) associated with trait variation. However, due to the large number of tests, standard analysis techniques impose highly stringent significance thresholds, leaving potentially associated SNPs undetected, and much of the trait genetic variation unexplained. Pathway- and network-based methodologies applied to GWAS aim to detect associations missed by standard single-marker approaches. The complex and non-random architecture of the genome makes it a challenge to derive an appropriate testing framework for such methodologies. We developed a rapid and simple permutation approach that uses GWAS SNP association results to establish the significance of pathway associations while accounting for the linkage disequilibrium structure of SNPs and the clustering of functionally related elements in the genome. All SNPs used in the GWAS are placed in a “circular genome” according to their location. Then the complete set of SNP association P values are permuted by rotation with respect to the genomic locations of the SNPs. Once these “simulated” P values are assigned, the joint gene P values are calculated using Fisher’s combination test, and the association of pathways is tested using the hypergeometric test. The circular genomic permutation approach was applied to a human genome-wide association dataset. The data consists of 719 individuals from the ORCADES study genotyped for ∼300,000 SNPs and measured for 51 traits ranging from physical to biochemical measurements. KEGG pathways (n = 225) were used as the sets of pathways to be tested. Our results demonstrate that the circular genomic permutations provide robust association P values. The non-permuted hypergeometric analysis generates ∼1400 pathway-trait combination results with an association P value more significant than P ≤ 0.05, whereas applying circular genomic permutation reduces the number of significant results to a more credible 40% of that value. The circular permutation software (“genomicper”) is available as an R package at http://cran.r-project.org/.
GWAS; pathway-based; permutation method; genomicper R package; cardiac disease
Since 2001, the use of more and more dense maps has made researchers aware that combining linkage and linkage disequilibrium enhances the feasibility of fine-mapping genes of interest. So, various method types have been derived to include concepts of population genetics in the analyses. One major drawback of many of these methods is their computational cost, which is very significant when many markers are considered. Recent advances in technology, such as SNP genotyping, have made it possible to deal with huge amount of data. Thus the challenge that remains is to find accurate and efficient methods that are not too time consuming. The study reported here specifically focuses on the half-sib family animal design. Our objective was to determine whether modelling of linkage disequilibrium evolution improved the mapping accuracy of a quantitative trait locus of agricultural interest in these populations. We compared two methods of fine-mapping. The first one was an association analysis. In this method, we did not model linkage disequilibrium evolution. Therefore, the modelling of the evolution of linkage disequilibrium was a deterministic process; it was complete at time 0 and remained complete during the following generations. In the second method, the modelling of the evolution of population allele frequencies was derived from a Wright-Fisher model. We simulated a wide range of scenarios adapted to animal populations and compared these two methods for each scenario.
Our results indicated that the improvement produced by probabilistic modelling of linkage disequilibrium evolution was not significant. Both methods led to similar results concerning the location accuracy of quantitative trait loci which appeared to be mainly improved by using four flanking markers instead of two.
Therefore, in animal half-sib designs, modelling linkage disequilibrium evolution using a Wright-Fisher model does not significantly improve the accuracy of the QTL location when compared to a simpler method assuming complete and constant linkage between the QTL and the marker alleles. Finally, given the high marker density available nowadays, the simpler method should be preferred as it gives accurate results in a reasonable computing time.
Summary: GWAF, Genome-Wide Association analyses with Family, is an R package designed for GWAF. It implements association tests between a batch of genotyped or imputed single nucleotide polymorphisms (SNPs) and a binary or continuous trait with user specified genetic model, and generates informative results from the analyses. In addition, GWAF provides functions to visualize results. We evaluated GWAF using a simulated continuous trait and a binary trait dichotomized from the simulated continuous trait with real genotype data from the Framingham Heart Study's SNP Health Association Resource project.
Supplementary information: Supplementary data are available at Bioinformatics online.