For genome-wide association studies in family-based designs, we propose a powerful two-stage testing strategy that can be applied in situations in which parent-offspring trio data are available and all offspring are affected with the trait or disease under study. In the first step of the testing strategy, we construct estimators of genetic effect size in the completely ascertained sample of affected offspring and their parents that are statistically independent of the family-based association/transmission disequilibrium tests (FBATs/TDTs) that are calculated in the second step of the testing strategy. For each marker, the genetic effect is estimated (without requiring an estimate of the SNP allele frequency) and the conditional power of the corresponding FBAT/TDT is computed. Based on the power estimates, a weighted Bonferroni procedure assigns an individually adjusted significance level to each SNP. In the second stage, the SNPs are tested with the FBAT/TDT statistic at the individually adjusted significance levels. Using simulation studies for scenarios with up to 1,000,000 SNPs, varying allele frequencies and genetic effect sizes, the power of the strategy is compared with standard methodology (e.g., FBATs/TDTs with Bonferroni correction). In all considered situations, the proposed testing strategy demonstrates substantial power increases over the standard approach, even when the true genetic model is unknown and must be selected based on the conditional power estimates. The practical relevance of our methodology is illustrated by an application to a genome-wide association study for childhood asthma, in which we detect two markers meeting genome-wide significance that would not have been detected using standard methodology.
The current state of genotyping technology has enabled researchers to conduct genome-wide association studies of up to 1,000,000 SNPs, allowing for systematic scanning of the genome for variants that might influence the development and progression of complex diseases. One of the largest obstacles to the successful detection of such variants is the multiple comparisons/testing problem in the genetic association analysis. For family-based designs in which all offspring are affected with the disease/trait under study, we developed a methodology that addresses this problem by partitioning the family-based data into two statistically independent components. The first component is used to screen the data and determine the most promising SNPs. The second component is used to test the SNPs for association, where information from the screening is used to weight the SNPs during testing. This methodology is more powerful than standard procedures for multiple comparisons adjustment (i.e., Bonferroni correction). Additionally, as only one data set is required for screening and testing, our testing strategy is less susceptible to study heterogeneity. Finally, as many family-based studies collect data only from affected offspring, this method addresses a major limitation of previous methodologies for multiple comparisons in family-based designs, which require variation in the disease/trait among offspring.
The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected sub-sample. It is natural to apply such a strategy for collecting genetic data in a sub-sample enriched for exposure to environmental factors for gene-environment interaction (G × E) analysis. In this paper, we consider two-phase studies of G × E interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phase I and phase II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene-gene and gene-environment independence to trade-off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the non-parametric Bayes construction of Dunson and Xing (2009). We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The sub-sample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.
Biased sampling; Colorectal cancer; Dirichlet prior; Exposure enriched; sampling; Gene-environment independence; Joint effects; Multivariate categorical distribution; Spike and slab prior
In genetic studies of rare complex diseases it is common to ascertain familial data from population based registries through all incident cases diagnosed during a pre-defined enrollment period. Such an ascertainment procedure is typically taken into account in the statistical analysis of the familial data by constructing either a retrospective or prospective likelihood expression, which conditions on the ascertainment event. Both of these approaches lead to a substantial loss of valuable data.
Methodology and Findings
Here we consider instead the possibilities provided by a Bayesian approach to risk analysis, which also incorporates the ascertainment procedure and reference information concerning the genetic composition of the target population to the considered statistical model. Furthermore, the proposed Bayesian hierarchical survival model does not require the considered genotype or haplotype effects be expressed as functions of corresponding allelic effects. Our modeling strategy is illustrated by a risk analysis of type 1 diabetes mellitus (T1D) in the Finnish population-based on the HLA-A, HLA-B and DRB1 human leucocyte antigen (HLA) information available for both ascertained sibships and a large number of unrelated individuals from the Finnish bone marrow donor registry. The heterozygous genotype DR3/DR4 at the DRB1 locus was associated with the lowest predictive probability of T1D free survival to the age of 15, the estimate being 0.936 (0.926; 0.945 95% credible interval) compared to the average population T1D free survival probability of 0.995.
The proposed statistical method can be modified to other population-based family data ascertained from a disease registry provided that the ascertainment process is well documented, and that external information concerning the sizes of birth cohorts and a suitable reference sample are available. We confirm the earlier findings from the same data concerning the HLA-DR3/4 related risks for T1D, and also provide here estimated predictive probabilities of disease free survival as a function of age.
The information provided by dense genome-wide markers using high throughput technology is of considerable potential in human disease studies and livestock breeding programs. Genome-wide association studies relate individual single nucleotide polymorphisms (SNP) from dense SNP panels to individual measurements of complex traits, with the underlying assumption being that any association is caused by linkage disequilibrium (LD) between SNP and quantitative trait loci (QTL) affecting the trait. Often SNP are in genomic regions of no trait variation. Whole genome Bayesian models are an effective way of incorporating this and other important prior information into modelling. However a full Bayesian analysis is often not feasible due to the large computational time involved.
This article proposes an expectation-maximization (EM) algorithm called emBayesB which allows only a proportion of SNP to be in LD with QTL and incorporates prior information about the distribution of SNP effects. The posterior probability of being in LD with at least one QTL is calculated for each SNP along with estimates of the hyperparameters for the mixture prior. A simulated example of genomic selection from an international workshop is used to demonstrate the features of the EM algorithm. The accuracy of prediction is comparable to a full Bayesian analysis but the EM algorithm is considerably faster. The EM algorithm was accurate in locating QTL which explained more than 1% of the total genetic variation. A computational algorithm for very large SNP panels is described.
emBayesB is a fast and accurate EM algorithm for implementing genomic selection and predicting complex traits by mapping QTL in genome-wide dense SNP marker data. Its accuracy is similar to Bayesian methods but it takes only a fraction of the time.
In quantitative trait mapping and genomic prediction, Bayesian variable selection methods have gained popularity in conjunction with the increase in marker data and computational resources. Whereas shrinkage-inducing methods are common tools in genomic prediction, rigorous decision making in mapping studies using such models is not well established and the robustness of posterior results is subject to misspecified assumptions because of weak biological prior evidence.
Here, we evaluate the impact of prior specifications in a shrinkage-based Bayesian variable selection method which is based on a mixture of uniform priors applied to genetic marker effects that we presented in a previous study. Unlike most other shrinkage approaches, the use of a mixture of uniform priors provides a coherent framework for inference based on Bayes factors. To evaluate the robustness of genetic association under varying prior specifications, Bayes factors are compared as signals of positive marker association, whereas genomic estimated breeding values are considered for genomic selection. The impact of specific prior specifications is reduced by calculation of combined estimates from multiple specifications. A Gibbs sampler is used to perform Markov chain Monte Carlo estimation (MCMC) and a generalized expectation-maximization algorithm as a faster alternative for maximum a posteriori point estimation. The performance of the method is evaluated by using two publicly available data examples: the simulated QTLMAS XII data set and a real data set from a population of pigs.
Combined estimates of Bayes factors were very successful in identifying quantitative trait loci, and the ranking of Bayes factors was fairly stable among markers with positive signals of association under varying prior assumptions, but their magnitudes varied considerably. Genomic estimated breeding values using the mixture of uniform priors compared well to other approaches for both data sets and loss of accuracy with the generalized expectation-maximization algorithm was small as compared to that with MCMC.
Since no error-free method to specify priors is available for complex biological phenomena, exploring a wide variety of prior specifications and combining results provides some solution to this problem. For this purpose, the mixture of uniform priors approach is especially suitable, because it comprises a wide and flexible family of distributions and computationally intensive estimation can be carried out in a reasonable amount of time.
Bayesian approaches for predicting genomic breeding values (GEBV) have been proposed that allow for different variances for individual markers resulting in a shrinkage procedure that uses prior information to coerce negligible effects towards zero. These approaches have generally assumed application to high-density genotype data on all individuals, which may not be the case in practice. In this study, three approaches were compared for their predictive power in computing GEBV when training at high SNP marker density and predicting at high or low densities: the well- known Bayes-A, a generalization of Bayes-A where scale and degrees of freedom are estimated from the data (Student-t) and a Bayesian implementation of the Lasso method. Twelve scenarios were evaluated for predicting GEBV using low-density marker subsets, including selection of SNP based on genome spacing or size of additive effect and the inclusion of unknown genotype information in the form of genotype probabilities from pedigree and genotyped ancestors.
The GEBV accuracy (calculated as correlation between GEBV and traditional breeding values) was highest for Lasso, followed by Student-t and then Bayes-A. When comparing GEBV to true breeding values, Student-t was most accurate, though differences were small. In general the shrinkage applied by the Lasso approach was less conservative than Bayes-A or Student-t, indicating that Lasso may be more sensitive to QTL with small effects. In the reduced-density marker subsets the ranking of the methods was generally consistent. Overall, low-density, evenly-spaced SNPs did a poor job of predicting GEBV, but SNPs selected based on additive effect size yielded accuracies similar to those at high density, even when coverage was low. The inclusion of genotype probabilities to the evenly-spaced subsets showed promising increases in accuracy and may be more useful in cases where many QTL of small effect are expected.
In this dataset the Student-t approach slightly outperformed the other methods when predicting GEBV at both high and low density, but the Lasso method may have particular advantages in situations where many small QTL are expected. When markers were selected at low density based on genome spacing, the inclusion of genotype probabilities increased GEBV accuracy which would allow a single low- density marker panel to be used across traits.
Although rheumatoid arthritis, a chronic and inflammatory disease affecting numerous adults, has a complex genetic component involving the human leukocyte antigen region, additional genomic regions most likely affects susceptibility. Whole-genome scans may assist in identifying these additional candidate regions, but a large number of false-positives are likely to occur using traditional statistical methods. Therefore, novel statistical approaches are needed. Here, we used a single replicate from the Genetic Analysis Workshop 15 simulated data to assess for marker-disease associations in 1500 rheumatoid arthritis cases and 2000 controls on chromosome 6. The statistical methods included a maximum-likelihood estimation approach and a novel Bayesian latent class analysis. The Bayesian analysis "borrows strength" from multiple loci to estimate association parameters and can incorporate differences across loci in the prior probability of association. Because of this, we hypothesized that the Bayesian analysis might be better able to detect true associations while minimizing false positives. The Bayesian posterior means for the log alleleic odds ratios were less variable than the maximum likelihood estimates, but the posterior probabilities were not as good as the simple p-values in distinguishing a signal from a non-signal. Overall, Bayesian latent class analyses provided no obvious improvement over maximum-likelihood estimation. However, our results may not be able to be generalized due to the large effect simulated in the human leukocyte antigen-DR locus.
Genes interact with each other as basic building blocks of life, forming a complicated network. The relationship between groups of genes with different functions can be represented as gene networks. With the deposition of huge microarray data sets in public domains, study on gene networking is now possible. In recent years, there has been an increasing interest in the reconstruction of gene networks from gene expression data. Recent work includes linear models, Boolean network models, and Bayesian networks. Among them, Bayesian networks seem to be the most effective in constructing gene networks. A major problem with the Bayesian network approach is the excessive computational time. This problem is due to the interactive feature of the method that requires large search space. Since fitting a model by using the copulas does not require iterations, elicitation of the priors, and complicated calculations of posterior distributions, the need for reference to extensive search spaces can be eliminated leading to manageable computational affords. Bayesian network approach produces a discretely expression of conditional probabilities. Discreteness of the characteristics is not required in the copula approach which involves use of uniform representation of the continuous random variables. Our method is able to overcome the limitation of Bayesian network method for gene-gene interaction, i.e. information loss due to binary transformation.
We analyzed the gene interactions for two gene data sets (one group is eight histone genes and the other group is 19 genes which include DNA polymerases, DNA helicase, type B cyclin genes, DNA primases, radiation sensitive genes, repaire related genes, replication protein A encoding gene, DNA replication initiation factor, securin gene, nucleosome assembly factor, and a subunit of the cohesin complex) by adopting a measure of directional dependence based on a copula function. We have compared our results with those from other methods in the literature. Although microarray results show a transcriptional co-regulation pattern and do not imply that the gene products are physically interactive, this tight genetic connection may suggest that each gene product has either direct or indirect connections between the other gene products. Indeed, recent comprehensive analysis of a protein interaction map revealed that those histone genes are physically connected with each other, supporting the results obtained by our method.
The results illustrate that our method can be an alternative to Bayesian networks in modeling gene interactions. One advantage of our approach is that dependence between genes is not assumed to be linear. Another advantage is that our approach can detect directional dependence. We expect that our study may help to design artificial drug candidates, which can block or activate biologically meaningful pathways. Moreover, our copula approach can be extended to investigate the effects of local environments on protein-protein interactions. The copula mutual information approach will help to propose the new variant of ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks): an algorithm for the reconstruction of gene regulatory networks.
A birth and death process is frequently used for modeling the size of a gene family that may vary along the branches of a phylogenetic tree. Under the birth and death model, maximum likelihood methods have been developed to estimate the birth and death rate and the sizes of ancient gene families (numbers of gene copies at the internodes of the phylogenetic tree). This paper aims to provide a Bayesian approach for estimating parameters in the birth and death model.
We develop a Bayesian approach for estimating the birth and death rate and other parameters in the birth and death model. In addition, a Bayesian hypothesis test is developed to identify the gene families that are unlikely under the birth and death process. Simulation results suggest that the Bayesian estimate is more accurate than the maximum likelihood estimate of the birth and death rate. The Bayesian approach was applied to a real dataset of 3517 gene families across genomes of five yeast species. The results indicate that the Bayesian model assuming a constant birth and death rate among branches of the phylogenetic tree cannot adequately explain the observed pattern of the sizes of gene families across species. The yeast dataset was thus analyzed with a Bayesian heterogeneous rate model that allows the birth and death rate to vary among the branches of the tree. The unlikely gene families identified by the Bayesian heterogeneous rate model are different from those given by the maximum likelihood method.
Compared to the maximum likelihood method, the Bayesian approach can produce more accurate estimates of the parameters in the birth and death model. In addition, the Bayesian hypothesis test is able to identify unlikely gene families based on Bayesian posterior p-values. As a powerful statistical technique, the Bayesian approach can effectively extract information from gene family data and thereby provide useful information regarding the evolutionary process of gene families across genomes.
We consider the inference problem of estimating covariate and genetic effects in a family-based case-control study where families are ascertained on the basis of the number of cases within the family. However, our interest lies not only in estimating the fixed covariate effects but also in estimating the random effects parameters that account for varying correlations among family members. These random effects parameters, though weakly identifiable in a strict theoretical sense, are often hard to estimate due to the small number of observations per family. A hierarchical Bayesian paradigm is a very natural route in this context with multiple advantages compared with a classical mixed effects estimation strategy based on the integrated likelihood. We propose a fully flexible Bayesian approach allowing nonparametric modeling of the random effects distribution using a Dirichlet process prior and provide estimation of both fixed effect and random effects parameters using a Markov chain Monte Carlo numerical integration scheme. The nonparametric Bayesian approach not only provides inference that is less sensitive to parametric specification of the random effects distribution but also allows possible uncertainty around a specific genetic correlation structure. The Bayesian approach has certain computational advantages over its mixed-model counterparts. Data from the Prostate Cancer Genetics Project, a family-based study at the University of Michigan Comprehensive Cancer Center including families having one or more members with prostate cancer, are used to illustrate the proposed methods. A small-scale simulation study is carried out to compare the proposed nonparametric Bayes methodology with a parametric Bayesian alternative.
conditional logistic regression; Dirichlet process prior; integrated likelihood; matched case-control studies; random effects model
Combinatorial gene perturbations provide rich information for a systematic exploration of genetic interactions. Despite successful applications to bacteria and yeast, the scalability of this approach remains a major challenge for higher organisms such as humans. Here, we report a novel experimental and computational framework to efficiently address this challenge by limiting the ‘search space’ for important genetic interactions. We propose to integrate rich phenotypes of multiple single gene perturbations to robustly predict functional modules, which can subsequently be subjected to further experimental investigations such as combinatorial gene silencing. We present posterior association networks (PANs) to predict functional interactions between genes estimated using a Bayesian mixture modelling approach. The major advantage of this approach over conventional hypothesis tests is that prior knowledge can be incorporated to enhance predictive power. We demonstrate in a simulation study and on biological data, that integrating complementary information greatly improves prediction accuracy. To search for significant modules, we perform hierarchical clustering with multiscale bootstrap resampling. We demonstrate the power of the proposed methodologies in applications to Ewing's sarcoma and human adult stem cells using publicly available and custom generated data, respectively. In the former application, we identify a gene module including many confirmed and highly promising therapeutic targets. Genes in the module are also significantly overrepresented in signalling pathways that are known to be critical for proliferation of Ewing's sarcoma cells. In the latter application, we predict a functional network of chromatin factors controlling epidermal stem cell fate. Further examinations using ChIP-seq, ChIP-qPCR and RT-qPCR reveal that the basis of their genetic interactions may arise from transcriptional cross regulation. A Bioconductor package implementing PAN is freely available online at http://bioconductor.org/packages/release/bioc/html/PANR.html.
Synthetic genetic interactions estimated from combinatorial gene perturbation screens provide systematic insights into synergistic interactions of genes in a biological process. However, this approach lacks scalability for large-scale genetic interaction profiling in metazoan organisms such as humans. We contribute to this field by proposing a more scalable and affordable approach, which takes the advantage of multiple single gene perturbation data to predict coherent functional modules followed by genetic interaction investigation using combinatorial perturbations. We developed a versatile computational framework (PAN) to robustly predict functional interactions and search for significant functional modules from rich phenotyping screens of single gene perturbations under different conditions or from multiple cell lines. PAN features a Bayesian mixture model to assess statistical significance of functional associations, the capability to incorporate prior knowledge as well as a generalized approach to search for significant functional modules by multiscale bootstrap resampling. In applications to Ewing's sarcoma and human adult stem cells, we demonstrate the general applicability and prediction power of PAN to both public and custom generated screening data.
To describe the utility of twin studies for attention-deficit/hyperactivity disorder (ADHD) research and demonstrate their potential for the identification of alternative phenotypes suitable for genomewide association, developmental risk assessment, treatment response, and intervention targets.
Brief descriptions of the classic twin study and genetic association study methods are provided, with illustrative findings from ADHD research. Biometrical genetics refers to the statistical modeling of data gathered from one or more group of known biological relation; it was apparently coined by Francis Galton in the 1860s and led to the “Biometrical School” at the University of London. Twin studies use genetic correlations between pairs of relatives, derived using this theoretical framework, to parse the individual differences in a trait into latent (unmeasured) genetic and environmental influences. This method enables the estimation of heritability, i.e., the percentage of variance due to genetic influences. It is usually implemented with a method called structural equation modeling, which is a statistical technique for fitting models to data, typically using maximum likelihood estimation. Genetic association studies aim to identify those genetic variants that account for the heritability estimated in twin studies. Measurements other than those used for the clinical diagnosis of the disorder are popular phenotype choices in current ADHD research. It is argued that twin studies have great potential to refine phenotypes relevant to ADHD.
Prior studies have consistently found that the majority of the variance in ADHD symptoms is due to genetic factors. To date, genomewide association studies of ADHD have not identified replicable associations that account for the heritable variation. Possibly, the application of genomewide association studies to these alternative phenotypic measurements will assist in identifying the pathways from genetic variants to ADHD.
Power to detect associations should be improved by the study of highly heritable endophenotypes for ADHD and by reducing the number of phenotypes to be considered. Therefore, twin studies are an important research tool in the development of endophenotypes, defined as alternative, more highly heritable traits that act at earlier stages of the pathway from genes to behavior. Although genetic variation in liability to ADHD is likely polygenic, the proposed approach should help to identify improved alternative measurements for genetic association studies.
twin studies; ADHD; genomewide association; endophenotype; translational research
For many molecularly targeted agents, the probability of response may be assumed to either increase or increase and then plateau in the tested dose range. Therefore, identifying the maximum effective dose, defined as the lowest dose that achieves a pre-specified target response and beyond which improvement in the response is unlikely, becomes increasingly important. Recently, a class of Bayesian designs for single-arm phase II clinical trials based on hypothesis tests and nonlocal alternative prior densities has been proposed and shown to outperform common Bayesian designs based on posterior credible intervals and common frequentist designs. We extend this and related approaches to the design of phase II oncology trials, with the goal of identifying the maximum effective dose among a small number of pre-specified doses.
We propose two new Bayesian designs with continuous monitoring of response rates across doses to identify the maximum effective dose, assuming monotonicity of the response rate across doses. The first design is based on Bayesian hypothesis tests. To determine whether each dose level achieves a pre-specified target response rate and whether the response rates between doses are equal, multiple statistical hypotheses are defined using nonlocal alternative prior densities. The second design is based on Bayesian model averaging and also uses nonlocal alternative priors. We conduct simulation studies to evaluate the operating characteristics of the proposed designs, and compare them with three alternative designs.
In terms of the likelihood of drawing a correct conclusion using similar between-design average sample sizes, the performance of our proposed design based on Bayesian hypothesis tests and nonlocal alternative priors is more robust than that of the other designs. Specifically, the proposed Bayesian hypothesis test-based design has the largest probability of being the best design among all designs under comparison and the smallest probability of being an inadequate design, under sensible definitions of the best design and an inadequate design, respectively.
The use of Bayesian hypothesis tests and nonlocal alternative priors under ordering constraints between dose groups results in a robust performance of the design, which is thus superior to other common designs.
Bayesian hypothesis test; Bayesian model averaging; Nonlocal alternative prior density; Plateau; Efficacy; Toxicity
The distribution of genetic variation among populations is conveniently measured by Wright’s FST, which is a scaled variance taking on values in [0,1]. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs) in particular, it is reasonable to presume that allelic differences at most loci are selectively neutral. For such loci, the distribution of genetic variation among populations is determined by the size of local populations, the pattern and rate of migration among those populations, and the rate of mutation. Because the demographic parameters (population sizes and migration rates) are common across all autosomal loci, locus-specific estimates of FST will depart from a common distribution only for loci with unusually high or low rates of mutation or for loci that are closely associated with genomic regions having a relationship with fitness. Thus, loci that are statistical outliers showing significantly more among-population differentiation than others may mark genomic regions subject to diversifying selection among the sample populations. Similarly, statistical outliers showing significantly less differentiation among populations than others may mark genomic regions subject to stabilizing selection across the sample populations. We propose several Bayesian hierarchical models to estimate locus-specific effects on FST, and we apply these models to single nucleotide polymorphism data from the HapMap project. Because loci that are physically associated with one another are likely to show similar patterns of variation, we introduce conditional autoregressive models to incorporate the local correlation among loci for high-resolution genomic data. We estimate the posterior distributions of model parameters using Markov chain Monte Carlo (MCMC) simulations. Model comparison using several criteria, including DIC and LPML, reveals that a model with locus- and population-specific effects is superior to other models for the data used in the analysis. To detect statistical outliers we propose an approach that measures divergence between the posterior distributions of locus-specific effects and the common FST with the Kullback-Leibler divergence measure. We calibrate this measure by comparing values with those produced from the divergence between a biased and a fair coin. We conduct a simulation study to illustrate the performance of our approach for detecting loci subject to stabilizing/divergent selection, and we apply the proposed models to low- and high-resolution SNP data from the HapMap project. Model comparison using DIC and LPML reveals that CAR models are superior to alternative models for the high resolution data. For both low and high resolution data, we identify statistical outliers that are associated with known genes.
Bayesian approach; Hierarchical model; SNP; Wright’s Fst; MCMC
Genomewide association (GWA) studies assay hundreds of thousands of single nucleotide polymorphisms (SNPs) simultaneously across the entire genome and associate them with diseases, other biological or clinical traits. The association analysis usually tests each SNP as an independent entity and ignores the biological information such as linkage disequilibrium. Although the Bonferroni correction and other approaches have been proposed to address the issue of multiple comparisons as a result of testing many SNPs, there is a lack of understanding of the distribution of an association test statistic when an entire genome is considered together. In other words, there are extensive efforts in hypothesis testing, and almost no attempt in estimating the density under the null hypothesis. By estimating the true null distribution, we can apply the result directly to hypothesis testing; better assess the existing approaches of multiple comparisons; and evaluate the impact of linkage disequilibrium on the GWA studies. To this end, we estimate the empirical null distribution of an association test statistic in GWA studies using simulated population data. We further propose a convenient and accurate method based on adaptive spline to estimate the empirical value in GWA studies and validate our findings using a real data set. Our method enables us to fully characterize the null distribution of an association test that not only can be used to test the null hypothesis of no association, but also provide important information about the impact of density of the genetic markers on the significance of the tests. Our method does not require users to perform computationally intensive permutations, and hence provides a timely solution to an important and difficult problem in GWA studies.
critical value; generalized extreme-value distribution; genomewide association
In the last decade, numerous genome-wide linkage and association studies of complex diseases have been completed. The critical question remains of how to best use this potentially valuable information to improve study design and statistical analysis in current and future genetic association studies. With genetic effect size for complex diseases being relatively small, the use of all available information is essential to untangling the genetic architecture of complex diseases. One promising approach to incorporating prior knowledge from linkage scans, or other information, is to up- or down-weight p-values resulting from genetic association study in either a frequentist or Bayesian manner. As an alternative to these methods, we propose a fully Bayesian mixture model to incorporate previous knowledge into on-going association analysis. In this approach, both the data and previous information collectively inform the association analysis, in contrast to modifying the association results (p-values) to conform to the prior knowledge. By using a Bayesian framework, one has flexibility in modeling, and is able to comprehensively assess the impact of model specification on posterior inferences. We illustrate use of this method through a genome-wide linkage study of colorectal cancer, and a genome-wide association study of colorectal polyps.
Bayesian; genetic association; linkage; mixture model; prior information
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Nowadays, the availability of cheaper and accurate assays to quantify multiple (endo)phenotypes in large population cohorts allows multi-trait studies. However, these studies are limited by the lack of flexible models integrated with efficient computational tools for genome-wide multi SNPs-traits analyses. To overcome this problem, we propose a novel Bayesian analysis strategy and a new algorithmic implementation which exploits parallel processing architecture for fully multivariate modeling of groups of correlated phenotypes at the genome-wide scale. In addition to increased power of our algorithm over alternative Bayesian and well-established non-Bayesian multi-phenotype methods, we provide an application to a real case study of several blood lipid traits, and show how our method recovered most of the major associations and is better at refining multi-trait polygenic associations than alternative methods. We reveal and replicate in independent cohorts new associations with two phenotypic groups that were not detected by competing multivariate approaches and not noticed by a large meta-GWAS. We also discuss the applicability of the proposed method to large meta-analyses involving hundreds of thousands of individuals and to diverse genomic datasets where complex dependencies in the predictor space are present.
Molecular marker information is a common source to draw inferences about the relationship between genetic and phenotypic variation. Genetic effects are often modelled as additively acting marker allele effects. The true mode of biological action can, of course, be different from this plain assumption. One possibility to better understand the genetic architecture of complex traits is to include intra-locus (dominance) and inter-locus (epistasis) interaction of alleles as well as the additive genetic effects when fitting a model to a trait. Several Bayesian MCMC approaches exist for the genome-wide estimation of genetic effects with high accuracy of genetic value prediction. Including pairwise interaction for thousands of loci would probably go beyond the scope of such a sampling algorithm because then millions of effects are to be estimated simultaneously leading to months of computation time. Alternative solving strategies are required when epistasis is studied.
We extended a fast Bayesian method (fBayesB), which was previously proposed for a purely additive model, to include non-additive effects. The fBayesB approach was used to estimate genetic effects on the basis of simulated datasets. Different scenarios were simulated to study the loss of accuracy of prediction, if epistatic effects were not simulated but modelled and vice versa.
If 23 QTL were simulated to cause additive and dominance effects, both fBayesB and a conventional MCMC sampler BayesB yielded similar results in terms of accuracy of genetic value prediction and bias of variance component estimation based on a model including additive and dominance effects. Applying fBayesB to data with epistasis, accuracy could be improved by 5% when all pairwise interactions were modelled as well. The accuracy decreased more than 20% if genetic variation was spread over 230 QTL. In this scenario, accuracy based on modelling only additive and dominance effects was generally superior to that of the complex model including epistatic effects.
This simulation study showed that the fBayesB approach is convenient for genetic value prediction. Jointly estimating additive and non-additive effects (especially dominance) has reasonable impact on the accuracy of prediction and the proportion of genetic variation assigned to the additive genetic source.
Given moderately strong genetic contributions to variation in alcoholism and heaviness of drinking (50–60% heritability), with high correlation of genetic influences, we have conducted a quantitative trait genomewide association study for phenotypes related to alcohol use and dependence.
Diagnostic interview and blood/buccal samples were obtained from sibships ascertained through the Australian Twin Registry. Genomewide SNP genotyping was performed with 8754 individuals [2062 alcohol dependent cases] selected for informativeness for alcohol use disorder and associated quantitative traits. Family-based association tests were performed for alcohol dependence, dependence factor score and heaviness of drinking factor score, with confirmatory case-population control comparisons using an unassessed population control series of 3393 Australians with genomewide SNP data.
No findings reached genomewide significance (p=8.4×10−8 for this study), with lowest p-value for primary phenotypes of 1.2×10−7. Convergent findings for quantitative consumption and diagnostic and quantitative dependence measures suggest possible roles for a transmembrane protein gene (TMEM108) and for ANKS1A. The major finding, however, was small effect sizes estimated for individual SNPs, suggesting that hundreds of genetic variants make modest contributions (1/4% of variance or less) to alcohol dependence risk.
We conclude that (i) meta-analyses of consumption data may contribute usefully to gene-discovery; (ii) translation of human alcoholism GWAS results to drug discovery or clinically useful prediction of risk will be challenging; (iii) through accumulation across studies, GWAS data may become valuable for improved genetic risk differentiation in research in biological psychiatry (e.g. prospective high-risk or resilience studies).
Alcoholism; genome-wide association; quantitative-trait; non-replication
Pharmacogenetic clinical trials seek to identify genetic modifiers of treatment effects. When a trial has collected data on many potential genetic markers, a first step in analysis is to screen for evidence of pharmacogenetic effects by testing for treatment-by-marker interactions in a statistical model for the outcome of interest. This approach is potentially problematic because i) individual significance tests can be overly sensitive, particularly when sample sizes are large; and ii) standard significance tests fail to distinguish between markers that are likely, on biological grounds, to have an effect, and those that are not. One way to address these concerns is to perform Bayesian hypothesis tests (Berger 1985; Kass and Raftery 1995), which are typically more conservative than standard uncorrected frequentist tests, less conservative than multiplicity-corrected tests, and make explicit use of relevant biological information through specification of the prior distribution. In this article we use a Bayesian testing approach to screen a panel of genetic markers recorded in a randomized clinical trial of bupropion versus placebo for smoking cessation. From a panel of 59 single-nucleotide polymorphisms (SNPs) located on 11 candidate genes, we identify four SNPs (one each on CHRNA5 and CHRNA2 and two on CHAT) that appear to have pharmacogenetic relevance. Of these, the SNP on CHRNA5 is most robust to specification of the prior. An unadjusted frequentist test identifies seven SNPs, including these four, none of which remains significant upon correction for multiplicity. In a panel of 43 randomly selected control SNPs, none is significant by either the Bayesian or the corrected frequentist test.
Bayes factor; Bayesian hypothesis test; bupropion; importance sampling; pharmacogenomics; single-nucleotide polymorphism
Imputation-based association methods provide a powerful framework for testing untyped variants for association with phenotypes and for combining results from multiple studies that use different genotyping platforms. Here, we consider several issues that arise when applying these methods in practice, including: (i) factors affecting imputation accuracy, including choice of reference panel; (ii) the effects of imputation accuracy on power to detect associations; (iii) the relative merits of Bayesian and frequentist approaches to testing imputed genotypes for association with phenotype; and (iv) how to quickly and accurately compute Bayes factors for testing imputed SNPs. We find that imputation-based methods can be robust to imputation accuracy and can improve power to detect associations, even when average imputation accuracy is poor. We explain how ranking SNPs for association by a standard likelihood ratio test gives the same results as a Bayesian procedure that uses an unnatural prior assumption—specifically, that difficult-to-impute SNPs tend to have larger effects—and assess the power gained from using a Bayesian approach that does not make this assumption. Within the Bayesian framework, we find that good approximations to a full analysis can be achieved by simply replacing unknown genotypes with a point estimate—their posterior mean. This approximation considerably reduces computational expense compared with published sampling-based approaches, and the methods we present are practical on a genome-wide scale with very modest computational resources (e.g., a single desktop computer). The approximation also facilitates combining information across studies, using only summary data for each SNP. Methods discussed here are implemented in the software package BIMBAM, which is available from http://stephenslab.uchicago.edu/software.html.
Genotype imputation is becoming a popular approach to comparing and combining results of multiple association studies that used different SNP genotyping platforms. The basic idea is to exploit the fact that, due to correlation among untyped and typed SNPs, genotypes of untyped SNPs in each study can be inferred (“imputed”) from the genotypes at typed SNPs, often with high accuracy. In this paper, we consider several issues that arise when applying these methods in practice, including factors affecting imputation accuracy, the importance of taking account of imputation uncertainty when testing for association between imputed SNPs and phenotype, how imputation accuracy affects power, and how to combine results across studies when only single-SNP summary data can be shared among research groups.
Many complex diseases are likely to be a result of the interplay of genes and environmental exposures. The standard analysis in a genome-wide association study (GWAS) scans for main effects and ignores the potentially useful information in the available exposure data. Two recently proposed methods that exploit environmental exposure information involve a two-step analysis aimed at prioritizing the large number of SNPs tested to highlight those most likely to be involved in a G×E interaction. For example, Murcray et al (2009) proposed screening on a test that models the G-E association induced by an interaction in the combined case-control sample. Alternatively, Kooperberg et al (2008) suggested screening on genetic marginal effects. In both methods, SNPs that pass the respective screening step at a pre-specified significance threshold are followed up with a formal test of interaction in the second step. We propose a hybrid method that combines these two screening approaches by allocating a proportion of the overall genomewide significance level to each test. We show that the Murcray et al. approach is often the most efficient method, but that the hybrid approach is a powerful and robust method for nearly any underlying model. As an example, for a GWAS of 1 million markers including a single true disease SNP with minor allele frequency of 0.15, and a binary exposure with prevalence 0.3, the Murcray, Kooperberg and hybrid methods are 1.90, 1.27, and 1.87 times as efficient, respectively, as the traditional case-control analysis to detect an interaction effect size of 2.0.
G×E interaction; case-control; genome-wide association study; efficiency
Genomewide association studies have become the primary tool for discovering the genetic basis of complex human diseases. Such studies are susceptible to the confounding effects of population stratification, in that the combination of allele-frequency heterogeneity with disease-risk heterogeneity among different ancestral subpopulations can induce spurious associations between genetic variants and disease. This article provides a statistically rigorous and computationally feasible solution to this challenging problem of unmeasured confounders. We show that the odds ratio of disease with a genetic variant is identifiable if and only if the genotype is independent of the unknown population substructure conditional on a set of observed ancestry-informative markers in the disease-free population. Under this condition, the odds ratio of interest can be estimated by fitting a semiparametric logistic regression model with an arbitrary function of a propensity score relating the genotype probability to ancestry-informative markers. Approximating the unknown function of the propensity score by B-splines, we derive a consistent and asymptotically normal estimator for the odds ratio of interest with a consistent variance estimator. Simulation studies demonstrate that the proposed inference procedures perform well in realistic settings. An application to the well-known Wellcome Trust Case-Control Study is presented. Supplemental materials are available online.
B-spline; Case-control study; Principal components; Propensity score; Semiparametric logistic regression; Single nucleotide polymorphism
Multilocus analysis of single nucleotide polymorphism haplotypes is a promising approach to dissecting the genetic basis of complex diseases. We propose a coalescent-based model for association mapping that potentially increases the power to detect disease-susceptibility variants in genetic association studies. The approach uses Bayesian partition modelling to cluster haplotypes with similar disease risks by exploiting evolutionary information. We focus on candidate gene regions with densely spaced markers and model chromosomal segments in high linkage disequilibrium therein assuming a perfect phylogeny. To make this assumption more realistic, we split the chromosomal region of interest into sub-regions or windows of high linkage disequilibrium. The haplotype space is then partitioned into disjoint clusters, within which the phenotype–haplotype association is assumed to be the same. For example, in case-control studies, we expect chromosomal segments bearing the causal variant on a common ancestral background to be more frequent among cases than controls, giving rise to two separate haplotype clusters. The novelty of our approach arises from the fact that the distance used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered according to the time to their most recent common ancestor. Our approach is fully Bayesian and we develop a Markov Chain Monte Carlo algorithm to sample efficiently over the space of possible partitions. We compare the proposed approach to both single-marker analyses and recently proposed multi-marker methods and show that the Bayesian partition modelling performs similarly in localizing the causal allele while yielding lower false-positive rates. Also, the method is computationally quicker than other multi-marker approaches. We present an application to real genotype data from the CYP2D6 gene region, which has a confirmed role in drug metabolism, where we succeed in mapping the location of the susceptibility variant within a small error.
Genetic association studies offer great promise in dissecting the genetic contribution to complex diseases. The underlying idea of such studies is to search for genetic variants along the genome that appear to be associated with a trait of interest, e.g., disease status for a binary trait. One then proceeds by genotyping unrelated individuals at several marker sites, searching for positions where single markers or combinations of multiple markers on the paternally and maternally inherited chromosomes (or haplotypes) appear to discriminate among affected and unaffected individuals, flagging genomic regions that may harbour disease susceptibility variants. The statistical analysis of such studies, however, poses several challenges, such as multiplicity and false-positives issue, due to the large number of markers considered. Focusing on case-control studies, we present a novel evolution-based Bayesian partition model that clusters haplotypes with similar disease risks. The novelty of this approach lies in the use of perfect phylogenies, which offers a sensible and computationally efficient approximation of the ancestry of a sample of chromosomes. We show that the incorporation of phylogenetic information leads to low false-positive rates, while our model fitting offers computational advantages over similar recently proposed coalescent-based haplotype clustering methods.
The identification of quantitative trait loci (QTLs) of small effect size that underlie complex traits poses a particular challenge for geneticists due to the large sample sizes and large numbers of genetic markers required for genomewide association scans. An efficient solution for screening purposes is to combine single nucleotide polymorphism (SNP) microarrays and DNA pooling (SNP-MaP), an approach that has been shown to be valid, reliable and accurate in deriving relative allele frequency estimates from pooled DNA for groups such as cases and controls for 10K SNP microarrays. However, in order to conduct a genomewide association study many more SNP markers are needed. To this end, we assessed the validity and reliability of the SNP-MaP method using Affymetrix GeneChip® Mapping 100K Array set. Interpretable results emerged for 95% of the SNPs (nearly 110 000 SNPs). We found that SNP-MaP allele frequency estimates correlated 0.939 with allele frequencies for 97 605 SNPs that were genotyped individually in an independent population; the correlation was 0.971 for 26 SNPs that were genotyped individually for the 1028 individuals used to construct the DNA pools. We conclude that extending the SNP-MaP method to the Affymetrix GeneChip® Mapping 100K Array set provides a useful screen of >100 000 SNP markers for QTL association scans.