We evaluate four association tests for rare variants—the combined multivariate and collapsing (CMC) method, two weighted-sum methods, and a variable threshold method—by applying them to the simulated data sets of unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data. The family-wise error rate (FWER) and average power are used as criteria for evaluation. Our results show that when all nonsynonymous SNPs (rare variants and common variants) in a gene are jointly analyzed, the CMC method fails to control the FWER; when only rare variants (single-nucleotide polymorphisms with minor allele frequency less than 0.05) are analyzed, all four methods can control FWER well. All four methods have comparable power, which is low for the analysis of the GAW17 data sets. Three of the methods (not including the CMC method) involve estimation of p-values using permutation procedures that either can be computationally intensive or generate inflated FWERs. We adapt a fast permutation procedure into these three methods. The results show that using the fast permutation procedure can produce FWERs and average powers close to the values obtained from the standard permutation procedure on the GAW17 data sets. The standard permutation procedure is computationally intensive.
Aggregating information across multiple variants in a gene or region can improve power for rare variant association testing. Power is maximized when the aggregation region contains many causal variants and few neutral variants. In this paper, we present a method for the localization of the association signal in a region using a sliding-window based approach to rare variant association testing in a region. We first introduce a novel method for analysis of rare variants, the Difference in Minor Allele Frequency test (DMAF), which allows combined analysis of common and rare variants, and makes no assumptions about the direction of effects. In whole-region analyses of simulated data with risk and protective variants, DMAF and other methods which pool data across individuals were found to outperform methods which pool data across variants. We then implement a sliding-window version of DMAF, using a step-down permutation approach to control type I error with the testing of multiple windows. In simulations, the sliding-window DMAF improved power to detect a causal sub-region, compared to applying DMAF to the whole region. Sliding-window DMAF was also effective in localizing the causal sub-region. We also applied the DMAF sliding-window approach to test for an association between response to the drug gemcitabine and variants in the gene FKBP5 sequenced in 91 lymphoblastoid cell lines derived from white non-Hispanic individuals. The application of the sliding-window test procedure detected an association in a sub-region spanning an exon and two introns, when rare and common variants were analyzed together.
rare variants; region-based analysis; multiple testing
There is solid evidence that rare variants contribute to complex disease etiology. Next-generation sequencing technologies make it possible to uncover rare variants within candidate genes, exomes, and genomes. Working in a novel framework, the kernel-based adaptive cluster (KBAC) was developed to perform powerful gene/locus based rare variant association testing. The KBAC combines variant classification and association testing in a coherent framework. Covariates can also be incorporated in the analysis to control for potential confounders including age, sex, and population substructure. To evaluate the power of KBAC: 1) variant data was simulated using rigorous population genetic models for both Europeans and Africans, with parameters estimated from sequence data, and 2) phenotypes were generated using models motivated by complex diseases including breast cancer and Hirschsprung's disease. It is demonstrated that the KBAC has superior power compared to other rare variant analysis methods, such as the combined multivariate and collapsing and weight sum statistic. In the presence of variant misclassification and gene interaction, association testing using KBAC is particularly advantageous. The KBAC method was also applied to test for associations, using sequence data from the Dallas Heart Study, between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified, including the associations of high density lipoprotein and very low density lipoprotein with ANGPTL4. The KBAC method is implemented in a user-friendly R package.
It has been demonstrated that both rare and common variants are involved in complex disease etiology. Until recently it was only possible to perform large scale analysis of common variants. With the development of next-generation sequencing technologies, detection and mapping of rare variants have been made possible. However, methods used to analyze common variants are not powerful for the analysis of rare variants. To address the problems of rare variant analysis working in a novel framework, the kernel-based adaptive cluster (KBAC) method was developed to perform gene/locus based analysis. The KBAC combines variant classification and association testing in a coherent framework. Through simulations motivated by population genetic and disease data, it is demonstrated that the KBAC has superior power to other rare variant analysis methods, especially in the presence of variant misclassification and gene interaction. Using data from the Dallas Heart Study, the KBAC method was applied to test for associations between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified. The KBAC method is implemented in a user-friendly R package.
We are interested in investigating the involvement of multiple rare variants within a given region by conducting analyses of individual regions with two goals: (1) to determine if regional rare variation in aggregate is associated with risk; and (2) conditional upon the region being associated, to identify specific genetic variants within the region that are driving the association. In particular, we seek a formal integrated analysis that achieves both of our goals. For rare variants with low minor allele frequencies, there is very little power to statistically test the null hypothesis of equal allele or genotype counts for each variant. Thus, genetic association studies are often limited to detecting association within a subset of the common genetic markers. However, it is very likely that associations exist for the rare variants that may not be captured by the set of common markers. Our framework aims at constructing a risk index based on multiple rare variants within a region. Our analytical strategy is novel in that we use a Bayesian approach to incorporate model uncertainty in the selection of variants to include in the index as well as the direction of the associated effects. Additionally, the approach allows for inference at both the group and variant-specific levels. Using a set of simulations, we show that our methodology has added power over other popular rare variant methods to detect global associations. In addition, we apply the approach to sequence data from the WECARE Study of second primary breast cancers.
genetic association studies; Bayesian model uncertainty; Bayes factors; multiplicity correction; sequence analysis; WECARE
With the advent of next-generation sequencing (NGS) technologies, researchers are now generating a deluge of data on high dimensional genomic variations, whose analysis is likely to reveal rare variants involved in the complex etiology of disease. Standing in the way of such discoveries, however, is the fact that statistics for rare variants are currently designed for use with population-based data. In this paper, we introduce a pedigree-based statistic specifically designed to test for rare variants in family-based data. The additional power of pedigree-based statistics stems from the fact that while rare variants related to diseases or traits of interest occur only infrequently in populations, in families with multiple affected individuals, such variants are enriched. Note that while the proposed statistic can be applied with and without statistical weighting, our simulations show that its power increases when weighting (WSS and VT) are applied.
Our working hypothesis was that, since rare variants are concentrated in families with multiple affected individuals, pedigree-based statistics should detect rare variants more powerfully than population-based statistics. To evaluate how well our new pedigree-based statistics perform in association studies, we develop a general framework for sequence-based association studies capable of handling data from pedigrees of various types and also from unrelated individuals. In short, we developed a procedure for transforming population-based statistics into tests for family-based associations. Furthermore, we modify two existing tests, the weighted sum-square test and the variable-threshold test, and apply both to our family-based collapsing methods. We demonstrate that the new family-based tests are more powerful than corresponding population-based test and they generate a reasonable type I error rate.
To demonstrate feasibility, we apply the newly developed tests to a pedigree-based GWAS data set from the Framingham Heart Study (FHS). FHS-GWAS data contain approximately 5000 uncommon variants with frequencies less than 0.05. Potential association findings in these data demonstrate the feasibility of the software PB-STAR (note, PB-STAR is now freely available to the public).
Our tests show that when analyzing for rare variants, a pedigree-based design is more powerful than a population-based case–control design. We further demonstrate that a pedigree-based statistic’s power to detect rare variants increases in direct relation to the proportion of affected individuals within the pedigree.
Pedigree; Next-generation sequencing; GWAS; Rare Variants; Collapsing
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Risk to common diseases, such as diabetes, heart disease, etc., is influenced by a complex interaction among genetic and environmental factors. Most of the disease-association studies conducted so far have focused on common variants, widely available on genotyping platforms. However, recent advances in sequencing technologies pave the way for large-scale medical sequencing studies with the goal of elucidating the role rare variants may play in affecting susceptibility to complex traits. The large number of rare variants and their low frequencies pose great challenges for the analysis of these data. We present here a novel testing strategy, based on a weighted-sum statistic, that is less sensitive than existing methods to the presence of both risk and protective variants in the genetic region under investigation. We show applications to simulated data and to a real dataset on Type-1 Diabetes.
The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T2, collapsing method, multivariate and collapsing (CMC) method, individual χ2 test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.
genome; GWAS; rare variants; sequencing
The advent of next generation sequencing (NGS) technologies enabled the investigation of the rare variant-common disease hypothesis in unrelated individuals, even on the genome-wide level. Analysis of this hypothesis requires tailored statistical methods as single marker tests fail on rare variants. An entire class of statistical methods collapses rare variants from a genomic region of interest (ROI), thereby aggregating rare variants. In an extensive simulation study using data from the Genetic Analysis Workshop 17 we compared the performance of 15 collapsing methods by means of a variety of pre-defined ROIs regarding minor allele frequency thresholds and functionality. Findings of the simulation study were additionally confirmed by a real data set investigating the association between methotrexate clearance and the SLCO1B1 gene in patients with acute lymphoblastic leukemia. Our analyses showed substantially inflated type I error levels for many of the proposed collapsing methods. Only four approaches yielded valid type I errors in all considered scenarios. None of the statistical tests was able to detect true associations over a substantial proportion of replicates in the simulated data. Detailed annotation of functionality of variants is crucial to detect true associations. These findings were confirmed in the analysis of the real data. Recent theoretical work showed that large power is achieved in gene-based analyses only if large sample sizes are available and a substantial proportion of causing rare variants is present in the gene-based analysis. Many of the investigated statistical approaches use permutation requiring high computational cost. There is a clear need for valid, powerful and fast to calculate test statistics for studies investigating rare variants.
collapsing; rare variants; simulation study; comparison; burden test; SLCO1B1
Current family-based association tests for sequencing data were mainly developed for identifying rare variants associated with a complex disease. As the disease can be influenced by the joint effects of common and rare variants, common variants with modest effects may not be identified by the methods focusing on rare variants. Moreover, variants can have risk, neutral, or protective effects. Association tests that can effectively select groups of common and rare variants that are likely to be causal and consider the directions of effects have become important. We developed the Ordered Subset - Variable Threshold - Pedigree Disequilibrium Test (OVPDT), a combination of three algorithms, for association analysis in family sequencing data. The ordered subset algorithm is used to select a subset of common variants based on their relative risks, calculated using only parental mating types. The variable threshold algorithm is used to search for an optimal allele frequency threshold such that rare variants below the threshold are more likely to be causal. The PDT statistics from both rare and common variants selected by the two algorithms are combined as the OVPDT statistic. A permutation procedure is used in OVPDT to calculate the p-value. We used simulations to demonstrate that OVPDT has the correct type I error rates under different scenarios and compared the power of OVPDT with two other family-based association tests. The results suggested that OVPDT can have more power than the other tests if both common and rare variants have effects on the disease in a region.
Association tests that pool minor alleles into a measure of burden at a locus have been proposed for case-control studies using sequence data containing rare variants. However, such pooling tests are not robust to the inclusion of neutral and protective variants, which can mask the association signal from risk variants. Early studies proposing pooling tests dismissed methods for locus-wide inference using nonnegative single-variant test statistics based on unrealistic comparisons. However, such methods are robust to the inclusion of neutral and protective variants and therefore may be more useful than previously appreciated. In fact, some recently proposed methods derived within different frameworks are equivalent to performing inference on weighted sums of squared single-variant score statistics. In this study, we compared two existing methods for locus-wide inference using nonnegative single-variant test statistics to two widely cited pooling tests under more realistic conditions. We established analytic results for a simple model with one rare risk and one rare neutral variant, which demonstrated that pooling tests were less powerful than even Bonferroni-corrected single-variant tests in most realistic situations. We also performed simulations using variants with realistic minor allele frequency and linkage disequilibrium spectra, disease models with multiple rare risk variants and extensive neutral variation, and varying rates of missing genotypes. In all scenarios considered, existing methods using nonnegative single-variant test statistics had power comparable to or greater than two widely cited pooling tests. Moreover, in disease models with only rare risk variants, an existing method based on the maximum single-variant Cochran-Armitage trend chi-square statistic in the locus had power comparable to or greater than another existing method closely related to some recently proposed methods. We conclude that efficient locus-wide inference using single-variant test statistics should be reconsidered as a useful framework for devising powerful association tests in sequence data with rare variants.
The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
Fast and cheaper next-generation sequencing technologies will generate unprecedentedly massive and highly dimensional genetic variation data that allow nearly complete evaluation of genetic variation including both common and rare variants. There are two types of association tests: variant-by-variant test and group test. The variant-by-variant test is designed to test the association of common variants, while the group test is suitable to collectively test the association of multiple rare variants. We propose here a smoothed functional principal component analysis (SFPCA) statistic as a general approach for testing association of the entire allelic spectrum of genetic variation (both common and rare variants), which utilizes the merits of both variant-by-variant analysis and group tests. By intensive simulations, we demonstrate that the SFPCA statistic has the correct type 1 error rates and much higher power than the existing methods to detect association of (1) common variants, (2) rare variants, (3) both common and rare variants and (4) variants with opposite directions of effects. To further evaluate its performance, the SFPCA statistic is applied to ANGPTL4 sequence and six continuous phenotypes data from the Dallas Heart Study as an example for testing association of rare variants and a GWAS of schizophrenia data as an example for testing association of common variants. The results show that the SFPCA statistic has much smaller P-values than many existing statistics in both real data analysis examples.
smoothed functional principal component analysis; rare variants; association studies; next-generation sequencing
Genome wide association (GWA) studies, which test for association between common genetic markers and a disease phenotype, have shown varying degrees of success. While many factors could potentially confound GWA studies, we focus on the possibility that multiple, rare variants (RVs) may act in concert to influence disease etiology. Here, we describe an algorithm for RV analysis, RareCover. The algorithm combines a disparate collection of RVs with low effect and modest penetrance. Further, it does not require the rare variants be adjacent in location. Extensive simulations over a range of assumed penetrance and population attributable risk (PAR) values illustrate the power of our approach over other published methods, including the collapsing and weighted-collapsing strategies. To showcase the method, we apply RareCover to re-sequencing data from a cohort of 289 individuals at the extremes of Body Mass Index distribution (NCT00263042). Individual samples were re-sequenced at two genes, FAAH and MGLL, known to be involved in endocannabinoid metabolism (187Kbp for 148 obese and 150 controls). The RareCover analysis identifies exactly one significantly associated region in each gene, each about 5 Kbp in the upstream regulatory regions. The data suggests that the RVs help disrupt the expression of the two genes, leading to lowered metabolism of the corresponding cannabinoids. Overall, our results point to the power of including RVs in measuring genetic associations.
We focus on the problem of detecting multiple rare variants (RVs) that act together to influence disease phenotypes. In considering this problem, we argue that the detection of causal rare variants must necessarily be different from typical single-marker analysis used for common variants and propose a novel algorithm, RareCover, to accomplish this analysis. RareCover combines a disparate collection of RVs, each with very low effect and modest penetrance. Extensive simulations over a range of values for penetrance and population attributable risk (PAR) illustrate the power of our approach over other published methods, including the collapsing and weighted-sum strategies. To showcase the method, we applied RareCover to data from 289 individuals at the extremes of Body Mass Index distribution (NCT00263042), sequenced around the FAAH and MGLL genes. RareCover analysis identified exactly one significantly associated region in each gene, each about 5Kbp in the upstream regulatory regions. The data suggests that the RVs help disrupt the expression of the two genes leading to lowered metabolism of the corresponding endocannabinoids previously linked with obesity. Overall, our results point to the power of including RVs in measuring genetic associations, and suggest that whole genome, DNA sequencing-based association studies investigating RV effects are feasible.
Both common and rare genetic variants have been shown to contribute to the etiology of complex diseases. Recent genome-wide association studies (GWAS) have successfully investigated how common variants contribute to the genetic factors associated with common human diseases. However, understanding the impact of rare variants, which are abundant in the human population (one in every 17 bases), remains challenging. A number of statistical tests have been developed to analyze collapsed rare variants identified by association tests. Here, we propose a haplotype-based approach. This work inspired by an existing statistical framework of the pedigree disequilibrium test (PDT), which uses genetic data to assess the effects of variants in general pedigrees. We aim to compare the performance between the haplotype-based approach and the rare variant-based approach for detecting rare causal variants in pedigrees.
Extensive simulations in the sequencing setting were carried out to evaluate and compare the haplotype-based approach with the rare variant methods that drew on a more conventional collapsing strategy. As assessed through a variety of scenarios, the haplotype-based pedigree tests had enhanced statistical power compared with the rare variants based pedigree tests when the disease of interest was mainly caused by rare haplotypes (with multiple rare alleles), and vice versa when disease was caused by rare variants acting independently. For most of other situations when disease was caused both by haplotypes with multiple rare alleles and by rare variants with similar effects, these two approaches provided similar power in testing for association.
The haplotype-based approach was designed to assess the role of rare and potentially causal haplotypes. The proposed rare variants-based pedigree tests were designed to assess the role of rare and potentially causal variants. This study clearly documented the situations under which either method performs better than the other. All tests have been implemented in a software, which was submitted to the Comprehensive R Archive Network (CRAN) for general use as a computer program named rvHPDT.
Haplotype; Pedigree disequilibrium test; Rare variants; Weights; Association
Analyzing sequencing data is difficult because of the low frequency of rare variants, which may result in low power to detect associations. We consider pathway analysis to detect multiple common and rare variants jointly and to investigate whether analysis at the pathway level provides an alternative strategy for identifying susceptibility genes. Available pathway analysis methods for data from genome-wide association studies might not be efficient because these methods are designed to detect common variants. Here, we investigate the performance of several existing pathway analysis methods for sequencing data. In particular, we consider the global test, which does not consider linkage disequilibrium between the variants in a gene. We improve the performance of the global test by assigning larger weights to rare variants, as proposed in the weighted-sum approach. Our conclusion is that straightforward application of pathway analysis is not satisfactory; hence, when common and rare variants are jointly analyzed, larger weights should be assigned to rare variants.
The development of genotyping arrays containing hundreds of thousands of rare variants across the genome and advances in high-throughput sequencing technologies have made feasible empirical genetic association studies to search for rare disease susceptibility alleles. As single variant testing is underpowered to detect associations, the development of statistical methods to combine analysis across variants – so-called “burden tests” - is an area of active research interest. We previously developed a method, the admixture maximum likelihood test, to test multiple, common variants for association with a trait of interest. We have extended this method, called the rare admixture maximum likelihood test (RAML), for the analysis of rare variants. In this paper we compare the performance of RAML with six other burden tests designed to test for association of rare variants.
We used simulation testing over a range of scenarios to test the power of RAML compared to the other rare variant association testing methods. These scenarios modelled differences in effect variability, the average direction of effect and the proportion of associated variants. We evaluated the power for all the different scenarios. RAML tended to have the greatest power for most scenarios where the proportion of associated variants was small, whereas SKAT-O performed a little better for the scenarios with a higher proportion of associated variants.
The RAML method makes no assumptions about the proportion of variants that are associated with the phenotype of interest or the magnitude and direction of their effect. The method is flexible and can be applied to both dichotomous and quantitative traits and allows for the inclusion of covariates in the underlying regression model. The RAML method performed well compared to the other methods over a wide range of scenarios. Generally power was moderate in most of the scenarios, underlying the need for large sample sizes in any form of association testing.
With the advent of next-generation sequencing technology, rare variant association analysis is increasingly being conducted to identify genetic variants associated with complex traits. In recent years, significant effort has been devoted to develop powerful statistical methods to test such associations for population-based designs. However, there has been relatively little development for family-based designs although family data have been shown to be more powerful to detect rare variants. This study introduces a blocking approach that extends two popular family-based common variant association tests to rare variants association studies. Several options are considered to partition a genomic region (gene) into “independent” blocks by which information from SNVs is aggregated within a block and an overall test statistic for the entire genomic region is calculated by combining information across these blocks. The proposed methodology allows different variants to have different directions (risk or protective) and specification of minor allele frequency threshold is not needed. We carried out a simulation to verify the validity of the method by showing that type I error is well under control when the underlying null hypothesis and the assumption of independence across blocks are satisfied. Further, data from the Genetic Analysis Workshop are utilized to illustrate the feasibility and performance of the proposed methodology in a realistic setting.
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Complex diseases and traits are likely influenced by many common and rare genetic variants and environmental factors. Detecting disease susceptibility variants is a challenging task, especially when their frequencies are low and/or their effects are small or moderate. We propose here a comprehensive hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical generalized linear models introduce a group effect and a genetic score (i.e., a linear combination of main-effect predictors for genetic variants) for each group of variants, and jointly they estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and it can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. Our computational strategy is based on extending the standard procedure for fitting generalized linear models in the statistical software R to the proposed hierarchical models, leading to the development of stable and flexible tools. The methods are illustrated with sequence data in gene ANGPTL4 from the Dallas Heart Study. The performance of the proposed procedures is further assessed via simulation studies. The methods are implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Complex diseases and traits are likely influenced by many common and rare genetic variants and environmental factors. Next-generation sequencing technologies have provided unparalleled tools to sequence a large number of individuals, allowing for comprehensive studies of both common and rare variants. However, detecting disease-associated rare variants and common variants of small effects poses unique statistical challenges. We propose here a comprehensive hierarchical generalized linear model framework for simultaneously analyzing multiple groups of rare and common variants and relevant covariates. The proposed hierarchical generalized linear models introduce a group effect and a genetic score for each group of variants, and jointly they estimate the group effects and the weights of the genetic scores. This framework includes various previous methods as special cases, and it can effectively deal with both risk and protective variants in a group and can simultaneously estimate the cumulative contribution of multiple variants and their relative importance. The methods are illustrated with sequence data in gene ANGPTL4 from the Dallas Heart Study and are further assessed via simulation studies. The methods have been implemented in a freely available R package BhGLM (http://www.ssg.uab.edu/bhglm/).
Multiple rare variants either within or across genes have been hypothesised to collectively influence complex human traits. The increasing availability of high throughput sequencing technologies offers the opportunity to study the effect of rare variants on these traits. However, appropriate and computationally efficient analytical methods are required to account for collections of rare variants that display a combination of protective, deleterious and null effects on the trait. We have developed a novel method for the analysis of rare genetic variation in a gene, region or pathway that, by simply aggregating summary statistics at each variant, can: (i) test for the presence of a mixture of effects on a trait; (ii) be applied to both binary and quantitative traits in population-based and family-based data; (iii) adjust for covariates to allow for non-genetic risk factors and; (iv) incorporate imputed genetic variation. In addition, for preliminary identification of promising genes, the method can be applied to association summary statistics, available from meta-analysis of published data, for example, without the need for individual level genotype data. Through simulation, we show that our method is immune to the presence of bi-directional effects, with no apparent loss in power across a range of different mixtures, and can achieve greater power than existing approaches as long as summary statistics at each variant are robust. We apply our method to investigate association of type-1 diabetes with imputed rare variants within genes in the major histocompatibility complex using genotype data from the Wellcome Trust Case Control Consortium.
Rapid advances in sequencing technology mean that it is now possible to directly assay rare genetic variation. In addition, the availability of almost fully sequenced human genomes by the 1000 Genomes Project allows genotyping at rare variants that are not present on arrays commonly used in genome-wide association studies. Rare variants within a gene or region may act to collectively influence a complex trait. Methods for testing these rare variants should be able to account for a combination of those that serve to either increase, decrease or have no effect on the trait of interest. Here, we introduce a method for the analysis of a collection of rare genetic variants, within a gene or region, which assesses evidence for a mixture of effects. Our method simply aggregates summary statistics at each variant and, as such, can be applied to both population and family-based data, to binary or quantitative traits and to either directly genotyped or imputed data. In addition, it does not require individual level genotype or phenotype data, and can be adjusted for non-genetic risk factors. We illustrate our approach by examining imputed rare variants in the major histocompatibility complex for association with type-1 diabetes using genotype data from the Wellcome Trust case Control Consortium.
Genome-wide association studies (GWAS) have been used successfully in detecting associations between common genetic variants and complex diseases. However, common SNPs detected by current GWAS only explain a small proportion of heritable variability. With the development of next-generation sequencing technologies, researchers find more and more evidence to support the role played by rare variants in heritable variability. However, rare and common variants are often studied separately. The objective of this paper is to develop a robust strategy to analyze association between complex traits and genetic regions using both common and rare variants.
We propose a weighted selective collapsing strategy for both candidate gene studies and genome-wide association scans. The strategy considers genetic information from both common and rare variants, selectively collapses all variants in a given region by a forward selection procedure, and uses an adaptive weight to favor more likely causal rare variants. Under this strategy, two tests are proposed. One test denoted by BwSC is sensitive to the directions of genetic effects, and it separates the deleterious and protective effects into two components. Another denoted by BwSCd is robust in the directions of genetic effects, and it considers the difference of the two components. In our simulation studies, BwSC achieves a higher power when the casual variants have the same genetic effect, while BwSCd is as powerful as several existing tests when a mixed genetic effect exists. Both of the proposed tests work well with and without the existence of genetic effects from common variants.
Two tests using a weighted selective collapsing strategy provide potentially powerful methods for association studies of sequencing data. The tests have a higher power when both common and rare variants contribute to the heritable variability and the effect of common variants is not strong enough to be detected by traditional methods. Our simulation studies have demonstrated a substantially higher power for both tests in all scenarios regardless whether the common SNPs are associated with the trait or not.
Next-generation sequencing has made possible the detection of rare variant (RV) associations with quantitative traits (QT). Due to high sequencing cost, many studies can only sequence a modest number of selected samples with extreme QT. Therefore association testing in individual studies can be underpowered. Besides the primary trait, many clinically important secondary traits are often measured. It is highly beneficial if multiple studies can be jointly analyzed for detecting associations with commonly measured traits. However, analyzing secondary traits in selected samples can be biased if sample ascertainment is not properly modeled. Some methods exist for analyzing secondary traits in selected samples, where some burden tests can be implemented. However p-values can only be evaluated analytically via asymptotic approximations, which may not be accurate. Additionally, potentially more powerful sequence kernel association tests, variable selection-based methods, and burden tests that require permutations cannot be incorporated. To overcome these limitations, we developed a unified method for analyzing secondary trait associations with RVs (STAR) in selected samples, incorporating all RV tests. Statistical significance can be evaluated either through permutations or analytically. STAR makes it possible to apply more powerful RV tests to analyze secondary trait associations. It also enables jointly analyzing multiple cohorts ascertained under different study designs, which greatly boosts power. The performance of STAR and commonly used RV association tests were comprehensively evaluated using simulation studies. STAR was also implemented to analyze a dataset from the SardiNIA project where samples with extreme low-density lipoprotein levels were sequenced. A significant association between LDLR and systolic blood pressure was identified, which is supported by pharmacogenetic studies. In summary, for sequencing studies, STAR is an important tool for detecting secondary-trait RV associations.
Next-generation sequencing has greatly expanded our ability to identify missing heritability due to rare variants. In order to increase the power to detect associations, one desirable study design is to combine samples from multiple cohorts for mapping commonly measured traits. However, many current studies sequence selected samples (e.g. samples with extreme QT), which can bias the analysis of secondary traits, unless the sampling ascertainment mechanisms are properly adjusted. We developed a unified method for detecting secondary trait associations with rare variants (STAR) in selected and random samples, which can flexibly incorporate all rare variant association tests and allow joint analysis of multiple cohorts ascertained under different study designs. We demonstrate via simulations that STAR greatly boosts the power for detecting secondary trait associations. As an application of STAR, a dataset from the SardiNIA project was analyzed, where DNA samples from well-phenotyped individuals with extreme low-density lipoprotein levels were sequenced. LDLR was identified to be significantly associated with systolic blood pressure, which is supported by a previous pharmacogenetics study. In conclusion, STAR is an important tool for sequence-based association studies.
We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
exome sequencing; pathway analysis; gene association
In spite of the success of genome-wide association studies in finding many common variants associated with disease, these variants seem to explain only a small proportion of the estimated heritability. Data collection has turned toward exome and whole genome sequencing, but it is well known that single marker methods frequently used for common variants have low power to detect rare variants associated with disease, even with very large sample sizes. In response, a variety of methods have been developed that attempt to cluster rare variants so that they may gather strength from one another under the premise that there may be multiple causal variants within a gene. Most of these methods group variants by gene or proximity, and test one gene or marker window at a time. We propose a penalized regression method (PeRC) that analyzes all genes at once, allowing grouping of all (rare and common) variants within a gene, along with subgrouping of the rare variants, thus borrowing strength from both rare and common variants within the same gene. The method can incorporate either a burden-based weighting of the rare variants or one in which the weights are data driven. In simulations, our method performs favorably when compared to many previously proposed approaches, including its predecessor, the sparse group lasso [Friedman et al., 2010].
penalized likelihood; lasso; elastic net; association analysis; rare variants
For rare-variant association analysis, due to extreme low frequencies of these variants, it is necessary to aggregate them by a prior set (e.g., genes and pathways) in order to achieve adequate power. In this paper, we consider hierarchical models to relate a set of rare variants to phenotype by modeling the effects of variants as a function of variant characteristics while allowing for variant-specific effect (heterogeneity). We derive a set of two score statistics, testing the group effect by variant characteristics and the heterogeneity effect. We make a novel modification to these score statistics so that they are independent under the null hypothesis and their asymptotic distributions can be derived. As a result, the computational burden is greatly reduced compared with permutation-based tests. Our approach provides a general testing framework for rare variants association, which includes many commonly used tests, such as the burden test [Li and Leal, 2008] and the sequence kernel association test [Wu et al., 2011], as special cases. Furthermore, in contrast to these tests, our proposed test has an added capacity to identify which components of variant characteristics and heterogeneity contribute to the association. Simulations under a wide range of scenarios show that the proposed test is valid, robust and powerful. An application to the Dallas Heart Study illustrates that apart from identifying genes with significant associations, the new method also provides additional information regarding the source of the association. Such information may be useful for generating hypothesis in future studies.