There is great interest to sequence unrelated or pedigree samples for detecting rare variant quantitative trait associations. In order to reduce the cost of sequencing and improve power, many studies sequence selected samples with extreme traits. Existing methods for detecting rare variant associations were developed for unrelated samples. Methods are needed to analyze (selected or randomly ascertained) pedigree samples.
We propose a unified framework of modeling extreme trait genetic associations (MEGA) with rare variants. Using MEGA and appropriate permutation algorithms, many rare variant tests can be extended to family data. As an application, we compared study designs using both sib-pairs and unrelated individuals. Extensive simulations were carried out using realistic population genetic and complex trait models.
It is demonstrated that when extreme sampling is implemented within equal-sized cohorts of unrelated individuals or sib-pairs, analyzing unrelated individuals is consistently more powerful than studying sib-pairs. A higher portion of rare variants can be identified through sequencing unrelated samples compared to sibs. Alternatively, if samples are ascertained using fixed thresholds from an infinite-sized population, sequencing one sib with the most extreme trait from each extreme concordant sib-pair is consistently the most powerful design.
MEGA will play an important role in the analysis of sequence-based genetic association studies.
Extreme sampling; Next-generation sequencing; Pedigree samples; Quantitative trait loci; Rare variants
With the advent of next-generation sequencing (NGS) technologies, researchers are now generating a deluge of data on high dimensional genomic variations, whose analysis is likely to reveal rare variants involved in the complex etiology of disease. Standing in the way of such discoveries, however, is the fact that statistics for rare variants are currently designed for use with population-based data. In this paper, we introduce a pedigree-based statistic specifically designed to test for rare variants in family-based data. The additional power of pedigree-based statistics stems from the fact that while rare variants related to diseases or traits of interest occur only infrequently in populations, in families with multiple affected individuals, such variants are enriched. Note that while the proposed statistic can be applied with and without statistical weighting, our simulations show that its power increases when weighting (WSS and VT) are applied.
Our working hypothesis was that, since rare variants are concentrated in families with multiple affected individuals, pedigree-based statistics should detect rare variants more powerfully than population-based statistics. To evaluate how well our new pedigree-based statistics perform in association studies, we develop a general framework for sequence-based association studies capable of handling data from pedigrees of various types and also from unrelated individuals. In short, we developed a procedure for transforming population-based statistics into tests for family-based associations. Furthermore, we modify two existing tests, the weighted sum-square test and the variable-threshold test, and apply both to our family-based collapsing methods. We demonstrate that the new family-based tests are more powerful than corresponding population-based test and they generate a reasonable type I error rate.
To demonstrate feasibility, we apply the newly developed tests to a pedigree-based GWAS data set from the Framingham Heart Study (FHS). FHS-GWAS data contain approximately 5000 uncommon variants with frequencies less than 0.05. Potential association findings in these data demonstrate the feasibility of the software PB-STAR (note, PB-STAR is now freely available to the public).
Our tests show that when analyzing for rare variants, a pedigree-based design is more powerful than a population-based case–control design. We further demonstrate that a pedigree-based statistic’s power to detect rare variants increases in direct relation to the proportion of affected individuals within the pedigree.
Pedigree; Next-generation sequencing; GWAS; Rare Variants; Collapsing
It is generally known that risk variants segregate together with a disease within families but this information has not been used in the existing statistical methods for detecting rare variants. Here we introduce two weighted sum statistics that can apply to either genome-wide association data or resequencing data for identifying rare disease variants: weights calculated based on sibpairs and odd ratios, respectively. We evaluated the two methods via extensive simulations under different disease models. We compared the proposed methods with the weighted sum statistic (WSS) proposed by Madsen and Browning, keeping the same genotyping or resequencing cost. Our methods clearly demonstrate more statistical power than the WSS. In addition, we found using sibpair information can increase power over using only unrelated samples by more than 40%. We applied our methods to the Framingham Heart Study (FHS) and Wellcome Trust Case Control Consortium (WTCCC) hypertension datasets. Although we did not identify any genes as reaching a genome-wide significance level, we found variants in the candidate gene angiotensinogen (AGT) significantly associated with hypertension at P=6.9×10-4, whereas the most significant single SNP association evidence is P=0.063. We further applied the odds ratio weighted method to the IFIH1 gene for type 1 diabetes in the WTCCC data. Our method yielded a P value of 4.82×10-4, much more significant than that obtained by haplotype-based methods. We demonstrated that family data are extremely informative in searching for rare variants underlying complex traits, and the odds ratio weighted sum statistic is more efficient than currently existing methods.
Multiple rare variants either within or across genes have been hypothesised to collectively influence complex human traits. The increasing availability of high throughput sequencing technologies offers the opportunity to study the effect of rare variants on these traits. However, appropriate and computationally efficient analytical methods are required to account for collections of rare variants that display a combination of protective, deleterious and null effects on the trait. We have developed a novel method for the analysis of rare genetic variation in a gene, region or pathway that, by simply aggregating summary statistics at each variant, can: (i) test for the presence of a mixture of effects on a trait; (ii) be applied to both binary and quantitative traits in population-based and family-based data; (iii) adjust for covariates to allow for non-genetic risk factors and; (iv) incorporate imputed genetic variation. In addition, for preliminary identification of promising genes, the method can be applied to association summary statistics, available from meta-analysis of published data, for example, without the need for individual level genotype data. Through simulation, we show that our method is immune to the presence of bi-directional effects, with no apparent loss in power across a range of different mixtures, and can achieve greater power than existing approaches as long as summary statistics at each variant are robust. We apply our method to investigate association of type-1 diabetes with imputed rare variants within genes in the major histocompatibility complex using genotype data from the Wellcome Trust Case Control Consortium.
Rapid advances in sequencing technology mean that it is now possible to directly assay rare genetic variation. In addition, the availability of almost fully sequenced human genomes by the 1000 Genomes Project allows genotyping at rare variants that are not present on arrays commonly used in genome-wide association studies. Rare variants within a gene or region may act to collectively influence a complex trait. Methods for testing these rare variants should be able to account for a combination of those that serve to either increase, decrease or have no effect on the trait of interest. Here, we introduce a method for the analysis of a collection of rare genetic variants, within a gene or region, which assesses evidence for a mixture of effects. Our method simply aggregates summary statistics at each variant and, as such, can be applied to both population and family-based data, to binary or quantitative traits and to either directly genotyped or imputed data. In addition, it does not require individual level genotype or phenotype data, and can be adjusted for non-genetic risk factors. We illustrate our approach by examining imputed rare variants in the major histocompatibility complex for association with type-1 diabetes using genotype data from the Wellcome Trust case Control Consortium.
Next-generation sequencing of large numbers of individuals presents challenges in data preparation, quality control, and statistical analysis because of the rarity of the variants. The Genetic Analysis Workshop 17 (GAW17) data provide an opportunity to survey existing methods and compare these methods with novel ones. Specifically, the GAW17 Group 2 contributors investigate existing and newly proposed methods and study design strategies to identify rare variants, predict functional variants, and/or examine quality control. We introduce the eight Group 2 papers, summarize their approaches, and discuss their strengths and weaknesses. For these investigations, some groups used only the genotype data, whereas others also used the simulated phenotype data. Although the eight Group 2 contributions covered a wide variety of topics under the general idea of identifying rare variants, they can be grouped into three broad categories according to their common research interests: functionality of variants and quality control issues, family-based analyses, and association analyses of unrelated individuals. The aims of the first subgroup were quite different. These were population structure analyses that used rare variants to predict functionality and examine the accuracy of genotype calls. The aims of the family-based analyses were to select which families should be sequenced and to identify high-risk pedigrees; the aim of the association analyses was to identify variants or genes with regression-based methods. However, power to detect associations was low in all three association studies. Thus this work shows opportunities for incorporating rare variants into the genetic and statistical analyses of common diseases.
1000 Genomes Project; association; collection of rare variants; family data; next-generation sequencing; regression; quality control
For almost all complex traits studied in humans, the identified genetic variants discovered to date have accounted for only a small portion of the estimated trait heritability. Consequently, several methods have been developed to identify rare single-nucleotide variants associated with complex traits for population-based designs. Because rare disease variants tend to be enriched in families containing multiple affected individuals, family-based designs can play an important role in the identification of rare causal variants. In this study, we utilize Genetic Analysis Workshop 18 simulated data to examine the performance of some existing rare variant identification methods for unrelated individuals, including our recent method (rPLS). The simulated data is used to investigate whether there is an advantage to using family data compared to case-control data. The results indicate that population-based methods suffer from power loss, especially when the sample size is small. The family-based method employed in this paper results in higher power but fails to control type I error. Our study also highlights the importance of the phenotype choice, which can affect the power of detecting causal genes substantially.
There is solid evidence that rare variants contribute to complex disease etiology. Next-generation sequencing technologies make it possible to uncover rare variants within candidate genes, exomes, and genomes. Working in a novel framework, the kernel-based adaptive cluster (KBAC) was developed to perform powerful gene/locus based rare variant association testing. The KBAC combines variant classification and association testing in a coherent framework. Covariates can also be incorporated in the analysis to control for potential confounders including age, sex, and population substructure. To evaluate the power of KBAC: 1) variant data was simulated using rigorous population genetic models for both Europeans and Africans, with parameters estimated from sequence data, and 2) phenotypes were generated using models motivated by complex diseases including breast cancer and Hirschsprung's disease. It is demonstrated that the KBAC has superior power compared to other rare variant analysis methods, such as the combined multivariate and collapsing and weight sum statistic. In the presence of variant misclassification and gene interaction, association testing using KBAC is particularly advantageous. The KBAC method was also applied to test for associations, using sequence data from the Dallas Heart Study, between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified, including the associations of high density lipoprotein and very low density lipoprotein with ANGPTL4. The KBAC method is implemented in a user-friendly R package.
It has been demonstrated that both rare and common variants are involved in complex disease etiology. Until recently it was only possible to perform large scale analysis of common variants. With the development of next-generation sequencing technologies, detection and mapping of rare variants have been made possible. However, methods used to analyze common variants are not powerful for the analysis of rare variants. To address the problems of rare variant analysis working in a novel framework, the kernel-based adaptive cluster (KBAC) method was developed to perform gene/locus based analysis. The KBAC combines variant classification and association testing in a coherent framework. Through simulations motivated by population genetic and disease data, it is demonstrated that the KBAC has superior power to other rare variant analysis methods, especially in the presence of variant misclassification and gene interaction. Using data from the Dallas Heart Study, the KBAC method was applied to test for associations between energy metabolism traits and rare variants in ANGPTL 3,4,5 and 6 genes. A number of novel associations were identified. The KBAC method is implemented in a user-friendly R package.
Although next-generation sequencing technology allows sequencing the whole genome of large groups of individuals, the development of powerful statistical methods for rare variant association studies is still underway. Even though many statistical methods have been developed for mapping rare variants, most of these methods are for unrelated individuals only, whereas family data have been shown to improve power to detect rare variants. The majority of the existing methods for unrelated individuals are essentially testing the effect of a weighted combination of variants with different weighting schemes. The performance of these methods depends on the weights being used. Recently, Sha et al. [Genet Epidemiol, 2012] proposed a test for Testing the effect of an Optimally Weighted combination of variants (TOW) for unrelated individuals. In this article, we extend our previously developed TOW for unrelated individuals to family-based data and propose a novel test for Testing the effect of an Optimally Weighted combination of variants for Family-based designs (TOW-F). The optimal weights are analytically derived. The results of extensive simulation studies show that TOW-F is robust to population stratification in a wide range of population structures, is robust to the direction and magnitude of the effects of causal variants, and is relatively robust to the percentage of neutral variants.
rare variants; general families; association studies; quantitative traits; population stratification
Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.
Developments in sequencing technology now enable us to assay all genetic variation, much of which is extremely rare. We propose to test the distribution of rare variants we observe in cases versus controls. To do so, we present a novel application of the C-alpha statistic to test these rare variants. C-alpha aims to determine whether the set of variants observed in cases and controls is a mixture, such that some of the variants confer risk or protection or are phenotypically neutral. Risk variants are expected to be more common in cases; protective variants more common in controls. C-alpha is sensitive to this imbalance, regardless of its origin—risk, protective, or both—but is ideally suited for a mixture of protective and risk variants. Variation in APOB nicely illustrates a mixture, in that certain rare variants increase triglyceride levels while others decrease it. The hallmark feature of C-alpha is that it uses the distribution of variation observed in cases and controls to detect the presence of a mixture, thus implicating genes or pathways as risk factors for disease.
With the development of sequencing technologies, the direct testing of rare variant associations has become possible. Many statistical methods for detecting associations between rare variants and complex diseases have recently been developed, most of which are population-based methods for unrelated individuals. A limitation of population-based methods is that spurious associations can occur when there is a population structure. For rare variants, this problem can be more serious, since the spectrum of rare variation can be very different in diverse populations, as well as the current nonexistence of methods to control for population stratification in population-based rare variant associations. A solution to the problem of population stratification is to use family-based association tests, which use family members to control for population stratification. In this article, we propose a novel test for Testing the Optimally Weighted combination of variants based on data of Parents and Affected Children (TOW-PAC). TOW-PAC is a family-based association test that tests the combined effect of rare and common variants in a genomic region, and is robust to the directions of the effects of causal variants. Simulation studies confirm that, for rare variant associations, family-based association tests are robust to population stratification while population-based association tests can be seriously confounded by population stratification. The results of power comparisons show that the power of TOW-PAC increases with an increase of the number of affected children in each family and TOW-PAC based on multiple affected children per family is more powerful than TOW based on unrelated individuals.
optimal weights; rare variants; association studies; family-based design
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
High-throughput sequencing technology allows researchers to test associations between phenotypes and all the variants identified throughout the genome, and is especially useful for analyzing rare variants. However, the statistical power to identify phenotype-associated rare variants is very low with typical genome-wide association studies because of their low allele frequencies among unrelated individuals. In contrast, a family-based design may have more power because rare variants are more likely to be enriched in families than among unrelated individuals. Regardless, an analysis of family-based association studies needs to account appropriately for relatedness between family members. We analyzed the observed quantitative trait systolic blood pressure as well as the simulated Q1 data in the Genetic Analysis Workshop 18 data set using 4 tests: (a) a single-variant test, (b) a collapsing test, (c) a single-variant test where familial relatedness was accounted for, and (d) a collapsing test where familial relatedness was accounted for. We then compared the results of the 4 methods and observed that adjusting for familial relatedness could appropriately control the false-positive rate while maintaining reasonable power to detect several strongly associated variants/genes.
In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing.
Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
Identifying genetic variants associated with complex diseases is an important task in genetic research. Although association studies based on unrelated individuals (ie, case-control genome-wide association studies) have successfully identified common single-nucleotide polymorphisms for many complex diseases, these studies are not so likely to identify rare genetic variants. In contrast, family-based association studies are particularly useful for identifying rare-variant associations. Recently, there has been some interest in employing multilevel models in family-based genetic association studies. However, the performance of such models in these studies, especially for longitudinal family-based sequence data, has not been fully investigated. Therefore, in this study, we investigated the performance of the multilevel model in the family-based genetic association analysis and compared it with the conventional family-based association test, by examining the powers and type I error rates of the 2 approaches using 3 data sets from the Genetic Analysis Workshop 18 simulated data: genome-wide association single-nucleotide polymorphism data, sequence data, and rare-variants-only data. Compared with the univariate family-based association test, the multilevel model had slightly higher power to identify most of the causal genetic variants using the genome-wide association single-nucleotide polymorphism data and sequence data. However, both approaches had low power to identify most of the causal single-nucleotide polymorphisms, especially those among the relatively rare genetic variants. Therefore, we suggest a unified method that combines both approaches and incorporates collapsing strategy, which may be more powerful than either approach alone for studying genetic associations using family-based data.
Family-based study design will play a key role in identifying rare causal variants, because rare causal variants can be enriched in families with multiple affected subjects. Furthermore, different from population-based studies, family studies are robust to bias induced by population substructure. It is well known that rare causal variants are difficult to detect from single-locus tests. Therefore, burden tests and non-burden tests have been developed, by combining signals of multiple variants in a chromosomal region or a functional unit. This inevitably incorporates some neutral variants into the test statistics, which can dilute the power of statistical methods. To guard against the noise caused by neutral variants, we here propose an ‘adaptive combination of P-values method’ (abbreviated as ‘ADA’). This method combines per-site P-values of variants that are more likely to be causal. Variants with large P-values (which are more likely to be neutral variants) are discarded from the combined statistic. In addition to performing extensive simulation studies, we applied these tests to the Genetic Analysis Workshop 17 data sets, where real sequence data were generated according to the 1000 Genomes Project. Compared with some existing methods, ADA is more robust to the inclusion of neutral variants. This is a merit especially when dichotomous traits are analyzed. However, there are some limitations for ADA. First, it is more computationally intensive. Second, pedigree structures and founders' sequence data are required for the permutation procedure. Third, unrelated controls cannot be included. We here show that, for family-based studies, the application of ADA is limited to dichotomous trait analyses with full pedigree information.
Use of trait-dependent sampling designs in whole-genome association studies of sequence data can reduce total sequencing costs with modest losses of statistical efficiency. In a quantitative trait (QT) analysis of data from the Genetic Analysis Workshop 17 mini-exome for unrelated individuals in the Asian subpopulation, we investigate alternative designs that sequence only 50% of the entire cohort. In addition to a simple random sampling design, we consider extreme-phenotype designs that are of increasing interest in genetic association analysis of QTs, especially in studies concerned with the detection of rare genetic variants. We also evaluate a novel sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability. We take differential sampling of individuals with informative trait values into account by inverse probability weighting using standard survey methods which thus generalizes to the source population. In replicate 1 data, we applied the designs in association analysis of Q1 with both rare and common variants in the FLT1 gene, based on knowledge of the generating model. Using all 200 replicate data sets, we similarly analyzed Q1 and Q4 (which is known to be free of association with FLT1) to evaluate relative efficiency, type I error, and power. Simulation study results suggest that the QT-dependent selection designs generally yield greater than 50% relative efficiency compared to using the entire cohort, implying cost-effectiveness of 50% sample selection and worthwhile reduction of sequencing costs.
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
1000 Genomes Project; association; collapsing methods; next-generation sequencing
The role of rare genetic variation in the etiology of complex disease remains unclear. However, the development of next-generation sequencing technologies offers the experimental opportunity to address this question. Several novel statistical methodologies have been recently proposed to assess the contribution of rare variation to complex disease etiology. Nevertheless, no empirical estimates comparing their relative power are available. We therefore assessed the parameters that influence their statistical power in 1,998 individuals Sanger-sequenced at seven genes by modeling different distributions of effect, proportions of causal variants, and direction of the associations (deleterious, protective, or both) in simulated continuous trait and case/control phenotypes. Our results demonstrate that the power of recently proposed statistical methods depend strongly on the underlying hypotheses concerning the relationship of phenotypes with each of these three factors. No method demonstrates consistently acceptable power despite this large sample size, and the performance of each method depends upon the underlying assumption of the relationship between rare variants and complex traits. Sensitivity analyses are therefore recommended to compare the stability of the results arising from different methods, and promising results should be replicated using the same method in an independent sample. These findings provide guidance in the analysis and interpretation of the role of rare base-pair variation in the etiology of complex traits and diseases.
There is now evidence that rare variants can contribute to the etiology of complex disease. Next generation sequencing technologies have enabled their detection in large cohorts, and new statistical methods have been proposed to ascertain their association with complex diseases and traits in order to improve power over single-marker analysis. Each of these new methods assumes a particular nature of the relationship between rare variants and complex disease, yet these hypotheses have been largely unverified. Therefore we sought to compare the power of commonly used and novel statistical methods for rare variants using Sanger sequencing data from 1,998 individuals sequenced at 7 genes by simulating several phenotypes under models spanning a spectrum of the common hypotheses concerning such associations. While all methods perform reasonably well under their own model-specific hypotheses, no single method gives consistently acceptable power when these hypotheses are violated. Unlike GWAS, wherein all variants can often be tested using the same method across the entire genome, the analysis and interpretation of sequencing studies will therefore be considerably more challenging.
Next-generation sequencing technologies are rapidly changing the field of genetic epidemiology and enabling exploration of the full allele frequency spectrum underlying complex diseases. Although sequencing technologies have shifted our focus toward rare genetic variants, statistical methods traditionally used in genetic association studies are inadequate for estimating effects of low minor allele frequency variants. Four our study we use the Genetic Analysis Workshop 17 data from 697 unrelated individuals (genotypes for 24,487 autosomal variants from 3,205 genes). We apply a Bayesian hierarchical mixture model to identify genes associated with a simulated binary phenotype using a transformed genotype design matrix weighted by allele frequencies. A Metropolis Hasting algorithm is used to jointly sample each indicator variable and additive genetic effect pair from its conditional posterior distribution, and remaining parameters are sampled by Gibbs sampling. This method identified 58 genes with a posterior probability greater than 0.8 for being associated with the phenotype. One of these 58 genes, PIK3C2B was correctly identified as being associated with affected status based on the simulation process. This project demonstrates the utility of Bayesian hierarchical mixture models using a transformed genotype matrix to detect genes containing rare and common variants associated with a binary phenotype.
Identity by descent (IBD) can be reliably detected for long shared DNA segments, which are found in related individuals. However, many studies contain cohorts of unrelated individuals that share only short IBD segments. New sequencing technologies facilitate identification of short IBD segments through rare variants, which convey more information on IBD than common variants. Current IBD detection methods, however, are not designed to use rare variants for the detection of short IBD segments. Short IBD segments reveal genetic structures at high resolution. Therefore, they can help to improve imputation and phasing, to increase genotyping accuracy for low-coverage sequencing and to increase the power of association studies. Since short IBD segments are further assumed to be old, they can shed light on the evolutionary history of humans. We propose HapFABIA, a computational method that applies biclustering to identify very short IBD segments characterized by rare variants. HapFABIA is designed to detect short IBD segments in genotype data that were obtained from next-generation sequencing, but can also be applied to DNA microarray data. Especially in next-generation sequencing data, HapFABIA exploits rare variants for IBD detection. HapFABIA significantly outperformed competing algorithms at detecting short IBD segments on artificial and simulated data with rare variants. HapFABIA identified 160 588 different short IBD segments characterized by rare variants with a median length of 23 kb (mean 24 kb) in data for chromosome 1 of the 1000 Genomes Project. These short IBD segments contain 752 000 single nucleotide variants (SNVs), which account for 39% of the rare variants and 23.5% of all variants. The vast majority—152 000 IBD segments—are shared by Africans, while only 19 000 and 11 000 are shared by Europeans and Asians, respectively. IBD segments that match the Denisova or the Neandertal genome are found significantly more often in Asians and Europeans but also, in some cases exclusively, in Africans. The lengths of IBD segments and their sharing between continental populations indicate that many short IBD segments from chromosome 1 existed before humans migrated out of Africa. Thus, rare variants that tag these short IBD segments predate human migration from Africa. The software package HapFABIA is available from Bioconductor. All data sets, result files and programs for data simulation, preprocessing and evaluation are supplied at http://www.bioinf.jku.at/research/short-IBD.
ARNT, a member of the basic helix-loop-helix family of transcription factors, is located on human chromosome 1q21–q24, a region which showed well replicated linkage to type 2 diabetes. We hypothesized that common polymorphisms in the ARNT gene might increase the susceptibility to type 2 diabetes through impaired glucose-stimulated insulin secretion.
We selected 9 single nucleotide polymorphisms to tag common variation across the ARNT gene. Additionally we searched for novel variants in functional coding domains in European American and African American samples. Case-control studies were performed in 191 European American individuals with type 2 diabetes and 187 nondiabetic European American control individuals, and in 372 African American individuals with type 2 diabetes and 194 African American control individuals. Metabolic effects of ARNT variants were examined in 122 members of 26 European American families from Utah and in 225 unrelated individuals from Arkansas. Gene expression was tested in 8 sibling pairs discordant for type 2 diabetes.
No nonsynonymous variants or novel polymorphisms were identified. No SNP was associated with type 2 diabetes in either African Americans or European Americans, but among nondiabetic European American individuals, ARNT SNPs rs188970 and rs11204735 were associated with acute insulin response (AIRg; p =< 0.005). SNP rs2134688 interacted with body mass index to alter β-cell compensation to insulin resistance (disposition index; p = 0.004). No significant difference in ARNT mRNA levels was observed in transformed lymphocytes from sibling pairs discordant for type 2 diabetes.
Common ARNT variants are unlikely to explain the linkage signal on chromosome 1q, but may alter insulin secretion in nondiabetic subjects. Our studies cannot exclude a role for rare variants or variants of small (< 1.6) effect size.
We evaluate four association tests for rare variants—the combined multivariate and collapsing (CMC) method, two weighted-sum methods, and a variable threshold method—by applying them to the simulated data sets of unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data. The family-wise error rate (FWER) and average power are used as criteria for evaluation. Our results show that when all nonsynonymous SNPs (rare variants and common variants) in a gene are jointly analyzed, the CMC method fails to control the FWER; when only rare variants (single-nucleotide polymorphisms with minor allele frequency less than 0.05) are analyzed, all four methods can control FWER well. All four methods have comparable power, which is low for the analysis of the GAW17 data sets. Three of the methods (not including the CMC method) involve estimation of p-values using permutation procedures that either can be computationally intensive or generate inflated FWERs. We adapt a fast permutation procedure into these three methods. The results show that using the fast permutation procedure can produce FWERs and average powers close to the values obtained from the standard permutation procedure on the GAW17 data sets. The standard permutation procedure is computationally intensive.
Previously described methods for the combined analysis of common and rare variants have disadvantages such as requiring an arbitrary classification of variants or permutation testing to assess statistical significance. Here we propose a novel method which implements a weighting scheme based on allele frequencies observed in both cases and controls. Because the test is unbiased, scores can be analyzed with a standard t-test. To test its validity we applied it to data for common, rare, and very rare variants simulated under the null hypothesis. To test its power we applied it to simulated data in which association was present, including data using the observed allele frequencies of common and rare variants in NOD2 previously reported in cases of Crohn’s disease and controls. The method produced results that conformed well to those expected under the null hypothesis. It demonstrated more power to detect association when rare and common variants were analyzed jointly, the power further increasing when rare variants were assigned higher weights. 20,000 analyses of a gene containing 62 variants could be performed in 80 minutes on a laptop. This approach shows promise for the analysis of data currently emerging from genome wide sequencing studies.
common; rare; variant; sequence; genome; exome
A large number of rare genetic variants have been discovered with the development in sequencing technology and the lowering of sequencing costs. Rare variant analysis may help identify novel genes associated with diseases and quantitative traits, adding to our knowledge of explaining heritability of these phenotypes. Many statistical methods for rare variant analysis have been developed in recent years, but some of them require the strong assumption that all rare variants in the analysis share the same direction of effect, and others requiring permutation to calculate the p-values are computer intensive. Among these methods, the sequence kernel association test (SKAT) is a powerful method under many different scenarios. It does not require any assumption on the directionality of effects, and statistical significance is computed analytically. In this paper, we extend SKAT to be applicable to family data. The family-based SKAT (famSKAT) has a different test statistic and null distribution compared to SKAT, but is equivalent to SKAT when there is no familial correlation. Our simulation studies show that SKAT has inflated type I error if familial correlation is inappropriately ignored, but has appropriate type I error if applied to a single individual per family to obtain an unrelated subset. In the contrast, famSKAT has the correct type I error when analyzing correlated observations, and it has higher power than competing methods in many different scenarios. We illustrate our approach to analyze the association of rare genetic variants using glycemic traits from the Framingham Heart Study.
rare variant analysis; quantitative traits; family samples; heritability; linear mixed effects model
Rapid advances in sequencing technologies set the stage for the large-scale medical sequencing efforts to be performed in the near future, with the goal of assessing the importance of rare variants in complex diseases. The discovery of new disease susceptibility genes requires powerful statistical methods for rare variant analysis. The low frequency and the expected large number of such variants pose great difficulties for the analysis of these data. We propose here a robust and powerful testing strategy to study the role rare variants may play in affecting susceptibility to complex traits. The strategy is based on assessing whether rare variants in a genetic region collectively occur at significantly higher frequencies in cases compared with controls (or vice versa). A main feature of the proposed methodology is that, although it is an overall test assessing a possibly large number of rare variants simultaneously, the disease variants can be both protective and risk variants, with moderate decreases in statistical power when both types of variants are present. Using simulations, we show that this approach can be powerful under complex and general disease models, as well as in larger genetic regions where the proportion of disease susceptibility variants may be small. Comparisons with previously published tests on simulated data show that the proposed approach can have better power than the existing methods. An application to a recently published study on Type-1 Diabetes finds rare variants in gene IFIH1 to be protective against Type-1 Diabetes.
Risk to common diseases, such as diabetes, heart disease, etc., is influenced by a complex interaction among genetic and environmental factors. Most of the disease-association studies conducted so far have focused on common variants, widely available on genotyping platforms. However, recent advances in sequencing technologies pave the way for large-scale medical sequencing studies with the goal of elucidating the role rare variants may play in affecting susceptibility to complex traits. The large number of rare variants and their low frequencies pose great challenges for the analysis of these data. We present here a novel testing strategy, based on a weighted-sum statistic, that is less sensitive than existing methods to the presence of both risk and protective variants in the genetic region under investigation. We show applications to simulated data and to a real dataset on Type-1 Diabetes.