Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
We propose a nonparametric Bayes-based clustering algorithm to detect associations with rare and common single-nucleotide polymorphisms (SNPs) for quantitative traits. Unlike current methods, our approach identifies associations with rare genetic variants at the variant level, not the gene level. In this method, we use a Dirichlet process prior for the distribution of SNP-specific regression coefficients, conduct hierarchical clustering with a distance measure derived from posterior pairwise probabilities of two SNPs having the same regression coefficient, and explore data-driven approaches to select the number of clusters. SNPs falling inside the largest cluster have relatively low or close to zero estimates of regression coefficients and are considered not associated with the trait. SNPs falling outside the largest cluster have relatively high estimates of regression coefficients and are considered potential risk variants. Using the data from the Genetic Analysis Workshop 17, we successfully detected associations with both rare and common SNPs for a quantitative trait. We conclude that our method provides a novel and broadly applicable strategy for obtaining association results with a reasonably low proportion of false discovery and that it can be routinely used in resequencing studies.
Genetic markers with rare variants are spread out in the genome, making it necessary and difficult to consider them in genetic association studies. Consequently, wisely combining rare variants into “composite” markers may facilitate meaningful analyses. In this paper, we propose a novel approach of analyzing rare variant data by incorporating the least absolute shrinkage and selection operator technique. We applied this method to the Genetic Analysis Workshop 17 data, and our results suggest that this new approach is promising. In addition, we took advantage of having 200 phenotype replications and assessed the performance of our approach by means of repeated classification tree analyses. Our method and analyses were performed without knowledge of the underlying simulating model. Our method identified 38 markers (in 65 genes) that are significantly associated with the phenotype Affected and correctly identified two causal genes, SIRT1 and PDGFD.
The unrelated individuals sample from Genetic Analysis Workshop 17 consists of a small number of subjects from eight population samples and genetic data composed mostly of rare variants. We compare two simple approaches to collapsing rare variants within genes for their utility in identifying genes that affect phenotype. We also compare results from stratified analyses to those from a pooled analysis that uses ethnicity as a covariate. We found that the two collapsing approaches were similarly effective in identifying genes that contain causative variants in these data. However, including population as a covariate was not an effective substitute for analyzing the subpopulations separately when only one subpopulation contained a rare variant linked to the phenotype.
Existing methods for analyzing rare variant data focus on collapsing a group of rare variants into a single common variant; collapsing is based on an intuitive function of the rare variant genotype information, such as an indicator function or a weighted sum. It is more natural, however, to take into account the single-nucleotide polymorphism (SNP) interactions informed directly by the data. We propose a novel tree-based method that automatically detects SNP interactions and generates candidate markers from the original pool of rare variants. In addition, we utilize the advantage of having 200 phenotype replications in the Genetic Analysis Workshop 17 data to assess the candidate markers by means of repeated logistic regressions. This new approach shows potential in the rare variant analysis. We correctly identify the association between gene FLT1 and phenotype Affect, although there exist other false positives in our results. Our analyses are performed without knowledge of the underlying simulating model.
Identifying rare variants that are responsible for complex disease has been promoted by advances in sequencing technologies. However, statistical methods that can handle the vast amount of data generated and that can interpret the complicated relationship between disease and these variants have lagged. We apply a zero-inflated Poisson regression model to take into account the excess of zeros caused by the extremely low frequency of the 24,487 exonic variants in the Genetic Analysis Workshop 17 data. We grouped the 697 subjects in the data set as Europeans, Asians, and Africans based on principal components analysis and found the total number of rare variants per gene for each individual. We then analyzed these collapsed variants based on the assumption that rare variants are enriched in a group of people affected by a disease compared to a group of unaffected people. We also tested the hypothesis with quantitative traits Q1, Q2, and Q4. Analyses performed on the combined 697 individuals and on each ethnic group yielded different results. For the combined population analysis, we found that UGT1A1, which was not part of the simulation model, was associated with disease liability and that FLT1, which was a causal locus in the simulation model, was associated with Q1. Of the causal loci in the simulation models, FLT1 and KDR were associated with Q1 and VNN1 was correlated with Q2. No significant genes were associated with Q4. These results show the feasibility and capability of our new statistical model to detect multiple rare variants influencing disease risk.
Machine learning approaches are an attractive option for analyzing large-scale data to detect genetic variants that contribute to variation of a quantitative trait, without requiring specific distributional assumptions. We evaluate two machine learning methods, random forests and logic regression, and compare them to standard simple univariate linear regression, using the Genetic Analysis Workshop 17 mini-exome data. We also apply these methods after collapsing multiple rare variants within genes and within gene pathways. Linear regression and the random forest method performed better when rare variants were collapsed based on genes or gene pathways than when each variant was analyzed separately. Logic regression performed better when rare variants were collapsed based on genes rather than on pathways.
The common disease/rare variant hypothesis predicts that rare variants with large effects will have a strong impact on corresponding phenotypes. Therefore it is assumed that rare functional variants are enriched in the extremes of the phenotype distribution. In this analysis of the Genetic Analysis Workshop 17 data set, my aim is to detect genes with rare variants that are associated with quantitative traits using two general approaches: analyzing the association with the complete distribution of values by means of linear regression and using statistical tests based on the tails of the distribution (bottom 10% of values versus top 10%). Three methods are used for this extreme phenotype approach: Fisher’s exact test, weighted-sum method, and beta method. Rare variants were collapsed on the gene level. Linear regression including all values provided the highest power to detect rare variants. Of the three methods used in the extreme phenotype approach, the beta method performed best. Furthermore, the sample size was enriched in this approach by adding additional samples with extreme phenotype values. Doubling the sample size using this approach, which corresponds to only 40% of sample size of the original continuous trait, yielded a comparable or even higher power than linear regression. If samples are selected primarily for sequencing, enriching the analysis by gathering a greater proportion of individuals with extreme values in the phenotype of interest rather than in the general population leads to a higher power to detect rare variants compared to analyzing a population-based sample with equivalent sample size.
Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.
Human genome resequencing technologies are becoming ever more affordable and provide a valuable source of data about rare genetic variants in the human genome. Such rare variation may play an important role in explaining the missing heritability of complex human traits. We implement an existing method for analyzing rare variants by testing for association with the mutational load across genes. In this study, we make use of simulated data from the Genetic Analysis Workshop 17 to assess the power of this approach to detect association with simulated quantitative and dichotomous phenotypes and to evaluate the impact of missing genotypes on the power of the analysis. According to our results, the mutational load based rare variant analysis method is relatively robust to call-rate and is adequately powered for genome-wide association analysis.
Genome-wide association studies have been firmly established in investigations of the associations between common genetic variants and complex traits or diseases. However, a large portion of complex traits and diseases cannot be explained well by common variants. Detecting rare functional variants becomes a trend and a necessity. Because rare variants have such a small minor allele frequency (e.g., <0.05), detecting functional rare variants is challenging. Group iterative sure independence screening (ISIS), a fast group selection tool, was developed to select important genes and the single-nucleotide polymorphisms within. The performance of the group ISIS and group penalization methods is compared for detecting important genes in the Genetic Analysis Workshop 17 data. The results suggest that the group ISIS is an efficient tool to discover genes and single-nucleotide polymorphisms associated to phenotypes.
Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
We aim to identify rare variants that have large effects on trait variance using a cost-efficient strategy. We use an oligogenic segregation analysis as a prioritizing tool for whole-exome sequencing studies to identify families more likely to harbor rare variants, by estimating the mean number of quantitative trait loci (QTLs) in each family. We hypothesize that families with additional QTLs, relative to the other families, are more likely to segregate functional rare variants. We test the association of rare variants with the traits only in regions where at least modest evidence of linkage with the trait is observed, thereby reducing the number of tests performed. We found that family 7 harbored an estimated two, one, and zero additional QTLs for traits Q1, Q2, and Q4, respectively. Two rare variants (C4S4935 and C6S2981) segregating in family 7 were associated with Q1 and explained a substantial proportion of the observed linkage signal. These rare variants have 31 and 22 carriers, respectively, in the 128-member family and entered through a single but different founder. For Q2, we found one rare variant unique to family 7 that showed small effect and weak evidence of association; this was a false positive. These results are a proof of principle that prioritizing the sequencing of carefully selected extended families is a simple and cost-efficient design strategy for sequencing studies aiming at identifying functional rare variants.
Genome-wide association studies have successfully identified numerous loci at which common variants influence disease risks or quantitative traits of interest. Despite these successes, the variants identified by these studies have generally explained only a small fraction of the variations in the phenotype. One explanation may be that many rare variants that are not included in the common genotyping platforms may contribute substantially to the genetic variations of the diseases. Next-generation sequencing, which would better allow for the analysis of rare variants, is now becoming available and affordable; however, the presence of a large number of rare variants challenges the statistical endeavor to stably identify these disease-causing genetic variants. We conduct a genome-wide association study of Genetic Analysis Workshop 17 case-control data produced by the next-generation sequencing technique and propose that collapsing rare variants within each genetic region through a supervised dimension reduction algorithm leads to several macrovariants constructed for rare variants within each genetic region. A simultaneous association of the phenotype to all common variants and macrovariants is undertaken using a linear discriminant analysis using the penalized orthogonal-components regression algorithm. The results suggest that the proposed analysis strategy shows promise but needs further development.
Use of trait-dependent sampling designs in whole-genome association studies of sequence data can reduce total sequencing costs with modest losses of statistical efficiency. In a quantitative trait (QT) analysis of data from the Genetic Analysis Workshop 17 mini-exome for unrelated individuals in the Asian subpopulation, we investigate alternative designs that sequence only 50% of the entire cohort. In addition to a simple random sampling design, we consider extreme-phenotype designs that are of increasing interest in genetic association analysis of QTs, especially in studies concerned with the detection of rare genetic variants. We also evaluate a novel sampling design in which all individuals have a nonzero probability of being selected into the sample but in which individuals with extreme phenotypes have a proportionately larger probability. We take differential sampling of individuals with informative trait values into account by inverse probability weighting using standard survey methods which thus generalizes to the source population. In replicate 1 data, we applied the designs in association analysis of Q1 with both rare and common variants in the FLT1 gene, based on knowledge of the generating model. Using all 200 replicate data sets, we similarly analyzed Q1 and Q4 (which is known to be free of association with FLT1) to evaluate relative efficiency, type I error, and power. Simulation study results suggest that the QT-dependent selection designs generally yield greater than 50% relative efficiency compared to using the entire cohort, implying cost-effectiveness of 50% sample selection and worthwhile reduction of sequencing costs.
We develop statistical methods for detecting rare variants that are associated with quantitative traits. We propose two strategies and their combination for this purpose: the iterative regression strategy and the extreme values strategy. In the iterative regression strategy, we use iterative regression on residuals and a multimarker association test to identify a group of significant variants. In the extreme values strategy, we use individuals with extreme trait values to select candidate genes and then test only these candidate genes. These two strategies are integrated into a hybrid approach through a weighting technology. We apply the proposed methods to analyze the Genetic Analysis Workshop 17 data set. The results show that the hybrid approach is the most powerful approach. Using the hybrid approach, the average power to detect causal genes for Q1 is about 40% and the powers to detect FLT1 and KDR are 100% and 68% for Q1, respectively. The powers to detect VNN3 and BCHE are 34% and 30% for Q2, respectively.
With recent advances in technology, deep sequencing data will be widely used to further the understanding of genetic influence on traits of interest. Therefore not only common variants but also rare variants need to be better used to exploit the new information provided by deep sequencing data. Recently, statistical approaches for analyzing rare variants in genetic association studies have been proposed, but many of them were designed only for dichotomous outcomes. We compare the type I error and power of several statistical approaches applicable to quantitative traits for collapsing and analyzing rare variant data within a defined gene region. In addition to comparing methods that consider only rare variants, such as indicator, count, and data-adaptive collapsing methods, we also compare methods that incorporate the analysis of common variants along with rare variants, such as CMC and LASSO regression. We find that the three methods used to collapse rare variants perform similarly in this simulation setting where all risk variants were simulated to have effects in the same direction. Further, we find that incorporating common variants is beneficial and using a LASSO regression to choose which common variants to include is most useful when there is are few common risk variants compared to the total number of risk variants.
Genome-wide association studies have been used successfully to detect associations between common genetic variants and complex diseases, but common single-nucleotide polymorphisms (SNPs) detected by these studies explain only 5–10% of disease heritability. Alternatively, the common disease/rare variants hypothesis suggests that complex diseases are often caused by multiple rare variants with moderate to high effects. Under this hypothesis, the analysis of the cumulative effect of rare variants may thus help us discover the missing genetic variations. Collapsing all rare variants across a functional region is currently a popular method to find rare variants that may have a causal effect on certain diseases. However, the power of tests based on collapsing methods is often impaired by misclassification of functional variants. We develop a data-adaptive forward selection procedure that selectively chooses only variants that improve the association signal between functional regions and the disease risk. We apply our strategy to the Genetic Analysis Workshop 17 unrelated individuals data with quantitative traits. The type I error rate and the power of different collapsing functions are evaluated. The substantially higher power of the proposed strategy was demonstrated. The new method provides a useful strategy for the association study of sequencing data by taking advantage of the selection of rare variants.
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.
Principal components analysis (PCA) has been successfully used to correct for population stratification in genome-wide association studies of common variants. However, rare variants also have a role in common disease etiology. Whether PCA successfully controls population stratification for rare variants has not been addressed. Thus we evaluate the effect of population stratification analysis on false-positive rates for common and rare variants at the single-nucleotide polymorphism (SNP) and gene level. We use the simulation data from Genetic Analysis Workshop 17 and compare false-positive rates with and without PCA at the SNP and gene level. We found that SNPs’ minor allele frequency (MAF) influenced the ability of PCA to effectively control false discovery. Specifically, PCA reduced false-positive rates more effectively in common SNPs (MAF > 0.05) than in rare SNPs (MAF < 0.01). Furthermore, at the gene level, although false-positive rates were reduced, power to detect true associations was also reduced using PCA. Taken together, these results suggest that sequence-level data should be interpreted with caution, because extremely rare SNPs may exhibit sporadic association that is not controlled using PCA.
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Genome-wide association studies have successfully identified many common variants associated with complex human diseases. However, a large portion of the remaining heritability cannot be explained by these common variants. Exploring rare variants associated with diseases is now catching more attention. Several methods have been recently proposed for identification of rare variants. Among them, the fixed-threshold, weighted-sum, and variable-threshold methods are effective in combining the information of multiple variants into a functional unit; these approaches are commonly used. We evaluate the performance of these three methods. Based on our analyses of the Genetic Analysis Workshop 17 data, we find that no method is universally better than the others. Furthermore, adjusting for potential covariates can not only increase the true-positive proportions but also reduce the false-positive proportions. Our study concludes that there is no uniformly most powerful test among the three methods we compared (the fixed-threshold, weighted-sum, and variable-threshold methods), and their performances depend on the underlying genetic architecture of a disease.
A number of rare variant statistical methods have been proposed for analysis of the impending wave of next-generation sequencing data. To date, there are few direct comparisons of these methods on real sequence data. Furthermore, there is a strong need for practical advice on the proper analytic strategies for rare variant analysis. We compare four recently proposed rare variant methods (combined multivariate and collapsing, weighted sum, proportion regression, and cumulative minor allele test) on simulated phenotype and next-generation sequencing data as part of Genetic Analysis Workshop 17. Overall, we find that all analyzed methods have serious practical limitations on identifying causal genes. Specifically, no method has more than a 5% true discovery rate (percentage of truly causal genes among all those identified as significantly associated with the phenotype). Further exploration shows that all methods suffer from inflated false-positive error rates (chance that a noncausal gene will be identified as associated with the phenotype) because of population stratification and gametic phase disequilibrium between noncausal SNPs and causal SNPs. Furthermore, observed true-positive rates (chance that a truly causal gene will be identified as significantly associated with the phenotype) for each of the four methods was very low (<19%). The combination of larger than anticipated false-positive rates, low true-positive rates, and only about 1% of all genes being causal yields poor discriminatory ability for all four methods. Gametic phase disequilibrium and population stratification are important areas for further research in the analysis of rare variant data.
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.