Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
1000 Genomes Project; association; collapsing methods; next-generation sequencing
Next-generation sequencing of large numbers of individuals presents challenges in data preparation, quality control, and statistical analysis because of the rarity of the variants. The Genetic Analysis Workshop 17 (GAW17) data provide an opportunity to survey existing methods and compare these methods with novel ones. Specifically, the GAW17 Group 2 contributors investigate existing and newly proposed methods and study design strategies to identify rare variants, predict functional variants, and/or examine quality control. We introduce the eight Group 2 papers, summarize their approaches, and discuss their strengths and weaknesses. For these investigations, some groups used only the genotype data, whereas others also used the simulated phenotype data. Although the eight Group 2 contributions covered a wide variety of topics under the general idea of identifying rare variants, they can be grouped into three broad categories according to their common research interests: functionality of variants and quality control issues, family-based analyses, and association analyses of unrelated individuals. The aims of the first subgroup were quite different. These were population structure analyses that used rare variants to predict functionality and examine the accuracy of genotype calls. The aims of the family-based analyses were to select which families should be sequenced and to identify high-risk pedigrees; the aim of the association analyses was to identify variants or genes with regression-based methods. However, power to detect associations was low in all three association studies. Thus this work shows opportunities for incorporating rare variants into the genetic and statistical analyses of common diseases.
1000 Genomes Project; association; collection of rare variants; family data; next-generation sequencing; regression; quality control
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Currently there is a great deal of interest in developing methods for testing the role that rare variation plays in disease development. Here we propose a weighted association test that accumulates genetic variation across a signaling pathway. We evaluate our approach by analyzing simulated phenotype data from an exome sequencing study of 697 unrelated individuals from the Genetic Analysis Workshop 17 (GAW17) data set. Although our weighted approach identifies several interesting pathways associated with phenotype Q1, so does an alternative unweighted accumulation approach. Such a result is not unexpected because there is no systematic relationship between the allele frequency of a variant and its effect on phenotype in the GAW17 simulation model.
Genetic Analysis Workshop 17 (GAW17) focused on the transition from genome-wide association study designs and methods to the study designs and statistical genetic methods that will be required for the analysis of next-generation sequence data including both common and rare sequence variants. In the 166 contributions to GAW17, a wide variety of statistical methods were applied to simulated traits in population- and family-based samples, and results from these analyses were compared to the known generating model. In general, many of the statistical genetic methods used in the population-based sample identified causal sequence variants (SVs) when the estimated locus-specific heritability, as measured in the population-based sample, was greater than about 0.08. However, SVs with locus-specific heritabilities less than 0.03 were rarely identified consistently. In the family-based samples, many of the methods detected SVs that were rarer than those detected in the population-based sample, but the estimated locus-specific heritabilities for these rare SVs, as measured in the family-based samples, were substantially higher (>0.2) than their corresponding heritabilities in the population-based samples. Substantial inflation of the type I error rate was observed across a wide variety of statistical methods. Although many of the contributions found little inflation in type I error for Q4, a trait with no causal SVs, type I error rates for Q1 and Q2 were well above their nominal levels with the inflation for Q1 being higher than that for Q2. It seems likely that this inflation in type I error is due to correlations among SVs.
linkage; association; next-generation sequencing; computer simulation
We summarize the work done by the contributors to Group 13 at Genetic Analysis Workshop 17 (GAW17) and provide a synthesis of their data analyses. The Group 13 contributors used a variety of approaches to test associations of both rare variants and common single-nucleotide polymorphisms (SNPs) with the GAW17 simulated traits, implementing analytic methods that incorporate multiallelic genotypes and haplotypes. In addition to using a wide variety of statistical methods and approaches, the contributors exhibited a remarkable amount of flexibility and creativity in coding the variants and their genes and in evaluating their proposed approaches and methods. We describe and contrast their methods along three dimensions: (1) selection and coding of genetic entities for analysis, (2) method of analysis, and (3) evaluation of the results. The contributors consistently presented a strong rationale for using multiallelic analytic approaches. They indicated that power was likely to be increased by capturing the signals of multiple markers within genetic entities defined by sliding windows, haplotypes, genes, functional pathways, and the entire set of SNPs and rare variants taken in aggregate. Despite this variability, the methods were fairly consistent in their ability to identify two associated genes for each simulated trait. The first gene was selected for the largest number of causal alleles and the second for a high-frequency causal SNP. The presumed model of inheritance and choice of genetic entities are likely to have a strong effect on the outcomes of the analyses.
rare variants; sequence data; multiallelic data; Bayesian regression; penalized regression; tree-based clustering; pathway analysis; haplotypes
Genetic Analysis Workshop 16 GAW16) was held September 17-20, 2008 in St. Louis, Missouri. The focus of GAW16 was on methods and challenges in analysis of single-nucleotide polymorphism (SNP) data from genome-wide scans. GAW16 attracted 221 participants from 12 countries. The 168 contributions were organized into 17 discussion groups of 6 to 17 papers each. Three data sets were available for analysis. Two of these were data from ongoing studies, generously provided by the investigators. The North American Rheumatoid Arthritis Consortium provided case-control data on rheumatoid arthritis, and the Framingham Heart Study made available information on cardiovascular risk factors for participants in three generations of pedigree data. The third data set included simulated phenotypes for participants in the Framingham Heart Study, using actual pedigree structures and genotypes. This volume includes a paper for each of the 17 discussion groups, summarizing their main findings.
single-nucleotide polymorphism; SNP; genome-wide scan; association; linkage; haplotype
The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to workshop participants. The simulating model included both common and rare variants with minor allele frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated.
In addition to methods that can identify common variants associated with susceptibility to common diseases, there has been increasing interest in approaches that can identify rare genetic variants. We use the simulated data provided to the participants of Genetic Analysis Workshop 17 (GAW17) to identify both rare and common single-nucleotide polymorphisms and pathways associated with disease status. We apply a rare variant collapsing approach and the usual association tests for common variants to identify candidates for further analysis using pathway-based and tree-based ensemble approaches. We use the mean log p-value approach to identify a top set of pathways and compare it to those used in simulation of GAW17 dataset. We conclude that the mean log p-value approach is able to identify those pathways in the top list and also related pathways. We also use the stochastic gradient boosting approach for the selected subset of single-nucleotide polymorphisms. When compared the result of this tree-based method with the list of single-nucleotide polymorphisms used in dataset simulation, in addition to correct SNPs we observe number of false positives.
We present a new statistical method to identify genes in which one or more variants influence quantitative traits. We use the Genetic Analysis Workshop 17 (GAW17) data set of unrelated individuals as a test of the method on the raw GAW17 phenotypes and on residuals after fitting linear models to individual-based covariates. By performing appropriate randomization tests, we found many significant results for a proportion of the genes that contain variants that directly contribute to disease but that have an increased type I error for analyses of raw phenotypes. Power calculations show that our methods have the ability to reliably identify a subset of the loci contributing to disease. When we applied our method to derived phenotypes, we removed many false positives, giving appropriate type I error rates at little cost to power. The correlation between genome-wide heterozygosity and the value of the trait Q1 appears to drive much of the type I error in this data set.
Genetic Analysis Workshop 16 (GAW16) Problem 2 presented data from the Framingham Heart Study (FHS), an observational, prospective study of risk factors for cardiovascular disease begun in 1948. Data have been collected in three generations of family participants in the study and the data presented for GAW16 included phenotype data from all three generations, with four examinations of data collected repeatedly for the first two generations. The trait data consisted of information on blood pressure, hypertension treatment, lipid levels, diabetes and blood glucose, smoking, alcohol consumed, weight, and coronary heart disease incidence. Additionally, genotype data obtained through a genome-wide scan (FHS SHARe) of 550,000 single-nucleotide polymorphisms from Affymetrix chips were included with the GAW16 data. The genotype data were also used for GAW16 Problem 3, where simulated phenotypes were generated using the actual FHS genotypes. These data served to provide investigators with a rich resource to study the behavior of genome-wide scans with longitudinally collected family data and to develop and apply new procedures
We evaluate four association tests for rare variants—the combined multivariate and collapsing (CMC) method, two weighted-sum methods, and a variable threshold method—by applying them to the simulated data sets of unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data. The family-wise error rate (FWER) and average power are used as criteria for evaluation. Our results show that when all nonsynonymous SNPs (rare variants and common variants) in a gene are jointly analyzed, the CMC method fails to control the FWER; when only rare variants (single-nucleotide polymorphisms with minor allele frequency less than 0.05) are analyzed, all four methods can control FWER well. All four methods have comparable power, which is low for the analysis of the GAW17 data sets. Three of the methods (not including the CMC method) involve estimation of p-values using permutation procedures that either can be computationally intensive or generate inflated FWERs. We adapt a fast permutation procedure into these three methods. The results show that using the fast permutation procedure can produce FWERs and average powers close to the values obtained from the standard permutation procedure on the GAW17 data sets. The standard permutation procedure is computationally intensive.
The advance of high-throughput next-generation sequencing technology makes possible the analysis of rare variants. However, the investigation of rare variants in unrelated-individuals data sets faces the challenge of low power, and most methods circumvent the difficulty by using various collapsing procedures based on genes, pathways, or gene clusters. We suggest a new way to identify causal rare variants using the F-statistic and sliced inverse regression. The procedure is tested on the data set provided by the Genetic Analysis Workshop 17 (GAW17). After preliminary data reduction, we ranked markers according to their F-statistic values. Top-ranked markers were then subjected to sliced inverse regression, and those with higher absolute coefficients in the most significant sliced inverse regression direction were selected. The procedure yields good false discovery rates for the GAW17 data and thus is a promising method for future study on rare variants.
As part of Genetic Analysis Workshop 17 (GAW17), our group considered the application of novel and standard approaches to the analysis of genotype-phenotype association in next-generation sequencing data. Our group identified a major issue in the analysis of the GAW17 next-generation sequencing data: type I error and false-positive report probability rates higher than those expected based on empirical type I error levels (as high as 90%). Two main causes emerged: population stratification and long-range correlation (gametic phase disequilibrium) between rare variants. Population stratification was expected because of the diverse sample. Correlation between rare variants was attributable to both random causes (e.g., nearly 10,000 of 25,000 markers were private variants, and the sample size was small [n = 697]) and nonrandom causes (more correlation was observed than was expected by random chance). Principal components analysis was used to control for population structure and helped to minimize type I errors, but this was at the expense of identifying fewer causal variants. A novel multiple regression approach showed promise to handle correlation between markers. Further work is needed, first, to identify best practices for the control of type I errors in the analysis of sequencing data and then to explore and compare the many promising new aggregating approaches for identifying markers associated with disease phenotypes.
population structure; correlated markers; next-generation sequencing
We studied rheumatoid arthritis (RA) in the North American Rheumatoid Arthritis Consortium (NARAC) data (1499 subjects; 757 families). Identical methods were applied for studying RA in the Genetic Analysis Workshop 15 (GAW15) simulated data (with a prior knowledge of the simulation answers). Fifty replications of GAW15 simulated data had 3497 ± 20 subjects in 1500 nuclear families. Two new statistical methods were applied to transform the original phenotypes on these data, the item response theory (IRT) to create a latent variable from nine classifying predictors and a Blom transformation of the anti-CCP (anti-cyclic citrinullated protein) variable. We performed linear mixed-effects (LME) models to study the additive associations of 404 Illumina-genotyped single-nucleotide polymorphisms (SNPs) on the NARAC data, and of 17,820 SNPs of the GAW15 simulated data. In the GAW15 simulated data, the association with anti-CCP Blom transformation showed a 100% sensitivity for SNP1 located in the major histocompatibility complex gene. In contrast, the association of SNP1 with the IRT latent variable showed only 24% sensitivity. From the simulated data, we conclude that the Blom transformation of the anti-CCP variable produced more reliable results than the latent variable from the qualitative combination of a group of RA risk factors. In the NARAC data, the significant RA-SNPs associations found with both phenotype-transformation methods provided a trend that may point toward dynein and energy control genes. Finer genotyping in the NARAC data would grant more exact evidence for the contributions of chromosome 6 to RA.
We applied our method of pairwise shared genomic segment (pSGS) analysis to high-risk pedigrees identified from the Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data set. The original shared genomic segment method focused on identifying regions shared by all case subjects in a pedigree; thus it can be sensitive to sporadic cases. Our new method examines sharing among all pairs of case subjects in a high-risk pedigree and then uses the mean sharing as the test statistic; in addition, the significance is assessed empirically based on the pedigree structure and linkage disequilibrium pattern of the single-nucleotide polymorphisms. Using all GAW17 replicates, we identified 18 unilineal high-risk pedigrees that contained excess disease (p < 0.01) and at least 15 meioses between case subjects. Eighteen rare causal variants were polymorphic in this set of pedigrees. Based on a significance threshold of 0.001, 72.2% (13/18) of these pedigrees were successfully identified with at least one region that contains a true causal variant. The regions identified included 4 of the possible 18 polymorphic causal variants. On average, 1.1 true positives and 1.7 false positives were identified per pedigree. In conclusion, we have demonstrated the potential of our new pSGS method for localizing rare disease causal variants in common disease using high-risk pedigrees and exome sequence data.
The identification of susceptibility genes for common, chronic disease presents great challenges. The development of novel statistical and computational methodologies to help identify these genes is an area of great necessity. Much research is ongoing and the Genetic Analysis Workshop (GAW) is a venue for the dissemination and comparison of many of these methods. GAW15 included real data sets to look for disease susceptibility genes for rheumatoid arthritis (RA). RA is a complex, chronic inflammatory disease with several replicated disease genes, but much of the genetic variation in the phenotype remains unexplained. We applied two computational methods, namely multifactor dimensionality reduction (MDR) and grammatical evolution neural networks (GENN), to three data sets from GAW15. While these analytic methods were applied with the intention of detecting of multilocus models of association, both methods identified a strong single locus effect of a single-nucleotide polymorphism (SNP) in PTPN22 that is significantly associated with RA. This SNP has previously been associated with RA in several other published studies. These results demonstrate that both MDR and GENN are capable of identifying a single-locus main effect, in addition to multilocus models of association. This is the first published comparison of the two methods. Because GENN employs an evolutionary computation search strategy in comparison to the exhaustive search strategy of MDR, it is encouraging that the two methods produced similar results. This comparison should be extended in future studies with both simulated and real data.
Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
This study, part of the Genetic Analysis Workshop 14 (GAW14), explored real Collaborative Study on the Genetics of Alcoholism data for linkage and association mapping between genetic polymorphisms (microsatellite and single-nucleotide polymorphisms (SNPs)) and beta (16.5–20 Hz) oscillations of the brain rhythms (ecb21). The ecb21 phenotype underwent the statistical adjustments for the age of participants, and for attaining a normal distribution. A total of 1,000 subjects' available phenotypes were included in linkage analysis with microsatellite markers. Linkage analysis was performed only for chromosome 4 where a quantitative trait locus with 5.01 LOD score had been previously reported. Previous findings related this location with the γ-aminobutyric acid type A (GABAA) receptor. At the same location, our analysis showed a LOD score of 2.2. This decrease in the LOD score is the result of a drastic reduction (one-third) of the available GAW14 phenotypic data. We performed SNP and haplotype association analyses with the same phenotypic data under the linkage peak region on chromosome 4. Seven Affymetrix and two Illumina SNPs showed significant associations with ecb21 phenotype. A haplotype, a combination of SNPs TSC0044171 and TSC0551006 (the latter almost under the region of GABAA genes), showed a significant association with ecb21 (p = 0.015) and a relatively high frequency in the sample studied. Our results affirmed that the GABA region has potential of harboring genes that contribute quantitatively to the beta oscillation of the brain rhythms. The inclusion of the remaining 614 subjects, which in the GAW14 had missing data for the ecb21, can improve the strength of the associations as they have already shown that they contribute quite important information in the linkage analysis.
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
The Framingham Heart Study is a very successful longitudinal research for cardiovascular diseases. The completion of a 10-cM genome scan in Framingham families provided an opportunity to evaluate linkage using longitudinal data. Several descriptive traits based on simulated longitudinal data from the Genetic Analysis Workshop 13 (GAW13) were generated, and linkage analyses were performed for these traits. We compared the power of detecting linkage for baseline and slope genes in the simulated data of GAW13 using these traits. We found that using longitudinal traits based on multiple follow-ups may not be more powerful than using cross-sectional traits for genetic linkage analysis.
We propose a method to perform linkage genome scans for many correlated traits in the Genetic Analysis Workshop 15 (GAW15) data. The proposed method has two steps: first, we use a clustering method to find the tight clusters of the traits and use the first principal component (PC) of the traits in each cluster to represent the cluster; second, we perform a linkage scan for each cluster by using the representative trait of the cluster. The results of applying the method to the GAW15 Problem 1 data indicate that most of the traits in the same cluster have the same regulators, and the representative trait measure, the first PC, can explain a large part of the total variation of all the traits in each cluster. Furthermore, considering one cluster of traits at a time may yield more linkage signals than considering traits individually.
Joint analyses of correlated phenotypes in genetic epidemiology studies are common. However, these analyses primarily focus on genetic correlation between traits and do not take into account environmental correlation. We describe a method that optimizes the genetic signal by accounting for stochastic environmental noise through joint analysis of a discrete trait and a correlated quantitative marker. We conducted bivariate analyses where heritability and the environmental correlation between the discrete and quantitative traits were calculated using Genetic Analysis Workshop 17 (GAW17) family data. The resulting inverse value of the environmental correlation between these traits was then used to determine a new β coefficient for each quantitative trait and was constrained in a univariate model. We conducted genetic association tests on 7,087 nonsynonymous SNPs in three GAW17 family replicates for Affected status with the β coefficient fixed for three quantitative phenotypes and compared these to an association model where the β coefficient was allowed to vary. Bivariate environmental correlations were 0.64 (± 0.09) for Q1, 0.798 (± 0.076) for Q2, and −0.169 (± 0.18) for Q4. Heritability of Affected status improved in each univariate model where a constrained β coefficient was used to account for stochastic environmental effects. No genome-wide significant associations were identified for either method but we demonstrated that constraining β for covariates slightly improved the genetic signal for Affected status. This environmental regression approach allows for increased heritability when the β coefficient for a highly correlated quantitative covariate is constrained and increases the genetic signal for the discrete trait.
Exome sequencing is emerging as a popular approach to study the effect of rare coding variants on complex phenotypes. The promise of exome sequencing is grounded in theoretical population genetics and in empirical successes of candidate gene sequencing studies. Many projects aimed at common diseases are underway, and their results are eagerly anticipated. In this Perspective, using exome sequencing data from 438 individuals, we discuss several aspects of exome sequencing studies that we view as particularly important. We review processing and quality control of raw sequence data, evaluate the statistical properties of exome sequencing studies, discuss rare variant burden tests to detect association to phenotypes, and demonstrate the importance of accounting for population stratification in the analysis of rare variants. We conclude that enthusiasm for exome sequencing studies of complex traits should be combined with the caution that thousands of samples may be required to reach sufficient statistical power.
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.