Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
1000 Genomes Project; association; collapsing methods; next-generation sequencing
The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
With the advance of next-generation sequencing technologies in recent years, rare genetic variant data have now become available for genetic epidemiology studies. For family samples however, only a few statistical methods for association analysis of rare genetic variants have been developed. Rare variant approaches are of great interest particularly for family data because samples enriched for trait-relevant variants can be ascertained and rare variants are putatively enriched through segregation. To facilitate the evaluation of existing and new rare variant testing approaches for analyzing family data, Genetic Analysis Workshop 18 (GAW18) provided genotype and next-generation sequencing data and longitudinal blood pressure traits from extended pedigrees of Mexican-American families from the San Antonio Family Study. Our GAW18 group members analyzed real and simulated phenotype data from GAW18 by using generalized linear mixed-effects models or principal components to adjust for familial correlation or by testing binary traits using a correction factor for familial effects. With one exception, approaches dealt with the extended pedigrees in their original state using information based on the kinship matrix or alternative genetic similarity measures. For simulated data, our group demonstrated that the family-based kernel machine score test is superior in power to family-based single-marker or burden tests, except in a few specific scenarios. For real data, three contributions identified significant associations. They substantially reduced the number of tests before performing the association analysis. We conclude from our real data analyses that further development of strategies for targeted testing or more focused screening of genetic variants is strongly desirable.
extended pedigrees; rare variant analysis; family-based association test; linear mixed effects model; kernel machine score test; principal components
Next-generation sequencing of large numbers of individuals presents challenges in data preparation, quality control, and statistical analysis because of the rarity of the variants. The Genetic Analysis Workshop 17 (GAW17) data provide an opportunity to survey existing methods and compare these methods with novel ones. Specifically, the GAW17 Group 2 contributors investigate existing and newly proposed methods and study design strategies to identify rare variants, predict functional variants, and/or examine quality control. We introduce the eight Group 2 papers, summarize their approaches, and discuss their strengths and weaknesses. For these investigations, some groups used only the genotype data, whereas others also used the simulated phenotype data. Although the eight Group 2 contributions covered a wide variety of topics under the general idea of identifying rare variants, they can be grouped into three broad categories according to their common research interests: functionality of variants and quality control issues, family-based analyses, and association analyses of unrelated individuals. The aims of the first subgroup were quite different. These were population structure analyses that used rare variants to predict functionality and examine the accuracy of genotype calls. The aims of the family-based analyses were to select which families should be sequenced and to identify high-risk pedigrees; the aim of the association analyses was to identify variants or genes with regression-based methods. However, power to detect associations was low in all three association studies. Thus this work shows opportunities for incorporating rare variants into the genetic and statistical analyses of common diseases.
1000 Genomes Project; association; collection of rare variants; family data; next-generation sequencing; regression; quality control
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
Genetic Analysis Workshop 18 (GAW18) focused on identification of genes and functional variants that influence complex phenotypes in human sequence data. Data for the workshop were donated by the T2D-GENES Consortium and included whole genome sequences for odd-numbered autosomes in 464 key individuals selected from 20 Mexican American families, a dense set of single-nucleotide polymorphisms in 959 individuals in these families, and longitudinal data on systolic and diastolic blood pressure measured at 1-4 examinations over a period of 20 years. Simulated phenotypes were generated based on the real sequence data and pedigree structures. In the design of the simulation model, gene expression measures from the San Antonio Family Heart Study (not distributed as part of the GAW18 data) were used to identify genes whose mRNA levels were correlated with blood pressure. Observed variants within these genes were designated as functional in the GAW18 simulation if they were nonsynonymous and predicted to have deleterious effects on protein function or if they were noncoding and associated with mRNA levels. Two simulated longitudinal phenotypes were modeled to have the same trait distributions as the real systolic and diastolic blood pressure data, with effects of age, sex, and medication use, including a genotype-medication interaction. For each phenotype, more than 1000 sequence variants in more than 200 genes present on the odd-numbered autosomes individually explained less than 0.01-2.78% of phenotypic variance. Cumulatively, variants in the most influential gene explained 7.79% of trait variance. An additional simulated phenotype, Q1, was designed to be correlated among family members but to not be associated with any sequence variants. Two hundred replicates of the phenotypes were simulated, with each including data for 849 individuals.
Currently there is a great deal of interest in developing methods for testing the role that rare variation plays in disease development. Here we propose a weighted association test that accumulates genetic variation across a signaling pathway. We evaluate our approach by analyzing simulated phenotype data from an exome sequencing study of 697 unrelated individuals from the Genetic Analysis Workshop 17 (GAW17) data set. Although our weighted approach identifies several interesting pathways associated with phenotype Q1, so does an alternative unweighted accumulation approach. Such a result is not unexpected because there is no systematic relationship between the allele frequency of a variant and its effect on phenotype in the GAW17 simulation model.
Genetic Analysis Workshop 16 (GAW16) Problem 2 presented data from the Framingham Heart Study (FHS), an observational, prospective study of risk factors for cardiovascular disease begun in 1948. Data have been collected in three generations of family participants in the study and the data presented for GAW16 included phenotype data from all three generations, with four examinations of data collected repeatedly for the first two generations. The trait data consisted of information on blood pressure, hypertension treatment, lipid levels, diabetes and blood glucose, smoking, alcohol consumed, weight, and coronary heart disease incidence. Additionally, genotype data obtained through a genome-wide scan (FHS SHARe) of 550,000 single-nucleotide polymorphisms from Affymetrix chips were included with the GAW16 data. The genotype data were also used for GAW16 Problem 3, where simulated phenotypes were generated using the actual FHS genotypes. These data served to provide investigators with a rich resource to study the behavior of genome-wide scans with longitudinally collected family data and to develop and apply new procedures
We applied our method of pairwise shared genomic segment (pSGS) analysis to high-risk pedigrees identified from the Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data set. The original shared genomic segment method focused on identifying regions shared by all case subjects in a pedigree; thus it can be sensitive to sporadic cases. Our new method examines sharing among all pairs of case subjects in a high-risk pedigree and then uses the mean sharing as the test statistic; in addition, the significance is assessed empirically based on the pedigree structure and linkage disequilibrium pattern of the single-nucleotide polymorphisms. Using all GAW17 replicates, we identified 18 unilineal high-risk pedigrees that contained excess disease (p < 0.01) and at least 15 meioses between case subjects. Eighteen rare causal variants were polymorphic in this set of pedigrees. Based on a significance threshold of 0.001, 72.2% (13/18) of these pedigrees were successfully identified with at least one region that contains a true causal variant. The regions identified included 4 of the possible 18 polymorphic causal variants. On average, 1.1 true positives and 1.7 false positives were identified per pedigree. In conclusion, we have demonstrated the potential of our new pSGS method for localizing rare disease causal variants in common disease using high-risk pedigrees and exome sequence data.
We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
exome sequencing; pathway analysis; gene association
The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to workshop participants. The simulating model included both common and rare variants with minor allele frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated.
We summarize the work done by the contributors to Group 13 at Genetic Analysis Workshop 17 (GAW17) and provide a synthesis of their data analyses. The Group 13 contributors used a variety of approaches to test associations of both rare variants and common single-nucleotide polymorphisms (SNPs) with the GAW17 simulated traits, implementing analytic methods that incorporate multiallelic genotypes and haplotypes. In addition to using a wide variety of statistical methods and approaches, the contributors exhibited a remarkable amount of flexibility and creativity in coding the variants and their genes and in evaluating their proposed approaches and methods. We describe and contrast their methods along three dimensions: (1) selection and coding of genetic entities for analysis, (2) method of analysis, and (3) evaluation of the results. The contributors consistently presented a strong rationale for using multiallelic analytic approaches. They indicated that power was likely to be increased by capturing the signals of multiple markers within genetic entities defined by sliding windows, haplotypes, genes, functional pathways, and the entire set of SNPs and rare variants taken in aggregate. Despite this variability, the methods were fairly consistent in their ability to identify two associated genes for each simulated trait. The first gene was selected for the largest number of causal alleles and the second for a high-frequency causal SNP. The presumed model of inheritance and choice of genetic entities are likely to have a strong effect on the outcomes of the analyses.
rare variants; sequence data; multiallelic data; Bayesian regression; penalized regression; tree-based clustering; pathway analysis; haplotypes
Genetic Analysis Workshop 17 (GAW17) focused on the transition from genome-wide association study designs and methods to the study designs and statistical genetic methods that will be required for the analysis of next-generation sequence data including both common and rare sequence variants. In the 166 contributions to GAW17, a wide variety of statistical methods were applied to simulated traits in population- and family-based samples, and results from these analyses were compared to the known generating model. In general, many of the statistical genetic methods used in the population-based sample identified causal sequence variants (SVs) when the estimated locus-specific heritability, as measured in the population-based sample, was greater than about 0.08. However, SVs with locus-specific heritabilities less than 0.03 were rarely identified consistently. In the family-based samples, many of the methods detected SVs that were rarer than those detected in the population-based sample, but the estimated locus-specific heritabilities for these rare SVs, as measured in the family-based samples, were substantially higher (>0.2) than their corresponding heritabilities in the population-based samples. Substantial inflation of the type I error rate was observed across a wide variety of statistical methods. Although many of the contributions found little inflation in type I error for Q4, a trait with no causal SVs, type I error rates for Q1 and Q2 were well above their nominal levels with the inflation for Q1 being higher than that for Q2. It seems likely that this inflation in type I error is due to correlations among SVs.
linkage; association; next-generation sequencing; computer simulation
As the availability of cost-effective high-throughput sequencing technology increases, genetic research is beginning to focus on identifying the contributions of rare variants (RVs) to complex traits. Using RVs to detect associated genes requires statistical approaches that mitigate the lack of power with the analysis of single RVs. Here we report the development and application of an approach that aggregates and evaluates the transmissions of RVs in parent-child trios. An initial score that incorporates the distortion in transmission of the observed RVs from the parents to their offspring is calculated for each variant. The scores are analyzed using a support vector machine that handles these data by mapping the transmission distortion of the multiple RVs into a one-dimensional score in a nonlinear fashion when parent-child trios with affected and nonaffected children are contrasted. We refer to this approach as Trio-SVM. A total of 275 trios were available in the Genetic Analysis Workshop 18 data for analysis. Because of their nonindependence and the extended linkage disequilibrium (LD) within pedigrees, Trio-SVM was vulnerable to type I errors in detecting association. Using the GAW18 data with simulated trait values, Trio-SVM has an appropriate type I error, but it lacks power with a sample of 267 trios. Larger samples of 500 to 1000 trios, derived from combining the simulated data, provided sufficient power. Two chromosome 3 candidate genes were tested in the real GAW18 data with Trio-SVM, and they showed marginal associations with hypertension.
Recent evidence suggests that many complex diseases are caused by genetic variations that play regulatory roles in controlling gene expression. Most genetic studies focus on nonsynonymous variations that can alter the amino acid composition of a protein and are therefore believed to have the highest impact on phenotype. Synonymous variations, however, can also play important roles in disease pathogenesis by regulating pre-mRNA processing and translational control. In this study, we systematically survey the effects of single-nucleotide variations (SNVs) on binding affinity of RNA-binding proteins (RBPs). Among the 10,113 synonymous SNVs identified in 697 individuals in the 1,000 Genomes Project and distributed by Genetic Analysis Workshop 17 (GAW17), we identified 182 variations located in alternatively spliced exons that can significantly change the binding affinity of nine RBPs whose binding preferences on 7-mer RNA sequences were previously reported. We found that the minor allele frequencies of these variations are similar to those of nonsynonymous SNVs, suggesting that they are in fact functional. We propose a workflow to identify phenotype-associated regulatory SNVs that might affect alternative splicing from exome-sequencing-derived genetic variations. Based on the affecting SNVs on the quantitative traits simulated in GAW17, we further identified two and four functional SNVs that are predicted to be involved in alternative splicing regulation in traits Q1 and Q2, respectively.
This study, part of the Genetic Analysis Workshop 14 (GAW14), explored real Collaborative Study on the Genetics of Alcoholism data for linkage and association mapping between genetic polymorphisms (microsatellite and single-nucleotide polymorphisms (SNPs)) and beta (16.5–20 Hz) oscillations of the brain rhythms (ecb21). The ecb21 phenotype underwent the statistical adjustments for the age of participants, and for attaining a normal distribution. A total of 1,000 subjects' available phenotypes were included in linkage analysis with microsatellite markers. Linkage analysis was performed only for chromosome 4 where a quantitative trait locus with 5.01 LOD score had been previously reported. Previous findings related this location with the γ-aminobutyric acid type A (GABAA) receptor. At the same location, our analysis showed a LOD score of 2.2. This decrease in the LOD score is the result of a drastic reduction (one-third) of the available GAW14 phenotypic data. We performed SNP and haplotype association analyses with the same phenotypic data under the linkage peak region on chromosome 4. Seven Affymetrix and two Illumina SNPs showed significant associations with ecb21 phenotype. A haplotype, a combination of SNPs TSC0044171 and TSC0551006 (the latter almost under the region of GABAA genes), showed a significant association with ecb21 (p = 0.015) and a relatively high frequency in the sample studied. Our results affirmed that the GABA region has potential of harboring genes that contribute quantitatively to the beta oscillation of the brain rhythms. The inclusion of the remaining 614 subjects, which in the GAW14 had missing data for the ecb21, can improve the strength of the associations as they have already shown that they contribute quite important information in the linkage analysis.
Statistical genetic methods incorporating temporal variation allow for greater understanding of genetic architecture and consistency of biological variation influencing development of complex diseases. This study proposes a bivariate association method jointly testing association of two quantitative phenotypic measures from different time points. Measured genotype association was analyzed for single-nucleotide polymorphisms (SNPs) for systolic blood pressure (SBP) from the first and third visits using 200 simulated Genetic Analysis Workshop 18 (GAW18) replicates. Bivariate association, in which the effect of an SNP on the mean trait values of the two phenotypes is constrained to be equal for both measures and is included as a covariate in the analysis, was compared with a bivariate analysis in which the effect of an SNP was estimated separately for the two measures and univariate association analyses in 9 SNPs that explained greater than 0.001% SBP variance over all 200 GAW18 replicates.The SNP 3_48040283 was significantly associated with SBP in all 200 replicates with the constrained bivariate method providing increased signal over the unconstrained bivariate method. This method improved signal in all 9 SNPs with simulated effects on SBP for nominal significance (p-value <0.05). However, this appears to be determined by the effect size of the SNP on the phenotype. This bivariate association method applied to longitudinal data improves genetic signal for quantitative traits when the effect size of the variant is moderate to large.
Analysis of longitudinal family data is challenging because of 2 sources of correlations: correlations across longitudinal measurements and correlations among related individuals. We investigated whether analysis using long-term average (average of all 3 visits) can enhance gene discovery compared with a single-visit analysis. We analyzed all 200 replicates of simulated systolic blood pressure (SBP) in Genetic Analysis Workshop 18 (GAW18) family data using both single-marker and collapsing methods. We considered 2 collapsing approaches: collapsing all variants and collapsing low-frequency variants. Analysis using long-term average performed slightly better than SBP measured at a single visit. Collapsing all variants performed much better than collapsing low-frequency variants at MAP4 and FLNB, which included a common variant with a relatively large effect. For several variants in gene MAP4, single-marker analysis also provided high power. In contrast, collapsing only low-frequency variants performed much better for SCAP, DNASE1L3, and LOC152217, where rare variants in these genes had larger effect than common variants. However, for other causal variants, all approaches provided disappointingly poor performance. This poor performance appeared to occur because most of these causal variants explained a very small fraction of phenotypic variance. We also found that collapsing multiple variants did worse than single-marker analysis for several genes when they contained causal single-nucleotide polymorphisms (SNPs) with both positive and negative effects. Because half of causal SNPs were not found in the annotation file based on the 1000 Genomes Project, we found that power was also affected by our use of incomplete annotation information.
Genetic Analysis Workshop 16 GAW16) was held September 17-20, 2008 in St. Louis, Missouri. The focus of GAW16 was on methods and challenges in analysis of single-nucleotide polymorphism (SNP) data from genome-wide scans. GAW16 attracted 221 participants from 12 countries. The 168 contributions were organized into 17 discussion groups of 6 to 17 papers each. Three data sets were available for analysis. Two of these were data from ongoing studies, generously provided by the investigators. The North American Rheumatoid Arthritis Consortium provided case-control data on rheumatoid arthritis, and the Framingham Heart Study made available information on cardiovascular risk factors for participants in three generations of pedigree data. The third data set included simulated phenotypes for participants in the Framingham Heart Study, using actual pedigree structures and genotypes. This volume includes a paper for each of the 17 discussion groups, summarizing their main findings.
single-nucleotide polymorphism; SNP; genome-wide scan; association; linkage; haplotype
We studied rheumatoid arthritis (RA) in the North American Rheumatoid Arthritis Consortium (NARAC) data (1499 subjects; 757 families). Identical methods were applied for studying RA in the Genetic Analysis Workshop 15 (GAW15) simulated data (with a prior knowledge of the simulation answers). Fifty replications of GAW15 simulated data had 3497 ± 20 subjects in 1500 nuclear families. Two new statistical methods were applied to transform the original phenotypes on these data, the item response theory (IRT) to create a latent variable from nine classifying predictors and a Blom transformation of the anti-CCP (anti-cyclic citrinullated protein) variable. We performed linear mixed-effects (LME) models to study the additive associations of 404 Illumina-genotyped single-nucleotide polymorphisms (SNPs) on the NARAC data, and of 17,820 SNPs of the GAW15 simulated data. In the GAW15 simulated data, the association with anti-CCP Blom transformation showed a 100% sensitivity for SNP1 located in the major histocompatibility complex gene. In contrast, the association of SNP1 with the IRT latent variable showed only 24% sensitivity. From the simulated data, we conclude that the Blom transformation of the anti-CCP variable produced more reliable results than the latent variable from the qualitative combination of a group of RA risk factors. In the NARAC data, the significant RA-SNPs associations found with both phenotype-transformation methods provided a trend that may point toward dynein and energy control genes. Finer genotyping in the NARAC data would grant more exact evidence for the contributions of chromosome 6 to RA.
In addition to methods that can identify common variants associated with susceptibility to common diseases, there has been increasing interest in approaches that can identify rare genetic variants. We use the simulated data provided to the participants of Genetic Analysis Workshop 17 (GAW17) to identify both rare and common single-nucleotide polymorphisms and pathways associated with disease status. We apply a rare variant collapsing approach and the usual association tests for common variants to identify candidates for further analysis using pathway-based and tree-based ensemble approaches. We use the mean log p-value approach to identify a top set of pathways and compare it to those used in simulation of GAW17 dataset. We conclude that the mean log p-value approach is able to identify those pathways in the top list and also related pathways. We also use the stochastic gradient boosting approach for the selected subset of single-nucleotide polymorphisms. When compared the result of this tree-based method with the list of single-nucleotide polymorphisms used in dataset simulation, in addition to correct SNPs we observe number of false positives.
We evaluate four association tests for rare variants—the combined multivariate and collapsing (CMC) method, two weighted-sum methods, and a variable threshold method—by applying them to the simulated data sets of unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data. The family-wise error rate (FWER) and average power are used as criteria for evaluation. Our results show that when all nonsynonymous SNPs (rare variants and common variants) in a gene are jointly analyzed, the CMC method fails to control the FWER; when only rare variants (single-nucleotide polymorphisms with minor allele frequency less than 0.05) are analyzed, all four methods can control FWER well. All four methods have comparable power, which is low for the analysis of the GAW17 data sets. Three of the methods (not including the CMC method) involve estimation of p-values using permutation procedures that either can be computationally intensive or generate inflated FWERs. We adapt a fast permutation procedure into these three methods. The results show that using the fast permutation procedure can produce FWERs and average powers close to the values obtained from the standard permutation procedure on the GAW17 data sets. The standard permutation procedure is computationally intensive.
In statistical data analysis, penalized regression is considered an attractive approach for its ability of simultaneous variable selection and parameter estimation. Although penalized regression methods have shown many advantages in variable selection and outcome prediction over other approaches for high-dimensional data, there is a relative paucity of the literature on their applications to hypothesis testing, e.g., in genetic association analysis. In this study, we apply several new penalized regression methods with a novel penalty, called Truncated L1-penalty (TLP) (Shen et al., 2012), for either variable selection, or both variable selection and parameter grouping, in a data-adaptive way to test for association between a quantitative trait and a group of rare variants. The performance of the new methods are compared with some existing tests, including some recently proposed global tests and penalized regression-based methods, via simulations and an application to the real sequence data of the Genetic Analysis Workshop 17 (GAW17). Although our proposed penalized methods can improve over some existing penalized methods, often they do not outperform some existing global association tests. Some possible problems with utilizing penalized regression methods in genetic hypothesis testing are discussed. Given the capability of penalized regression in selecting causal variants and its sometimes promising performance, further studies are warranted.
GWAS; SSU test; SSUw test; Sum test; TLP
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.