Search tips
Search criteria

Results 1-25 (1185054)

Clipboard (0)

Related Articles

1.  Identification of Genetic Association of Multiple Rare Variants Using Collapsing Methods 
Genetic Epidemiology  2011;35(Suppl 1):S101-S106.
Next-generation sequencing technology allows investigation of both common and rare variants in humans. Exomes are sequenced on the population level or in families to further study the genetics of human diseases. Genetic Analysis Workshop 17 (GAW17) provided exomic data from the 1000 Genomes Project and simulated phenotypes. These data enabled evaluations of existing and newly developed statistical methods for rare variant sequence analysis for which standard statistical methods fail because of the rareness of the alleles. Various alternative approaches have been proposed that overcome the rareness problem by combining multiple rare variants within a gene. These approaches are termed collapsing methods, and our GAW17 group focused on studying the performance of existing and novel collapsing methods using rare variants. All tested methods performed similarly, as measured by type I error and power. Inflated type I error fractions were consistently observed and might be caused by gametic phase disequilibrium between causal and noncausal rare variants in this relatively small sample as well as by population stratification. Incorporating prior knowledge, such as appropriate covariates and information on functionality of SNPs, increased the power of detecting associated genes. Overall, collapsing rare variants can increase the power of identifying disease-associated genes. However, studying genetic associations of rare variants remains a challenging task that requires further development and improvement in data collection, management, analysis, and computation.
PMCID: PMC3289287  PMID: 22128049
1000 Genomes Project; association; collapsing methods; next-generation sequencing
2.  Gene-based multiple trait analysis for exome sequencing data 
BMC Proceedings  2011;5(Suppl 9):S75.
The common genetic variants identified through genome-wide association studies explain only a small proportion of the genetic risk for complex diseases. The advancement of next-generation sequencing technologies has enabled the detection of rare variants that are expected to contribute significantly to the missing heritability. Some genetic association studies provide multiple correlated traits for analysis. Multiple trait analysis has the potential to improve the power to detect pleiotropic genetic variants that influence multiple traits. We propose a gene-level association test for multiple traits that accounts for correlation among the traits. Gene- or region-level testing for association involves both common and rare variants. Statistical tests for common variants may have limited power for individual rare variants because of their low frequency and multiple testing issues. To address these concerns, we use the weighted-sum pooling method to test the joint association of multiple rare and common variants within a gene. The proposed method is applied to the Genetic Association Workshop 17 (GAW17) simulated mini-exome data to analyze multiple traits. Because of the nature of the GAW17 simulation model, increased power was not observed for multiple-trait analysis compared to single-trait analysis. However, multiple-trait analysis did not result in a substantial loss of power because of the testing of multiple traits. We conclude that this method would be useful for identifying pleiotropic genes.
PMCID: PMC3287915  PMID: 22373189
3.  Testing Genetic Association with Rare and Common Variants in Family Data 
Genetic epidemiology  2014;38(0 1):S37-S43.
With the advance of next-generation sequencing technologies in recent years, rare genetic variant data have now become available for genetic epidemiology studies. For family samples however, only a few statistical methods for association analysis of rare genetic variants have been developed. Rare variant approaches are of great interest particularly for family data because samples enriched for trait-relevant variants can be ascertained and rare variants are putatively enriched through segregation. To facilitate the evaluation of existing and new rare variant testing approaches for analyzing family data, Genetic Analysis Workshop 18 (GAW18) provided genotype and next-generation sequencing data and longitudinal blood pressure traits from extended pedigrees of Mexican-American families from the San Antonio Family Study. Our GAW18 group members analyzed real and simulated phenotype data from GAW18 by using generalized linear mixed-effects models or principal components to adjust for familial correlation or by testing binary traits using a correction factor for familial effects. With one exception, approaches dealt with the extended pedigrees in their original state using information based on the kinship matrix or alternative genetic similarity measures. For simulated data, our group demonstrated that the family-based kernel machine score test is superior in power to family-based single-marker or burden tests, except in a few specific scenarios. For real data, three contributions identified significant associations. They substantially reduced the number of tests before performing the association analysis. We conclude from our real data analyses that further development of strategies for targeted testing or more focused screening of genetic variants is strongly desirable.
PMCID: PMC4324976  PMID: 25112186
extended pedigrees; rare variant analysis; family-based association test; linear mixed effects model; kernel machine score test; principal components
4.  Quality Control Issues and the Identification of Rare Functional Variants with Next-Generation Sequencing Data 
Genetic Epidemiology  2011;35(Suppl 1):S22-S28.
Next-generation sequencing of large numbers of individuals presents challenges in data preparation, quality control, and statistical analysis because of the rarity of the variants. The Genetic Analysis Workshop 17 (GAW17) data provide an opportunity to survey existing methods and compare these methods with novel ones. Specifically, the GAW17 Group 2 contributors investigate existing and newly proposed methods and study design strategies to identify rare variants, predict functional variants, and/or examine quality control. We introduce the eight Group 2 papers, summarize their approaches, and discuss their strengths and weaknesses. For these investigations, some groups used only the genotype data, whereas others also used the simulated phenotype data. Although the eight Group 2 contributions covered a wide variety of topics under the general idea of identifying rare variants, they can be grouped into three broad categories according to their common research interests: functionality of variants and quality control issues, family-based analyses, and association analyses of unrelated individuals. The aims of the first subgroup were quite different. These were population structure analyses that used rare variants to predict functionality and examine the accuracy of genotype calls. The aims of the family-based analyses were to select which families should be sequenced and to identify high-risk pedigrees; the aim of the association analyses was to identify variants or genes with regression-based methods. However, power to detect associations was low in all three association studies. Thus this work shows opportunities for incorporating rare variants into the genetic and statistical analyses of common diseases.
PMCID: PMC3268158  PMID: 22128054
1000 Genomes Project; association; collection of rare variants; family data; next-generation sequencing; regression; quality control
5.  Collapsing-based and kernel-based single-gene analyses applied to Genetic Analysis Workshop 17 mini-exome data 
BMC Proceedings  2011;5(Suppl 9):S117.
Recently there has been great interest in identifying rare variants associated with common diseases. We apply several collapsing-based and kernel-based single-gene association tests to Genetic Analysis Workshop 17 (GAW17) rare variant association data with unrelated individuals without knowledge of the simulation model. We also implement modified versions of these methods using additional information, such as minor allele frequency (MAF) and functional annotation. For each of four given traits provided in GAW17, we use the Bayesian mixed-effects model to estimate the phenotypic variance explained by the given environmental and genotypic data and to infer an individual-specific genetic effect to use directly in single-gene association tests. After obtaining information on the GAW17 simulation model, we compare the performance of all methods and examine the top genes identified by those methods. We find that collapsing-based methods with weights based on MAFs are sensitive to the “lower MAF, larger effect size” assumption, whereas kernel-based methods are more robust when this assumption is violated. In addition, many false-positive genes identified by multiple methods often contain variants with exactly the same genotype distribution as the causal variants used in the simulation model. When the sample size is much smaller than the number of rare variants, it is more likely that causal and noncausal variants will share the same or similar genotype distribution. This likely contributes to the low power and large number of false-positive results of all methods in detecting causal variants associated with disease in the GAW17 data set.
PMCID: PMC3287841  PMID: 22373309
6.  Application of Bayesian network structure learning to identify causal variant SNPs from resequencing data 
BMC Proceedings  2011;5(Suppl 9):S109.
Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
PMCID: PMC3287832  PMID: 22373088
7.  Data for Genetic Analysis Workshop 18: human whole genome sequence, blood pressure, and simulated phenotypes in extended pedigrees 
BMC Proceedings  2014;8(Suppl 1):S2.
Genetic Analysis Workshop 18 (GAW18) focused on identification of genes and functional variants that influence complex phenotypes in human sequence data. Data for the workshop were donated by the T2D-GENES Consortium and included whole genome sequences for odd-numbered autosomes in 464 key individuals selected from 20 Mexican American families, a dense set of single-nucleotide polymorphisms in 959 individuals in these families, and longitudinal data on systolic and diastolic blood pressure measured at 1-4 examinations over a period of 20 years. Simulated phenotypes were generated based on the real sequence data and pedigree structures. In the design of the simulation model, gene expression measures from the San Antonio Family Heart Study (not distributed as part of the GAW18 data) were used to identify genes whose mRNA levels were correlated with blood pressure. Observed variants within these genes were designated as functional in the GAW18 simulation if they were nonsynonymous and predicted to have deleterious effects on protein function or if they were noncoding and associated with mRNA levels. Two simulated longitudinal phenotypes were modeled to have the same trait distributions as the real systolic and diastolic blood pressure data, with effects of age, sex, and medication use, including a genotype-medication interaction. For each phenotype, more than 1000 sequence variants in more than 200 genes present on the odd-numbered autosomes individually explained less than 0.01-2.78% of phenotypic variance. Cumulatively, variants in the most influential gene explained 7.79% of trait variance. An additional simulated phenotype, Q1, was designed to be correlated among family members but to not be associated with any sequence variants. Two hundred replicates of the phenotypes were simulated, with each including data for 849 individuals.
PMCID: PMC4145406  PMID: 25519314
8.  A weighted accumulation test for associating rare genetic variation with quantitative phenotypes 
BMC Proceedings  2011;5(Suppl 9):S6.
Currently there is a great deal of interest in developing methods for testing the role that rare variation plays in disease development. Here we propose a weighted association test that accumulates genetic variation across a signaling pathway. We evaluate our approach by analyzing simulated phenotype data from an exome sequencing study of 697 unrelated individuals from the Genetic Analysis Workshop 17 (GAW17) data set. Although our weighted approach identifies several interesting pathways associated with phenotype Q1, so does an alternative unweighted accumulation approach. Such a result is not unexpected because there is no systematic relationship between the allele frequency of a variant and its effect on phenotype in the GAW17 simulation model.
PMCID: PMC3287898  PMID: 22373271
9.  Genetics Analysis Workshop 16 Problem 2: the Framingham Heart Study data 
BMC Proceedings  2009;3(Suppl 7):S3.
Genetic Analysis Workshop 16 (GAW16) Problem 2 presented data from the Framingham Heart Study (FHS), an observational, prospective study of risk factors for cardiovascular disease begun in 1948. Data have been collected in three generations of family participants in the study and the data presented for GAW16 included phenotype data from all three generations, with four examinations of data collected repeatedly for the first two generations. The trait data consisted of information on blood pressure, hypertension treatment, lipid levels, diabetes and blood glucose, smoking, alcohol consumed, weight, and coronary heart disease incidence. Additionally, genotype data obtained through a genome-wide scan (FHS SHARe) of 550,000 single-nucleotide polymorphisms from Affymetrix chips were included with the GAW16 data. The genotype data were also used for GAW16 Problem 3, where simulated phenotypes were generated using the actual FHS genotypes. These data served to provide investigators with a rich resource to study the behavior of genome-wide scans with longitudinally collected family data and to develop and apply new procedures
PMCID: PMC2795927  PMID: 20018020
10.  Pairwise shared genomic segment analysis in high-risk pedigrees: application to Genetic Analysis Workshop 17 exome-sequencing SNP data 
BMC Proceedings  2011;5(Suppl 9):S9.
We applied our method of pairwise shared genomic segment (pSGS) analysis to high-risk pedigrees identified from the Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data set. The original shared genomic segment method focused on identifying regions shared by all case subjects in a pedigree; thus it can be sensitive to sporadic cases. Our new method examines sharing among all pairs of case subjects in a high-risk pedigree and then uses the mean sharing as the test statistic; in addition, the significance is assessed empirically based on the pedigree structure and linkage disequilibrium pattern of the single-nucleotide polymorphisms. Using all GAW17 replicates, we identified 18 unilineal high-risk pedigrees that contained excess disease (p < 0.01) and at least 15 meioses between case subjects. Eighteen rare causal variants were polymorphic in this set of pedigrees. Based on a significance threshold of 0.001, 72.2% (13/18) of these pedigrees were successfully identified with at least one region that contains a true causal variant. The regions identified included 4 of the possible 18 polymorphic causal variants. On average, 1.1 true positives and 1.7 false positives were identified per pedigree. In conclusion, we have demonstrated the potential of our new pSGS method for localizing rare disease causal variants in common disease using high-risk pedigrees and exome sequence data.
PMCID: PMC3287931  PMID: 22373081
11.  Incorporating Biological Information into Association Studies of Sequencing Data 
Genetic epidemiology  2011;35(0 1):S29-S34.
We summarize the methodological contributions from Group 3 of Genetic Analysis Workshop 17 (GAW17). The overarching goal of these methods was the evaluation and enhancement of state-of-the-art approaches in integration of biological knowledge into association studies of rare variants. We found that methods loosely fell into three major categories: (1) hypothesis testing of index scores based on aggregating rare variants at the gene level, (2) variable selection techniques that incorporate biological prior information, and (3) novel approaches that integrate external (i.e., not provided by GAW17) prior information, such as pathway and single-nucleotide polymorphism (SNP) annotations. Commonalities among the findings from these contributions are that gene-based analysis of rare variants is advantageous to single-SNP analysis and that the minor allele frequency threshold to identify rare variants may influence power and thus needs to be carefully considered. A consistent increase in power was also identified by considering only nonsynonymous SNPs in the analyses. Overall, we found that no single method had an appreciable advantage over the other methods. However, methods that carried out sensitivity analyses by comparing biologically informative to noninformative prior probabilities demonstrated that integrating biological knowledge into statistical analyses always, at the least, enabled subtle improvements in the performance of any statistical method applied to these simulated data. Although these statistical improvements reflect the simulation model assumed for GAW17, our hope is that the simulation models provide a reasonable representation of the underlying biology and that these methods can thus be of utility in real data.
PMCID: PMC3635488  PMID: 22128055
exome sequencing; pathway analysis; gene association
12.  Genetic Analysis Workshop 17 mini-exome simulation 
BMC Proceedings  2011;5(Suppl 9):S2.
The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to workshop participants. The simulating model included both common and rare variants with minor allele frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated.
PMCID: PMC3287854  PMID: 22373155
13.  Detecting Rare Variant Associations: Methods for Testing Haplotypes and Multiallelic Genotypes 
Genetic Epidemiology  2011;35(Suppl 1):S85-S91.
We summarize the work done by the contributors to Group 13 at Genetic Analysis Workshop 17 (GAW17) and provide a synthesis of their data analyses. The Group 13 contributors used a variety of approaches to test associations of both rare variants and common single-nucleotide polymorphisms (SNPs) with the GAW17 simulated traits, implementing analytic methods that incorporate multiallelic genotypes and haplotypes. In addition to using a wide variety of statistical methods and approaches, the contributors exhibited a remarkable amount of flexibility and creativity in coding the variants and their genes and in evaluating their proposed approaches and methods. We describe and contrast their methods along three dimensions: (1) selection and coding of genetic entities for analysis, (2) method of analysis, and (3) evaluation of the results. The contributors consistently presented a strong rationale for using multiallelic analytic approaches. They indicated that power was likely to be increased by capturing the signals of multiple markers within genetic entities defined by sliding windows, haplotypes, genes, functional pathways, and the entire set of SNPs and rare variants taken in aggregate. Despite this variability, the methods were fairly consistent in their ability to identify two associated genes for each simulated trait. The first gene was selected for the largest number of causal alleles and the second for a high-frequency causal SNP. The presumed model of inheritance and choice of genetic entities are likely to have a strong effect on the outcomes of the analyses.
PMCID: PMC3274416  PMID: 22128065
rare variants; sequence data; multiallelic data; Bayesian regression; penalized regression; tree-based clustering; pathway analysis; haplotypes
14.  Lessons Learned from Genetic Analysis Workshop 17: Transitioning from Genome-Wide Association Studies to Whole-Genome Statistical Genetic Analysis 
Genetic Epidemiology  2011;35(Suppl 1):S107-S114.
Genetic Analysis Workshop 17 (GAW17) focused on the transition from genome-wide association study designs and methods to the study designs and statistical genetic methods that will be required for the analysis of next-generation sequence data including both common and rare sequence variants. In the 166 contributions to GAW17, a wide variety of statistical methods were applied to simulated traits in population- and family-based samples, and results from these analyses were compared to the known generating model. In general, many of the statistical genetic methods used in the population-based sample identified causal sequence variants (SVs) when the estimated locus-specific heritability, as measured in the population-based sample, was greater than about 0.08. However, SVs with locus-specific heritabilities less than 0.03 were rarely identified consistently. In the family-based samples, many of the methods detected SVs that were rarer than those detected in the population-based sample, but the estimated locus-specific heritabilities for these rare SVs, as measured in the family-based samples, were substantially higher (>0.2) than their corresponding heritabilities in the population-based samples. Substantial inflation of the type I error rate was observed across a wide variety of statistical methods. Although many of the contributions found little inflation in type I error for Q4, a trait with no causal SVs, type I error rates for Q1 and Q2 were well above their nominal levels with the inflation for Q1 being higher than that for Q2. It seems likely that this inflation in type I error is due to correlations among SVs.
PMCID: PMC3277851  PMID: 22128050
linkage; association; next-generation sequencing; computer simulation
15.  Identifying rare-variant associations in parent-child trios using a Gaussian support vector machine 
BMC Proceedings  2014;8(Suppl 1):S98.
As the availability of cost-effective high-throughput sequencing technology increases, genetic research is beginning to focus on identifying the contributions of rare variants (RVs) to complex traits. Using RVs to detect associated genes requires statistical approaches that mitigate the lack of power with the analysis of single RVs. Here we report the development and application of an approach that aggregates and evaluates the transmissions of RVs in parent-child trios. An initial score that incorporates the distortion in transmission of the observed RVs from the parents to their offspring is calculated for each variant. The scores are analyzed using a support vector machine that handles these data by mapping the transmission distortion of the multiple RVs into a one-dimensional score in a nonlinear fashion when parent-child trios with affected and nonaffected children are contrasted. We refer to this approach as Trio-SVM. A total of 275 trios were available in the Genetic Analysis Workshop 18 data for analysis. Because of their nonindependence and the extended linkage disequilibrium (LD) within pedigrees, Trio-SVM was vulnerable to type I errors in detecting association. Using the GAW18 data with simulated trait values, Trio-SVM has an appropriate type I error, but it lacks power with a sample of 267 trios. Larger samples of 500 to 1000 trios, derived from combining the simulated data, provided sufficient power. Two chromosome 3 candidate genes were tested in the real GAW18 data with Trio-SVM, and they showed marginal associations with hypertension.
PMCID: PMC4143758  PMID: 25519420
16.  Prioritizing single-nucleotide variations that potentially regulate alternative splicing 
BMC Proceedings  2011;5(Suppl 9):S40.
Recent evidence suggests that many complex diseases are caused by genetic variations that play regulatory roles in controlling gene expression. Most genetic studies focus on nonsynonymous variations that can alter the amino acid composition of a protein and are therefore believed to have the highest impact on phenotype. Synonymous variations, however, can also play important roles in disease pathogenesis by regulating pre-mRNA processing and translational control. In this study, we systematically survey the effects of single-nucleotide variations (SNVs) on binding affinity of RNA-binding proteins (RBPs). Among the 10,113 synonymous SNVs identified in 697 individuals in the 1,000 Genomes Project and distributed by Genetic Analysis Workshop 17 (GAW17), we identified 182 variations located in alternatively spliced exons that can significantly change the binding affinity of nine RBPs whose binding preferences on 7-mer RNA sequences were previously reported. We found that the minor allele frequencies of these variations are similar to those of nonsynonymous SNVs, suggesting that they are in fact functional. We propose a workflow to identify phenotype-associated regulatory SNVs that might affect alternative splicing from exome-sequencing-derived genetic variations. Based on the affecting SNVs on the quantitative traits simulated in GAW17, we further identified two and four functional SNVs that are predicted to be involved in alternative splicing regulation in traits Q1 and Q2, respectively.
PMCID: PMC3287877  PMID: 22373210
17.  Microsatellite linkage analysis, single-nucleotide polymorphisms, and haplotype associations with ECB21 in the COGA data 
BMC Genetics  2005;6(Suppl 1):S94.
This study, part of the Genetic Analysis Workshop 14 (GAW14), explored real Collaborative Study on the Genetics of Alcoholism data for linkage and association mapping between genetic polymorphisms (microsatellite and single-nucleotide polymorphisms (SNPs)) and beta (16.5–20 Hz) oscillations of the brain rhythms (ecb21). The ecb21 phenotype underwent the statistical adjustments for the age of participants, and for attaining a normal distribution. A total of 1,000 subjects' available phenotypes were included in linkage analysis with microsatellite markers. Linkage analysis was performed only for chromosome 4 where a quantitative trait locus with 5.01 LOD score had been previously reported. Previous findings related this location with the γ-aminobutyric acid type A (GABAA) receptor. At the same location, our analysis showed a LOD score of 2.2. This decrease in the LOD score is the result of a drastic reduction (one-third) of the available GAW14 phenotypic data. We performed SNP and haplotype association analyses with the same phenotypic data under the linkage peak region on chromosome 4. Seven Affymetrix and two Illumina SNPs showed significant associations with ecb21 phenotype. A haplotype, a combination of SNPs TSC0044171 and TSC0551006 (the latter almost under the region of GABAA genes), showed a significant association with ecb21 (p = 0.015) and a relatively high frequency in the sample studied. Our results affirmed that the GABA region has potential of harboring genes that contribute quantitatively to the beta oscillation of the brain rhythms. The inclusion of the remaining 614 subjects, which in the GAW14 had missing data for the ecb21, can improve the strength of the associations as they have already shown that they contribute quite important information in the linkage analysis.
PMCID: PMC1866685  PMID: 16451710
18.  Bivariate association analysis of longitudinal phenotypes in families 
BMC Proceedings  2014;8(Suppl 1):S90.
Statistical genetic methods incorporating temporal variation allow for greater understanding of genetic architecture and consistency of biological variation influencing development of complex diseases. This study proposes a bivariate association method jointly testing association of two quantitative phenotypic measures from different time points. Measured genotype association was analyzed for single-nucleotide polymorphisms (SNPs) for systolic blood pressure (SBP) from the first and third visits using 200 simulated Genetic Analysis Workshop 18 (GAW18) replicates. Bivariate association, in which the effect of an SNP on the mean trait values of the two phenotypes is constrained to be equal for both measures and is included as a covariate in the analysis, was compared with a bivariate analysis in which the effect of an SNP was estimated separately for the two measures and univariate association analyses in 9 SNPs that explained greater than 0.001% SBP variance over all 200 GAW18 replicates.The SNP 3_48040283 was significantly associated with SBP in all 200 replicates with the constrained bivariate method providing increased signal over the unconstrained bivariate method. This method improved signal in all 9 SNPs with simulated effects on SBP for nominal significance (p-value <0.05). However, this appears to be determined by the effect size of the SNP on the phenotype. This bivariate association method applied to longitudinal data improves genetic signal for quantitative traits when the effect size of the variant is moderate to large.
PMCID: PMC4143799  PMID: 25519346
19.  Whole genome sequence analysis of the simulated systolic blood pressure in Genetic Analysis Workshop 18 family data: long-term average and collapsing methods 
BMC Proceedings  2014;8(Suppl 1):S12.
Analysis of longitudinal family data is challenging because of 2 sources of correlations: correlations across longitudinal measurements and correlations among related individuals. We investigated whether analysis using long-term average (average of all 3 visits) can enhance gene discovery compared with a single-visit analysis. We analyzed all 200 replicates of simulated systolic blood pressure (SBP) in Genetic Analysis Workshop 18 (GAW18) family data using both single-marker and collapsing methods. We considered 2 collapsing approaches: collapsing all variants and collapsing low-frequency variants. Analysis using long-term average performed slightly better than SBP measured at a single visit. Collapsing all variants performed much better than collapsing low-frequency variants at MAP4 and FLNB, which included a common variant with a relatively large effect. For several variants in gene MAP4, single-marker analysis also provided high power. In contrast, collapsing only low-frequency variants performed much better for SCAP, DNASE1L3, and LOC152217, where rare variants in these genes had larger effect than common variants. However, for other causal variants, all approaches provided disappointingly poor performance. This poor performance appeared to occur because most of these causal variants explained a very small fraction of phenotypic variance. We also found that collapsing multiple variants did worse than single-marker analysis for several genes when they contained causal single-nucleotide polymorphisms (SNPs) with both positive and negative effects. Because half of causal SNPs were not found in the annotation file based on the 1000 Genomes Project, we found that power was also affected by our use of incomplete annotation information.
PMCID: PMC4143632  PMID: 25519365
20.  Genetic Analysis Workshop 16: Introduction to Workshop Summaries 
Genetic epidemiology  2009;33(Suppl 1):S1-S7.
Genetic Analysis Workshop 16 GAW16) was held September 17-20, 2008 in St. Louis, Missouri. The focus of GAW16 was on methods and challenges in analysis of single-nucleotide polymorphism (SNP) data from genome-wide scans. GAW16 attracted 221 participants from 12 countries. The 168 contributions were organized into 17 discussion groups of 6 to 17 papers each. Three data sets were available for analysis. Two of these were data from ongoing studies, generously provided by the investigators. The North American Rheumatoid Arthritis Consortium provided case-control data on rheumatoid arthritis, and the Framingham Heart Study made available information on cardiovascular risk factors for participants in three generations of pedigree data. The third data set included simulated phenotypes for participants in the Framingham Heart Study, using actual pedigree structures and genotypes. This volume includes a paper for each of the 17 discussion groups, summarizing their main findings.
PMCID: PMC2987734  PMID: 19924709
single-nucleotide polymorphism; SNP; genome-wide scan; association; linkage; haplotype
21.  Rheumatoid arthritis, item response theory, Blom transformation, and mixed models 
BMC Proceedings  2007;1(Suppl 1):S116.
We studied rheumatoid arthritis (RA) in the North American Rheumatoid Arthritis Consortium (NARAC) data (1499 subjects; 757 families). Identical methods were applied for studying RA in the Genetic Analysis Workshop 15 (GAW15) simulated data (with a prior knowledge of the simulation answers). Fifty replications of GAW15 simulated data had 3497 ± 20 subjects in 1500 nuclear families. Two new statistical methods were applied to transform the original phenotypes on these data, the item response theory (IRT) to create a latent variable from nine classifying predictors and a Blom transformation of the anti-CCP (anti-cyclic citrinullated protein) variable. We performed linear mixed-effects (LME) models to study the additive associations of 404 Illumina-genotyped single-nucleotide polymorphisms (SNPs) on the NARAC data, and of 17,820 SNPs of the GAW15 simulated data. In the GAW15 simulated data, the association with anti-CCP Blom transformation showed a 100% sensitivity for SNP1 located in the major histocompatibility complex gene. In contrast, the association of SNP1 with the IRT latent variable showed only 24% sensitivity. From the simulated data, we conclude that the Blom transformation of the anti-CCP variable produced more reliable results than the latent variable from the qualitative combination of a group of RA risk factors. In the NARAC data, the significant RA-SNPs associations found with both phenotype-transformation methods provided a trend that may point toward dynein and energy control genes. Finer genotyping in the NARAC data would grant more exact evidence for the contributions of chromosome 6 to RA.
PMCID: PMC2367565  PMID: 18466457
22.  Rare variant collapsing in conjunction with mean log p-value and gradient boosting approaches applied to Genetic Analysis Workshop 17 data 
BMC Proceedings  2011;5(Suppl 9):S94.
In addition to methods that can identify common variants associated with susceptibility to common diseases, there has been increasing interest in approaches that can identify rare genetic variants. We use the simulated data provided to the participants of Genetic Analysis Workshop 17 (GAW17) to identify both rare and common single-nucleotide polymorphisms and pathways associated with disease status. We apply a rare variant collapsing approach and the usual association tests for common variants to identify candidates for further analysis using pathway-based and tree-based ensemble approaches. We use the mean log p-value approach to identify a top set of pathways and compare it to those used in simulation of GAW17 dataset. We conclude that the mean log p-value approach is able to identify those pathways in the top list and also related pathways. We also use the stochastic gradient boosting approach for the selected subset of single-nucleotide polymorphisms. When compared the result of this tree-based method with the list of single-nucleotide polymorphisms used in dataset simulation, in addition to correct SNPs we observe number of false positives.
PMCID: PMC3287936  PMID: 22373203
23.  Evaluation of association tests for rare variants using simulated data sets in the Genetic Analysis Workshop 17 data 
BMC Proceedings  2011;5(Suppl 9):S86.
We evaluate four association tests for rare variants—the combined multivariate and collapsing (CMC) method, two weighted-sum methods, and a variable threshold method—by applying them to the simulated data sets of unrelated individuals in the Genetic Analysis Workshop 17 (GAW17) data. The family-wise error rate (FWER) and average power are used as criteria for evaluation. Our results show that when all nonsynonymous SNPs (rare variants and common variants) in a gene are jointly analyzed, the CMC method fails to control the FWER; when only rare variants (single-nucleotide polymorphisms with minor allele frequency less than 0.05) are analyzed, all four methods can control FWER well. All four methods have comparable power, which is low for the analysis of the GAW17 data sets. Three of the methods (not including the CMC method) involve estimation of p-values using permutation procedures that either can be computationally intensive or generate inflated FWERs. We adapt a fast permutation procedure into these three methods. The results show that using the fast permutation procedure can produce FWERs and average powers close to the values obtained from the standard permutation procedure on the GAW17 data sets. The standard permutation procedure is computationally intensive.
PMCID: PMC3287927  PMID: 22373475
24.  Penalized regression approaches to testing for quantitative trait-rare variant association 
Frontiers in Genetics  2014;5:121.
In statistical data analysis, penalized regression is considered an attractive approach for its ability of simultaneous variable selection and parameter estimation. Although penalized regression methods have shown many advantages in variable selection and outcome prediction over other approaches for high-dimensional data, there is a relative paucity of the literature on their applications to hypothesis testing, e.g., in genetic association analysis. In this study, we apply several new penalized regression methods with a novel penalty, called Truncated L1-penalty (TLP) (Shen et al., 2012), for either variable selection, or both variable selection and parameter grouping, in a data-adaptive way to test for association between a quantitative trait and a group of rare variants. The performance of the new methods are compared with some existing tests, including some recently proposed global tests and penalized regression-based methods, via simulations and an application to the real sequence data of the Genetic Analysis Workshop 17 (GAW17). Although our proposed penalized methods can improve over some existing penalized methods, often they do not outperform some existing global association tests. Some possible problems with utilizing penalized regression methods in genetic hypothesis testing are discussed. Given the capability of penalized regression in selecting causal variants and its sometimes promising performance, further studies are warranted.
PMCID: PMC4026747  PMID: 24860593
GWAS; SSU test; SSUw test; Sum test; TLP
25.  Comparison of collapsing methods for the statistical analysis of rare variants 
BMC Proceedings  2011;5(Suppl 9):S115.
Novel technologies allow sequencing of whole genomes and are considered as an emerging approach for the identification of rare disease-associated variants. Recent studies have shown that multiple rare variants can explain a particular proportion of the genetic basis for disease. Following this assumption, we compare five collapsing approaches to test for groupwise association with disease status, using simulated data provided by Genetic Analysis Workshop 17 (GAW17). Variants are collapsed in different scenarios per gene according to different minor allele frequency (MAF) thresholds and their functionality. For comparing the different approaches, we consider the family-wise error rate and the power. Most of the methods could maintain the nominal type I error levels well for small MAF thresholds, but the power was generally low. Although the methods considered in this report are common approaches for analyzing rare variants, they performed poorly with respect to the simulated disease phenotype in the GAW17 data set.
PMCID: PMC3287839  PMID: 22373249

Results 1-25 (1185054)