Gene set analysis methods continue to be a popular and powerful method of evaluating genome-wide transcriptomics data. These approach require a priori grouping of genes into biologically meaningful sets, and then conducting downstream analyses at the set (instead of gene) level of analysis. Gene set analysis methods have been shown to yield more powerful statistical conclusions than single-gene analyses due to both reduced multiple testing penalties and potentially larger observed effects due to the aggregation of effects across multiple genes in the set. Traditionally, gene set analysis methods have been applied directly to normalized, log-transformed, transcriptomics data. Recently, efforts have been made to transform transcriptomics data to scales yielding more biologically interpretable results. For example, recently proposed models transform log-transformed transcriptomics data to a confidence metric (ranging between 0 and 100%) that a gene is active (roughly speaking, that the gene product is part of an active cellular mechanism). In this manuscript, we demonstrate, on both real and simulated transcriptomics data, that tests for differential expression between sets of genes using are typically more powerful when using gene activity state estimates as opposed to log-transformed gene expression data. Our analysis suggests further exploration of techniques to transform transcriptomics data to meaningful quantities for improved downstream inference.
The use of posterior probabilities to summarize genotype uncertainty is pervasive across genotype, sequencing and imputation platforms. Prior work in many contexts has shown the utility of incorporating genotype uncertainty (posterior probabilities) in downstream statistical tests. Typical approaches to incorporating genotype uncertainty when testing Hardy-Weinberg equilibrium tend to lack calibration in the type I error rate, especially as genotype uncertainty increases. We propose a new approach in the spirit of genomic control that properly calibrates the type I error rate, while yielding improved power to detect deviations from Hardy-Weinberg Equilibrium. We demonstrate the improved performance of our method on both simulated and real genotypes.
Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curated set of regulatory interactions for E. coli found in RegulonDB, showing that ARs are more consistent with gold standard regulons than are data-driven gene clusters. We further examined the consistency of ARs and data-driven gene clusters in the context of gene interactions predicted by Context Likelihood of Relatedness (CLR) analysis, finding that the ARs show better agreement with CLR predicted interactions. We determined the impact of increasing amounts of expression data on AR construction and find that while more data improve ARs, it is not necessary to use the full set of gene expression experiments available for E. coli to produce high quality ARs. In order to explore the conservation of co-regulated gene sets across different organisms, we computed ARs for Shewanella oneidensis, Pseudomonas aeruginosa, Thermus thermophilus, and Staphylococcus aureus, each of which represents increasing degrees of phylogenetic distance from E. coli. Comparison of the organism-specific ARs showed that the consistency of AR gene membership correlates with phylogenetic distance, but there is clear variability in the regulatory networks of closely related organisms. As large scale expression data sets become increasingly common for model and non-model organisms, comparative analyses of atomic regulons will provide valuable insights into fundamental regulatory modules used across the bacterial domain.
atomic regulon; clustering; gene expression analysis; transcriptomic data; Escherichia coli; hierarchical clustering; CLR; k-means clustering
Current rare-variant, gene-based tests of association often suffer from a lack of statistical power to detect genotype–phenotype associations as a result of a lack of prior knowledge of genetic disease models combined with limited observations of extremely rare causal variants in population-based samples. The use of pedigree data, in which rare variants are often more highly concentrated than in population-based data, has been proposed as 1 possible method for enhancing power. Methods for combining multiple gene-based tests of association into a single summary p value are a robust approach to different genetic architectures when little a priori knowledge is available about the underlying genetic disease model. To date, however, little consideration has been given to combining gene-based tests of association for the analysis of pedigree data. We propose a flexible framework for combining any number of family-based rare-variant tests of association into a single summary statistic and for assessing the significance of that statistic. We show that this approach maintains type I error and improves the robustness, to different genetic architectures, of the statistical power of family- and gene-based rare-variant tests through application to simulated phenotype data from Genetic Analysis Workshop 19.
The aggregation of functionally associated variants given a priori biological information can aid in the discovery of rare variants associated with complex diseases. Many methods exist that aggregate rare variants into a set and compute a single p value summarizing association between the set of rare variants and a phenotype of interest. These methods are often called gene-based, rare variant tests of association because the variants in the set are often all contained within the same gene. A reasonable extension of these approaches involves aggregating variants across an even larger set of variants (eg, all variants contained in genes within a pathway). Testing sets of variants such as pathways for association with a disease phenotype reduces multiple testing penalties, may increase power, and allows for straightforward biological interpretation. However, a significant variant-set association test does not indicate precisely which variants contained within that set are causal. Because pathways often contain many variants, it may be helpful to follow-up significant pathway tests by conducting gene-based tests on each gene in that pathway to narrow in on the region of causal variants. In this paper, we propose such a multistep approach for variant-set analysis that can also account for covariates and complex pedigree structure. We demonstrate this approach on simulated phenotypes from Genetic Analysis Workshop 19. We find generally better power for the multistep approach when compared to a more conventional, single-step approach that simply runs gene-based tests of association on each gene across the genome. Further work is necessary to evaluate the multistep approach on different data sets with different characteristics.
Machine learning methods continue to show promise in the analysis of data from genetic association studies because of the high number of variables relative to the number of observations. However, few best practices exist for the application of these methods. We extend a recently proposed supervised machine learning approach for predicting disease risk by genotypes to be able to incorporate gene expression data and rare variants. We then apply 2 different versions of the approach (radial and linear support vector machines) to simulated data from Genetic Analysis Workshop 19 and compare performance to logistic regression. Method performance was not radically different across the 3 methods, although the linear support vector machine tended to show small gains in predictive ability relative to a radial support vector machine and logistic regression. Importantly, as the number of genes in the models was increased, even when those genes contained causal rare variants, model predictive ability showed a statistically significant decrease in performance for both the radial support vector machine and logistic regression. The linear support vector machine showed more robust performance to the inclusion of additional genes. Further work is needed to evaluate machine learning approaches on larger samples and to evaluate the relative improvement in model prediction from the incorporation of gene expression data.
Numerous methods for classifying gene activity states based on gene expression data have been proposed for use in downstream applications, such as incorporating transcriptomics data into metabolic models in order to improve resulting flux predictions. These methods often attempt to classify gene activity for each gene in each experimental condition as belonging to one of two states: active (the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active). These existing methods of classifying gene activity states suffer from multiple limitations, including enforcing unrealistic constraints on the overall proportions of active and inactive genes, failing to leverage a priori knowledge of gene co-regulation, failing to account for differences between genes, and failing to provide statistically meaningful confidence estimates. We propose a flexible Bayesian approach to classifying gene activity states based on a Gaussian mixture model. The model integrates genome-wide transcriptomics data from multiple conditions and information about gene co-regulation to provide activity state confidence estimates for each gene in each condition. We compare the performance of our novel method to existing methods on both simulated data and real data from 907 E. coli gene expression arrays, as well as a comparison with experimentally measured flux values in 29 conditions, demonstrating that our method provides more consistent and accurate results than existing methods across a variety of metrics.
metabolic modeling; gene expression; bacteria; gene activity; Bayesian model
Most genome-wide association studies have explored relationships between genetic variants and plasma phospholipid fatty acid proportions, but few have examined apparent genetic influences on the membrane fatty acid profile of red blood cells (RBC). Using RBC fatty acid data from the Framingham Offspring Study, we analyzed over 2.5 million single nucleotide polymorphisms (SNPs) for association with 14 RBC fatty acids identifying 191 different SNPs associated with at least 1 fatty acid. Significant associations (p<1×10−8) were located within five distinct 1 MB regions. Of particular interest were novel associations between (1) arachidonic acid and PCOLCE2 (regulates apoA-I maturation and modulates apoA-I levels), and (2) oleic and linoleic acid and LPCAT3 (mediates the transfer of fatty acids between glycerol-lipids). We also replicated previously identified strong associations between SNPs in the FADS (chromosome 11) and ELOVL (chromosome 6) regions. Multiple SNPs explained 8–14% of the variation in 3 high abundance (>11%) fatty acids, but only 1–3% in 4 low abundance (<3%) fatty acids, with the notable exception of dihomo-gamma linolenic acid with 53% of variance explained by SNPs. Further studies are needed to determine the extent to which variations in these genes influence tissue fatty acid content and pathways modulated by fatty acids.
PUFA; apoA-I maturation; oleic acid; linoleic acid; arachidonic acid; glycerol-lipids
In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.
In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.
Prolonged, severe, unalleviated fatigue may be disabling whether it occurs on its own or in conjunction with medical or psychiatric conditions. This paper compares the prevalence and correlates of prolonged fatigue in general population samples in Ukraine versus the U.S.
Population surveys were conducted in 2002 in both Ukraine (Ukraine World Mental Health [WMH] Survey) and the U.S. (National Comorbidity Survey-Replication; NCS-R). Both surveys administered the Composite International Diagnostic Interview (CIDI 3.0), which contained modules assessing: neurasthenia (prolonged fatigue); mood, anxiety, and alcohol/drug use disorders; chronic medical conditions; and demographic characteristics. Multivariable logistic regression was used to examine risk factors in each country.
The lifetime prevalence of prolonged fatigue was higher in Ukraine (5.2%) than the U.S. (3.7%). In both countries, one-fifth of individuals with prolonged fatigue had no medical or DSM-IV psychiatric condition. Also in both settings, fatigue was significantly associated with sociodemographic characteristics (being female, not working, and married before) as well as early onset and adult episodes of mood/anxiety disorder. Fatigue prevalence in Ukraine increased with age, but decreased in the U.S. at age 70. Unique risk factors for fatigue in Ukraine included lower socio-economic status, Ukrainian vs Russian ethnicity, and cardiovascular disease. Unique risk factors in the U.S. were parental depression/anxiety, adult episodes of alcohol/drugs, pain conditions, and other health problems.
The lifetime prevalence of prolonged fatigue in Ukraine was 40% higher than that found in U.S. data. In addition, fatigue prevalence increased sharply with age in Ukraine perhaps due to limited social and medical resources and greater comorbidity.
prolonged fatigue; neurasthenia; chronic fatigue syndrome; prevalence; Ukraine; comorbidities
Little is known about the epidemiology of late life depression in Eastern Europe. This study examined the 12-month prevalence and correlates of DSM-IV major depressive episode (MDE) in adults age 50 years and over in Ukraine. The correlates included demographic factors, mental health and alcohol history, physical conditions, and impairments in functioning.
A cross-sectional survey was conducted in Ukraine using the Composite International Diagnostic Interview (CIDI-3.0) as part of the World Health Organization-World Mental Health Survey Initiative. The sample included 1,843 respondents age 50–91. Unadjusted and adjusted odds ratios were used to examine associations of the risk factors with 12-month MDE in men and women separately.
The 12-month prevalence of MDE was 14.4% in women and 7.1% in men. In both sexes, history of MDE before age 50 and poor self-assessed mental/physical health were significantly associated with MDE. Additionally, in men, living alone, 5+ physician visits, and role impairment, but not alcoholism, were associated with depression; in women, poverty, history of anxiety disorder, medical conditions, and cognitive and self-care impairment were significant.
The 12-month prevalence of late life MDE was substantially higher in Ukraine than in Western Europe and other developed countries. The risk factors, however, were similar to those found outside Ukraine. Depression is a recurrent condition, and history of depression was the strongest risk factor. Overall, the results show that older people in Ukraine constitute a high risk group for MDE and would therefore benefit from targeted interventions by primary care physicians.
major depression; late life depression; Ukraine; risk factors; living alone; poverty
When analyzing family data, we dream of perfectly informative data, even whole genome sequences (WGS) for all family members. Reality intervenes, and we find next-generation sequence (NGS) data have error, and are often too expensive or impossible to collect on everyone. Genetic Analysis Workshop 18 groups “Quality Control” and “Dropping WGS through families using GWAS framework” focused on finding, correcting, and using errors within the available sequence and family data, developing methods to infer and analyze missing sequence data among relatives, and testing for linkage and association with simulated blood pressure.
We found that single nucleotide polymorphisms, NGS, and imputed data are generally concordant, but that errors are particularly likely at rare variants, homozygous genotypes, within regions with repeated sequences or structural variants, and within sequence data imputed from unrelateds. Admixture complicated identification of cryptic relatedness, but information from Mendelian transmission improved error detection and provided an estimate of the de novo mutation rate. Both genotype and pedigree errors had an adverse effect on subsequent analyses. Computationally fast rules-based imputation was accurate, but could not cover as many loci or subjects as more computationally demanding probability-based methods. Incorporating population-level data into pedigree-based imputation methods improved results. Observed data outperformed imputed data in association testing, but imputed data were also useful.
We discuss the strengths and weaknesses of existing methods, and suggest possible future directions. Topics include improving communication between those performing data collection and analysis, establishing thresholds for and improving imputation quality, and incorporating error into imputation and analytical models.
Inference; type 1 error; power; next-generation sequence data; de novo mutation
Pathway analysis, broadly defined as a group of methods incorporating a priori biological information from public databases, has emerged as a promising approach for analyzing high-dimensional genomic data. As part of Genetic Analysis Workshop 18 (GAW18), seven research groups applied pathway analysis techniques to whole genome sequence data from the San Antonio Family Study. Overall, the groups found that the potential of pathway analysis to improve detection of causal variants by lowering the multiple testing burden and incorporating biologic insight remains largely unrealized. Specifically, there is a lack of best practices at each stage of the pathway approach: annotation, analysis, interpretation, and follow-up. Annotation of genetic variants is inconsistent across databases, incomplete, and biased towards known genes. At the analysis stage, insufficient statistical power remains a major challenge. Analyses combining rare and common variants may have an inflated type 1 error rate and may not improve detection of causal genes. Inclusion of known causal genes may not improve statistical power, although the fraction of explained phenotypic variance may be a more appropriate metric. Interpretation of findings is further complicated by evidence in support of interactions between pathways as well as the lack of consensus on how to best incorporate functional information. Finally, all presented approaches warranted follow-up studies, both to reduce the likelihood of false positive findings and to identify specific causal variants within a given pathway. Despite the initial promise of pathway analysis for modeling biological complexity of disease phenotypes, many methodological challenges currently remain to be addressed.
pathway analysis; whole genome sequence; hypertension; family studies
To examine unique and relative predictive values of demographic, social learning, developmental, psychopathology, and dyadic variables as risk factors for perpetration of intimate partner physical aggression in a national sample of married or cohabitating individuals.
Men (n=798) and women (n=770) were selected from the public use data file of the 2003 National Comorbidity Survey Replication (NCS-R) which used a multi-stage cluster sampling design.
Eight percent of women and 5% of men reported perpetrating physical aggression in the past year. Based on multivariable regression analyses, among men, the unique risk factors for perpetrating physical aggression were parental violence, dating before age 14, dating aggression, Intermittent Explosive Disorder (IED) before and after age 20, and being victimized by partner. Among women, significant risk factors were younger age, dating aggression, IED before age 20, cohabiting, victimization by partner, and marital/relationship strain.
A number of social learning, developmental, adult psychopathology, and dyadic factors were significant. Two dyadic variables, victimization and marital strain, had by far the strongest associations with perpetration of partner aggression. Given that dating aggression and early IED were risk factors for male and female IPV much later in life suggests early interventions for those at risk.
partner violence; risk factors; Intermittent Explosive Disorder; family violence; dating aggression
Background: Rapid growth in the availability of genome-wide transcript abundance levels through gene expression microarrays and RNAseq promises to provide deep biological insights into the complex, genome-wide transcriptional behavior of single-celled organisms. However, this promise has not yet been fully realized.
Results: We find that computation of pairwise gene associations (correlation; mutual information) across a set of 2782 total genome-wide expression samples from six diverse bacteria produces unexpectedly large variation in estimates of pairwise gene association—regardless of the metric used, the organism under study, or the number and source of the samples. We pinpoint the cause to sampling bias. In particular, in repositories of expression data (e.g., Gene Expression Omnibus, GEO), many individual genes show small differences in absolute gene expression levels across the set of samples. We demonstrate that these small differences are due mainly to “noise” instead of “signal” attributable to environmental or genetic perturbations. We show that downstream analysis using gene expression levels of genes with small differences yields biased estimates of pairwise association.
Conclusions: We propose flagging genes with small differences in absolute, RMA-normalized, expression levels (e.g., standard deviation less than 0.5), as potentially yielding biased pairwise association metrics. This strategy has the potential to substantially improve the confidence in genome-wide conclusions about transcriptional behavior in bacterial organisms. Further work is needed to further refine strategies to identify genes with small difference in expression levels prior to computing gene-gene association metrics.
Pearson correlation; operon prediction; mutual information; regulatory network inference; co-regulation
Until very recently, few methods existed to analyze rare-variant association with binary phenotypes in complex pedigrees. We consider a set of recently proposed methods applied to the simulated and real hypertension phenotype as part of the Genetic Analysis Workshop 18. Minimal power of the methods is observed for genes containing variants with weak effects on the phenotype. Application of the methods to the real hypertension phenotype yielded no genes meeting a strict Bonferroni cutoff of significance. Some prior literature connects 3 of the 5 most associated genes (p <1 × 10−4) to hypertension or related phenotypes. Further methodological development is needed to extend these methods to handle covariates, and to explore more powerful test alternatives.
Genotype errors are well known to increase type I errors and/or decrease power in related tests of genotype-phenotype association, depending on whether the genotype error mechanism is associated with the phenotype. These relationships hold for both single and multimarker tests of genotype-phenotype association. To assess the potential for genotype errors in Genetic Analysis Workshop 18 (GAW18) data, where no gold standard genotype calls are available, we explored concordance rates between sequencing, imputation, and microarray genotype calls. Our analysis shows that missing data rates for sequenced individuals are high and that there is a modest amount of called genotype discordance between the 2 platforms, with discordance most common for lower minor allele frequency (MAF) single-nucleotide polymorphisms (SNPs). Some evidence for discordance rates that were different between phenotypes was observed, and we identified a number of cases where different technologies identified different bases at the variant site. Type I errors and power loss is possible as a result of missing genotypes and errors in called genotypes in downstream analysis of GAW18 data.
Pathway analysis approaches for sequence data typically either operate in a single stage (all variants within all genes in the pathway are combined into a single, very large set of variants that can then be analyzed using standard "gene-based" test statistics) or in 2-stages (gene-based p values are computed for all genes in the pathway, and then the gene-based p values are combined into a single pathway p value). To date, little consideration has been given to the performance of gene-based tests (typically designed for a smaller number of single-nucleotide variants [SNVs]) when the number of SNVs in the gene or in the pathway is very large and the genotypes come from sequence data organized in large pedigrees. We consider recently proposed gene-based tests for rare variants from complex pedigrees that test for association between a large set of SNVs and a qualitative phenotype of interest (1-stage analyses) as well as 2-stage approaches. We find that many of these methods show inflated type I errors when the number of SNVs in the gene or the pathway is large (>200 SNVs) and when using standard approaches to estimate the genotype covariance matrix. Alternative methods are needed when testing very large sets of SNVs in 1-stage approaches.
Genetic Analysis Workshop 18 provided a platform for developing and evaluating statistical methods to analyze whole-genome sequence data from a pedigree-based sample. In this article we present an overview of the data sets and the contributions that analyzed these data. The family data, donated by the Type 2 Diabetes Genetic Exploration by Next-Generation Sequencing in Ethnic Samples Consortium, included sequence-level genotypes based on sequencing and imputation, genome-wide association genotypes from prior genotyping arrays, and phenotypes from longitudinal assessments. The contributions from individual research groups were extensively discussed before, during, and after the workshop in theme-based discussion groups before being submitted for publication.
The wave of next-generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to non-causal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.
rare variants; sequencing; burden tests
The new class of rare variant tests has usually been evaluated assuming perfect genotype information. In reality, rare variant genotypes may be incorrect, and so rare variant tests should be robust to imperfect data. Errors and uncertainty in SNP genotyping are already known to dramatically impact statistical power for single marker tests on common variants and, in some cases, inflate the type I error rate. Recent results show that uncertainty in genotype calls derived from sequencing reads are dependent on several factors, including read depth, calling algorithm, number of alleles present in the sample, and the frequency at which an allele segregates in the population. We have recently proposed a general framework for the evaluation and investigation of rare variant tests of association, classifying most rare variant tests into one of two broad categories (length or joint tests). We use this framework to relate factors affecting genotype uncertainty to the power and type I error rate of rare variant tests. We find that non-differential genotype errors (an error process that occurs independent of phenotype) decrease power, with larger decreases for extremely rare variants, and for the common homozygote to heterozygote error. Differential genotype errors (an error process that is associated with phenotype status), lead to inflated type I error rates which are more likely to occur at sites with more common homozygote to heterozygote errors than vice versa. Finally, our work suggests that certain rare variant tests and study designs may be more robust to the inclusion of genotype errors. Further work is needed to directly integrate genotype calling algorithm decisions, study costs and test statistic choices to provide comprehensive design and analysis advice which appropriately accounts for the impact of genotype errors.
SKAT; gene-based; genotype uncertainty; misclassification; dosage
The use of haplotypes to impute the genotypes of unmeasured single nucleotide variants continues to rise in popularity. Simulation results suggest that the use of the dosage as a one-dimensional summary statistic of imputation posterior probabilities may be optimal both in terms of statistical power and computational efficiency, however little theoretical understanding is available to explain and unify these simulation results. In our analysis, we provide a theoretical foundation for the use of the dosage as a one-dimensional summary statistic of genotype posterior probabilities from any technology.
We analytically evaluate the dosage, mode and the more general set of all one-dimensional summary statistics of two-dimensional (three posterior probabilities that must sum to 1) genotype posterior probability vectors.
We prove that the dosage is an optimal one-dimensional summary statistic under a typical linear disease model and is robust to violations of this model. Simulation results confirm our theoretical findings.
Our analysis provides a strong theoretical basis for the use of the dosage as a one-dimensional summary statistic of genotype posterior probability vectors in related tests of genetic association across a wide variety of genetic disease models.
Imputation; dosage; genome-wide association studies
Gene-based tests of association are frequently applied to common SNPs (MAF>5%) as an alternative to single-marker tests. In this analysis we conduct a variety of simulation studies applied to five popular gene-based tests investigating general trends related to their performance in realistic situations. In particular, we focus on the impact of non-causal SNPs and a variety of LD structures on the behavior of these tests. Ultimately, we find that non-causal SNPs can significantly impact the power of all gene-based tests. On average, we find that the “noise” from 6–12 non-causal SNPs will cancel out the “signal” of one causal SNP across five popular gene-based tests. Furthermore, we find complex and differing behavior of the methods in the presence of LD within and between non-causal and causal SNPs. Ultimately, better approaches for a priori prioritization of potentially causal SNPs (e.g., predicting functionality of non-synonymous SNPs), application of these methods to sequenced or fully imputed datasets, and limited use of window-based methods for assigning inter-genic SNPs to genes will improve power. However, significant power loss from non-causal SNPs may remain unless alternative statistical approaches robust to the inclusion of non-causal SNPs are developed.
Genotyping errors are well-known to impact the power and type I error rate in single marker tests of association. Genotyping errors that happen according to the same process in cases and controls are known as non-differential genotyping errors, whereas genotyping errors that occur with different processes in the cases and controls are known as differential genotype errors. For single marker tests, non-differential genotyping errors reduce power, while differential genotyping errors increase the type I error rate. However, little is known about the behavior of the new generation of rare variant tests of association in the presence of genotyping errors. In this manuscript we use a comprehensive simulation study to explore the effects of numerous factors on the type I error rate of rare variant tests of association in the presence of differential genotyping error. We find that increased sample size, decreased minor allele frequency, and an increased number of single nucleotide variants (SNVs) included in the test all increase the type I error rate in the presence of differential genotyping errors. We also find that the greater the relative difference in case-control genotyping error rates the larger the type I error rate. Lastly, as is the case for single marker tests, genotyping errors classifying the common homozygote as the heterozygote inflate the type I error rate significantly more than errors classifying the heterozygote as the common homozygote. In general, our findings are in line with results from single marker tests. To ensure that type I error inflation does not occur when analyzing next-generation sequencing data careful consideration of study design (e.g. use of randomization), caution in meta-analysis and using publicly available controls, and the use of standard quality control metrics is critical.
We aim to quantify the effect of non-differential genotyping errors on the power of rare variant tests and identify those situations when genotyping errors are most harmful.
We simulated genotype and phenotype data for a range of sample sizes, minor allele frequencies, disease relative risks and numbers of rare variants. Genotype errors were then simulated using five different error models covering a wide range of error rates.
Even at very low error rates, misclassifying a common homozygote as a heterozygote translates into a substantial loss of power, a result that is exacerbated even further as the minor allele frequency decreases. While the power loss from heterozygote to common homozygote errors tends to be smaller for a given error rate, in practice heterozygote to homozygote errors are more frequent and, thus, will have measurable impact on power.
Error rates from genotype-calling technology for next-generation sequencing data suggest that substantial power loss may be seen when applying current rare variant tests of association to called genotypes.
Sequencing data; Power; Case-control; Misclassification