In the increasing number of sequencing studies aimed at identifying rare variants associated with complex traits, the power of the test can be improved by guided sampling procedures. We confirm both analytically and numerically that sampling individuals with extreme phenotypes can enrich the presence of causal rare variants and can therefore lead to an increase in power compared to random sampling. While application of traditional rare variant association tests to these extreme phenotype samples requires dichotomizing the continuous phenotypes before analysis, the dichotomization procedure can decrease the power by reducing the information in the phenotypes. To avoid this, we propose a novel statistical method based on optimal SKAT (SKAT-O) that allows us to test for rare variant effects using continuous phenotypes in the analysis of extreme phenotype samples. The increase in power of this method is demonstrated through simulation of a wide range of scenarios as well as in the triglyceride data of the Dallas Heart Study.
Complex trait associations; Selective sampling; Rare genetic variants; Extreme phenotype sampling
Recent advances in next generation sequencing technologies make it affordable to search for rare and functional variants for common complex diseases systematically. We investigated strategies for enriching rare variants in the samples selected for sequencing so as to optimize the power for their discovery. In particular, we investigated the roles of alternative sources of enrichment in families through computer simulations. We showed that linkage information, extreme phenotype, and non-random ascertainment, such as multiply affected families, constitute different sources for enriching rare and functional variants in a sequencing study design. Linkage is well known to have limited power for detecting small genetic effects, and hence not considered to be a powerful tool for discovering variants for common complex diseases. However, those families with some degree of family-specific linkage evidence provide an effective sampling strategy to sub-select the most linkage-informative families for sequencing. Compared with selecting subjects with extreme phenotypes, linkage evidence performs better with larger families, while extreme-phenotype method is more efficient with smaller families. Families with multiple affected siblings were found to provide the largest enrichment of rare variants. Finally, we showed that combined strategies such as selecting linkage informative families from multiply affected families provides an optimum strategy with much higher enrichment of rare functional variants than either strategy alone.
Next generation sequencing; rare variants; enrichment; study design; complex diseases; linkage
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
We present a novel method, IBDLD, for estimating the probability of identity by descent (IBD) for a pair of related individuals at a locus, given dense genotype data and a pedigree of arbitrary size and complexity. IBDLD overcomes the challenges of exact multipoint estimation of IBD in pedigrees of potentially large size and eliminates the difficulty of accommodating the background linkage disequilibrium (LD) that is present in high-density genotype data. We show that IBDLD is much more accurate at estimating the true IBD sharing than methods that remove LD by pruning SNPs and is highly robust to pedigree errors or other forms of misspecified relationships. The method is fast and can be used to estimate the probability for each possible IBD sharing state at every SNP from a high-density genotyping array for hundreds of thousands of pairs of individuals. We use it to estimate point-wise and genomewide IBD sharing between 185,745 pairs of subjects all of whom are related through a single, large and complex 13-generation pedigree and genotyped with the Affymetrix 500 k chip. We find that we are able to identify the true pedigree relationship for individuals who were misidentified in the collected data and estimate empirical kinship coefficients that can be used in follow-up QTL mapping studies. IBDLD is implemented as an open source software package and is freely available.
linkage disequilibrium; IBD; pedigrees; Hidden Markov Models; SNP; relatedness
The genetic etiology of complex human diseases has been commonly viewed as a process that involves multiple genetic variants, environmental factors, as well as their interactions. Statistical approaches, such as the multifactor dimensionality reduction (MDR) and generalized MDR (GMDR), have recently been proposed to test the joint association of multiple genetic variants with either dichotomous or continuous traits. In this paper, we propose a novel Forward U-Test to evaluate the combined effect of multiple loci on quantitative traits with consideration of gene-gene/gene-environment interactions. In this new approach, a U-Statistic-based forward algorithm is first used to select potential disease-susceptibility loci and then a weighted U statistic is used to test the joint association of the selected loci with the disease. Through a simulation study, we found the Forward U-Test outperformed GMDR in terms of greater power. Aside from that, our approach is less computationally intensive, making it feasible for high-dimensional gene-gene/gene-environment research. We illustrate our method with a real data application to Nicotine Dependence (ND), using three independent datasets from the Study of Addiction: Genetics and Environment. Our gene-gene interaction analysis of 155 SNPs in 67 candidate genes identified two SNPs, rs16969968 within gene CHRNA5 and rs1122530 within gene NTRK2, jointly associated with the level of ND (p-value = 5.31e-7). The association, which involves essential interaction, is replicated in two independent datasets with p-values of 1.08e-5 and 0.02, respectively. Our finding suggests that joint action may exist between the two gene products.
gene-gene interaction; Forward U-Test; Nicotine Dependence
As a result of the availability of a very large numbers of single nucleotide polymorphisms, there has been increasing interest in genetic associations involving several closely linked loci. Methods for detection of association between traits and multiple genetic polymorphisms are being rapidly developed, which include the Hotelling’s T2 test and the LD contrast (LDC) tests. The Hotelling’s T2 test can be considered as a test to compare the means of the genotypic score in cases and controls; while the (LDC) tests can be considered as a test to compare the variance-covariance matrices of the genotypic score in cases and controls. In this article, we propose a likelihood ratio test which simultaneously compares the means and the variance-covariance matrices of the genotypic score in cases and controls. We use simulation studies to evaluate the type I error rate of the proposed test, and compare the power of the test with the Hotelling’s T2 test and the LDC tests. The simulation results show that when marginal effects of the disease loci are strong, the proposed test is more powerful than the LDC tests and similar with or slightly less powerful than the Hotelling’s T2 test. If there are interaction effects and weak or no marginal effects, the proposed method is more powerful than the Hotelling’s T2 test and slightly more powerful than the LDC tests.
likelihood ratio test; principle component analysis; association study; complex disease
Complex diseases are presumed to be the results of interactions of several genes and environmental factors, with each gene only having a small effect on the disease. Thus, the methods that can account for gene-gene interactions to search for a set of marker loci in different genes or across genome and to analyze these loci jointly are critical. In this article, we propose an ensemble learning approach (ELA) to detect a set of loci whose main and interaction effects jointly have a significant association with the trait. In the ELA, we first search for “base learners” and then combine the effects of the base learners by a linear model. Each base learner represents a main effect or an interaction effect. The result of the ELA is easy to interpret. When the ELA is applied to analyze a data set, we can get a final model, an overall P-value of the association test between the set of loci involved in the final model and the trait, and an importance measure for each base learner and each marker involved in the final model. The final model is a linear combination of some base learners. We know which base learner represents a main effect and which one represents an interaction effect. The importance measure of each base learner or marker can tell us the relative importance of the base learner or marker in the final model. We used intensive simulation studies as well as a real data set to evaluate the performance of the ELA. Our simulation studies demonstrated that the ELA is more powerful than the single-marker test in all the simulation scenarios. The ELA also outperformed the other three existing multi-locus methods in almost all cases. In an application to a large-scale case-control study for Type 2 diabetes, the ELA identified 11 single nucleotide polymorphisms that have a significant multi-locus effect (P-value = 0.01), while none of the single nucleotide polymorphisms showed significant marginal effects and none of the two-locus combinations showed significant two-locus interaction effects.
epistasis; association study; complex disease; Type 2 diabetes
We investigate testing gene-disease outcome associations in situations where the genetic relationship potentially varies among subjects with differing environmental or clinical attributes. We propose a strategy which modestly increases multiple testing by evaluating weighted test statistics which focus (or enrich) association tests within subgroups and use a Monte-Carlo method, based on simulating from the approximate large sample distribution of the statistics, to control Type 1 error. We also introduce a stage-wise calculated test statistic which allows more complex weighting on multiple environmental variables. Results from simulation studies confirm improved power of the proposed approaches compared to marginal testing in many situations.
score tests; association tests; data adaptive; gene-environment interactions
Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations—for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry. Genet. Epidemiol. 35:766–780, 2011.
haplotype variation; imputation; linkage disequilibrium
Genetic imputation has become standard practice in modern genetic studies. However, several important issues have not been adequately addressed including the utility of study-specific reference, performance in admixed populations, and quality for less common (minor allele frequency [MAF] 0.005–0.05) and rare (MAF < 0.005) variants. These issues only recently became addressable with genome-wide association studies (GWAS) follow-up studies using dense genotyping or sequencing in large samples of non-European individuals. In this work, we constructed a study-specific reference panel of 3,924 haplotypes using African Americans in the Women’s Health Initiative (WHI) genotyped on both the Metabochip and the Affymetrix 6.0 GWAS platform. We used this reference panel to impute into 6,459 WHI SNP Health Association Resource (SHARe) study subjects with only GWAS genotypes. Our analysis confirmed the imputation quality metric Rsq (estimated r2, specific to each SNP) as an effective post-imputation filter. We recommend different Rsq thresholds for different MAF categories such that the average (across SNPs) Rsq is above the desired dosage r2 (squared Pearson correlation between imputed and experimental genotypes).With a desired dosage r2 of 80%, 99.9% (97.5%, 83.6%, 52.0%, 20.5%) of SNPs with MAF > 0.05 (0.03–0.05, 0.01–0.03, 0.005–0.01, and 0.001–0.005) passed the post-imputation filter. The average dosage r2 for these SNPs is 94.7%, 92.1%, 89.0%, 83.1%, and 79.7%, respectively. These results suggest that for African Americans imputation of Metabochip SNPs from GWAS data, including low frequency SNPs with MAF 0.005–0.05, is feasible and worthwhile for power increase in downstream association analysis provided a sizable reference panel is available.
genotype imputation; Metabochip; internal reference; African Americans; rare variants
Recent studies suggest that rare variants play an important role in the etiology of many traits. Although a number of methods have been developed for genetic association analysis of rare variants, they all assume a relatively homogeneous population under study. Such an assumption may not be valid for samples collected from admixed populations such as African Americans and Hispanic Americans as there is a great extent of local variation in ancestry in these populations. To ensure valid and more powerful rare variant association tests performed in admixed populations, we have developed a local ancestry-based weighted dosage test, which is able to take into account local ancestry of rare alleles, uncertainties in rare variant imputation when imputed data are included, and the direction of effect that rare variants exert on phenotypic outcome. We used simulated sequence data to show that our proposed test has controlled type I error rates, whereas naïve application of existing rare variants tests and tests that adjust for global ancestry lead to inflated type I error rates. We showed that our test has higher power than tests without proper adjustment of ancestry. We also applied the proposed method to a candidate gene study on low-density lipoprotein cholesterol. Our results suggest that it is important to appropriately control for potential population stratification induced by local ancestry difference in the analysis of rare variants in admixed populations.
admixed population; rare variants; population stratification
Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women’s Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison.
genotype imputation; admixed populations; large reference panel; uncommon variants; MaCH-Admix
Over the past several years, genome-wide association studies (GWAS) have succeeded in identifying hundreds of genetic markers associated with common diseases. However, most of these markers confer relatively small increments of risk and explain only a small proportion of familial clustering. To identify obstacles to future progress in genetic epidemiology research and provide recommendations to NIH for overcoming these barriers, the National Cancer Institute sponsored a workshop entitled “Next Generation Analytic Tools for Large-Scale Genetic Epidemiology Studies of Complex Diseases” on September 15–16, 2010. The goal of the workshop was to facilitate discussions on (1) statistical strategies and methods to efficiently identify genetic and environmental factors contributing to the risk of complex disease; and (2) how to develop, apply, and evaluate these strategies for the design, analysis, and interpretation of large-scale complex disease association studies in order to guide NIH in setting the future agenda in this area of research. The workshop was organized as a series of short presentations covering scientific (gene-gene and gene-environment interaction, complex phenotypes, and rare variants and next generation sequencing) and methodological (simulation modeling and computational resources and data management) topic areas. Specific needs to advance the field were identified during each session and are summarized.
gene-gene interactions; gene-environment interactions; rare variants; next generation sequencing; complex phenotypes; simulations; computational resources
Gene-set analyses have been widely used in gene expression studies, and some of the developed methods have been extended to genome wide association studies (GWAS). Yet, complications due to linkage disequilibrium (LD) among single nucleotide polymorphisms (SNPs), and variable numbers of SNPs per gene and genes per gene-set, have plagued current approaches, often leading to ad hoc “fixes”. To overcome some of the current limitations, we developed a general approach to scan GWAS SNP data for both gene-level and gene-set analyses, building on score statistics for generalized linear models, and taking advantage of the directed acyclic graph structure of the gene ontology when creating gene-sets. However, other types of gene-set structures can be used, such as the popular Kyoto Encyclopedia of Genes and Genomes (KEGG). Our approach combines SNPs into genes, and genes into gene-sets, but assures that positive and negative effects of genes on a trait do not cancel. To control for multiple testing of many gene-sets, we use an efficient computational strategy that accounts for LD and provides accurate step-down adjusted p-values for each gene-set. Application of our methods to two different GWAS provide guidance on the potential strengths and weaknesses of our proposed gene-set analyses.
gene-sets; genome wide association; pathways; score statistics
Poly(ADP-ribose) polymerase-1 (PARP-1 catalyzes poly(ADP-ribosyl)ation to various proteins involved in many cellular processes, including DNA damage detection and repair, and cell proliferation and death. PARP-1 has been implicated in human carcinogenesis, but the association between the most-studied PARP-1 V762A polymorphism (rs1136410) and risk of various cancers was reported with inconclusive results.
To assess the association between the PARP-1 V762A polymorphism and cancer risk.
A meta-analysis of 21 studies with 12027 cancer patients and 14106 cancer-free controls was conducted to evaluate the strength of the association using odds ratio (OR) with 95% confidence interval (CI).
Overall, no significant association was found between the PARP-1 V762A polymorphism and cancer risk. In the stratified analyses, however, it was found that the variant A allele of the PARP-1 V762A polymorphism was associated with an increased risk of cancer among Asian populations (VA+AA vs.VV: OR = 1.11, 95% CI: 1.01-1.23; Pheterogeneity = 0.210) but a decreased risk of cancer (VA+AA vs.VV: OR =0.89, 95% CI: 0.80-1.00; Pheterogeneity = 0.004), among Caucasian populations, especially for glioma risk (OR = 0.79, 95% CI: 0.69-0.90; Pheterogeneity = 0.800).
This meta-analysis found evidence for an association of the PARP-1 V 762A polymorphism with increased risk of cancer among Asians but decreased risk of cancer among Caucasians, particularly of glioma. Further well designed studies with large sample sizes of different ethnic populations and different cancer types are warranted to confirm these findings.
DNA repair; Case-control study; Meta-analysis; Polymorphism; Susceptibility
Most disease association mapping algorithms are based on hypothesis testing procedures that test one variant at a time. Those methods lose power when the disease mutations are jointly tagged by multiple variants, or when gene-gene interaction exist. Nearby variants are also correlated, for which procedures ignoring the dependence between variants will inevitably produce redundant results. With a large number of variants genotyped in current genome-wide disease association studies, simultaneous multi-variant association mapping algorithms are strongly desired. We present a novel Bayesian method for automatic detection of multi-variant joint association in genome-wide case-control studies. Our method has improved power and specificity over existing tools. We fit a joint probabilistic model to the entire data and identify disease variants simultaneously. The method dynamically accounts for the strong linkage disequilibrium (LD) between variants. As a result, only the primary disease variants will be identified, with all secondary associations due to LD effects filtered out. Our method better pinpoints the disease variants with improved resolution. The method is also computationally efficient for genome-wide studies. When applied to a real dataset of inflammatory bowel disease (IBD) containing 401,473 variants in 4,720 individuals, our method detected all previously reported IBD loci in the same data, and recovered two missed loci. We further detected two novel inter-chromosome interactions. The first is between STAT3 and PARD6G, and the second is between DLG5 and an intergenic region at 5p14. We further validated the two interactions in an independent study.
disease association mapping; Bayesian graph; linkage disequilibrium; Markov chain Monte Carlo
The ongoing controversy surrounding direct-to-consumer (DTC) personal genomic tests intensified last year when the U.S. Government Accountability Office (GAO) released results of an undercover investigation of four companies that offer such testing. Among their findings, they reported that some of their donors received DNA-based predictions that conflicted with their actual medical histories. We aimed to more rigorously evaluate the relationship between DTC genomic risk estimates and self-reported disease by leveraging data from the Scripps Genomic Health Initiative (SGHI). We prospectively collected self-reported personal and family health history data for 3,416 individuals who went on to purchase a commercially available DTC genomic test. For 5 out of 15 total conditions studied, we found that risk estimates from the test were significantly associated with self-reported family and/or personal health history. The 5 conditions, included Graves’ disease, Type 2 Diabetes, Lupus, Alzheimer’s disease, and Restless Leg Syndrome. To further investigate these findings, we ranked each of the 15 conditions based on published heritability estimates and conducted post-hoc power analyses based on the number of individuals in our sample who reported significant histories of each condition. We found that high heritability, coupled with high prevalence in our sample and thus adequate statistical power, explained the pattern of associations observed. Our study represents one of the first evaluations of the relationship between risk estimates from a commercially available DTC personal genomic test and self-reported health histories in the consumers of that test.
direct-to-consumer; genetic testing; genetic risk estimates; clinical validity; consumer genomics
The advent of next-generation sequencing technologies has facilitated the detection of rare variants. Despite the significant cost reduction, sequencing cost is still high for large-scale studies. In this article, we examine DNA pooling as a cost-effective strategy for rare variant detection. We consider the optimal number of individuals in a DNA pool to detect an allele with a specific minor allele frequency (MAF) under a given coverage depth and detection threshold. We found that the optimal number of individuals in a pool is indifferent to the MAF at the same coverage depth and detection threshold. In addition, when the individual contributions to each pool are equal, the total number of individuals across different pools required in an optimal design to detect a variant with a desired power is similar at different coverage depths. When the contributions are more variable, more individuals tend to be needed for higher coverage depths. Our study provides general guidelines on using DNA pooling for more cost-effective identifications of rare variants. Genet. Epidemiol. 35:139-147, 2011. © 2011 Wiley-Liss, Inc.
optimal pooling designs; rare variant detection; next-generation sequencing
We describe implementation of a set-based method to assess the significance of findings from genome-wide association study data. Our method, implemented in PLINK, is based on theoretical approximation of Fisher’s statistics such that the combination of p-vales at a gene or across a pathway are done in a manner that accounts for the correlation structure, or linkage disequilibrium, between SNPs. We compare our method to a permutation based product of p-values approach and show a typical correlation in excess of 0.98 for a number of comparisons. The method gives Type I error rates that are less than or equal to the corresponding nominal significance levels, making it robust to the effects of false positives. We show that in broadly similar populations, reference datasets of markers are an appropriate substrate for deriving marker-marker LD, negating the need to access individual level genotypes, greatly facilitating its generic applicability. We show that the method is thus robust to LD-associated bias and has equivalent performance to permutation-based methods, with a significantly shorter runtime. This is particularly relevant at a time of increasing public availability of significantly larger genetic datasets and should go a long way to assist in the rapid analysis of these datasets.
GWAS; set-based analysis; multiple dependent tests
In multi-locus association analysis, since some markers may not be associated with a trait, it seems attractive to use penalized regression with the capability of automatic variable selection. On the other hand, in spite of a rapidly growing body of literature on penalized regression, most focus on variable selection and outcome prediction, for which penalized methods are generally more effective than their non-penalized counterparts. However, for statistical inference, i.e. hypothesis testing and interval estimation, it is less clear how penalized methods would perform, or even how to best apply them, largely due to lack of studies on this topic. In our motivating data for a cohort of kidney transplant recipients, it is of primary interest to assess whether a group of genetic variants are associated with a binary clinical outcome, acute rejection at 6 months. In this paper, we study some technical issues and alternative implementations of hypothesis testing in Lasso penalized logistic regression, and compare their performance with each other and with several existing global tests, some of which are specifically designed as variance component tests for high-dimensional data. The most interesting, and perhaps surprising, conclusion of this study is that, for low to moderately high-dimensional data, statistical tests based on Lasso penalized regression are not necessarily more powerful than some existing global tests. In addition, in penalized regression, rather than building a test based on a single selected “best” model, combining multiple tests, each of which is built on a candidate model, might be more promising.
Lasso; Logistic kernel machine regression; Logistic regression; Random-effects model; Score test; Sum of squared score (SSU) test
Quantitative traits (QT) are an important focus of human genetic studies both because of interest in the traits themselves, and because of their role as risk factors for many human diseases. For large-scale QT association studies including genome-wide association studies (GWAS), investigators usually focus on genetic loci showing significant evidence for SNP-QT association, and genetic effect size tends to be overestimated as a consequence of the winner’s curse. In this paper, we study the impact of the winner’s curse on QT association studies in which the genetic effect size is parameterized as the slope in a linear regression model. We demonstrate by analytical calculation that the overestimation in the regression slope estimate decreases as power increases. To reduce the ascertainment bias, we propose a three-parameter maximum likelihood method and then simplify this to a one-parameter method by excluding nuisance parameters. We show that both methods reduce the bias when power to detect association is low or moderate, and that the one-parameter model generally results in smaller variance in the estimate.
quantitative trait; winner’s curse; ascertainment bias; genome-wide association study; linear regression; maximum likelihood
It is now understood that virtually all human cancer types are the result of the accumulation of both genetic and epigenetic changes. DNA methylation is a molecular modification of DNA that is crucial for normal development. Genes that are rich in CpG dinucleotides are usually not methylated in normal tissues, but are frequently hypermethylated in cancer. With the advent of high-throughput platforms, large-scale structure of genomic methylation patterns is available through genome-wide scans and tremendous amount of DNA methylation data have been recently generated. However, sophisticated statistical methods to handle complex DNA methylation data are very limited. Here we developed a likelihood based Uniform-Normal-mixture model to select differentially methylated loci between case and control groups using Illumina arrays. The idea is to model the data as three types of methylation loci, one unmethylated, one completely methylated, and one partially methylated. A three-component mixture model with two Uniform distributions and one truncated normal distribution was used to model the three types. The mixture probabilities and the mean of the normal distribution were used to make inference about differentially methylated loci. Through extensive simulation studies, we demonstrated the feasibility and power of the proposed method. An application to a recently published study on ovarian cancer identified several methylation loci that are missed by the existing method.
DNA methylation; mixture model; case-control designs
Meta-analysis of genome-wide association studies involves testing single nucleotide polymorphisms (SNPs) using summary statistics that are weighted sums of site-specific score or Wald statistics. This approach avoids having to pool individual-level data. We describe the weights that maximize the power of the summary statistics. For small effect-sizes, any choice of weights yields summary Wald and score statistics with the same power, and the optimal weights are proportional to the square roots of the sites' Fisher information for the SNP's regression coefficient. When SNP effect size is constant across sites, the optimal summary Wald statistic is the well-known inverse-variance-weighted combination of estimated regression coefficients, divided by its standard deviation. We give simple approximations to the optimal weights for various phenotypes, and show that weights proportional to the square roots of study sizes are suboptimal for data from case-control studies with varying case-control ratios, for quantitative trait data when the trait variance differs across sites, for count data when the site-specific mean counts differ, and for survival data with different proportions of failing subjects. Simulations suggest that weights that accommodate inter-site variation in imputation error give little power gain compared to those obtained ignoring imputation uncertainties. We note advantages to combining site-specific score statistics, and we show how they can be used to assess effect-size heterogeneity across sites. The utility of the summary score statistic is illustrated by application to a meta-analysis of schizophrenia data in which only site-specific p-values and directions of association are available.
combining GWAS; effect-size heterogeneity; meta-analysis; noncentrality parameter; score statistics; Wald statistics
In anticipation of the availability of next-generation sequencing data, there is increasing interest in investigating association between complex traits and rare variants (RVs). In contrast to association studies for common variants (CVs), due to the low frequencies of RVs, common wisdom suggests that existing statistical tests for CVs might not work, motivating the recent development of several new tests for analyzing RVs, most of which are based on the idea of pooling/collapsing RVs. However, there is a lack of evaluations of, and thus guidance on the use of, existing tests. Here we provide a comprehensive comparison of various statistical tests using simulated data. We consider both independent and correlated rare mutations, and representative tests for both CVs and RVs. As expected, if there are no or few non-causal (i.e. neutral or non-associated) RVs in a locus of interest while the effects of causal RVs on the trait are all (or mostly) in the same direction (i.e. either protective or deleterious, but not both), then the simple pooled association tests (without selecting RVs and their association directions) and a new test called kernel-based adaptive clustering (KBAC) perform similarly and are most powerful; KBAC is more robust than simple pooled association tests in the presence of non-causal RVs; however, as the number of non-causal CVs increases and/or in the presence of opposite association directions, the winners are two methods originally proposed for CVs and a new test called C-alpha test proposed for RVs, each of which can be regarded as testing on a variance component in a random-effects model. Interestingly, several methods based on sequential model selection (i.e. selecting causal RVs and their association directions), including two new methods proposed here, perform robustly and often have statistical power between those of the above two classes.
C-alpha test; kernel machine regression; logistic regression; model selection; permutation; pooled association tests; random-effects models; SSU test; Sum test; statistical power
Genome-wide association studies (GWAS) have been frequently conducted on general or isolated populations with related individuals. However, there is a lack of consensus on which strategy is most appropriate for analyzing dichotomous phenotypes in general pedigrees. Using simulation studies, we compared several strategies including generalized estimating equations (GEE) strategies with various working correlation structures, generalized linear mixed model (GLMM) and a variance component strategy (denoted LMEBIN) that treats dichotomous outcomes as continuous with special attentions to their performance with rare variants, rare diseases and small sample sizes. In our simulations, when the sample size is not small, for type I error, only GEE and LMEBIN maintain nominal type I error in most cases with exceptions for GEE with very rare disease and genetic variants. GEE and LMEBIN have similar statistical power and slightly outperform GLMM when the prevalence is low. In terms of computational efficiency, GEE with sandwich variance estimator outperforms GLMM and LMEBIN. We apply the strategies to GWAS of gout in the Framingham Heart Study. Based on our results, we would recommend using GEE ind-san in the GWAS for common variants and GEE ind-fij or LMEBIN for rare variants for GWAS of dichotomous outcomes with general pedigrees.
genetic association; dichotomous phenotypes; familial relatedness