Although high-throughput genotyping arrays have made whole-genome association studies (WGAS) feasible, only a small proportion of SNPs in the human genome are actually surveyed in such studies. In addition, various SNP arrays assay different sets of SNPs, which leads to challenges in comparing results and merging data for meta-analyses. Genome-wide imputation of untyped markers allows us to address these issues in a direct fashion.
384 Caucasian American liver donors were genotyped using Illumina 650Y (Ilmn650Y) arrays, from which we also derived genotypes from the Ilmn317K array. On these data, we compared two imputation methods: MACH and BEAGLE. We imputed 2.5 million HapMap Release22 SNPs, and conducted GWAS on ~40,000 liver mRNA expression traits (eQTL analysis). In addition, 200 Caucasian American and 200 African American subjects were genotyped using the Affymetrix 500 K array plus a custom 164 K fill-in chip. We then imputed the HapMap SNPs and quantified the accuracy by randomly masking observed SNPs.
MACH and BEAGLE perform similarly with respect to imputation accuracy. The Ilmn650Y results in excellent imputation performance, and it outperforms Affx500K or Ilmn317K sets. For Caucasian Americans, 90% of the HapMap SNPs were imputed at 98% accuracy. As expected, imputation of poorly tagged SNPs (untyped SNPs in weak LD with typed markers) was not as successful. It was more challenging to impute genotypes in the African American population, given (1) shorter LD blocks and (2) admixture with Caucasian populations in this population. To address issue (2), we pooled HapMap CEU and YRI data as an imputation reference set, which greatly improved overall performance. The approximate 40,000 phenotypes scored in these populations provide a path to determine empirically how the power to detect associations is affected by the imputation procedures. That is, at a fixed false discovery rate, the number of cis-eQTL discoveries detected by various methods can be interpreted as their relative statistical power in the GWAS. In this study, we find that imputation offer modest additional power (by 4%) on top of either Ilmn317K or Ilmn650Y, much less than the power gain from Ilmn317K to Ilmn650Y (13%).
Current algorithms can accurately impute genotypes for untyped markers, which enables researchers to pool data between studies conducted using different SNP sets. While genotyping itself results in a small error rate (e.g. 0.5%), imputing genotypes is surprisingly accurate. We found that dense marker sets (e.g. Ilmn650Y) outperform sparser ones (e.g. Ilmn317K) in terms of imputation yield and accuracy. We also noticed it was harder to impute genotypes for African American samples, partially due to population admixture, although using a pooled reference boosts performance. Interestingly, GWAS carried out using imputed genotypes only slightly increased power on top of assayed SNPs. The reason is likely due to adding more markers via imputation only results in modest gain in genetic coverage, but worsens the multiple testing penalties. Furthermore, cis-eQTL mapping using dense SNP set derived from imputation achieves great resolution, and locate associate peak closer to causal variants than conventional approach.
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Tests of association with disease status are normally conducted one SNP at a time, ignoring the effects of all other genotyped SNPs. We developed a computationally efficient method to simultaneously analyse all SNPs, either in a genome-wide association (GWA) study, or a fine-mapping study based on re-sequencing and/or imputation. The method selects a subset of SNPs that best predicts disease status, while controlling the type-I error of the selected SNPs. This brings many advantages over standard single-SNP approaches, because the signal from a particular SNP can be more clearly assessed when other SNPs associated with disease status are already included in the model. Thus, in comparison with single-SNP analyses, power is increased and the false positive rate is reduced because of reduced residual variation. Localisation is also greatly improved. We demonstrate these advantages over the widely used single-SNP Armitage Trend Test using GWA simulation studies, a real GWA dataset, and a sequence-based fine-mapping simulation study.
Imputation of genome-wide single-nucleotide polymorphism (SNP) arrays to a larger known reference panel of SNPs has become a standard and an essential part of genome-wide association studies. However, little is known about the behavior of imputation in African Americans with respect to the different imputation algorithms, the reference population(s) and the reference SNP panels used. Genome-wide SNP data (Affymetrix 6.0) from 3207 African American samples in the Atherosclerosis Risk in Communities Study (ARIC) was used to systematically evaluate imputation quality and yield. Imputation was performed with the imputation algorithms MACH, IMPUTE and BEAGLE using several combinations of three reference panels of HapMap III (ASW, YRI and CEU) and 1000 Genomes Project (pilot 1 YRI June 2010 release, EUR and AFR August 2010 and June 2011 releases) panels with SNP data on chromosomes 18, 20 and 22. About 10% of the directly genotyped SNPs from each chromosome were masked, and SNPs common between the reference panels were used for evaluating the imputation quality using two statistical metrics—concordance accuracy and Cohen’s kappa (κ) coefficient. The dependencies of these metrics on the minor allele frequencies (MAF) and specific genotype categories (minor allele homozygotes, heterozygotes and major allele homozygotes) were thoroughly investigated to determine the best panel and method for imputation in African Americans. In addition, the power to detect imputed SNPs associated with simulated phenotypes was studied using the mean genotype of each masked SNP in the imputed data. Our results indicate that the genotype concordances after stratification into each genotype category and Cohen’s κ coefficient are considerably better equipped to differentiate imputation performance compared with the traditionally used total concordance statistic, and both statistics improved with increasing MAF irrespective of the imputation method. We also find that both MACH and IMPUTE performed equally well and consistently better than BEAGLE irrespective of the reference panel used. Of the various combinations of reference panels, for both HapMap III and 1000 Genomes Project reference panels, the multi-ethnic panels had better imputation accuracy than those containing only single ethnic samples. The most recent 1000 Genomes Project release June 2011 had substantially higher number of imputed SNPs than HapMap III and performed as well or better than the best combined HapMap III reference panels and previous releases of the 1000 Genomes Project.
concordance; GWAS; Hapmap; imputation; imputation accuracy; kappa; 1000 genomes
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
Genotype imputation substantially increases available markers for analysis in genome-wide association studies (GWAS) by leveraging linkage disequilibrium from a reference panel. We sought to (i) investigate the performance of imputation from the August 2010 release of the 1000 Genomes Project (1000GP) in an existing GWAS of prostate cancer, (ii) look for novel associations with prostate cancer risk, (iii) fine-map known prostate cancer susceptibility regions using an approximate Bayesian framework and stepwise regression, and (iv) compare power and efficiency of imputation and de novo sequencing.
We used 2,782 aggressive prostate cancer cases and 4,458 controls from the NCI Breast and Prostate Cancer Cohort Consortium aggressive prostate cancer GWAS to infer 5.8 million well-imputed autosomal single nucleotide polymorphisms.
Imputation quality, as measured by correlation between imputed and true allele counts, was higher among common variants than rare variants. We found no novel prostate cancer associations among a subset of 1.2 million well-imputed low-frequency variants. At a genome-wide sequencing cost of $2,500, imputation from SNP arrays is a more powerful strategy than sequencing for detecting disease associations of SNPs with minor allele frequencies above 1%.
1000GP imputation provided dense coverage of previously-identified prostate cancer susceptibility regions, highlighting its potential as an inexpensive first-pass approach to fine-mapping in regions such as 5p15 and 8q24. Our study shows 1000GP imputation can accurately identify low-frequency variants and stresses the importance of large sample size when studying these variants.
rare variants; association; fine mapping
Imputation-based association methods provide a powerful framework for testing untyped variants for association with phenotypes and for combining results from multiple studies that use different genotyping platforms. Here, we consider several issues that arise when applying these methods in practice, including: (i) factors affecting imputation accuracy, including choice of reference panel; (ii) the effects of imputation accuracy on power to detect associations; (iii) the relative merits of Bayesian and frequentist approaches to testing imputed genotypes for association with phenotype; and (iv) how to quickly and accurately compute Bayes factors for testing imputed SNPs. We find that imputation-based methods can be robust to imputation accuracy and can improve power to detect associations, even when average imputation accuracy is poor. We explain how ranking SNPs for association by a standard likelihood ratio test gives the same results as a Bayesian procedure that uses an unnatural prior assumption—specifically, that difficult-to-impute SNPs tend to have larger effects—and assess the power gained from using a Bayesian approach that does not make this assumption. Within the Bayesian framework, we find that good approximations to a full analysis can be achieved by simply replacing unknown genotypes with a point estimate—their posterior mean. This approximation considerably reduces computational expense compared with published sampling-based approaches, and the methods we present are practical on a genome-wide scale with very modest computational resources (e.g., a single desktop computer). The approximation also facilitates combining information across studies, using only summary data for each SNP. Methods discussed here are implemented in the software package BIMBAM, which is available from http://stephenslab.uchicago.edu/software.html.
Genotype imputation is becoming a popular approach to comparing and combining results of multiple association studies that used different SNP genotyping platforms. The basic idea is to exploit the fact that, due to correlation among untyped and typed SNPs, genotypes of untyped SNPs in each study can be inferred (“imputed”) from the genotypes at typed SNPs, often with high accuracy. In this paper, we consider several issues that arise when applying these methods in practice, including factors affecting imputation accuracy, the importance of taking account of imputation uncertainty when testing for association between imputed SNPs and phenotype, how imputation accuracy affects power, and how to combine results across studies when only single-SNP summary data can be shared among research groups.
Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.
Large association studies have proven to be effective tools for identifying parts of the genome that influence disease risk and other heritable traits. So-called “genotype imputation” methods form a cornerstone of modern association studies: by extrapolating genetic correlations from a densely characterized reference panel to a sparsely typed study sample, such methods can estimate unobserved genotypes with high accuracy, thereby increasing the chances of finding true associations. To date, most genome-wide imputation analyses have used reference data from the International HapMap Project. While this strategy has been successful, association studies in the near future will also have access to additional reference information, such as control sets genotyped on multiple SNP chips and dense genome-wide haplotypes from the 1,000 Genomes Project. These new reference panels should improve the quality and scope of imputation, but they also present new methodological challenges. We describe a genotype imputation method, IMPUTE version 2, that is designed to address these challenges in next-generation association studies. We show that our method can use a reference panel containing thousands of chromosomes to attain higher accuracy than is possible with the HapMap alone, and that our approach is more accurate than competing methods on both current and next-generation datasets. We also highlight the modeling issues that arise in imputation datasets.
Whole-genome sequencing technologies are being increasingly applied to Plasmodium falciparum clinical isolates to identify genetic determinants of malaria pathogenesis. However, genome-wide discovery methods, such as haplotype scans for signatures of natural selection, are hindered by missing genotypes in sequence data. Poor correlation between single nucleotide polymorphisms (SNPs) in the P. falciparum genome complicates efforts to apply established missing-genotype imputation methods that leverage off patterns of linkage disequilibrium (LD). The accuracy of state-of-the-art, LD-based imputation methods (IMPUTE, Beagle) was assessed by measuring allelic r2 for 459 P. falciparum samples from malaria patients in 4 countries: Thailand, Cambodia, Gambia, and Malawi. In restricting our analysis to 86k high-quality SNPs across the populations, we found that the complete-case analysis was restricted to 21k SNPs (24.5%), despite no single SNP having more than 10% missing genotypes. The accuracy of Beagle in filling in missing genotypes was consistently high across all populations (allelic r2, 0.87-0.96), but the performance of IMPUTE was mixed (allelic r2, 0.34-0.99) depending on reference haplotypes and population. Positive selection analysis using Beagle-imputed haplotypes identified loci involved in resistance to chloroquine (crt) in Thailand, Cambodia, and Gambia, sulfadoxine-pyrimethamine (dhfr, dhps) in Cambodia, and artemisinin (kelch13) in Cambodia. Tajima’s D-based analysis identified genes under balancing selection that encode well-characterized vaccine candidates: apical merozoite antigen 1 (ama1) and merozoite surface protein 1 (msp1). In contrast, the complete-case analysis failed to identify any well-validated drug resistance or candidate vaccine loci, except kelch13. In a setting of low LD and modest levels of missing genotypes, using Beagle to impute P. falciparum genotypes is a viable strategy for conducting accurate large-scale population genetics and association analyses, and supporting global surveillance for drug resistance markers and candidate vaccine antigens.
Characterizing genetic diversity and function in Plasmodium falciparum, including identifying determinants of emerging drug resistance, is crucial to informing public health strategies to contain and eliminate this malaria parasite. The lack of a robust framework to handle missing P. falciparum genotypes arising from next-generation sequencing efforts, impedes genome-wide methods that depend on complete genotype information, and often leads to analysis that discards entire regions of the genome. This study is the first to evaluate the performance of missing data imputation or “filling in” in the P. falciparum genome, where the correlation between genetic markers is generally lower than in the human genome. We considered 86k markers in 459 clinical isolates from 4 malaria-endemic populations of Africa and Southeast Asia. Although low genotype missingness per SNP (<10%) results in complete datasets for only 25% of SNPs, imputation is accurate. This finding is corroborated by the ability of imputed haplotype analysis to recover several well-established vaccine candidates and drug resistance loci, including kelch13—a recently-validated gene involved in artemisinin resistance. Our work demonstrates that imputation can assist the application of genome-wide methods to identify the determinants of P. falciparum diversity, including those involved in drug resistance, immune evasion, and host virulence.
Genome-wide association studies are revolutionizing the search for the genes underlying human complex diseases. The main decisions to be made at the design stage of these studies are the choice of the commercial genotyping chip to be used and the numbers of case and control samples to be genotyped. The most common method of comparing different chips is using a measure of coverage, but this fails to properly account for the effects of sample size, the genetic model of the disease, and linkage disequilibrium between SNPs. In this paper, we argue that the statistical power to detect a causative variant should be the major criterion in study design. Because of the complicated pattern of linkage disequilibrium (LD) in the human genome, power cannot be calculated analytically and must instead be assessed by simulation. We describe in detail a method of simulating case-control samples at a set of linked SNPs that replicates the patterns of LD in human populations, and we used it to assess power for a comprehensive set of available genotyping chips. Our results allow us to compare the performance of the chips to detect variants with different effect sizes and allele frequencies, look at how power changes with sample size in different populations or when using multi-marker tags and genotype imputation approaches, and how performance compares to a hypothetical chip that contains every SNP in HapMap. A main conclusion of this study is that marked differences in genome coverage may not translate into appreciable differences in power and that, when taking budgetary considerations into account, the most powerful design may not always correspond to the chip with the highest coverage. We also show that genotype imputation can be used to boost the power of many chips up to the level obtained from a hypothetical “complete” chip containing all the SNPs in HapMap. Our results have been encapsulated into an R software package that allows users to design future association studies and our methods provide a framework with which new chip sets can be evaluated.
Genome-wide association studies are a powerful and now widely-used method for finding genetic variants that increase the risk of developing particular diseases. These studies are complex and must be planned carefully in order to maximize the probability of finding novel associations. The main design choices to be made relate to sample sizes and choice of commercially available genotyping chip and are often constrained by cost, which can currently be as much as several million dollars. No comprehensive comparisons of chips based on their power for different sample sizes or for fixed study cost are currently available. We describe in detail a method for simulating large genome-wide association samples that accounts for the complex correlations between SNPs due to LD, and we used this method to assess the power of current genotyping chips. Our results highlight the differences between the chips under a range of plausible scenarios, and we demonstrate how our results can be used to design a study with a budget constraint. We also show how genotype imputation can be used to boost the power of each chip and that this method decreases the differences between the chips. Our simulation method and software for comparing power are being made available so that future association studies can be designed in a principled fashion.
Most genetic association studies only genotype a small proportion of cataloged single-nucleotide polymorphisms (SNPs) in regions of interest. With the catalogs of high-density SNP data available (e.g., HapMap) to researchers today, it has become possible to impute genotypes at untyped SNPs. This in turn allows us to test those untyped SNPs, the motivation being to increase power in association studies. Several imputation methods and corresponding software packages have been developed for this purpose. The objective of our study is to apply three widely used imputation methods and corresponding software packages to a data from a genome-wide association study of rheumatoid arthritis from the North American Rheumatoid Arthritis Consortium in Genetic Analysis Workshop 16, to compare the performances of the three methods, to evaluate their strengths and weaknesses, and to identify additional susceptibility loci underlying rheumatoid arthritis. The software packages used in this paper included a program for Bayesian imputation-based association mapping (BIMBAM), a program for imputing unobserved genotypes in case-control association studies (IMPUTE), and a program for testing untyped alleles (TUNA). We found some untyped SNP that showed significant association with rheumatoid arthritis. Among them, a few of these were not located near any typed SNP that was found to be significant and thus may be worth further investigation.
Several methods have been proposed to impute genotypes at untyped markers using observed genotypes and genetic data from a reference panel. We used the Genetic Analysis Workshop 16 rheumatoid arthritis case-control dataset to compare the performance of four of these imputation methods: IMPUTE, MACH, PLINK, and fastPHASE. We compared the methods' imputation error rates and performance of association tests using the imputed data, in the context of imputing completely untyped markers as well as imputing missing genotypes to combine two datasets genotyped at different sets of markers. As expected, all methods performed better for single-nucleotide polymorphisms (SNPs) in high linkage disequilibrium with genotyped SNPs. However, MACH and IMPUTE generated lower imputation error rates than fastPHASE and PLINK. Association tests based on allele "dosage" from MACH and tests based on the posterior probabilities from IMPUTE provided results closest to those based on complete data. However, in both situations, none of the imputation-based tests provide the same level of evidence of association as the complete data at SNPs strongly associated with disease.
Imputation is a statistical process used to predict genotypes of loci not directly assayed in a sample of individuals. Our goal is to measure the performance of imputation in predicting the genotype of the best known gene polymorphisms involved in drug metabolism using a common SNP array genotyping platform generally exploited in genome wide association studies.
Thirty-nine (39) individuals were genotyped with both Affymetrix Genome Wide Human SNP 6.0 (AFFY) and Affymetrix DMET Plus (DMET) platforms. AFFY and DMET contain nearly 900000 and 1931 markers respectively. We used a 1000 Genomes Pilot + HapMap 3 reference panel. Imputation was performed using the computer program Impute, version 2. SNPs contained in DMET, but not imputed, were analysed studying markers around their chromosome regions. The efficacy of the imputation was measured evaluating the number of successfully imputed SNPs (SSNPs).
The imputation predicted the genotypes of 654 SNPs not present in the AFFY array, but contained in the DMET array. Approximately 1000 SNPs were not annotated in the reference panel and therefore they could not be directly imputed. After testing three different imputed genotype calling threshold (IGCT), we observed that imputation performs at its best for IGCT value equal to 50%, with rate of SSNPs (MAF > 0.05) equal to 85%.
Most of the genes involved in drug metabolism can be imputed with high efficacy using standard genome-wide genotyping platforms and imputing procedures.
Genotyping-by-sequencing (GBS) represents a highly cost-effective high-throughput genotyping approach. By nature, however, GBS is subject to generating sizeable amounts of missing data and these will need to be imputed for many downstream analyses. The extent to which such missing data can be tolerated in calling SNPs has not been explored widely. In this work, we first explore the use of imputation to fill in missing genotypes in GBS datasets. Importantly, we use whole genome resequencing data to assess the accuracy of the imputed data. Using a panel of 301 soybean accessions, we show that over 62,000 SNPs could be called when tolerating up to 80% missing data, a five-fold increase over the number called when tolerating up to 20% missing data. At all levels of missing data examined (between 20% and 80%), the resulting SNP datasets were of uniformly high accuracy (96–98%). We then used imputation to combine complementary SNP datasets derived from GBS and a SNP array (SoySNP50K). We thus produced an enhanced dataset of >100,000 SNPs and the genotypes at the previously untyped loci were again imputed with a high level of accuracy (95%). Of the >4,000,000 SNPs identified through resequencing 23 accessions (among the 301 used in the GBS analysis), 1.4 million tag SNPs were used as a reference to impute this large set of SNPs on the entire panel of 301 accessions. These previously untyped loci could be imputed with around 90% accuracy. Finally, we used the 100K SNP dataset (GBS + SoySNP50K) to perform a GWAS on seed oil content within this collection of soybean accessions. Both the number of significant marker-trait associations and the peak significance levels were improved considerably using this enhanced catalog of SNPs relative to a smaller catalog resulting from GBS alone at ≤20% missing data. Our results demonstrate that imputation can be used to fill in both missing genotypes and untyped loci with very high accuracy and that this leads to more powerful genetic analyses.
While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
This paper presents improved methodologies for the analysis of genome-wide association studies in admixed populations, which are populations that came about by the mixing of two or more distant continental populations over a few hundred years (e.g., African Americans or Latinos). Studies of admixed populations offer the promise of capturing additional genetic diversity compared to studies over homogeneous populations such as Europeans. In admixed populations, correlation between genetic variants exists both at a fine scale in the ancestral populations and at a coarse scale due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered either one or the other type of correlation, but not both. In this work we develop novel statistical methods that account for both types of genetic correlation, and we show that the combined approach attains greater statistical power than that achieved by applying either approach separately. We provide analysis of simulated and real data from major studies performed in African-American men and women to show the improvement obtained by our methods over the standard methods for analyzing association studies in admixed populations.
Gene–gene interactions have an important role in complex human diseases. Detection of gene–gene interactions has long been a challenge due to their complexity. The standard method aiming at detecting SNP–SNP interactions may be inadequate as it does not model linkage disequilibrium (LD) among SNPs in each gene and may lose power due to a large number of comparisons. To improve power, we propose a principal component (PC)-based framework for gene-based interaction analysis. We analytically derive the optimal weight for both quantitative and binary traits based on pairwise LD information. We then use PCs to summarize the information in each gene and test for interactions between the PCs. We further extend this gene-based interaction analysis procedure to allow the use of imputation dosage scores obtained from a popular imputation software package, MACH, which incorporates multilocus LD information. To evaluate the performance of the gene-based interaction tests, we conducted extensive simulations under various settings. We demonstrate that gene-based interaction tests are more powerful than SNP-based tests when more than two variants interact with each other; moreover, tests that incorporate external LD information are generally more powerful than those that use genotyped markers only. We also apply the proposed gene-based interaction tests to a candidate gene study on high-density lipoprotein. As our method operates at the gene level, it can be applied to a genome-wide association setting and used as a screening tool to detect gene–gene interactions.
gene–gene interaction; linkage disequilibrium; imputation
Interactions among genetic loci are believed to play an important role in disease risk. While many methods have been proposed for detecting such interactions, their relative performance remains largely unclear, mainly because different data sources, detection performance criteria, and experimental protocols were used in the papers introducing these methods and in subsequent studies. Moreover, there have been very few studies strictly focused on comparison of existing methods. Given the importance of detecting gene-gene and gene-environment interactions, a rigorous, comprehensive comparison of performance and limitations of available interaction detection methods is warranted.
We report a comparison of eight representative methods, of which seven were specifically designed to detect interactions among single nucleotide polymorphisms (SNPs), with the last a popular main-effect testing method used as a baseline for performance evaluation. The selected methods, multifactor dimensionality reduction (MDR), full interaction model (FIM), information gain (IG), Bayesian epistasis association mapping (BEAM), SNP harvester (SH), maximum entropy conditional probability modeling (MECPM), logistic regression with an interaction term (LRIT), and logistic regression (LR) were compared on a large number of simulated data sets, each, consistent with complex disease models, embedding multiple sets of interacting SNPs, under different interaction models. The assessment criteria included several relevant detection power measures, family-wise type I error rate, and computational complexity. There are several important results from this study. First, while some SNPs in interactions with strong effects are successfully detected, most of the methods miss many interacting SNPs at an acceptable rate of false positives. In this study, the best-performing method was MECPM. Second, the statistical significance assessment criteria, used by some of the methods to control the type I error rate, are quite conservative, thereby limiting their power and making it difficult to fairly compare them. Third, as expected, power varies for different models and as a function of penetrance, minor allele frequency, linkage disequilibrium and marginal effects. Fourth, the analytical relationships between power and these factors are derived, aiding in the interpretation of the study results. Fifth, for these methods the magnitude of the main effect influences the power of the tests. Sixth, most methods can detect some ground-truth SNPs but have modest power to detect the whole set of interacting SNPs.
This comparison study provides new insights into the strengths and limitations of current methods for detecting interacting loci. This study, along with freely available simulation tools we provide, should help support development of improved methods. The simulation tools are available at: http://code.google.com/p/simulation-tool-bmc-ms9169818735220977/downloads/list.
Association mapping is a powerful approach for dissecting the genetic architecture of complex quantitative traits using high-density SNP markers in maize. Here, we expanded our association panel size from 368 to 513 inbred lines with 0.5 million high quality SNPs using a two-step data-imputation method which combines identity by descent (IBD) based projection and k-nearest neighbor (KNN) algorithm. Genome-wide association studies (GWAS) were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model (MLM) and a new method, the Anderson-Darling (A-D) test. Ten loci for five traits were identified using the MLM method at the Bonferroni-corrected threshold −log10 (P) >5.74 (α = 1). Many loci ranging from one to 34 loci (107 loci for plant height) were identified for 17 traits using the A-D test at the Bonferroni-corrected threshold −log10 (P) >7.05 (α = 0.05) using 556809 SNPs. Many known loci and new candidate loci were only observed by the A-D test, a few of which were also detected in independent linkage analysis. This study indicates that combining IBD based projection and KNN algorithm is an efficient imputation method for inferring large missing genotype segments. In addition, we showed that the A-D test is a useful complement for GWAS analysis of complex quantitative traits. Especially for traits with abnormal phenotype distribution, controlled by moderate effect loci or rare variations, the A-D test balances false positives and statistical power. The candidate SNPs and associated genes also provide a rich resource for maize genetics and breeding.
Genotype imputation has been used widely in the analysis of genome-wide association studies (GWAS) to boost power and fine-map associations. We developed a two-step data imputation method to meet the challenge of large proportion missing genotypes. GWAS have uncovered an extensive genetic architecture of complex quantitative traits using high-density SNP markers in maize in the past few years. Here, GWAS were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model and a new method, the Anderson-Darling (A-D) test. We intend to show that the A-D test is a complement to current GWAS methods, especially for complex quantitative traits controlled by moderate effect loci or rare variations and with abnormal phenotype distribution. In addition, the traits associated QTL identified here provide a rich resource for maize genetics and breeding.
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.
The main goal of our study was to investigate the implementation, prospects, and limits of marker imputation for quantitative genetic studies contrasting map-independent and map-dependent algorithms. We used a diversity panel consisting of 372 European elite wheat (Triticum aestivum L.) varieties, which had been genotyped with SNP arrays, and performed intensive simulation studies.
Our results clearly showed that imputation accuracy was substantially higher for map-dependent compared to map-independent methods. The accuracy of marker imputation depended strongly on the linkage disequilibrium between the markers in the reference panel and the markers to be imputed. For the decay of linkage disequilibrium present in European wheat, we concluded that around 45,000 markers are needed for low cost, low-density marker profiling. This will facilitate high imputation accuracy, also for rare alleles. Genomic selection and diversity studies profited only marginally from imputing missing values. In contrast, the power of association mapping increased substantially when missing values were imputed.
Imputing missing values is especially of interest for an economic implementation of association mapping in breeding populations.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1366-y) contains supplementary material, which is available to authorized users.
Elite wheat; Map-dependent imputation; Map-independent imputation; Intensive simulation; genomic selection; Association mapping
Rhesus macaques are widely used in biomedical research, but the application of genomic information in this species to better understand human disease is still in its infancy. Whole-genome sequence (WGS) data in large pedigreed macaque colonies could provide substantial experimental power for genetic discovery, but the collection of WGS data in large cohorts remains a formidable expense. Here, we describe a cost-effective approach that selects the most informative macaques in a pedigree for 30X WGS, followed by low-cost genotyping-by-sequencing (GBS) at 30X on the remaining macaques in order to generate sparse genotype data at high accuracy. Dense variants from the selected macaques with WGS data are then imputed into macaques having only sparse GBS data, resulting in dense genome-wide genotypes throughout the pedigree.
We developed GBS for the macaque genome using a digestion with PstI, followed by sequencing of size-selected fragments at 30X coverage. From GBS sequence data collected on all individuals in a 16-member pedigree, we characterized high-confidence genotypes at 22,455 single nucleotide variant (SNV) sites that were suitable for guiding imputation of dense sequence data from WGS. To characterize dense markers for imputation, we performed WGS at 30X coverage on nine of the 16 individuals, yielding 10,193,425 high-confidence SNVs. To validate the use of GBS data for facilitating imputation, we initially focused on chromosome 19 as a test case, using an optimized panel of 833 sparse, evenly-spaced markers from GBS and 5,010 dense markers from WGS. Using the method of “Genotype Imputation Given Inheritance” (GIGI), we evaluated the effects on imputation accuracy of 3 different strategies for selecting individuals for WGS, including 1) using “GIGI-Pick” to select the most informative individuals, 2) using the most recent generation, or 3) using founders only. We also evaluated the effects on imputation accuracy of using a range of from 1 to 9 WGS individuals for imputation. We found that the GIGI-Pick algorithm for selection of WGS individuals outperformed common heuristic approaches, and that genotype numbers and accuracy improved very little when using >5 WGS individuals for imputation. Informed by our findings, we used 4 macaques with WGS data to impute variants at up to 7,655,491 sites spanning all 20 autosomes in the 12 remaining macaques, based on their GBS genotypes at only 17,158 loci. Using a strict confidence threshold, we imputed an average of 3,680,238 variants per individual at >99 % accuracy, or an average 4,458,883 variants per individual at a more relaxed threshold, yielding >97 % accuracy.
We conclude that an optimal tradeoff between genotype accuracy, number of imputed genotypes, and overall cost exists at the ratio of one individual selected for WGS using the GIGI-Pick algorithm, per 3–5 relatives selected for GBS. This approach makes feasible the collection of accurate, dense genome-wide sequence data in large pedigreed macaque cohorts without the need for more expensive WGS data on all individuals.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-2966-x) contains supplementary material, which is available to authorized users.
Whole-genome sequencing; genotyping-by-sequencing; imputation; macaque; pedigree
Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increase the power to detect strong or weak genotype effects or b) as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software), however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming.
Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127.
YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy-SNPs for missing markers to avoid unnecessary power loss. MA with YAMAS can be readily conducted as YAMAS provides a generic parser for heterogeneous tabulated file formats within the GWAS field and avoids cumbersome setups. In this way, it supplements the meta-analysis process.
Testing gene-gene interaction in genome-wide association studies generally yields lower power than testing marginal association. Meta-analysis that combines different genotyping platforms is one method used to increase power when assessing gene-gene interactions, which requires a test for interaction on untyped SNPs. However, to date, formal statistical tests for gene-gene interaction on untyped SNPs have not been thoroughly addressed. The key concern for gene-gene interaction testing on untyped SNPs located on different chromosomes is that the pair of genes might not be independent and the current generation of imputation methods provides imputed genotypes at the marginal accuracy.
In this study we address this challenge and describe a novel method for testing gene-gene interaction on marginally imputed values of untyped SNPs. We show that our novel Wald-type test statistics for interactions with and without constraints in the interaction parameters follow the asymptotic distributions which are the same as those of the corresponding tests for typed SNPs. Through simulations, we show that the proposed tests properly control type I error and are more powerful than the extension of the classical dosage method to interaction tests. The increase in power results from a proper correction for the uncertainty in imputation through the variance estimator using the jackknife, one of resampling techniques. We apply the method to detect interactions between SNPs on chromosomes 5 and 15 on lung cancer data. The inclusion of the results at the untyped SNPs provides a much more detailed information at the regions of interest.
As demonstrated by the simulation studies and real data analysis, our approaches outperform the application of traditional dosage method to detection of gene-gene interaction in terms of power while providing control of the type I error.
Electronic supplementary material
The online version of this article (doi:10.1186/s12863-015-0225-9) contains supplementary material, which is available to authorized users.
Jackknife-based testing framework; Untyped SNP; Imputation-based testing; Gene-gene interaction
We hypothesize that imputation based on data from the 1000 Genomes Project can identify novel association signals on a genome-wide scale due to the dense marker map and the large number of haplotypes. To test the hypothesis, the Wellcome Trust Case Control Consortium (WTCCC) Phase I genotype data were imputed using 1000 genomes as reference (20100804 EUR), and seven case/control association studies were performed using imputed dosages. We observed two ‘missed' disease-associated variants that were undetectable by the original WTCCC analysis, but were reported by later studies after the 2007 WTCCC publication. One is within the IL2RA gene for association with type 1 diabetes and the other in proximity with the CDKN2B gene for association with type 2 diabetes. We also identified two refined associations. One is SNP rs11209026 in exon 9 of IL23R for association with Crohn's disease, which is predicted to be probably damaging by PolyPhen2. The other refined variant is in the CUX2 gene region for association with type 1 diabetes, where the newly identified top SNP rs1265564 has an association P-value of 1.68 × 10−16. The new lead SNP for the two refined loci provides a more plausible explanation for the disease association. We demonstrated that 1000 Genomes-based imputation could indeed identify both novel (in our case, ‘missed' because they were detected and replicated by studies after 2007) and refined signals. We anticipate the findings derived from this study to provide timely information when individual groups and consortia are beginning to engage in 1000 genomes-based imputation.
genome-wide association study; the 1000 Genomes project; imputation
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data.
Genetic research has been focused at analysis of datapoints that are assumed to be deterministically known. However, the majority of current, high throughput data is only probabilistically known, and proper methods for handing such uncertain genotypes are limited. Here, we build on existing theory from the field of statistics to introduce a general framework for handling probabilistic genotype data obtained through genotype imputation. This framework, called Multiple Imputation, matches or improves upon existing methods for handling uncertainty in basic analysis of genetic association. As opposed to such methods, our work furthermore extends to more advanced analysis, such as mixed-effects models, with no additional complication. Importantly, it generates posterior probabilities of association that are intrinsically weighted by the certainty of the underlying data, a feature unmatched by other existing methods. Multiple Imputation is also fully compatible with meta-analysis. Finally, our analysis of probabilistic genotype data brings into focus the accuracy and unreliability of imputation’s estimated probabilities. Taken together, these results substantially increase the utility of imputed genotypes in statistical genetics, and may have strong implications for analysis of sequencing data moving forward.
The technical progress in the last decade has made it possible to sequence millions of DNA reads in a relatively short time frame. Several variant callers based on different algorithms have emerged and have made it possible to extract single nucleotide polymorphisms (SNPs) out of the whole-genome sequence. Often, only a few individuals of a population are sequenced completely and imputation is used to obtain genotypes for all sequence-based SNP loci for other individuals, which have been genotyped for a subset of SNPs using a genotyping array.
First, we compared the sets of variants detected with different variant callers, namely GATK, freebayes and SAMtools, and checked the quality of genotypes of the called variants in a set of 50 fully sequenced white and brown layers. Second, we assessed the imputation accuracy (measured as the correlation between imputed and true genotype per SNP and per individual, and genotype conflict between father-progeny pairs) when imputing from high density SNP array data to whole-genome sequence using data from around 1000 individuals from six different generations. Three different imputation programs (Minimac, FImpute and IMPUTE2) were checked in different validation scenarios.
There were 1,741,573 SNPs detected by all three callers on the studied chromosomes 3, 6, and 28, which was 71.6 % (81.6 %, 88.0 %) of SNPs detected by GATK (SAMtools, freebayes) in total. Genotype concordance (GC) defined as the proportion of individuals whose array-derived genotypes are the same as the sequence-derived genotypes over all non-missing SNPs on the array were 0.98 (GATK), 0.97 (freebayes) and 0.98 (SAMtools). Furthermore, the percentage of variants that had high values (>0.9) for another three measures (non-reference sensitivity, non-reference genotype concordance and precision) were 90 (88, 75) for GATK (SAMtools, freebayes). With all imputation programs, correlation between original and imputed genotypes was >0.95 on average with randomly masked 1000 SNPs from the SNP array and >0.85 for a leave-one-out cross-validation within sequenced individuals.
Performance of all variant callers studied was very good in general, particularly for GATK and SAMtools. FImpute performed slightly worse than Minimac and IMPUTE2 in terms of genotype correlation, especially for SNPs with low minor allele frequency, while it had lowest numbers in Mendelian conflicts in available father-progeny pairs. Correlations of real and imputed genotypes remained constantly high even if individuals to be imputed were several generations away from the sequenced individuals.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-2059-2) contains supplementary material, which is available to authorized users.
Whole-genome sequencing data; Variant calling; Imputation accuracy; Layer chicken