Genome-wide association studies (GWAS) are a widely used study design for detecting genetic causes of complex diseases. Current studies provide good coverage of common causal SNPs, but not rare ones. A popular method to detect rare causal variants is haplotype testing. A disadvantage of this approach is that many parameters are estimated simultaneously, which can mean a loss of power and slower fitting to large datasets.
Haplotype testing effectively tests both the allele frequencies and the linkage disequilibrium (LD) structure of the data. LD has previously been shown to be mostly attributable to LD between adjacent SNPs. We propose a generalised linear model (GLM) which models the effects of each SNP in a region as well as the statistical interactions between adjacent pairs. This is compared to two other commonly used multimarker GLMs: one with a main-effect parameter for each SNP; one with a parameter for each haplotype.
We show the haplotype model has higher power for rare untyped causal SNPs, the main-effects model has higher power for common untyped causal SNPs, and the proposed model generally has power in between the two others. We show that the relative power of the three methods is dependent on the number of marker haplotypes the causal allele is present on, which depends on the age of the mutation. Except in the case of a common causal variant in high LD with markers, all three multimarker models are superior in power to single-SNP tests.
Including the adjacent statistical interactions results in lower inflation in test statistics when a realistic level of population stratification is present in a dataset.
Using the multimarker models, we analyse data from the Molecular Genetics of Schizophrenia study. The multimarker models find potential associations that are not found by single-SNP tests. However, multimarker models also require stricter control of data quality since biases can have a larger inflationary effect on multimarker test statistics than on single-SNP test statistics.
Analysing a GWAS with multimarker models can yield candidate regions which may contain rare untyped causal variants. This is useful for increasing prior odds of association in future whole-genome sequence analyses.
Current genome-wide association studies still heavily rely on a single-marker strategy, in which each single nucleotide polymorphism (SNP) is tested individually for association with a phenotype. Although methods and software packages that consider multimarker models have become available, they have been slow to become widely adopted and their efficacy in real data analysis is often questioned. Based on conducting extensive simulations, here we endeavor to provide more insights into the performance of simple multimarker association tests as compared to single-marker tests. The results reveal the power advantage as well as disadvantage of the two- vs. the single-marker test. Power differentials depend on the correlation structure among tag SNPs, as well as that between tag SNPs and causal variants. A two-marker test has relatively better performance than single-marker tests when the correlation of the two adjacent markers is high. However, using HapMap data, two-marker tests tended to have a greater chance of being less powerful than single-marker tests, due to constraints on the number of actual possible haplotypes in the HapMap data. Yet, the average power difference was small whenever the one-marker test is more powerful, while there were many situations where the two-marker test can be much more powerful. These findings can be useful to guide analyses of future studies.
Asymptotic power; single-marker test; two-marker test; genome-wide association
Multimarker Transmission/Disequilibrium Tests (TDTs) are very robust association tests to population admixture and structure which may be used to identify susceptibility loci in genome-wide association studies. Multimarker TDTs using several markers may increase power by capturing high-degree associations. However, there is also a risk of spurious associations and power reduction due to the increase in degrees of freedom. In this study we show that associations found by tests built on simple null hypotheses are highly reproducible in a second independent data set regardless the number of markers. As a test exhibiting this feature to its maximum, we introduce the multimarker
-Groups TDT (), a test which under the hypothesis of no linkage, asymptotically follows a distribution with degree of freedom regardless the number of markers. The statistic requires the division of parental haplotypes into two groups: disease susceptibility and disease protective haplotype groups. We assessed the test behavior by performing an extensive simulation study as well as a real-data study using several data sets of two complex diseases. We show that test is highly efficient and it achieves the highest power among all the tests used, even when the null hypothesis is tested in a second independent data set. Therefore, turns out to be a very promising multimarker TDT to perform genome-wide searches for disease susceptibility loci that may be used as a preprocessing step in the construction of more accurate genetic models to predict individual susceptibility to complex diseases.
Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies.
Microarray; Genome-wide association study; Coverage; Imputation; Single nucleotide polymorphism; Throughput
Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in Phase I of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency is estimated from a DNA pool will influence a study's power to detect associations. Here we address how to control the variance in allele frequency estimation when DNAs are pooled, and how to plan and conduct the most efficient well-powered pool-based GWAS.
By examining the variation in allele frequency estimation on SNP arrays between and within DNA pools we determine how array variance [var(earray)] and pool-construction variance [var(econstruction)] contribute to the total variance of allele frequency estimation. This information is useful in deciding whether replicate arrays or replicate pools are most useful in reducing variance. Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples, genotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad.
For all three Illumina SNP array types our estimates of var(earray) were similar, between 3-4 × 10-4 for normalized data. Var(econstruction) accounted for between 20-40% of pooling variance across 27 pools in normalized data.
We conclude that relative to var(earray), var(econstruction) is of less importance in reducing the variance in allele frequency estimation from DNA pools; however, our data suggests that on average it may be more important than previously thought. We have prepared a simple online tool, PoolingPlanner (available at http://www.kchew.ca/PoolingPlanner/), which calculates the effective sample size (ESS) of a DNA pool given a range of replicate array values. ESS can be used in a power calculator to perform pool-adjusted calculations. This allows one to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.
Meta-analysis (MA) is widely used to pool genome-wide association studies (GWASes) in order to a) increase the power to detect strong or weak genotype effects or b) as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software), however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming.
Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127.
YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy-SNPs for missing markers to avoid unnecessary power loss. MA with YAMAS can be readily conducted as YAMAS provides a generic parser for heterogeneous tabulated file formats within the GWAS field and avoids cumbersome setups. In this way, it supplements the meta-analysis process.
Genotype imputation is a vital tool in genome-wide association studies (GWAS) and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous, and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR + CEU + YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation based analysis in Latinos.
genotype imputation; Latino; HapMap Project; 1000 Genomes Project
Typically, the first phase of a genome wide association study (GWAS) includes genotyping across hundreds of individuals and validation of the most significant SNPs. Allelotyping of pooled genomic DNA is a common approach to reduce the overall cost of the study. Knowledge of haplotype structure can provide additional information to single locus analyses. Several methods have been proposed for estimating haplotype frequencies in a population from pooled DNA data.
We introduce a technique for haplotype frequency estimation in a population from pooled DNA samples focusing on datasets containing a small number of individuals per pool (2 or 3 individuals) and a large number of markers. We compare our method with the publicly available state-of-the-art algorithms HIPPO and HAPLOPOOL on datasets of varying number of pools and marker sizes. We demonstrate that our algorithm provides improvements in terms of accuracy and computational time over competing methods for large number of markers while demonstrating comparable performance for smaller marker sizes. Our method is implemented in the "Tree-Based Deterministic Sampling Pool" (TDSPool) package which is available for download at http://www.ee.columbia.edu/~anastas/tdspool.
Using a tree-based determinstic sampling technique we present an algorithm for haplotype frequency estimation from pooled data. Our method demonstrates superior performance in datasets with large number of markers and could be the method of choice for haplotype frequency estimation in such datasets.
Multimarker transmission/disequilibrium tests (TDTs) are powerful association and linkage tests used to perform genome-wide filtering in the search for disease susceptibility loci. In contrast to case/control studies, they have a low rate of false positives for population stratification and admixture. However, the length of a region found in association with a disease is usually very large because of linkage disequilibrium (LD). Here, we define a multimarker proportional TDT (mTDTP) designed to improve locus specificity in complex diseases that has good power compared to the most powerful multimarker TDTs. The test is a simple generalization of a multimarker TDT in which haplotype frequencies are used to weight the effect that each haplotype has on the whole measure. Two concepts underlie the features of the metric: the ‘common disease, common variant’ hypothesis and the decrease in LD with chromosomal distance. Because of this decrease, the frequency of haplotypes in strong LD with common disease variants decreases with increasing distance from the disease susceptibility locus. Thus, our haplotype proportional test has higher locus specificity than common multimarker TDTs that assume a uniform distribution of haplotype probabilities. Because of the common variant hypothesis, risk haplotypes at a given locus are relatively frequent and a metric that weights partial results for each haplotype by its frequency will be as powerful as the most powerful multimarker TDTs. Simulations and real data sets demonstrate that the test has good power compared with the best tests but has remarkably higher locus specificity, so that the association rate decreases at a higher rate with distance from a disease susceptibility or disease protective locus.
Imputation has been widely used in genome-wide association studies (GWAS) to infer genotypes of un-genotyped variants based on the linkage disequilibrium in external reference panels such as the HapMap and 1000 Genomes. However, imputation has only rarely been performed based on family relationships to infer genotypes of un-genotyped individuals. Using 8998 Framingham Heart Study (FHS) participants genotyped with Affymetrix 550K SNPs, we imputed genotypes of same set of SNPs for additional 3121 participants, most of whom were never genotyped due to lack of DNA sample. Prior to imputation, 122 pedigrees were too large to be handled by the imputation software Merlin. Therefore, we developed a novel pedigree splitting algorithm that can maximize the number of genotyped relatives for imputing each un-genotyped individual, while keeping new sub-pedigrees under a pre-specified size. In GWAS of four phenotypes available in FHS (Alzheimer disease, circulating levels of fibrinogen, high-density lipoprotein cholesterol, and uric acid), we compared results using genotyped individuals only with results using both genotyped and imputed individuals. We studied the impact of applying different imputation quality filtering thresholds on the association results and did not found a universal threshold that always resulted in a more significant p-value for previously identified loci. However most of these loci had a lower p-value when we only included imputed genotypes with with ≥60% SNP- and ≥50% person-specific imputation certainty. In summary, we developed a novel algorithm for splitting large pedigrees for imputation and found a plausible imputation quality filtering threshold based on FHS. Further examination may be required to generalize this threshold to other studies.
By applying an imputation strategy based on the 1000 Genomes project to two genome-wide association studies (GWAS), we detected a susceptibility locus for venous thrombosis on chromosome 11p11.2 that was missed by previous GWAS analyses that had been conducted on the same datasets. A comprehensive linkage disequilibrium and haplotype analysis of the whole locus where twelve SNPs exhibited association p-values lower than 2.23 10−11 and the use of independent case-control samples demonstrated that the culprit variant was a rare variant located ∼1 Mb away from the original hits, not tagged by current genome-wide genotyping arrays and even not well imputed in the original GWAS samples. This variant was in fact the rs1799963, also known as the FII G20210A prothrombin mutation. This work may be of major interest not only for its scientific impact but also for its methodological findings.
High coverage whole genome sequencing provides near complete information about genetic variation. However, other technologies can be more efficient in some settings by (a) reducing redundant coverage within samples and (b) exploiting patterns of genetic variation across samples. To characterize as many samples as possible, many genetic studies therefore employ lower coverage sequencing or SNP array genotyping coupled to statistical imputation. To compare these approaches individually and in conjunction, we developed a statistical framework to estimate genotypes jointly from sequence reads, array intensities, and imputation. In European samples, we find similar sensitivity (89%) and specificity (99.6%) from imputation with either 1× sequencing or 1 M SNP arrays. Sensitivity is increased, particularly for low-frequency polymorphisms (), when low coverage sequence reads are added to dense genome-wide SNP arrays — the converse, however, is not true. At sites where sequence reads and array intensities produce different sample genotypes, joint analysis reduces genotype errors and identifies novel error modes. Our joint framework informs the use of next-generation sequencing in genome wide association studies and supports development of improved methods for genotype calling.
In this work we address a series of questions prompted by the rise of next-generation sequencing as a data collection strategy for genetic studies. How does low coverage sequencing compare to traditional microarray based genotyping? Do studies increase sensitivity by collecting both sequencing and array data? What can we learn about technology error modes based on analysis of SNPs for which sequence and array data disagree? To answer these questions, we developed a statistical framework to estimate genotypes from sequence reads, array intensities, and imputation. Through experiments with intensity and read data from the Hapmap and 1000 Genomes (1000 G) Projects, we show that 1 M SNP arrays used for genome wide association studies perform similarly to 1× sequencing. We find that adding low coverage sequence reads to dense array data significantly increases rare variant sensitivity, but adding dense array data to low coverage sequencing has only a small impact. Finally, we describe an improved SNP calling algorithm used in the 1000 G project, inspired by a novel next-generation sequencing error mode identified through analysis of disputed SNPs. These results inform the use of next-generation sequencing in genetic studies and model an approach to further improve genotype calling methods.
Genome-wide association studies of pooled DNA samples were shown to be a valuable tool to identify candidate SNPs associated to a phenotype. No such study was up to now applied to childhood allergic asthma, even if the very high complexity of asthma genetics is an appropriate field to explore the potential of pooled GWAS approach.
We performed a pooled GWAS and individual genotyping in 269 children with allergic respiratory diseases comparing allergic children with and without asthma. We used a modular approach to identify the most significant loci associated with asthma by combining silhouette statistics and physical distance method with cluster-adapted thresholding. We found 97% concordance between pooled GWAS and individual genotyping, with 36 out of 37 top-scoring SNPs significant at individual genotyping level. The most significant SNP is located inside the coding sequence of C5, an already identified asthma susceptibility gene, while the other loci regulate functions that are relevant to bronchial physiopathology, as immune- or inflammation-mediated mechanisms and airway smooth muscle contraction. Integration with gene expression data showed that almost half of the putative susceptibility genes are differentially expressed in experimental asthma mouse models.
Combined silhouette statistics and cluster-adapted physical distance threshold analysis of pooled GWAS data is an efficient method to identify candidate SNP associated to asthma development in an allergic pediatric population.
Genome-wide association studies (GWAS) using array-based genotyping technology are widely used to identify genetic loci associated with complex diseases or other phenotypes. The costs of GWAS projects based on individual genotyping are still comparatively high and increase with the size of study populations. Genotyping using pooled DNA samples, as also being referred as to allelotyping approach, offers an alternative at affordable costs. In the present study, data from 100 DNA samples individually genotyped with the Affymetrix Genome-Wide Human SNP Array 6.0 were used to estimate the error of the pooling approach by comparing the results with those obtained using the same array type but DNA pools each composed of 50 of the same samples. Newly developed and established methods for signal intensity correction were applied. Furthermore, the relative allele intensity signals (RAS) obtained by allelotyping were compared to the corresponding values derived from individual genotyping. Similarly, differences in RAS values between pools were determined and compared.
Regardless of the intensity correction method applied, the pooling-specific error of the pool intensity values was larger for single pools than for the comparison of the intensity values of two pools, which reflects the scenario of a case–control study. Using 50 pooled samples and analyzing 10,000 SNPs with a minor allele frequency of >1% and applying the best correction method for the corresponding type of comparison, the 90% quantile (median) of the pooling-specific absolute error of the RAS values for single sub-pools and the SNP-specific difference in allele frequency comparing two pools was 0.064 (0.026) and 0.056 (0.021), respectively.
Correction of the RAS values reduced the error of the RAS values when analyzing single pool intensities. We developed a new correction method with high accuracy but low computational costs. Correction of RAS, however, only marginally reduced the error of true differences between two sample groups and those obtained by allelotyping. Exclusion of SNPs with a minor allele frequency of ≤1% notably reduced the pooling-specific error. Our findings allow for improving the estimation of the pooling-specific error and may help in designing allelotyping studies using the Affymetrix Genome-Wide Human SNP Array 6.0.
In Genome-wide association studies (GWAS), it is common practice to impute the genotypes of untyped single-nucleotide polymorphism by exploiting the linkage disequilibrium structure among SNPs. Use of imputed genotypes improves genome coverage and makes it possible to perform meta-analysis combining results from studies genotyped on different platforms. A popular way of using imputed data is the “expectation-substitution” method, which treats the imputed dosage as if it were the true genotype. In current practice, the estimates given by the expectation-substitution method are usually combined using inverse variance weighting scheme in meta-analysis. However, the inverse variance weighting is not optimal as the estimates given by the expectation-substitution method are generally biased. The optimal weight is, in fact, proportional to the inverse variance and the expected value of the effect size estimates. We show both theoretically and numerically that the bias of the estimates is very small under practical conditions of low effect sizes in GWAS. This finding validates the use of the expectation-substitution method, and shows the inverse variance is a good approximation of the optimal weight. Through simulation, we compared the power of the inverse variance weighting method with several methods including the optimal weight, the regular z-score meta-analysis and a recently proposed “imputation aware” meta-analysis method [Zaitlen and Eskin (2010)]. Our results show that the performance of the inverse variance weight is always indistinguishable from the optimal weight and similar to or better than the other two methods.
GWAS; imputation; bias; meta-analysis; weight
Correlations between Educational Attainment (EA) and measures of cognitive performance are as high as 0.8. This makes EA an attractive alternative phenotype for studies wishing to map genes affecting cognition due to the ease of collecting EA data compared to other cognitive phenotypes such as IQ.
In an Australian family sample of 9538 individuals we performed a genome-wide association scan (GWAS) using the imputed genotypes of ∼2.4 million single nucleotide polymorphisms (SNP) for a 6-point scale measure of EA. Top hits were checked for replication in an independent sample of 968 individuals. A gene-based test of association was then applied to the GWAS results. Additionally we performed prediction analyses using the GWAS results from our discovery sample to assess the percentage of EA and full scale IQ variance explained by the predicted scores.
The best SNP fell short of having a genome-wide significant p-value (p = 9.77×10−7). In our independent replication sample six SNPs among the top 50 hits pruned for linkage disequilibrium (r2<0.8) had a p-value<0.05 but only one of these SNPs survived correction for multiple testing - rs7106258 (p = 9.7*10−4) located in an intergenic region of chromosome 11q14.1. The gene based test results were non-significant and our prediction analyses show that the predicted scores explained little variance in EA in our replication sample.
While we have identified a polymorphism chromosome 11q14.1 associated with EA, further replication is warranted. Overall, the absence of genome-wide significant p-values in our large discovery sample confirmed the high polygenic architecture of EA. Only the assembly of large samples or meta-analytic efforts will be able to assess the implication of common DNA polymorphisms in the etiology of EA.
Genome-wide association studies (GWAS) aim to detect single nucleotide polymorphisms (SNP) associated with trait variation. However, due to the large number of tests, standard analysis techniques impose highly stringent significance thresholds, leaving potentially associated SNPs undetected, and much of the trait genetic variation unexplained. Pathway- and network-based methodologies applied to GWAS aim to detect associations missed by standard single-marker approaches. The complex and non-random architecture of the genome makes it a challenge to derive an appropriate testing framework for such methodologies. We developed a rapid and simple permutation approach that uses GWAS SNP association results to establish the significance of pathway associations while accounting for the linkage disequilibrium structure of SNPs and the clustering of functionally related elements in the genome. All SNPs used in the GWAS are placed in a “circular genome” according to their location. Then the complete set of SNP association P values are permuted by rotation with respect to the genomic locations of the SNPs. Once these “simulated” P values are assigned, the joint gene P values are calculated using Fisher’s combination test, and the association of pathways is tested using the hypergeometric test. The circular genomic permutation approach was applied to a human genome-wide association dataset. The data consists of 719 individuals from the ORCADES study genotyped for ∼300,000 SNPs and measured for 51 traits ranging from physical to biochemical measurements. KEGG pathways (n = 225) were used as the sets of pathways to be tested. Our results demonstrate that the circular genomic permutations provide robust association P values. The non-permuted hypergeometric analysis generates ∼1400 pathway-trait combination results with an association P value more significant than P ≤ 0.05, whereas applying circular genomic permutation reduces the number of significant results to a more credible 40% of that value. The circular permutation software (“genomicper”) is available as an R package at http://cran.r-project.org/.
GWAS; pathway-based; permutation method; genomicper R package; cardiac disease
The high-throughput genotyping chips have contributed greatly to genome-wide association (GWA) studies to identify novel disease susceptibility single nucleotide polymorphisms (SNPs). The high-density chips are designed using two different SNP selection approaches, the direct gene-centric approach, and the indirect quasi-random SNPs or linkage disequilibrium (LD)-based tagSNPs approaches. Although all these approaches can provide high genome coverage and ascertain variants in genes, it is not clear to which extent these approaches could capture the common genic variants. It is also important to characterize and compare the differences between these approaches.
In our study, by using both the Phase II HapMap data and the disease variants extracted from OMIM, a gene-centric evaluation was first performed to evaluate the ability of the approaches in capturing the disease variants in Caucasian population. Then the distribution patterns of SNPs were also characterized in genic regions, evolutionarily conserved introns and nongenic regions, ontologies and pathways. The results show that, no mater which SNP selection approach is used, the current high-density SNP chips provide very high coverage in genic regions and can capture most of known common disease variants under HapMap frame. The results also show that the differences between the direct and the indirect approaches are relatively small. Both have similar SNP distribution patterns in these gene-centric characteristics.
This study suggests that the indirect approaches not only have the advantage of high coverage but also are useful for studies focusing on various functional SNPs either in genes or in the conserved regions that the direct approach supports. The study and the annotation of characteristics will be helpful for designing and analyzing GWA studies that aim to identify genetic risk factors involved in common diseases, especially variants in genes and conserved regions.
Genome-wide association studies with single nucleotide polymorphisms (SNPs) show great promise to identify genetic determinants of complex human traits. In current analyses, genotype calling and imputation of missing genotypes are usually considered as two separated tasks. The genotypes of SNPs are first determined one at a time from allele signal intensities. Then the missing genotypes, i.e., no-calls caused by not perfectly separated signal clouds, are imputed based on the linkage disequilibrium (LD) between multiple SNPs. Although many statistical methods have been developed to improve either genotype calling or imputation of missing genotypes, treating the two steps independently can lead to loss of genetic information.
We propose a novel genotype calling framework. In this framework, we consider the signal intensities and underlying LD structure of SNPs simultaneously by estimating both cluster parameters and haplotype frequencies. As a result, our new method outperforms some existing algorithms in terms of both call rates and genotyping accuracy. Our studies also suggest that jointly analyzing multiple SNPs in LD provides more accurate estimation of haplotypes than haplotype reconstruction methods that only use called genotypes.
Our study demonstrates that jointly analyzing signal intensities and LD structure of multiple SNPs is a better way to determine genotypes and estimate LD parameters.
Motivation: Genome-wide association studies (GWASs) have been widely used to map loci contributing to variation in complex traits and risk of diseases in humans. Accurate specification of familial relationships is crucial for family-based GWAS, as well as in population-based GWAS with unknown (or unrecognized) family structure. The family structure in a GWAS should be routinely investigated using the SNP data prior to the analysis of population structure or phenotype. Existing algorithms for relationship inference have a major weakness of estimating allele frequencies at each SNP from the entire sample, under a strong assumption of homogeneous population structure. This assumption is often untenable.
Results: Here, we present a rapid algorithm for relationship inference using high-throughput genotype data typical of GWAS that allows the presence of unknown population substructure. The relationship of any pair of individuals can be precisely inferred by robust estimation of their kinship coefficient, independent of sample composition or population structure (sample invariance). We present simulation experiments to demonstrate that the algorithm has sufficient power to provide reliable inference on millions of unrelated pairs and thousands of relative pairs (up to 3rd-degree relationships). Application of our robust algorithm to HapMap and GWAS datasets demonstrates that it performs properly even under extreme population stratification, while algorithms assuming a homogeneous population give systematically biased results. Our extremely efficient implementation performs relationship inference on millions of pairs of individuals in a matter of minutes, dozens of times faster than the most efficient existing algorithm known to us.
Availability: Our robust relationship inference algorithm is implemented in a freely available software package, KING, available for download at http://people.virginia.edu/∼wc9c/KING.
Supplementary information: Supplementary data are available at Bioinformatics online.
New sequencing technologies have tremendously increased the number of known molecular markers (single nucleotide polymorphisms; SNPs) in a variety of species. Concurrently, improvements to genotyping technology have now made it possible to efficiently genotype large numbers of genome-wide distributed SNPs enabling genome wide association studies (GWAS). However, genotyping significant numbers of individuals with large number of SNPs remains prohibitively expensive for many research groups. A possible solution to this problem is to determine allele frequencies from pooled DNA samples, such ‘allelotyping’ has been presented as a cost-effective alternative to individual genotyping and has become popular in human GWAS. In this article we have tested the effectiveness of DNA pooling to obtain accurate allele frequency estimates for Atlantic salmon (Salmo salar L.) populations using an Illumina SNP-chip.
In total, 56 Atlantic salmon DNA pools from 14 populations were analyzed on an Atlantic salmon SNP-chip containing probes for 5568 SNP markers, 3928 of which were bi-allelic. We developed an efficient quality control filter which enables exclusion of loci showing high error rate and minor allele frequency (MAF) close to zero. After applying multiple quality control filters we obtained allele frequency estimates for 3631 bi-allelic loci. We observed high concordance (r > 0.99) between allele frequency estimates derived from individual genotyping and DNA pools. Our results also indicate that even relatively small DNA pools (35 individuals) can provide accurate allele frequency estimates for a given sample.
Despite of higher level of variation associated with array replicates compared to pool construction, we suggest that both sources of variation should be taken into account. This study demonstrates that DNA pooling allows fast and high-throughput determination of allele frequencies in Atlantic salmon enabling cost-efficient identification of informative markers for discrimination of populations at various geographical scales, as well as identification of loci controlling ecologically and economically important traits.
DNA pooling; Atlantic salmon; SNP; Allele frequency estimation; Allelotyping; Population genomics
Motivation: A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of genotype–phenotype correlation with a priori evidence of biological relevance.
Results: We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains including statistical evidence of genotype–phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
Availability: A comprehensive database of biological prioritization scores for all known SNPs is available at http://zork.wustl.edu/gin. This can be used to prioritize nicotine dependence association studies through a straightforward mathematical formula—no special software is necessary.
Supplementary information: Supplementary data are available at Bioinformatics online.
Testing a relatively small genomic region with a few hundred SNPs provides limited information. Genome-wide association studies (GWAS) provide an opportunity to overcome the limitation of candidate gene association studies. Here, we report the results of a GWAS for the responses to an NSAID analgesic.
Materials & methods
European Americans (60 females and 52 males) undergoing oral surgery were genotyped with Affymetrix 500K SNP assay. Additional SNP genotyping was performed from the gene in linkage disequilibrium with the candidate SNP revealed by the GWAS.
GWAS revealed a candidate SNP (rs2562456) associated with analgesic onset, which is in linkage disequilibrium with a gene encoding a zinc finger protein. Additional SNP genotyping of ZNF429 confirmed the association with analgesic onset in humans (p = 1.8 × 10−10, degrees of freedom = 103, F = 28.3). We also found candidate loci for the maximum post-operative pain rating (rs17122021, p = 6.9 × 10−7) and post-operative pain onset time (rs6693882, p = 2.1 × 10−6), however, correcting for multiple comparisons did not sustain these genetic associations.
GWAS for acute clinical pain followed by additional SNP genotyping of a neighboring gene suggests that genetic variations in or near the loci encoding DNA binding proteins play a role in the individual variations in responses to analgesic drugs.
analgesic onset; GWAS; pain
Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website.
As next-generation sequencing (NGS) costs continue to fall and genome-wide association study (GWAS) platform coverage improves, the human genetics community is positioned to identify potentially causal variants. However, current NGS or imputation-based studies of either the whole genome or regions previously identified by GWAS have not yet been very successful in identifying causal variants. A major hurdle is the development of methods to distinguish disease-causing variants from their highly-correlated proxies within an associated region. We show that various common factors, such as differential sequencing or imputation accuracy rates and linkage disequilibrium patterns, with or without GWAS-informed region selection, can substantially decrease the probability of identifying the correct causal SNP, often by more than half. We then describe a novel and easy-to-implement re-ranking procedure that can double the probability that the causal SNP is top-ranked in many settings. Application to the NCI Breast and Prostate Cancer (BPC3) Cohort Consortium aggressive prostate cancer data identified new top SNPs within two associated loci previously established via GWAS, as well as several additional possible causal SNPs that had been previously overlooked.
Genome-wide association studies (GWAS) based on linkage disequilibrium (LD) provide a promising tool for the detection and fine mapping of quantitative trait loci (QTL) underlying complex agronomic traits. In this study we explored the genetic basis of variation for the traits heading date, plant height, thousand grain weight, starch content and crude protein content in a diverse collection of 224 spring barleys of worldwide origin. The whole panel was genotyped with a customized oligonucleotide pool assay containing 1536 SNPs using Illumina's GoldenGate technology resulting in 957 successful SNPs covering all chromosomes. The morphological trait "row type" (two-rowed spike vs. six-rowed spike) was used to confirm the high level of selectivity and sensitivity of the approach. This study describes the detection of QTL for the above mentioned agronomic traits by GWAS.
Population structure in the panel was investigated by various methods and six subgroups that are mainly based on their spike morphology and region of origin. We explored the patterns of linkage disequilibrium (LD) among the whole panel for all seven barley chromosomes. Average LD was observed to decay below a critical level (r2-value 0.2) within a map distance of 5-10 cM. Phenotypic variation within the panel was reasonably large for all the traits. The heritabilities calculated for each trait over multi-environment experiments ranged between 0.90-0.95. Different statistical models were tested to control spurious LD caused by population structure and to calculate the P-value of marker-trait associations. Using a mixed linear model with kinship for controlling spurious LD effects, we found a total of 171 significant marker trait associations, which delineate into 107 QTL regions. Across all traits these can be grouped into 57 novel QTL and 50 QTL that are congruent with previously mapped QTL positions.
Our results demonstrate that the described diverse barley panel can be efficiently used for GWAS of various quantitative traits, provided that population structure is appropriately taken into account. The observed significant marker trait associations provide a refined insight into the genetic architecture of important agronomic traits in barley. However, individual QTL account only for a small portion of phenotypic variation, which may be due to insufficient marker coverage and/or the elimination of rare alleles prior to analysis. The fact that the combined SNP effects fall short of explaining the complete phenotypic variance may support the hypothesis that the expression of a quantitative trait is caused by a large number of very small effects that escape detection. Notwithstanding these limitations, the integration of GWAS with biparental linkage mapping and an ever increasing body of genomic sequence information will facilitate the systematic isolation of agronomically important genes and subsequent analysis of their allelic diversity.