With the development of high-throughput sequencing and genotyping technologies, the number of markers collected in genetic association studies is growing rapidly, increasing the importance of methods for correcting for multiple hypothesis testing. The permutation test is widely considered the gold standard for accurate multiple testing correction, but it is often computationally impractical for these large datasets. Recently, several studies proposed efficient alternative approaches to the permutation test based on the multivariate normal distribution (MVN). However, they cannot accurately correct for multiple testing in genome-wide association studies for two reasons. First, these methods require partitioning of the genome into many disjoint blocks and ignore all correlations between markers from different blocks. Second, the true null distribution of the test statistic often fails to follow the asymptotic distribution at the tails of the distribution. We propose an accurate and efficient method for multiple testing correction in genome-wide association studies—SLIDE. Our method accounts for all correlation within a sliding window and corrects for the departure of the true null distribution of the statistic from the asymptotic distribution. In simulations using the Wellcome Trust Case Control Consortium data, the error rate of SLIDE's corrected p-values is more than 20 times smaller than the error rate of the previous MVN-based methods' corrected p-values, while SLIDE is orders of magnitude faster than the permutation test and other competing methods. We also extend the MVN framework to the problem of estimating the statistical power of an association study with correlated markers and propose an efficient and accurate power estimation method SLIP. SLIP and SLIDE are available at http://slide.cs.ucla.edu.
In genome-wide association studies, it is important to account for the fact that a large number of genetic variants are tested in order to adequately control for false positives. The simplest way to correct for multiple hypothesis testing is the Bonferroni correction, which multiplies the p-values by the number of markers assuming the markers are independent. Since the markers are correlated due to linkage disequilibrium, this approach leads to a conservative estimate of false positives, thus adversely affecting statistical power. The permutation test is considered the gold standard for accurate multiple testing correction, but is often computationally impractical for large association studies. We propose a method that efficiently and accurately corrects for multiple hypotheses in genome-wide association studies by fully accounting for the local correlation structure between markers. Our method also corrects for the departure of the true distribution of test statistics from the asymptotic distribution, which dramatically improves the accuracy, particularly when many rare variants are included in the tests. Our method shows a near identical accuracy to permutation and shows greater computational efficiency than previously suggested methods. We also provide a method to accurately and efficiently estimate the statistical power of genome-wide association studies.
By assaying hundreds of thousands of single nucleotide polymorphisms, genome wide association studies (GWAS) allow for a powerful, unbiased review of the entire genome to localize common genetic variants that influence health and disease. Although it is widely recognized that some correction for multiple testing is necessary, in order to control the family-wide Type 1 Error in genetic association studies, it is not clear which method to utilize. One simple approach is to perform a Bonferroni correction using all n single nucleotide polymorphisms (SNPs) across the genome; however this approach is highly conservative and would "overcorrect" for SNPs that are not truly independent. Many SNPs fall within regions of strong linkage disequilibrium (LD) ("blocks") and should not be considered "independent".
We proposed to approximate the number of "independent" SNPs by counting 1 SNP per LD block, plus all SNPs outside of blocks (interblock SNPs). We examined the effective number of independent SNPs for Genome Wide Association Study (GWAS) panels. In the CEPH Utah (CEU) population, by considering the interdependence of SNPs, we could reduce the total number of effective tests within the Affymetrix and Illumina SNP panels from 500,000 and 317,000 to 67,000 and 82,000 "independent" SNPs, respectively. For the Affymetrix 500 K and Illumina 317 K GWAS SNP panels we recommend using 10-5, 10-7 and 10-8 and for the Phase II HapMap CEPH Utah and Yoruba populations we recommend using 10-6, 10-7 and 10-9 as "suggestive", "significant" and "highly significant" p-value thresholds to properly control the family-wide Type 1 error.
By approximating the effective number of independent SNPs across the genome we are able to 'correct' for a more accurate number of tests and therefore develop 'LD adjusted' Bonferroni corrected p-value thresholds that account for the interdepdendence of SNPs on well-utilized commercially available SNP "chips". These thresholds will serve as guides to researchers trying to decide which regions of the genome should be studied further.
Large-scale genetic association studies can test hundreds of thousands of genetic markers for association with a trait. Since the genetic markers may be correlated, a Bonferroni correction is typically too stringent a correction for multiple testing. Permutation testing is a standard statistical technique for determining statistical significance when performing multiple correlated tests for genetic association. However, permutation testing for large-scale genetic association studies is computationally demanding and calls for optimized algorithms and software. PRESTO is a new software package for genetic association studies that performs fast computation of multiple-testing adjusted P-values via permutation of the trait.
PRESTO is an order of magnitude faster than other existing permutation testing software, and can analyze a large genome-wide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO has several unique features that are useful in a wide range of studies: it reports empirical null distributions for the top-ranked statistics (i.e. order statistics), it performs user-specified combinations of allelic and genotypic tests, it performs stratified analysis when sampled individuals are from multiple populations and each individual's population of origin is specified, and it determines significance levels for one and two-stage genotyping designs. PRESTO is designed for case-control studies, but can also be applied to trio data (parents and affected offspring) if transmitted parental alleles are coded as case alleles and untransmitted parental alleles are coded as control alleles.
PRESTO is a platform-independent software package that performs fast and flexible permutation testing for genetic association studies. The PRESTO executable file, Java source code, example data, and documentation are freely available at .
To date, 39 SNPs have been associated with blood pressure (BP) or hypertension (HTN) in genome-wide association studies (GWAS) in Caucasians. Our hypothesis is that the loci/SNPs associated with BP/HTN are also associated with BP response to antihypertensive drugs.
Methods and Results
We assessed the association of these loci with BP response to atenolol or hydrochlorothiazide monotherapy in 768 hypertensive participants in the Pharmacogenomics Responses of Antihypertensive Responses (PEAR) study. Linear regression analysis was performed in Caucasians for each SNP in an additive model adjusting for baseline BP, age, gender and principal components for ancestry. Genetic scores were constructed to include SNPs with nominal associations and empirical p values were determined by permutation test. Genotypes of 37 loci were obtained from Illumina 50K cardiovascular or Omni1M GWAS chips. In Caucasians, no SNPs reached Bonferroni-corrected alpha of 0.0014, six reached nominal significance (p<0.05) and 3 were associated with atenolol BP response at p < 0.01. The genetic score of the atenolol BP lowering alleles was associated with response to atenolol (p =3.3*10−6 for SBP; p=1.6*10−6 for DBP). The genetic score of the HCTZ BP lowering alleles was associated with response to HCTZ (p = 0.0006 for SBP; p = 0.0003 for DBP). Both risk score p values were < 0.01 based on the empirical distribution from the permutation test.
These findings suggest selected signals from hypertension GWAS may predict BP response to atenolol and HCTZ when assessed through risk scoring.
beta-blocker; diuretics; hypertension; pharmacogenetics; polymorphisms blood pressure
Non-syndromic cleft lip with or without cleft palate (NSCL/P) is a common disorder with complex etiology. The Bone Morphogenetic Protein 4 gene (BMP4) has been considered a prime candidate gene with evidence accumulated from animal experimental studies, human linkage studies, as well as candidate gene association studies. The aim of the current study is to test for linkage and association between BMP4 and NSCL/P that could be missed in genome-wide association studies (GWAS) when genotypic (G) main effects alone were considered.
We performed the analysis considering G and interactions with multiple maternal environmental exposures using additive conditional logistic regression models in 895 Asian and 681 European complete NSCL/P trios. Single nucleotide polymorphisms (SNPs) that passed the quality control criteria among 122 genotyped and 25 imputed single nucleotide variants in and around the gene were used in analysis. Selected maternal environmental exposures during 3 months prior to and through the first trimester of pregnancy included any personal tobacco smoking, any environmental tobacco smoke in home, work place or any nearby places, any alcohol consumption and any use of multivitamin supplements. A novel significant association held for rs7156227 among Asian NSCL/P and non-syndromic cleft lip and palate (NSCLP) trios after Bonferroni correction which was not seen when G main effects alone were considered in either allelic or genotypic transmission disequilibrium tests. Odds ratios for carrying one copy of the minor allele without maternal exposure to any of the four environmental exposures were 0.58 (95%CI = 0.44, 0.75) and 0.54 (95%CI = 0.40, 0.73) for Asian NSCL/P and NSCLP trios, respectively. The Bonferroni P values corrected for the total number of 117 tested SNPs were 0.0051 (asymptotic P = 4.39*10−5) and 0.0065 (asymptotic P = 5.54*10−5), accordingly. In European trios, no significant association was seen for any SNPs after Bonferroni corrections for the total number of 120 tested SNPs.
Our findings add evidence from GWAS to support the role of BMP4 in susceptibility to NSCL/P originally identified in linkage and candidate gene association studies.
Genome-wide association studies (GWAS) have identified many loci associated with breast cancer risk. These studies have primarily been conducted in populations of European descent.
To determine whether previously reported susceptibility loci in other ethnic groups are also risk factors for breast cancer in a Chinese population.
We genotyped 21 previously reported single nucleotide polymorphisms (SNPs) within a female Chinese cohort of 1203 breast cancer cases and 2525 healthy controls using the Sequenom iPlex platform. Fourteen SNPs passed the quality control test. These SNPs were subjected to statistical analysis for the entire cohort and were further analyzed for estrogen receptor (ER) status. The associations of the SNPs with disease susceptibility were assessed using logistic regression, adjusting for age. The Bonferroni correction was used to conservatively account for multiple testing, and the threshold for statistical significance was P<3.57×10−3 (0.05/14).
Although none of the SNPs showed an overall association with breast cancer, an analysis of the ER status of the breast cancer patients revealed that the SIAH2 locus (rs6788895; P = 5.73×10−4, odds ratio [OR] = 0.81) is associated with ER-positive breast cancer.
A common variant in the SIAH2 locus is associated with ER-positive breast cancer in the Chinese Han population. The replication of published GWAS results in other ethnic groups provides important information regarding the genetic etiology of breast cancer.
Association mapping is a powerful approach for dissecting the genetic architecture of complex quantitative traits using high-density SNP markers in maize. Here, we expanded our association panel size from 368 to 513 inbred lines with 0.5 million high quality SNPs using a two-step data-imputation method which combines identity by descent (IBD) based projection and k-nearest neighbor (KNN) algorithm. Genome-wide association studies (GWAS) were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model (MLM) and a new method, the Anderson-Darling (A-D) test. Ten loci for five traits were identified using the MLM method at the Bonferroni-corrected threshold −log10 (P) >5.74 (α = 1). Many loci ranging from one to 34 loci (107 loci for plant height) were identified for 17 traits using the A-D test at the Bonferroni-corrected threshold −log10 (P) >7.05 (α = 0.05) using 556809 SNPs. Many known loci and new candidate loci were only observed by the A-D test, a few of which were also detected in independent linkage analysis. This study indicates that combining IBD based projection and KNN algorithm is an efficient imputation method for inferring large missing genotype segments. In addition, we showed that the A-D test is a useful complement for GWAS analysis of complex quantitative traits. Especially for traits with abnormal phenotype distribution, controlled by moderate effect loci or rare variations, the A-D test balances false positives and statistical power. The candidate SNPs and associated genes also provide a rich resource for maize genetics and breeding.
Genotype imputation has been used widely in the analysis of genome-wide association studies (GWAS) to boost power and fine-map associations. We developed a two-step data imputation method to meet the challenge of large proportion missing genotypes. GWAS have uncovered an extensive genetic architecture of complex quantitative traits using high-density SNP markers in maize in the past few years. Here, GWAS were carried out for 17 agronomic traits with a panel of 513 inbred lines applying both mixed linear model and a new method, the Anderson-Darling (A-D) test. We intend to show that the A-D test is a complement to current GWAS methods, especially for complex quantitative traits controlled by moderate effect loci or rare variations and with abnormal phenotype distribution. In addition, the traits associated QTL identified here provide a rich resource for maize genetics and breeding.
Genome wide association studies (GWAS) are applied to identify genetic loci, which are associated with complex traits and human diseases. Analogous to the evolution of gene expression analyses, pathway analyses have emerged as important tools to uncover functional networks of genome-wide association data. Usually, pathway analyses combine statistical methods with a priori available biological knowledge. To determine significance thresholds for associated pathways, correction for multiple testing and over-representation permutation testing is applied.
We systematically investigated the impact of three different permutation test approaches for over-representation analysis to detect false positive pathway candidates and evaluate them on genome-wide association data of Dilated Cardiomyopathy (DCM) and Ulcerative Colitis (UC). Our results provide evidence that the gold standard - permuting the case–control status – effectively improves specificity of GWAS pathway analysis. Although permutation of SNPs does not maintain linkage disequilibrium (LD), these permutations represent an alternative for GWAS data when case–control permutations are not possible. Gene permutations, however, did not add significantly to the specificity. Finally, we provide estimates on the required number of permutations for the investigated approaches.
To discover potential false positive functional pathway candidates and to support the results from standard statistical tests such as the Hypergeometric test, permutation tests of case control data should be carried out. The most reasonable alternative was case–control permutation, if this is not possible, SNP permutations may be carried out. Our study also demonstrates that significance values converge rapidly with an increasing number of permutations. By applying the described statistical framework we were able to discover axon guidance, focal adhesion and calcium signaling as important DCM-related pathways and Intestinal immune network for IgA production as most significant UC pathway.
DCM; UC; GWAS; Permutation tests; Pathway analysis
Substantial genotyping data produced by current high-throughput technologies have brought opportunities and difficulties. With the number of single-nucleotide polymorphisms (SNPs) going into millions comes the harsh challenge of multiple-testing adjustment. However, even with the false discovery rate (FDR) control approach, a genome-wide association study (GWAS) may still fall short of discovering any true positive gene, particularly when it has a relatively small sample size.
To counteract such a harsh multiple-testing penalty, in this report, we incorporate findings from previous linkage and association studies to re-analyze a GWAS on age-related macular degeneration. While previous Bonferroni correction and the traditional FDR approach detected only one significant SNP (rs380390), here we have been able to detect seven significant SNPs with an easy-to-implement prioritized subset analysis (PSA) with the overall FDR controlled at 0.05. These include SNPs within three genes: CFH, CFHR4, and SGCD.
Based on the success of this example, we advocate using the simple method of PSA to facilitate discoveries in future GWASs.
Multiple testing corrections are an active research topic in genetic association studies, especially for genome-wide association studies (GWAS), where tests of association with traits are conducted at millions of imputed SNPs with estimated allelic dosages now. Failure to address multiple comparisons appropriately can introduce excess false positive results and make subsequent studies following up those results inefficient. Permutation tests are considered the gold standard in multiple testing adjustment; however, this procedure is computationally demanding, especially for GWAS. Notably, the permutation thresholds for the huge number of estimated allelic dosages in real data sets have not been reported. Although many researchers have recently developed algorithms to rapidly approximate the permutation thresholds with accuracy similar to the permutation test, these methods have not been verified with estimated allelic dosages. In this study, we compare recently published multiple testing correction methods using 2.5M estimated allelic dosages. We also derive permutation significance levels based on 10,000 GWAS results under the null hypothesis of no association. Our results show that the simpleM method works well with estimated allelic dosages and gives the closest approximation to the permutation threshold while requiring the least computation time.
multiple testing; genome-wide association studies; imputed SNPs; allelic dosages
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online.
Genome-wide association study; Multiple comparison; Poisson approximation
Genome-wide association studies (GWAS) aim to detect single nucleotide polymorphisms (SNP) associated with trait variation. However, due to the large number of tests, standard analysis techniques impose highly stringent significance thresholds, leaving potentially associated SNPs undetected, and much of the trait genetic variation unexplained. Pathway- and network-based methodologies applied to GWAS aim to detect associations missed by standard single-marker approaches. The complex and non-random architecture of the genome makes it a challenge to derive an appropriate testing framework for such methodologies. We developed a rapid and simple permutation approach that uses GWAS SNP association results to establish the significance of pathway associations while accounting for the linkage disequilibrium structure of SNPs and the clustering of functionally related elements in the genome. All SNPs used in the GWAS are placed in a “circular genome” according to their location. Then the complete set of SNP association P values are permuted by rotation with respect to the genomic locations of the SNPs. Once these “simulated” P values are assigned, the joint gene P values are calculated using Fisher’s combination test, and the association of pathways is tested using the hypergeometric test. The circular genomic permutation approach was applied to a human genome-wide association dataset. The data consists of 719 individuals from the ORCADES study genotyped for ∼300,000 SNPs and measured for 51 traits ranging from physical to biochemical measurements. KEGG pathways (n = 225) were used as the sets of pathways to be tested. Our results demonstrate that the circular genomic permutations provide robust association P values. The non-permuted hypergeometric analysis generates ∼1400 pathway-trait combination results with an association P value more significant than P ≤ 0.05, whereas applying circular genomic permutation reduces the number of significant results to a more credible 40% of that value. The circular permutation software (“genomicper”) is available as an R package at http://cran.r-project.org/.
GWAS; pathway-based; permutation method; genomicper R package; cardiac disease
Determination of the relevance of both demanding classical epidemiologic criteria for control selection and robust handling of population stratification (PS) represents a major challenge in the design and analysis of genome-wide association studies (GWAS). Empirical data from two GWAS in European Americans of the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to evaluate the impact of PS in studies with different control selection strategies. In each of the two original case-control studies nested in corresponding prospective cohorts, a minor confounding effect due to PS (inflation factor λ of 1.025 and 1.005) was observed. In contrast, when the control groups were exchanged to mimic a cost-effective but theoretically less desirable control selection strategy, the confounding effects were larger (λ of 1.090 and 1.062). A panel of 12,898 autosomal SNPs common to both the Illumina and Affymetrix commercial platforms and with low local background linkage disequilibrium (pair-wise r2<0.004) was selected to infer population substructure with principal component analysis. A novel permutation procedure was developed for the correction of PS that identified a smaller set of principal components and achieved a better control of type I error (to λ of 1.032 and 1.006, respectively) than currently used methods. The overlap between sets of SNPs in the bottom 5% of p-values based on the new test and the test without PS correction was about 80%, with the majority of discordant SNPs having both ranks close to the threshold. Thus, for the CGEMS GWAS of prostate and breast cancer conducted in European Americans, PS does not appear to be a major problem in well-designed studies. A study using suboptimal controls can have acceptable type I error when an effective strategy for the correction of PS is employed.
Through recent genome-wide association studies (GWAS), several groups have reported significant association between variants in the alpha 1C subunit of the L-type voltage-gated calcium channel (CACNA1C) and bipolar disorder (BP) in European and European-American cohorts. We performed a family-based association study to determine whether CACNA1C is associated with BP in the Latino population.
This study consisted of 913 individuals from 215 Latino pedigrees recruited from the United States, Mexico, Guatemala, and Costa Rica. The Illumina GoldenGate Genotyping Assay was used to genotype 58 single-nucleotide polymorphisms (SNPs) that spanned a 602.9 kb region encompassing the CACNA1C gene including two SNPs (rs7297582 and rs1006737) previously shown to associate with BP. Individual SNP and haplotype association analyses were performed using Family-Based Association Test (version 2.0.3) and Haploview (version 4.2) software.
An eight-locus haplotype block that included these two markers showed significant association with BP (global marker permuted p = 0.0018) in the Latino population. For individual SNPs, this sample had insufficient power (10%) to detect associations with SNPs with minor effect (odds ratio = 1.15).
Although we were not able to replicate findings of association between individual CACNA1C SNPs rs7297582 and rs1006737 and BP, we were able to replicate the GWAS signal reported for CACNA1C through a haplotype analysis that encompassed these previously reported significant SNPs. These results provide additional evidence that CACNA1C is associated with BP and provides the first evidence that variations in this gene might play a role in the pathogenesis of this disorder in the Latino population.
bipolar disorder; calcium channels; genetic association studies; haplotypes; Hispanic Americans; L-type; pedigree; polymorphism; single nucleotide
Permutation testing is a robust and popular approach for significance testing in genomic research, which has the broad advantage of estimating significance non-parametrically, thereby safe guarding against inflated type I error rates. However, the computational efficiency remains a challenging issue that limits its wide application, particularly in genome-wide association studies (GWAS). Because of this, adaptive permutation strategies can be employed to make permutation approaches feasible. While these approaches have been used in practice, there is little research into the statistical properties of these approaches, and little guidance into the proper application of such a strategy for accurate p-value estimation at the GWAS level.
In this work, we advocate an adaptive permutation procedure that is statistically valid as well as computationally feasible in GWAS. We perform extensive simulation experiments to evaluate the robustness of the approach to violations of modeling assumptions and compare the power of the adaptive approach versus standard approaches. We also evaluate the parameter choices in implementing the adaptive permutation approach to provide guidance on proper implementation in real studies. Additionally, we provide an example of the application of adaptive permutation testing on real data.
The results provide sufficient evidence that the adaptive test is robust to violations of modeling assumptions. In addition, even when modeling assumptions are correct, the power achieved by adaptive permutation is identical to the parametric approach over a range of significance thresholds and effect sizes under the alternative. A framework for proper implementation of the adaptive procedure is also generated.
While the adaptive permutation approach presented here is not novel, the current study provides evidence of the validity of the approach, and importantly provides guidance on the proper implementation of such a strategy. Additionally, tools are made available to aid investigators in implementing these approaches.
New technology for large-scale genotyping has created new challenges for statistical analysis. Correcting for multiple comparison without discarding true positive results and extending methods to triad studies are among the important problems facing statisticians. We present a one-sample permutation test for testing transmission disequilibrium hypotheses in triad studies, and show how this test can be used for multiple single nucleotide polymorphism (SNP) testing. The resulting multiple comparison procedure is shown in the case of the transmission disequilibrium test to control the familywise error. Furthermore, this procedure can handle multiple possible modes of risk inheritance per SNP. The resulting permutational procedure is shown through simulation of SNP data to be more powerful than the Bonferroni procedure when the SNPs are in linkage disequilibrium. Moreover, permutations implicitly avoid any multiple comparison correction penalties when the SNP has a rare allele. The method is illustrated by analyzing a large candidate gene study of neural tube defects and an independent study of oral clefts, where the smallest adjusted p-values using the permutation procedure are approximately half those of the Bonferroni procedure. We conclude that permutation tests are more powerful for identifying disease-associated SNPs in candidate gene studies and are useful for analysis of triad studies.
Exchangeable; familywise error rate; linkage disequilibrium; power
For genome-wide association studies in family-based designs, we propose a powerful two-stage testing strategy that can be applied in situations in which parent-offspring trio data are available and all offspring are affected with the trait or disease under study. In the first step of the testing strategy, we construct estimators of genetic effect size in the completely ascertained sample of affected offspring and their parents that are statistically independent of the family-based association/transmission disequilibrium tests (FBATs/TDTs) that are calculated in the second step of the testing strategy. For each marker, the genetic effect is estimated (without requiring an estimate of the SNP allele frequency) and the conditional power of the corresponding FBAT/TDT is computed. Based on the power estimates, a weighted Bonferroni procedure assigns an individually adjusted significance level to each SNP. In the second stage, the SNPs are tested with the FBAT/TDT statistic at the individually adjusted significance levels. Using simulation studies for scenarios with up to 1,000,000 SNPs, varying allele frequencies and genetic effect sizes, the power of the strategy is compared with standard methodology (e.g., FBATs/TDTs with Bonferroni correction). In all considered situations, the proposed testing strategy demonstrates substantial power increases over the standard approach, even when the true genetic model is unknown and must be selected based on the conditional power estimates. The practical relevance of our methodology is illustrated by an application to a genome-wide association study for childhood asthma, in which we detect two markers meeting genome-wide significance that would not have been detected using standard methodology.
The current state of genotyping technology has enabled researchers to conduct genome-wide association studies of up to 1,000,000 SNPs, allowing for systematic scanning of the genome for variants that might influence the development and progression of complex diseases. One of the largest obstacles to the successful detection of such variants is the multiple comparisons/testing problem in the genetic association analysis. For family-based designs in which all offspring are affected with the disease/trait under study, we developed a methodology that addresses this problem by partitioning the family-based data into two statistically independent components. The first component is used to screen the data and determine the most promising SNPs. The second component is used to test the SNPs for association, where information from the screening is used to weight the SNPs during testing. This methodology is more powerful than standard procedures for multiple comparisons adjustment (i.e., Bonferroni correction). Additionally, as only one data set is required for screening and testing, our testing strategy is less susceptible to study heterogeneity. Finally, as many family-based studies collect data only from affected offspring, this method addresses a major limitation of previous methodologies for multiple comparisons in family-based designs, which require variation in the disease/trait among offspring.
Total cholesterol, low-density lipoprotein cholesterol, triglyceride, and high-density lipoprotein cholesterol (HDL-C) levels are among the most important risk factors for coronary artery disease. We tested for gene–gene interactions affecting the level of these four lipids based on prior knowledge of established genome-wide association study (GWAS) hits, protein–protein interactions, and pathway information. Using genotype data from 9,713 European Americans from the Atherosclerosis Risk in Communities (ARIC) study, we identified an interaction between HMGCR and a locus near LIPC in their effect on HDL-C levels (Bonferroni corrected Pc = 0.002). Using an adaptive locus-based validation procedure, we successfully validated this gene–gene interaction in the European American cohorts from the Framingham Heart Study (Pc = 0.002) and the Multi-Ethnic Study of Atherosclerosis (MESA; Pc = 0.006). The interaction between these two loci is also significant in the African American sample from ARIC (Pc = 0.004) and in the Hispanic American sample from MESA (Pc = 0.04). Both HMGCR and LIPC are involved in the metabolism of lipids, and genome-wide association studies have previously identified LIPC as associated with levels of HDL-C. However, the effect on HDL-C of the novel gene–gene interaction reported here is twice as pronounced as that predicted by the sum of the marginal effects of the two loci. In conclusion, based on a knowledge-driven analysis of epistasis, together with a new locus-based validation method, we successfully identified and validated an interaction affecting a complex trait in multi-ethnic populations.
Genome-wide association studies (GWAS) have identified many loci associated with complex human traits or diseases. However, the fraction of heritable variation explained by these loci is often relatively low. Gene–gene interactions might play a significant role in complex traits or diseases and are one of the many possible factors contributing to the missing heritability. However, to date only a few interactions have been found and validated in GWAS due to the limited power caused by the need for multiple-testing correction for the very large number of tests conducted. Here, we used three types of prior knowledge, known GWAS hits, protein–protein interactions, and pathway information, to guide our search for gene–gene interactions affecting four lipid levels. We identified an interaction between HMGCR and a locus near LIPC in their effect on high-density lipoprotein cholesterol (HDL-C) and another pair of loci that interact in their effect on low-density lipoprotein cholesterol (LDL-C). We validated the interaction on HDL-C in a number of independent multiple-ethnic populations, while the interaction underlying LDL-C did not validate. The prior knowledge-driven searching approach and a locus-based validation procedure show the potential for dissecting and validating gene–gene interactions in current and future GWAS.
A genome-wide association study (GWAS) for heading date (HD) was performed with a panel of 358 European winter wheat (Triticum aestivum L.) varieties and 14 spring wheat varieties through the phenotypic evaluation of HD in field tests in eight environments. Genotyping data consisted of 770 mapped microsatellite loci and 7934 mapped SNP markers derived from the 90K iSelect wheat chip. Best linear unbiased estimations (BLUEs) were calculated across all trials and ranged from 142.5 to 159.6 days after the 1st of January with an average value of 151.4 days. Considering only associations with a −log10 (P-value) ≥ 3.0, a total of 340 SSR and 2983 SNP marker-trait associations (MTAs) were detected. After Bonferroni correction for multiple testing, a total of 72 SSR and 438 SNP marker-trait associations remained significant. Highly significant MTAs were detected for the photoperiodism gene Ppd-D1, which was genotyped in all varieties. Consistent associations were found on all chromosomes with the highest number of MTAs on chromosome 5B. Linear regression showed a clear dependence of the HD score BLUEs on the number of favorable alleles (decreasing HD) and unfavorable alleles (increasing HD) per variety meaning that genotypes with a higher number of favorable or a low number of unfavorable alleles showed lower HD and therefore flowered earlier. For the vernalization gene Vrn-A2 co-locating MTAs on chromosome 5A, as well as for the photoperiodism genes Ppd-A1 and Ppd-B1 on chromosomes 2A and 2B were detected. After the construction of an integrated map of the SSR and SNP markers and by exploiting the synteny to sequenced species, such as rice and Brachypodium distachyon, we were able to demonstrate that a marker locus on wheat chromosome 5BL with homology to the rice photoperiodism gene Hd6 played a significant role in the determination of the heading date in wheat.
genome wide associations; Triticum aestivum L.; photoperiodism; environmental adaptation; flowering time
It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci.
We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives.
We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis.
A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.
Genome-wide association studies (GWAS) have identified genetic factors in type 2 diabetes (T2D), mostly among individuals of European ancestry. We tested whether previously identified T2D-associated single nucleotide polymorphisms (SNPs) replicate and whether SNPs in regions near known T2D SNPs were associated with T2D within the Singapore Chinese Health Study.
2338 cases and 2339 T2D controls from the Singapore Chinese Health Study were genotyped for 507,509 SNPs. Imputation extended the genotyped SNPs to 7,514,461 with high estimated certainty (r2>0.8). Replication of known index SNP associations in T2D was attempted. Risk scores were computed as the sum of index risk alleles. SNPs in regions ±100 kb around each index were tested for associations with T2D in conditional fine-mapping analysis.
Of 69 index SNPs, 20 were genotyped directly and genotypes at 35 others were well imputed. Among the 55 SNPs with data, disease associations were replicated (at p<0.05) for 15 SNPs, while 32 more were directionally consistent with previous reports. Risk score was a significant predictor with a 2.03 fold higher risk CI (1.69–2.44) of T2D comparing the highest to lowest quintile of risk allele burden (p = 5.72×10−14). Two improved SNPs around index rs10923931 and 5 new candidate SNPs around indices rs10965250 and rs1111875 passed simple Bonferroni corrections for significance in conditional analysis. Nonetheless, only a small fraction (2.3% on the disease liability scale) of T2D burden in Singapore is explained by these SNPs.
While diabetes risk in Singapore Chinese involves genetic variants, most disease risk remains unexplained. Further genetic work is ongoing in the Singapore Chinese population to identify unique common variants not already seen in earlier studies. However rapid increases in T2D risk have occurred in recent decades in this population, indicating that dynamic environmental influences and possibly gene by environment interactions complicate the genetic architecture of this disease.
Obesity is a major health problem. Although heritability is substantial, genetic mechanisms predisposing to obesity are not very well understood. We have performed a genome wide association study (GWA) for early onset (extreme) obesity.
a) GWA (Genome-Wide Human SNP Array 5.0 comprising 440,794 single nucleotide polymorphisms) for early onset extreme obesity based on 487 extremely obese young German individuals and 442 healthy lean German controls; b) confirmatory analyses on 644 independent families with at least one obese offspring and both parents. We aimed to identify and subsequently confirm the 15 SNPs (minor allele frequency ≥10%) with the lowest p-values of the GWA by four genetic models: additive, recessive, dominant and allelic. Six single nucleotide polymorphisms (SNPs) in FTO (fat mass and obesity associated gene) within one linkage disequilibrium (LD) block including the GWA SNP rendering the lowest p-value (rs1121980; log-additive model: nominal p = 1.13×10−7, corrected p = 0.0494; odds ratio (OR)CT 1.67, 95% confidence interval (CI) 1.22–2.27; ORTT 2.76, 95% CI 1.88–4.03) belonged to the 15 SNPs showing the strongest evidence for association with obesity. For confirmation we genotyped 11 of these in the 644 independent families (of the six FTO SNPs we chose only two representing the LD bock). For both FTO SNPs the initial association was confirmed (both Bonferroni corrected p<0.01). However, none of the nine non-FTO SNPs revealed significant transmission disequilibrium.
Our GWA for extreme early onset obesity substantiates that variation in FTO strongly contributes to early onset obesity. This is a further proof of concept for GWA to detect genes relevant for highly complex phenotypes. We concurrently show that nine additional SNPs with initially low p-values in the GWA were not confirmed in our family study, thus suggesting that of the best 15 SNPs in the GWA only the FTO SNPs represent true positive findings.
Genome-wide association studies of obesity measures have identified associations with single nucleotide polymorphisms (SNPs). However, no large-scale evaluation of gene-environment interactions has been performed. We conducted a search of gene-environment (G×E) interactions in post-menopausal African-American and Hispanic women from the Women’s Health Initiative SNP Health Association Resource GWAS study. Single SNP linear regression on body mass index (BMI) and waist-to-hip circumference ratio (WHR) adjusted for multidimensional-scaling-derived axes of ancestry and age was run in race-stratified data with 871,512 SNPs available from African-Americans (N=8,203) and 786,776 SNPs from Hispanics (N=3,484). Tests of G×E interaction at all SNPs for recreational physical activity (met-hrs/wk), dietary energy intake (kcal/day), alcohol intake (categorical), cigarette smoking years, and cigarette smoking (ever vs. never) were run in African-Americans and Hispanics adjusted for ancestry and age at interview, followed by meta-analysis of G×E interaction terms. The strongest evidence for concordant G×E interactions in African-Americans and Hispanics was for smoking and marker rs10133840 (Q statistic P=0.70, beta=−0.01, P=3.81×10−7) with BMI as the outcome. The strongest evidence for G×E interaction within a cohort was in African-Americans with WHR as outcome for dietary energy intake and rs9557704 (SNP×kcal =−0.04, P=2.17×10−7). No results exceeded the Bonferroni–corrected statistical significance threshold.
BMI; WHR; genetic epidemiology; disparity; obesity; GWAS
Motivation: Gene–gene interactions (epistasis) are thought to be important in shaping complex traits, but they have been under-explored in genome-wide association studies (GWAS) due to the computational challenge of enumerating billions of single nucleotide polymorphism (SNP) combinations. Fast screening tools are needed to make epistasis analysis routinely available in GWAS.
Results: We present BiForce to support high-throughput analysis of epistasis in GWAS for either quantitative or binary disease (case–control) traits. BiForce achieves great computational efficiency by using memory efficient data structures, Boolean bitwise operations and multithreaded parallelization. It performs a full pair-wise genome scan to detect interactions involving SNPs with or without significant marginal effects using appropriate Bonferroni-corrected significance thresholds. We show that BiForce is more powerful and significantly faster than published tools for both binary and quantitative traits in a series of performance tests on simulated and real datasets. We demonstrate BiForce in analysing eight metabolic traits in a GWAS cohort (323 697 SNPs, >4500 individuals) and two disease traits in another (>340 000 SNPs, >1750 cases and 1500 controls) on a 32-node computing cluster. BiForce completed analyses of the eight metabolic traits within 1 day, identified nine epistatic pairs of SNPs in five metabolic traits and 18 SNP pairs in two disease traits. BiForce can make the analysis of epistasis a routine exercise in GWAS and thus improve our understanding of the role of epistasis in the genetic regulation of complex traits.
Availability and implementation: The software is free and can be downloaded from http://bioinfo.utu.fi/BiForce/.
Supplementary data are available at Bioinformatics online.
Linkage Disequilibrium (LD) is a powerful approach for the identification and characterization of morphological shape, which usually involves multiple genetic markers. However, multiple testing corrections substantially reduce the power of the associated tests. In addition, the principle component analysis (PCA), used to quantify the shape variations into several principal phenotypes, further increases the number of tests. As a result, a powerful multiple testing correction for simultaneous large-scale gene-shape association tests is an essential part of determining statistical significance. Bonferroni adjustments and permutation tests are the most popular approaches to correcting for multiple tests within LD based Quantitative Trait Loci (QTL) models. However, permutations are extremely computationally expensive and may mislead in the presence of family structure. The Bonferroni correction, though simple and fast, is conservative and has low power for large-scale testing.
We propose a new multiple testing approach, constructed by combining an Intersection Union Test (IUT) with the Holm correction, which strongly controls the family-wise error rate (FWER) without any additional assumptions on the joint distribution of the test statistics or dependence structure of the markers. The power improvement for the Holm correction, as compared to the standard Bonferroni correction, is examined through a simulation study. A consistent and moderate increase in power is found under the majority of simulated circumstances, including various sample sizes, Heritabilities, and numbers of markers. The power gains are further demonstrated on real leaf shape data from a natural population of poplar, Populus szechuanica var tietica, where more significant QTL associated with morphological shape are detected than under the previously applied Bonferroni adjustment.
The Holm correction is a valid and powerful method for assessing gene-shape association involving multiple markers, which not only controls the FWER in the strong sense but also improves statistical power.
Bonferroni; Holm; QTL mapping; LD; Multiple correction