In the past few years, case-control studies of common diseases have shifted their focus from single genes to whole exomes. New sequencing technologies now routinely detect hundreds of thousands of sequence variants in a single study, many of which are rare or even novel. The limitation of classical single-marker association analysis for rare variants has been a challenge in such studies. A new generation of statistical methods for case-control association studies has been developed to meet this challenge. A common approach to association analysis of rare variants is the burden-style collapsing methods to combine rare variant data within individuals across or within genes. Here, we propose a new hybrid likelihood model that combines a burden test with a test of the position distribution of variants. In extensive simulations and on empirical data from the Dallas Heart Study, the new model demonstrates consistently good power, in particular when applied to a gene set (e.g., multiple candidate genes with shared biological function or pathway), when rare variants cluster in key functional regions of a gene, and when protective variants are present. When applied to data from an ongoing sequencing study of bipolar disorder (191 cases, 107 controls), the model identifies seven gene sets with nominal p-values0.05, of which one MAPK signaling pathway (KEGG) reaches trend-level significance after correcting for multiple testing.
Inexpensive, high-throughput sequencing has transformed the field of case-control association studies. For the first time, it may be possible to identify the genetic underpinnings of complex diseases, by sequencing the DNA of hundreds (even thousands) of cases and controls and comparing patterns of DNA sequence variation. However, complex diseases are likely to be caused by many variants, some of which are very rare. Taken one at a time, the association between variant and disease phenotype may not be detectable by current statistical methods. One strategy is to identify regions where important variants occur by “collapsing” variants into groups. Here, we present a new collapsing approach, capable of detecting subtle genetic differences between cases and controls. We show, in extensive simulations and using a benchmark set of genes involved in human triglyceride levels, that the approach is potentially more powerful than existing methods. We apply the new method to an ongoing sequencing study of bipolar cases and controls and identify a set of genes found in neuronal synapses, which may be implicated in bipolar disorder.
There is increasing evidence that rare variants play a role in some complex traits, but their analysis is not straightforward. Locus-based tests become necessary due to low power in rare variant single-point association analyses. In addition, variant quality scores are available for sequencing data, but are rarely taken into account. Here, we propose two locus-based methods that incorporate variant quality scores: a regression-based collapsing approach and an allele-matching method.
Using simulated sequencing data we compare 4 locus-based tests of trait association under different scenarios of data quality. We test two collapsing-based approaches and two allele-matching-based approaches, taking into account variant quality scores and ignoring variant quality scores. We implement the collapsing and allele-matching approaches accounting for variant quality in the freely available ARIEL and AMELIA software.
The incorporation of variant quality scores in locus-based association tests has power advantages over weighting each variant equally. The allele-matching methods are robust to the presence of both protective and risk variants in a locus, while collapsing methods exhibit a dramatic loss of power in this scenario.
The incorporation of variant quality scores should be a standard protocol when performing locus-based association analysis on sequencing data. The ARIEL and AMELIA software implement collapsing and allele-matching locus association analysis methods, respectively, that allow the incorporation of variant quality scores.
Whole-genome sequencing; Exome sequencing; Association analysis; Accounting for uncertainty; Complex trait
Although variations in allele frequencies at common SNPs have been extensively studied in different populations, little is known about the stratification of rare variants and its impact on association tests. In this paper, we used Affymetrix 500K genotype data from the WTCCC to investigate if variants in three different frequency categories (below 1%, between 1 and 5%, above 5%) show different stratification patterns in the UK population. We found that these patterns are indeed different. The top principal component extracted from the rare variant category shows poor correlations with any principal component or combination of principal components from the low frequency or common variant categories. These results could suggest that a suitable solution to avoid false positive association due to population stratification would involve adjusting for the respective PCs when testing for variants in different allele frequency categories. However, we found this was not the case both on type 2 diabetes data and on simulated data. Indeed, adjusting rare variant association tests on PCs derived from rare variants does no better to correct for population stratification than adjusting on PCs derived from more common variants. Mixed models perform slightly better for low frequency variants than PC based adjustments but less well for the rarest variants. These results call for the need of new methodological developments specifically devoted to address rare variant stratification issues in association tests.
A new study successfully applies complementary whole-genome sequencing and imputation approaches to establish robust disease associations in a population isolate. This strategy is poised to help elucidate the genetic architecture of complex traits in the low end of the allele frequency spectrum.
In the presence of epistasis multilocus association tests of human complex traits can provide powerful methods to detect susceptibility variants. We undertook multilocus analyses in 1924 type 2 diabetes cases and 2938 controls from the Wellcome Trust Case Control Consortium (WTCCC). We performed a two-dimensional genome-wide association (GWA) scan using joint two-locus tests of association including main and epistatic effects in 70,236 markers tagging common variants. We found two-locus association at 79 SNP-pairs at a Bonferroni-corrected P-value = 0.05 (uncorrected P-value = 2.14 × 10−11). The 79 pair-wise results always contained rs11196205 in TCF7L2 paired with 79 variants including confirmed variants in FTO, TSPAN8, and CDKAL1, which are associated in the absence of epistasis. However, the majority (82%) of the 79 variants did not have compelling single-locus association signals (P-value = 5 × 10−4). Analyses conditional on the single-locus effects at TCF7L2 established that the joint two-locus results could be attributed to single-locus association at TCF7L2 alone. Interaction analyses among the peak 80 regions and among 23 previously established diabetes candidate genes identified five SNP-pairs with case-control and case-only epistatic signals. Our results demonstrate the feasibility of systematic scans in GWA data, but confirm that single-locus association can underlie and obscure multilocus findings.
Epistasis; simultaneous search; joint effects; genome-wide association
Large-scale meta-analyses of genome-wide association scans (GWAS) have been successful in discovering common risk variants with modest and small effects. The detection of lower frequency signals will undoubtedly require concerted efforts of at least similar scale. We investigate the sample size-dictated power limits of GWAS meta-analyses, in the presence and absence of modest levels of heterogeneity and across a range of different allelic architectures. We find that data combination through large-scale collaboration is vital in the quest for complex trait susceptibility loci, but that effect size heterogeneity across meta-analysed studies drawn from similar populations does not appear to have a profound effect on sample size requirements.
genetic study; sample size; heterogeneity; replication; study design
Background Genetic differences between men and women may contribute to sex differences in prevalence and progression of many common complex diseases.
Using the WTCCC GWAS, we analysed whether there are sex-specific differences in effect size estimates at 142 established loci for seven complex diseases: rheumatoid arthritis, type 1 diabetes (T1D), Crohn’s disease, type 2 diabetes (T2D), hypertension, coronary artery disease and bipolar disorder.
Methods For each Single nucleotide polymorphism (SNP), we calculated the per-allele odds ratio for each sex and the relative odds ratios (RORs; the effect size is higher in men with ROR greater than one). RORs were then meta-analysed across loci within each disease and across diseases.
Results For each disease, summary RORs were not different from one, but there was between-SNP heterogeneity in the RORs for T1D and T2D. Four loci in T1D, three in Crohn’s disease and three in T2D showed differences in the genetic effect between men and women (P < 0.05). We probed these differences in additional independent replication samples for T1D and T2D. The differences remained for the T1D loci CTSH, 17q21 and 20p13 and the T2D locus BCL11A, when WTCCC data and replication data were meta-analysed. Only CTSH showed different genetic effect between men and women in the replication data alone.
Conclusion Our results exclude the presence of large and frequent differences in the effect size estimates between men and women for the established loci in the seven common diseases explored. Documenting small differences in genetic effects between men and women requires large studies and systematic evaluation.
Genetic Predisposition to Disease; Genome-Wide Association Study; Odds ratio; Sex
Next-generation sequencing has opened the possibility of large-scale sequence-based disease association studies. A major challenge in interpreting whole-exome data is predicting which of the discovered variants are deleterious or neutral. To address this question in silico, we have developed a score called Combined Annotation scoRing toOL (CAROL), which combines information from 2 bioinformatics tools: PolyPhen-2 and SIFT, in order to improve the prediction of the effect of non-synonymous coding variants.
We used a weighted Z method that combines the probabilistic scores of PolyPhen-2 and SIFT. We defined 2 dataset pairs to train and test CAROL using information from the db-SNP: ‘HGMD-PUBLIC’ and 1000 Genomes Project databases. The training pair comprises a total of 980 positive control (disease-causing) and 4,845 negative control (non-disease-causing) variants. The test pair consists of 1,959 positive and 9,691 negative controls.
CAROL has higher predictive power and accuracy for the effect of non-synonymous variants than each individual annotation tool (PolyPhen-2 and SIFT) and benefits from higher coverage.
The combination of annotation tools can help improve automated prediction of whole-genome/exome non-synonymous variant functional consequences.
CAROL; PolyPhen-2; SIFT; Weighted Z method
Meta-analysis has proven a useful tool in genetic association studies. Allelic heterogeneity can arise from ethnic background differences across populations being meta-analyzed (for example, in search of common frequency variants through genome-wide association studies), and through the presence of multiple low frequency and rare associated variants in the same functional unit of interest (for example, within a gene or a regulatory region). The latter challenge will be increasingly relevant in whole-genome and whole-exome sequencing studies investigating association with complex traits. Here, we evaluate the performance of different approaches to meta-analysis in the presence of allelic heterogeneity. We simulate allelic heterogeneity scenarios in three populations and examine the performance of current approaches to the analysis of these data. We show that current approaches can detect only a small fraction of common frequency causal variants. We also find that for low-frequency variants with large effects (odds ratios 2–3), single-point tests have high power, but also high false-positive rates. P-value based meta-analysis of summary results from allele-matching locus-wide tests outperforms collapsing approaches. We conclude that current strategies for the combination of genetic association data in the presence of allelic heterogeneity are insufficiently powered.
genetic association; trans-ethnic mapping; multiple rare variants
The role of rare genetic variation in the etiology of complex disease remains unclear. However, the development of next-generation sequencing technologies offers the experimental opportunity to address this question. Several novel statistical methodologies have been recently proposed to assess the contribution of rare variation to complex disease etiology. Nevertheless, no empirical estimates comparing their relative power are available. We therefore assessed the parameters that influence their statistical power in 1,998 individuals Sanger-sequenced at seven genes by modeling different distributions of effect, proportions of causal variants, and direction of the associations (deleterious, protective, or both) in simulated continuous trait and case/control phenotypes. Our results demonstrate that the power of recently proposed statistical methods depend strongly on the underlying hypotheses concerning the relationship of phenotypes with each of these three factors. No method demonstrates consistently acceptable power despite this large sample size, and the performance of each method depends upon the underlying assumption of the relationship between rare variants and complex traits. Sensitivity analyses are therefore recommended to compare the stability of the results arising from different methods, and promising results should be replicated using the same method in an independent sample. These findings provide guidance in the analysis and interpretation of the role of rare base-pair variation in the etiology of complex traits and diseases.
There is now evidence that rare variants can contribute to the etiology of complex disease. Next generation sequencing technologies have enabled their detection in large cohorts, and new statistical methods have been proposed to ascertain their association with complex diseases and traits in order to improve power over single-marker analysis. Each of these new methods assumes a particular nature of the relationship between rare variants and complex disease, yet these hypotheses have been largely unverified. Therefore we sought to compare the power of commonly used and novel statistical methods for rare variants using Sanger sequencing data from 1,998 individuals sequenced at 7 genes by simulating several phenotypes under models spanning a spectrum of the common hypotheses concerning such associations. While all methods perform reasonably well under their own model-specific hypotheses, no single method gives consistently acceptable power when these hypotheses are violated. Unlike GWAS, wherein all variants can often be tested using the same method across the entire genome, the analysis and interpretation of sequencing studies will therefore be considerably more challenging.
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size.
Human height is a classic, highly heritable quantitative trait. To begin to identify genetic variants influencing height, we examined genome-wide association data from 4,921 individuals. Common variants in the HMGA2 oncogene, exemplified by rs1042725, were associated with height (P = 4 × 10−8). HMGA2 is also a strong biological candidate for height, as rare, severe mutations in this gene alter body size in mice and humans, so we tested rs1042725 in additional samples. We confirmed the association in 19,064 adults from four further studies (P = 3 × 10−11, overall P = 4 × 10−16, including the genome-wide association data). We also observed the association in children (P = 1 × 10−6, N = 6,827) and a tall/short case-control study (P = 4 × 10−6, N = 3,207). We estimate that rs1042725 explains ~0.3% of population variation in height (~0.4 cm increased adult height per C allele). There are few examples of common genetic variants reproducibly associated with human quantitative traits; these results represent, to our knowledge, the first consistently replicated association with adult and childhood height.
Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.
genome-wide association study; imputation; quality control; single nucleotide polymorphism
The study of rare variants holds the promise of accounting for some of the missing heritability in complex traits. Next-generation sequencing technologies enable probing of variation across the full spectrum of allele frequencies. Multiple methods for the analysis of rare variants have been proposed and, recently, Ionita-Laza et al. have presented an approach with the theoretical capacity to detect risk and protective variants. The identification of rare risk variants could have major implications in understanding complex disease etiopathogenesis.
By combining genome-wide association data from 8,130 individuals with type 2 diabetes (T2D) and 38,987 controls of European descent and following up previously unidentified meta-analysis signals in a further 34,412 cases and 59,925 controls, we identified 12 new T2D association signals with combinedP < 5 × 10−8. These include a second independent signal at the KCNQ1 locus; the first report, to our knowledge, of an X-chromosomal association (near DUSP9); and a further instance of overlap between loci implicated in monogenic and multifactorial forms of diabetes (at HNF1A). The identified loci affect both beta-cell function and insulin action, and, overall, T2D association signals show evidence of enrichment for genes involved in cell cycle regulation. We also show that a high proportion of T2D susceptibility loci harbor independent association signals influencing apparently unrelated complex traits.
Interpretation of dense single nucleotide polymorphism (SNP) follow-up of genome-wide association or linkage scan signals can be facilitated by establishing expectation for the behaviour of primary mapping signals upon fine-mapping, under both null and alternative hypotheses. We examined the inferences that can be made regarding the posterior probability of a real genetic effect and considered different disease-mapping strategies and prior probabilities of association. We investigated the impact of the extent of linkage disequilibrium between the disease SNP and the primary analysis signal and the extent to which the disease gene can be physically localised under these scenarios. We found that large increases in significance (>2 orders of magnitude) appear in the exclusive domain of genuine genetic effects, especially in the follow-up of genome-wide association scans or consensus regions from multiple linkage scans. Fine-mapping significant association signals that reside directly under linkage peaks yield little improvement in an already high posterior probability of a real effect. Following fine-mapping, those signals that increase in significance also demonstrate improved localisation. We found local linkage disequiliptium patterns around the primary analysis signal(s) and tagging efficacy of typed markers to play an important role in determining a suitable interval for fine-mapping. Our findings help inform the interpretation and design of dense SNP-mapping follow-up studies, thus facilitating discrimination between a genuine genetic effect and chance fluctuation (false positive).
genome-wide association; false positive; localization; disease gene; linkage disequilibrium; haplotype; fine-scale mapping
Common variation in the FTO gene is associated with BMI and type 2 diabetes. Increased BMI is associated with diabetes risk factors, including raised insulin, glucose, and triglycerides. We aimed to test whether FTO genotype is associated with variation in these metabolic traits.
RESEARCH DESIGN AND METHODS
We tested the association between FTO genotype and 10 metabolic traits using data from 17,037 white European individuals. We compared the observed effect of FTO genotype on each trait to that expected given the FTO-BMI and BMI-trait associations.
Each copy of the FTO rs9939609 A allele was associated with higher fasting insulin (0.039 SD [95% CI 0.013–0.064]; P = 0.003), glucose (0.024 [0.001– 0.048]; P = 0.044), and triglycerides (0.028 [0.003– 0.052]; P = 0.025) and lower HDL cholesterol (0.032 [0.008 – 0.057]; P = 0.009). There was no evidence of these associations when adjusting for BMI. Associations with fasting alanine aminotransferase, γ-glutamyl-transferase, LDL cholesterol, A1C, and systolic and diastolic blood pressure were in the expected direction but did not reach P < 0.05. For all metabolic traits, effect sizes were consistent with those expected for the per allele change in BMI. FTO genotype was associated with a higher odds of metabolic syndrome (odds ratio 1.17 [95% CI 1.10 –1.25]; P = 3 × 10−6).
FTO genotype is associated with metabolic traits to an extent entirely consistent with its effect on BMI. Sample sizes of >12,000 individuals were needed to detect associations at P < 0.05. Our findings highlight the importance of using appropriately powered studies to assess the effects of a known diabetes or obesity variant on secondary traits correlated with these conditions.
The P-value approach has been employed to prioritizing genome-wide association (GWA) scan signals, with a genome-wide significance defined by a prior P-value threshold, although this is not ideal. A rationale put forward is that the association signals rather should be expected to give less support for single nucleotide polymorphisms (SNPs) that are rare (with associated low-power tests) than for common SNPs with equivalent P-values, unless investigators believe, a priori, that rare causative variants contribute to the disease and have more pronounced effects.
Using data from a GWA scan for type 2 diabetes (1924 cases, 2938 controls, 393 453 SNPs), we compared P-values with four alternative signal measures: likelihood ratio (LR), Bayes factor (BF; with a specified prior distribution for true effects), ‘frequentist factor’ (FF; reflecting the ratio between estimated—post-data— ‘power’ and P-value) and probability of pronounced effect size (PrPES).
The 19 common SNPs [minor allele frequency (MAF) among the controls >29%] yielding strong P-value signals (P<5×10−7) were also top ranked by the other approaches. There was a strong similarity between the P-values, LR and BF signals, in terms of ranking SNPs. In contrast, FF and PrPES signals down-weighted rare SNPs (control MAF<10%) with low P-values.
For prioritization of signals that do not achieve compelling levels of evidence for association, the main driving force behind observed differences between the various association signals appears to be SNP MAF. The statistical power afforded by follow-up samples for establishing replication should be taken into account when tailoring the signal selection strategy.
Bayes factor; effect size; likelihood ratio; single nucleotide polymorphism; statistical power; statistics
Synthetic associations have been posited as a possible explanation for missing heritability in complex disease. We show several lines of evidence which suggest that, while possible, these synthetic associations are not common.
The association between common variants in the FTO gene with weight, adiposity and body mass index (BMI) has now been widely replicated. Although the causal variant has yet to be identified, it most likely maps within a 47 kb region of intron 1 of FTO. We performed a genome-wide association study in the Sorbian population and evaluated the relationships between FTO variants and BMI and fat mass in this isolate of Slavonic origin resident in Germany. In a sample of 948 Sorbs, we could replicate the earlier reported associations of intron 1 SNPs with BMI (eg, P-value=0.003, β=0.02 for rs8050136). However, using genome-wide association data, we also detected a second independent signal mapping to a region in intron 2/3 about 40–60 kb away from the originally reported SNPs (eg, for rs17818902 association with BMI P-value=0.0006, β=−0.03 and with fat mass P-value=0.0018, β=−0.079). Both signals remain independently associated in the conditioned analyses. In conclusion, we extend the evidence that FTO variants are associated with BMI by putatively identifying a second susceptibility allele independent of that described earlier. Although further statistical analysis of these findings is hampered by the finite size of the Sorbian isolate, these findings should encourage other groups to seek alternative susceptibility variants within FTO (and other established susceptibility loci) using the opportunities afforded by analyses in populations with divergent mutational and/or demographic histories.
FTO; BMI; Sorbs
Genome-wide association studies have been successful in finding common variants influencing common traits. However, these associations only account for a fraction of trait heritability. There has been a shift in the field towards studying low frequency and rare variants, which are now widely recognised as putative complex trait determinants. Despite this increasing focus on examining the role of low frequency and rare variants in complex disease susceptibility, there is a lack of user-friendly analytical packages implementing powerful association tests for the analysis of rare variants.
We have developed two software tools, CCRaVAT (Case-Control Rare Variant Analysis Tool) and QuTie (Quantitative Trait), which enable efficient large-scale analysis of low frequency and rare variants. Both programs implement a collapsing method examining the accumulation of low frequency and rare variants across a locus of interest that has more power than single variant analysis. CCRaVAT carries out case-control analyses whereas QuTie has been developed for continuous trait analysis.
CCRaVAT and QuTie are easy to use software tools that allow users to perform genome-wide association analysis on low frequency and rare variants for both binary and quantitative traits. The software is freely available and provides the genetics community with a resource to perform association analysis on rarer genetic variants.
Elevated blood pressure is a common, heritable cause of cardiovascular disease worldwide. To date, identification of common genetic variants influencing blood pressure has proven challenging. We tested 2.5m genotyped and imputed SNPs for association with systolic and diastolic blood pressure in 34,433 subjects of European ancestry from the Global BPgen consortium and followed up findings with direct genotyping (N≤71,225 European ancestry, N=12,889 Indian Asian ancestry) and in silico comparison (CHARGE consortium, N=29,136). We identified association between systolic or diastolic blood pressure and common variants in 8 regions near the CYP17A1 (P=7×10−24), CYP1A2 (P=1×10−23), FGF5 (P=1×10−21), SH2B3 (P=3×10−18), MTHFR (P=2×10−13), c10orf107 (P=1×10−9), ZNF652 (P=5×10−9) and PLCD3 (P=1×10−8) genes. All variants associated with continuous blood pressure were associated with dichotomous hypertension. These associations between common variants and blood pressure and hypertension offer mechanistic insights into the regulation of blood pressure and may point to novel targets for interventions to prevent cardiovascular disease.
Genome-wide association studies (GWAS) conducted using commercial single nucleotide polymorphisms (SNP) arrays have proven to be a powerful tool for the detection of common disease susceptibility variants. However, their utility for the detection of lower frequency variants is yet to be practically investigated. Here we describe the application of a rare variant collapsing method to a large genome-wide SNP dataset, the Wellcome Trust Case Control Consortium rheumatoid arthritis (RA) GWAS. We partitioned the data into gene-centric bins and collapsed genotypes of low frequency variants (defined here as MAF ≤0.05) into a single count coupled with univariate analysis. We then prioritised gene regions for further investigation in an independent cohort of 3,355 cases and 2,427 controls based on rare variant signal p value and prior evidence to support involvement in RA. A total of 14,536 gene bins were investigated in the primary analysis and signals mapping to the TNFAIP3 and chr17q24 loci were selected for further investigation. We detected replicating association to low frequency variants in the TNFAIP3 gene (combined p = 6.6 × 10−6). Even though rare variants are not well-represented and can be difficult to genotype in GWAS, our study supports the application of low frequency variant collapsing methods to genome-wide SNP datasets as a means of exploiting data that are routinely ignored.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-010-0889-1) contains supplementary material, which is available to authorized users.