|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWASs) have confirmed the ubiquitous existence of genetic heterogeneity for common disease: multiple common genetic variants have been identified to be associated, while many more are yet expected to be uncovered. On the other hand, the single SNP-based trend test (or its variants) that has been dominantly used in GWASs is based on contrasting the allele frequency difference between the case and control groups, completely ignoring possible genetic heterogeneity. In spite of the widely accepted notion of genetic heterogeneity, we are not aware of any previous attempt to apply genetic heterogeneity-motivated methods in GWAS. Here, to explicitly account for unknown genetic heterogeneity, we applied a mixture model-based single SNP test to the WTCCC GWAS data with traits Crohn’s disease, bipolar disease, coronary artery disease and type 2 diabetes, identifying much larger numbers of significant SNPs and risk loci for each trait than those of the popular trend test, demonstrating potential power gain of the mixture model-based test.
Genome-wide association studies (GWASs) have been extremely successful in identifying thousands of common genetic variants, mostly single nucleotide polymorphisms (SNPs), associated with common disease and complex traits (NHGRI Catalog: http://wwww.genome.gov/gwastudies/). Some important discoveries on the genetic architecture for complex traits are the following. First, genetic heterogeneity is everywhere: multiple SNPs and risk loci have been identified for many common disease and complex traits, suggesting the plausibility and even ubiquity of a polygenic model. Second, the effect sizes of most causal SNPs are estimated to be from moderate to small, many of which with smaller effect sizes are yet to be identified. Consequently, larger sample sizes and more powerful statistical tests are always needed in order to identify more risk loci. On the other hand, almost all GWASs have adopted single SNP-based analysis without explicitly accounting for (unknown) genetic heterogeneity, leading to possible loss of power as convincingly shown in simulation studies (Zhou and Pan 2009; Londono et al 2012; Qian and Shao 2013). Note here we consider unknown genetic heterogeneity here, rather than known phenotype heterogeneity as discussed in Darabi and Humphreys (2011). Based on a two-component binomial mixture model of Zhou and Pan (2009), if any given locus contains a causal SNP, then the patient population is decomposed into two subpopulations: the first subpopulation consists of the patients whose disease is associated with the disease allele at the locus, while the second includes those with disease caused by other unknown alleles at other unknown risk loci. In addition to contrasting the allele frequencies (i.e. means or first moments) between the control and case groups as targeted by the most popular 1df trend test, the proposed mixture model can capture some other distributional differences of the allele (i.e. second moments) between the two groups. For example, in an extreme case, even if there is barely any difference of the allele frequencies between the two groups, leading to no power of the trend test, if the mixture model assumption holds, then a mixture model-based likelihood ratio test (LRT-H) may still be able to detect the distributional differences of the allele between the two groups. Qian and Shao (2013) extended the two-component binomial mixture model to one with more than two components, for which an LRT-H statistic with a simple closed-form and an asymptotic null distribution was derived, facilitating its application to GWAS. Surprisingly, to our knowledge, it has not yet been applied to any GWAS. In fact, we are not aware of any other analyses of GWASs that explicitly account for genetic heterogeneity. Here we review the LRT-H and apply it to the WTCCC GWAS data. To ensure that our conclusion is not limited to any specific disease, we considered multiple diseases. We demonstrate that the LRT-H test can be much more powerful than the popular trend test in identifying a much larger number of associated SNPs and risk loci.
For each subject i, suppose Xi = 0, 1 or 2 is the number of the minor allele of a SNP to be tested, and Yi = 0 or 1 is the disease indicator. The methods are all based on single SNP analysis by testing on each SNP individually and separately, hence we can focus on only one SNP. The goal is to test for possible association between the SNP and disease.
Most of the existing association tests ignore possible genetic heterogeneity due to the disease. For example, the most popular Cochran-Armitage 1df trend test can be formulated as the Score test in a logistic regression model (Wellek and Ziegler 2011) (or more generally a GLM or Cox PHM for other types of traits)
to test the null hypothesis H0: β1 = 0. It is well known (Clayton et al 2004) that the Score test is
where (d) is the sample mean of Xi’s with Yi = d for d = 0 or 1. The Score test is asymptotically equivalent to the Z-test for one SNP (and equivalent to Hotelling’s T2 test for multiple SNPs (Xiong et al 2002; Fan and Knapp 2003)). Specifically, one can model the conditional distribution of Xi as binomial:
for which we’d like to test the null hypothesis , which is equivalent to the original H0. It is easy to see that, the Wald test for is
which differs from TS in the variance estimates used in the denominator, but nonetheless is asymptotically equivalent to TS. Furthermore, the asymptotically equivalent likelihood ratio test (LRT) can be also applied:
where ng and mg are the genotype frequencies as summarized in Table 1, the maximum likelihood estimates (MLEs) of the minor allele frequencies (MAFs) under H0 and H1 are
is the probability mass function for a binomial distribution X ~ Bin(2, p).
The above three tests all share the same (asymptotic) null distribution as a chi-squared distribution with 1 df.
Rather than using a trend test based on the additive genetic model for Xi, a more general 2df test can be formulated by fitting an expanded regression model:
and we test with one of the three asymptotically equivalent Score test, Wald test and LRT, all with an asymptotically null distribution of with df=2. Interestingly, as pointed out by Kim et al (2010), β2 measures the difference of Hardy-Weinberg coefficients in the disease and control groups, hence a 2df test can be regarded as testing on both allele frequency difference and Hardy-Weinberg coefficient difference between the two groups. In this paper, we used R function glm() to fit a logistic regression model and applied the Wald test.
A Hardy-Weinberg Equilibrium (HWE) test can be applied to the case group (with notation shown in Table 1) for association analysis (Nielsen et al 1998). Given that the total number of observed minor allele is na = 2n2 + n1, under the assumption of HWE, the probability of observing n1 heterozygotes:
and the p-value is calculated as
We conducted the HWE exact test of Wigginton et al (2005) as implemented in function hwexact() from R package hwde.
To fully and explicitly account for genetic heterogeneity, Zhou and Pan (2009) proposed a binomial mixture model for the disease group while using a usual binomial model for the control group:
where θ0 is the background MAF for the controls. In contrast, for the case group, we assume θ is the probability of having the minor allele on a chromosome for a subpopulation of cases with disease caused by (or associated with) the minor allele, while for other subpopulations of cases the disease is caused by (unknown) variants at other unlinked loci, and thus for them the probability of having the minor allele at the locus of interest is the same as that for the controls. We test H0 : θ = θ0 or π = 0. Zhou and Pan (2009) considered more general scenarios with different θ0’s for cases and controls, or with more than one non-null component for cases, but recommended the above two-component mixture model due to the non-identifiability issues with the more general models.
There are several important implications from the mixture model. First, the mixture model differs from the usual (implicit) assumption of with θ* = πθ + (1 − π)θ0 for cases. Although E(X) = E(X*), it is shown (Zhou and Pan 2009) that
where the strict inequality holds for the non-degenerated case with θ* ≠ θ0, π ≠ 0 and π ≠ 1. Hence, the binomial mixture model introduces an overdispersion of the minor allele as compared to a binomial distribution with the same MAF. While the most popular 1df trend test compares the mean difference of the genotype scores between the control and case groups, it ignores the possible genetic heterogeneity in the case group and thus possible differences in high moments of the genotype scores between the two groups. Hence, taking advantage of the existing genetic heterogeneity (as modeled by the mixture model), an association test can gain power in detecting differences in both the means (i.e. first moments) and higher-order moments (e.g. second moments) between the control and case groups, which is closely related to the expanded regression model (2). Second, as shown by Zhou and Pan (2010), the mixture model also implies that HWE is violated in the case group under genetic heterogeneity, suggesting its connection to the HWE test.
Since complex diseases can be caused by a large number of genetic variants, it may be desirable to use a mixture model with more than two components to capture the complex heterogeneity. Qian and Shao (2013) extended the two-component mixture model to a more general form for the disease group as follows:
with J ≥ 2, 0 < θj < 1 and πj ≥ 0 for any j = 1, …, J, and . Note that J ≥ 2 is unknown and does not need to be specified.
To test the null hypothesis H0: θ1 = θ2 = ··· = θJ (or πj = 1 and πk = 0 ∀k ≠ j), Qian and Shao (2013) developed a likelihood ratio test under genetic heterogeneity (LRT-H). The likelihoods LH and L0 are the same as before except
Under H0, the LRT-H statistic TLRT-H = 2 log(LDLH) − 2 log L0 asymptotically follows a mixture of two chi-squared distributions with 1 and 2 dfs respectively, i.e. . Note that since the model (5) is not equivalent to the model (6), the asymptotic null distribution of the LRT-H statistics for the two models may be different.
We applied the methods to the Wellcome Trust Case Control Consortium (WTCCC) data (Burton et al 2007). The data include two control groups, called the National UK Blood Services (NBS) and 1958 British Birth Cohort (58C). To illustrate that a conclusion is not limited to a specific disease, we considered four traits: Bipolar disorder (BD), coronary artery disease (CAD), Crohn’s disease (CD), and type 2 diabetes (T2D).
We followed the quality control procedures of Burton et al (2007) to screen for subjects and SNPs. In addition, we removed any SNP with a p-value < 5.7×10−7 by the LRT-H test contrasting the two control groups (58C vs NBS) or (NBS vs 58C); the same cutoff 5.7×10−7 was used by Burton et al (2007) for the HWE test applied to the combined control group to remove SNPs. After QC, the genotyping rates were greater than 99.9% for all the four datasets (each with a combined case and control sample). The numbers of subjects in the control group (i.e. NBS+58C) before and after QC were 3,004 and 2,938 respectively. The numbers of autosomal SNPs and subjects for each disease group before and after QC are summarized as in Table 2.
A genome-wide scan was conducted for each dataset with each of the four tests applied to each SNP. The corresponding Q-Q plots are shown in Figure 1, confirming no obvious population structures, as supported by the estimated inflation factors all close to 1 (Table 3).
After identifying significant SNPs at the usual genome-wide significance level of 5 × 10−8, we applied the method of Psychiatric Genomics Consortium (PGC, 2014) to define LD-independent (index) SNPs and risk loci. Briefly, first, among all significant SNPs, an SNP is defined to be an LD-independent SNP if it is in weak LD with r2 < 0.1 with a more significant SNP within a 0.5Mb window; second, a risk locus is defined as a basepair (BP) interval including all the SNPs with r2 > 0.6 to an LD-independent SNP, and any two risk loci within the distance of 0.25Mb are merged.
The numbers of significant SNPs and risk loci identified by each test are shown in Figures 2 and and3.3. It is clear that HWE test identified the largest numbers of significant SNPs and risk loci, most of which overlapped with those of the LRT-H test; this can be explained by the close connection between the two tests: a binomial mixture model implies the Hardy-Weinberg disequilibrium. Second, by the close relationship between the binomial mixture model and the expanded regression model (2), most of the significant SNPs and risk loci identified by the 2df test were also uncovered by the LRT-H test. Third, perhaps most importantly, since the LRT-H test also contrasts the allele frequency differences between the case and control groups as does the 1df trend test, the significant SNPs and risk loci identified by the popular trend test were almost all recovered by the LRT-H test. Finally, the LRT-H test also discovered some unique significant SNPs and risk loci.
Some example SNPs identified to be significant by LRT-H, but not by other tests, are shown in Table 4. Some significant risk loci identified by the LRT-H or HWE test, but not by the other two tests, are confirmed to be within 0.25 Mb of some previously identified SNPs or risk loci, as shown in Table 5. The LocusZoom plots (Pruim et al 2010) for the significant risk loci uniquely identified by LRT-H are shown in Figures 4 – 7. To facilitate interpretation, we also added their predicted GenoCanyon scores (Lu et al 2015); a higher score predicts a higher likelihood of the SNP’s being functional. As GenoCanyon only supports hg19 while the original WTCCC data were all based on hg18, we lifted the annotation for the WTCCC data to hg19 using the UCSC web interface (http://genome.ucsc.edu/cgi-bin/hgLiftOver) before generating LocusZoom and GenoCanyon plots. We can see that some of the significant risk loci are in the regions with high functional scores.
We have shown possible power gain of the proposed LRT-H test which is closely related to the Hardy-Weinberg equilibrium (HWE) test. Briefly speaking, when applied to the WTCCC GWAS data, we found that the HWE test on the case group could identify the largest number of associated SNPs and risk loci for each of the four diseases considered, most of which overlapped with those of LRT-H, though LRT-H could uniquely detect some risk loci too. Although the HWE test has long been proposed as a possible choice for association testing (Nielsen et al 1998; Wittke-Thompson et al 2005), it has seldom been used and “often under-exploited” (Balding 2006) for such a purpose but most widely used only for SNP genotyping quality control. It is likely that some of the significant SNPs identified by the HWE test and LRT-H are due to genotyping errors, but some may be true positives. A challenge is to validate and interpret them as for any new discovery in GWAS.
R code will be available on the corresponding author’s website (http://www.biostat.umn.edu/~weip/prog.html).
The authors thank the reviewers for constructive comments. This research was supported by NIH grants R01GM113250, R01HL105397 and R01HL116720, and by the Minnesota Supercomputing Institute at the Univeristy of Minnesota.
“This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113.”