Genetic mutations underlying most human complex diseases are non-Mendelian. These mutations may show little main effects to the disease with low penetrance but possibly interact with each other in a complex way. We refer to multilocus interactions with the disease as epistasis. In particular, the joint association of 2 or more single nucleotide polymorphisms (SNPs) is greater than their marginal associations. It has been speculated (Moore and Williams, 2002
) and reported (Ritchie and others
, 2001), (Zee and others
, 2002), (Williams and others
, 2004), (Tsai and others
, 2004), (Cho and others
, 2004) that epistasis contributes to many human complex traits. Simulation studies (Marchini and others
, 2005), (Zhang and Liu, 2007
), (Wanwan and others
, 2009) further showed that epistasis mapping is computationally feasible in large-scale case–control studies.
The power for epistasis association mapping can be greatly affected by the tagging SNPs genotyped in a case–control study. Assume a disease is affected by an interaction between 2 mutations, a
, at 2 loci. If each of the 2 mutations a
is strongly correlated with a tagging SNP, A
, respectively, an interaction association between the 2 loci can be detected between A
(using a saturated test of 9 possible genotype combinations). If one disease mutation a
) is strongly correlated with a combination of tagging SNPs A1
) but weakly correlated with each SNP individually, however, the interaction between the 2 loci will only be detectable by testing a combination of A1
). Testing more than 2 tagging SNPs is statistically more challenging because the number of possible genotype combinations and the number of multiple comparisons grows exponentially. One way to improve the power of epistasis mapping is to impute untyped SNPs from a reference sample via linkage disequilibrium (LD) and test disease associations on both tagging SNPs and imputed SNPs (Marchini and others
, 2007). SNP imputation is statistically feasible given the LD between closely located SNPs. Many algorithms have been developed for imputation-based association mapping, including IMPUTE (Marchini and others
, 2007), Bayesian Imputation-Based Association Mapping (BIMBAM) (Servin and Stephens, 2007
), Testing Untyped Alleles (TUNA) (Nicolae, 2006
), BEAGLE (Browning, 2008
), and SNPMStat (Lin and others
, 2008). Most methods follow a 2-step approach, where the missing genotypes at untyped SNPs are first imputed from a reference panel, without distinguishing cases and controls. Association tests are then performed on the imputed genotypes. Alternatively, methods that impute the missing genotypes and test disease associations simultaneously have be developed (Lin and others
, 2008), which were shown to be more powerful than 2-stage approaches.
SNP imputation can potentially improve the power of epistasis mapping. Rather than directly testing epistasis on tagging SNPs, which may involve a large number of genotype combinations and multiple comparisons, testing epistasis on imputed SNPs can be useful. For the example given above, if we impute an untyped SNP Aimp from the tagging SNPs A1 and A2, (or to impute Bimp from B1 and B2), we can test interaction association between Aimp and B (or A and Bimp) such that the interaction only involves 9 genotype combinations (as oppose to 27 for 3 tagging SNPs), and the multiple comparison problem is within pairwise comparisons. As a result, we gain power by using imputed SNPs as long as they capture the epistasis information from tagging SNPs. This is a dimension reduction approach that projects high-dimensional information of many tagging SNPs into a low dimension of a few untyped SNPs, where the projection retains a maximum amount of association information.
In this paper, we propose a new Bayesian model for joint SNP imputation and epistasis association mapping. Our method uses SNP blocks (Zhang and others, 2010) to account for LD among SNPs and imputes untyped SNPs iteratively using Markov chain Monte Carlo (MCMC) algorithms. Our method is Bayesian that models all SNPs and their LD using a full probability function, such that the imputation of all untyped SNPs are consistent. Most existing methods do not model LD between untyped SNPs, and hence a single disease association signal captured by tagging SNPs can be overly reproduced at all nearby untyped SNPs. Our model treats the reference data as random, with observation uncertainty properly accounted for using prior distributions. The imputation result is therefore more robust to outliers than methods that make a bona fide use of the reference. Our method performs imputation conditional on the disease association status of SNPs. The reference data only represents the normal population, and hence the conditional imputation is more appropriate to impute SNPs around disease loci. Our approach is similar to Bayesian fine mapping that models the observed SNPs conditional on some unobserved “causal” mutations. We further utilize a dynamic block partitioning algorithm to account for the variability of LD among SNPs, which is particularly useful at genomic regions that demonstrate vague block pattern of LD. Our method effectively explores all plausible block partitions, such that an untyped SNP is imputed over different partitions of tagging SNPs, and hence the power of association mapping can be improved by finding the most plausible block partitions.
We demonstrate that imputing untyped SNPs in cases and controls can consistently and sometimes substantially improve the power of epistasis mapping. Comparing nonepistasis and epistasis disease models, we show that the power improvement by SNP imputation is greater for detecting epistasis than detecting single SNP associations. For single SNPs associations, imputing additional SNPs may not in general improve the power than using tagging SNPs alone. For epistasis associations, on the other hand, we observed a consistent gain of power using our method, despite of the greatly increased number of multiple comparisons induced by imputation. Our simulation results showed that the power of epistasis mapping can be strongly affected by the tagging SNPs genotyped around the disease loci. By imputing additional SNPs, the disease information jointly captured by many tagging SNPs can be summarized by a few imputed SNPs, which result in simpler tests and hence easier to reach statistical significance. We further show that the resolution of association mapping can be consistently improved by SNP imputation, for both single SNP and epistasis mapping. We demonstrate the application of our method using a data set of inflammatory bowel disease (IBD) from The Wellcome Trust Case Control Consortium (WTCCC, 2007).