Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3062153

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. MATERIALS AND METHODS
- 3. RESULTS
- 4. DISCUSSION
- SUPPLEMENTARY MATERIAL
- FUNDING
- References

Authors

Related links

Biostatistics. 2011 April; 12(2): 211–222.

Published online 2010 October 5. doi: 10.1093/biostatistics/kxq063

PMCID: PMC3062153

Yu Zhang^{}

Department of Statistics, The Pennsylvania State University, 422A Thomas Building, University Park, PA 16802, USA

Email: ude.usp.tats@gnahzuy

Received 2010 March 25; Revised 2010 August 21; Accepted 2010 August 31.

Copyright © The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

This article has been cited by other articles in PMC.

Genetic mutations may interact to increase the risk of human complex diseases. Mapping of multiple interacting disease loci in the human genome has recently shown promise in detecting genes with little main effects. The power of interaction association mapping, however, can be greatly influenced by the set of single nucleotide polymorphism (SNP) genotyped in a case–control study. Previous imputation methods only focus on imputation of individual SNPs without considering their joint distribution of possible interactions. We present a new method that simultaneously detects multilocus interaction associations and imputes missing SNPs from a full Bayesian model. Our method treats both the case–control sample and the reference data as random observations. The output of our method is the posterior probabilities of SNPs for their marginal and interacting associations with the disease. Using simulations, we show that the method produces accurate and robust imputation with little overfitting problems. We further show that, with the type I error rate maintained at a common level, SNP imputation can consistently and sometimes substantially improve the power of detecting disease interaction associations. We use a data set of inflammatory bowel disease to demonstrate the application of our method.

Genetic mutations underlying most human complex diseases are non-Mendelian. These mutations may show little main effects to the disease with low penetrance but possibly interact with each other in a complex way. We refer to multilocus interactions with the disease as epistasis. In particular, the joint association of 2 or more single nucleotide polymorphisms (SNPs) is greater than their marginal associations. It has been speculated (Moore and Williams, 2002) and reported (Ritchie *and others*, 2001), (Zee *and others*, 2002), (Williams *and others*, 2004), (Tsai *and others*, 2004), (Cho *and others*, 2004) that epistasis contributes to many human complex traits. Simulation studies (Marchini *and others*, 2005), (Zhang and Liu, 2007), (Wanwan *and others*, 2009) further showed that epistasis mapping is computationally feasible in large-scale case–control studies.

The power for epistasis association mapping can be greatly affected by the tagging SNPs genotyped in a case–control study. Assume a disease is affected by an interaction between 2 mutations, *a* and *b*, at 2 loci. If each of the 2 mutations *a* and *b* is strongly correlated with a tagging SNP, *A* and *B*, respectively, an interaction association between the 2 loci can be detected between *A* and *B* (using a saturated test of 9 possible genotype combinations). If one disease mutation *a* (or *b*) is strongly correlated with a combination of tagging SNPs *A*_{1} and *A*_{2} (or *B*_{1} and *B*_{2}) but weakly correlated with each SNP individually, however, the interaction between the 2 loci will only be detectable by testing a combination of *A*_{1}, *A*_{2}, *B* (or *A*, *B*_{1}, *B*_{2}). Testing more than 2 tagging SNPs is statistically more challenging because the number of possible genotype combinations and the number of multiple comparisons grows exponentially. One way to improve the power of epistasis mapping is to impute untyped SNPs from a reference sample via linkage disequilibrium (LD) and test disease associations on both tagging SNPs and imputed SNPs (Marchini *and others*, 2007). SNP imputation is statistically feasible given the LD between closely located SNPs. Many algorithms have been developed for imputation-based association mapping, including IMPUTE (Marchini *and others*, 2007), Bayesian Imputation-Based Association Mapping (BIMBAM) (Servin and Stephens, 2007), Testing Untyped Alleles (TUNA) (Nicolae, 2006), BEAGLE (Browning, 2008), and SNPMStat (Lin *and others*, 2008). Most methods follow a 2-step approach, where the missing genotypes at untyped SNPs are first imputed from a reference panel, without distinguishing cases and controls. Association tests are then performed on the imputed genotypes. Alternatively, methods that impute the missing genotypes and test disease associations simultaneously have be developed (Lin *and others*, 2008), which were shown to be more powerful than 2-stage approaches.

SNP imputation can potentially improve the power of epistasis mapping. Rather than directly testing epistasis on tagging SNPs, which may involve a large number of genotype combinations and multiple comparisons, testing epistasis on imputed SNPs can be useful. For the example given above, if we impute an untyped SNP *A*_{imp} from the tagging SNPs *A*_{1} and *A*_{2}, (or to impute *B*_{imp} from *B*_{1} and *B*_{2}), we can test interaction association between *A*_{imp} and *B* (or *A* and *B*_{imp}) such that the interaction only involves 9 genotype combinations (as oppose to 27 for 3 tagging SNPs), and the multiple comparison problem is within pairwise comparisons. As a result, we gain power by using imputed SNPs as long as they capture the epistasis information from tagging SNPs. This is a dimension reduction approach that projects high-dimensional information of many tagging SNPs into a low dimension of a few untyped SNPs, where the projection retains a maximum amount of association information.

In this paper, we propose a new Bayesian model for joint SNP imputation and epistasis association mapping. Our method uses SNP blocks (Zhang *and others*, 2010) to account for LD among SNPs and imputes untyped SNPs iteratively using Markov chain Monte Carlo (MCMC) algorithms. Our method is Bayesian that models all SNPs and their LD using a full probability function, such that the imputation of all untyped SNPs are consistent. Most existing methods do not model LD between untyped SNPs, and hence a single disease association signal captured by tagging SNPs can be overly reproduced at all nearby untyped SNPs. Our model treats the reference data as random, with observation uncertainty properly accounted for using prior distributions. The imputation result is therefore more robust to outliers than methods that make a *bona fide* use of the reference. Our method performs imputation conditional on the disease association status of SNPs. The reference data only represents the normal population, and hence the conditional imputation is more appropriate to impute SNPs around disease loci. Our approach is similar to Bayesian fine mapping that models the observed SNPs conditional on some unobserved “causal” mutations. We further utilize a dynamic block partitioning algorithm to account for the variability of LD among SNPs, which is particularly useful at genomic regions that demonstrate vague block pattern of LD. Our method effectively explores all plausible block partitions, such that an untyped SNP is imputed over different partitions of tagging SNPs, and hence the power of association mapping can be improved by finding the most plausible block partitions.

We demonstrate that imputing untyped SNPs in cases and controls can consistently and sometimes substantially improve the power of epistasis mapping. Comparing nonepistasis and epistasis disease models, we show that the power improvement by SNP imputation is greater for detecting epistasis than detecting single SNP associations. For single SNPs associations, imputing additional SNPs may not in general improve the power than using tagging SNPs alone. For epistasis associations, on the other hand, we observed a consistent gain of power using our method, despite of the greatly increased number of multiple comparisons induced by imputation. Our simulation results showed that the power of epistasis mapping can be strongly affected by the tagging SNPs genotyped around the disease loci. By imputing additional SNPs, the disease information jointly captured by many tagging SNPs can be summarized by a few imputed SNPs, which result in simpler tests and hence easier to reach statistical significance. We further show that the resolution of association mapping can be consistently improved by SNP imputation, for both single SNP and epistasis mapping. We demonstrate the application of our method using a data set of inflammatory bowel disease (IBD) from The Wellcome Trust Case Control Consortium (WTCCC, 2007).

We first define a probability function for multilocus genotypes or haplotypes (multilocus alleles) over a set of SNPs. Suppose there are *m* SNPs under consideration, the total number of possible genotype combinations is *d* = 3^{m} (and *d* = 2^{m} for haplotypes). For simplicity, we describe our model using multilocus genotypes throughout the paper, but it can easily handle haplotypes if such data are available. Let *π* = {*π*_{i}}_{i = 1,…,d} denote the frequencies of *d* genotype combinations, *X*_{m} denotes the observed genotype data over *m* SNPs, and *n* = {*n*_{i}}_{i = 1,…,d} denotes the observed number of genotype combinations in *X*_{m}. We write the probability function of *X*_{m} as . We assign a Dirichlet prior to *π* with parameters (or ${\alpha}_{i}=\frac{1}{d}$ for haplotypes) and integrate out *π* to obtain

(2.1)

Here, |*x*| denotes the summation of all elements in *x*.

We next define a more general probability function for a set of *m*_{2} SNPs conditional on a set of *m*_{1} SNPs. Let *d*_{1} = 3^{m1} and *d*_{2} = 3^{m2} denote the number of possible genotype combinations over each set of SNPs, respectively. Let *q* = {*q*_{i·}}_{i = 1,…,d1}, where *q*_{i·} = {*q*_{ij}}_{j = 1,…,d2}, denote the probabilities of observing genotype combination *j* over the *m*_{2} SNPs within individuals carrying genotype combination *i* over the *m*_{1} SNPs, and let *n* = {*n*_{i·}}_{i = 1,…,d1}, where *n*_{i·} = {*n*_{ij}}_{j = 1,…,d2}, denote the corresponding counts of genotype combinations in the sample. We write the probability function of genotypes at the *m*_{2} SNPs (*X*_{m2}), conditional on the genotypes at the *m*_{1} SNPs (*X*_{m1}), as . Without knowing *q*, we again assign a Dirichlet prior to each *q*_{i·} with parameters , and we integrate out *q* to obtain the marginalized conditional probability function:

(2.2)

SNP imputation relies on the LD between typed and untyped SNPs. We first describe an SNP block model (Zhang *and others*, 2010) to account for SNP LD, motivated by the block-like LD structure in the human genome (The International HapMap Consortium, 2005). Let *B* = {*b*_{0} = 0,*b*_{1},…,*b*_{k − 1},*b*_{k} = *L*} denote a partition of *L* physically ordered SNPs into *k* consecutive blocks, where SNPs with indices between [*b*_{i},*b*_{i + 1}) belong to the *i*th block. For notation simplicity, we will use *b*_{i} to denote the block [*b*_{i},*b*_{i + 1}). Within blocks, we assume that SNPs are highly correlated, and hence we model their genotype distributions jointly over the space of multilocus genotype combinations. Between blocks, we assume that SNPs are uncorrelated and we model SNPs in different blocks independently. We can express the SNP block model of genotypes *X* at *L* SNPs as *P*(*X*|*B*) = _{i = 1}^{k − 1}Pr(*X*_{bi}), where Pr(*X*_{bi}) is defined in formula (2.1).

To further model disease associations, we use the same SNP partition model introduced in BEAM (Zhang and Liu, 2007). SNPs are assigned into 3 nonoverlapping groups: group 1 contains SNPs marginally associated with the disease, group 2 contains SNPs jointly associated with the disease, that is, interactions, and group 0 contains all other SNPs unassociated with the disease. We introduce a vector *I* = {*I*_{i} = 0,1,2}_{i = 1,…,L} denoting the group memberships of all SNPs. Our interest is to infer the posterior distribution of *I*. For simplicity, we first only consider marginal disease associations (i.e. no group 2). It suffices to describe our association model for a single block, and the full model is the product of probabilities of individual blocks. Within a block *b*_{i}, let *b*_{i,1} denote the set of SNPs belonging to group 1 and *b*_{i,0} denotes the remaining SNPs in the block belonging to group 0, where *b*_{i} = {*b*_{i,0},*b*_{i,1}} are determined by the block variable *B* and the group variable *I*. Let *D* and *U* denote the cases and controls, respectively. We model the case–control data (*D*_{bi},*U*_{bi}) in block *b*_{i} as Pr(*D*_{bi},*U*_{bi}) = Pr(*D*_{bi,1})Pr(*U*_{bi,1})Pr((*D* + *U*)_{bi,0}|(*D* + *U*)_{bi,1}), where (*D* + *U*) denotes the union of case and control data. Similar expression will be used hereafter. The first 2 probabilities Pr(*D*_{bi,1}) and Pr(*U*_{bi,1}) are defined in formula (2.1), indicating that the distributions of cases and controls at the *b*_{i,1} SNPs are different. The last probability function Pr((*D* + *U*)_{bi,0}|(*D* + *U*)_{bi,1}) is defined in formula (2.2), indicating that the conditional distributions of the *b*_{i,0} SNPs are the same between cases and controls given the *b*_{i,1} SNPs.

To further allow interaction association in the model (i.e. group 2), let *b*_{i,2} denote the SNPs belonging to group 2 in block *b*_{i}, we can modify the model as a conditional distribution given group 2 SNPs as

(2.3)

That is, blocks are conditionally independent given the group 2 SNPs. The full likelihood function of cases and controls over all SNPs, given *B* and *I*, can be expressed as

(2.4)

The probability function in the first product is defined in formula (2.3). We model group 2 SNPs in cases jointly as because they are interacting in the affected population. In contrast, we model group 2 SNPs in controls as independent across blocks. Both and Pr(*U*_{bi,2}) are defined in formula (2.2).

We call the above SNP block model as BEAM2 (Zhang *and others*, 2010), based on which we further extend to impute untyped SNPs for disease epistasis mapping. Untyped SNPs are completely missing in cases and controls, but they are observed in a reference panel like HapMap (The International HapMap Consortium, 2007). From a Bayesian perspective, unobserved genotypes can be treated as missing variables, the posterior distribution of which can be learned from the model using MCMC algorithms.

We first extend our block model to include both typed and untyped SNPs. Among *L* SNPs, assume some SNPs are completely missing in cases and controls but are observed in a group of reference individuals. Within a block *b*_{i}, we again assign SNPs into 3 groups (no association, marginal association, and epistasis association), regardless of whether the SNPs are observed. Let *b*_{i,j} and *b*_{i,j}^{*} denote the typed and untyped SNPs within block *b*_{i} in group *j* = 0,1,2, respectively, and *b*_{i} = {*b*_{i,0},*b*_{i,1},*b*_{i,2},*b*_{i,0}^{*},*b*_{i,1}^{*},*b*_{i,2}^{*}}. We write the probability function of *b*_{i,0,1} and *b*_{i,0,1}^{*} (group 0 and group 1 SNPs) conditional on *b*_{i,2} and *b*_{i,2}^{*} (group 2 SNPs) as

(2.5)

The probability functions in formula (2.5) are defined in formulas (2.1) and (2.2). *U* includes both controls and the reference data. Unlike many existing methods, our model imputes group 0 SNPs conditional on groups 1 2 SNPs, irrespective of whether the SNPs are typed or untyped. Our model is similar to fine-mapping approaches that model the observed genotypes conditional on some local unobserved “causal” mutations, but we limit the “causal” mutations to be either the typed or the untyped SNPs of interests.

With the probability function (2.5) of one block defined, we write the joint model of all SNPs (typed and untyped) conditional on a block partition *B* and SNP membership *I*, as

(2.6)

We assign a uniform prior distribution to *B* over all possible block partitions of SNPs, and we assign an independent identical multinomial distribution to the group membership variable *I*_{i} for each SNP *i*, with Pr(*I*_{i} = 0,1,2) = 0.95,0.025,0.025, respectively.

We randomly choose the initial values of all model parameters according to their prior distributions, and we initialize the missing genotypes at each untyped SNP separately according to the reference data. We then use MCMC algorithms to iteratively update the missing values of model parameters and untyped SNPs according to model (2.6) as follows:

- Update block partitions
*B*using one of the following Metropolis–Hastings (MH) moves on randomly chosen blocks: merge 2 adjacent blocks; split 1 block into 2; shift a block boundary at random. For each randomly chosen MH move, let*B*_{*}denote the new block configuration. We accept*B*_{*}with probability , where*π*(*B*) is computed from formula (2.6) multiplied by the block prior, and*q*(*B*→*B*^{*}) = 0.33 denoting the MH move frequency. We update each block once per iteration. - Impute untyped SNPs by a collapsed Gibbs sampler. For each untyped SNP of each individual, we calculate the posterior probabilities of all possible missing genotypes from formula (2.6), treating other untyped SNPs as “observed.” We then sample a genotype accordingly for the individual at the untyped SNP. We update all untyped SNPs for all individuals once per iteration.
- Update group membership (
*I*) according to formula (2.6) multiplied by the prior of*I*.

Directly imputing untyped SNPs from model (2.6) will be very inefficient because the number of untyped SNPs is much larger than the number of typed SNPs, and SNPs are highly correlated. We simplify our calculation as follows. Under the block independence assumption, when updating blocks, all blocks but the ones being updated are ignored. When updating untyped SNPs or SNP memberships, only data within the same block of the SNP and interacting SNPs are needed. It is worthy noting that, given any configuration of SNP membership *I*, untyped SNPs in group 0 (which is the majority case) can be ignored in imputation. We show in supplementary material available at *Biostatistics* online how to integrate out the untyped group 0 SNPs from model (2.6). Of course, we do not know which SNPs belong to group 0 *a priori*. We therefore impute SNPs based on the current SNP memberships, and we update SNP memberships simultaneously. We restrict that the number of typed SNPs within a block must double the number of untyped SNPs allowed in groups 1 and 2 in the same block. We also restrict that the observed number of distinct genotype combinations within a block, over typed and associated untyped SNPs, must be less than 10% of the sample size.

We considered 3 disease models in our simulation study (Table S1 of the supplementary material available at *Biostatistics* online). Model 1 is an additive model of 2 disease loci independently contributing to the disease risk. Model 2 is a 2-way interaction model with disease risk increased only when both disease loci have mutations. Model 3 is a 3-way interaction model with 3 disease loci that have no marginal effects when their minor allele frequencies (MAF) all equal to 0.5. We simulated 50 data sets (1000 cases, 1000 controls, approximately 250 typed SNPs and 750 untyped SNPs) for each disease model at MAF = 0.05, 0.1, 0.20, 0.50, respectively. Detailed simulation procedure can be found in the supplementary material available at *Biostatistics* online.

We first evaluated the imputation accuracy of our method using 200 null data sets. The accuracy is calculated as $\frac{1}{N}\sum _{i=1}^{N}{p}_{{g}_{i}}$, where *N* denotes the number of individuals, *g*_{i} denotes the true genotype of individual *i* at a untyped SNP, and *p*_{gi} denotes the estimated posterior probability of individual *i* carrying genotype *g*_{i}. Overall, the median imputation accuracy of our method is 94.3%. We also checked the imputation accuracy of our method using data sets generated under the 3 disease models and observed very similar results. For example, the imputation accuracy of disease model 2 is shown in Figure 1(a), which has median accuracy 94.1% for cases and 94.2% for controls. We also checked the convergence of our Markov chains (Liu, 2001), where the chains stabilized within 150 iterations for the simulated data sets. (Figure S1 of the supplementary material available at *Biostatistics* online.)

Accuracy of SNP imputation. (a) Imputation accuracy of our method calculated separately for cases (dot) and controls (line). Data sets from disease model 2 were used. (b) Quantile–quantile plot of single SNP association test statistics from 200 **...**

We next calculated the single SNP disease association statistics using a chi-square test of 2 degrees of freedom (df) on imputed genotypes. As shown in Figure 1(b), from the 200 null data sets, the chi-square statistics closely followed the expected null distribution. We also calculated the family-wise type I error rate of our method for both marginal and epistasis associations. At the Bonferroni corrected significance level of 0.05, we observed 9 significant marginal associations out of 200 null data sets, yielding a family-wise type I error rate of 0.045. For testing epistasis, we tested 2-way and 3-way interactions among SNPs whose posterior association probability (marginal association plus epistasis) were ≥0.05. We observed only one significant 2-way interaction and no 3-way interactions from the 200 null data sets. This result indicates that the Bonferroni correction for epistasis is conservative. To further evaluate the false-positive rate of imputation in presence of disease association, we simulated additional 200 data sets containing 2 parts: the first 500 SNPs are simulated from the disease model 2 at 4 different MAFs (50 data sets for each MAF), the second 500 SNPs are simulated from the null distribution, and the SNPs between 2 parts are simulated independently. We observed 7 marginal associations and 5 interactions either within the second component or between the 2 components, yielding a false-positive rate of 0.035 for marginal associations and 0.025 for interaction associations at level 0.05.

We next compare the power of our method with 3 existing algorithms: IMPUTE (Marchini *and others*, 2007), BIMBAM (Servin and Stephens, 2007), and SNPMStat (Lin *and others*, 2008). IMPUTE is a 2-stage approach that imputes missing genotypes first, followed by association tests treating the imputed SNPs as “observed.” BIMBAM is a Bayesian approach that imputes missing genotypes and evaluates associations by Bayes factors. SNPMStat is an expectation–maximization approach. We compared the power of all methods before and after imputing SNPs. At Bonferroni adjusted 0.05 level, the power of each method was calculated as the proportion of true disease loci detected. A disease locus is regarded as detected if at least one significant SNP within its neighborhood is detected by a method. SNPs in LD tend to show strong association signals simultaneously, we therefore only count a significant SNP if it is the most significant compared to other SNPs within 20 kb. For IMPUTE, we used a 2-df chi-square test to map single SNP disease associations. For BIMBAM, we used permutation to empirically determine the statistical significance of its Bayes factors for both single SNP and 2-way interaction associations. For SNPMStat, we specified the additive model for imputation, as it consistently produced better results than other models allowed by SNPMStat. Both IMPUTE and SNPMStat cannot test interactions, and we use their results to benchmark the performance of single SNP mapping via imputation. BEAM2 and BIMBAM are both Bayesian methods that impute untyped SNPs and detect interactions. It was computationally very expensive for BIMBAM to test high-order interactions, we therefore only compared the power of 2-way interactions by BIMBAM and BEAM2. For BEAM2, we tested 2-way interactions among SNPs with posterior association probability (marginal association plus epistasis) ≥0.05.

The powers of the 4 methods before and after imputation are shown in Figure 2. In model 1 (no interactions), we observed that BIMBAM and SNPMStat performed slightly better than IMPUTE and BEAM2. This may be explained by the fact that both BIMBAM and SNPMStat specifically tested the additive effects of disease alleles, where IMPUTE and BEAM2 tested genotype associations without assuming Hardy–Weinberg equilibrium. Comparing the power before and after imputation (dotted and solid lines in Figure 2, respectively), no methods clearly gained power by imputing SNPs (except for MAF = 0.2). Interestingly, within a small distance (20 kb or less) to the true disease loci, all methods showed improved power after imputing SNPs, suggesting that imputing untyped SNPs can improve the mapping resolution of true disease loci. For the interaction models 2 and 3, our method clearly obtained the best power among the 4 methods. Since IMPUTE and SNPMStat only tested single SNP associations, they were expected to loose power. BIMBAM tested both single SNP and 2-way interactions, yet we did not observe a clear power improvement. In comparison, BEAM2 consistently gained power in detecting epistasis. This is nontrivial given the amount of extra 2-way interaction tests induced by imputation. Comparing the outputs of each method, we also observed that BEAM2 most accurately pinpointed the disease loci and best distinguishes between marginal and epistasis effects (Figure S2 of the supplementary material available at *Biostatistics* online).

Power of BEAM2 (“°”), BIMBAM (“”), SNPMStat (“ × ”), and IMPUTE (“•”) for 3 disease models (a color version is available in the Supplementary Material at **...**

A possible explanation for the power gain through imputation is that, before imputing untyped SNPs, the epistasis is jointly captured by multiple tagging SNPs. A direct testing on those tagging SNPs involves large complexity and thus is difficult to reach family-wise significance. After imputing SNPs, the epistasis information is projected into a lower dimension of a few untyped SNPs from which the interactions become detectable. Using an exhaustive 2-way interaction search, we confirmed that the smallest *p*-values obtained by our imputation are often smaller than the smallest *p*-values obtained without imputation, after multiple testing correction (Figure S3 of the supplementary material available at *Biostatistics* online). We further tested a combined approach that uses IMPUTE to infer missing genotypes and uses our method to map interactions treating the imputed SNPs as “real.” As shown in Figure 3, the combined approach is more powerful than the single SNP test of IMPUTE but is less powerful than our method when epistasis is present. This result illustrates the advantage of simultaneous SNP imputation and association mapping by our approach.

IBD is a common disease in the human population. Previous genome-wide association studies have identified 32 loci in the genome that demonstrate strong marginal associations with IBD (Barrett *and others*, 2008). In a replication study using 3664 cases, Barrett *and others* (2008) further performed pairwise interaction test between the 32 IBD susceptible loci, and concluded that the *p*-values of pairwise interactions, although nominally significant, did not withstand the Bonferroni correction. We picked 3 pairs of loci with smallest *p*-values from their report: rs2188962 versus rs2872507, rs4613763 versus rs7758080, and rs11190140 versus rs2542151, the unadjusted *p*-values of which by Barrett *and others* (2008) are shown in Table 1. For each pair of loci, we applied our method to the IBD data set obtained from WTCCC1 (The Wellcome Trust Case Control Consortium, 2007) (typed SNPs), which is independent of the data used by Barrett *and others* (2008), and we imputed HapMap Phase II SNPs (untyped SNPs) within 400 Kb of each of the 2 loci to look for independent evidence of epistasis. We removed nonpolymorphic SNPs and SNPs with more than 5% bad quality genotypes (score <0.8). The data contained 2005 cases and 3004 controls, and we used 60 unrelated HapMap Utah residents with ancestry from northern and western Europe as the reference.

Imputing epistasis in WTCCC1 IBD data set. Three pairs of SNPs with smallest interaction p-values from Barrett and others (2008) were tested in WTCCC1 by imputing HapMap Phase II SNPs. Barrett P: nominal p-value reported in Barrett and others (2008); **...**

As shown in Table 1, our method detected interaction associations with IBD between two 400 Kb regions around rs2188962 at 15q31.1 and rs2872507 at 17q21.2. The most significant interaction we found had a nominal *p*-value 5.57×10^{ − 8} calculated by B-stat (Zhang and Liu, 2007), and the Bonferroni adjusted *p*-value was 9.58×10^{ − 3}. The interaction was between 2 imputed SNPs. To confirm that this finding is not an artifact created by BEAM2, we reran the method in another 5 Markov chains independently, and the method consistently reported interactions between the 2 regions with similar interaction *p*-values. We further shuffled individuals within cases and controls, separately, for the data within the region around rs2872507 only. The shuffled data therefore has no correlation between the 2 regions but retains the marginal association and LD of SNPs within each region. We ran our method again in 6 independent chains on the shuffled data and observed no interactions between the 2 regions in all chains. It therefore suggests that the epistasis we found between the 2 regions around rs2188962 and rs2872507 were not an artifact of BEAM2. Given that the 2 SNPs have demonstrated interaction with *p*-value 8.17 ×10^{ − 5} in an independent study of Barrett *and others* (2008), we believe that the epistasis between 15q31.1 and 17q21.2 is potentially real. Further replication studies with more SNPs genotyped in the 2 regions are needed to confirm this result. For the other 2 pairs of loci we picked, however, we were not able to find significant epistasis, which could be due to either no epistasis or lack of power.

We introduced a full Bayesian model for simultaneous SNP imputation and disease epistasis mapping for case–control studies. Compared with existing methods, our approach has the following advantages: (i) full Bayesian probabilistic model for all SNPs, reference data, and model parameters, such that the results are consistent and robust; (ii) simultaneous imputation of untyped SNPs in consideration of LD, such that associations created by LD effects are filtered; (iii) conditional imputation of SNPs based on disease association status of SNPs, such that imputation in the affected population is not biased toward the normal population; (iv) dynamic partitioning of SNP blocks, such that untyped SNPs are imputed from multiple plausible sets of tagging SNPs; and (v) detection of both marginal associations and epistasis via imputation. The output of our method is the posterior distributions of missing genotypes, single SNP associations, SNP–SNP interactions, and SNP block partitions.

Compared with existing imputation algorithms, our method has better power in detecting interaction associations when epistasis exist, and the method performs similarly with existing algorithms for noninteractive disease models. We further used IBD data to demonstrate an application of our method. In terms of computation time, in our study, BEAM2 took 50 min to impute one simulated data set of 2000 individuals and 700~800 untyped SNPs, where IMPUTE took 15 min, SNPMStat took 30 min, and BIMBAM took 40 min, respectively. While BEAM2 and BIMBAM spent most computation at imputing interactions, IMPUTE and SNPMStat only considered marginal associations. The speeds reported here were based on the default settings used in our study.

Our Bayesian model can be extended to accommodate additional complexities in case–control studies. First, the SNP block model can be generalized to incorporate unknown haplotypes within blocks, where haplotypes can be phased from genotype data. Haplotypes are much less diverse than multilocus genotypes, which can potentially improve imputation accuracy and power of association mapping. Second, current model can be used to impute low quality genotypes. The underlying assumption however is that the low quality genotypes occur at random. Nonrandom genotyping errors will bias the imputation and very likely induce spurious associations. It is thus necessary to further detect and adjust for nonrandom genotyping errors. Third, population structure is a well-known confounder in case–control studies. Our model currently assumes a homogeneous sample structure. SNP imputation in stratified sample is a challenging problem that warrants further investigation.

National Institutes of Health (R01-HG004718 to Y. Z.); the Wellcome Trust (076113).

Supplementary material is available at http://biostatistics.oxfordjournals.org.

The author is grateful to Prof. James L Rosenberger, 2 anonymous reviewers, and the editors of *Biostatistics* for their insightful comments that lead to a substantial improvement of the manuscript. The study presented in this paper makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. All the chromosomal positions are in National Center for Biotechnology Information build 35 coordinates. *Conflict of Interest:* None declared.

- Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, Brant SR, Silverberg MS, Taylor KD, Barmada MM. and others. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nature Genetics. 2008;40:955–962. [PMC free article] [PubMed]
- Browning SR. Missing data imputation and haplotype phase inference for genome-wide association studies. Human Genetics. 2008;124:439–450. [PMC free article] [PubMed]
- Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004;47:549–554. [PubMed]
- Lin DY, Hu Y, Huang BE. Simple and efficient analysis of disease association with missing genotype data. American Journal of Human Genetics. 2008;82:444–452. [PubMed]
- Liu JS. Monte Carlo Strategies in Scientific Computing. New York: Springer; 2001.
- Marchini J, Donnelly P, Cardon LR. Genotype-wide strategies for detecting multiple loci that influence complex diseases. Nature Genetics. 2005;37:413–417. [PubMed]
- Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. [PubMed]
- Moore JH, Williams SM. New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine. 2002;34:88–95. [PubMed]
- Nicolae DL. Testing untyped alleles (TUNA)—applications to genome-wide association studies. Genetic Epidemiology. 2006;30:718–727. [PubMed]
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. American Journal of Human Genetics. 2001;69:138–147. [PubMed]
- Servin B, Stephens M. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genetics. 2007 3, e114. [PMC free article] [PubMed]
- The International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
- The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
- The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007:661–678. [PMC free article] [PubMed]
- Tsai CT, Lai LP, Lin JL, Chiang FT, Hwang JJ, Ritchie MD, Moore JH, Hsu KL, Tseng CD, Liau CS. and others. Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation. 2004;109:1640–1646. [PubMed]
- Wanwan T, Xuebing W, Rui J, Yanda L. Epistatic module detection for case-control studies: a Bayesian module and a Gibbs sampling strategie. PLoS Genetics. 2009 5, e1000464. [PMC free article] [PubMed]
- Williams SM, Ritchie MD, Phillips JA, 3rd, Dawson E, Prince M, Dzhura E, Willis A, Semenya A, Summar M, White BC. and others. Multilocus analysis of hypertension: a hierarchical approach. Human Heredity. 2004;57:28–38. [PubMed]
- Zee RY, Hoh J, Cheng S, Reynolds R, Grow MA, Silbergleit A, Walker K, Steiner L, Zangenberg G, Fernandez-ortiz A. and others. Multi-locus interactions predict risk for post-PTCA restenosis: an approach to the genetic analysis of common complex disease. The Pharmacogenomics Journal. 2002;2:197–201. [PubMed]
- Zhang Y, Liu JS. Bayesian inference of epistatic interactions in case-control studies. Nature Genetics. 2007;39:1167–1173. [PubMed]
- Zhang Y, Zhang J, Liu JS. Bayesian detection of epistatic interactions associated with type 1 diabetes (under revision) 2010

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |